Responsibilities
Administration
- Perform installation, uninstallation, and modification of software, patches, and system components.
- Review and analyze monthly compliance reports (provided by customer) to create mitigation and resolution action plans.
- Execute server OS patching, including remediation and rollback for failed patches, as well as patch deployment for critical Common Vulnerability and Exposure (CVE) issues.
- Conduct server and hardware firmware upgrades as part of lifecycle management.
Problem Management
- Isolate, diagnose, and troubleshoot system-related incidents.
- Coordinate service incident management to ensure timely resolution and communication.
- Raise and manage service requests on behalf of the customer when required.
- Participate in root cause analysis (RCA) reviews and provide technical insights for preventive measures.
- Review monthly system logs, identify anomalies, and highlight issues with supporting justification for Authority investigations where applicable.
- 4–5 years of experience in IT Operations, including OS upgrades, patching, and hardware/software lifecycle management.
- Hands-on experience with NVIDIA Base Command Manager for GPU cluster administration.
- Working knowledge of Kubernetes and NVIDIA GPU Operator.
- Experience with NVIDIA AI Enterprise solutions.
- Strong analytical, troubleshooting, and incident management skills.
- Ability to deliver professional, structured reports and technical findings.