We are seeking a professional Senior Hardware & Software Operations Engineer to take responsibility for the maintenance and repair of B200 GPU/X86 servers. This role requires deep hardware knowledge and troubleshooting skills to ensure optimal server performance.
Key Responsibilities
- Diagnose circuit board faults in B200 GPU/X86 servers.
- Perform regular hardware maintenance and inspections to prevent potential issues.
- Manage hardware assets, including repair records, part replacements, and inventory tracking.
- Work with vendors and manufacturers to resolve complex hardware issues, including warranty and RMA services.
- Prepare and update hardware maintenance reports; record, track, and summarize repair/maintenance/testing results in a results-oriented format.
- Provide technical support for urgent hardware-related incidents.
- Analyze hardware failure patterns and propose improvement measures to reduce future fault rates.
- Participate in hardware upgrades and configuration management to ensure continuous optimization.
- Provide remote Tier-2 support for service tickets in other regions as needed.
Qualifications
- Bachelor's degree or above in
Electronics Engineering, Computer Science
, or related fields. - Minimum
3 years of experience
in server/Linux system operations, particularly in GPU server environments. - Proficient in analyzing hardware and OS logs to extract key information and identify fault points.
- Ability to read and understand circuit diagrams and hardware specifications.
- Strong communication and teamwork skills.
- Willingness to work on-call, including
night shifts and weekends
when required.
Technical Skills
- Familiar with major server brands (H3C, Inspur, Lenovo, DELL, HyperFusion, etc.); capable of independent server hardware diagnosis and repair, including firmware upgrades and troubleshooting.
- In-depth understanding of
B200 GPU/X86 server
architecture and components. - Experienced with server hardware monitoring and management tools.
- Basic networking knowledge and ability to resolve server network connectivity issues.
- Familiar with asset management systems and related tools.
- Strong documentation skills, attention to detail, responsibility, and excellent service awareness and teamwork.
Preferred Qualifications
- Relevant IT hardware certifications;
vendor repair engineer
or
Red Hat certification
is a plus. - Experience in
data center design and construction
. - Programming skills for developing
automation scripts
to enhance operational efficiency. - Strong communication and customer service orientation.
- Good command of
English
, both written and spoken.