Job Summary
A company is looking for a Senior GPU and HPC Infrastructure Engineer - DGX Cloud.
Key Responsibilities
- Contribute to the automation of datacenter operations, break/fix, and lifecycle management for large-scale Machine Learning systems
- Implement monitoring and health management capabilities for GPU assets to ensure reliability and scalability
- Build automated test infrastructure for qualifying distributed systems and ensure software integration across engineering teams
Required Qualifications
- 5+ years of software engineering experience on large-scale production systems
- BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience
- Expert knowledge of a systems programming language (Go, Python) and Linux system administration
- Understanding of cluster management systems (Kubernetes, SLURM) and complex distributed systems
- Familiarity with performance, security, and reliability in distributed systems
Comments