Job Summary
A company is looking for a Senior Software Engineer, Distributed Systems Engineer - DGX Cloud.
Key Responsibilities
- Develop custom software for scheduling GPU resources on Kubernetes for scalable AI workloads
- Implement monitoring and health management capabilities for GPU asset reliability and scalability
- Collaborate with teams to ensure reliable performance of production AI clusters and improve services based on incident management
Required Qualifications
- 5+ years of software engineering experience in a technical organization with a focus on large-scale production systems
- Experience with Kubernetes APIs and frameworks, not just cluster operation
- BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience
- Proficiency in a systems programming language (Go, Python) and understanding of data structures and algorithms
- Strong motivation and ability to work with multi-functional teams across organizational boundaries
Comments