Job Summary
A company is looking for a GPU and HPC Infrastructure Engineer - New College Grad 2025.
Key Responsibilities
- Contribute to the automation of datacenter operations and lifecycle management for large-scale Machine Learning systems
- Implement monitoring and health management capabilities for GPU assets to ensure reliability and scalability
- Develop software for NVLINK topography management and build automated test infrastructure for distributed systems
Required Qualifications
- Pursuing or recently completed a BS or MS in Computer Science, Engineering, Physics, Mathematics, or a comparable degree
- Software engineering experience on large-scale production systems
- Strong knowledge of a systems programming language (Go, Python) and understanding of Data Structures and Algorithms
- High-level knowledge of Linux system administration and cluster management systems (Kubernetes, SLURM)
- Understanding of performance, security, and reliability in complex distributed systems
Comments