Let’s get started
Company Logo

Remote Jobs

Senior HPC Infrastructure Engineer

9/1/2025

Not specified

Job Summary

A company is looking for a Senior GPU and HPC Infrastructure Engineer - DGX Cloud.

Key Responsibilities
  • Contribute to the automation of datacenter operations, break/fix, and lifecycle management for large-scale Machine Learning systems
  • Implement monitoring and health management capabilities for GPU assets to ensure reliability and scalability
  • Build automated test infrastructure for qualifying distributed systems and ensure software integration across engineering teams
Required Qualifications
  • 5+ years of software engineering experience on large-scale production systems
  • BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience
  • Expert knowledge of a systems programming language (Go, Python) and Linux system administration
  • Understanding of cluster management systems (Kubernetes, SLURM) and complex distributed systems
  • Familiarity with performance, security, and reliability in distributed systems

Comments

No comments yet. Be the first to comment!