Remote Jobs

Senior HPC Infrastructure Engineer

9/1/2025

Not specified

Job Summary

A company is looking for a Senior GPU and HPC Infrastructure Engineer - DGX Cloud.

Key Responsibilities

Contribute to the automation of datacenter operations, break/fix, and lifecycle management for large-scale Machine Learning systems
Implement monitoring and health management capabilities for GPU assets to ensure reliability and scalability
Build automated test infrastructure for qualifying distributed systems and ensure software integration across engineering teams

Required Qualifications

5+ years of software engineering experience on large-scale production systems
BS in Computer Science, Engineering, Physics, Mathematics, or equivalent experience
Expert knowledge of a systems programming language (Go, Python) and Linux system administration
Understanding of cluster management systems (Kubernetes, SLURM) and complex distributed systems
Familiarity with performance, security, and reliability in distributed systems

Comments

No comments yet. Be the first to comment!

Similar Jobs

Principal System Engineer

8/22/2025

Remote Jobs

Site Reliability Engineer Intern

9/3/2025

Remote Jobs

Technical Solution Architect

8/22/2025

Remote Jobs

Embedded Systems Engineer

8/30/2025

Remote Jobs

Principal Mechanical Design Engineer

8/29/2025

Remote Jobs

Modeling & Simulation Lead Engineer

8/23/2025

Remote Jobs

Minnesota Licensed Enterprise Architect

8/24/2025

Remote Jobs

9/1/2025

Remote Jobs

Senior Security Engineer

8/23/2025

Remote Jobs

Senior GPU System Architect

8/27/2025

Remote Jobs

Staff Firmware Engineer

8/24/2025

Remote Jobs

Platform Engineer

9/3/2025

Remote Jobs

Senior Machine Learning Engineer

9/3/2025

Remote Jobs