Remote Jobs

Senior Site Reliability Engineer

7/23/2025

No location specified

Job Summary

A company is looking for a Senior Site Reliability Engineer, AI Infrastructure.

Key Responsibilities

Provide leadership and strategic guidance on managing large-scale HPC systems, including deployment of compute, networking, and storage
Develop scalable automation solutions and improve the ecosystem around GPU-accelerated computing
Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud

Required Qualifications

Bachelor's degree in Computer Science, Electrical Engineering, or related field, or equivalent experience
Minimum 8 years of experience designing and operating large scale compute infrastructure
Experience with AI/HPC advanced job schedulers, such as Slurm or Kubernetes
Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
Solid understanding of cluster configuration management tools such as Ansible or Puppet

Comments

No comments yet. Be the first to comment!

Similar Jobs

Director of Engineering

7/22/2025

Remote Jobs

Head of Acquirer Processing

7/24/2025

Remote Jobs

Senior Systems Programmer

7/13/2025

Remote Jobs

AWS Cloud Developer

7/18/2025

Remote Jobs

Senior Site Reliability Engineer

7/23/2025

Remote Jobs

Solutions Engineer

7/23/2025

Remote Jobs

Platform Engineer

7/16/2025

Remote Jobs

Nuclear Mechanical Engineer

7/16/2025

Remote Jobs

Senior Databricks Architect

7/19/2025

Remote Jobs

Production Control Engineer

7/25/2025

Remote Jobs

Lead Site Reliability Engineer

7/22/2025

Remote Jobs

Configuration Engineer

7/12/2025

Remote Jobs

Senior Cloud Engineer

7/25/2025

Remote Jobs