Let’s get started
Company Logo

Remote Jobs

Senior Site Reliability Engineer

7/23/2025

No location specified

Job Summary

A company is looking for a Senior Site Reliability Engineer, AI Infrastructure.

Key Responsibilities
  • Provide leadership and strategic guidance on managing large-scale HPC systems, including deployment of compute, networking, and storage
  • Develop scalable automation solutions and improve the ecosystem around GPU-accelerated computing
  • Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
Required Qualifications
  • Bachelor's degree in Computer Science, Electrical Engineering, or related field, or equivalent experience
  • Minimum 8 years of experience designing and operating large scale compute infrastructure
  • Experience with AI/HPC advanced job schedulers, such as Slurm or Kubernetes
  • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
  • Solid understanding of cluster configuration management tools such as Ansible or Puppet

Comments

No comments yet. Be the first to comment!