Job Summary
A company is looking for a Senior Site Reliability Engineer, AI Infrastructure.
Key Responsibilities
- Provide leadership and strategic guidance on managing large-scale HPC systems, including deployment of compute, networking, and storage
- Develop scalable automation solutions and improve the ecosystem around GPU-accelerated computing
- Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
Required Qualifications
- Bachelor's degree in Computer Science, Electrical Engineering, or related field, or equivalent experience
- Minimum 8 years of experience designing and operating large scale compute infrastructure
- Experience with AI/HPC advanced job schedulers, such as Slurm or Kubernetes
- Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
- Solid understanding of cluster configuration management tools such as Ansible or Puppet
Comments