Job Summary
A company is looking for a Platform Engineer - AI/ML Infrastructure.
Key Responsibilities
- Architect and maintain core computing platforms using Kubernetes on AWS and on-premise
- Develop and manage infrastructure using Infrastructure-as-Code (IaC) principles with Terraform
- Design, build, and optimize AI/ML job scheduling and orchestration systems integrating Slurm with Kubernetes clusters
Required Qualifications
- 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
- Hands-on experience building and managing production infrastructure with Terraform
- Expert-level knowledge of Kubernetes architecture and operations in large-scale environments
- Experience with high-performance compute (HPC) job schedulers, specifically Slurm
- Experience managing bare metal infrastructure, including server provisioning and lifecycle management
Comments