Job Summary
A company is looking for a Site Reliability Engineer.
Key Responsibilities
- Deploy clusters of 1,000+ GPUs and modify tools for customer solutions
- Validate and optimize compute, storage, and networking infrastructure
- Debug production issues and build internal tooling to enhance deployment efficiency
Required Qualifications
- 2+ years of experience in SRE, DevOps, Sysadmin, or HPC engineering
- Experience deploying and operating Kubernetes and/or SLURM clusters
- Proficiency in Go, Python, and Bash programming languages
- Familiarity with automation tools like Ansible and Terraform
- Strong engineering background in Computer Science, Software Engineering, Math, or related fields
Comments