Job Summary
A company is looking for a Senior Cluster Site Reliability Engineer.
Key Responsibilities
- Respond to and resolve urgent cluster outages or issues
- Ensure high cluster uptime and track SLAs for reliability
- Diagnose recurring problems and collaborate on engineering solutions
Required Qualifications
- 5+ years of experience in SRE or DevOps roles
- Knowledge of HPC/batch compute frameworks and machine learning training systems
- Ability to develop scripts in a common scripting language
- Familiarity with infrastructure-as-code and cloud infrastructure
- Bachelor's degree in computer science or equivalent experience
Comments