Job Summary
A company is looking for a Principal Site Reliability Engineer, AI Infrastructure.
Key Responsibilities
- Architect and scale globally distributed production systems for AI/ML and HPC across hybrid and multi-cloud environments
- Design and implement automation frameworks to enhance system resilience and operational efficiency
- Lead initiatives to assess operational maturity and establish long-term reliability strategies in collaboration with various teams
Required Qualifications
- 15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure
- Deep expertise in Linux/Unix systems and public/private cloud platforms (AWS, GCP, Azure, OCI)
- Expert-level programming skills in Python and familiarity with languages such as C++, Go, or Rust
- Experience with Kubernetes, microservice orchestration, and observability frameworks
- Degree in Computer Science or related field, or equivalent experience
Comments