Job Summary
A company is looking for a Senior Site Reliability Engineer, AI Infrastructure.
Key Responsibilities
- Develop and maintain large-scale systems for AI Infrastructure, ensuring reliability and scalability
- Implement SRE fundamentals, including incident management and automation tools to enhance operational efficiency
- Establish frameworks for operational maturity and lead incident response protocols to improve system resilience
Required Qualifications
- Degree in Computer Science or related field, or equivalent experience with 12+ years in Software Development, SRE, or Production Engineering
- Proficiency in Python and at least one additional programming language (C/C++, Go, Perl, Ruby)
- Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, OCI, Azure, GCP)
- Strong understanding of SRE principles, including error budgets and Infrastructure as Code tools
- Hands-on experience with observability platforms and CI/CD systems
Comments