Job Summary
A company is looking for a Senior Site Reliability Engineer (AWS, AI/ML, & APM).
Key Responsibilities
- Provide on-call production support and address engineering/implementation team tickets
- Monitor and maintain system health, respond to alerts, and manage incidents
- Develop automation scripts, collaborate with software engineers, and assist in capacity planning
Required Qualifications
- 5+ years in site reliability engineering or system administration
- Experience with AI/ML infrastructure and AWS services
- Proficiency in Linux/Unix systems and cloud platforms (AWS, Azure, Google Cloud)
- Strong skills in scripting (Python, Bash, Ruby) and programming (Go, Java, C++)
- Familiarity with the ELK Stack and configuration management tools
Comments