Job Summary
A company is looking for a Principal Infra and Ops Engineer, responsible for managing operations related to a large-scale AI/ML platform.
Key Responsibilities:
- Implement automation across the infrastructure lifecycle using Infrastructure as Code (IaC) and DevOps principles
- Develop and implement monitoring frameworks for infrastructure to ensure high availability and performance optimization
- Provide SRE support to users on the AI/ML platform, including ticket response and customer liaison
Required Qualifications:
- Bachelor's degree in computer science, information technology, or a STEM-related field
- 8+ years of infrastructure experience with large-scale, cloud-based software platforms
- 6+ years of experience in Infrastructure-as-Code and CI/CD tools like Terraform and Git Actions
- 4+ years of experience in containerization technologies such as Kubernetes and Docker
- 3+ years of hands-on experience with Azure
Comments