Job Summary
A company is looking for a Staff Site Reliability Engineer focused on Machine Learning Infrastructure.
Key Responsibilities
- Design and implement robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models
- Improve reliability, availability, and scalability of ML infrastructure to support internal ML engineers and researchers
- Collaborate with teams to identify infrastructure requirements and optimize system performance and security
Required Qualifications
- 7+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles
- Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker)
- Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible)
- Experience implementing observability and monitoring for ML systems (e.g., Prometheus, Grafana)
- Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow)
Comments