Job Summary
A company is looking for a Staff Site Reliability Engineer focused on Machine Learning Infrastructure.
Key Responsibilities
- Design and implement robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models
- Improve reliability, availability, and scalability of ML infrastructure to support internal ML workflows
- Collaborate with various teams to identify infrastructure needs and optimize the ML lifecycle
Required Qualifications
- 7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles
- Expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker)
- Proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible)
- Experience with observability and monitoring for ML systems (e.g., Prometheus, Grafana)
- Familiarity with Python-based ML frameworks (e.g., PyTorch, TensorFlow)
Comments