Let’s get started
Company Logo

Remote Jobs

Staff Site Reliability Engineer

7/20/2025

No location specified

Job Summary

A company is looking for a Staff Site Reliability Engineer focused on Machine Learning Infrastructure.

Key Responsibilities
  • Design and implement robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models
  • Improve reliability, availability, and scalability of ML infrastructure to support internal ML workflows
  • Collaborate with various teams to identify infrastructure needs and optimize the ML lifecycle
Required Qualifications
  • 7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles
  • Expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker)
  • Proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible)
  • Experience with observability and monitoring for ML systems (e.g., Prometheus, Grafana)
  • Familiarity with Python-based ML frameworks (e.g., PyTorch, TensorFlow)

Comments

No comments yet. Be the first to comment!