Let’s get started
Company Logo

Remote Jobs

Staff Site Reliability Engineer

6/20/2025

No location specified

Job Summary

A company is looking for a Staff Site Reliability Engineer focused on Machine Learning Infrastructure.

Key Responsibilities
  • Design and implement robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models
  • Improve reliability, availability, and scalability of ML infrastructure to support internal ML engineers and researchers
  • Collaborate with teams to identify infrastructure requirements and optimize system performance and security
Required Qualifications
  • 7+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles
  • Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker)
  • Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible)
  • Experience implementing observability and monitoring for ML systems (e.g., Prometheus, Grafana)
  • Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow)

Comments

No comments yet. Be the first to comment!