Remote Jobs

Staff Site Reliability Engineer

7/20/2025

No location specified

Job Summary

A company is looking for a Staff Site Reliability Engineer focused on Machine Learning Infrastructure.

Key Responsibilities

Design and implement robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models
Improve reliability, availability, and scalability of ML infrastructure to support internal ML workflows
Collaborate with various teams to identify infrastructure needs and optimize the ML lifecycle

Required Qualifications

7+ years of experience in Site Reliability Engineering (SRE), DevOps, or infrastructure engineering roles
Expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker)
Proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible)
Experience with observability and monitoring for ML systems (e.g., Prometheus, Grafana)
Familiarity with Python-based ML frameworks (e.g., PyTorch, TensorFlow)

Comments

No comments yet. Be the first to comment!

Similar Jobs

Principal DevOps Engineer

7/12/2025

Remote Jobs

Oregon Licensed Bridge Engineer

7/16/2025

Remote Jobs

GCP DevOps Engineer

7/18/2025

Remote Jobs

Cloud Operations Engineer

7/16/2025

Remote Jobs

Control-M Systems Engineer

7/11/2025

Remote Jobs

Salesforce Omnistudio Developer

7/10/2025

Remote Jobs

AI Innovation Engineer

7/12/2025

Remote Jobs

Field Engineering Manager

7/15/2025

Remote Jobs

Director of Engineering Operations

7/10/2025

Remote Jobs

Senior Infrastructure Developer

7/22/2025

Remote Jobs

Senior Backend Engineer

7/10/2025

Remote Jobs

Public Cloud Architect

7/23/2025

Remote Jobs

7/17/2025

Remote Jobs