Remote Jobs

Staff Site Reliability Engineer

6/20/2025

No location specified

Job Summary

A company is looking for a Staff Site Reliability Engineer focused on Machine Learning Infrastructure.

Key Responsibilities

Design and implement robust ML infrastructure for training, deployment, monitoring, and scaling of machine learning models
Improve reliability, availability, and scalability of ML infrastructure to support internal ML engineers and researchers
Collaborate with teams to identify infrastructure requirements and optimize system performance and security

Required Qualifications

7+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles
Proven expertise with on-premises infrastructure for machine learning workloads (e.g., Kubernetes, Docker)
Strong proficiency with infrastructure automation and configuration management tools (e.g., Terraform, Ansible)
Experience implementing observability and monitoring for ML systems (e.g., Prometheus, Grafana)
Familiarity with popular Python-based ML frameworks (e.g., PyTorch, TensorFlow)

Comments

No comments yet. Be the first to comment!

Similar Jobs

Java Cloud Architect

6/12/2025

Remote Jobs

Release Engineer

6/18/2025

Remote Jobs

Ohio Licensed Interface Analyst

6/14/2025

Remote Jobs

DevSecOps Engineer with IRS Clearance

6/15/2025

Remote Jobs

MicroStation Draftsman

6/21/2025

Remote Jobs

Machine Learning Engineer

6/25/2025

Remote Jobs

Project Engineer

6/23/2025

Remote Jobs

Technical Engineer Associate

6/19/2025

Remote Jobs

Senior Salesforce Developer

6/12/2025

Remote Jobs

Frontline Engineer

6/12/2025

Remote Jobs

Principal VoIP Engineer

6/13/2025

Remote Jobs

Remote Automation Engineer

6/15/2025

Remote Jobs

Automation Engineer, OAK4 – RME

6/21/2025

Remote Jobs