Let’s get started
Company Logo

Remote Jobs

Principal Site Reliability Engineer

7/31/2025

Remote

Job Summary

A company is looking for a Principal Site Reliability Engineer, AI Infrastructure.

Key Responsibilities
  • Architect and scale globally distributed production systems for AI/ML and HPC across hybrid and multi-cloud environments
  • Design and implement automation frameworks to enhance system resilience and operational efficiency
  • Lead initiatives to assess operational maturity and establish long-term reliability strategies in collaboration with various teams
Required Qualifications
  • 15+ years of experience in SRE, Production Engineering, or Cloud Infrastructure
  • Deep expertise in Linux/Unix systems and public/private cloud platforms (AWS, GCP, Azure, OCI)
  • Expert-level programming skills in Python and familiarity with languages such as C++, Go, or Rust
  • Experience with Kubernetes, microservice orchestration, and observability frameworks
  • Degree in Computer Science or related field, or equivalent experience

Comments

No comments yet. Be the first to comment!