Job Summary
A company is looking for a Senior Site Reliability Engineer, DGX Cloud.
Key Responsibilities
- Support large-scale Kubernetes services and manage system creation, capacity, and launch reviews
- Build and maintain operational reliability for large-scale Kubernetes clusters with a focus on performance and monitoring
- Lead incident response and root-cause analysis while maintaining service health and optimizing GPU workloads across cloud platforms
Required Qualifications
- BS in Computer Science or related technical field, or equivalent experience
- 12+ years of experience operating production services at scale
- Expert-level knowledge of Kubernetes administration and microservices architecture
- Experience with infrastructure automation tools and proficiency in at least one high-level programming language
- In-depth knowledge of Linux, networking fundamentals, and SRE principles
Comments