Job Summary
A company is looking for a Staff Site Reliability Engineer (Databricks).
Key Responsibilities
- Operate and improve the Databricks platform lifecycle, focusing on automation and cost optimization
- Design resilient and scalable infrastructure across cloud environments, driving initiatives for failover and capacity planning
- Build and maintain monitoring and logging infrastructure, defining SLOs/SLAs for critical services
Required Qualifications
- 6+ years in SRE, platform engineering, or DevOps roles with data-intensive applications
- Hands-on experience with Databricks, including workspace setup and job management
- Deep understanding of cloud-native infrastructure, particularly AWS
- Proven expertise with observability tools and architecting monitoring solutions
- Strong command of CI/CD tooling and infrastructure-as-code practices
Comments