Job Summary
A company is looking for a Senior AI Observability Engineer to architect and implement distributed observability systems for AI and HPC clusters.
Key Responsibilities
- Collaborate with engineering and research teams to deliver observability solutions for AI/HPC clusters
- Develop, test, and deploy data collectors, pipelines, and visualization services
- Define data collection and retention policies to optimize network bandwidth and storage costs
Required Qualifications
- Experience developing large scale, distributed observability systems
- Proficiency in Python programming and API usage
- Experience with observability platforms like Apache Spark, Elastic/Open Search, and Grafana
- MS (preferred) or BS in Computer Science, Electrical Engineering, or related field
- 8+ years of proven experience in relevant fields
Comments