Job Summary
A company is looking for an HPC SRE Engineer.
Key Responsibilities
- Implement monitoring solutions for critical infrastructure and applications
- Collect metrics on system performance, service availability, and user experience
- Respond to infrastructure alerts and user community tickets to resolve issues
Required Qualifications
- Experience with HPC infrastructure management
- Proficiency in monitoring and automation tools
- Knowledge of metrics collection and analysis
- Familiarity with incident response and ticketing systems
- Understanding of documentation practices for software and procedures
Comments