Job Summary
A company is looking for a Member of Engineering focused on pre-training and inference fault tolerance.
Key Responsibilities
- Identify, study, and troubleshoot hardware problems during training at scale
- Minimize GPU idle time during faults, both operationally and strategically
- Design and develop tools and add-ons to accelerate training recovery
Required Qualifications
- Strong engineering background with programming experience in Linux API and Linux kernel
- Basic understanding of Large Language Models (LLM) and deep learning fundamentals
- Proficiency in Python (PyTorch), C/C++, and CUDA API
- Knowledge of distributed systems, reliability, and fault-tolerance
- Experience with NCCL and modern development tools
Comments