Job Summary
A company is looking for a Machine Learning Engineer (Training Infrastructure).
Key Responsibilities:
- Performance engineering of training infrastructure for large language models
- Implementing parallelization strategies across various dimensions
- Profiling distributed training runs and optimizing performance bottlenecks
Required Qualifications:
- 3+ years of experience training large neural networks in production
- Expert-level knowledge of PyTorch or JAX for training code
- Experience with multi-node, multi-GPU training and debugging
- Familiarity with distributed training frameworks and cluster management
- Deep understanding of GPU memory management and optimization techniques
Comments