Let’s get started
Company Logo

Remote Jobs

Machine Learning Engineer

6/21/2025

N/A

Job Summary

A company is looking for a Machine Learning Engineer (Training Infrastructure).

Key Responsibilities:
  • Performance engineering of training infrastructure for large language models
  • Implementing parallelization strategies across various dimensions
  • Profiling distributed training runs and optimizing performance bottlenecks
Required Qualifications:
  • 3+ years of experience training large neural networks in production
  • Expert-level knowledge of PyTorch or JAX for training code
  • Experience with multi-node, multi-GPU training and debugging
  • Familiarity with distributed training frameworks and cluster management
  • Deep understanding of GPU memory management and optimization techniques

Comments

No comments yet. Be the first to comment!