Let’s get started
Company Logo

Remote Jobs

Engineering Member - Fault Tolerance

8/21/2025

No location specified

Job Summary

A company is looking for a Member of Engineering focused on pre-training and inference fault tolerance.

Key Responsibilities
  • Identify, study, and troubleshoot hardware problems during training at scale
  • Minimize GPU idle time during faults, both operationally and strategically
  • Design and develop tools and add-ons to accelerate training recovery
Required Qualifications
  • Strong engineering background with programming experience in Linux API and Linux kernel
  • Basic understanding of Large Language Models (LLM) and deep learning fundamentals
  • Proficiency in Python (PyTorch), C/C++, and CUDA API
  • Knowledge of distributed systems, reliability, and fault-tolerance
  • Experience with NCCL and modern development tools

Comments

No comments yet. Be the first to comment!