Responsibilities:

  • System development and optimization: Lead the development and optimization of large-scale model training and inference systems. Apply techniques such as hybrid parallelism, automatic parallelization, high-performance operator development, and communication optimization to significantly improve training speed and efficiency, accelerating model iteration cycles.
  • Solving complex system challenges: Tackle advanced machine learning system challenges related to high concurrency, reliability, and scalability, ensuring stable and efficient system behavior across diverse production scenarios.
  • End-to-end ML systems ownership: Take ownership of critical machine learning system domains, including resource scheduling, model training, model inference, and reinforcement learning training, driving overall system performance improvements.
  • Performance analysis and technical innovation: Perform in-depth performance analysis of large-scale model training workloads to identify and resolve bottlenecks, maximizing training efficiency. Continuously evaluate and adopt cutting-edge technologies to fully leverage hardware capabilities.

Qualifications:

  • Bachelor’s degree or higher (or equivalent practical experience) in Computer Science, Software Engineering, or a related field.
  • Strong programming and systems expertise: Proficient in at least one of C, C++, Python, or CUDA; hands-on experience with distributed training frameworks such as PyTorch FSDP, DeepSpeed, or Megatron-LM.
  • Technical rigor and system thinking: Ability to evaluate and design technical solutions across dimensions such as system performance, stability, and efficiency, ensuring sound, scalable, and high-quality system architectures.
  • Practical experience and strong interest in one or more of the following areas:
    • Parallel systems: Deep experience with distributed training, efficient fine-tuning, reinforcement learning training, and inference engine optimization for foundation models, including parallel strategy design, quantization and compression techniques, and operator optimization.
    • High-performance operators and infrastructure: Familiarity with parallel computing frameworks (e.g., Triton, CUDA), communication libraries (e.g., NCCL, NVSHMEM), and AI compilers (e.g., MLIR, TVM, Triton, LLVM).
If you are interested in these job openings, please submit your resume and cover letter to shandahr@shanda.com. We also welcome assistance from recruitment agencies.