Responsibilities:

  • Build efficient data processing pipelines for large-scale unstructured data sources—including web data, books, and code—to support foundation model pre-training.
  • Leverage cloud-based big data computing frameworks (e.g., Spark, Ray) to optimize performance and cost for large-scale offline processing workloads.

Qualifications:

  • Bachelor’s degree or equivalent practical experience.
  • Experience with big data platforms, data warehouses, or data governance systems; familiarity with cloud-based data warehouse or data platform architectures.
  • Hands-on experience with at least one major cloud data PaaS ecosystem, including object storage, data lakes/warehouses, and large-scale processing frameworks (Spark, Flink, Ray); proven experience optimizing large offline workloads (experience at very large scale is a strong plus).
  • Strong understanding of data governance systems, with practical experience in data layering, metadata management, data lineage, and data version control.
  • Experience managing and governing large-scale unstructured datasets (text, web content, code, etc.); experience with pre-training corpus management is a plus.
If you are interested in these job openings, please submit your resume and cover letter to shandahr@shanda.com. We also welcome assistance from recruitment agencies.