Big Data Engineer (Foundation Model Pre-training Data)
Responsibilities:
- Build efficient data processing pipelines for large-scale unstructured data sources—including web data, books, and code—to support foundation model pre-training.
- Leverage cloud-based big data computing frameworks (e.g., Spark, Ray) to optimize performance and cost for large-scale offline processing workloads.
Qualifications:
- Bachelor’s degree or equivalent practical experience.
- Experience with big data platforms, data warehouses, or data governance systems; familiarity with cloud-based data warehouse or data platform architectures.
- Hands-on experience with at least one major cloud data PaaS ecosystem, including object storage, data lakes/warehouses, and large-scale processing frameworks (Spark, Flink, Ray); proven experience optimizing large offline workloads (experience at very large scale is a strong plus).
- Strong understanding of data governance systems, with practical experience in data layering, metadata management, data lineage, and data version control.
- Experience managing and governing large-scale unstructured datasets (text, web content, code, etc.); experience with pre-training corpus management is a plus.
If you are interested in these job openings, please submit your resume and cover letter to shandahr@shanda.com. We also welcome assistance from recruitment agencies.