Responsibilities:

  • Web corpus pipelines: Design, build, and lead large-scale web data processing pipelines (e.g., Common Crawl), including content extraction, quality filtering, classification, and multi-level deduplication.
  • Document parsing pipelines: Develop automated parsing and processing pipelines for massive collections of books and academic papers, covering OCR, layout reconstruction, formula and table extraction, quality filtering, and deduplication.
  • Code corpus pipelines: Build large-scale processing pipelines for code datasets, including quality filtering and deduplication.
  • Knowledge and domain data mining: Construct high-quality corpora representing “world knowledge,” and enhance knowledge density, diversity, factual accuracy, and domain-specific capabilities of foundation models through data synthesis and rewriting.
  • Intelligent pipelines and closed-loop validation: Collaborate closely with algorithm and research teams to productionize model capabilities (e.g., quality scoring, content rewriting, intelligent filtering) and integrate them into data cleaning pipelines. Validate data contributions through downstream benchmark performance and closed-loop evaluation.

Qualifications:

  • Bachelor’s degree or equivalent practical experience.
  • Proven experience leading or deeply participating in large-scale LLM pre-training data cleaning efforts, or leading web quality evaluation platforms for large search engines, or content quality detection systems for large-scale content platforms.
  • Strong expertise in data cleaning and processing algorithms and tools, including content extraction, deduplication, and quality/toxicity filtering models.
  • Solid experience with large-scale data processing frameworks and strong data engineering skills, including end-to-end pipeline design and implementation.
  • Familiarity with applied algorithms for leveraging LLMs in content quality assessment, content extraction, and content rewriting.
If you are interested in these job openings, please submit your resume and cover letter to shandahr@shanda.com. We also welcome assistance from recruitment agencies.