Head of Data
Responsibilities:
- Data infrastructure and governance: Lead the design of a cloud-based AGI data warehouse architecture for large-scale unstructured data. Establish clear standards for data layering, lineage, metadata management, and version control.
- Large-scale data engineering and optimization: Architect and operate PB-scale ETL pipelines, and lead deep performance optimization for massive distributed computing workloads at tens-of-thousands–core scale.
- Web-scale data acquisition: Lead the engineering, strategy optimization, and data management of a web crawling system capable of collecting data at tens-of-billions–page scale.
- Ultra-large-scale corpus processing: Build automated pipelines for cleaning, parsing, and deduplicating massive corpora across web data, books, and code.
- Team leadership and closed-loop validation: Build and lead data engineering and data governance teams. Collaborate closely with algorithm and research teams to validate data quality through downstream benchmark performance and closed-loop evaluation.
Qualifications:
- Bachelor’s degree or equivalent practical experience.
- Proven experience leading large language model pre-training data processing and governance efforts; deep expertise in PB-scale data processing.
- Strong knowledge of governance frameworks for unstructured data warehouses or data lakes, including data layering, lineage, and metadata management.
- Demonstrated experience leading technical teams, with strong cross-functional collaboration skills across research, engineering, and legal/compliance teams.
If you are interested in these job openings, please submit your resume and cover letter to shandahr@shanda.com. We also welcome assistance from recruitment agencies.