About the role
Develop and maintain web data pipelines for AI model training.
- •As a Senior Member of Technical Staff specializing in web data for pre-training, you will develop the large scale web data pipeline that underpins Cohere’s advanced language models.
- •Key Responsibilities Maintain large-scale pipelines for processing web corpora.
- •Work on filtering and quality-scoring systems to identify high-value web documents.
- •Analyze web data composition across domains, languages and time periods.
- •Develop and maintain highly-performant deduplication pipelines.
- •Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models.
- •Requirements Strong software engineering skills, with proficiency in Python and experience building data pipelines.
- •Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools.
- •Experience working with large-scale web datasets.
- •Knowledge of data quality assessment techniques and experimentation with data mixtures.
Tech stack
PythonSparkPandas
Match insights
Tech:Python, Spark, Pandas
Level:Senior