Skip to content
Cohere logo

Senior Member of Technical Staff, Web Data

CohereLarge Language company
TorontoSenior
Data & AI

About the role

Develop and maintain web data pipelines for AI model training.

  • As a Senior Member of Technical Staff specializing in web data for pre-training, you will develop the large scale web data pipeline that underpins Cohere’s advanced language models.
  • Key Responsibilities Maintain large-scale pipelines for processing web corpora.
  • Work on filtering and quality-scoring systems to identify high-value web documents.
  • Analyze web data composition across domains, languages and time periods.
  • Develop and maintain highly-performant deduplication pipelines.
  • Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models.
  • Requirements Strong software engineering skills, with proficiency in Python and experience building data pipelines.
  • Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools.
  • Experience working with large-scale web datasets.
  • Knowledge of data quality assessment techniques and experimentation with data mixtures.
View original posting →

Tech stack

PythonSparkPandas

Match insights

Tech:Python, Spark, Pandas
Level:Senior

More roles at Cohere

View open roles at Cohere