Senior Member of Technical Staff, Web Data

CohereLarge Language company

TorontoSenior

Data & AI

Bookmark Apply on site→

About the role

Develop and maintain web data pipelines for AI model training.

•As a Senior Member of Technical Staff specializing in web data for pre-training, you will develop the large scale web data pipeline that underpins Cohere’s advanced language models.
•Key Responsibilities Maintain large-scale pipelines for processing web corpora.
•Work on filtering and quality-scoring systems to identify high-value web documents.
•Analyze web data composition across domains, languages and time periods.
•Develop and maintain highly-performant deduplication pipelines.
•Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models.
•Requirements Strong software engineering skills, with proficiency in Python and experience building data pipelines.
•Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools.
•Experience working with large-scale web datasets.
•Knowledge of data quality assessment techniques and experimentation with data mixtures.

View original posting →

View original posting for full requirements →

Tech stack

PythonSparkPandas

Match insights

Tech:Python, Spark, Pandas

Level:Senior

More roles at Cohere

View open roles at Cohere