Software Engineer, Data Infrastructure

CartesiaVoice AI company

San Francisco, United StatesSenior

Data & AI

Bookmark Apply on site→

About the role

Build and manage data infrastructure for AI model training at Cartesia.

•Data is the lifeblood of our models, and we're looking for a Software Engineer to help build the training data and ML data infrastructure at Cartesia.
•This role sits at the intersection of data systems, model training, and inference.
•You'll design and ship the pipelines, datasets, and infrastructure that feed our pre-training and post-training, with particular depth in audio and other multimodal data.
•Key Responsibilities Contribute to Cartesia's multi-modal data strategy across pre-training and post-training, spanning human, synthetic, and web-scale sources, with particular depth in audio.
•Design and build scalable, high-throughput data pipelines for text, audio, and video covering ingestion, preprocessing, augmentation, dataset versioning, and data loading for training.
•Partner closely with research and inference teams so data systems are co-designed with training and serving infrastructure (batching, GPU-aware loading, evaluation pipelines).
•Requirements Hands-on experience with ML data infrastructure: training data pipelines, dataset versioning, large-scale data loading, and the interplay between data systems and model training and inference.
•Working knowledge of multimodal data, i.e. audio: formats, preprocessing, augmentation, and large-scale storage and streaming patterns.
•Strong modern engineering execution: clean, well-tested code, fluency with current tools, and a willingness to pick the right tool for the problem rather than defaulting to familiar patterns.

View original posting →

View original posting for full requirements →

Tech stack

PythonSQLAirflowApache KafkaAWSDockerKubernetes

Match insights

Tech:Python, SQL, Airflow, Apache Kafka, AWS

Level:Senior

More roles at Cartesia

View open roles at Cartesia