Research Engineer, Infrastructure

CognitionAI Software company

San Francisco, United StatesSenior

Data & AI

Bookmark Apply on site→

About the role

Build and operate large-scale distributed training and experiment infrastructure for research.

•Build and operate large-scale distributed training and experiment infrastructure to accelerate research across thousands of GPUs.
•Partner closely with researchers to design reliable, high-performance systems for training, rollout, and data pipelines.
•Key Responsibilities Own distributed training infrastructure: job launchers, checkpointing, fault tolerance, and monitoring.
•Scale and run massive agent rollouts in VM sandboxes and distributed environments.
•Profile and optimize training throughput, memory, and communication.
•Design experiment orchestration, tooling, and high-throughput data pipelines.
•Diagnose and improve reliability across GPUs, networking, and numerics.
•Requirements Deep experience building and operating distributed training systems for large models.
•Strong systems engineering fundamentals across distributed systems, networking, and storage.
•Proficiency in Python and C++; systems-level experience with PyTorch.

View original posting →

View original posting for full requirements →

Tech stack

PythonC++PyTorchKubernetesDockerPrometheusGrafana

Match insights

Tech:Python, C++, PyTorch, Kubernetes, Docker

Level:Senior

More roles at Cognition

View open roles at Cognition