Research Engineer, Infrastructure
CognitionAI Software company
San Francisco, United StatesSenior
Data & AI
About the role
Build and operate large-scale distributed training and experiment infrastructure for research.
- •Build and operate large-scale distributed training and experiment infrastructure to accelerate research across thousands of GPUs.
- •Partner closely with researchers to design reliable, high-performance systems for training, rollout, and data pipelines.
- •Key Responsibilities Own distributed training infrastructure: job launchers, checkpointing, fault tolerance, and monitoring.
- •Scale and run massive agent rollouts in VM sandboxes and distributed environments.
- •Profile and optimize training throughput, memory, and communication.
- •Design experiment orchestration, tooling, and high-throughput data pipelines.
- •Diagnose and improve reliability across GPUs, networking, and numerics.
- •Requirements Deep experience building and operating distributed training systems for large models.
- •Strong systems engineering fundamentals across distributed systems, networking, and storage.
- •Proficiency in Python and C++; systems-level experience with PyTorch.
Tech stack
PythonC++PyTorchKubernetesDockerPrometheusGrafana
Match insights
Tech:Python, C++, PyTorch, Kubernetes, Docker
Level:Senior