Skip to content
Cognition logo

Research Engineer, Infrastructure

CognitionAI Software company
San Francisco, United StatesSenior
Data & AI

About the role

Build and operate large-scale distributed training and experiment infrastructure for research.

  • Build and operate large-scale distributed training and experiment infrastructure to accelerate research across thousands of GPUs.
  • Partner closely with researchers to design reliable, high-performance systems for training, rollout, and data pipelines.
  • Key Responsibilities Own distributed training infrastructure: job launchers, checkpointing, fault tolerance, and monitoring.
  • Scale and run massive agent rollouts in VM sandboxes and distributed environments.
  • Profile and optimize training throughput, memory, and communication.
  • Design experiment orchestration, tooling, and high-throughput data pipelines.
  • Diagnose and improve reliability across GPUs, networking, and numerics.
  • Requirements Deep experience building and operating distributed training systems for large models.
  • Strong systems engineering fundamentals across distributed systems, networking, and storage.
  • Proficiency in Python and C++; systems-level experience with PyTorch.
View original posting →

Tech stack

PythonC++PyTorchKubernetesDockerPrometheusGrafana

Match insights

Tech:Python, C++, PyTorch, Kubernetes, Docker
Level:Senior

More roles at Cognition

View open roles at Cognition