About the role
Develop and manage software for a large fleet of GPU servers.
- •Hands-on engineer building software and processes to manage a large fleet of GPU servers, focusing on provisioning, health monitoring, diagnostics, and recovery.
- •Key Responsibilities Build and maintain Python fleet tracking and server management tooling Automate provisioning, health checks, GPU diagnostics, recovery and alerting Create metrics, dashboards, and alerting for hardware health Implement OS-level security and manage distributed/local storage Requirements 3+ years managing bare-metal and cloud server fleets Strong Python and deep Linux systems knowledge Experience with Ansible, Terraform, cloud-init and storage tech (LVM, NVMe, NFS) Ability to build internal tools and dashboards and drive cross-team decisions
Tech stack
PythonLinuxAnsibleTerraform
Match insights
Tech:Python, Linux, Ansible, Terraform
Level:Mid