Skip to content
Fal AI logo

Software Engineer, Infrastructure

Fal AIGenerative Media company
TurkeyMid
Software Engineering

About the role

Develop and manage software for a large fleet of GPU servers.

  • Hands-on engineer building software and processes to manage a large fleet of GPU servers, focusing on provisioning, health monitoring, diagnostics, and recovery.
  • Key Responsibilities Build and maintain Python fleet tracking and server management tooling Automate provisioning, health checks, GPU diagnostics, recovery and alerting Create metrics, dashboards, and alerting for hardware health Implement OS-level security and manage distributed/local storage Requirements 3+ years managing bare-metal and cloud server fleets Strong Python and deep Linux systems knowledge Experience with Ansible, Terraform, cloud-init and storage tech (LVM, NVMe, NFS) Ability to build internal tools and dashboards and drive cross-team decisions
View original posting →

Tech stack

PythonLinuxAnsibleTerraform

Match insights

Tech:Python, Linux, Ansible, Terraform
Level:Mid

More roles at Fal AI

View open roles at Fal AI