Skip to content
Fal AI logo

Software Engineer, Site Reliability

Fal AIGenerative Media company
San Francisco, United StatesSenior
Software Engineering

About the role

Ensure reliability and availability of customer-facing systems.

  • Seasoned SRE responsible for reliability and availability of customer-facing systems, operating Kubernetes clusters, deployment pipelines, and networking at scale.
  • Focuses on SLOs, automation, and incident-driven improvements.
  • Key Responsibilities Operate Kubernetes infrastructure: lifecycle, upgrades, networking, multi-tenant isolation.
  • Build and maintain CI/CD pipelines and deployment infrastructure.
  • Automate incident analysis/resolution and improve reliability using AI and tooling.
  • Implement dashboards, alerting, SLOs, incident response, and chaos engineering.
  • Requirements 5+ years managing critical production systems and software workflows.
  • Experience operating Kubernetes at scale and using infrastructure-as-code (Terraform, Ansible).
  • Deep knowledge of Linux and container networking (CNI, VXLAN, BGP) and DNS.
  • Experience with CI/CD and GitOps (FluxCD, ArgoCD); proficiency in Python and Go or Bash.
View original posting →

Tech stack

KubernetesCI/CDTerraformAnsiblePythonGoBashLinuxArgoCDPrometheusGrafana

Match insights

Tech:Kubernetes, CI/CD, Terraform, Ansible, Python
Level:Senior

More roles at Fal AI

View open roles at Fal AI