Software Engineer, Site Reliability

Fal AIGenerative Media company

San Francisco, United StatesSenior

Software Engineering

Bookmark Apply on site→

About the role

Ensure reliability and availability of customer-facing systems.

•Seasoned SRE responsible for reliability and availability of customer-facing systems, operating Kubernetes clusters, deployment pipelines, and networking at scale.
•Focuses on SLOs, automation, and incident-driven improvements.
•Key Responsibilities Operate Kubernetes infrastructure: lifecycle, upgrades, networking, multi-tenant isolation.
•Build and maintain CI/CD pipelines and deployment infrastructure.
•Automate incident analysis/resolution and improve reliability using AI and tooling.
•Implement dashboards, alerting, SLOs, incident response, and chaos engineering.
•Requirements 5+ years managing critical production systems and software workflows.
•Experience operating Kubernetes at scale and using infrastructure-as-code (Terraform, Ansible).
•Deep knowledge of Linux and container networking (CNI, VXLAN, BGP) and DNS.
•Experience with CI/CD and GitOps (FluxCD, ArgoCD); proficiency in Python and Go or Bash.

View original posting →

View original posting for full requirements →

Tech stack

KubernetesCI/CDTerraformAnsiblePythonGoBashLinuxArgoCDPrometheusGrafana

Match insights

Tech:Kubernetes, CI/CD, Terraform, Ansible, Python

Level:Senior

More roles at Fal AI

View open roles at Fal AI