Software Engineer, Site Reliability
Fal AIGenerative Media company
San Francisco, United StatesSenior
Software Engineering
About the role
Ensure reliability and availability of customer-facing systems.
- •Seasoned SRE responsible for reliability and availability of customer-facing systems, operating Kubernetes clusters, deployment pipelines, and networking at scale.
- •Focuses on SLOs, automation, and incident-driven improvements.
- •Key Responsibilities Operate Kubernetes infrastructure: lifecycle, upgrades, networking, multi-tenant isolation.
- •Build and maintain CI/CD pipelines and deployment infrastructure.
- •Automate incident analysis/resolution and improve reliability using AI and tooling.
- •Implement dashboards, alerting, SLOs, incident response, and chaos engineering.
- •Requirements 5+ years managing critical production systems and software workflows.
- •Experience operating Kubernetes at scale and using infrastructure-as-code (Terraform, Ansible).
- •Deep knowledge of Linux and container networking (CNI, VXLAN, BGP) and DNS.
- •Experience with CI/CD and GitOps (FluxCD, ArgoCD); proficiency in Python and Go or Bash.
Tech stack
KubernetesCI/CDTerraformAnsiblePythonGoBashLinuxArgoCDPrometheusGrafana
Match insights
Tech:Kubernetes, CI/CD, Terraform, Ansible, Python
Level:Senior