Skip to content
Alembic logo

Senior Network & Site Reliability Engineer

AlembicCausal AI, company
San FranciscoSenior
Software Engineering

About the role

Architect and operate scalable, secure network infrastructure for high-performance computing and ML workloads.

  • We're building infrastructure that has to perform under real-world scale, reliability, and security demands
  • and we're looking for an engineer who wants to own the foundation it runs on.
  • This isn't a traditional "keep the lights on" role.
  • Key Responsibilities Architect and operate scalable, secure network architecture for high-security requirements and large-scale machine learning workloads.
  • Own network device configuration management end to end, ensuring consistency and reliability across the fleet.
  • Improve system and network reliability and performance through automation, observability, and proactive capacity planning.
  • Build and maintain comprehensive monitoring, alerting, and incident response
  • SLOs, runbooks, and on-call rotations
  • and drive post-incident analysis and continuous improvement.
  • Requirements 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration.
View original posting →

Tech stack

AnsibleTerraformKubernetesPrometheusGrafanaDatadogPythonBashSparkAirflow

Match insights

Tech:Ansible, Terraform, Kubernetes, Prometheus, Grafana
Level:Senior

More roles at Alembic

View open roles at Alembic