Senior Network & Site Reliability Engineer
AlembicCausal AI, company
San FranciscoSenior
Software Engineering
About the role
Architect and operate scalable, secure network infrastructure for high-performance computing and ML workloads.
- •We're building infrastructure that has to perform under real-world scale, reliability, and security demands
- •and we're looking for an engineer who wants to own the foundation it runs on.
- •This isn't a traditional "keep the lights on" role.
- •Key Responsibilities Architect and operate scalable, secure network architecture for high-security requirements and large-scale machine learning workloads.
- •Own network device configuration management end to end, ensuring consistency and reliability across the fleet.
- •Improve system and network reliability and performance through automation, observability, and proactive capacity planning.
- •Build and maintain comprehensive monitoring, alerting, and incident response
- •SLOs, runbooks, and on-call rotations
- •and drive post-incident analysis and continuous improvement.
- •Requirements 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration.
Tech stack
AnsibleTerraformKubernetesPrometheusGrafanaDatadogPythonBashSparkAirflow
Match insights
Tech:Ansible, Terraform, Kubernetes, Prometheus, Grafana
Level:Senior