Senior Network & Site Reliability Engineer

AlembicCausal AI, company

San FranciscoSenior

Software Engineering

Bookmark Apply on site→

About the role

Architect and operate scalable, secure network infrastructure for high-performance computing and ML workloads.

•We're building infrastructure that has to perform under real-world scale, reliability, and security demands
•and we're looking for an engineer who wants to own the foundation it runs on.
•This isn't a traditional "keep the lights on" role.
•Key Responsibilities Architect and operate scalable, secure network architecture for high-security requirements and large-scale machine learning workloads.
•Own network device configuration management end to end, ensuring consistency and reliability across the fleet.
•Improve system and network reliability and performance through automation, observability, and proactive capacity planning.
•Build and maintain comprehensive monitoring, alerting, and incident response
•SLOs, runbooks, and on-call rotations
•and drive post-incident analysis and continuous improvement.
•Requirements 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration.

View original posting →

View original posting for full requirements →

Tech stack

AnsibleTerraformKubernetesPrometheusGrafanaDatadogPythonBashSparkAirflow

Match insights

Tech:Ansible, Terraform, Kubernetes, Prometheus, Grafana

Level:Senior

More roles at Alembic

View open roles at Alembic