You are viewing a preview of this job. Log in or register to view more details about this job.

Site Reliability Engineer

Location: Redwood City, CA (On-Site) • Full-Time

About Hammerhead

We're unleashing AI with intelligent orchestration while addressing one of the most pressing bottlenecks for AI: access to power. Our cutting-edge platform optimizes data center power infrastructure to maximize AI workload throughput within existing electrical limits, without requiring new power plants or grid expansions.

Our platform uses reinforcement learning to intelligently orchestrate power, cooling, and compute in real time, enabling data centers to run significantly more AI workloads within their existing electrical and thermal limits. Our team, at AutoGrid, had optimized over 8 gigawatts of mission-critical power globally. At Hammerhead, we're addressing a $64 billion-per-year market opportunity while dramatically reducing the environmental footprint of AI infrastructure.

About the Role

We are seeking a Site Reliability Engineer to own the reliability, scalability, and operational excellence of Hammerhead's AI-driven power orchestration platform. Our software runs in production data centers around the world, where real-time decisions directly affect gigawatts of compute infrastructure. Availability, latency, and correctness are not negotiable.

You will work at the boundary between software and infrastructure, building the systems that deploy, monitor, and protect Hammerhead's platform in production. You will partner with engineering teams to establish SLOs, automate toil, accelerate releases, and ensure that when things go wrong, we know fast and recover faster.

This is a foundational SRE role. You will be the first dedicated hire in this function. You will set the standard for how Hammerhead runs software in production.

You will report to the Head of Engineering.

Key Responsibilities

- Own production reliability for Hammerhead's platform: define and enforce SLOs, SLAs, and error budgets across services, and drive resolution when they are breached.
- Build and maintain the observability stack: metrics, logging, distributed tracing, and alerting across cloud and on-prem deployment environments.
- Architect and manage CI/CD pipelines that enable fast, safe, and repeatable deployments to production data center environments.
- Automate operational toil: provisioning, configuration management, scaling, failover, and incident response workflows.
- Lead incident response: act as the primary on-call escalation, run blameless post-mortems, and drive systemic fixes that prevent recurrence.
- Partner with software and RL engineers to bake reliability into the development lifecycle: code reviews, deployment checklists, chaos testing, and load testing.
- Manage and evolve Hammerhead's cloud infrastructure (primarily AWS) and edge deployment infrastructure at customer data center sites.
- Establish security and compliance practices for production environments: secrets management, access controls, audit logging, and vulnerability remediation.
- Evaluate and introduce tooling that improves platform velocity and reliability, from container orchestration to infrastructure-as-code to incident management platforms.

Qualifications

Required:
- 4+ years of experience in site reliability engineering, DevOps, or platform/infrastructure engineering in production environments.
- Deep proficiency with Kubernetes and container orchestration in production.
- Strong infrastructure-as-code experience with Terraform, Pulumi, or equivalent.
- Hands-on experience with observability tools (Prometheus, Grafana, Datadog, OpenTelemetry, or equivalent).
- Expertise in Python for writing automation scripts, internal tooling, and operational runbooks.
- Experience managing CI/CD pipelines (GitHub Actions, ArgoCD, or equivalent).
- Strong incident response instincts.
- Comfortable working in environments with strict operational requirements.

Preferred:
- Experience deploying or operating software in industrial, energy, or data center environments.
- Familiarity with ML/AI system operations.
- Experience with advanced Kubernetes networking and security primitives.
- Background in chaos engineering or formal reliability testing frameworks.
- Prior experience as the first or founding SRE at an early-stage company.

What We Offer
- Competitive base salary + meaningful equity
- Comprehensive health, dental, and vision insurance
- The opportunity to build Hammerhead's reliability function from the ground up
- A collaborative, low-ego team of world-class engineers solving genuinely hard problems
- Work that matters: our platform reduces the energy footprint of AI at scale