Skip to content

SRE Runbook Template

First PublishedByAtif Alam

A runbook is a step-by-step procedure for a known scenario (see Runbooks and Playbooks for the distinction from playbooks). The sections below are a practical skeleton you can copy into your wiki or repo; trim or extend per service. Optimize for someone paged at 3am, not for a leisurely read.

  • Title — Specific scenario (e.g. “High 5xx rate on checkout API”).
  • Owner — Team or named owner responsible for accuracy.
  • Last updated — Date (and optionally review cadence, e.g. quarterly or after every related incident).
  • Severity — Typical severity when this runbook applies, if helpful.
  • Linked alert / paging policy — Which alert routes here; link to on-call rotation or escalation group if your tool supports it.
  • Where this doc lives — Wiki URL, git path, or “source of truth” so copies do not diverge.
  • What this runbook covers — Symptom or failure mode in one short paragraph.
  • Service or system — Name, boundaries, critical dependencies.
  • Business impact — What degrades for users or internal customers if this issue is real (latency, failed orders, data staleness).
  • Out of scope (optional) — e.g. “Not for physical data center access” or “Not for client-side app bugs.”
  • Dependencies — Other teams, vendors, or services often involved—so responders know when to hand off early.
  • How the issue surfaces — Alert names, dashboard links, typical log patterns or metrics.
  • Normal vs broken — Baseline (“p99 ~40ms”) vs failure (“p99 over 2s sustained 5m”).
  • Noise vs signal — Known false positives, flapping, or noisy alerts; when to wait one evaluation vs act immediately.
  • Correlation — Check recent deploys, config changes, or change calendar; many incidents follow a change event.

Link alerts to this runbook from your alerting tool so the page is one click from the page.

  • Who or what is affected — Regions, tenants, feature flags, percentage of traffic.
  • Blast radius — How big could this get if unchecked (see multi-region ideas for geographic scope).
  • SLO / SLA / error budget — Whether this scenario burns budget or breaches a customer commitment.
  • When this becomes an incident — Pointer to Severity and classification so responders know when to declare and pull in comms or leadership.
  • Quick checks — Short, numbered diagnostics (commands, URLs) with expected output when possible.
  • Decision tree — “If X → go to section / runbook Y”; “If not X → continue.”
  • Stopping rule — When to stop triage and escalate (time box, failed check, missing access).
  • Trace and correlation IDs — How to pull request IDs or trace IDs from logs so you can follow one path through the system.
  • Numbered, unambiguous steps — No assumed expert knowledge; define terms or link to glossary once.
  • Commands — Show expected output or “success looks like …” for critical steps.
  • Preflight — Preconditions (maintenance window, quorum, backups) before destructive actions.
  • Rollback — If a fix can worsen things: how to revert config, flag, or deployment; who approves. For release-driven rollback criteria and change-window planning (what to document before a deploy), see Rollback plans—this runbook template stays incident-centric.
  • SecretsNever paste credentials in the runbook. Link to your vault or secret manager and name the key reference.
  • Disruption — If remediation needs a window or customer-visible change, note it and link Communicating during incidents when user impact is possible.

Automate stable, repeated steps over time; keep human approval for risky changes—see Toil and automation.

  • When to escalate — Time thresholds (e.g. “no progress in 15 minutes”), severity, or safety criteria.
  • Technical escalation — Next on-call line, platform team, or subject-matter expert; how to page (tool, Slack).
  • Non-engineering — When to involve vendor support, executive, or communications (customer-facing status)—align with On-call and escalation.
  • Incident ticket — Link pattern for Jira, PagerDuty incident, or internal tracker.
  • Postmortem — Pointer to your template or Post-incident review process.
  • Update this runbook — If steps were wrong, incomplete, or order mattered in a way the doc did not capture, edit the runbook as part of follow-up (blameless).
  • Feedback — Optional “Was this helpful?” link or owner inbox for continuous improvement.
  • Architecture — Diagram or link to system design doc.
  • Related runbooksLink, do not duplicate overlapping procedures.
  • Dashboards and logs — Deep links to saved views.
  • Code and config repos — Repositories or IaC paths that matter for this scenario.
  • Status page / comms — Where customer-facing status is updated if applicable.
  • Internal incident channel — Template for Slack/Teams incident room if your org uses one.
PrincipleMeaning
Actionable over explanatorySteps and checks, not essays. Long context belongs in linked docs.
TestedRunbooks rot. Validate in game-days and drills or fire drills; update when reality changes.
Linked, not duplicatedOne canonical procedure; other runbooks point here instead of copying steps.
Audience-awareWritten for a tired on-call with standard access, not only for senior experts.