Skip to content

Production Readiness

First PublishedByAtif Alam

Production readiness is the set of practices that ensure a service is safe to run in production—before it goes live and as it evolves.

It covers readiness checklists, reliability reviews, service ownership models, on-call maturity, and risk assessments: how to decide when something is ready and who owns keeping it reliable.

Existing Incident Management content covers runbooks, on-call, severity, and post-incident review.

Production readiness ties them together with governance: who is responsible, what “ready” means, and how you review and improve over time.

A readiness checklist is a gate that a service or change must pass before it is allowed in production. It makes “are we ready?” explicit instead of implicit.

  • What to include — Observability (metrics, logs, traces; SLOs defined and instrumented); runbooks for common failures and escalation; on-call ownership and rotation; security and compliance (e.g. secrets, access); capacity and Capacity Planning basics; backup and restore and RTO/RPO where applicable. Tailor to your org: not every service needs every item, but the list should be consistent and reviewable.
  • Who runs it — Often the team that owns the service, with a sign-off from a reliability or platform lead for critical services. For smaller teams, a lightweight self-check with a shared template is enough; for high-risk or regulated workloads, a formal review may be required.
  • When — Before first production launch; for major changes (e.g. new dependency, new region); and periodically (e.g. annually) to catch drift. See Runbooks and Playbooks for keeping runbooks current as part of readiness.

Checklists prevent “we didn’t think of that” and create a baseline for reliability reviews.

A reliability review is a periodic (e.g. quarterly) look at a service’s health: SLO performance, incident history, on-call load, tech debt, and upcoming risks.

It’s a governance practice that keeps production readiness from decaying.

  • Agenda — Review SLO and error budget burn; incidents since last review and post-incident follow-ups; runbook and on-call coverage; capacity and scaling; dependencies and failover readiness. Identify gaps and assign actions (e.g. add runbook, improve alerting, reduce toil).
  • Who attends — Service owners, on-call reps, and optionally SRE or platform. Keep it focused so it doesn’t become a status meeting. Output: a short summary and a list of reliability actions with owners.
  • Connection to SLO lifecycle — Reliability reviews can feed into SLO lifecycle (e.g. “we’re consistently missing this SLO; do we change the target or invest in fixes?”). They also surface when readiness has slipped (e.g. runbooks stale, no one on call for a critical path).

Reviews turn “we’re fine” into evidence and action.

Service ownership means a clear answer to “who is responsible for this service in production?” Without it, incidents and reliability work have no home.

  • Model — One or more named owners (team or person) per service. Owners are accountable for availability, SLOs, runbooks, on-call, and responding to incidents. They don’t have to do everything themselves (e.g. platform may run the infra) but they are the escalation point and the decision-makers for that service.
  • Visibility — Maintain a service catalog or registry: service name, owners, SLOs, runbooks, on-call link. New joiners and incident responders need to find owners quickly. See Incident Lifecycle for how ownership fits into detection and response.
  • Handoff — When ownership changes (team reorg, service deprecated or merged), update the catalog, runbooks, and on-call; run a handoff review so the new owners know the state of the service. Avoid orphaned services with no clear owner.

Ownership makes readiness and reliability someone’s job.

On-call maturity is the level of structure and support around being on call: clear rotations, escalation, runbooks, and sustainable load.

It affects both readiness (can we respond?) and team health (is on-call burning people out?).

  • BaselineOn-Call and Escalation covers rotation models and escalation paths. Maturity adds: documented expectations (response time, what “acknowledge” means); runbooks that reduce “what do I do?” stress; and coverage so no one is permanently overloaded.
  • Improving maturity — Reduce pages that don’t need a human (e.g. alerting tuned to SLO burn rate instead of every blip); automate common fixes so runbooks are short; track on-call load (pages per rotation, incidents per person) and fix the worst offenders. See Toil and Automation for reducing repetitive operational work.
  • Readiness link — A service isn’t ready if no one is on call for it or if the on-call burden is unsustainable. Readiness checklists should include “on-call assigned and documented.”

Maturity keeps on-call from being a constant fire drill.

A risk assessment is a structured look at “what could go wrong and how bad would it be?” It informs readiness (e.g. we need a runbook for X) and prioritization (e.g. we should fix Y before we scale).

  • What to assess — Failure modes (dependency down, region down, data loss, bad deploy); likelihood and impact; mitigations (redundancy, failover, backup, progressive delivery); and residual risk. You don’t need a formal risk matrix for every service—a lightweight “what keeps us up at night?” list is often enough.
  • When — For new services or major changes (e.g. new region, new critical dependency); after a major incident (did we miss something?); and periodically for critical services. Risk assessments feed into readiness checklists (e.g. “if dependency X is single points of failure, we need a runbook and escalation”).
  • Governance — In regulated or high-impact domains, risk assessments may be required and documented. In most teams, a short discussion in a reliability review or launch review is sufficient. The goal is to make risks explicit so they can be accepted or mitigated.

Risk assessments close the gap between “we think we’re ready” and “we’ve thought about what could fail.”