Incident Lifecycle

First PublishedFeb 10, 2026ByAtif Alam

A field-tested sequence for SRE/DevOps on-call.

Each step calls out what to do, why it matters, and the concrete artifacts you should leave behind.

Phases At A Glance

#	Phase	What Success Looks Like	Key Actions & Artifacts
0	Preparation (always-on)	People, playbooks and telemetry are ready before anything breaks.	Clear on-call rotation and escalation matrix; severity matrix, incident channel template, runbooks, dashboards; regular game-days & tool fire-drills.
1	Detection & Declaration	The first responder spots the symptom and explicitly “calls the incident.”	Monitoring/alert fires, engineer verifies; declare severity (SEV-x), open incident ticket / Slack channel.
2	Assign Roles (ICS / 3 Cs)	Everyone knows Coordinate, Communicate, Control roles.	Incident Commander (IC), Operations Lead (OL), Communications Lead (CL).
3	Triage & Mitigation	Stop the bleeding and shrink blast radius.	Gather current state → form hypotheses; execute safest mitigation first (rollback, failover, feature-flag); IC keeps a running timeline. For rollback and feature flag strategies, see Release Engineering. For region or site failover, see Disaster Recovery.
4	Internal & External Comms	Stakeholders know impact, action, ETA—no rumor mill.	CL posts regular updates (e.g., every 15 min); status-page entries, customer success briefings.
5	Recovery & Verification	Service Level Indicators (SLIs) back in the green.	Gradual traffic ramp-up / canary; IC declares “Resolved” when steady-state proven.
6	Blameless Post-Incident Review (PIR)	Learning is captured, not buried.	Within 24–72 h, write postmortem (timeline, impact, root cause(s), contributing factors, “could this have been detected sooner?“).
7	Action-Items & Continuous Improvement	The same failure mode can’t bite twice.	Track remediation tasks in backlog with owners and due dates; trend analysis of incident classes to guide larger investments.

For how to define severity and when to declare an incident, see Severity and Classification. For on-call rotation and escalation design, see On-Call and Escalation.

Why this structure works

Role clarity cuts cognitive load; the Incident Command System scales from a two-person page to a cross-org SEV-0.
Rapid mitigation > root-cause-hunting during the fire—root cause can wait until systems are stable.
Blameless culture encourages full disclosure of mistakes, giving you the data needed for systemic fixes.
Postmortem action tracking turns lessons into reliability capital instead of “shelved docs.”

Mnemonic

P-D-R-M-C-R-P-A

Prepare → Detect/Declare → Roles → Mitigate → Communicate → Restore → Postmortem → Action-items.

Post-incident phase in detail

For the detailed post-incident phase (phase 6), including a repeatable 10-step post-mortem checklist and guiding principles, see Post-incident review.