Incident Lifecycle
A field-tested sequence for SRE/DevOps on-call.
Each step calls out what to do, why it matters, and the concrete artifacts you should leave behind.
Phases At A Glance
Section titled “Phases At A Glance”| # | Phase | What Success Looks Like | Key Actions & Artifacts |
|---|---|---|---|
| 0 | Preparation (always-on) | People, playbooks and telemetry are ready before anything breaks. | Clear on-call rotation and escalation matrix; severity matrix, incident channel template, runbooks, dashboards; regular game-days & tool fire-drills. |
| 1 | Detection & Declaration | The first responder spots the symptom and explicitly “calls the incident.” | Monitoring/alert fires, engineer verifies; declare severity (SEV-x), open incident ticket / Slack channel. |
| 2 | Assign Roles (ICS / 3 Cs) | Everyone knows Coordinate, Communicate, Control roles. | Incident Commander (IC), Operations Lead (OL), Communications Lead (CL). |
| 3 | Triage & Mitigation | Stop the bleeding and shrink blast radius. | Gather current state → form hypotheses; execute safest mitigation first (rollback, failover, feature-flag); IC keeps a running timeline. For rollback and feature flag strategies, see Release Engineering. For region or site failover, see Disaster Recovery. |
| 4 | Internal & External Comms | Stakeholders know impact, action, ETA—no rumor mill. | CL posts regular updates (e.g., every 15 min); status-page entries, customer success briefings. |
| 5 | Recovery & Verification | Service Level Indicators (SLIs) back in the green. | Gradual traffic ramp-up / canary; IC declares “Resolved” when steady-state proven. |
| 6 | Blameless Post-Incident Review (PIR) | Learning is captured, not buried. | Within 24–72 h, write postmortem (timeline, impact, root cause(s), contributing factors, “could this have been detected sooner?“). |
| 7 | Action-Items & Continuous Improvement | The same failure mode can’t bite twice. | Track remediation tasks in backlog with owners and due dates; trend analysis of incident classes to guide larger investments. |
For how to define severity and when to declare an incident, see Severity and Classification. For on-call rotation and escalation design, see On-Call and Escalation.
Why this structure works
Section titled “Why this structure works”- Role clarity cuts cognitive load; the Incident Command System scales from a two-person page to a cross-org SEV-0.
- Rapid mitigation > root-cause-hunting during the fire—root cause can wait until systems are stable.
- Blameless culture encourages full disclosure of mistakes, giving you the data needed for systemic fixes.
- Postmortem action tracking turns lessons into reliability capital instead of “shelved docs.”
Mnemonic
Section titled “Mnemonic”P-D-R-M-C-R-P-A
Prepare → Detect/Declare → Roles → Mitigate → Communicate → Restore → Postmortem → Action-items.
Post-incident phase in detail
Section titled “Post-incident phase in detail”For the detailed post-incident phase (phase 6), including a repeatable 10-step post-mortem checklist and guiding principles, see Post-incident review.