Game-Days and Drills
A chaos experiment tests a component. A game-day tests the whole system—technology, people, and process together.
Game-days are scheduled exercises where you simulate a failure scenario and walk through the response end-to-end: detection, diagnosis, mitigation, communication, and recovery.
They answer questions that automated experiments can’t: Do the right people get paged? Can the on-call follow the runbook? Does the team know how to communicate during an incident?
Game-Days vs. Drills
Section titled “Game-Days vs. Drills”| Game-Day | Drill | |
|---|---|---|
| Scope | Broad: end-to-end failure scenario involving multiple teams or systems | Narrow: single procedure or tool (e.g. “restore this database from backup”) |
| Duration | 1-4 hours | 15-60 minutes |
| Participants | Cross-functional: engineering, on-call, sometimes leadership and comms | Usually one team or one person |
| Goal | Validate the full response chain: detection → response → recovery → communication | Validate a specific runbook step or tool works |
Both are valuable.
Drills build muscle memory for individual procedures. Game-days test how those procedures fit together under pressure.
Running A Game-Day
Section titled “Running A Game-Day”Before: Planning
Section titled “Before: Planning”-
Pick a Scenario. Choose a realistic failure that would have high impact. Examples:
- Primary database becomes unavailable.
- A major dependency returns errors for 30 minutes.
- A region goes down and you need to fail over.
- A bad deploy causes elevated error rates for a critical API.
-
Define Objectives. What are you testing? Examples:
- “Can the on-call detect the problem within 5 minutes?”
- “Does the failover runbook work end-to-end?”
- “Can we communicate status to stakeholders within 10 minutes?”
-
Assign Roles. At minimum:
- Facilitator — Runs the exercise, injects the failure, keeps things on track. Does not participate as a responder.
- Observers — Take notes on what happens, what goes well, and where things break down.
- Responders — The team that would normally respond to this incident. They should not know the scenario in advance (or at least not the details).
-
Set Safety Boundaries. Will you inject real failures or simulate them? If injecting, define abort conditions (see Chaos Experiments — Blast Radius Control). If simulating, prepare mock alerts, dashboards, or narrated scenarios.
-
Notify Stakeholders. Let affected teams and leadership know a game-day is happening so real alerts during the exercise don’t cause confusion.
During: Execution
Section titled “During: Execution”- Inject or Simulate the Failure. Start the clock.
- Let the Team Respond. Don’t coach—observe. The value is in seeing how the team actually responds, not how they respond with hints.
- Track Timeline. Note when the alert fired, when the on-call responded, when the runbook was opened, when mitigation was applied, when recovery was confirmed.
- Enforce Time Limits. If the team is stuck for too long, the facilitator can offer a hint or end the exercise. A game-day that runs 4 hours without progress isn’t productive.
After: Debrief
Section titled “After: Debrief”Run a debrief (similar to a post-incident review) within a day:
- What Worked? — Detection was fast, runbook was accurate, communication was clear.
- What Didn’t? — Alert didn’t fire, runbook was outdated, nobody knew who to escalate to, recovery took too long.
- Action Items — Update the runbook, fix the alert, add a missing dashboard, schedule a follow-up drill on the weak area.
The debrief is the most valuable part. A game-day without a debrief is a missed opportunity.
Cadence
Section titled “Cadence”| Exercise Type | Suggested Cadence | Notes |
|---|---|---|
| Focused drills (single runbook or tool) | Monthly or quarterly | Low overhead, high value for muscle memory |
| Team game-day (one team, one scenario) | Quarterly | Core practice for on-call teams |
| Cross-team game-day (multi-team, complex scenario) | Annually or semi-annually | Higher coordination cost, tests organizational response |
| DR exercise (failover, backup restore) | See DR Planning and Testing | Specific cadence for disaster recovery scenarios |
Adjust based on your team’s maturity and how often your architecture changes.
New services or major redesigns are good triggers for an unscheduled game-day.
Common Findings
Section titled “Common Findings”Game-days frequently reveal:
- Runbooks Are Outdated. The infrastructure changed but the runbook wasn’t updated. Steps reference old tools or missing dashboards.
- Detection Is Slow. The failure happened but nobody noticed for 15 minutes because the alert was too noisy or didn’t exist.
- Escalation Paths Are Unclear. The on-call didn’t know who to page next, or the escalation contact was out of date.
- Communication Gaps. Stakeholders weren’t updated, or updates were too technical/too vague.
- Recovery Is Harder Than Expected. The “simple failover” turns out to require manual steps nobody practiced.
- Tool Access Issues. Someone needed to access a system but didn’t have the right credentials or permissions.
Relationship To DR Testing
Section titled “Relationship To DR Testing”DR testing (failover drills, backup restores) is a specialized form of game-day focused on disaster recovery scenarios.
DR Planning and Testing covers DR-specific exercises, including backup restore cadence and failover drills.
The practices overlap: a DR failover drill is a game-day with a DR scenario.
Use whichever framing fits your organization, but make sure both your operational failure scenarios (a dependency is slow, a deploy is bad) and your disaster scenarios (a region is down, data is corrupted) get exercised regularly.
See Also
Section titled “See Also”- Chaos Experiments — Individual failure injection experiments that feed into game-day scenarios.
- Synthetic Testing and Load Replay — Continuous automated validation between game-days.
- Incident Lifecycle — The response process that game-days practice.
- Runbooks and Playbooks — The procedures that game-days validate.
- Post-Incident Review — The debrief format used after game-days.