Failover and Failback
Failover is shifting traffic and workload from a failed primary site or region to a standby.
Failback is returning to the primary once it’s healthy again.
Both require clear runbooks, tested procedures, and coordination with your incident response process.
What Is Failover
Section titled “What Is Failover”When the primary region or data center is unavailable—outage, disaster, or unrecoverable failure—you redirect traffic to a standby.
The standby may be:
- Active-passive — Standby is idle or lightly used until failover; you switch DNS, load balancer, or routing to point to it.
- Active-active — Both regions serve traffic normally; if one fails, the other absorbs the load. More complex (data consistency, split-brain) but no “cold” standby.
For design guidance on when to use each, see the Infrastructure / redundancy example and System Design Checklist Section 7.
When to Declare a Disaster and Fail Over
Section titled “When to Declare a Disaster and Fail Over”Not every outage is a disaster. Criteria for failover might include:
- Primary region or AZ is down or unreachable.
- Data center or cloud provider declares an incident affecting your primary.
- RTO is at risk—manual fix will take longer than your recovery window.
Define criteria in your DR strategy and runbook.
The Incident lifecycle (detection, declaration, roles, mitigation) applies; failover is a mitigation action.
Failover Runbook: Typical Steps
Section titled “Failover Runbook: Typical Steps”- Declare — Incident Commander (or equivalent) confirms failover criteria are met; document the decision.
- Verify standby — Confirm standby region/site is healthy and has recent data (replication lag, backup restore status).
- Switch traffic — Update DNS, load balancer, or routing to point to standby. May involve lowering TTLs beforehand.
- Verify — Monitor SLIs (availability, error rate, latency); confirm application is serving correctly.
- Communicate — Update status page, stakeholders, and incident channel. See Communicating During Incidents.
Details depend on your architecture. Keep the runbook in a known location and link it from your incident response playbooks.
Failback
Section titled “Failback”Once the primary is restored and verified healthy, you need to return traffic to it.
- Data sync — If writes continued on the standby, you must reconcile or replicate changes back to primary before failback (or accept a cutover with potential data loss, depending on your RPO).
- Traffic switch — Reverse the traffic routing; consider a canary or gradual ramp.
- Verification — Confirm primary is serving correctly; monitor for issues.
- Standby state — Return standby to its pre-failover state (replication, backups) so it’s ready for the next incident.
Failback can be as risky as failover. Test it in drills if possible.
See Also
Section titled “See Also”For more on documenting and practicing failover:
- DR Planning and Testing — How to document and test failover runbooks.
- Runbooks and Playbooks — What makes runbooks effective; connection to game-days.