Failover and Failback

First PublishedFeb 14, 2026ByAtif Alam

Failover is shifting traffic and workload from a failed primary site or region to a standby.

Failback is returning to the primary once it’s healthy again.

Both require clear runbooks, tested procedures, and coordination with your incident response process.

What Is Failover

When the primary region or data center is unavailable—outage, disaster, or unrecoverable failure—you redirect traffic to a standby.

The standby may be:

Active-passive — Standby is idle or lightly used until failover; you switch DNS, load balancer, or routing to point to it.
Active-active — Both regions serve traffic normally; if one fails, the other absorbs the load. More complex (data consistency, split-brain) but no “cold” standby.

For design guidance on when to use each, see the Infrastructure / redundancy example and System Design Checklist Section 7.

Not every outage is a disaster. Criteria for failover might include:

Define criteria in your DR strategy and runbook.

The Incident lifecycle (detection, declaration, roles, mitigation) applies; failover is a mitigation action.

Declare — Incident Commander (or equivalent) confirms failover criteria are met; document the decision.
Verify standby — Confirm standby region/site is healthy and has recent data (replication lag, backup restore status).
Switch traffic — Update DNS, load balancer, or routing to point to standby. May involve lowering TTLs beforehand.
Verify — Monitor SLIs (availability, error rate, latency); confirm application is serving correctly.
Communicate — Update status page, stakeholders, and incident channel. See Communicating During Incidents.

Details depend on your architecture. Keep the runbook in a known location and link it from your incident response playbooks.

Once the primary is restored and verified healthy, you need to return traffic to it.

Data sync — If writes continued on the standby, you must reconcile or replicate changes back to primary before failback (or accept a cutover with potential data loss, depending on your RPO).
Traffic switch — Reverse the traffic routing; consider a canary or gradual ramp.
Verification — Confirm primary is serving correctly; monitor for issues.
Standby state — Return standby to its pre-failover state (replication, backups) so it’s ready for the next incident.

Failback can be as risky as failover. Test it in drills if possible.