Skip to content

Active-Passive vs Active-Active Multi-Region

First PublishedByAtif Alam

A useful mental model for steady state:

  1. Client resolves DNS or is routed by a global entry (anycast, global load balancer, or geo-DNS).
  2. Traffic lands in a region on your API tier (containers, functions, or VMs) behind a regional load balancer.
  3. The app may read regional cache (for example Redis), then databases or queues—often a primary for writes and replicas for reads, depending on design.

Nothing here requires multiple regions; it is the happy path you secure before you reason about regional failover.

  • Availability zones (AZs) are separate data centers within one region, connected by low-latency networks. Surviving one AZ is usually solved with redundant subnets, load balancers, and synchronous or fast failover inside the region.
  • A region is a geographic failure and isolation boundary in public clouds: power, networking, and control-plane scope are independent per region. Multi-region design addresses loss of a whole region (or placing workloads closer to users across continents).

Solve AZ-level problems with multi-AZ first. Add another region when you need cross-region DR, latency to distant users, or residency constraints that one region cannot satisfy.

  • RTO (recovery time objective) — how long service may be degraded after a regional disaster before you recover.
  • RPO (recovery point objective) — how much data you may lose (time or volume) measured at failover.

These pair with replication: async replication usually improves write latency but widens RPO; sync replication tightens RPO but adds latency and coupling. See RTO and RPO for definitions and tradeoffs.

DimensionActive-PassiveActive-Active
Traffic handlingOne region serves production traffic; standby is cold or warmMultiple regions serve traffic simultaneously
ComplexityLowerHigher (routing, data, conflicts, ops)
CostLower if standby is mostly idleHigher—often full capacity in each active region
RTOMinutes (failover, DNS, promotion)Near zero for read paths if others absorb load
RPOSeconds to minutes typical with async replicationCan approach zero with the right stores and sync model
Write conflictsNone across regions if single writerMust be designed for (or avoided by funneling writes)

Sync vs async replication (plain language)

Section titled “Sync vs async replication (plain language)”
  • Synchronous replication to another region waits for the remote copy before acknowledging the write. RPO can be very small, but write latency includes cross-region round trips (often tens to hundreds of milliseconds).
  • Asynchronous replication acknowledges after the local commit; the remote copy catches up in the background. Faster writes, but on failover you may lose lag worth of data—your RPO is at least that lag unless you have a compensating design.

Rough targets like 50,000 requests per second and P99 under 200ms push you toward:

  • Connection pooling at the database boundary (pooler or proxy) so failover does not cold-start thousands of connections.
  • Regional caches to absorb read variance and protect the database.
  • Fast health checks (for example on the order of seconds, not half a minute) so a bad region does not absorb traffic long enough to burn your error budget.
  • Warm standby or shadow traffic in active-passive so the passive region is not completely cold (JVM caches, connection pools).

Strict global P99 under 200ms during a regional failover event is hard in pure active-passive because cutover and cold caches create spikes—active-active with regional capacity and routing often fits latency SLOs better if you can afford the data-layer complexity.

  • Global load balancer or DNS (patterns exist on AWS, GCP, Azure, Cloudflare, and others) with health checks aimed at the primary region.
  • The passive region receives no production traffic until failover (unless you use shadow traffic to keep it warm).
  • DNS TTL often kept low (tens of seconds) to shorten cutover time—at the cost of more DNS churn and operational attention.
  • Primary in the active region takes all writes (typical pattern).
  • Replica in the passive region via async or sync replication.
  • Sync tightens RPO but adds write latency; async is faster for writes but widens the data-loss window on promotion.
FailureResponse
Active region downLB marks unhealthy; traffic shifts to passive; degraded period roughly DNS/TTL and promotion steps
DB primary failsPromote replica in passive region; replay or accept replication lag
Replication lag spikeRPO risk—monitor lag aggressively
Partial failure (one AZ)Prefer AZ failover inside the active region before declaring a region dead

Cold standby risk: After cutover, caches and pools may be cold—expect minutes of elevated latency unless you warm the passive path regularly.

  • Latency-based routing at the global tier so users hit a nearby region.
  • Regions are often sized for part of peak each (for example 60–70% per region) so one region can absorb a surge when another fails.
  • Circuit breakers per region to shed load when overloaded.

Three common patterns:

  1. Global or multi-region database with multi-master or consensus (for example DynamoDB Global Tables, CockroachDB, Spanner)—writes in multiple regions; conflict resolution is product-specific (for example last-write-wins).
  2. Geo-sharding — Data has a home region; writes go home; reads may be local from replicas.
  3. CQRS and event sourcing — Writes append to a log replicated across regions; each region builds read models; strong eventual consistency; read-heavy workloads often fit well.
  • Distributed tracing across regions (OpenTelemetry, vendor APM) to debug cross-region paths.
  • Conflict strategy wherever two regions can write the same entity.
  • Idempotency keys on mutating APIs so retries do not double-apply.
  • Session affinity only if stateful; otherwise externalize session state to a shared store or token.
FailureResponse
One region downGlobal routing stops sending traffic there; survivors must have headroom
Network partitionRegions may continue serving; divergent writes possible—reconcile on heal
Replication lagReads may be stale—define per-endpoint staleness SLOs
Split-brain on multi-masterDatabase or app conflict rules; idempotency limits duplicate side effects
Cascading overloadOne region fails → surge elsewhere → second failure; needs load shedding and autoscaling buffer

Conceptually (cloud products differ):

  1. Health checks mark the region’s endpoints unhealthy.
  2. The global layer stops steering new connections to that region (DNS updates, anycast withdrawal, or LB pool removal).
  3. In active-passive, you may promote a database replica and point traffic to the former passive region.
  4. In active-active, surviving regions take the share of traffic—pre-provisioned capacity and circuit breakers matter.

Active-passive often looks like a cutover (DNS or LB flip). Active-active looks like shedding a bad region while others absorb load.

Single-region outage: operational steps (high level)

Section titled “Single-region outage: operational steps (high level)”

This is not a full runbook—see Failover and Failback and DR planning and testing—but the arc is:

  1. Detect — Alarms, health checks, synthetic probes, or regional error rate.
  2. Decide — Confirm regional failure vs. transient blip; invoke incident command.
  3. Shift — Route traffic away from the bad region; scale or throttle surviving regions.
  4. Stabilize — Promote databases if needed, validate RPO, fix caches, communicate status.
  5. Record — Timeline for post-incident review and runbook updates.

When the failed region comes back (failback)

Section titled “When the failed region comes back (failback)”

At architecture level:

  1. Verify replication and disk health before sending user traffic.
  2. Reattach the region as a replica or secondary, or rebuild it from backup if needed.
  3. Reconcile divergent data if the outage allowed split writes (details depend on store—see Data, ordering, and stores).
  4. Ramp traffic gradually; warm caches; watch replication lag and error rates.

Align spend and risk with need:

PhaseTypical focusCost / complexity
1Single region hardened: multi-AZ, backups, tested restoreLowest
2Async DR replica in second regionLow–medium
3Active-passive with tested failover; optional warm trafficMedium
4Multi-region reads (replicas + cache); write still centralizedMedium–high
5Active-active or geo-partitioned writes or global DBHighest

Signals to move up: latency SLOs across geographies, hard RTO/RPO, regulatory in-region serving, or write/read scale that demands distribution.

Observability by phase: early on, prioritize success rate, latency percentiles, and replication lag. As you add active regions, add per-region saturation, cross-region traffic share, and failover annotations on dashboards—see Grafana multi-region dashboards.

  • TLS for all cross-region traffic; private connectivity where supported.
  • Secrets and keys—rotation and scope so a regional compromise does not imply global credential reuse without controls.
  • Least privilege for replication agents, operators, and automation that can promote a database.

Design so one compromised region or one bad deploy can be contained: separate accounts or projects where appropriate, feature flags, and bulkheads for cross-region calls—see Scalability Patterns and the Infrastructure redundancy example.

  • If you need strict low P99 globally and can fund engineering complexity, active-active with a clear write strategy (for example geo-sharded writes or a global database) is often a better fit than pure active-passive at large scale.
  • If the workload is mostly single geography and you need DR, active-passive with a warm standby (not cold) is simpler and often sufficient.