Skip to content

Grafana Dashboards for Multi-Region Systems

First PublishedByAtif Alam

This page focuses on metrics you typically chart in Grafana (or similar) for a multi-region API stack. It is not the whole observability story: combine these dashboards with structured logs and distributed tracing so you can move from “P99 spiked” to “this trace shows which region and dependency failed.” See the Observability overview and Alerting.

Early phases (single region + DR replica, or passive second region):

  • Global and regional request success rate and latency (P50 / P95 / P99).
  • Replication lag between primary and secondary.
  • Error budget or simple SLO view if you have one.

Later phases (multiple active regions, global data plane):

  • Everything above, plus per-region traffic share, saturation (connections, CPU, pool usage), cross-region replication or consensus metrics, cache hit rate per region, and consumer lag for streaming.

Expand dashboards when blast radius and routing complexity grow—see Multi-region topology.

Audience: Engineering leads, on-call—often the first screen in an incident.

MetricWhyExample alert threshold
Global success rateUser-impacting healthBelow SLO (for example 99.9%)
P50 / P95 / P99 latencyTail behaviorP99 above target (for example 200 ms)
Error budget burn rateAre we exhausting the monthly budget too fast?Burn rate multiple over sustained window
Requests per secondTotal and per regionSudden drop or spike
Region healthGreen / yellow / redAny region red

Tip: Error budget burn rate is often the single best “should we page?” signal when paired with latency and success rate.

Audience: On-call engineer during triage.

Traffic

  • RPS by endpoint, region, HTTP method.
  • Traffic shift between regions—unexpected shifts can mean failover or routing misconfiguration.

Latency

  • Percentiles per endpoint; heatmap over time for tails.
  • Time to first byte vs total time—helps separate network from app/DB.

Errors

  • 4xx vs 5xx; 5xx by region; error type (timeout vs reset vs application).

Saturation

  • Thread / worker queue depth, connection pool utilization (critical at high RPS).
  • GC pauses if your runtime exposes them.

Label this for your actual engine (Postgres, CockroachDB, managed RDS, etc.).

Throughput

  • Read and write QPS per region.
  • Transactions committed vs aborted (if applicable).

Latency

  • P99 query or statement latency by query family.
  • Consensus or replication latency where relevant (for example Raft).

Consistency / ordering

  • Transaction retry rate, contention indicators, replication lag per region.

Saturation and health

  • Connections vs max, disk IO, compaction or vacuum backlog for LSM-style stores.
  • Under-replicated ranges or unhealthy nodes—non-zero often means risk.

Dashboard 4 — Cache (Redis or equivalent)

Section titled “Dashboard 4 — Cache (Redis or equivalent)”
MetricWhy it matters
Hit rate per regionBelow expected (for example 80%) → more load on DB
Eviction rateMemory pressure or wrong TTL policy
Memory utilizationHigh utilization risks evictions and tail latency
Replication lag (if cross-region cache replication)Stale secondary region cache
Command latency P99Should usually stay very low; spikes → network or overload
Connected clientsSudden drops can indicate pool issues upstream

Trend hit rate over time: gradual decline sometimes precedes DB overload by minutes.

Dashboard 5 — Kafka / streaming (if used)

Section titled “Dashboard 5 — Kafka / streaming (if used)”
MetricAlert threshold (example)
Consumer lag per topic/partitionAbove SLA for that pipeline
Producer send latency P99Should stay low for fire-and-forget
Messages in vs outDivergence → consumer falling behind
Under-replicated partitionsAny > 0 → investigate
Broker diskAbove ~70% → plan expansion

Metrics many teams forget until an incident:

MetricWhat it tells you
Cross-region replication lagStaleness budget for reads
Traffic % per regionDetect unintended failover or bad routing
Failover timeline with deploy markersCorrelation with releases
Write-region or primary round-trip latencyBudget for funneled writes
Inter-region bandwidth (if available)Cost and saturation signal

Structure so on-call is not flooded:

P1 — Page immediately

  • Error rate above threshold for sustained minutes.
  • P99 latency above SLO for sustained window.
  • Region unreachable or zero healthy backends.
  • Data safety signals (for example under-replicated ranges) where applicable.

P2 — Slack / ticket

  • Error budget burn rate elevated.
  • Cache hit rate drop sustained.
  • Consumer lag growing.
  • Replication lag beyond agreed bound.

P3 — Backlog

  • Disk trending high.
  • Connection pools trending high without user impact yet.
  • Exemplars: Link Prometheus exemplars to traces to jump from a latency spike to a trace.
  • Annotations: Record deploys and failovers on dashboards—many incidents correlate with change.
  • Variables: region, environment, service as dropdowns for fast drill-down.
  • Composite panels: RPS, errors, and latency together for pattern recognition.