Grafana Dashboards for Multi-Region Systems
This page focuses on metrics you typically chart in Grafana (or similar) for a multi-region API stack. It is not the whole observability story: combine these dashboards with structured logs and distributed tracing so you can move from “P99 spiked” to “this trace shows which region and dependency failed.” See the Observability overview and Alerting.
Minimum viable observability by phase
Section titled “Minimum viable observability by phase”Early phases (single region + DR replica, or passive second region):
- Global and regional request success rate and latency (P50 / P95 / P99).
- Replication lag between primary and secondary.
- Error budget or simple SLO view if you have one.
Later phases (multiple active regions, global data plane):
- Everything above, plus per-region traffic share, saturation (connections, CPU, pool usage), cross-region replication or consensus metrics, cache hit rate per region, and consumer lag for streaming.
Expand dashboards when blast radius and routing complexity grow—see Multi-region topology.
Dashboard 1 — Executive / SLO
Section titled “Dashboard 1 — Executive / SLO”Audience: Engineering leads, on-call—often the first screen in an incident.
| Metric | Why | Example alert threshold |
|---|---|---|
| Global success rate | User-impacting health | Below SLO (for example 99.9%) |
| P50 / P95 / P99 latency | Tail behavior | P99 above target (for example 200 ms) |
| Error budget burn rate | Are we exhausting the monthly budget too fast? | Burn rate multiple over sustained window |
| Requests per second | Total and per region | Sudden drop or spike |
| Region health | Green / yellow / red | Any region red |
Tip: Error budget burn rate is often the single best “should we page?” signal when paired with latency and success rate.
Dashboard 2 — API / application
Section titled “Dashboard 2 — API / application”Audience: On-call engineer during triage.
Traffic
- RPS by endpoint, region, HTTP method.
- Traffic shift between regions—unexpected shifts can mean failover or routing misconfiguration.
Latency
- Percentiles per endpoint; heatmap over time for tails.
- Time to first byte vs total time—helps separate network from app/DB.
Errors
- 4xx vs 5xx; 5xx by region; error type (timeout vs reset vs application).
Saturation
- Thread / worker queue depth, connection pool utilization (critical at high RPS).
- GC pauses if your runtime exposes them.
Dashboard 3 — Database
Section titled “Dashboard 3 — Database”Label this for your actual engine (Postgres, CockroachDB, managed RDS, etc.).
Throughput
- Read and write QPS per region.
- Transactions committed vs aborted (if applicable).
Latency
- P99 query or statement latency by query family.
- Consensus or replication latency where relevant (for example Raft).
Consistency / ordering
- Transaction retry rate, contention indicators, replication lag per region.
Saturation and health
- Connections vs max, disk IO, compaction or vacuum backlog for LSM-style stores.
- Under-replicated ranges or unhealthy nodes—non-zero often means risk.
Dashboard 4 — Cache (Redis or equivalent)
Section titled “Dashboard 4 — Cache (Redis or equivalent)”| Metric | Why it matters |
|---|---|
| Hit rate per region | Below expected (for example 80%) → more load on DB |
| Eviction rate | Memory pressure or wrong TTL policy |
| Memory utilization | High utilization risks evictions and tail latency |
| Replication lag (if cross-region cache replication) | Stale secondary region cache |
| Command latency P99 | Should usually stay very low; spikes → network or overload |
| Connected clients | Sudden drops can indicate pool issues upstream |
Trend hit rate over time: gradual decline sometimes precedes DB overload by minutes.
Dashboard 5 — Kafka / streaming (if used)
Section titled “Dashboard 5 — Kafka / streaming (if used)”| Metric | Alert threshold (example) |
|---|---|
| Consumer lag per topic/partition | Above SLA for that pipeline |
| Producer send latency P99 | Should stay low for fire-and-forget |
| Messages in vs out | Divergence → consumer falling behind |
| Under-replicated partitions | Any > 0 → investigate |
| Broker disk | Above ~70% → plan expansion |
Dashboard 6 — Multi-region specific
Section titled “Dashboard 6 — Multi-region specific”Metrics many teams forget until an incident:
| Metric | What it tells you |
|---|---|
| Cross-region replication lag | Staleness budget for reads |
| Traffic % per region | Detect unintended failover or bad routing |
| Failover timeline with deploy markers | Correlation with releases |
| Write-region or primary round-trip latency | Budget for funneled writes |
| Inter-region bandwidth (if available) | Cost and saturation signal |
Alerting tiers (example)
Section titled “Alerting tiers (example)”Structure so on-call is not flooded:
P1 — Page immediately
- Error rate above threshold for sustained minutes.
- P99 latency above SLO for sustained window.
- Region unreachable or zero healthy backends.
- Data safety signals (for example under-replicated ranges) where applicable.
P2 — Slack / ticket
- Error budget burn rate elevated.
- Cache hit rate drop sustained.
- Consumer lag growing.
- Replication lag beyond agreed bound.
P3 — Backlog
- Disk trending high.
- Connection pools trending high without user impact yet.
Grafana implementation tips
Section titled “Grafana implementation tips”- Exemplars: Link Prometheus exemplars to traces to jump from a latency spike to a trace.
- Annotations: Record deploys and failovers on dashboards—many incidents correlate with change.
- Variables:
region,environment,serviceas dropdowns for fast drill-down. - Composite panels: RPS, errors, and latency together for pattern recognition.