Skip to content

Alerting

First PublishedByAtif Alam

If you’ve wondered how to turn your SLOs and error budgets into alerts that actually help—without drowning the team in noise—you’re in the right place.

Here you’ll learn how to alert on error budget burn rate and how to design alerts for signal over noise.

Alerting on burn rate (how fast you’re consuming your error budget) focuses you on SLO risk instead of every availability or latency blip.

That reduces noise and keeps attention on what matters: are we at risk of missing our commitment?

What burn rate means — Burn rate is how quickly you’re eating through your error budget.

For example, burning 14.4x in 1 hour means you’d exhaust a 30-day budget in about 5 hours. A high burn rate is an early warning that you need to act.

Fast vs slow burnFast burn is a short spike (e.g. a sudden outage); you want to page immediately. Slow burn is sustained degradation (e.g. error rate creeping up over hours); you want to alert before the budget is exhausted so you can fix the cause.

Multi-window burn rate — A common approach: alert when burn rate is high over both a short window (e.g. 1 hour) and a long window (e.g. 6 hours). That avoids flapping on brief blips while still catching sustained issues.

For how error budgets are defined and used, see Error budgets and SLOs, SLIs & SLAs.

Signal vs noise — Alert when something requires human action. Avoid alerting on every blip or on metrics that don’t correlate with user impact.

If an alert fires and no one needs to do anything, it’s noise.

Alert fatigue — Too many alerts lead to ignoring them. Prefer fewer, higher-signal alerts.

For example, error budget burn rate is usually a better primary alert than raw error rate spikes—it answers “are we at risk?” instead of “did something change?”

SLO-based vs symptom-based — SLO and error-budget alerts answer “are we at risk of missing our commitment?”

Symptom-based alerts (e.g. CPU spike, error rate jump) answer “something changed—investigate.”

Use both: SLO-based for user impact and prioritization; symptom-based for early warning and root cause when you’re already responding.

See Infrastructure metrics for how CPU, memory, and disk fit in.

Actionable alerts — Each alert should have a clear owner and action: a runbook link, severity, and escalation path. When an alert fires, the responder should know what to do next.

Alerts feed detection—the first step in the Incident lifecycle.

The responder verifies the symptom, declares severity, and opens an incident if needed.

Well-designed alerts make that flow smoother: fewer false positives, clearer ownership, and runbooks at hand.

Chaos experiments and synthetic testing frequently reveal alerting gaps—missing alerts, noisy alerts, or alerts that fire too late. Use them to validate your alerting setup.