Error Budgets
An error budget is the amount of “failure” you can afford in a window while still meeting your SLO.
It turns SLOs into a concrete number (e.g. “we can have 43 minutes of downtime this month”) and informs when to alert, when to pause releases, and when to invest in reliability.
What it is
Section titled “What it is”If your availability SLO is 99.9%, your “budget” is 0.1% failure—the equivalent in failed requests or downtime.
For a 30-day window, that’s about 43 minutes of downtime. Same idea for latency SLOs: the budget is the percentage of requests allowed to exceed your target (e.g. exceed p95).
How it’s used
Section titled “How it’s used”- Alerting — Alert when burn rate is high (budget consumed too fast) rather than on every blip. That reduces noise and focuses on real risk. See Alerting for burn rate and multi-window details.
- Release decisions — When the budget is exhausted, slow or pause feature work until the budget recovers or you’ve fixed the cause. That keeps reliability and delivery in balance. See Change Risk and Deployment SLOs for how error budgets gate deployments.
- Reliability work — Use remaining budget to decide when to invest in fixes and hardening. If you’re burning budget fast, invest in reliability; if you have headroom, you can ship more.
Connection to SLOs
Section titled “Connection to SLOs”Error budgets are derived from SLOs. Define your SLOs first, then compute the budget and feed it into alerting and release policy. For the full framework, see SLOs, SLIs & SLAs.