Skip to content

Error Rate and Throughput

First PublishedLast UpdatedByAtif Alam

Two other SLIs that show up in requirements and SLOs: error rate (how often requests fail) and throughput (how much traffic the system handles).

This page explains what they mean and when to use them.

Definition — The proportion of requests that fail. You define “fail” (e.g. 5xx, timeouts, or your own rules).

How it’s measured — Failed requests / total requests over a time window.

Relation to availability — When availability is defined as success rate, error rate = 1 − availability. They’re two sides of the same coin: more failed requests mean lower availability.

Example — “Rollback if error rate > 5%,” “monitor error rate,” “alert when error rate spikes.”

Set thresholds and rollback criteria from your SLO and risk tolerance (e.g. 1% error rate might be acceptable for some endpoints, not for others).

When to use as an SLI — Capacity planning, scaling decisions, or when “how much” matters: events per second, requests per second (QPS), or writes per second.

How it appears in docs — “Read throughput,” “write throughput,” “high throughput,” “QPS.”

Throughput is typically a target or capacity indicator rather than a user-facing SLO. It answers “can we handle N requests per second?” rather than “how fast did each request complete?”

In practice — Often paired with latency: e.g. “support N QPS at p95 < X ms.”

See Latency percentiles and targets for latency targets.

Both are SLIs you can set SLO targets for (e.g. “error rate < 0.1%,” “throughput ≥ N QPS”). For the full framework, see SLOs, SLIs & SLAs.