Skip to content

SLOs, SLIs & SLAs

First PublishedLast UpdatedByAtif Alam

When you run a service, you want to answer two questions:

  1. How is the service behaving for users?
  2. How good is “good enough”?

That maps to:

  • SLI (Service Level Indicator)what you measure about user experience.
  • SLO (Service Level Objective)the target you want to hit for that measurement.
  • SLA (Service Level Agreement) — a customer-facing contract with consequences if you miss it. SLOs are usually set stricter than SLAs so you have headroom before breaching a commitment.

An SLI is a specific, measurable metric that describes how well a service is behaving from a user’s point of view.

Good SLIs are:

  • User-centric — they reflect what users experience
  • Quantitative — a number you can compute
  • Well-defined — clear formula, window, and scope

Metrics like CPU, memory, and disk are infrastructure metrics that support debugging and capacity planning, but they don’t directly tell you what users felt.

See Infrastructure metrics for how they relate to SLIs.

Question: “Are requests succeeding?”

Formula: Success Rate = (good requests) / (total requests)

“Good” should be explicitly defined: HTTP 2xx (or 2xx/3xx), and business outcome indicates success (e.g., order_created=true). Optionally exclude client-canceled requests, but define it clearly.

Question: “How fast is it for users?”

Example: p95 latency for POST /payment is 280ms (last 7 days).

Averages hide user pain. A small number of very slow requests may barely move the average but still harm UX — percentiles (p95, p99) reveal the real picture. See Latency percentiles for how to interpret and set targets.

Question: “Is the service returning the right thing?”

Examples:

  • Search: % of searches returning at least one relevant result (judged sample)
  • API: % responses consistent with expected business rules

Question: “How up-to-date is the data?”

Example: p99 “event time to queryable” data lag is under 5 minutes.

Durability / Loss (Storage, Queues, Streaming)

Section titled “Durability / Loss (Storage, Queues, Streaming)”

Question: “Do we lose data or messages?”

Examples:

  • Message loss rate = lost / produced
  • Successful write rate for storage operations

Question: “Does it work for key paths and real users?”

Examples:

  • % of sessions that complete checkout with no client-side errors
  • % of app launches without a crash

An SLO takes an SLI and adds:

  • A target/threshold (how good it must be)
  • A time window (over what period)
  • Sometimes scope details (which endpoints, regions, tenants)

Example:

  • SLI: Success rate for POST /checkout
  • SLO: Success rate ≥ 99.9% over rolling 30 days
  • If it’s a computed measurement → SLI
  • If it’s a promise/goal about that measurement + a time window → SLO

Take this SLO: “POST /checkout request success rate is ≥ 99.9% over the last 30 days.”

The SLI part (the measurement):

  • Name: Checkout success rate
  • Formula: (good checkout requests) / (total checkout requests)
  • Definition of “good”: HTTP 2xx, business outcome success, documented exclusions
  • Scope: POST /checkout, prod traffic, specific regions/tenants if applicable

The SLO part (the commitment):

  • Target: ≥ 99.9%
  • Window: rolling 30 days

These are strong defaults because they are user/outcome-centric and hard to game.

  1. Request success rate — % successful and valid outcomes (not just 2xx; include business correctness)
  2. Tail latency — p95 (and often p99) for critical endpoints
  3. Canary/synthetic availability — % time a canary request succeeds from multiple locations
  4. Throttling impact — % requests rejected due to throttling (429s)
  5. Correctness — % reads consistent with expected version; % writes durable

Typical pick: success rate + p95 latency + canary availability.

  1. User journey completion rate — % sessions completing core flow (signup, checkout)
  2. Real-user performance (RUM) — p75/p95 LCP and INP for key pages
  3. Front-end error rate — JS exceptions per session + failed network calls
  4. Page/route availability — % page loads returning successful HTML + assets
  5. Conversion-impact latency — p95 time-to-interactive for checkout/critical routes

Typical pick: journey completion + LCP/INP p75 + JS error rate.

  1. Freshness / end-to-end lag — event time to queryable lag (p95/p99)
  2. Completeness — % expected records/partitions present vs baseline
  3. Quality pass rate — % rows passing validation (schema, ranges, referential integrity)
  4. Pipeline success rate — % jobs completing successfully (or streaming uptime)
  5. Duplication / loss rate — % duplicates or missing IDs

Typical pick: freshness + completeness + quality pass rate.

Kubernetes Platform (Cluster as a Product)

Section titled “Kubernetes Platform (Cluster as a Product)”
  1. Scheduling success and time — % pods scheduled successfully; p95 time-to-schedule
  2. Capacity availability — % time workloads have needed capacity (no prolonged Pending)
  3. Control-plane availability and latency — kube-apiserver success rate + p95 latency
  4. Networking SLI — in-cluster service-to-service success rate and p95 latency (via probes)
  5. Upgrade reliability — % upgrades completed without user-impacting disruption

Typical pick: time-to-schedule + API availability/latency + service networking success/latency.

When you write an SLI definition, be explicit about:

  • Numerator / denominator — what counts as “good” vs “total”
  • Scope — which endpoints/journeys, which environment (prod), which regions
  • Percentiles vs thresholds — e.g., p95 latency OR ”% under 300ms”
  • Exclusions — client-canceled requests, test traffic, known bots (only if justified)
  • Measurement source — server metrics, synthetic checks, RUM, logs, tracing, pipeline metadata

The goal is a metric that is stable, meaningful, and clearly tied to what users experience.

Choosing a small set of meaningful SLIs and setting realistic targets is the start.

The real value comes from feeding them into alerting and incident response — for example, alerting when error budget is burning too fast rather than on every blip.

See Error budgets for how error budgets inform when to slow feature work and invest in reliability. See Error rate and throughput for what those mean as SLIs. See Availability and the nines for what the nines mean and how availability is measured.