SLOs, SLIs & SLAs

First PublishedFeb 7, 2026Last UpdatedFeb 26, 2026ByAtif Alam

When you run a service, you want to answer two questions:

How is the service behaving for users?
How good is “good enough”?

That maps to:

SLI (Service Level Indicator) — what you measure about user experience.
SLO (Service Level Objective) — the target you want to hit for that measurement.
SLA (Service Level Agreement) — a customer-facing contract with consequences if you miss it. SLOs are usually set stricter than SLAs so you have headroom before breaching a commitment.

What Is an SLI?

An SLI is a specific, measurable metric that describes how well a service is behaving from a user’s point of view.

Good SLIs are:

User-centric — they reflect what users experience
Quantitative — a number you can compute
Well-defined — clear formula, window, and scope

Metrics like CPU, memory, and disk are infrastructure metrics that support debugging and capacity planning, but they don’t directly tell you what users felt.

See Infrastructure metrics for how they relate to SLIs.

Common SLI Types

Availability / Success Rate

Question: “Are requests succeeding?”

Formula: Success Rate = (good requests) / (total requests)

“Good” should be explicitly defined: HTTP 2xx (or 2xx/3xx), and business outcome indicates success (e.g., order_created=true). Optionally exclude client-canceled requests, but define it clearly.

Latency (Especially Tail Latency)

Question: “How fast is it for users?”

Example: p95 latency for POST /payment is 280ms (last 7 days).

Averages hide user pain. A small number of very slow requests may barely move the average but still harm UX — percentiles (p95, p99) reveal the real picture. See Latency percentiles for how to interpret and set targets.

Correctness / Quality

Question: “Is the service returning the right thing?”

Examples:

Search: % of searches returning at least one relevant result (judged sample)
API: % responses consistent with expected business rules

Freshness (Data Systems)

Question: “How up-to-date is the data?”

Example: p99 “event time to queryable” data lag is under 5 minutes.

Durability / Loss (Storage, Queues, Streaming)

Question: “Do we lose data or messages?”

Examples:

Message loss rate = lost / produced
Successful write rate for storage operations

Coverage / Completeness

Question: “Does it work for key paths and real users?”

Examples:

% of sessions that complete checkout with no client-side errors
% of app launches without a crash

What Is an SLO?

An SLO takes an SLI and adds:

A target/threshold (how good it must be)
A time window (over what period)
Sometimes scope details (which endpoints, regions, tenants)

Example:

SLI: Success rate for POST /checkout
SLO: Success rate ≥ 99.9% over rolling 30 days

SLI vs SLO (Simple Rule)

If it’s a computed measurement → SLI
If it’s a promise/goal about that measurement + a time window → SLO

Dissecting an SLO to Extract the SLI

Take this SLO: “POST /checkout request success rate is ≥ 99.9% over the last 30 days.”

The SLI part (the measurement):

Name: Checkout success rate
Formula: (good checkout requests) / (total checkout requests)
Definition of “good”: HTTP 2xx, business outcome success, documented exclusions
Scope: POST /checkout, prod traffic, specific regions/tenants if applicable

The SLO part (the commitment):

Target: ≥ 99.9%
Window: rolling 30 days

Golden Starter SLIs (3–5 per System)

These are strong defaults because they are user/outcome-centric and hard to game.

API (HTTP/gRPC)

Request success rate — % successful and valid outcomes (not just 2xx; include business correctness)
Tail latency — p95 (and often p99) for critical endpoints
Canary/synthetic availability — % time a canary request succeeds from multiple locations
Throttling impact — % requests rejected due to throttling (429s)
Correctness — % reads consistent with expected version; % writes durable

Typical pick: success rate + p95 latency + canary availability.

Web App (Browser Product)

User journey completion rate — % sessions completing core flow (signup, checkout)
Real-user performance (RUM) — p75/p95 LCP and INP for key pages
Front-end error rate — JS exceptions per session + failed network calls
Page/route availability — % page loads returning successful HTML + assets
Conversion-impact latency — p95 time-to-interactive for checkout/critical routes

Typical pick: journey completion + LCP/INP p75 + JS error rate.

Data Pipeline (Batch/Streaming ETL/ELT)

Freshness / end-to-end lag — event time to queryable lag (p95/p99)
Completeness — % expected records/partitions present vs baseline
Quality pass rate — % rows passing validation (schema, ranges, referential integrity)
Pipeline success rate — % jobs completing successfully (or streaming uptime)
Duplication / loss rate — % duplicates or missing IDs

Typical pick: freshness + completeness + quality pass rate.

Kubernetes Platform (Cluster as a Product)

Scheduling success and time — % pods scheduled successfully; p95 time-to-schedule
Capacity availability — % time workloads have needed capacity (no prolonged Pending)
Control-plane availability and latency — kube-apiserver success rate + p95 latency
Networking SLI — in-cluster service-to-service success rate and p95 latency (via probes)
Upgrade reliability — % upgrades completed without user-impacting disruption

Typical pick: time-to-schedule + API availability/latency + service networking success/latency.

Practical Tips When Defining SLIs

When you write an SLI definition, be explicit about:

Numerator / denominator — what counts as “good” vs “total”
Scope — which endpoints/journeys, which environment (prod), which regions
Percentiles vs thresholds — e.g., p95 latency OR ”% under 300ms”
Exclusions — client-canceled requests, test traffic, known bots (only if justified)
Measurement source — server metrics, synthetic checks, RUM, logs, tracing, pipeline metadata

The goal is a metric that is stable, meaningful, and clearly tied to what users experience.

Using SLOs in Practice

Choosing a small set of meaningful SLIs and setting realistic targets is the start.

The real value comes from feeding them into alerting and incident response — for example, alerting when error budget is burning too fast rather than on every blip.

See Error budgets for how error budgets inform when to slow feature work and invest in reliability. See Error rate and throughput for what those mean as SLIs. See Availability and the nines for what the nines mean and how availability is measured.

SLOs, SLIs & SLAs

What Is an SLI?

Common SLI Types

Availability / Success Rate

Latency (Especially Tail Latency)

Correctness / Quality

Freshness (Data Systems)

Durability / Loss (Storage, Queues, Streaming)

Coverage / Completeness

What Is an SLO?

SLI vs SLO (Simple Rule)

Dissecting an SLO to Extract the SLI

Golden Starter SLIs (3–5 per System)

API (HTTP/gRPC)

Web App (Browser Product)

Data Pipeline (Batch/Streaming ETL/ELT)

Kubernetes Platform (Cluster as a Product)

Practical Tips When Defining SLIs

Using SLOs in Practice

See Also