Observability Overview
Observability is how you understand what your system is doing from the outside—without constantly changing code to add one-off checks.
It rests on three pillars:
Metrics (counters, gauges, histograms that answer “how much?” and “how fast?”), Logs (timestamped events and context that answer “what happened?”), and Traces (request flows across services that answer “where did time go?”).
Together they let you detect issues, debug failures, and reason about behavior in production.
The goal is simple: get from “something’s wrong” to “here’s why” quickly.
That means instrumenting the paths that matter, correlating logs and traces, and defining alerts and dashboards that reflect real user impact.
This section is about building and using monitoring, logging, and tracing so you can understand and improve how your system behaves.
- SLOs, SLIs & SLAs — Defining, measuring, and committing to service-level objectives, indicators, and agreements.
- Latency percentiles and targets — What p50, p95, p99 mean and how to set latency SLO targets.
- Availability and the nines — What “99.9%” means and how availability is measured.
- Error rate and throughput — What error rate and throughput mean as SLIs; how they appear in docs (e.g. rollback thresholds, QPS).
- Infrastructure metrics — CPU, memory, disk, and other resource metrics; how they relate to SLIs.
- Error budgets — How much failure you can afford; alerting and release decisions.
- Alerting — SLO-based alerting (burn rate), alert design, reducing noise and fatigue.
- Cost observability — Cost per request/tenant, cloud cost visibility, FinOps basics, right-sizing.
- SLO lifecycle — Negotiating, reviewing, evolving, and retiring SLOs; stakeholder alignment.
- Reliability metrics — MTTR, MTTD, MTBF: definitions, measurement, and how they relate to incidents and SLOs.