Skip to content

Infrastructure Metrics

First PublishedByAtif Alam

If you’ve wondered where CPU, memory, and disk fit alongside availability and latency, you’re in the right place.

Here you’ll learn what infrastructure metrics (or resource metrics) are, how they differ from SLIs, and how they help you understand and explain what’s happening when things go wrong.

Infrastructure metrics answer “how is the system using resources?” rather than “how does the user experience the service?”

Typical examples:

  • CPU utilization — How much of the CPU is in use (e.g. percentage).
  • Memory usage — How much memory is in use vs available.
  • Disk I/O and capacity — Read/write throughput and how much disk space is free.
  • Network I/O — Traffic in and out.

It helps to distinguish utilization (e.g. CPU %, memory in use) from capacity or headroom (e.g. disk space free, memory available).

Both are infrastructure metrics, and both can predict or explain when SLIs like latency or availability start to suffer.

Supporting and leading indicators — High CPU or memory can precede or explain latency or error rate problems.

For example: CPU saturation often leads to higher latency; a full disk or memory pressure can cause errors or OOMs that show up in error rate and availability.

Not SLIs by defaultSLIs are the metrics that matter for user experience: availability, latency, error rate, throughput.

Infrastructure metrics are inputs to the system—useful for diagnosis and early warning, but not direct measures of that experience unless you choose to treat them as such.

Proxy SLIs when it helps — Internal or platform teams sometimes define SLOs on resource metrics (e.g. “CPU under 80%”) when the “user” is another service or the platform itself. In those cases, a resource metric is being used as a proxy SLI.

Use infrastructure metrics in dashboards and runbooks to diagnose why an SLO was missed.

Prefer alerting on SLIs for user impact; use resource metrics for early warning or root-cause context (e.g. “latency is high—check CPU and memory”).

For the full SLO/SLI framework, see SLOs, SLIs & SLAs.

For how infrastructure metrics feed into scaling decisions and performance baselines, see Capacity Planning and Load and Stress Testing.