Load and Stress Testing
Load testing answers: “Can this system handle the traffic we expect?”
Stress testing answers: “What happens when we push past that?”
Both are about finding limits and understanding failure modes before your users do.
Types Of Performance Tests
Section titled “Types Of Performance Tests”| Test Type | What It Does | When To Use |
|---|---|---|
| Load Test | Simulates expected traffic (normal and peak) and measures latency, throughput, error rate | Before major releases, after architecture changes, regularly in CI |
| Stress Test | Pushes beyond expected traffic to find the breaking point | Capacity planning, understanding failure modes |
| Soak Test | Runs load for an extended period (hours or days) to surface memory leaks, connection exhaustion, disk growth | Before production launch, periodically for long-running services |
| Spike Test | Sends a sudden burst of traffic, then drops back to normal | Validating autoscaling, queue backpressure, rate limiting |
| Benchmark | Measures performance of a specific component in isolation (a query, an algorithm, a cache hit) | Comparing implementations, tracking regressions, capacity modeling |
What To Measure
Section titled “What To Measure”During any performance test, track:
- Latency — p50, p95, p99 response times. See Latency Percentiles for why percentiles matter more than averages.
- Throughput — Requests per second (RPS) or transactions per second (TPS) the system sustains.
- Error Rate — At what load do errors start appearing? How does the error rate climb?
- Resource Utilization — CPU, memory, disk I/O, network, connection pools. See Infrastructure Metrics.
- Saturation Point — The load at which latency spikes or errors jump. This is your practical capacity limit.
How To Run A Load Test
Section titled “How To Run A Load Test”- Define the scenario. What endpoints, what mix of reads/writes, what payload sizes? Model real user behavior, not uniform synthetic traffic.
- Set the target. What throughput and latency are you testing against? Use your SLOs as the success criteria.
- Start below target, ramp up. Gradually increase load so you can see how the system behaves at each level. A sudden jump hides the inflection point.
- Monitor everything. Not just the system under test—also dependencies (database, cache, message queue, third-party APIs).
- Run against a production-like environment. Testing against a staging environment that’s half the size of production gives misleading results. Match instance types, data volume, and configuration as closely as possible.
- Record and compare. Save results and compare across runs. Track trends: is latency creeping up over releases?
Stress Testing: Finding The Breaking Point
Section titled “Stress Testing: Finding The Breaking Point”A stress test pushes past your expected peak to answer:
- Where does it break? — Which component fails first (database, cache, network, application thread pool)?
- How does it break? — Graceful degradation (slower responses, load shedding) or cascading failure (one component takes others down)?
- Does it recover? — After the overload passes, does the system return to normal or does it stay degraded?
If your system fails catastrophically under 2x traffic, you have a resilience problem, not just a capacity problem.
Understanding failure modes informs your scaling strategy, your backpressure design, and your incident response runbooks.
Soak Testing: The Slow Leaks
Section titled “Soak Testing: The Slow Leaks”Some problems only appear after hours or days of sustained load:
- Memory Leaks — Gradual memory growth that leads to OOM (out of memory) kills.
- Connection Exhaustion — Connections that aren’t properly released accumulate over time.
- Disk Growth — Logs, temp files, or data accumulation that fills storage.
- GC (garbage collection) Pressure — Increasingly long GC pauses as heap grows.
A soak test runs your normal load for an extended period and monitors for these slow-burn issues.
If your service restarts weekly in production and nobody knows why, a soak test will likely reveal it.
Testing In CI
Section titled “Testing In CI”Integrate lightweight performance tests into your CI/CD pipeline so regressions are caught before production:
- Benchmark Critical Paths — Run benchmarks for key operations (e.g. “search query under 50ms”) as part of the build.
- Gate on Regression — If latency increases by more than a threshold (e.g. 20%), fail the build or flag for review.
- Full Load Tests on Schedule — Run comprehensive load tests nightly or weekly, not on every commit (they’re slow and expensive).
For how quality gates work in pipelines, see CI/CD for Applications.
See Also
Section titled “See Also”- Capacity Planning — Use load test results to model capacity needs and scaling thresholds.
- Caching Strategies — Caching is often the first lever for improving performance under load.
- Chaos Experiments — Once you have performance baselines, inject failures to test resilience.
- Error Rate and Throughput — The SLIs you’re measuring during load tests.