Skip to content

Toil and Automation

First PublishedByAtif Alam

Toil is manual, repetitive operational work that scales linearly with the system or the number of services—and that could be automated or eliminated.

The goal is to keep toil low so the team can focus on reliability and improvement.

Reducing toil frees time for reliability work, improves consistency, and makes on-call and production readiness more sustainable.

This page covers how to identify toil, measure it, decide what to automate (and when), and reduce operational burden over time.

The concept is central to SRE: if more than half of an SRE’s time is toil, there’s little capacity left for engineering that improves the system.

Toil has a few characteristics:

  1. Manual — A human does it repeatedly.
  2. Repetitive — The same steps each time.
  3. Scales with growth — More services, more instances, more of the same work.
  4. Doesn’t add long-term value — Fixing the same symptom again and again instead of fixing the cause.

Not all manual work is toil—incident response and one-off investigations are operational but not necessarily toil if they’re variable and lead to lasting fixes.

Examples of toil:

  • Manually scaling or restarting instances when alerts fire instead of fixing the root cause or improving autoscaling.
  • Copy-pasting runbook steps (e.g. clearing a queue, running a script) every time the same scenario occurs instead of automating the procedure.
  • Repeatedly applying the same config or patch across many nodes by hand instead of using automation or IaC.
  • Manually reconciling data or fixing the same class of data issue over and over instead of fixing the pipeline or adding safeguards.
  • Triage that could be automated: classifying alerts, routing tickets, or running the same diagnostic steps for every similar incident.

If you’re doing the same sequence of steps often, and it doesn’t require novel judgment each time, it’s a candidate for toil reduction.

You can’t reduce what you don’t measure. Measuring toil means tracking how much time (or how many occurrences) go into repetitive operational tasks.

  • Time-based — What percentage of the team’s time (or on-call time) is spent on manual, repetitive tasks? Surveys, time-tracking, or sampling (e.g. “what did you do this week?”) can approximate. A common SRE guideline is to keep toil under ~50% so there’s capacity for project work.
  • Occurrence-based — How often do we run the same runbook, clear the same queue, or fix the same type of issue? Count incidents or tasks by type; high repeat counts signal toil and often point to automation or root-cause fixes.
  • Where to look — Runbooks that are run frequently; alerts that fire often and trigger the same response; recurring tickets or manual steps in release or config processes. See Runbooks and Playbooks for how runbooks relate to toil—automating runbook steps is a direct way to reduce toil.

Measuring gives you a baseline and helps prioritize: tackle the highest-time or highest-frequency toil first.

Not all toil is worth automating. Automation has a cost (build time, maintenance, testing) and sometimes the toil is rare enough or quick enough that automation isn’t justified. ROI is the return (time saved, errors avoided, consistency gained) relative to that cost.

  • When to automate — High frequency (we do this daily or weekly) or high impact (errors are costly, or the task blocks critical path). Also when the procedure is stable and well-understood so the automation won’t break every time the system changes.
  • When to defer or avoid — One-off or rare tasks; procedures that change often; tasks that require judgment or context that’s hard to encode. Sometimes the better lever is to eliminate the need (e.g. fix the root cause so the alert stops firing) rather than automate the response.
  • Build vs buy — Use existing tooling (runbook automation, orchestration, IaC) where it fits; custom scripts or services when the workflow is specific. Prefer solutions that are maintainable and documented so the next person can change them. See Infrastructure as Code for automation of infrastructure and config.

Deciding what to automate is a product decision: you’re investing engineering time to reduce future operational cost. Prioritize by impact and frequency.

Reducing toil is one way to reduce operational burden; the other is to reduce the need for the work at all.

  • Eliminate the cause — If the same alert fires every week and you run the same runbook, fix the underlying issue (e.g. capacity, flaky dependency, bad config) so the alert stops firing. That’s better than automating the runbook.
  • Improve detection and designAlerting tuned to SLO burn rate instead of every blip reduces alert fatigue and manual triage. Progressive delivery and Capacity Planning reduce incidents and surprise load, which reduces firefighting.
  • Automate the repetitive part — Once you’ve identified and measured toil, automate the stable, high-value parts. Start small (e.g. one runbook automated, one script that runs the same steps) and expand. Document what’s automated and how to maintain it so it doesn’t become hidden toil (maintaining brittle automation).
  • Culture and expectations — Encourage “automate or eliminate” as the default: when someone does a repetitive task, the question is “should we automate this or fix the cause?” Track toil in reliability reviews so it stays visible and prioritized.

Operational burden goes down when toil is measured, prioritized, and either automated or eliminated—and when the system is designed and operated to need less manual intervention in the first place.