Skip to content

Change Risk and Deployment SLOs

First PublishedByAtif Alam

The question is not “will this change cause a problem?” but “how likely is it, and how bad would it be?”

Every change is a potential incident.

Change risk management is about answering that question before you deploy—and having policies that reduce risk when the stakes are high.

Not all changes carry the same risk.

A typo fix in a log message is not the same as a database schema migration.

Factors that increase risk:

  • Blast radius — How many users, services, or systems are affected if this goes wrong? A change to a shared library or a core database has a larger blast radius than a change to a single endpoint.
  • Reversibility — Can you roll back? Code deploys are usually reversible. Database migrations, data deletions, and third-party API changes often are not.
  • Novelty — Is this a well-understood change (e.g. a dependency bump you’ve done before) or something new (e.g. a new replication topology)?
  • Coupling — Does this change require coordinated changes across multiple services or teams?
  • Timing — Deploying during peak traffic or right before a holiday increases the cost of failure.

Risk categories:

Risk LevelCharacteristicsExample
LowSmall blast radius, easily reversible, well-understoodConfig change, minor UI fix
MediumModerate blast radius, reversible with effortNew API endpoint, feature flag rollout
HighLarge blast radius, hard to reverse, or novelSchema migration, core service rewrite, data pipeline change
CriticalAffects all users, irreversible, or crosses compliance boundariesPayment system change, auth infrastructure change, data deletion

Deployments consume error budget. A bad deploy that causes elevated errors for 10 minutes eats into your monthly availability SLO.

Before and After Every Deploy: A Checklist

  • Before deploying, check your current error budget burn rate. If the budget is already low, the cost of a failed deploy is higher.
  • After deploying, monitor SLIs (error rate, latency) and compare against pre-deploy baselines.
  • Set a policy: e.g. “if error budget is below 20%, only low-risk changes may deploy without additional approval.”

Deployment SLOs are targets for the deployment process itself:

  • Deploy frequency — How often can you deploy? More frequent = smaller changes = lower risk per deploy.
  • Deploy lead time — How long from commit to production? Shorter = faster feedback = faster fixes.
  • Deploy failure rate — What percentage of deploys cause a rollback or incident? Track and trend this.
  • MTTR for bad deploys — How long to detect and roll back a bad deploy? This is your deployment recovery time.

These are sometimes called the DORA metrics (deployment frequency, lead time, change failure rate, MTTR).

Change management is broader than deployment.

It covers any change that could affect production: code, config, infrastructure, database, network, third-party integrations.

Practices:

  • Change advisory / review — For high-risk or critical changes, require a review with relevant stakeholders before proceeding. This doesn’t have to be a formal board; a short sync with the on-call engineer and a tech lead may be enough.
  • Change windows — Some organizations define preferred deployment windows (e.g. “deploy during business hours when the team is available to respond, not on Friday afternoon”). The goal is to have people available if something goes wrong.
  • Change freeze — Periods where non-emergency changes are not allowed (e.g. during a major sales event, holiday season, or active incident). Define what qualifies as an exception.
  • Change log — Maintain a record of what changed, when, and by whom. Correlate with incidents to identify patterns (e.g. “most incidents follow config changes on Thursdays”).
  • Deploy small, deploy often. Smaller changes are easier to understand, review, test, and roll back. Batching changes increases risk.
  • Use progressive delivery. Canary and traffic shifting limit blast radius by exposing changes to a fraction of traffic first.
  • Use feature flags. Decouple deploy from release so you can disable a feature without rolling back the deploy.
  • Automate rollback. If SLIs breach a threshold, roll back automatically. Don’t rely on a human noticing and acting at 3 AM.
  • Test in production-like environments. The closer your staging environment matches production, the fewer surprises in prod. See CI/CD for Applications.
  • Backward compatibility. Design changes so the old and new versions can coexist. This makes rollback safe and reduces the risk of coordinated deployments.
  • Error Budgets — How error budget burn rate can gate deployments.
  • Incident Lifecycle — Phase 3 (Triage & Mitigation) often starts with “what changed recently?”
  • Alerting — SLO-based alerting can detect deploy-related degradation quickly.