Change Risk and Deployment SLOs
The question is not “will this change cause a problem?” but “how likely is it, and how bad would it be?”
Every change is a potential incident.
Change risk management is about answering that question before you deploy—and having policies that reduce risk when the stakes are high.
Change Risk Assessment
Section titled “Change Risk Assessment”Not all changes carry the same risk.
A typo fix in a log message is not the same as a database schema migration.
Factors that increase risk:
- Blast radius — How many users, services, or systems are affected if this goes wrong? A change to a shared library or a core database has a larger blast radius than a change to a single endpoint.
- Reversibility — Can you roll back? Code deploys are usually reversible. Database migrations, data deletions, and third-party API changes often are not.
- Novelty — Is this a well-understood change (e.g. a dependency bump you’ve done before) or something new (e.g. a new replication topology)?
- Coupling — Does this change require coordinated changes across multiple services or teams?
- Timing — Deploying during peak traffic or right before a holiday increases the cost of failure.
Risk categories:
| Risk Level | Characteristics | Example |
|---|---|---|
| Low | Small blast radius, easily reversible, well-understood | Config change, minor UI fix |
| Medium | Moderate blast radius, reversible with effort | New API endpoint, feature flag rollout |
| High | Large blast radius, hard to reverse, or novel | Schema migration, core service rewrite, data pipeline change |
| Critical | Affects all users, irreversible, or crosses compliance boundaries | Payment system change, auth infrastructure change, data deletion |
Deployment Impact On SLOs
Section titled “Deployment Impact On SLOs”Deployments consume error budget. A bad deploy that causes elevated errors for 10 minutes eats into your monthly availability SLO.
Before and After Every Deploy: A Checklist
- Before deploying, check your current error budget burn rate. If the budget is already low, the cost of a failed deploy is higher.
- After deploying, monitor SLIs (error rate, latency) and compare against pre-deploy baselines.
- Set a policy: e.g. “if error budget is below 20%, only low-risk changes may deploy without additional approval.”
Deployment SLOs are targets for the deployment process itself:
- Deploy frequency — How often can you deploy? More frequent = smaller changes = lower risk per deploy.
- Deploy lead time — How long from commit to production? Shorter = faster feedback = faster fixes.
- Deploy failure rate — What percentage of deploys cause a rollback or incident? Track and trend this.
- MTTR for bad deploys — How long to detect and roll back a bad deploy? This is your deployment recovery time.
These are sometimes called the DORA metrics (deployment frequency, lead time, change failure rate, MTTR).
Change Management
Section titled “Change Management”Change management is broader than deployment.
It covers any change that could affect production: code, config, infrastructure, database, network, third-party integrations.
Practices:
- Change advisory / review — For high-risk or critical changes, require a review with relevant stakeholders before proceeding. This doesn’t have to be a formal board; a short sync with the on-call engineer and a tech lead may be enough.
- Change windows — Some organizations define preferred deployment windows (e.g. “deploy during business hours when the team is available to respond, not on Friday afternoon”). The goal is to have people available if something goes wrong.
- Change freeze — Periods where non-emergency changes are not allowed (e.g. during a major sales event, holiday season, or active incident). Define what qualifies as an exception.
- Change log — Maintain a record of what changed, when, and by whom. Correlate with incidents to identify patterns (e.g. “most incidents follow config changes on Thursdays”).
Risk Reduction Strategies
Section titled “Risk Reduction Strategies”- Deploy small, deploy often. Smaller changes are easier to understand, review, test, and roll back. Batching changes increases risk.
- Use progressive delivery. Canary and traffic shifting limit blast radius by exposing changes to a fraction of traffic first.
- Use feature flags. Decouple deploy from release so you can disable a feature without rolling back the deploy.
- Automate rollback. If SLIs breach a threshold, roll back automatically. Don’t rely on a human noticing and acting at 3 AM.
- Test in production-like environments. The closer your staging environment matches production, the fewer surprises in prod. See CI/CD for Applications.
- Backward compatibility. Design changes so the old and new versions can coexist. This makes rollback safe and reduces the risk of coordinated deployments.
See Also
Section titled “See Also”- Error Budgets — How error budget burn rate can gate deployments.
- Incident Lifecycle — Phase 3 (Triage & Mitigation) often starts with “what changed recently?”
- Alerting — SLO-based alerting can detect deploy-related degradation quickly.