Backup and Restore
Backups protect you from data loss—accidental deletion, corruption, ransomware, or a failed primary. But backups only help if you can restore them.
Aim for a strategy that meets your RPO, keeps backups safe, and includes regular restore testing so restores work when you need them.
Why Backups Matter
Section titled “Why Backups Matter”- Data loss — Human error, bug, or malicious action deletes or corrupts data.
- Ransomware — Encrypted or held hostage; clean restore from backup may be the only path.
- Disaster — Primary site or region is lost; restore to a new environment.
Backups are your last line of defense. Replication and failover can reduce downtime, but they replicate bad data too. Backups give you a point-in-time copy to restore from.
Backup Types
Section titled “Backup Types”| Type | What it copies | When to use | Restore time |
|---|---|---|---|
| Full | Everything | Baseline; periodic (e.g. weekly) | Slowest; restores everything |
| Incremental | Changes since last backup (full or incremental) | Daily or hourly; fast, small | Requires full + all incrementals; slower restore |
| Differential | Changes since last full | Mid-frequency; simpler restore than incremental | Full + latest differential |
| Continuous / log-based | Transaction logs or change stream | Near-zero RPO; point-in-time recovery | Fast; depends on log retention |
Many systems use a combination: full weekly, incremental daily, with transaction logs for point-in-time recovery.
Backup Storage
Section titled “Backup Storage”- Location — Offsite from primary; ideally a different region or cloud provider. Same-region backups can be lost with the primary.
- Retention — How long to keep backups. Driven by compliance, RPO, and cost.
- Immutability — Some backup systems support immutable or write-once storage to resist tampering or ransomware.
Restore Testing
Section titled “Restore Testing”Backups that are never tested often fail when needed. Restore testing validates that your backups are usable.
Cadence — Restore a backup at least quarterly; more often for critical systems. Run a full restore to a test environment annually.
What to verify — Data integrity (checksums, record counts), application can start and serve traffic, restore time meets RTO expectations.
Runbooks — Document restore steps in a runbook and keep it updated.
See Runbooks and Playbooks for structure. Link restore runbooks from your DR planning docs.
Point-in-Time Recovery (PITR)
Section titled “Point-in-Time Recovery (PITR)”PITR lets you restore to a specific moment (e.g. “10 minutes before the corruption”). It requires transaction logs or a change stream, not just periodic snapshots.
- When to use — When RPO is minutes or zero; when corruption or bad data is discovered after the fact.
- How — Database log replay, or backup system that supports point-in-time restore. Retention of logs determines how far back you can go.
IaC and Reproducible Restore
Section titled “IaC and Reproducible Restore”Restoring data is one part of recovery. Restoring infrastructure (servers, networks, databases) is another.
Infrastructure as Code lets you define environments in code so you can recreate a DR site from scratch. Combine IaC with backup restore for a full recovery path.