Staged Design Examples
Every staged design example in this library (Feed, Chat, Search, URL shortener, and the rest) follows three stages. You don’t plan for Stage 3 on day one—you move forward when real signals tell you to.
The Stage Model
Section titled “The Stage Model”- Stage 1 — MVP: One path in, one store (or a minimal set). Ship, learn, and validate. Keep it simple enough that one team can build and run it.
- Stage 2 — Growth: Bottlenecks or triggers appear. Add load balancing, cache, replicas, queues or a message bus, workers, or a dedicated index/store—whatever the signal calls for.
- Stage 3 — Advanced Scale: Multi-region, retention and replay, ranking or stream processing, invalidation at scale, cost control. You’re here because Stage 2 solutions have hit their limits.
Signal → Component Quick Reference
Section titled “Signal → Component Quick Reference”Find your signal, read across, get the component to consider and the stage it typically appears.
| Signal / Pain | Consider Adding | Stage |
|---|---|---|
| Read latency high, repeated same queries | Cache (Redis/Memcached), CDN | 2 |
| DB overloaded on reads | Cache, read replicas | 2 |
| DB overloaded on writes | Sharding / partitioning | 2–3 |
| Slow work in request path (e.g. transcode, email) | Queue + workers | 2 |
| Fan-out to many servers (chat, collab, cache invalidation) | Message bus (pub/sub) | 2 |
| Need ordering, retention, replay | Event stream / log (e.g. Kafka) | 2–3 |
| Complex queries, full-text, filtering + ranking | Search index (e.g. Elasticsearch) | 2 |
| Geographic latency for end users | CDN (static), multi-region (data) | 2–3 |
| Hot keys or uneven load | Cache + sharding, dedicated partitions | 2–3 |
| Many independent consumers of same data | Event stream / log | 2 |
| Connection count limit (WebSocket, long-poll) | Load balancer, horizontal servers | 2 |
| Need retention or “load history” | Event log, history/cold store | 2–3 |
| Data or index too stale for users | Cache invalidation (write-through/invalidate), indexing pipeline refresh, replication-lag SLO | 2 |
| Read-your-writes or strong consistency required | Primary/session routing for reads, write-through cache, sync replication or quorum reads | 2 |
| Cost growing fast (storage, scans) | Tiered storage, downsampling, retention | 3 |
| Manual recovery or ops toil | Orchestration, automation | 3 |
For goals, typical mechanisms, and risks per category, see Optimization Quick Reference.
What Triggers the Next Stage?
Section titled “What Triggers the Next Stage?”Use these tables to identify where you are and what to consider adding. Find your pain on the left, read across for the answer and an example to explore.
Stage 1 → 2 Triggers
Section titled “Stage 1 → 2 Triggers”| Signal / Pain | Consider adding | Example(s) |
|---|---|---|
| Single DB is the bottleneck (CPU, connections, disk) | Read replicas, cache | Feed, Chat, Search |
| Read or write latency growing | Cache (reads), queue + workers (writes) | URL shortener, Analytics |
| Need to fan out to many servers | Message bus (pub/sub) | Chat, Collaborative editing |
| Need ordering or replay | Operation log or event stream | Collaborative editing, Streaming |
| Need search or complex queries | Dedicated search index | Autocomplete, Marketplace |
| Connection count or CPU limit on single server | Load balancer, multiple servers | Chat, Game backend |
Stage 2 → 3 Triggers
Section titled “Stage 2 → 3 Triggers”| Signal / Pain | Consider adding | Example(s) |
|---|---|---|
| Users in multiple regions need low latency | Multi-region replicas, regional routing | Chat, Distributed cache, Feed |
| Need retention or replay (e.g. “load last 7 days”) | Event log, history store | Chat, Streaming, IoT |
| Very high QPS or CCU | Sharding, dedicated stores | Feed, Search, Game backend |
| Need ranking or personalization | Ranking pipeline, dedicated feed store | Feed, Autocomplete |
| Invalidation at scale (many keys, many nodes) | Pub/sub for invalidation fan-out | Distributed cache |
| Cost growing fast | Tiered storage, downsampling, retention policy | IoT, Analytics |
When Do You Add Messages, Streams, or Queues?
Section titled “When Do You Add Messages, Streams, or Queues?”This is where most confusion happens. The three types—queue, pub/sub, and stream—solve different problems. Find what you need on the left and see which type fits.
| I need to… | Queue + Workers | Message Bus (Pub/Sub) | Event Stream / Log |
|---|---|---|---|
| Move slow work off the request path | ✓ | ||
| Retry failed deliveries with backoff + DLQ | ✓ | ||
| Async fan-out on write (e.g. feed to followers) | ✓ | ||
| Fan out to many servers in real time (e.g. chat) | ✓ | ||
| Invalidate cache across nodes | ✓ | ||
| Broadcast edits to all subscribers of a doc | ✓ | ||
| Order events per key and replay from any point | ✓ | ||
| Support multiple independent consumer groups | ✓ | ||
| Buffer high-volume ingest (IoT, analytics) | ✓ | ||
| Deliver or replicate across regions | ✓ | ✓ |
Quick distinction: Queue = task goes to one worker. Pub/sub = message goes to all subscribers. Stream = ordered log, multiple consumers, replay.
For a deeper comparison, see Redis vs Kafka: When to Use Which.
How to Use This with the Examples
Section titled “How to Use This with the Examples”- Map your problem to the quality dimensions in the Quality Requirements Checklist on the System Design Requirements page. Which dimensions matter most?
- Identify your stage from the trigger tables above. Are you seeing Stage 1 → 2 signals, or Stage 2 → 3?
- Pick one or two example docs that match your problem shape:
- Read-heavy + social graph → Feed / timeline
- Real-time + many connections → Chat
- High-volume ingest + rules → IoT / sensor ingestion or Streaming
- Strong consistency + audit → Payments
- Full-text search + ranking → Search or Autocomplete
Browse the full list on the System Design overview.