Staged Design Examples

First PublishedFeb 10, 2026Last UpdatedFeb 28, 2026ByAtif Alam

Every staged design example in this library (Feed, Chat, Search, URL shortener, and the rest) follows three stages. You don’t plan for Stage 3 on day one—you move forward when real signals tell you to.

The Stage Model

Stage 1 — MVP: One path in, one store (or a minimal set). Ship, learn, and validate. Keep it simple enough that one team can build and run it.
Stage 2 — Growth: Bottlenecks or triggers appear. Add load balancing, cache, replicas, queues or a message bus, workers, or a dedicated index/store—whatever the signal calls for.
Stage 3 — Advanced Scale: Multi-region, retention and replay, ranking or stream processing, invalidation at scale, cost control. You’re here because Stage 2 solutions have hit their limits.

Signal → Component Quick Reference

Find your signal, read across, get the component to consider and the stage it typically appears.

Signal / Pain	Consider Adding	Stage
Read latency high, repeated same queries	Cache (Redis/Memcached), CDN	2
DB overloaded on reads	Cache, read replicas	2
DB overloaded on writes	Sharding / partitioning	2–3
Slow work in request path (e.g. transcode, email)	Queue + workers	2
Fan-out to many servers (chat, collab, cache invalidation)	Message bus (pub/sub)	2
Need ordering, retention, replay	Event stream / log (e.g. Kafka)	2–3
Complex queries, full-text, filtering + ranking	Search index (e.g. Elasticsearch)	2
Geographic latency for end users	CDN (static), multi-region (data)	2–3
Hot keys or uneven load	Cache + sharding, dedicated partitions	2–3
Many independent consumers of same data	Event stream / log	2
Connection count limit (WebSocket, long-poll)	Load balancer, horizontal servers	2
Need retention or “load history”	Event log, history/cold store	2–3
Data or index too stale for users	Cache invalidation (write-through/invalidate), indexing pipeline refresh, replication-lag SLO	2
Read-your-writes or strong consistency required	Primary/session routing for reads, write-through cache, sync replication or quorum reads	2
Cost growing fast (storage, scans)	Tiered storage, downsampling, retention	3
Manual recovery or ops toil	Orchestration, automation	3

For goals, typical mechanisms, and risks per category, see Optimization Quick Reference.

What Triggers the Next Stage?

Use these tables to identify where you are and what to consider adding. Find your pain on the left, read across for the answer and an example to explore.

Stage 1 → 2 Triggers

Signal / Pain	Consider adding	Example(s)
Single DB is the bottleneck (CPU, connections, disk)	Read replicas, cache	Feed, Chat, Search
Read or write latency growing	Cache (reads), queue + workers (writes)	URL shortener, Analytics
Need to fan out to many servers	Message bus (pub/sub)	Chat, Collaborative editing
Need ordering or replay	Operation log or event stream	Collaborative editing, Streaming
Need search or complex queries	Dedicated search index	Autocomplete, Marketplace
Connection count or CPU limit on single server	Load balancer, multiple servers	Chat, Game backend

Stage 2 → 3 Triggers

Signal / Pain	Consider adding	Example(s)
Users in multiple regions need low latency	Multi-region replicas, regional routing	Chat, Distributed cache, Feed
Need retention or replay (e.g. “load last 7 days”)	Event log, history store	Chat, Streaming, IoT
Very high QPS or CCU	Sharding, dedicated stores	Feed, Search, Game backend
Need ranking or personalization	Ranking pipeline, dedicated feed store	Feed, Autocomplete
Invalidation at scale (many keys, many nodes)	Pub/sub for invalidation fan-out	Distributed cache
Cost growing fast	Tiered storage, downsampling, retention policy	IoT, Analytics

When Do You Add Messages, Streams, or Queues?

This is where most confusion happens. The three types—queue, pub/sub, and stream—solve different problems. Find what you need on the left and see which type fits.

I need to…	Queue + Workers	Message Bus (Pub/Sub)	Event Stream / Log
Move slow work off the request path	✓
Retry failed deliveries with backoff + DLQ	✓
Async fan-out on write (e.g. feed to followers)	✓
Fan out to many servers in real time (e.g. chat)		✓
Invalidate cache across nodes		✓
Broadcast edits to all subscribers of a doc		✓
Order events per key and replay from any point			✓
Support multiple independent consumer groups			✓
Buffer high-volume ingest (IoT, analytics)			✓
Deliver or replicate across regions		✓	✓

Quick distinction: Queue = task goes to one worker. Pub/sub = message goes to all subscribers. Stream = ordered log, multiple consumers, replay.

For a deeper comparison, see Redis vs Kafka: When to Use Which.

How to Use This with the Examples

Map your problem to the quality dimensions in the Quality Requirements Checklist on the System Design Requirements page. Which dimensions matter most?
Identify your stage from the trigger tables above. Are you seeing Stage 1 → 2 signals, or Stage 2 → 3?
Pick one or two example docs that match your problem shape:
- Read-heavy + social graph → Feed / timeline
- Real-time + many connections → Chat
- High-volume ingest + rules → IoT / sensor ingestion or Streaming
- Strong consistency + audit → Payments
- Full-text search + ranking → Search or Autocomplete

Browse the full list on the System Design overview.