Skip to content

System Design Checklist

First PublishedLast UpdatedByAtif Alam

This checklist walks through 10 system-design areas—from clients and edge through traffic, compute, data, caching, messaging, distribution, consistency, observability, and security.

In each area, Components are the concrete building blocks; Patterns & Concerns are the design choices, strategies, and trade-offs you apply to them.

Use it as a mental checklist when you’re designing or reviewing systems.

Components

  • API servers / app servers (stateless)
  • Service layout (monolith app vs microservices)
  • Background workers
  • Real-time servers (WebSocket/SSE)
  • Ranking / scoring service (e.g. feed ranking pipeline, search reranker, ML-based scoring)

Patterns & Concerns

  • Horizontal scaling strategy (statelessness, autoscaling triggers)
  • Backpressure + load shedding
  • Retry policy + timeouts + circuit breakers (service resilience)
  • Deployment strategy (blue/green, canary) if relevant — see Progressive Delivery

Components

  • Relational databases (SQL)
  • NoSQL data stores (KV / document / wide-column / graph)
  • Object/blob storage
  • Search index
  • Time-series DB (when relevant)
  • Dedicated / purpose-built store (e.g. precomputed feed store, graph store) — when a single DB can’t serve mixed workloads

Patterns & Concerns

  • Data modeling choices (schema, indexing, access patterns)
  • Query patterns (read-heavy vs write-heavy, joins vs denormalization)
  • Hot partition avoidance (key design)
  • Data lifecycle (retention policy, archival, GDPR delete flows)
  • Polyglot persistence — using different store types for different access patterns (e.g. OLTP + search + cache + analytics)
  • Tiered storage (hot/warm/cold) — moving older or less-accessed data to cheaper storage for cost optimization
  • Downsampling — reducing granularity of old data (e.g. time-series, analytics) to save storage cost

Components

  • Clients: web, mobile, desktop, other services
  • DNS
  • Global routing / Anycast (conceptually “global traffic steering”)
  • CDN
  • WAF (Web Application Firewall) / DDoS (Distributed Denial of Service) protection

Patterns & Concerns

  • Edge caching strategy (what gets cached, TTLs, invalidation)
  • Geo routing goals (latency vs compliance vs cost)
  • Regional routing — directing user traffic to the nearest regional deployment for low latency (beyond DNS; includes request routing at the LB/app level)

Components

  • Load balancer (L4/L7)
  • API Gateway / Reverse proxy

Patterns & Concerns

  • Rate limiting (token/leaky bucket; per-user/IP/key)
  • AuthN/AuthZ (OAuth/OIDC, JWT, service-to-service auth)
  • Request shaping: throttling, quotas, spike arrest
  • API versioning / backward compatibility (often discussed here)

Components

  • Client/device cache
  • CDN cache
  • In-memory cache (e.g. Redis/Memcached)

Patterns & Concerns

  • Cache strategy: cache-aside, write-through, write-behind — see Caching Strategies for detailed patterns
  • TTLs, eviction policies, invalidation
  • Read scaling via replicas
  • Stampede protection (locks, singleflight, request coalescing)

Components

  • Queue (e.g. RabbitMQ, Amazon SQS, Redis)
  • Pub/Sub / event bus (e.g. Kafka, Google Pub/Sub, AWS SNS, Redis Pub/Sub)
  • Event log / durable stream (e.g. Kafka with retention) — ordered, replayable log for history and replay
  • Stream processing (e.g. Kafka Streams, Apache Flink, ksqlDB) when needed

Patterns & Concerns

  • Delivery semantics: at-most/at-least/exactly-once (practically)
  • Retries + dead-letter queues (DLQ)
  • Ordering guarantees (per key/partition)
  • Consumer scaling + backpressure
  • Retention and replay — how long to keep messages/events, replay from offset or timestamp, history/cold store for older data

7. Data Distribution and Reliability Primitives

Section titled “7. Data Distribution and Reliability Primitives”

Components

  • Replication setup (as a system capability; often in the DB/platform)
  • Sharding/partitioning mechanism (db-level or app-level)
  • Coordination system (leader election / consensus) when needed
  • ID generator service (if not using UUID)

Patterns & Concerns

  • Partitioning strategy: hash/range/geo; consistent hashing
  • Replication mode: sync vs async; RPO/RTO implications
  • Failover strategy (active-active vs active-passive)
  • Rebalancing and resharding strategy

Components

  • Usually no new infrastructure—you apply consistency patterns to existing DBs, queues, and services; optionally:
  • Transaction coordinator / workflow engine — coordinates multi-step or cross-service transactions (e.g. two-phase commit, saga orchestration; tools like Temporal, AWS Step Functions)
  • Lock service — distributed locking so processes can coordinate on “who holds the lock” (e.g. etcd, ZooKeeper, Redis with Redlock)

Patterns & Concerns

  • Transactions and isolation choices — which level (read uncommitted → serializable) and where to use them
  • Idempotency keys — same request applied once even if retried; client sends a key, server dedupes
  • Distributed locks (and alternatives) — exclusive access across nodes; or use DB row locks / optimistic concurrency
  • Saga / outbox pattern (cross-service consistency) — multi-step flow with compensating actions, or write events to outbox table then publish
  • Deduplication (handling at-least-once) — accept duplicate deliveries and make them harmless (e.g. idempotency, unique constraint)

Components

  • Logging pipeline (collector/store) (e.g. ELK, Loki, Fluentd, CloudWatch Logs)
  • Metrics system (time-series metrics backend) (e.g. Prometheus, Grafana, InfluxDB, Datadog)
  • Tracing system (e.g. Jaeger, Zipkin, OpenTelemetry, X-Ray)
  • Alerting/on-call tooling (e.g. PagerDuty, Opsgenie, Alertmanager)
  • Health checks system (e.g. load balancer health checks, Kubernetes liveness/readiness)

Patterns & Concerns

  • SLOs/SLIs (availability/latency/error budgets)
  • Sampling strategy for tracing/logging
  • Autoscaling policy (CPU/QPS/latency)
  • Runbooks + incident response expectations

Components

  • Secrets manager / KMS — store and serve secrets (passwords, API keys) and encryption keys with access control and audit (e.g. HashiCorp Vault, AWS Secrets Manager, AWS KMS, Azure Key Vault)
  • Audit log store

Patterns & Concerns

  • Encryption in transit (TLS) and at rest
  • Least privilege / IAM model
  • PII boundaries, retention, deletion
  • Threat modeling basics (abuse prevention, auth bypass, replay)

Use this as a mental checklist when you’re designing or reviewing systems.