Architecture Debugging Checklist
When you’re debugging a backend or architecture problem—especially performance—you often don’t have all the information you need.
This checklist helps you consider every layer of a typical web stack so you don’t miss a bottleneck. It applies to any kind of issue, including performance; the goals are knowing the parts of a multi-tier architecture and clearly identifying where the problem might be.
Use it as a high-level guide: work through each layer, then combine with observability (metrics, logs, traces per layer) where you have it.
Once you suspect a layer, use the Optimization Quick Reference and Infrastructure Building Blocks for concrete fixes.
Frontend
Section titled “Frontend”What it is: The client—browser or app—that renders the UI and calls your APIs.
What to check
- Client-side rendering time, heavy scripts, or layout thrashing
- Asset size and load order (JS/CSS/images); blocking requests
- API call patterns (N+1, waterfall requests, missing caching)
- Client-side caching and staleness (e.g. service worker, local storage)
How well do you know this part? Can you distinguish “slow first load” vs “slow API” vs “slow after interaction” using dev tools or RUM (real user monitoring)?
Networking
Section titled “Networking”What it is: DNS, load balancers, proxies, and the path requests take before they reach your app.
What to check
- DNS resolution delays or failures
- Load balancer health, timeouts, connection limits, and backend pool saturation
- Proxy timeouts and connection pooling (e.g. keep-alive, max connections)
- TLS/SSL handshake cost; certificate or cipher issues
- Geographic routing and latency to origin
How well do you know this part? Can you trace a request from client to app and identify where time is spent in the network path?
Web Application Server
Section titled “Web Application Server”What it is: The app tier (e.g. Ruby on Rails, Django, Node, .NET, Spring) that handles requests and talks to data stores.
What to check
- Request handling time per route or handler
- Thread or worker pool exhaustion; blocking I/O on the request path
- Connection pool usage to DB, cache, and downstream services
- Memory and CPU saturation; GC or event-loop stalls
- Cold starts, JIT, or initialization cost after deploys
How well do you know this part? Can you tell if slowness is in “our code,” in a library, or in a downstream call (DB, cache, API)?
Database
Section titled “Database”What it is: The primary data store (e.g. PostgreSQL, MySQL, SQL Server) for transactional or core application data.
What to check
- Connection pool exhaustion or connection leaks
- Slow or missing indexes; full table scans; lock contention
- Query shape (N+1, large result sets, unnecessary joins)
- Replication lag if reads go to replicas; primary saturation
- Schema or migration impact (locks, long-running DDL)
How well do you know this part? Can you read query plans and tie latency spikes to specific queries or connection pressure?
Key-Value / Document Store
Section titled “Key-Value / Document Store”What it is: Caches or document/KV stores (e.g. Redis, MongoDB) used for sessions, caching, or specialized access patterns.
What to check
- Hit rate and eviction; memory or key-size growth
- Serialization cost and large values; pipeline vs many round-trips
- Connection pool or connection limits to the store
- Cluster topology (sharding, replicas) and hot keys or partitions
- Timeouts and retries; consistency vs availability trade-offs
How well do you know this part? Can you tell whether the store is the bottleneck (latency, errors, saturation) vs the app’s usage pattern?
Infrastructure
Section titled “Infrastructure”What it is: The compute and platform layer (e.g. EC2, Google Cloud, Azure)—VMs, containers, quotas, and shared resources.
What to check
- CPU, memory, and disk I/O saturation; noisy neighbors
- Network throughput and packet loss within the region or AZ
- Quota limits (API rate limits, disk, connections) and throttling
- Scaling triggers and cooldowns (e.g. autoscaling lag)
- Placement and affinity (e.g. same AZ for app and DB to reduce latency)
How well do you know this part? Can you distinguish “the box is full” from “the app is inefficient” using OS-level or cloud metrics?
How to Use This Checklist
Section titled “How to Use This Checklist”- Before working through layers: Circumscribe the problem: note what is working and what isn’t, and be precise about how the system is misbehaving. That keeps you from chasing the wrong layer.
- Simplify when you can: Create a minimal or simplified case that still reproduces the issue (e.g. one request type, smaller dataset). It makes hypothesis-testing, communication, and adding regression tests easier.
- When you have limited info: Work through the layers in order (frontend → networking → app → database → KV/store → infrastructure). For each layer, form a hypothesis (e.g. “DB pool exhausted” or “cache miss storm”), then use observability to confirm or rule it out—faster than blindly splitting the system in half.
- Combine with observability: Use Observability (SLIs, dashboards, tracing) to get signals per layer. The checklist tells you what to look for; observability tells you what you see.
- Once you suspect a layer: Use the System Design Checklist for vocabulary and components, and the Optimization Quick Reference for “this pain → this fix.” The Infrastructure Building Blocks doc explains why each type of infrastructure exists and when to use it.
Handoff: Writing a Useful Message to a Teammate
Section titled “Handoff: Writing a Useful Message to a Teammate”When you hand off a debugging task to someone else—a good message does 3 things:
- Gives context so they understand the issue and why it matters
- Summarizes what’s already known or ruled out so they don’t redo work
- Gives 1–3 concrete next steps and where to look (layer + observability)
Use this template when writing an email or Slack message:
**Subject:** [Brief symptom] — suspected [layer], next steps below
**What we know:**:[1–2 sentences, who’s affected, symptom, any metrics/dashboards]
**What we’ve already checked / ruled out:**[e.g. "Frontend and network look fine; slowness starts at app tier."]
**Suggested next steps:**1. [Concrete action, e.g. "Check app metrics for route X and DB pool usage."]2. [Where to look, e.g. "Dashboard: …" or "Runbook: …"]3. [Optional: "If that doesn’t show it, work through the checklist from [layer]."]
**Useful links:** [This checklist, relevant runbook, dashboard.]Before You Close the Loop
Section titled “Before You Close the Loop”After you find and fix the cause:
- Add a test (or scenario) that would have caught this bug so it doesn’t come back.
- Document the fix (runbook, postmortem, or code comment) so others benefit next time.
- Consider the bug class: Ask whether this is one instance of a broader failure mode and scan for similar cases.