Architecture Debugging Checklist

First PublishedMar 2, 2026ByAtif Alam

When you’re debugging a backend or architecture problem—especially performance—you often don’t have all the information you need.

This checklist helps you consider every layer of a typical web stack so you don’t miss a bottleneck. It applies to any kind of issue, including performance; the goals are knowing the parts of a multi-tier architecture and clearly identifying where the problem might be.

Use it as a high-level guide: work through each layer, then combine with observability (metrics, logs, traces per layer) where you have it.

Once you suspect a layer, use the Optimization Quick Reference and Infrastructure Building Blocks for concrete fixes.

Frontend

What it is: The client—browser or app—that renders the UI and calls your APIs.

What to check

Client-side rendering time, heavy scripts, or layout thrashing
Asset size and load order (JS/CSS/images); blocking requests
API call patterns (N+1, waterfall requests, missing caching)
Client-side caching and staleness (e.g. service worker, local storage)

How well do you know this part? Can you distinguish “slow first load” vs “slow API” vs “slow after interaction” using dev tools or RUM (real user monitoring)?

Networking

What it is: DNS, load balancers, proxies, and the path requests take before they reach your app.

What to check

DNS resolution delays or failures
Load balancer health, timeouts, connection limits, and backend pool saturation
Proxy timeouts and connection pooling (e.g. keep-alive, max connections)
TLS/SSL handshake cost; certificate or cipher issues
Geographic routing and latency to origin

How well do you know this part? Can you trace a request from client to app and identify where time is spent in the network path?

Web Application Server

What it is: The app tier (e.g. Ruby on Rails, Django, Node, .NET, Spring) that handles requests and talks to data stores.

What to check

Request handling time per route or handler
Thread or worker pool exhaustion; blocking I/O on the request path
Connection pool usage to DB, cache, and downstream services
Memory and CPU saturation; GC or event-loop stalls
Cold starts, JIT, or initialization cost after deploys

How well do you know this part? Can you tell if slowness is in “our code,” in a library, or in a downstream call (DB, cache, API)?

Database

What it is: The primary data store (e.g. PostgreSQL, MySQL, SQL Server) for transactional or core application data.

What to check

Connection pool exhaustion or connection leaks
Slow or missing indexes; full table scans; lock contention
Query shape (N+1, large result sets, unnecessary joins)
Replication lag if reads go to replicas; primary saturation
Schema or migration impact (locks, long-running DDL)

How well do you know this part? Can you read query plans and tie latency spikes to specific queries or connection pressure?

Key-Value / Document Store

What it is: Caches or document/KV stores (e.g. Redis, MongoDB) used for sessions, caching, or specialized access patterns.

What to check

Hit rate and eviction; memory or key-size growth
Serialization cost and large values; pipeline vs many round-trips
Connection pool or connection limits to the store
Cluster topology (sharding, replicas) and hot keys or partitions
Timeouts and retries; consistency vs availability trade-offs

How well do you know this part? Can you tell whether the store is the bottleneck (latency, errors, saturation) vs the app’s usage pattern?

Infrastructure

What it is: The compute and platform layer (e.g. EC2, Google Cloud, Azure)—VMs, containers, quotas, and shared resources.

What to check

CPU, memory, and disk I/O saturation; noisy neighbors
Network throughput and packet loss within the region or AZ
Quota limits (API rate limits, disk, connections) and throttling
Scaling triggers and cooldowns (e.g. autoscaling lag)
Placement and affinity (e.g. same AZ for app and DB to reduce latency)

How well do you know this part? Can you distinguish “the box is full” from “the app is inefficient” using OS-level or cloud metrics?

How to Use This Checklist

Before working through layers: Circumscribe the problem: note what is working and what isn’t, and be precise about how the system is misbehaving. That keeps you from chasing the wrong layer.
Simplify when you can: Create a minimal or simplified case that still reproduces the issue (e.g. one request type, smaller dataset). It makes hypothesis-testing, communication, and adding regression tests easier.
When you have limited info: Work through the layers in order (frontend → networking → app → database → KV/store → infrastructure). For each layer, form a hypothesis (e.g. “DB pool exhausted” or “cache miss storm”), then use observability to confirm or rule it out—faster than blindly splitting the system in half.
Combine with observability: Use Observability (SLIs, dashboards, tracing) to get signals per layer. The checklist tells you what to look for; observability tells you what you see.
Once you suspect a layer: Use the System Design Checklist for vocabulary and components, and the Optimization Quick Reference for “this pain → this fix.” The Infrastructure Building Blocks doc explains why each type of infrastructure exists and when to use it.

Handoff: Writing a Useful Message to a Teammate

When you hand off a debugging task to someone else—a good message does 3 things:

Gives context so they understand the issue and why it matters
Summarizes what’s already known or ruled out so they don’t redo work
Gives 1–3 concrete next steps and where to look (layer + observability)

Use this template when writing an email or Slack message:

**Subject:** [Brief symptom] — suspected [layer], next steps below

**What we know:**:
[1–2 sentences, who’s affected, symptom, any metrics/dashboards]

**What we’ve already checked / ruled out:**
[e.g. "Frontend and network look fine; slowness starts at app tier."]

**Suggested next steps:**
1. [Concrete action, e.g. "Check app metrics for route X and DB pool usage."]
2. [Where to look, e.g. "Dashboard: …" or "Runbook: …"]
3. [Optional: "If that doesn’t show it, work through the checklist from [layer]."]

**Useful links:** [This checklist, relevant runbook, dashboard.]

Before You Close the Loop

After you find and fix the cause:

Add a test (or scenario) that would have caught this bug so it doesn’t come back.
Document the fix (runbook, postmortem, or code comment) so others benefit next time.
Consider the bug class: Ask whether this is one instance of a broader failure mode and scan for similar cases.