Analytics / Reporting — Designed in Stages
You don’t need to design for scale on day one.
This guide is a staged design playbook: it tells you what to build at MVP, Growth, and Advanced scale so you don’t over- or under-build.
Define what you need—ingest events, run aggregations, and serve ad-hoc or scheduled reports—then build the simplest thing that works and evolve as event volume and query load grow.
When to use this: Use this when you’re designing or evolving an analytics system from scratch or from an existing MVP; when you have event sources and need reports and dashboards. Skip or adapt if you only need one-off exports or a non–read-heavy workload.
Unlike designing for max scale up front, this adds complexity only when triggers appear. Unlike ad-hoc growth with no structure, you get a clear sequence: MVP → Growth → Advanced. If you over-build (e.g. real-time before you need it), you pay in cost and ops. If you under-invest in triggers (e.g. no warehouse when scans slow), you hit limits and firefight. The stages tie additions to triggers so you avoid both.
Here we use an analytics or reporting system as the running example: events, aggregations, and reports. The same staged thinking applies to product analytics, business intelligence, or any read-heavy, time-range query workload. When you need real-time dashboards or alerts, the Streaming example (event streams, stream processing) is a natural fit for the ingestion path.
Requirements and Constraints (no architecture yet)
Section titled “Requirements and Constraints (no architecture yet)”Functional Requirements
- Ingest events — accept event data (user actions, system metrics, business events); batch or stream; store for later query and aggregation.
- Run aggregations — count, sum, average, group-by (e.g. by day, by user, by dimension); precomputed or ad-hoc.
- Ad-hoc or scheduled reports — dashboards (e.g. DAU, revenue by day); scheduled reports (daily summary, weekly digest); export (CSV, etc.).
- Time-range queries — most analytics queries filter by time (last 7 days, this month); storage and indexing should support efficient time-range scans.
Quality Requirements
- Read-heavy — queries and dashboard loads dominate; writes (event ingest) can be batched or streamed; optimize for read throughput and query latency.
- Freshness — how soon after an event it appears in reports; batch ETL may be minutes to hours; real-time pipeline can be seconds (see Stage 3).
- Query latency — dashboard load time and ad-hoc query time (e.g. p95 < 2–5 s for typical aggregations); large scans need cost control (limits, partitioning).
- Expected scale — events per second (or per day), storage volume, number of dashboards and concurrent users, retention (e.g. 90 days, 1 year).
Key Entities
- Events — the raw unit: timestamp, event type, dimensions (user_id, device, etc.), optional metrics (value, count); immutable once stored.
- Aggregations — precomputed or on-demand: e.g. daily active users (DAU), revenue by day, count by dimension; materialized views or summary tables.
- Reports — saved queries, dashboards, or scheduled jobs; definition (query, filters, group-by) and optionally cached result.
Primary Use Cases and Access Patterns
- Ingest — write path; batch (file or API bulk) or stream (event-by-event); dedup and schema validation; append-only.
- Aggregation queries — read path; scan or index by time and dimensions; aggregate in query or read from precomputed table.
- Dashboard load — read path; often same query repeatedly (e.g. “DAU last 30 days”); caching and precomputation reduce load.
Given this, start with the simplest MVP: an API or ingest endpoint, one DB (or warehouse), batch ETL or simple inserts, and basic reporting queries, then evolve with a dedicated warehouse, ETL/ELT pipeline, and caching—and optionally a real-time tier—as volume and freshness requirements grow.
Stage 1 — MVP (simple, correct, not over-engineered)
Section titled “Stage 1 — MVP (simple, correct, not over-engineered)”Goal
Ship working analytics: events are ingested (batch or simple API), stored, and queryable for basic aggregations and reports. One ingest path, one store, minimal moving parts.
Components
- API or ingest endpoint — accept events (single or batch); validate schema; write to store (direct insert or to staging table for ETL). Auth (e.g. write key) and rate limiting to protect the system.
- DB (or warehouse) — store events in a table (or partitioned table) with timestamp and dimensions; can be same DB as OLTP for very small scale, or a separate DB/warehouse (e.g. Postgres, BigQuery, Snowflake) to avoid impacting production.
- Batch ETL or simple inserts — either (a) events inserted directly into analytics table(s), or (b) raw events in staging; batch job transforms and loads into aggregated or query-optimized tables. Simple daily or hourly batch is enough for MVP (e.g. nightly or every 6–12 hours).
- Basic reporting queries — run SQL (or query API) for aggregations (count, sum, group by time/dimension); dashboards or reports call these queries; no real-time requirement yet.
Minimal Diagram
Stage 1: one path in (ingest), one store, one path out (reporting).
Events (API or batch file) | v+-----------------+| Ingest API |+-----------------+ | vDB or Warehouse - events table (time, dimensions, metrics) - optional: summary tables (batch ETL) ^ |Reporting / Dashboards (query API or SQL)Patterns and Concerns (don’t overbuild)
- Schema for events: define event types and dimensions (user_id, date, etc.); optional metrics (value); partition or index by time for range queries.
- Separation from OLTP: if using same DB, use separate schema or database for analytics so heavy scans don’t block transactions; prefer separate store when possible.
- Basic monitoring: ingest rate, storage growth, query latency, failed ingest.
Still Avoid (common over-engineering here)
- Dedicated warehouse or columnar store before you have slow scans, storage limits, or many dashboards.
- Real-time stream ingestion and stream processing before you have sub-hour freshness requirements.
Why This Is a Correct MVP
- One ingest path, one store, batch ETL or simple inserts, basic queries → enough to ship dashboards and reports; easy to reason about.
- Vertical scaling (larger DB/warehouse) and nightly batch buy you time before you need a dedicated warehouse and real-time pipeline.
Stage 2 — Growth Phase (more events, heavier queries, dashboards)
Section titled “Stage 2 — Growth Phase (more events, heavier queries, dashboards)”You have a working MVP (ingest, one store, batch ETL, basic reports). Now one or more of the triggers below are true.
What Triggers the Growth Phase?
- Need dedicated warehouse: when single DB or simple schema doesn’t scale (slow scans, storage limits) → move to columnar store for scan and aggregation; clear separation from OLTP.
- Need query optimization at scale: when many dashboards and ad-hoc queries overwhelm the DB → add dedicated warehouse (columnar, compression) and query optimization.
- Need reproducible pipeline: when you need consistent transform, aggregate, and load → add ETL/ELT pipeline and clear separation from OLTP; scheduled (hourly, daily) or triggered; idempotent where possible.
- Need caching or precomputation: when dashboard load causes repeated heavy queries → cache query results or materialize dashboard tables; reduce load on warehouse.
Components to Add (incrementally)
- Dedicated warehouse or columnar store — move analytics to a data warehouse (e.g. BigQuery, Snowflake, Redshift) or columnar store (e.g. ClickHouse); optimized for scan and aggregation; separate from OLTP.
- Partition large tables by date (or hour) for time-range queries; retention = drop old partitions.
- ETL/ELT pipeline — extract from source (OLTP DB, event log, or ingest table), transform (clean, aggregate, join), load into warehouse; scheduled (hourly, daily) or triggered; idempotent where possible.
- Run on a schedule (e.g. hourly or daily); idempotent runs so re-runs are safe.
- Caching for dashboards — cache query results (e.g. by query key + time range) with TTL (e.g. 5–15 min for near-real-time, longer for historical); or materialize dashboard tables and refresh on schedule.
- TTL 5–15 min for near-real-time dashboards; longer for historical; or materialize and refresh on schedule.
- Separation from OLTP — analytics reads only from warehouse or replica; no heavy query on production DB; clear ownership and cost attribution.
- No direct heavy query on production DB; clear ownership and cost attribution for warehouse.
Growth Diagram
Stage 2: we add staging, scheduled ETL, warehouse, and cache for dashboards.
+------------------+Events -----------> | Ingest API | +------------------+ | v Staging / Raw store | v ETL/ELT (scheduled, e.g. hourly/daily) | v +------------------+ | Warehouse | | (columnar) | +------------------+ | | v v Dashboards Ad-hoc queries | | v v Cache (dashboard results, TTL e.g. 5–15 min)Patterns and Concerns to Introduce (practical scaling)
- Partitioning by time: partition large tables by date (or hour); queries with time filter scan only relevant partitions; retention = drop old partitions.
- Precomputation: materialize common aggregations (e.g. DAU by day, revenue by day); ETL or scheduled job updates them; dashboards read from these tables for speed.
- Query optimization: avoid full scans when possible (filters, limits); cost control (scan limits, query timeout) to protect warehouse from runaway queries.
- Monitoring: ETL run duration and success, warehouse storage and compute cost, cache hit ratio, query p95 latency.
Still Avoid (common over-engineering here)
- Real-time stream ingestion and stream processing before you have sub-hour freshness requirements.
- Multiple tiers (batch + real-time) until product clearly needs both.
- Complex data mesh or many independent pipelines before boundaries are clear.
Stage 3 — Advanced Scale (stream ingestion, real-time pipeline, cost control)
Section titled “Stage 3 — Advanced Scale (stream ingestion, real-time pipeline, cost control)”You have a warehouse, ETL, and caching. Now you need sub-hour freshness or real-time alerts.
What Triggers Advanced Scale?
- Need near real-time: when product needs events visible in minutes or seconds → add stream ingestion and real-time pipeline (event stream → stream processing → real-time store or dashboard).
- Batch ETL window too long: when the batch window can’t meet freshness SLOs → add real-time analytics pipeline with windowed aggregations (e.g. 1-min, 5-min counts).
- Need tight cost and performance control: when query cost or latency need guardrails → add scan limits, query quotas, and optimization (e.g. approximate aggregates, sampling).
Components to Add (incrementally)
- Stream ingestion — events flow into a stream (e.g. Kafka; see Streaming example); real-time pipeline consumes and writes to warehouse or a real-time store (e.g. Druid, Pinot); batch ETL may still run for backfill and consistency.
- Batch ETL continues for backfill and consistency; stream path for low-latency visibility.
- Real-time analytics pipeline — event stream → stream processor (e.g. Flink, Kafka Streams) → windowed aggregations (e.g. 1-min, 5-min counts) → write to store or push to dashboard/alerting; supports real-time dashboards and alerts.
- Window sizes e.g. 1-min, 5-min for counts and rollups; push to real-time store or alerting.
- Real-time vs batch tiers — real-time tier for low-latency metrics and alerts; batch tier (warehouse) for historical reports and deep analysis; optionally sync batch into same schema for unified query.
- Define SLO per tier; real-time may be eventually consistent with batch; backfill or reconcile if needed.
- Query optimization and cost control — scan limits per query or per user; approximate aggregation (e.g. HyperLogLog for distinct count) when exact isn’t required; sampling for exploratory queries; quotas and budgets.
- Scan limits, query timeout, per-user or per-team quotas; alert on budget overrun; use HyperLogLog for distinct count when exact isn’t required.
Advanced Diagram (conceptual)
Stage 3: batch and stream paths merge into dashboards and alerts.
+------------------+Events -----------> | Ingest (batch + stream) +------------------+ | | v v Batch ETL Event stream (e.g. Kafka) | | v v Warehouse Stream processor (historical) (windowed agg) | | v v Reports/Dashboards Real-time store | | +--------+-------+ | v Dashboards / Alerts (batch + real-time)Patterns and Concerns at This Stage
- Consistency between batch and real-time: real-time may be eventually consistent with batch; define SLO for each tier; backfill or reconcile if needed.
- Stream processing: windowing (tumbling, sliding), late events, and exactly-once or at-least-once semantics; see Streaming example for event backbone.
- Cost control: warehouse scan limits, query timeout, per-user or per-team quotas; alert on budget overrun.
- SLO-driven ops: ingest latency, ETL freshness, dashboard load time, real-time pipeline lag; error budgets and on-call.
Still Avoid (common over-engineering here)
- Multi-region warehouse or active-active analytics before you have latency or DR requirements.
- Exactly-once semantics end-to-end before you have proven duplication issues.
- Separate real-time cluster or complex stream topology before a single pipeline is the bottleneck.
Summarizing the Evolution
Section titled “Summarizing the Evolution”MVP delivers analytics with an ingest endpoint, one DB or warehouse, batch ETL or simple inserts, and basic reporting queries. That’s enough to ship dashboards and reports.
As you grow, you add a dedicated warehouse (or columnar store), an ETL/ELT pipeline, and caching for dashboards. You keep analytics separate from OLTP and optimize for read-heavy, time-range queries.
At very high scale, you add stream ingestion and a real-time analytics pipeline (event stream → stream processing → real-time store or dashboards), real-time vs batch tiers, and query optimization with cost control. You cross-reference the Streaming example for event streams and processing; add complexity only where freshness and cost demand it.
This approach gives you:
- Start Simple — ingest, one store, batch ETL, basic queries; ship and learn.
- Scale Intentionally — add warehouse and ETL when volume and separation justify it; add real-time pipeline when freshness and alerts demand it.
- Add Complexity Only When Required — avoid real-time and multi-tier until product and SLOs require them; keep batch path clear and cost-controlled first.
Example: SaaS product analytics
Stage 1: Postgres (or single warehouse) + nightly batch, one or a few dashboards. Stage 2: When dashboards slow and ad-hoc queries grow, add a dedicated warehouse and ETL (e.g. hourly); add caching for hot dashboards. Stage 3: When leadership wants real-time funnel or alerts, add stream ingestion and a real-time pipeline (e.g. Kafka + Flink); keep batch for historical consistency.
Limits and confidence
This approach fits read-heavy, time-range–dominated analytics; adjust if your workload is different (e.g. heavy point lookups or transactional reporting). Use it as a heuristic, not a spec.
What do I do next?
- Capture your requirements using the sections above (functional, quality, entities, access patterns).
- Map your current system to Stage 1, 2, or 3.
- If you’re in Growth or Advanced, pick one trigger that applies and add the corresponding components first.