Search / Discovery — Designed in Stages
You don’t need to design for scale on day one.
Define what you need—full-text search, facets, maybe ranking—then build the simplest thing that works and evolve as document volume and query load grow.
Here we use a search or discovery system as the running example: a corpus of documents, a search index, and queries that return ranked results. The same staged thinking applies to product search, log search, or any system where indexing latency and search freshness are key trade-offs.
Requirements and Constraints (no architecture yet)
Section titled “Requirements and Constraints (no architecture yet)”Functional Requirements
- Full-text search — users submit a query; system returns documents matching the query (keyword, phrase, or semantic depending on scope).
- Facets — filter or aggregate by attributes (category, date range, author, etc.); often alongside the main result list.
- Ranking — order results by relevance, recency, or other signals; simple (e.g. TF-IDF, BM25) or advanced (learning-to-rank, embeddings) as product demands.
Quality Requirements
- Indexing latency — how long from document create/update until it is searchable (seconds to minutes for MVP; sub-second for real-time use cases).
- Search freshness — consistency between what is in the primary store (e.g. DB) and what the index reflects; eventual consistency is common.
- Query latency — p95 search response time (e.g. < 200–500 ms for autocomplete or full search).
- Expected scale — document count, index size, queries per second (QPS), update rate.
Key Entities
- Document — the unit of search; has content (title, body, metadata) and an identifier; stored in primary DB or source system and (often denormalized) in the search index.
- Index — the search engine’s representation of documents (inverted index, etc.); optimized for query, not for transactional updates.
Primary Use Cases and Access Patterns
- Index document — write path; add or update a document in the index (sync or async from primary store).
- Search — read path; submit query, get ranked results (and facets); dominant load.
- Delete or hide document — soft delete or filter; must be reflected in index so document no longer appears in results.
Given this, start with the simplest MVP: one API, one primary DB (source of truth), and one search index (e.g. Elasticsearch/OpenSearch) with simple sync or batch indexing, then evolve with an indexing pipeline, replicas, and caching as volume and freshness requirements grow.
Stage 1 — MVP (simple, correct, not over-engineered)
Section titled “Stage 1 — MVP (simple, correct, not over-engineered)”Goal
Ship working search: documents are indexed (from DB or ingest), users can run full-text queries and get ranked results. One API, one primary store, one search index, minimal moving parts.
Components
- API — REST or GraphQL; auth, search endpoint (query, filters, pagination); optionally document create/update that also triggers index update.
- Primary DB — source of truth for documents (if applicable); stores metadata and content; may be the same system that receives writes from other services.
- Search index — e.g. Elasticsearch, OpenSearch, or similar; stores document content and metadata for full-text and facet queries; single node or small cluster for MVP.
- Simple sync or batch index — when a document is created or updated, either (a) sync: API writes to DB and updates index in the same request or immediately after, or (b) batch: periodic job reads from DB (or changelog) and (re)indexes into search engine. Choose based on freshness vs simplicity.
Minimal Diagram
Client | v+-----------------+| API |+-----------------+ | | | write doc | search v vPrimary DB Search Index (source) (e.g. Elasticsearch/OpenSearch) | | sync or batch vSearch Index (indexing path)Patterns and Concerns (don’t overbuild)
- Document model: define fields (title, body, facets); map from primary store to index schema; handle updates (full replace or partial update as supported).
- Query mapping: map user query to engine query (query string, filters, sort); pagination (cursor or offset) and limits to protect the cluster.
- Basic monitoring: index size, indexing lag, search latency, error rate.
Why This Is a Correct MVP
- One API, one primary store, one search index → clear data flow, easy to reason about; sync or batch keeps indexing simple.
- Vertical scaling (larger index node, more RAM/disk) buys you time before you need indexing pipelines and index replicas.
Stage 2 — Growth Phase (more documents, more QPS, freshness demands)
Section titled “Stage 2 — Growth Phase (more documents, more QPS, freshness demands)”What Triggers the Growth Phase?
- Indexing can’t keep up (batch window too long, or sync path blocks API).
- Search QPS or index size outgrows a single node; need read scaling and possibly dedicated indexing path.
- Need better freshness (documents searchable within seconds) or higher throughput for index updates.
Components to Add (incrementally)
- Indexing pipeline (queue + workers) — document create/update publishes event or enqueues job; workers consume and update the search index. Decouples write path from indexing; allows retries and backpressure.
- Index replicas — run multiple index nodes (replica shards); search queries hit replicas; indexing writes go to primary shards and replicate. Scales read capacity.
- Caching for hot queries — cache full result sets or top-N for frequent queries (e.g. popular terms, trending); TTL and invalidation (e.g. on index refresh) to balance freshness and load.
Growth Diagram
+------------------+Clients ----------> | Load Balancer | +------------------+ | v +------------------+ | API | +------------------+ | | write doc | | search v v Primary DB Query Cache (hot queries) | | v | Indexing Queue | | | v v Workers -------> Search Index (primary + replicas) | v Replicas (read scaling)Patterns and Concerns to Introduce (practical scaling)
- Eventual consistency: index may lag behind primary; define SLO for indexing latency (e.g. < 1 min) and surface in product if needed (e.g. “recent items may take a moment to appear”).
- Idempotent indexing: workers should handle duplicate events (same doc id, same version) without corrupting index; use document id and optionally version or timestamp.
- Monitoring: indexing queue depth, worker lag, replica lag, cache hit rate, search p95 latency.
Still Avoid (common over-engineering here)
- Real-time indexing (sub-second) before product needs it.
- Multiple search clusters or complex routing until you have clear multi-tenant or multi-region requirements.
- Custom ranking pipeline (ML, embeddings) before you have relevance problems that justify it.
Stage 3 — Advanced Scale (very large index, real-time, multi-tenant)
Section titled “Stage 3 — Advanced Scale (very large index, real-time, multi-tenant)”What Triggers Advanced Scale?
- Single index cluster hits storage or throughput limits; need to shard or split.
- Product requires near real-time search (new documents searchable within seconds).
- Multiple tenants or use cases (e.g. product search vs log search) need separate scaling or isolation.
Components (common advanced additions)
- Index sharding — partition index by tenant, document type, or key range; route queries to relevant shards; allows horizontal scaling of index size and write throughput.
- Real-time indexing path — low-latency path from write to index (e.g. Kafka → indexer, or sync write to index with async durability); separate from batch backfill or reconciliation.
- Separate search cluster scaling — search cluster (or clusters) scaled independently from primary DB; possibly dedicated clusters per tenant or region; query routing and aggregation across shards or clusters.
Advanced Diagram (conceptual)
+------------------+Clients ----------> | API / Query GW | +------------------+ | +-----------------+-----------------+ v v v Shard A (e.g. tenant 1) Shard B Shard C | | | v v v Search Index Search Index Search Index (primary+replicas) (primary+replicas) (primary+replicas) ^ ^ ^ | | | Indexing pipeline (queue + workers; real-time + batch) ^ | Primary DB / Event streamPatterns and Concerns at This Stage
- Shard routing: route index writes and search queries by shard key (tenant_id, doc_type, etc.); aggregate results from multiple shards when query spans shards.
- Real-time vs batch: real-time path for freshness; batch or backfill for historical data or repair; avoid duplicate indexing with idempotency or versioning.
- Query aggregation: when query hits multiple shards or clusters, aggregate and re-rank results; respect limits and timeouts.
- SLO-driven ops: indexing latency, search latency, index freshness; error budgets and on-call.
Summarizing the Evolution
Section titled “Summarizing the Evolution”MVP delivers search with one API, one primary store, and one search index (e.g. Elasticsearch/OpenSearch), with simple sync or batch indexing. That’s enough to ship and learn.
As you grow, you add an indexing pipeline (queue + workers) so the write path doesn’t block and indexing can scale, index replicas for read scaling, and caching for hot queries. You accept eventual consistency between primary and index and define freshness SLOs.
At very high scale, you introduce index sharding, a real-time indexing path for low-latency visibility, and separate search cluster scaling (and possibly multi-tenant or multi-region). You keep the document and query model clear and add complexity only where size, freshness, or isolation require it.
This approach gives you:
- Start Simple — one API, one primary store, one index, sync or batch indexing; ship and learn.
- Scale Intentionally — add indexing pipeline and replicas when throughput or QPS justify it; add sharding and real-time path when size or freshness demand it.
- Add Complexity Only When Required — avoid multi-cluster and custom ranking until bottlenecks and product needs are clear.