Webhooks (Outbound Delivery) — Designed in Stages
You don’t need to design for scale on day one.
Define what you need—register subscriptions (event types, URL, secret), publish events, and deliver HTTP POST to each subscriber with retries on failure—then build the simplest thing that works and evolve as subscriber count and delivery guarantees grow.
Here we use webhooks (OpenAI-, Stripe-style outbound delivery) as the running example: subscriptions, events, delivery attempts, and endpoints. The same staged thinking applies to any system that pushes events to external URLs: at-least-once delivery, retry with backoff, idempotency (optional), and security (signing, verification) are central.
Requirements and Constraints (no architecture yet)
Section titled “Requirements and Constraints (no architecture yet)”Functional Requirements
- Register subscription — subscriber registers endpoint URL and optional secret; subscribes to one or more event types; store for delivery; optional update or delete subscription.
- Publish event — when something happens (e.g. payment completed, run finished), system publishes an event; event has type, payload, and optional id; all subscribers to that event type should receive it.
- Deliver — for each subscription matching the event type, send HTTP POST to subscriber’s URL with event payload; subscriber returns 2xx to acknowledge; non-2xx or timeout triggers retry.
- Retry on failure — if delivery fails (timeout, 4xx/5xx), retry with backoff; eventually give up and optionally move to dead-letter; at-least-once semantics (subscriber may receive duplicate; idempotency on their side or via id in payload).
Quality Requirements
- At-least-once delivery — every event should be delivered at least once to each subscriber; retries until success or max attempts; duplicates possible; subscriber should handle idempotency.
- Retry with backoff — exponential (or similar) backoff between retries to avoid hammering failing endpoints; configurable max attempts and backoff.
- Idempotency (optional) — include event id (or delivery id) in payload so subscriber can dedupe; or support idempotency key header so retries don’t cause duplicate side effects.
- Security — sign payload (e.g. HMAC with secret) so subscriber can verify request came from you; optional verification handshake (e.g. challenge/response when registering URL).
- Expected scale — events per second, number of subscriptions, endpoints per event type, delivery latency SLO.
Key Entities
- Subscription — subscriber’s registration; subscription_id, event_type(s), endpoint URL, secret (for signing), optional status (active, disabled); created_at.
- Event — something that happened; event_id, event_type, payload (JSON), created_at; immutable; published to all matching subscriptions.
- Delivery attempt — one try to deliver an event to one endpoint; attempt_id, event_id, subscription_id, attempt_number, status (pending, success, failed), response_code, attempted_at; used for retry and audit.
- Endpoint — the URL (and optional secret) from subscription; receives HTTP POST; must be HTTPS in production.
Primary Use Cases and Access Patterns
- Register subscription — write path; store (event_type, url, secret); validate URL (optional handshake); index by event_type for fan-out.
- Publish event — write path; create event record; find all subscriptions for event_type; for each, create delivery attempt(s) or enqueue; return ack to publisher (event accepted).
- Deliver — read + write; worker or process picks attempt; HTTP POST to URL with payload (and signature); record response; on failure, schedule retry with backoff; on success, mark attempt success.
- Retry — read path; find attempts that failed and are due for retry (backoff elapsed); re-attempt; increment attempt number; cap max attempts then move to DLQ or mark failed.
Given this, start with the simplest MVP: API + DB, store subscriptions, on event sync HTTP POST to each subscriber, store delivery status, simple retry (e.g. 3 attempts)—then add async delivery via queue, exponential backoff, dead-letter queue, idempotency and signature, and observability as scale and reliability demands grow.
Stage 1 — MVP (simple, correct, not over-engineered)
Section titled “Stage 1 — MVP (simple, correct, not over-engineered)”Goal
Ship working webhooks: subscribers register URL and event type(s), events are published, and each subscriber receives an HTTP POST; failures are retried a fixed number of times; delivery status stored. One API, one DB; sync delivery (blocking) or simple background job; single region.
Components
- API — REST or similar; auth (API key); create subscription (event_type, url, optional secret); list/update/delete subscription; internal or admin: publish event (event_type, payload). Publish may be in-process (your app calls “publish” when something happens) or via API.
- DB — subscriptions (id, event_type, url, secret, status); events (id, event_type, payload, created_at); delivery_attempts (id, event_id, subscription_id, attempt_number, status, response_code, attempted_at). Index subscriptions by event_type; index attempts by status and event_id.
- On event, sync HTTP POST — when event is published: load all subscriptions for event_type (active); for each, HTTP POST to url with payload (JSON); record attempt (success or failure, response code); if failure, retry immediately or in same loop up to 3 times with short delay. Sync means publisher may block until all deliveries are attempted; acceptable for few subscribers.
- Store delivery status — every attempt recorded in delivery_attempts; success = 2xx; failure = other or timeout; used for debugging and simple retry (e.g. cron that retries failed attempts).
- Simple retry — retry failed attempts: e.g. 3 attempts total with 1s delay between; then mark failed; optional cron to retry “failed” attempts once more later.
Minimal Diagram
Publisher (your app) | vPublish event | v+-----------------+| API / Worker |+-----------------+ | | v vDB (subscriptions, events, attempts) | vFor each subscription: HTTP POST to URL | vRecord attempt (success / fail) | vSimple retry (e.g. 3 attempts)Patterns and Concerns (don’t overbuild)
- Timeout: set HTTP timeout (e.g. 5–10 s) so slow endpoints don’t block; treat timeout as failure and retry.
- Payload: include event_id and event_type in body so subscriber can log and dedupe; optional timestamp.
- Basic monitoring: delivery success rate, failure rate by endpoint, retry count, latency.
Why This Is a Correct MVP
- API + DB, subscriptions, sync (or simple job) delivery, store attempts, simple retry → enough to ship webhooks for a small number of subscribers; easy to reason about.
- Sync delivery and few subscribers buy you time before you need a queue and async workers.
Stage 2 — Growth Phase (async queue, backoff, DLQ, signing)
Section titled “Stage 2 — Growth Phase (async queue, backoff, DLQ, signing)”What Triggers the Growth Phase?
- Many subscribers or high event rate; sync delivery blocks publisher or takes too long; need async delivery (event → queue → workers).
- Retries need exponential backoff so failing endpoints aren’t hammered; need scheduled retry (delay between attempts).
- Permanent failures (e.g. 4xx, or max retries exceeded) should go to dead-letter queue (DLQ) for inspection and manual replay.
- Subscribers need to verify request authenticity; sign payload (e.g. HMAC) and optionally include idempotency key or event id for dedupe.
Components to Add (incrementally)
- Async delivery via queue — when event is published: write event to DB; enqueue one message per (event_id, subscription_id) to a queue (e.g. SQS, RabbitMQ, Kafka); return ack to publisher immediately. Workers consume from queue; each message = one delivery attempt; HTTP POST to URL; record result; on failure, re-enqueue with delay (or use queue’s delay) for retry.
- Retry with exponential backoff — retry after 1 min, then 5 min, then 30 min (or similar); use queue delay or scheduler; max attempts (e.g. 5); after max, move to DLQ or mark permanently failed.
- Dead-letter queue — after max retries, move message to DLQ; store event and subscription info; ops can inspect, fix endpoint, and replay; or alert for manual handling.
- Idempotency key or event id in payload — include event_id (and optionally delivery_id) in POST body so subscriber can dedupe (e.g. store processed event_ids and ignore duplicate); optional header X-Idempotency-Key.
- Signature (e.g. HMAC) for verification — compute HMAC-SHA256 of body using subscription secret; send in header (e.g. X-Webhook-Signature); subscriber verifies before processing; prevents spoofing and ensures authenticity.
Growth Diagram
Publisher | vPublish event → DB + enqueue (event_id, subscription_id) per subscriber | vQueue (per delivery or per event) | vWorkers: consume → HTTP POST (signed) → record attempt | | v vSuccess Failure → re-enqueue with delay (backoff) | | v vDone Max retries → DLQ | vSubscriber verifies signature, processes (idempotent by event_id)Patterns and Concerns to Introduce (practical scaling)
- Fan-out: one event → N subscriptions → N queue messages (or one message with list of endpoints); workers scale independently.
- Backoff: use queue delay or separate “retry at” timestamp; exponential: delay = base * 2^attempt_number; cap max delay.
- Monitoring: delivery latency (p50, p95), success rate per endpoint and globally, DLQ depth, retry rate.
Still Avoid (common over-engineering here)
- Ordering guarantees (per key) and rate limiting per endpoint until product demands them.
- Complex verification handshake (e.g. challenge on register) unless required; HMAC in delivery is often enough.
Stage 3 — Advanced Scale (rate limiting, fan-out, ordering, observability)
Section titled “Stage 3 — Advanced Scale (rate limiting, fan-out, ordering, observability)”What Triggers Advanced Scale?
- Many subscribers per event type; fan-out at scale (thousands of endpoints); need efficient batching or partitioning so queue and workers don’t become bottleneck.
- Subscriber endpoints may be slow or rate-limited; need rate limiting per endpoint (or per subscriber) so one bad endpoint doesn’t block others.
- Ordering: some use cases need events delivered in order per partition key (e.g. per user_id); use ordered queue or partition key.
- Observability: delivery latency, success rate per endpoint, alert on DLQ growth and failing endpoints; dashboard and runbooks.
Components (common advanced additions)
- Rate limiting per endpoint — limit concurrent in-flight requests per endpoint (or per subscription); or limit requests per second per endpoint; use token bucket or semaphore; avoid overwhelming subscriber and avoid one slow endpoint consuming all workers.
- Fan-out at scale — partition queue by event_type or subscription_id; many workers consume; optional batching (e.g. same endpoint gets multiple events in one POST) if subscriber supports it; otherwise one POST per event per subscription.
- Ordering guarantees (per key) — if events must be delivered in order per key (e.g. user_id): use partition key in queue (e.g. Kafka partition by key); single consumer per partition so order preserved; document “at-least-once with order per key.”
- Observability — metrics: delivery latency (event published to 2xx received), success rate, retry rate, DLQ depth; logs: event_id, subscription_id, attempt, response code; alert on DLQ growth, success rate drop; dashboard for ops and support.
Advanced Diagram (conceptual)
Publisher | vPublish → DB + enqueue (partitioned by event_type or key) | vQueue (partitioned for ordering if needed) | vWorkers (scale out) | Rate limit per endpoint vHTTP POST (signed, idempotency id) | vRecord attempt → Metrics & logs | vDLQ (after max retries) → Alert & replay | vObservability (latency, success rate, DLQ)Patterns and Concerns at This Stage
- Per-endpoint limits: track in-flight and RPS per endpoint; reject or delay new attempts when limit reached; backpressure.
- Replay from DLQ: support “replay” action (re-enqueue from DLQ) after subscriber fixes endpoint; idempotency so replay doesn’t double-process.
- SLO-driven ops: delivery latency SLO (e.g. p95 < 30 s), success rate SLO; error budgets and on-call; runbooks for DLQ and failing endpoints.
Summarizing the Evolution
Section titled “Summarizing the Evolution”MVP delivers webhooks with API + DB, subscriptions stored, sync (or simple job) HTTP POST to each subscriber, delivery status stored, and simple retry (e.g. 3 attempts). That’s enough to push events to a few endpoints.
As you grow, you add async delivery via queue, retry with exponential backoff, dead-letter queue for permanent failures, idempotency (event id in payload), and signature (HMAC) for verification. You decouple publish from delivery and improve reliability.
At advanced scale, you add rate limiting per endpoint, fan-out at scale with partitioning, optional ordering per key, and observability (latency, success rate, DLQ, alerts). You scale subscribers and delivery without over-building on day one.
This approach gives you:
- Start Simple — API + DB, sync delivery, store attempts, simple retry; ship and learn.
- Scale Intentionally — add queue and async workers when publish path must not block; add backoff and DLQ when reliability demands it.
- Add Complexity Only When Required — avoid rate limiting and ordering until subscriber mix and product justify them; keep at-least-once and signing first.