Skip to content

IoT / Sensor Ingestion — Designed in Stages

First PublishedByAtif Alam

You don’t need to design for scale on day one.

Define what you need—ingest telemetry from devices or sensors, store by device and time, query by device and time range, and optionally evaluate rules and trigger alerts—then build the simplest thing that works and evolve as event volume and alert latency requirements grow.

Here we use IoT or sensor ingestion (temperature sensors, fitness wearables, industrial telemetry) as the running example: devices/sensors, telemetry events, and optional alerts. The same staged thinking applies to any high-volume time-series ingest: throughput (millions of events per second), durability, low latency for alerts, and cost (storage, retention) are central.

Requirements and Constraints (no architecture yet)

Section titled “Requirements and Constraints (no architecture yet)”

Functional Requirements

  • Ingest events — accept telemetry from devices (temperature, heart rate, pressure, etc.); high volume, small payloads per event; batch or single-event API; support many devices.
  • Store by device and time — persist each event with device_id (or sensor_id), timestamp, and optional dimensions (location, type); queryable by device and time range.
  • Query by device and time range — read path; “all events for device X in last hour” or “aggregate by hour for device X last 7 days”; support dashboards and ad-hoc analysis.
  • Rules and alerts (optional) — evaluate rules on incoming data (e.g. temperature > threshold, anomaly); trigger alert (notification, webhook, ticket); low latency for critical alerts.

Quality Requirements

  • Throughput — ingest must handle high event rate (thousands to millions of events per second at scale); small payloads; buffering and batching help.
  • Durability — events should not be lost once accepted; write path should be durable (WAL, replication, or ack after persist); at-least-once or exactly-once semantics.
  • Low latency for alerts — time from event to alert should be short when rules fire; sync evaluation or stream processing; avoid blocking ingest on alert path.
  • Cost (storage, retention) — raw events grow quickly; retention policy (e.g. 90 days raw, 1 year downsampled); compression and tiered storage; cost control.
  • Expected scale — events per second, number of devices, retention period, query QPS, number of rules.

Key Entities

  • Device / Sensor — source of telemetry; device_id, optional type, location, or metadata; registered or implicit (first event creates).
  • Telemetry event — one reading; device_id, timestamp, optional dimensions (sensor_type, location), metrics (value, or key-value); immutable once stored.
  • Alert (optional) — triggered when a rule matches; alert_id, rule_id, device_id, timestamp, payload; optional state (fired, acknowledged, resolved).

Primary Use Cases and Access Patterns

  • Ingest — write path; devices POST events (single or batch); validate schema; store with device_id and timestamp; return ack or 202; idempotent by (device_id, request_id) if needed.
  • Query by device and time — read path; filter by device_id and time range; scan or index; return events or aggregates; dominant read pattern for dashboards.
  • Rules and alerts — read + write; on ingest (sync) or via stream (async): evaluate rule (e.g. value > 80); if match, create alert and notify; decouple from ingest so ingest doesn’t block.

Given this, start with the simplest MVP: an API or ingest endpoint, a DB or time-series store, devices POST events stored by device_id and timestamp, simple query by device and range, and optional threshold alert (sync or cron)—then add buffered ingest (batch or stream), time-series or columnar store, rule engine, downsampling and retention as volume grows.

Stage 1 — MVP (simple, correct, not over-engineered)

Section titled “Stage 1 — MVP (simple, correct, not over-engineered)”

Goal

Ship working ingest: devices send events, events are stored by device and timestamp, and you can query by device and time range. Optional: simple threshold alert (e.g. temperature > 90) evaluated synchronously or via cron. One ingest endpoint, one store; single region.

Components

  • API or ingest endpoint — REST or similar; auth (API key or device credential); POST events (single or batch: device_id, timestamp, metrics); validate schema; write to store; return 200 or 202. Rate limit per device or key to protect system.
  • DB or time-series store — store events; schema: device_id, timestamp, metrics (value or JSON); index or partition by (device_id, timestamp) for range queries. Can be relational (table with index) or dedicated time-series DB (InfluxDB, TimescaleDB, etc.); choose by volume and query shape.
  • Simple query — GET or query API: by device_id and time range (start, end); return events; paginate if large; optional aggregation (count, avg) in query or application.
  • Optional threshold alert — (a) Sync: on ingest, check value against threshold; if over, create alert row and trigger notification (email, webhook). Or (b) Cron: every N minutes, query “recent events where value > threshold”, create alerts for new violations. Sync is simpler but adds latency to ingest; cron decouples but delay.
  • Single region — one ingest cluster and one store; vertical scaling for MVP.

Minimal Diagram

Devices / Sensors
|
v
+-----------------+
| Ingest API |
+-----------------+
| |
v v
DB / Time-series store Optional: threshold check (sync)
- device_id, timestamp, metrics → alert + notify
|
v
Query API (by device, time range)
|
v
Optional: cron job (threshold alert)

Patterns and Concerns (don’t overbuild)

  • Schema: define event schema (device_id, timestamp, required fields); reject invalid payloads; optional idempotency key to dedupe retries.
  • Time order: use server timestamp or device timestamp (document which); ordering by timestamp for range queries; handle clock skew if device time.
  • Basic monitoring: ingest rate, write latency, query latency, alert fires, error rate.

Why This Is a Correct MVP

  • One ingest endpoint, one store (DB or time-series), events by device and time, query by device and range, optional threshold alert → enough to ship a small IoT or sensor product; easy to reason about.
  • Vertical scaling and single store buy you time before you need buffered ingest, stream processing, and tiered retention.

Stage 2 — Growth Phase (buffered ingest, time-series store, rule engine)

Section titled “Stage 2 — Growth Phase (buffered ingest, time-series store, rule engine)”

What Triggers the Growth Phase?

  • Ingest rate grows; single DB or sync write doesn’t scale (latency or throughput); need buffered ingest (batch or stream) before write.
  • Query load or storage shape; need dedicated time-series DB or columnar store (compression, efficient range scans).
  • Many rules or complex conditions; need rule engine (threshold, anomaly, multi-signal); decouple from ingest so alerts don’t block ingest.
  • Storage cost; need downsampling (e.g. keep 1-min raw, aggregate to 1-hour for older data) and retention policy (drop or archive after N days).

Components to Add (incrementally)

  • Buffered ingest (batch or stream) — events land in buffer first: message queue (e.g. Kafka, SQS) or batch endpoint (client batches, server flushes to store); workers or stream consumer read from buffer and write to store; ingest API returns ack quickly; durability = buffer + store.
  • Time-series DB or columnar store — dedicated store optimized for (device_id, timestamp) and range scans; compression; optional retention and downsampling built-in (e.g. InfluxDB, TimescaleDB, ClickHouse); or data lake with columnar format (Parquet) and query engine.
  • Rule engine for alerts — rules: condition (e.g. temperature > 80 for 5 min, or anomaly score > threshold); action: create alert, notify. Evaluate in stream consumer (on each event or window) or separate job that reads recent data; avoid evaluating in ingest API so ingest stays fast.
  • Downsampling and retention — raw events: retain 7–90 days; aggregate to 1-hour or 1-day for older data; drop or archive beyond retention; reduce storage and query cost.

Growth Diagram

Devices
|
v
+-----------------+
| Ingest API |
+-----------------+
|
v
Buffer (queue or batch)
|
v
Consumer / worker
| |
v v
Time-series store Rule engine (threshold, anomaly)
| |
v v
Downsample + retention Alerts → notify
|
v
Query API (by device, range, optional agg)

Patterns and Concerns to Introduce (practical scaling)

  • At-least-once: buffer + consumer may deliver duplicates; dedupe by (device_id, request_id) or idempotent write; or accept at-least-once and dedupe at query time for alerts.
  • Backpressure: if consumer falls behind, buffer grows; alert on lag; scale consumers or increase partition count; reject or throttle ingest if buffer full (circuit breaker).
  • Monitoring: ingest rate, buffer lag, write latency, rule evaluation latency, alert delivery success, storage growth.

Still Avoid (common over-engineering here)

  • Stream processing (Flink, etc.) and real-time complex rules until volume and latency requirements justify it.
  • Multi-tenant isolation and per-tenant retention until you have paying tenants with different SLAs.
  • Full data lake and batch analytics in the same pipeline; keep ingest and query path clear first.

Stage 3 — Advanced Scale (stream processing, real-time alerts, cost control)

Section titled “Stage 3 — Advanced Scale (stream processing, real-time alerts, cost control)”

What Triggers Advanced Scale?

  • Very high event rate; need stream processing (Kafka + Flink, Kafka Streams, etc.) for ingestion and rule evaluation; windowed aggregations and complex rules.
  • Real-time alerts: sub-second or few-second latency from event to alert; stream processing evaluates rules in pipeline; no batch or cron.
  • Multi-tenant: many customers or teams; isolation (per-tenant queues, stores, or namespaces); per-tenant retention and cost.
  • Cost control: tiered retention (hot/warm/cold), compression, scan limits, and quotas; stay within budget at scale.

Components (common advanced additions)

  • Stream processing — events flow into Kafka (or similar); stream job (e.g. Flink) consumes, optionally aggregates (tumbling/sliding window), writes to time-series store or downstream; rule engine can be inside stream job (e.g. filter + window + threshold) or separate consumer; exactly-once or at-least-once semantics.
  • Real-time alerts — rules evaluated in stream; on match, emit alert event or call notification service; low latency (seconds); optional dedupe (same rule + device + window fire once).
  • Multi-tenant — tenant_id or org_id on events and in store; separate topics or partitions per tenant; quota per tenant (ingest rate, storage); retention and rules per tenant.
  • Scale and cost control — tiered retention: hot (recent, fast query), warm (compressed, slower), cold (archive); scan limits and query timeout to avoid runaway queries; cost attribution and budgets; alert on overage.

Advanced Diagram (conceptual)

Devices (many tenants)
|
v
Ingest API (per-tenant or shared)
|
v
Kafka (partitioned by device or tenant)
|
v
Stream processing (Flink, etc.)
- write to time-series store
- evaluate rules → alerts
- optional: downsampling
|
v
Time-series store (hot/warm/cold)
|
v
Query API (with limits, quotas)
|
v
Alerts → notification service

Patterns and Concerns at This Stage

  • Exactly-once vs at-least-once: stream processing can offer exactly-once with checkpointing and idempotent sink; or at-least-once with dedupe in store or at read; choose by consistency and complexity.
  • Rule complexity: simple threshold in stream; anomaly or ML-based may need separate service or batch; keep hot path simple.
  • SLO-driven ops: ingest latency (accept to ack), alert latency (event to notify), query p95, buffer lag, storage and cost; error budgets and on-call.

MVP delivers IoT/sensor ingest with an API or ingest endpoint, a DB or time-series store, events stored by device and timestamp, query by device and range, and optional threshold alert (sync or cron). That’s enough to ship a small deployment.

As you grow, you add buffered ingest (queue or batch), a time-series or columnar store, a rule engine for alerts, and downsampling and retention. You keep ingest throughput and durability and decouple alerts from the ingest path.

At advanced scale, you add stream processing (Kafka + Flink) for ingest and real-time rules, multi-tenant isolation, and cost control (tiered retention, quotas). You scale event rate and alert latency without over-building on day one.

This approach gives you:

  • Start Simple — ingest endpoint + store, query by device and range, optional threshold alert; ship and learn.
  • Scale Intentionally — add buffer and time-series store when rate and query shape demand it; add rule engine when alerts become critical.
  • Add Complexity Only When Required — avoid stream processing and multi-tenant until volume and product justify them; keep ingest durable and alerts decoupled first.