Media / Video (and music / audio streaming) — Designed in Stages

First PublishedFeb 7, 2026ByAtif Alam

You don’t need to design for scale on day one.

Define what you need—upload media, transcode (or encode) for playback, and serve via CDN—then build the simplest thing that works and evolve as storage, transcode load, and playback scale grow.

Here we use media streaming (video or music) as the running example: assets, transcode jobs, metadata, and play (VOD—video on demand—or on-demand audio). The same patterns apply to YouTube/Netflix-style video and Spotify/Apple Music-style audio: ingest, store, encode/transcode, stream, CDN.

Requirements and Constraints (no architecture yet)

Functional Requirements

Upload — ingest raw or mezzanine assets (video/audio files); store durably; trigger transcode (or encode) to playback formats.
Transcode — convert source to one or more formats/bitrates (e.g. H.264 720p, 1080p; AAC for audio); store outputs in object storage; update metadata when done.
Play — serve playback URLs (VOD or live for video; on-demand for music); support adaptive bitrate (e.g. HLS, DASH) for video; large blobs and high bandwidth imply CDN in front of storage.

Quality Requirements

Large blobs — video/audio files are large; object storage and CDN are standard; avoid serving origin directly at scale.
Transcode latency — time from upload to “playable”; sync transcode blocks upload response; async (queue + worker) is common for anything beyond trivial length.
Playback latency and quality — start-up time (time to first frame/audio), buffering, and bitrate selection; CDN and multiple renditions improve both.
Expected scale — storage volume, transcode jobs per hour, concurrent playback streams, geographic distribution.

Key Entities

Asset — the media item (video or audio); source file(s) and derived (transcoded) renditions; has an identifier and metadata (title, duration, etc.).
Transcode job — one unit of work: source asset, target format/bitrate, status (pending, running, done, failed); may be one job per output or a pipeline (one job produces multiple outputs).
Metadata — title, duration, format info, playback URLs; stored in DB and/or alongside assets; used by API and players.

Primary Use Cases and Access Patterns

Upload — write path; multipart or resumable upload to object storage; register asset and optionally enqueue transcode.
Transcode — async (or sync for very short clips); worker picks job, runs encoder, writes output to storage, updates metadata and job status.
Play — read path; API returns playback URL(s) (e.g. HLS manifest); client fetches segments from CDN; CDN pulls from origin (object storage) on miss.

Given this, start with the simplest MVP: one API, one DB (metadata), object storage (blobs), and sync or simple async transcode (one worker), with CDN in front of storage for play, then evolve with a transcode queue, multiple formats/bitrates, and retention/cleanup as volume grows.

Stage 1 — MVP (simple, correct, not over-engineered)

Goal

Ship working media flow: upload asset, transcode to at least one playback format, and play via CDN. One API, one DB, object storage, minimal transcode path.

Components

API — REST or similar; auth, upload (initiate and complete), get asset metadata and playback URL(s), list assets; optionally webhook or poll for transcode completion.
Primary DB — stores asset metadata (id, title, status, duration, playback URLs, transcode job id); source and output keys in object storage.
Object storage — stores source and transcoded blobs; durable, scalable; not served directly to users at scale (use CDN in front).
Sync or simple async transcode (one worker) — either (a) sync: upload completes, API triggers transcode and waits (or blocks) until done, then returns; or (b) async: upload completes, enqueue job, one worker processes jobs and updates metadata when done; client polls or gets webhook for “ready.”
CDN in front of storage — playback URLs point to CDN; CDN origin is object storage (or a read-through cache); reduces load on origin and improves latency for viewers.

Minimal Diagram

Client (upload)          Client (play)
   |                          ^
   v                          |
+-----------------+     +------+------+
| API             |     | CDN         |
+-----------------+     +------+------+
   |                          ^
   v                          |
Primary DB (metadata)   Object Storage (blobs)
   |                          ^
   v                          |
Transcode (1 worker) --------+
   - reads source from storage
   - writes output to storage
   - updates DB with playback URL

Patterns and Concerns (don’t overbuild)

Upload: use multipart upload for large files; store source under a key (e.g. assets/{id}/source); create asset row and transcode job (or run sync transcode).
Playback URL: generate signed or public URL pointing to CDN (e.g. https://cdn.example.com/assets/{id}/720p.m3u8); CDN configured with origin = object storage bucket.
Basic monitoring: upload success rate, transcode job duration and failure rate, CDN cache hit ratio, origin bandwidth.

Why This Is a Correct MVP

One API, one DB, object storage, one transcode worker, CDN → clear path from upload to play; sync or simple async keeps logic straightforward.
Vertical scaling (larger worker, more storage) buys you time before you need a transcode queue and multiple workers or formats.

Stage 2 — Growth Phase (more uploads, multiple formats, retention)

What Triggers the Growth Phase?

Single transcode worker or sync path becomes the bottleneck; uploads queue up or transcode latency is too high.
Product needs multiple formats/bitrates (e.g. 360p, 720p, 1080p for video; or multiple codecs); one worker or one job per format doesn’t scale.
Storage and bandwidth grow; need retention and cleanup policy; CDN and origin separation must be clear.

Components to Add (incrementally)

Transcode queue + workers — upload writes to storage and enqueues a transcode job (one per asset or one per output format); multiple workers consume queue and run transcode; update metadata when each output is done; scale workers independently.
Multiple formats/bitrates — each asset can have several outputs (e.g. H.264 720p, 1080p; HLS variants); transcode pipeline produces all; metadata stores list of playback URLs or manifest URL for adaptive streaming.
CDN + origin — CDN edge caches segments; origin is object storage (or origin server that reads from storage); configure TTL and cache behavior; reduce origin egress and improve global latency.
Retention and cleanup — policy for source and outputs (e.g. delete source after N days, or keep only certain formats); background job or lifecycle rules on object storage; update DB when assets are removed.

Growth Diagram

                    +------------------+
Clients (upload) -->| API              |
                    +------------------+
                     |              |
                     v              v
              Primary DB    Object Storage (source)
                     |              |
                     v              v
              Transcode Queue
                     |
         +-----------+-----------+
         v           v           v
    Worker 1    Worker 2    Worker 3
         |           |           |
         v           v           v
    Object Storage (outputs: 360p, 720p, 1080p, ...)
                     |
                     v
              CDN (origin = storage)
                     ^
                     |
              Clients (play)

Patterns and Concerns to Introduce (practical scaling)

Idempotent transcode jobs: same asset + format should not be transcoded twice; use job id or (asset_id, profile) as idempotency key; store output URL in metadata once.
Retention and cost: define lifecycle (delete source after transcode, expire old outputs); use storage classes (e.g. cold for archive) to reduce cost.
Monitoring: queue depth, transcode duration per profile, failure rate, CDN hit ratio, origin bandwidth, storage growth.

Still Avoid (common over-engineering here)

Live pipeline (real-time encoding, ingest) before you have live product requirements.
Multi-region origin and DRM before you have geographic or licensing requirements.
Custom encoding pipeline (beyond standard profiles) until quality or cost demands it.

Stage 3 — Advanced Scale (multi-region origin, live, DRM, analytics)

What Triggers Advanced Scale?

Viewers in multiple regions; single origin or single CDN region causes high latency or cost (cross-region egress).
Product needs live streaming (real-time ingest, encode, and distribute).
Content requires DRM or licensing (e.g. widevine, fairplay); need license server and protected packaging.
Product needs analytics on playback (watch time, quality, errors); need events from player and aggregation pipeline.

Components (common advanced additions)

Multi-region origin — replicate or distribute assets to origins in multiple regions; CDN routes viewers to nearest origin; or use CDN with multiple origins and geo-routing.
Live pipeline (if needed) — ingest (RTMP, SRT, or similar), real-time encode to HLS/DASH segments, write segments to storage or push to CDN; low latency (few seconds) or ultra-low latency (sub-second) depending on product.
DRM / licensing — package content with DRM (e.g. CENC); license server issues keys to authorized clients; integrate with auth and entitlements.
Analytics on playback — player emits events (play, pause, quality change, error); collect via API or stream; aggregate for dashboards (watch time, buffering ratio, errors by region/device).

Advanced Diagram (conceptual)

                    +------------------+
Upload / Ingest --> | API / Ingest     |
                    +------------------+
                     |         |        |
                     v         v        v
              Transcode   Live Encoder  Metadata
              (VOD)       (if live)    DB
                     |         |        |
                     v         v        v
              Object Storage (multi-region origins)
                     |         |
                     v         v
              CDN (geo-routing)   License Server (DRM)
                     ^
                     |
              Clients (play) --> Analytics events

Patterns and Concerns at This Stage

Multi-region: replicate assets (async) to regional buckets or origins; CDN configuration for multiple origins; consistency is eventual (new asset may not be in all regions immediately).
Live: low latency vs reliability trade-off; redundant encoders and ingest; segment availability and CDN caching for live.
DRM: key rotation, license caching, and entitlement checks; avoid blocking playback on license latency.
SLO-driven ops: upload success rate, transcode SLA, playback availability and start time, CDN and origin error rates; error budgets and on-call.

Summarizing the Evolution

MVP delivers media flow with one API, one DB, object storage, sync or simple async transcode (one worker), and CDN in front of storage. That’s enough to ship and learn.

As you grow, you add a transcode queue and multiple workers, multiple formats/bitrates for adaptive streaming, and retention/cleanup. You keep CDN + origin separation clear and scale workers and storage independently.

At very high scale, you introduce multi-region origin for global latency, a live pipeline if the product needs it, DRM/licensing for protected content, and analytics on playback. You add complexity only where geography, live, or compliance require it.

This approach gives you:

Start Simple — one API, one DB, object storage, one transcode path, CDN; ship and learn.
Scale Intentionally — add transcode queue and multiple formats when upload and quality justify it; add multi-region and live when product and SLOs demand it.
Add Complexity Only When Required — avoid live and DRM until product requires them; keep VOD path clear and scalable first.