How is video stored and delivered at scale?

Video files are large immutable blobs, so they belong in object storage - durable, cheap, and effectively unbounded - not in a database. Delivery happens through a CDN: thousands of edge locations cache video segments close to viewers. The origin object storage is touched only on a cache miss, because no single origin could serve the hundreds of terabits per second that global streaming demands.

What is adaptive bitrate streaming?

Adaptive bitrate streaming makes each video available in several renditions - different resolutions and bitrates - and chops every rendition into short segments of a few seconds. A manifest lists all of them. The client player measures its own download speed and buffer level and picks the rendition for each next segment, stepping down when the network is congested and back up when it recovers, so playback stays smooth instead of buffering.

How does video transcoding scale?

Transcoding is CPU-intensive, so a video is split into chunks at keyframe boundaries and the chunks are transcoded in parallel across an elastic worker fleet, each worker handling one chunk into one rendition. A one-hour video split into sixty chunks finishes roughly sixty times faster in wall-clock terms, and a worker crash only re-does one chunk rather than the whole video.

Why is a CDN essential for a video platform?

Streaming egress runs to hundreds of terabits per second globally, far beyond what any origin can serve, and viewers are everywhere while the origin is not. A CDN caches immutable video segments at edge locations near viewers, serving the overwhelming majority of traffic locally with low latency. Video segments never change after transcoding, so this caching needs no invalidation.

Why upload large videos in resumable chunks?

A multi-gigabyte upload over a flaky network will frequently be interrupted, and restarting from zero each time wastes bandwidth and frustrates users. Resumable chunked upload sends the file in pieces and tracks which pieces succeeded, so a dropped connection resumes from the last good chunk. Uploading directly to object storage with pre-signed URLs also keeps the upload service from becoming a bandwidth bottleneck.

When is a video available to watch after upload?

Not immediately - the video moves through a status lifecycle of uploading, processing, ready, or failed, and it is playable only once transcoding into its renditions completes. The system is eventually consistent here by design: the upload returns quickly, transcoding runs asynchronously, and the video flips to ready when its renditions and manifest are published.

Design Video Streaming: System Design Interview 2026

A video streaming platform is built around one uncomfortable number: global streaming egress is measured in hundreds of terabits per second, and no origin server, datacenter, or database comes close to serving that. Every major design decision - storing video as blobs, transcoding it into many renditions, delivering it through a CDN, letting the client choose the bitrate - follows from accepting that the bytes must already be sitting close to the viewer before they press play.

This walkthrough assumes the 6-step system design framework and applies it at senior depth. It is Part 10 of a system design series.

The Problem
Step 1 - Clarify Requirements
Step 2 - Estimate Scale
Step 3 - API and Data Model
Step 4 - High-Level Design
Step 5 - Deep Dive: The Transcoding Pipeline, CDN, and Adaptive Bitrate
Step 6 - Bottlenecks and Trade-offs
Reference Architecture
Common Mistakes in the Interview
Quick Reference
Related Articles

The Problem

We are designing a video-on-demand platform: users upload videos, the platform processes them, and viewers around the world stream them on any device. The canonical examples are YouTube for user-uploaded content and Netflix for a curated catalogue.

The senior framing is a pipeline with two very different ends. The ingest end takes a large, immutable source file and fans it out through a parallel transcoding pipeline into many derived artifacts. The delivery end must put those artifacts in front of a global audience within a tight startup-latency budget and keep playback smooth on unpredictable networks. The connecting insight is that everything past the original is immutable - which is what makes caching, the CDN, and the whole delivery side tractable.

Step 1 - Clarify Requirements

Functional requirements:

Upload a video.
Process it into multiple renditions for different devices and bandwidths.
Stream it to viewers, adapting quality to their connection.

Out of scope (name, then defer): recommendations, comments, monetisation, DRM specifics, and live streaming - live is a genuinely different problem; we design video on demand.

Non-functional requirements:

Petabyte-to-exabyte storage, growing continuously.
Massive read bandwidth - video is the bulk of internet traffic.
Low startup latency and smooth playback - fast time-to-first-frame, minimal rebuffering.
Global low latency - viewers are everywhere.
Durability of uploaded originals.

The clarifying questions: this is VOD, not live. We design the upload path, because user-uploaded content forces the hard transcoding-at-scale problem; a curated platform like Netflix simply runs the same transcoding offline over a fixed catalogue and skips per-user upload. And availability is asymmetric - uninterrupted playback matters more than instant upload.

Step 2 - Estimate Scale

Ingest. Assume 1 million videos uploaded/day, averaging ~500 MB of source: ~500 TB/day of originals. Each video becomes ~6 renditions of segments, so processed output adds 1-2 PB/day, and total storage climbs into the exabytes.

Delivery. Assume 1 billion hours watched/day at an average ~3 Mbps. That is 1e9 x 3600 s x 3e6 bits ≈ ~10^19 bits/day, an average egress on the order of ~150 Tbps - and multiples of that at peak. No origin serves this; this number is the entire argument for a CDN.

Transcoding compute. 1M videos x ~10 min x ~6 renditions ≈ 60 million rendition-minutes/day of CPU-heavy work - an elastic worker fleet, sized to the queue.

The shape: exabyte storage, ~150 Tbps of egress that only a CDN can carry, and a large elastic transcoding fleet.

Step 3 - API and Data Model

Upload is a multi-step, resumable flow; playback is manifest-driven.

POST /api/videos                 -> { videoId, uploadUrl }     (initiate)
PUT  <uploadUrl>  (chunked, resumable, direct to object storage)
POST /api/videos/{id}/complete   -> 202 Accepted                (triggers processing)
 
GET  /api/videos/{id}/manifest   -> HLS .m3u8 / DASH .mpd       (rendition + segment list)
GET  <segment URL>  (served by the CDN)

Element	Stored where
Video metadata	Metadata DB: `videoId`, uploader, title, `status`, duration, rendition list
Original file	Object storage - archived for future re-transcoding
Rendition segments	Object storage - many small immutable files per rendition
Manifest	Object storage - lists renditions and segment URLs

The status field - uploading, processing, ready, failed - is what tells a viewer whether the video can be played, and it is the spine of the consistency model below.

Step 4 - High-Level Design

flowchart TD
    Up([Uploader]) -->|resumable chunks| US[Upload Service]
    US -->|original| OS[(Object Storage)]
    US -->|new-video event| Q[Transcoding Queue]
    Q --> Pipe[Transcoding Pipeline]
    Pipe -->|segments + manifests| OS
    Pipe -->|status: ready| Meta[(Metadata DB)]
    Viewer([Viewer]) -->|manifest request| MS[Metadata / Manifest Service]
    Viewer -->|segment requests| CDN[CDN Edge]
    CDN -->|miss| OS
    MS --> Meta

Figure 1. The architecture separates the upload-and-transcode write path from the manifest-and-segment read path. Both meet at object storage, which holds the originals and the transcoded segments; the CDN sits between viewers and the origin and serves almost all bytes from edge caches. The status flowing from pipeline back to the metadata DB is what tells viewers when a video is playable.

The upload service streams a resumable upload straight into object storage and emits an event. The transcoding pipeline turns the original into renditions and marks the video ready. Viewers fetch a manifest, then pull immutable segments from the CDN, which reaches back to object storage only on a cache miss.

Step 5 - Deep Dive: The Transcoding Pipeline, CDN, and Adaptive Bitrate

This is the core. Three subsystems carry it: the pipeline that produces the renditions, the CDN that delivers them, and the adaptive-bitrate scheme that keeps playback smooth.

Part A - Why transcode, and the pipeline

An uploaded video is one file, one codec, one resolution. Viewers are not: a phone on cellular, a laptop on wifi, and a 4K television all need different resolutions, bitrates, and sometimes codecs (H.264 for compatibility, H.265 or AV1 for efficiency). So the platform must produce a matrix of renditions.

Transcoding is slow and CPU-bound. Doing it as one job per video means a one-hour video occupies one worker for a long time and a crash restarts the whole thing. The pipeline instead exploits chunk-level parallelism:

flowchart LR
    Src[Original] --> Split[Split at keyframe boundaries]
    Split --> C1[Chunk 1]
    Split --> C2[Chunk 2]
    Split --> C3[Chunk N]
    C1 --> TW[Transcode workers<br/>chunk x rendition, in parallel]
    C2 --> TW
    C3 --> TW
    TW --> Asm[Assemble segments per rendition]
    Asm --> Man[Generate manifests]
    Man --> Pub[Publish -> status: ready]

Figure 2. The transcoding pipeline made parallel. The source is split at keyframe boundaries, every chunk is transcoded into every rendition in parallel across the elastic worker fleet, then segments are assembled per rendition and manifests written. A one-hour video split into sixty chunks finishes roughly sixty times faster than as a single monolithic job - and a worker crash only redoes one chunk.

The source is split at keyframe boundaries, each chunk is transcoded into each rendition in parallel across an elastic fleet, then segments are assembled per rendition and manifests generated. A one-hour video split into sixty chunks finishes roughly sixty times faster in wall-clock time. This is the durable-work-queue pattern of Part 3 and Part 8, with the senior twist that the unit of work is a chunk, which is what makes both speed and fault recovery cheap.

Part B - Blob storage

Video files are large immutable blobs and belong in object storage, never a database: object storage is highly durable (replication giving many nines), cheap, and scales to exabytes. Originals are kept - archived to a cold, cheaper tier - because when a better codec like AV1 arrives you re-transcode from the original rather than from a lossy rendition. Segments live in object storage too, and since viewing follows a steep popularity curve, storage tiering keeps hot content on fast storage and the long tail on archival storage.

Part C - The CDN

Streaming egress is ~150 Tbps; no origin serves that, and viewers are global while the origin is not. A CDN - thousands of edge locations - caches video segments close to viewers. A player fetches each segment from its nearest edge: a hit is served locally with low latency and zero origin load, and a miss has the edge fetch from object storage and cache the result.

This works cleanly because segments are immutable - a transcoded segment never changes - so CDN caching needs no invalidation, the same immutability dividend seen in Part 1 and Part 4. Popular content stays hot at the edge; the long tail occasionally misses to origin. For a known surge - a major new release - the CDN is pre-warmed, pushing content to edges before launch so the first million viewers all hit a warm cache.

Part D - Adaptive bitrate streaming

A viewer's bandwidth fluctuates, so a single fixed bitrate either rebuffers (too high) or looks poor (too low). Adaptive bitrate (ABR) streaming solves this. Every rendition is cut into short segments of a few seconds; a manifest (HLS .m3u8 or DASH .mpd) lists every rendition and its segments.

sequenceDiagram
    participant P as Player
    participant C as CDN
 
    P->>C: GET manifest
    C-->>P: renditions + segment list
    P->>C: GET segment 1 @ low rendition (fast start)
    C-->>P: segment 1
    Note over P: measure throughput + buffer
    P->>C: GET segment 2 @ higher rendition
    C-->>P: segment 2
    Note over P: bandwidth drops
    P->>C: GET segment 3 @ lower rendition
    C-->>P: segment 3

Figure 3. Adaptive bitrate streaming in action. The player fetches the manifest, then chooses each next segment's rendition based on its own throughput and buffer measurements - starting low for a fast first frame, ramping up, stepping down on congestion. The server stays completely stateless: it serves immutable segments and a static manifest and makes no per-viewer decisions.

The decisive point: the adaptation logic lives in the client, per segment. The player measures throughput and buffer level and chooses the next segment's rendition - starting low for a fast first frame, ramping up, stepping down on congestion. The server stays completely stateless: it serves immutable segments and a static manifest and makes no per-viewer decisions. HLS and DASH are the two standard segmented formats; a platform typically offers both for device coverage.

Consistency model

A video has an explicit status lifecycle, and the system is eventually consistent between upload and playability: the upload returns fast, transcoding runs asynchronously, and the video becomes ready only when its renditions and manifest are published. Segments, once written, are immutable, which is what makes CDN delivery consistent for free. Metadata is consistent enough that a viewer reliably sees processing until the video is genuinely ready.

Failure modes

Transcoding worker crash. Because the unit of work is a chunk, only that chunk is re-transcoded - at-least-once via the queue - not the whole video. This is the payoff of chunk-level granularity.
One rendition fails. Publish the renditions that succeeded so the video is watchable, and retry the failed one; a persistently bad source becomes a failed video after capped retries, the dead-letter idea from Part 3.
CDN edge down. Viewers route to the next-nearest edge - higher latency, not an outage.
New-release origin stampede. A viral premiere would flood the origin on cache misses; pre-warming plus the CDN's tiered caching (edge to regional to origin) absorbs it.
Object storage. Engineered for very high durability; originals are replicated so a re-transcode is always possible.

Multi-region

The CDN is the multi-region delivery layer - distributing content globally is its entire purpose. Object storage is replicated across regions, transcoding compute runs wherever capacity is free, and the metadata DB is replicated. Uploads go to the nearest region and the original replicates outward. For global releases, content is pre-positioned to every region's edges ahead of time.

Evolution path

Stage	Approach
Launch	Upload, a single transcode job, serve directly from origin
Growth	Chunked parallel transcoding pipeline, object storage, a CDN
Scale	Multi-codec renditions, storage tiering, CDN pre-warming, multi-region

Build on object storage for blobs, the segmented ABR format, resumable upload, and the status lifecycle from day one - all four are structural and painful to retrofit. Defer multi-codec encoding, storage tiering, and pre-warming.

Observability

Track upload success rate, transcoding latency (upload to ready) at p50/p99, transcoding queue depth and failure rate, CDN cache hit ratio (the headline cost-and-performance metric), rebuffering ratio (the headline viewer-quality metric), startup time, egress bandwidth, and storage growth. Reasonable SLOs: 99% of videos ready within minutes of upload, p99 startup under 2 seconds, and a rebuffering ratio below 0.5%.

Step 6 - Bottlenecks and Trade-offs

Delivery egress at ~150 Tbps can only be served by a CDN - the defining constraint of the whole design.
Transcoding compute is heavy, handled by an elastic fleet plus chunk-level parallelism.
Storage growth into the exabytes is contained by tiering hot and cold content.
Transcoding latency from upload to ready is cut by transcoding chunks in parallel.
The new-release stampede on the origin is absorbed by CDN pre-warming and tiered caching.

Reference Architecture

The pattern this problem teaches, reusable well beyond video:

Ingest a large immutable asset, fan it out through a parallel processing pipeline into many derived immutable artifacts, store them in blob storage, and serve them globally through a CDN while the client adapts to its own conditions.

flowchart LR
    subgraph Ingest["Ingest - parallel pipeline"]
        I1[Large original] --> I2[Split into chunks]
        I2 --> I3[Parallel transcode]
        I3 --> I4[(Blob storage)]
    end
    subgraph Deliver["Deliver - global, client-adaptive"]
        D1[CDN edge] --> D2[Client picks bitrate]
    end
    I4 --> D1

Figure 4. The reference architecture stripped to its two halves: ingest a large immutable asset through a parallel processing pipeline into many derived immutable artifacts, then deliver them globally through a CDN while the client adapts to its own conditions. The same shape applies to image-processing pipelines, document conversion, and any large-media platform.

The same shape recurs in any large-media or large-asset platform: image-processing pipelines, document conversion, satellite-imagery processing, ML pipelines over large blobs. Split a big immutable input, process the pieces in parallel, store immutable outputs in blob storage, and deliver them through a cache layer that immutability makes trivial.

Common Mistakes in the Interview

Storing video blobs in a database instead of object storage.
Transcoding as one monolithic job, losing parallelism and re-doing everything on a crash.
Serving video from the origin with no CDN - physically impossible at streaming scale.
Server-side bitrate switching instead of client-driven, per-segment ABR.
Forgetting resumable upload, so a dropped connection restarts a multi-gigabyte upload.
Discarding the original, leaving no way to re-transcode for a future codec.
A synchronous upload-then-transcode flow with no status lifecycle.
Ignoring the new-release stampede on the CDN origin.

Quick Reference

Topic	Key Point
Core pattern	Parallel transcoding pipeline + blob storage + CDN + client-side ABR
Storage	Object storage for blobs; keep originals; tier hot vs cold content
Transcoding	Split at keyframes, transcode chunks in parallel on an elastic fleet
CDN	Carries ~150 Tbps of egress; immutable segments need no invalidation
ABR	Renditions cut into segments; the client picks bitrate per segment
Upload	Resumable chunked upload, direct to object storage via pre-signed URLs
Consistency	Eventually consistent: status lifecycle uploading -> processing -> ready
Failure recovery	Chunk granularity - a crash re-transcodes one chunk, not the video
Hot content	Pre-warm the CDN before a known release to avoid an origin stampede
Multi-region	The CDN is the delivery layer; replicate storage, pre-position content

System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
Design a Web Crawler - Part 8; the parallel work-queue pipeline pattern.
Design a Distributed Cache - Part 4; the caching and immutability ideas the CDN relies on.
Design a URL Shortener - Part 1; immutable data making caching nearly free.
Design a Payment System - Part 11; correctness-first design, where ingest-and-fan-out shifts to ledger-and-reconcile.

This is Part 10 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a Payment System.

Design Video Streaming: System Design Interview 2026

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: The Transcoding Pipeline, CDN, and Adaptive Bitrate

Part A - Why transcode, and the pipeline

Part B - Blob storage

Part C - The CDN

Part D - Adaptive bitrate streaming

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Ready to ace your interview?

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: The Transcoding Pipeline, CDN, and Adaptive Bitrate

Part A - Why transcode, and the pipeline

Part B - Blob storage

Part C - The CDN

Part D - Adaptive bitrate streaming

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Related Articles

Ready to ace your interview?