Design Video Streaming: System Design Interview 2026

·14 min read
system-designvideo-streamingcdnarchitecturebackendinterview-preparation

A video streaming platform is built around one uncomfortable number: global streaming egress is measured in hundreds of terabits per second, and no origin server, datacenter, or database comes close to serving that. Every major design decision - storing video as blobs, transcoding it into many renditions, delivering it through a CDN, letting the client choose the bitrate - follows from accepting that the bytes must already be sitting close to the viewer before they press play.

This walkthrough assumes the 6-step system design framework and applies it at senior depth. It is Part 10 of a system design series.

Table of Contents

  1. The Problem
  2. Step 1 - Clarify Requirements
  3. Step 2 - Estimate Scale
  4. Step 3 - API and Data Model
  5. Step 4 - High-Level Design
  6. Step 5 - Deep Dive: The Transcoding Pipeline, CDN, and Adaptive Bitrate
  7. Step 6 - Bottlenecks and Trade-offs
  8. Reference Architecture
  9. Common Mistakes in the Interview
  10. Quick Reference
  11. Related Articles

The Problem

We are designing a video-on-demand platform: users upload videos, the platform processes them, and viewers around the world stream them on any device. The canonical examples are YouTube for user-uploaded content and Netflix for a curated catalogue.

The senior framing is a pipeline with two very different ends. The ingest end takes a large, immutable source file and fans it out through a parallel transcoding pipeline into many derived artifacts. The delivery end must put those artifacts in front of a global audience within a tight startup-latency budget and keep playback smooth on unpredictable networks. The connecting insight is that everything past the original is immutable - which is what makes caching, the CDN, and the whole delivery side tractable.


Step 1 - Clarify Requirements

Functional requirements:

  • Upload a video.
  • Process it into multiple renditions for different devices and bandwidths.
  • Stream it to viewers, adapting quality to their connection.

Out of scope (name, then defer): recommendations, comments, monetisation, DRM specifics, and live streaming - live is a genuinely different problem; we design video on demand.

Non-functional requirements:

  • Petabyte-to-exabyte storage, growing continuously.
  • Massive read bandwidth - video is the bulk of internet traffic.
  • Low startup latency and smooth playback - fast time-to-first-frame, minimal rebuffering.
  • Global low latency - viewers are everywhere.
  • Durability of uploaded originals.

The clarifying questions: this is VOD, not live. We design the upload path, because user-uploaded content forces the hard transcoding-at-scale problem; a curated platform like Netflix simply runs the same transcoding offline over a fixed catalogue and skips per-user upload. And availability is asymmetric - uninterrupted playback matters more than instant upload.


Step 2 - Estimate Scale

Ingest. Assume 1 million videos uploaded/day, averaging ~500 MB of source: ~500 TB/day of originals. Each video becomes ~6 renditions of segments, so processed output adds 1-2 PB/day, and total storage climbs into the exabytes.

Delivery. Assume 1 billion hours watched/day at an average ~3 Mbps. That is 1e9 x 3600 s x 3e6 bits~10^19 bits/day, an average egress on the order of ~150 Tbps - and multiples of that at peak. No origin serves this; this number is the entire argument for a CDN.

Transcoding compute. 1M videos x ~10 min x ~6 renditions ≈ 60 million rendition-minutes/day of CPU-heavy work - an elastic worker fleet, sized to the queue.

The shape: exabyte storage, ~150 Tbps of egress that only a CDN can carry, and a large elastic transcoding fleet.


Step 3 - API and Data Model

Upload is a multi-step, resumable flow; playback is manifest-driven.

POST /api/videos                 -> { videoId, uploadUrl }     (initiate)
PUT  <uploadUrl>  (chunked, resumable, direct to object storage)
POST /api/videos/{id}/complete   -> 202 Accepted                (triggers processing)
 
GET  /api/videos/{id}/manifest   -> HLS .m3u8 / DASH .mpd       (rendition + segment list)
GET  <segment URL>  (served by the CDN)
ElementStored where
Video metadataMetadata DB: videoId, uploader, title, status, duration, rendition list
Original fileObject storage - archived for future re-transcoding
Rendition segmentsObject storage - many small immutable files per rendition
ManifestObject storage - lists renditions and segment URLs

The status field - uploading, processing, ready, failed - is what tells a viewer whether the video can be played, and it is the spine of the consistency model below.


Step 4 - High-Level Design

flowchart TD
    Up([Uploader]) -->|resumable chunks| US[Upload Service]
    US -->|original| OS[(Object Storage)]
    US -->|new-video event| Q[Transcoding Queue]
    Q --> Pipe[Transcoding Pipeline]
    Pipe -->|segments + manifests| OS
    Pipe -->|status: ready| Meta[(Metadata DB)]
    Viewer([Viewer]) -->|manifest request| MS[Metadata / Manifest Service]
    Viewer -->|segment requests| CDN[CDN Edge]
    CDN -->|miss| OS
    MS --> Meta

Figure 1. The architecture separates the upload-and-transcode write path from the manifest-and-segment read path. Both meet at object storage, which holds the originals and the transcoded segments; the CDN sits between viewers and the origin and serves almost all bytes from edge caches. The status flowing from pipeline back to the metadata DB is what tells viewers when a video is playable.

The upload service streams a resumable upload straight into object storage and emits an event. The transcoding pipeline turns the original into renditions and marks the video ready. Viewers fetch a manifest, then pull immutable segments from the CDN, which reaches back to object storage only on a cache miss.


Step 5 - Deep Dive: The Transcoding Pipeline, CDN, and Adaptive Bitrate

This is the core. Three subsystems carry it: the pipeline that produces the renditions, the CDN that delivers them, and the adaptive-bitrate scheme that keeps playback smooth.

Part A - Why transcode, and the pipeline

An uploaded video is one file, one codec, one resolution. Viewers are not: a phone on cellular, a laptop on wifi, and a 4K television all need different resolutions, bitrates, and sometimes codecs (H.264 for compatibility, H.265 or AV1 for efficiency). So the platform must produce a matrix of renditions.

Transcoding is slow and CPU-bound. Doing it as one job per video means a one-hour video occupies one worker for a long time and a crash restarts the whole thing. The pipeline instead exploits chunk-level parallelism:

flowchart LR
    Src[Original] --> Split[Split at keyframe boundaries]
    Split --> C1[Chunk 1]
    Split --> C2[Chunk 2]
    Split --> C3[Chunk N]
    C1 --> TW[Transcode workers<br/>chunk x rendition, in parallel]
    C2 --> TW
    C3 --> TW
    TW --> Asm[Assemble segments per rendition]
    Asm --> Man[Generate manifests]
    Man --> Pub[Publish -> status: ready]

Figure 2. The transcoding pipeline made parallel. The source is split at keyframe boundaries, every chunk is transcoded into every rendition in parallel across the elastic worker fleet, then segments are assembled per rendition and manifests written. A one-hour video split into sixty chunks finishes roughly sixty times faster than as a single monolithic job - and a worker crash only redoes one chunk.

The source is split at keyframe boundaries, each chunk is transcoded into each rendition in parallel across an elastic fleet, then segments are assembled per rendition and manifests generated. A one-hour video split into sixty chunks finishes roughly sixty times faster in wall-clock time. This is the durable-work-queue pattern of Part 3 and Part 8, with the senior twist that the unit of work is a chunk, which is what makes both speed and fault recovery cheap.

Part B - Blob storage

Video files are large immutable blobs and belong in object storage, never a database: object storage is highly durable (replication giving many nines), cheap, and scales to exabytes. Originals are kept - archived to a cold, cheaper tier - because when a better codec like AV1 arrives you re-transcode from the original rather than from a lossy rendition. Segments live in object storage too, and since viewing follows a steep popularity curve, storage tiering keeps hot content on fast storage and the long tail on archival storage.

Part C - The CDN

Streaming egress is ~150 Tbps; no origin serves that, and viewers are global while the origin is not. A CDN - thousands of edge locations - caches video segments close to viewers. A player fetches each segment from its nearest edge: a hit is served locally with low latency and zero origin load, and a miss has the edge fetch from object storage and cache the result.

This works cleanly because segments are immutable - a transcoded segment never changes - so CDN caching needs no invalidation, the same immutability dividend seen in Part 1 and Part 4. Popular content stays hot at the edge; the long tail occasionally misses to origin. For a known surge - a major new release - the CDN is pre-warmed, pushing content to edges before launch so the first million viewers all hit a warm cache.

Part D - Adaptive bitrate streaming

A viewer's bandwidth fluctuates, so a single fixed bitrate either rebuffers (too high) or looks poor (too low). Adaptive bitrate (ABR) streaming solves this. Every rendition is cut into short segments of a few seconds; a manifest (HLS .m3u8 or DASH .mpd) lists every rendition and its segments.

sequenceDiagram
    participant P as Player
    participant C as CDN
 
    P->>C: GET manifest
    C-->>P: renditions + segment list
    P->>C: GET segment 1 @ low rendition (fast start)
    C-->>P: segment 1
    Note over P: measure throughput + buffer
    P->>C: GET segment 2 @ higher rendition
    C-->>P: segment 2
    Note over P: bandwidth drops
    P->>C: GET segment 3 @ lower rendition
    C-->>P: segment 3

Figure 3. Adaptive bitrate streaming in action. The player fetches the manifest, then chooses each next segment's rendition based on its own throughput and buffer measurements - starting low for a fast first frame, ramping up, stepping down on congestion. The server stays completely stateless: it serves immutable segments and a static manifest and makes no per-viewer decisions.

The decisive point: the adaptation logic lives in the client, per segment. The player measures throughput and buffer level and chooses the next segment's rendition - starting low for a fast first frame, ramping up, stepping down on congestion. The server stays completely stateless: it serves immutable segments and a static manifest and makes no per-viewer decisions. HLS and DASH are the two standard segmented formats; a platform typically offers both for device coverage.

Consistency model

A video has an explicit status lifecycle, and the system is eventually consistent between upload and playability: the upload returns fast, transcoding runs asynchronously, and the video becomes ready only when its renditions and manifest are published. Segments, once written, are immutable, which is what makes CDN delivery consistent for free. Metadata is consistent enough that a viewer reliably sees processing until the video is genuinely ready.

Failure modes

  • Transcoding worker crash. Because the unit of work is a chunk, only that chunk is re-transcoded - at-least-once via the queue - not the whole video. This is the payoff of chunk-level granularity.
  • One rendition fails. Publish the renditions that succeeded so the video is watchable, and retry the failed one; a persistently bad source becomes a failed video after capped retries, the dead-letter idea from Part 3.
  • CDN edge down. Viewers route to the next-nearest edge - higher latency, not an outage.
  • New-release origin stampede. A viral premiere would flood the origin on cache misses; pre-warming plus the CDN's tiered caching (edge to regional to origin) absorbs it.
  • Object storage. Engineered for very high durability; originals are replicated so a re-transcode is always possible.

Multi-region

The CDN is the multi-region delivery layer - distributing content globally is its entire purpose. Object storage is replicated across regions, transcoding compute runs wherever capacity is free, and the metadata DB is replicated. Uploads go to the nearest region and the original replicates outward. For global releases, content is pre-positioned to every region's edges ahead of time.

Evolution path

StageApproach
LaunchUpload, a single transcode job, serve directly from origin
GrowthChunked parallel transcoding pipeline, object storage, a CDN
ScaleMulti-codec renditions, storage tiering, CDN pre-warming, multi-region

Build on object storage for blobs, the segmented ABR format, resumable upload, and the status lifecycle from day one - all four are structural and painful to retrofit. Defer multi-codec encoding, storage tiering, and pre-warming.

Observability

Track upload success rate, transcoding latency (upload to ready) at p50/p99, transcoding queue depth and failure rate, CDN cache hit ratio (the headline cost-and-performance metric), rebuffering ratio (the headline viewer-quality metric), startup time, egress bandwidth, and storage growth. Reasonable SLOs: 99% of videos ready within minutes of upload, p99 startup under 2 seconds, and a rebuffering ratio below 0.5%.


Step 6 - Bottlenecks and Trade-offs

  • Delivery egress at ~150 Tbps can only be served by a CDN - the defining constraint of the whole design.
  • Transcoding compute is heavy, handled by an elastic fleet plus chunk-level parallelism.
  • Storage growth into the exabytes is contained by tiering hot and cold content.
  • Transcoding latency from upload to ready is cut by transcoding chunks in parallel.
  • The new-release stampede on the origin is absorbed by CDN pre-warming and tiered caching.

Reference Architecture

The pattern this problem teaches, reusable well beyond video:

Ingest a large immutable asset, fan it out through a parallel processing pipeline into many derived immutable artifacts, store them in blob storage, and serve them globally through a CDN while the client adapts to its own conditions.

flowchart LR
    subgraph Ingest["Ingest - parallel pipeline"]
        I1[Large original] --> I2[Split into chunks]
        I2 --> I3[Parallel transcode]
        I3 --> I4[(Blob storage)]
    end
    subgraph Deliver["Deliver - global, client-adaptive"]
        D1[CDN edge] --> D2[Client picks bitrate]
    end
    I4 --> D1

Figure 4. The reference architecture stripped to its two halves: ingest a large immutable asset through a parallel processing pipeline into many derived immutable artifacts, then deliver them globally through a CDN while the client adapts to its own conditions. The same shape applies to image-processing pipelines, document conversion, and any large-media platform.

The same shape recurs in any large-media or large-asset platform: image-processing pipelines, document conversion, satellite-imagery processing, ML pipelines over large blobs. Split a big immutable input, process the pieces in parallel, store immutable outputs in blob storage, and deliver them through a cache layer that immutability makes trivial.


Common Mistakes in the Interview

  • Storing video blobs in a database instead of object storage.
  • Transcoding as one monolithic job, losing parallelism and re-doing everything on a crash.
  • Serving video from the origin with no CDN - physically impossible at streaming scale.
  • Server-side bitrate switching instead of client-driven, per-segment ABR.
  • Forgetting resumable upload, so a dropped connection restarts a multi-gigabyte upload.
  • Discarding the original, leaving no way to re-transcode for a future codec.
  • A synchronous upload-then-transcode flow with no status lifecycle.
  • Ignoring the new-release stampede on the CDN origin.

Quick Reference

TopicKey Point
Core patternParallel transcoding pipeline + blob storage + CDN + client-side ABR
StorageObject storage for blobs; keep originals; tier hot vs cold content
TranscodingSplit at keyframes, transcode chunks in parallel on an elastic fleet
CDNCarries ~150 Tbps of egress; immutable segments need no invalidation
ABRRenditions cut into segments; the client picks bitrate per segment
UploadResumable chunked upload, direct to object storage via pre-signed URLs
ConsistencyEventually consistent: status lifecycle uploading -> processing -> ready
Failure recoveryChunk granularity - a crash re-transcodes one chunk, not the video
Hot contentPre-warm the CDN before a known release to avoid an origin stampede
Multi-regionThe CDN is the delivery layer; replicate storage, pre-position content

This is Part 10 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a Payment System.

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides