Why use WebSocket instead of polling for a chat system?

HTTP polling forces the client to repeatedly ask whether new messages exist, wasting requests and adding latency equal to the poll interval. WebSocket establishes a single persistent, bidirectional connection so the server can push a message the instant it arrives. The cost is that connections are long-lived and stateful, which makes the connection servers stateful and complicates load balancing and failover.

How do you route a message between two users on different servers?

Each connection server registers its connected users in a shared connection registry that maps a user ID to the server holding their connection. When a message arrives, the sender's server persists it, looks up the recipient's server in the registry, and forwards the message over a pub/sub backplane to that server, which pushes it down the recipient's WebSocket. The registry plus the backplane are what let any server reach any connection.

How do you track online presence at scale?

Broadcasting every connect and disconnect to all of a user's contacts creates a presence storm at scale. Instead, a connection server marks a user online while their WebSocket is alive and refreshed by periodic heartbeats, storing presence with a short TTL so a crashed client expires to offline automatically. Presence is then queried on demand - only for the contacts a user is currently looking at - rather than pushed for every change.

How do you guarantee message ordering in a chat?

Ordering is needed only within a conversation, not globally. The message service assigns each message a monotonic per-conversation sequence number from a single sequencer for that conversation's partition, and clients sort by that number rather than by client clocks, which suffer from skew. Global ordering across all conversations is unnecessary and expensive, so it is deliberately not provided.

How are messages delivered to a user who is offline?

Every message is persisted to a durable store before any delivery attempt, so an offline recipient simply has messages waiting. When they reconnect, the client sends the sequence number of the last message it has, and the server streams everything after it. This same resync-by-sequence-number mechanism also recovers messages missed during a connection server crash.

Can a chat system guarantee exactly-once message delivery?

No - like any system crossing an unreliable network boundary, the achievable guarantee is at-least-once delivery plus deduplication. The client generates a unique message ID, retries sending until the server acknowledges persistence, and the server discards duplicates by that ID. Delivery to the recipient is acknowledged the same way, with redelivery on reconnect, producing an effectively-once result.

Design a Chat System: System Design Interview 2026

A chat system inverts the usual web architecture. Most services are stateless request-response; a chat system maintains hundreds of millions of persistent connections and must push a message from any one of them to any other within a couple of hundred milliseconds. The data model - messages in conversations - is simple. The difficulty is the connection layer: it is stateful, it is enormous, and any two users in a conversation are almost certainly attached to different servers that have no direct knowledge of each other.

This walkthrough assumes the 6-step system design framework and applies it at senior depth. It is Part 6 of a system design series.

The Problem
Step 1 - Clarify Requirements
Step 2 - Estimate Scale
Step 3 - API and Data Model
Step 4 - High-Level Design
Step 5 - Deep Dive: Real-Time Delivery and Connection Routing
Step 6 - Bottlenecks and Trade-offs
Reference Architecture
Common Mistakes in the Interview
Quick Reference
Related Articles

The Problem

We are designing a real-time chat system supporting one-to-one and group messaging, with delivery and read receipts, presence, and message history - the shape of WhatsApp, Messenger, or Slack.

The senior framing is that this is a routing problem over a stateful connection layer. Unlike every prior system in this series, the front tier is not stateless: a connection server owns the live WebSocket connections of the users attached to it. Delivering a message means finding which server holds the recipient and getting the message there - and staying correct when a server holding a hundred thousand connections suddenly dies.

Step 1 - Clarify Requirements

Functional requirements:

One-to-one messaging and group messaging.
Real-time delivery when the recipient is online.
Offline delivery: store messages for an offline recipient, deliver on reconnect.
Delivery and read receipts.
Presence: online / offline / last-seen.
Message history.

Out of scope (name, then defer): media and file storage, end-to-end encryption details, and voice/video calls.

Non-functional requirements:

Real-time latency. Delivery to an online recipient within ~200 ms.
Massive concurrent connections. Hundreds of millions of simultaneous persistent connections.
Reliability. No message is lost; the guarantee is at-least-once plus deduplication.
Per-conversation ordering. Messages in one conversation appear in a consistent order; global ordering is not required.

The clarifying questions that shape the design: the delivery guarantee is at-least-once with idempotency - exactly-once is not achievable, as established in Part 3. Ordering is per-conversation only. And group size matters: a small group and a 100,000-member broadcast channel need different fan-out, the same way Part 5's celebrity accounts did.

Step 2 - Estimate Scale

Connections. Assume 1 billion users and 500 million online at peak - so 500 million concurrent WebSocket connections. A single connection server, bounded by memory and file descriptors, holds on the order of 100,000 connections, so the connection layer needs roughly 5,000 servers.

Messages. At ~40 messages/user/day, that is 40 billion messages/day ≈ ~460,000 messages/sec average, with peaks past 2 million/sec.

Storage. At ~300 bytes per message, 40B/day is ~12 TB/day of message data, partitioned by conversation and retained per policy.

Connection memory. At ~10 KB of state per connection, 500M connections is ~5 TB of memory spread across the connection fleet.

Two numbers define the problem: 500 million persistent connections, and the fact that a message must hop between two arbitrary servers among 5,000.

Step 3 - API and Data Model

Messaging does not use request-response REST - it uses a persistent WebSocket carrying typed frames: SEND, ACK, RECEIPT, PRESENCE, TYPING. A thin REST endpoint serves history: GET /conversations/{id}/messages?cursor=<opaque>.

Entity	Key fields
Message	`conversationId`, `messageId` (client UUID), `seq` (per-conversation), `senderId`, `content`, `createdAt`
Conversation	`conversationId`, participants, type (1:1 / group)
Sync cursor	per user-conversation: last delivered and last read `seq`
Connection registry	`userId` -> connection server currently holding the socket

Two IDs do two jobs. The messageId is generated by the client and used for deduplication - a retried send carries the same ID. The seq is a monotonic per-conversation sequence number assigned by the server and used for ordering and for resync. Messages are partitioned by conversationId.

Step 4 - High-Level Design

flowchart TD
    CA([User A]) <-->|WebSocket| SA[Connection Server A]
    CB([User B]) <-->|WebSocket| SB[Connection Server B]
    SA --> MS[Message Service]
    MS --> Store[(Message Store<br/>partitioned by conversation)]
    SA -->|lookup recipient server| Reg[(Connection Registry)]
    SB -->|register / heartbeat| Reg
    SA -->|forward| BP[Pub/Sub Backplane]
    BP --> SB
    SB -.push.-> CB
    SA -.presence.-> Pres[(Presence Store - TTL)]

Figure 1. The chat architecture's distinguishing feature: a stateful connection layer (WebSocket servers) that owns persistent client connections, rather than a stateless fleet. The registry maps user to server so any sender can find any recipient; the pub/sub backplane carries messages between those two servers; and the durable message store sits behind both, which is what makes a dropped connection a resync rather than a loss.

The connection layer is a fleet of stateful WebSocket servers. The connection registry maps each user to the server holding their socket. The message service persists every message to the durable store - the source of truth - before delivery. The pub/sub backplane carries a message from the sender's server to the recipient's server. Presence lives in a TTL-backed store. Every message is durable first and delivered second, which is what makes a dropped connection a resync rather than a loss.

Step 5 - Deep Dive: Real-Time Delivery and Connection Routing

This is the core. Four things make real-time chat work: the transport, cross-server routing, presence, and ordered reliable delivery.

Part A - The transport

The options form a clear ladder. HTTP polling has the client repeatedly ask "anything new?" - wasteful, and latency equals the poll interval. Long polling holds a request open until data arrives - better, but it is one request per message with constant connection churn. Server-Sent Events push server-to-client only, which chat's bidirectional traffic outgrows. WebSocket is a single persistent, bidirectional connection over which the server pushes instantly - the correct choice.

The price is architectural: a WebSocket is long-lived and stateful, so the connection server is stateful. That single fact drives the registry, the backplane, and the failover story below - and it is the deliberate trade a senior candidate names rather than glosses over.

Part B - Connection routing

With 500M connections across 5,000 servers, the sender and recipient almost never share a server. Routing works in four steps:

The sender's server persists the message via the message service - durability first.
It looks up the recipient in the connection registry to find their server.
It forwards the message over the pub/sub backplane to that server.
The recipient's server pushes the message down the recipient's WebSocket.

sequenceDiagram
    participant A as User A
    participant SA as Server A
    participant MS as Message Service
    participant R as Registry
    participant BP as Backplane
    participant SB as Server B
    participant B as User B
 
    A->>SA: SEND (messageId, content)
    SA->>MS: persist + assign seq
    MS-->>SA: stored (seq)
    SA-->>A: ACK (persisted)
    SA->>R: where is User B?
    R-->>SA: Server B
    SA->>BP: forward message -> Server B
    BP->>SB: deliver
    SB->>B: push over WebSocket
    B->>SB: RECEIPT (delivered)

Figure 2. A message routing across two connection servers - the canonical case at scale. The sender's server persists first, then asks the registry where the recipient lives, then forwards over the backplane to that server, which pushes the message down its socket. Durability happens before any network hand-off, which is what makes the system tolerant of mid-flight failures.

Each connection server registers its users in the registry on connect and removes them on disconnect; entries carry a TTL so a crashed server's stale entries expire. The backplane decouples the 5,000 servers from one another - a server subscribes for the traffic destined to its connections and need not know the rest of the fleet.

Part C - Group messaging and fan-out

A 1:1 message is one delivery. A group of N members is N deliveries: the message is stored once per conversation, and the fan-out is purely a delivery concern - route a copy to each member's connection server, store for those offline.

Large broadcast channels - a Slack channel with 100,000 members - are the celebrity problem from Part 5 in a new guise. Do not synchronously fan out to 100,000 connections. Push to the online members, persist for the offline majority, and for very large channels let clients pull on read rather than receiving a push at all.

Part D - Presence

Presence is deceptively expensive. Broadcasting every connect and disconnect to all of a user's contacts produces a presence storm - at 500M users churning connections, the fan-out dwarfs the actual messaging traffic.

The scalable approach has two parts. A connection server knows a user is online while their socket is alive, refreshed by a periodic heartbeat; presence is written to a store with a short TTL, so a crashed client silently expires to offline with no explicit event. And presence is read on demand - when a user opens a chat or their contact list, the client queries presence for just those users - rather than pushed on every change. "Last seen" is simply a timestamp written on disconnect.

Ordering and reliable delivery

Ordering uses the per-conversation seq. A conversation's partition has a single sequencer, so its messages get a clean monotonic order; clients sort by seq, never by client timestamps, which drift with clock skew. Global cross-conversation ordering is not provided - it is expensive and no one needs it.

Reliability is at-least-once plus deduplication. The client retries SEND with the same messageId until it gets a persistence ACK; the server discards duplicate IDs. Delivery to the recipient is acknowledged with a RECEIPT; if it does not arrive, the message is redelivered on reconnect. Receipts - sent, delivered, read - are themselves small messages flowing back through the same pipeline.

Offline delivery and resync share one mechanism: every message is durable before delivery, so a reconnecting client sends its last known seq per conversation and the server streams everything after it. The same resync recovers messages missed during a server crash.

Consistency model

Per-conversation ordering is strongly consistent via the single sequencer; across conversations there is no ordering guarantee, by design. Presence is eventually consistent and approximate - a crashed client reads as online until its TTL lapses. Message delivery is at-least-once plus dedup, observably effectively-once.

Failure modes

Connection server crash. Its ~100,000 sockets all drop. Clients detect the dead socket, reconnect through the load balancer onto a different server, re-register, and resync by seq. Because the message store is the source of truth, nothing is lost - this is the central failure case, and statefulness is what makes it interesting.
Reconnect storm. A dead server dumps 100,000 clients reconnecting at once onto the rest of the fleet and the registry. Clients must reconnect with backoff and jitter - the Part 3 discipline.
Stale registry entry. A registry record pointing at a dead server causes a forward into the void, but the message is already persisted, so the recipient gets it on resync. TTLs keep the registry self-cleaning.
Backplane outage. Real-time push stops, yet messages still persist; on recovery, delivery resumes and clients resync. Degraded, not lost.

Multi-region

Users connect to the nearest region. A conversation has a home region that owns its sequencer, keeping seq authoritative; a cross-region conversation simply pays extra latency to reach that sequencer. The message store is replicated, and a global cross-region stream on the backplane carries messages to a recipient connected in another region.

Evolution path

Stage	Approach
Launch	One server, an in-memory connection map, WebSocket
Growth	Multiple connection servers, a shared registry, a pub/sub backplane
Scale	Thousands of connection servers, TTL + on-demand presence, large-group fan-out hybrid, multi-region

Build the client messageId, the per-conversation seq, and the resync-by-sequence protocol from day one - they are the contract every reliability and recovery property depends on. Defer presence sophistication, large-group hybrids, and multi-region.

Observability

Track concurrent connections per server, connection churn rate, end-to-end delivery latency p99, backplane lag, registry lookup latency, reconnect-storm spikes, and offline-resync volume. A reasonable SLO: 99% of messages delivered to an online recipient within 500 ms.

Step 6 - Bottlenecks and Trade-offs

Connection count makes the front tier stateful and memory-bound - hence thousands of servers and a registry, instead of a stateless fleet.
Cross-server routing is bounded by the backplane's throughput, so it must be partitioned.
Presence fan-out would dominate all other traffic if pushed; TTL heartbeats plus on-demand reads contain it.
Large-group fan-out repeats the celebrity problem and needs the push/store/pull hybrid.
Connection server failover is the defining hard case - clients reconnect and resync, and statefulness is precisely what makes it non-trivial.

Reference Architecture

The pattern this problem teaches, reusable well beyond chat:

A stateful connection layer holding persistent client connections, a registry mapping clients to connection servers, and a pub/sub backplane routing events between servers - all backed by a durable store, so a dropped connection means resync, not data loss.

flowchart LR
    subgraph Conn["Stateful connection layer"]
        direction TB
        S1[Connection server]
        S2[Connection server]
    end
    S1 <-->|registry lookup| Reg[(Connection registry)]
    S2 <-->|registry lookup| Reg
    S1 <-->|route events| BP[Pub/Sub backplane]
    S2 <-->|route events| BP
    Conn --> Durable[(Durable message store)]

Figure 3. The reference architecture for any system that pushes real-time events to many connected clients. A stateful edge layer holds the connections; a registry tells servers where to find each other; a backplane routes events between them; a durable store underwrites the whole thing. This shape recurs in live notifications, collaborative editing, multiplayer game state, and the live driver-rider link of a ride-sharing service.

The same shape recurs in any system that pushes real-time events to many connected clients: live notifications, collaborative-editing presence, multiplayer game state, and the live driver-rider link of a ride-sharing service. A stateful edge, a registry, a backplane, and a durable backing store is the toolkit for "deliver events to whichever server currently holds the client".

Common Mistakes in the Interview

Using polling, or choosing WebSocket without naming the statefulness it imposes.
No cross-server routing story - failing to explain how a message reaches a recipient on another server.
Broadcasting every presence change, producing a presence storm that dwarfs real traffic.
Ordering by client timestamps instead of a server-assigned per-conversation sequence.
Forgetting offline delivery and the reconnect-and-resync protocol.
Treating a 100,000-member channel like a small group with no fan-out hybrid.
Ignoring connection-server failover and the reconnect storm it triggers.
Claiming exactly-once delivery instead of at-least-once plus deduplication.

Quick Reference

Topic	Key Point
Transport	WebSocket - persistent, bidirectional; makes connection servers stateful
Connection layer	Thousands of stateful servers, ~100k connections each
Routing	Registry maps user -> server; pub/sub backplane forwards between servers
Durability	Persist every message before delivery; the store is the source of truth
Group fan-out	Store once; push to online, store for offline; pull for huge channels
Presence	Heartbeat + TTL for online state; query on demand, never broadcast all
Ordering	Per-conversation `seq` from a single sequencer; never client clocks
Delivery	At-least-once + dedup by client `messageId`; receipts via the same path
Offline / recovery	Reconnect, send last `seq`, server streams everything after it
Failover	Connection server crash -> clients reconnect (backoff + jitter) and resync

System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
Design a Notification Service - Part 3; at-least-once delivery, deduplication, and backoff with jitter.
Design a News Feed - Part 5; the large-group fan-out is the same celebrity problem.
Design a Ride-Sharing Service - Part 7; the connection layer reused for live location streaming.
WebSockets Interview Questions - the transport behind the connection layer in depth.

This is Part 6 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a Ride-Sharing Service.

Design a Chat System: System Design Interview 2026

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Real-Time Delivery and Connection Routing

Part A - The transport

Part B - Connection routing

Part C - Group messaging and fan-out

Part D - Presence

Ordering and reliable delivery

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Ready to ace your interview?

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Real-Time Delivery and Connection Routing

Part A - The transport

Part B - Connection routing

Part C - Group messaging and fan-out

Part D - Presence

Ordering and reliable delivery

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Related Articles

Ready to ace your interview?