Design a Chat System: System Design Interview 2026

·14 min read
system-designwebsocketsreal-timearchitecturebackendinterview-preparation

A chat system inverts the usual web architecture. Most services are stateless request-response; a chat system maintains hundreds of millions of persistent connections and must push a message from any one of them to any other within a couple of hundred milliseconds. The data model - messages in conversations - is simple. The difficulty is the connection layer: it is stateful, it is enormous, and any two users in a conversation are almost certainly attached to different servers that have no direct knowledge of each other.

This walkthrough assumes the 6-step system design framework and applies it at senior depth. It is Part 6 of a system design series.

Table of Contents

  1. The Problem
  2. Step 1 - Clarify Requirements
  3. Step 2 - Estimate Scale
  4. Step 3 - API and Data Model
  5. Step 4 - High-Level Design
  6. Step 5 - Deep Dive: Real-Time Delivery and Connection Routing
  7. Step 6 - Bottlenecks and Trade-offs
  8. Reference Architecture
  9. Common Mistakes in the Interview
  10. Quick Reference
  11. Related Articles

The Problem

We are designing a real-time chat system supporting one-to-one and group messaging, with delivery and read receipts, presence, and message history - the shape of WhatsApp, Messenger, or Slack.

The senior framing is that this is a routing problem over a stateful connection layer. Unlike every prior system in this series, the front tier is not stateless: a connection server owns the live WebSocket connections of the users attached to it. Delivering a message means finding which server holds the recipient and getting the message there - and staying correct when a server holding a hundred thousand connections suddenly dies.


Step 1 - Clarify Requirements

Functional requirements:

  • One-to-one messaging and group messaging.
  • Real-time delivery when the recipient is online.
  • Offline delivery: store messages for an offline recipient, deliver on reconnect.
  • Delivery and read receipts.
  • Presence: online / offline / last-seen.
  • Message history.

Out of scope (name, then defer): media and file storage, end-to-end encryption details, and voice/video calls.

Non-functional requirements:

  • Real-time latency. Delivery to an online recipient within ~200 ms.
  • Massive concurrent connections. Hundreds of millions of simultaneous persistent connections.
  • Reliability. No message is lost; the guarantee is at-least-once plus deduplication.
  • Per-conversation ordering. Messages in one conversation appear in a consistent order; global ordering is not required.

The clarifying questions that shape the design: the delivery guarantee is at-least-once with idempotency - exactly-once is not achievable, as established in Part 3. Ordering is per-conversation only. And group size matters: a small group and a 100,000-member broadcast channel need different fan-out, the same way Part 5's celebrity accounts did.


Step 2 - Estimate Scale

Connections. Assume 1 billion users and 500 million online at peak - so 500 million concurrent WebSocket connections. A single connection server, bounded by memory and file descriptors, holds on the order of 100,000 connections, so the connection layer needs roughly 5,000 servers.

Messages. At ~40 messages/user/day, that is 40 billion messages/day ≈ ~460,000 messages/sec average, with peaks past 2 million/sec.

Storage. At ~300 bytes per message, 40B/day is ~12 TB/day of message data, partitioned by conversation and retained per policy.

Connection memory. At ~10 KB of state per connection, 500M connections is ~5 TB of memory spread across the connection fleet.

Two numbers define the problem: 500 million persistent connections, and the fact that a message must hop between two arbitrary servers among 5,000.


Step 3 - API and Data Model

Messaging does not use request-response REST - it uses a persistent WebSocket carrying typed frames: SEND, ACK, RECEIPT, PRESENCE, TYPING. A thin REST endpoint serves history: GET /conversations/{id}/messages?cursor=<opaque>.

EntityKey fields
MessageconversationId, messageId (client UUID), seq (per-conversation), senderId, content, createdAt
ConversationconversationId, participants, type (1:1 / group)
Sync cursorper user-conversation: last delivered and last read seq
Connection registryuserId -> connection server currently holding the socket

Two IDs do two jobs. The messageId is generated by the client and used for deduplication - a retried send carries the same ID. The seq is a monotonic per-conversation sequence number assigned by the server and used for ordering and for resync. Messages are partitioned by conversationId.


Step 4 - High-Level Design

flowchart TD
    CA([User A]) <-->|WebSocket| SA[Connection Server A]
    CB([User B]) <-->|WebSocket| SB[Connection Server B]
    SA --> MS[Message Service]
    MS --> Store[(Message Store<br/>partitioned by conversation)]
    SA -->|lookup recipient server| Reg[(Connection Registry)]
    SB -->|register / heartbeat| Reg
    SA -->|forward| BP[Pub/Sub Backplane]
    BP --> SB
    SB -.push.-> CB
    SA -.presence.-> Pres[(Presence Store - TTL)]

Figure 1. The chat architecture's distinguishing feature: a stateful connection layer (WebSocket servers) that owns persistent client connections, rather than a stateless fleet. The registry maps user to server so any sender can find any recipient; the pub/sub backplane carries messages between those two servers; and the durable message store sits behind both, which is what makes a dropped connection a resync rather than a loss.

The connection layer is a fleet of stateful WebSocket servers. The connection registry maps each user to the server holding their socket. The message service persists every message to the durable store - the source of truth - before delivery. The pub/sub backplane carries a message from the sender's server to the recipient's server. Presence lives in a TTL-backed store. Every message is durable first and delivered second, which is what makes a dropped connection a resync rather than a loss.


Step 5 - Deep Dive: Real-Time Delivery and Connection Routing

This is the core. Four things make real-time chat work: the transport, cross-server routing, presence, and ordered reliable delivery.

Part A - The transport

The options form a clear ladder. HTTP polling has the client repeatedly ask "anything new?" - wasteful, and latency equals the poll interval. Long polling holds a request open until data arrives - better, but it is one request per message with constant connection churn. Server-Sent Events push server-to-client only, which chat's bidirectional traffic outgrows. WebSocket is a single persistent, bidirectional connection over which the server pushes instantly - the correct choice.

The price is architectural: a WebSocket is long-lived and stateful, so the connection server is stateful. That single fact drives the registry, the backplane, and the failover story below - and it is the deliberate trade a senior candidate names rather than glosses over.

Part B - Connection routing

With 500M connections across 5,000 servers, the sender and recipient almost never share a server. Routing works in four steps:

  1. The sender's server persists the message via the message service - durability first.
  2. It looks up the recipient in the connection registry to find their server.
  3. It forwards the message over the pub/sub backplane to that server.
  4. The recipient's server pushes the message down the recipient's WebSocket.
sequenceDiagram
    participant A as User A
    participant SA as Server A
    participant MS as Message Service
    participant R as Registry
    participant BP as Backplane
    participant SB as Server B
    participant B as User B
 
    A->>SA: SEND (messageId, content)
    SA->>MS: persist + assign seq
    MS-->>SA: stored (seq)
    SA-->>A: ACK (persisted)
    SA->>R: where is User B?
    R-->>SA: Server B
    SA->>BP: forward message -> Server B
    BP->>SB: deliver
    SB->>B: push over WebSocket
    B->>SB: RECEIPT (delivered)

Figure 2. A message routing across two connection servers - the canonical case at scale. The sender's server persists first, then asks the registry where the recipient lives, then forwards over the backplane to that server, which pushes the message down its socket. Durability happens before any network hand-off, which is what makes the system tolerant of mid-flight failures.

Each connection server registers its users in the registry on connect and removes them on disconnect; entries carry a TTL so a crashed server's stale entries expire. The backplane decouples the 5,000 servers from one another - a server subscribes for the traffic destined to its connections and need not know the rest of the fleet.

Part C - Group messaging and fan-out

A 1:1 message is one delivery. A group of N members is N deliveries: the message is stored once per conversation, and the fan-out is purely a delivery concern - route a copy to each member's connection server, store for those offline.

Large broadcast channels - a Slack channel with 100,000 members - are the celebrity problem from Part 5 in a new guise. Do not synchronously fan out to 100,000 connections. Push to the online members, persist for the offline majority, and for very large channels let clients pull on read rather than receiving a push at all.

Part D - Presence

Presence is deceptively expensive. Broadcasting every connect and disconnect to all of a user's contacts produces a presence storm - at 500M users churning connections, the fan-out dwarfs the actual messaging traffic.

The scalable approach has two parts. A connection server knows a user is online while their socket is alive, refreshed by a periodic heartbeat; presence is written to a store with a short TTL, so a crashed client silently expires to offline with no explicit event. And presence is read on demand - when a user opens a chat or their contact list, the client queries presence for just those users - rather than pushed on every change. "Last seen" is simply a timestamp written on disconnect.

Ordering and reliable delivery

Ordering uses the per-conversation seq. A conversation's partition has a single sequencer, so its messages get a clean monotonic order; clients sort by seq, never by client timestamps, which drift with clock skew. Global cross-conversation ordering is not provided - it is expensive and no one needs it.

Reliability is at-least-once plus deduplication. The client retries SEND with the same messageId until it gets a persistence ACK; the server discards duplicate IDs. Delivery to the recipient is acknowledged with a RECEIPT; if it does not arrive, the message is redelivered on reconnect. Receipts - sent, delivered, read - are themselves small messages flowing back through the same pipeline.

Offline delivery and resync share one mechanism: every message is durable before delivery, so a reconnecting client sends its last known seq per conversation and the server streams everything after it. The same resync recovers messages missed during a server crash.

Consistency model

Per-conversation ordering is strongly consistent via the single sequencer; across conversations there is no ordering guarantee, by design. Presence is eventually consistent and approximate - a crashed client reads as online until its TTL lapses. Message delivery is at-least-once plus dedup, observably effectively-once.

Failure modes

  • Connection server crash. Its ~100,000 sockets all drop. Clients detect the dead socket, reconnect through the load balancer onto a different server, re-register, and resync by seq. Because the message store is the source of truth, nothing is lost - this is the central failure case, and statefulness is what makes it interesting.
  • Reconnect storm. A dead server dumps 100,000 clients reconnecting at once onto the rest of the fleet and the registry. Clients must reconnect with backoff and jitter - the Part 3 discipline.
  • Stale registry entry. A registry record pointing at a dead server causes a forward into the void, but the message is already persisted, so the recipient gets it on resync. TTLs keep the registry self-cleaning.
  • Backplane outage. Real-time push stops, yet messages still persist; on recovery, delivery resumes and clients resync. Degraded, not lost.

Multi-region

Users connect to the nearest region. A conversation has a home region that owns its sequencer, keeping seq authoritative; a cross-region conversation simply pays extra latency to reach that sequencer. The message store is replicated, and a global cross-region stream on the backplane carries messages to a recipient connected in another region.

Evolution path

StageApproach
LaunchOne server, an in-memory connection map, WebSocket
GrowthMultiple connection servers, a shared registry, a pub/sub backplane
ScaleThousands of connection servers, TTL + on-demand presence, large-group fan-out hybrid, multi-region

Build the client messageId, the per-conversation seq, and the resync-by-sequence protocol from day one - they are the contract every reliability and recovery property depends on. Defer presence sophistication, large-group hybrids, and multi-region.

Observability

Track concurrent connections per server, connection churn rate, end-to-end delivery latency p99, backplane lag, registry lookup latency, reconnect-storm spikes, and offline-resync volume. A reasonable SLO: 99% of messages delivered to an online recipient within 500 ms.


Step 6 - Bottlenecks and Trade-offs

  • Connection count makes the front tier stateful and memory-bound - hence thousands of servers and a registry, instead of a stateless fleet.
  • Cross-server routing is bounded by the backplane's throughput, so it must be partitioned.
  • Presence fan-out would dominate all other traffic if pushed; TTL heartbeats plus on-demand reads contain it.
  • Large-group fan-out repeats the celebrity problem and needs the push/store/pull hybrid.
  • Connection server failover is the defining hard case - clients reconnect and resync, and statefulness is precisely what makes it non-trivial.

Reference Architecture

The pattern this problem teaches, reusable well beyond chat:

A stateful connection layer holding persistent client connections, a registry mapping clients to connection servers, and a pub/sub backplane routing events between servers - all backed by a durable store, so a dropped connection means resync, not data loss.

flowchart LR
    subgraph Conn["Stateful connection layer"]
        direction TB
        S1[Connection server]
        S2[Connection server]
    end
    S1 <-->|registry lookup| Reg[(Connection registry)]
    S2 <-->|registry lookup| Reg
    S1 <-->|route events| BP[Pub/Sub backplane]
    S2 <-->|route events| BP
    Conn --> Durable[(Durable message store)]

Figure 3. The reference architecture for any system that pushes real-time events to many connected clients. A stateful edge layer holds the connections; a registry tells servers where to find each other; a backplane routes events between them; a durable store underwrites the whole thing. This shape recurs in live notifications, collaborative editing, multiplayer game state, and the live driver-rider link of a ride-sharing service.

The same shape recurs in any system that pushes real-time events to many connected clients: live notifications, collaborative-editing presence, multiplayer game state, and the live driver-rider link of a ride-sharing service. A stateful edge, a registry, a backplane, and a durable backing store is the toolkit for "deliver events to whichever server currently holds the client".


Common Mistakes in the Interview

  • Using polling, or choosing WebSocket without naming the statefulness it imposes.
  • No cross-server routing story - failing to explain how a message reaches a recipient on another server.
  • Broadcasting every presence change, producing a presence storm that dwarfs real traffic.
  • Ordering by client timestamps instead of a server-assigned per-conversation sequence.
  • Forgetting offline delivery and the reconnect-and-resync protocol.
  • Treating a 100,000-member channel like a small group with no fan-out hybrid.
  • Ignoring connection-server failover and the reconnect storm it triggers.
  • Claiming exactly-once delivery instead of at-least-once plus deduplication.

Quick Reference

TopicKey Point
TransportWebSocket - persistent, bidirectional; makes connection servers stateful
Connection layerThousands of stateful servers, ~100k connections each
RoutingRegistry maps user -> server; pub/sub backplane forwards between servers
DurabilityPersist every message before delivery; the store is the source of truth
Group fan-outStore once; push to online, store for offline; pull for huge channels
PresenceHeartbeat + TTL for online state; query on demand, never broadcast all
OrderingPer-conversation seq from a single sequencer; never client clocks
DeliveryAt-least-once + dedup by client messageId; receipts via the same path
Offline / recoveryReconnect, send last seq, server streams everything after it
FailoverConnection server crash -> clients reconnect (backoff + jitter) and resync

This is Part 6 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a Ride-Sharing Service.

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides