Design a Collaborative Editor: System Design Interview 2026

·17 min read
system-designcrdtoperational-transformationreal-timearchitectureinterview-preparation

A collaborative editor is two systems pretending to be one. There is the transport - a real-time fan-out of small operations to everyone editing a document, which is the Part 6 chat system in a different shirt. And there is the convergence - the guarantee that no matter what order edits arrive in, and no matter how long any one user was offline, all replicas end up with the same document. The convergence is what makes this its own problem, and it is one of the few in this series where two genuinely different algorithmic approaches - Operational Transformation and CRDTs - both work, with very different trade-offs.

This walkthrough assumes the 6-step system design framework and applies it at senior-plus depth. It is Part 15 of an extended system design series.

Table of Contents

  1. The Problem
  2. Step 1 - Clarify Requirements
  3. Step 2 - Estimate Scale
  4. Step 3 - API and Data Model
  5. Step 4 - High-Level Design
  6. Step 5 - Deep Dive: CRDT vs Operational Transformation
  7. Step 6 - Bottlenecks and Trade-offs
  8. Reference Architecture
  9. Common Mistakes in the Interview
  10. Quick Reference
  11. Related Articles

The Problem

We are designing a real-time collaborative editor: many users edit the same document, see each other's changes in near real-time, and end up with the same document regardless of the order edits arrived in or how long any user was offline. The canonical examples are Google Docs, Figma, Notion, and Linear.

The senior framing is that this is a convergence problem on top of a real-time transport. The transport is essentially solved by reusing the connection-layer pattern from Part 6; the convergence is the genuinely interesting part, and the field has produced two different correct answers - Operational Transformation and CRDTs. The deep dive is largely a careful comparison of those two, ending with the trade-offs that explain why modern editors pick CRDTs.


Step 1 - Clarify Requirements

Functional requirements:

  • Multiple users edit the same document concurrently.
  • Every user sees every edit in near real-time.
  • Edits are never lost, and conflicts never surface to the user as a "merge this" dialog.
  • Offline editing: a user disconnects, edits locally, reconnects, and the document merges cleanly.
  • Cursor and selection awareness.
  • Undo and history.

Out of scope (name, then defer): permissions and access control, comments and suggestions (a different feature layer), and document storage at multi-petabyte archive scale.

Non-functional requirements:

  • Real-time latency - a keystroke should reach other collaborators within ~100 ms.
  • Tens to hundreds of concurrent collaborators per document.
  • Strong eventual consistency - replicas that have seen the same operations are in the same state.
  • High availability, including survival of brief network partitions.

The decisive clarifying question is convergence: all replicas must end up identical, regardless of operation arrival order or duplication. This is strong eventual consistency (SEC) and is the central correctness property of the design - and the property that OT and CRDTs deliver in very different ways.


Step 2 - Estimate Scale

Sessions. Hundreds of millions of users, ~5 million concurrent active editing sessions at peak. The connection layer carries that many persistent WebSocket connections - the same order of magnitude as Part 6.

Operation rate. At ~5 ops/sec during active typing, ~25 million ops/sec at peak across the platform. Per document with ~5 collaborators that is ~25 ops/sec - trivial.

Per-op bandwidth. A character-level op is on the order of ~50 bytes (op type, position anchor, ID, timestamp), so ~25M ops x 50 bytes ≈ ~1.25 GB/sec of edit traffic - dominated by metadata, not content.

Document storage. ~10 million actively edited documents at ~100 KB each is ~1 TB of live state; full operation histories add more, controlled by periodic snapshotting.

The defining shape: a chat-scale connection layer, modest per-document throughput, document-level state where the algorithmic choice (OT vs CRDT) determines almost everything else.


Step 3 - API and Data Model

The session uses a persistent WebSocket (the Part 6 transport) with a handful of frame types:

FrameDirectionPurpose
SYNCserver -> clientInitial snapshot + tail of recent ops on join
OPbothA single edit (CRDT update or OT operation)
ACKserver -> clientConfirms persistence (OT) or just receipt (CRDT)
CURSORbothEphemeral cursor / selection - not part of document state
EntityStored
Document snapshotAuthoritative state at a checkpoint
Operation logOrdered (OT) or causal (CRDT) op history since the snapshot
PresencePer-session ephemeral - cursor, selection - discarded on disconnect

Snapshot plus a tail of ops is the sync primitive: a reconnecting or new client gets the snapshot and then catches up by replaying the tail. The same primitive recovers from a server failover.


Step 4 - High-Level Design

flowchart TD
    A([User A]) <-->|WebSocket| GA[Gateway A]
    B([User B]) <-->|WebSocket| GB[Gateway B]
    C([User C]) <-->|WebSocket| GA
    GA --> DocOwn[Document Owner<br/>one per document]
    GB --> DocOwn
    DocOwn -->|persist ops| Store[(Op Log + Snapshots)]
    DocOwn -->|broadcast| GA
    DocOwn -->|broadcast| GB
    Reg[(Document Registry)] -.lookup owner.-> GA
    Reg -.lookup owner.-> GB

Figure 1. The architecture: WebSocket gateways (the Part 6 connection layer) hold client connections; one document owner per document persists and broadcasts operations through an op log plus snapshots. For OT the owner is the convergence authority; for CRDT the owner is mostly a relay - operations commute, so its ordering only affects fan-out timing, not correctness.

The connection layer is the chat connection layer from Part 6. Each document has one owner - the per-document analogue of a chat's per-conversation sequencer. For OT, the owner is the authority that transforms and orders operations. For CRDT, the owner is mostly a relay and a persistence target - operations commute, so the order it picks does not affect correctness, only fan-out timing.


Step 5 - Deep Dive: CRDT vs Operational Transformation

This is the core. The same convergence problem has two correct answers; the deep dive explains both and the reasons modern systems mostly pick one.

Part A - The convergence problem

Two users type into the same document. Alice inserts X at position 5, Bob inserts Y at position 3, both at the same time. If each user applies their own op locally then receives the other's op and applies it literally, the positions are wrong: Alice's op had position 5 before Bob inserted at 3, but by the time Alice receives Bob's Y her local document has shifted - so naively applying Bob's Y at 3 lands it in the wrong place. The documents diverge.

A correct solution must either rewrite incoming operations to compensate for concurrent ones (OT) or express operations in a way that makes order irrelevant (CRDT).

Part B - Operational Transformation

OT routes every op through a central authority - the document owner. The server orders the operations it receives into a single linear log; clients send their ops to the server tentatively and apply incoming ops locally, transforming them against any of their own tentative ops that the server has not yet acknowledged.

The transform function is the heart of OT. Given two concurrent ops, it produces a new op with the same intended effect:

Insert("X", pos=5)  transformed against  Insert("Y", pos=3)
   --->  Insert("X", pos=6)        // because Y at 3 shifted everything right
sequenceDiagram
    participant A as Alice
    participant S as Server (OT)
    participant B as Bob
 
    Note over A,B: Both see doc "hello world"
    A->>S: Insert("X", pos=5) - rev 0
    B->>S: Insert("Y", pos=3) - rev 0
    Note over S: Server orders: Bob first, then Alice
    S->>B: ack Insert("Y", pos=3) -> rev 1
    S->>A: incoming Insert("Y", pos=3)
    Note over A: transform local pending against incoming<br/>still pos=5? no - shift to 6
    S->>A: ack Insert("X", pos=6) -> rev 2
    S->>B: incoming Insert("X", pos=6) -> rev 2
    Note over A,B: both converge to "helYlowXorld"

Figure 2. Operational Transformation on a concurrent edit. Both clients send to the server with the same revision; the server picks an order and transforms each operation against the others so positions stay correct. Alice's Insert at pos=5 becomes pos=6 after Bob's earlier insert at pos=3 shifts everything right - the transform function is what makes this convergence possible, and also what makes OT notoriously hard to get right.

OT is efficient - operations carry only a position and a payload, no per-character metadata - and is proven in production at Google Docs scale. The cost is that the transform function is notoriously hard to get right: several published OT algorithms had subtle bugs that took years to discover, and adding new operation types (especially structural ones in rich documents) means adding more transform cases. OT also requires the central authority on the convergence path; true peer-to-peer is much harder.

Part C - CRDTs

A Conflict-free Replicated Data Type sidesteps transformation by designing the data so concurrent ops commute - applying A then B yields the same state as B then A. For text the canonical idea is to give every character a globally unique identifier (typically (siteId, counter) from a Lamport timestamp) and express inserts as between two identifiers rather than at a numeric position:

Insert("X", after=id_7, before=id_8, id=(siteA, 42))
flowchart TD
    A["Doc: [h,id1] [e,id2] [l,id3] [l,id4] [o,id5]"]
    A --> Op1["Alice: insert X between id3 and id4<br/>new id (siteA, 42)"]
    A --> Op2["Bob: insert Y between id1 and id2<br/>new id (siteB, 31)"]
    Op1 --> Merge["Apply both - either order"]
    Op2 --> Merge
    Merge --> Final["Doc: [h,id1] [Y,(B,31)] [e,id2] [l,id3] [X,(A,42)] [l,id4] [o,id5]"]

Figure 3. CRDT insertions making order irrelevant. Each character has a globally unique ID, and inserts anchor between existing IDs rather than at numeric positions, so Alice's and Bob's concurrent inserts can be applied in either order and produce the same document. There is no transformation - operations commute by construction, which is why CRDTs need no central authority for convergence.

Because each insert anchors itself to existing IDs that are not moved by other inserts, applying ops in any order yields the same document. There is no transformation, and no central authority is required for correctness - the server, if there is one, is a relay and a persistence layer rather than an arbiter.

The cost is bookkeeping. Every character carries metadata (its id, and often the IDs of its neighbours), so the in-memory representation is larger than the raw text. Deletes leave tombstones - the character is marked deleted but kept, because a concurrent op may still reference it as an anchor - so state grows over time and needs background garbage collection.

Production CRDT implementations - Yjs, Automerge, the algorithms behind Figma's text layer and Notion's blocks - all add structural optimisations (RGA, YATA, LSEQ) to keep the metadata overhead manageable.

Part D - The comparison

AspectOperational TransformationCRDT
Central authority on the convergence pathYes - server orders and transformsNo - peers converge regardless of order
Metadata per characterMinimal (operation has a position)Larger (each character carries an ID)
Algorithm difficultyTransform function is famously hard to get rightOperations commute by construction
Offline editingPossible but the transform batch gets complicatedTrivial - apply the missing ops in any order
Peer-to-peerHard - needs the authorityNatural - operations commute
Memory growthBounded by the documentGrows with edits; needs tombstone GC
Production examplesGoogle Docs (historical)Figma, Notion, Linear, Yjs ecosystem

The reason modern editors largely adopt CRDTs is offline and multi-device. A long disconnected session in OT produces a large batch of operations the server must transform against everything that happened in the meantime, and any subtle bug in the transform function turns silent corruption into a quiet data-loss bug. In CRDT the same batch is applied in any order and converges by construction - the catch-up has no correctness pitfalls, only metadata bandwidth.

Causality

Both approaches need each op to carry causality - which ops the producer had seen. Lamport timestamps (a single integer that strictly increases past anything seen) and vector clocks (the Part 14 idea: a per-site counter map) are the two options. CRDT IDs are typically (siteId, Lamport) pairs; concurrent ops are detected naturally, and ties between concurrent ops are broken deterministically by site ID so all replicas pick the same order.

Consistency model

Strong eventual consistency: any two replicas that have received the same set of operations are byte-for-byte identical. There is no global ordering of all writes (CRDT) or only the server's ordering (OT). During a partition each side keeps editing, and on heal both converge - silently in CRDT, via a transform batch in OT.

This is a stricter eventual consistency than the Part 14 Dynamo store, which can surface concurrent writes as siblings the application must merge. CRDTs decide the merge automatically because the data structure embeds the rule.

Failure modes

  • Connection drop. The client reconnects, sends a SYNC with its last known op version, and the owner streams the tail. Same as Part 6's reconnect-and-resync.
  • Document-owner crash. A new owner is elected (the Part 12 leader-election pattern), loads the latest snapshot plus tail, and resumes. Clients re-sync.
  • Server-side conflict in OT. Means the transform function is wrong - a real bug, never a graceful degradation. This is exactly why CRDTs are preferred for correctness-critical text.
  • Tombstone bloat in CRDT. State grows; background garbage collection removes tombstones that no live replica can still anchor against (often coordinated by a global frontier of "everyone has seen up to here").
  • Long offline session. The reconnect carries a large op batch; CRDT applies it cleanly, OT transforms it through every concurrent op since the disconnect.

Multi-region

Each document has a home region - the owner - to keep persistence simple and gateway routing local. Cross-region collaborators connect to their nearest gateway, which forwards to the home owner. A CRDT design can in principle let peers in different regions sync directly without a home owner, but most production systems keep the home for the persistence and access-control reasons rather than the convergence one.

Evolution path

StageApproach
LaunchSingle-user edits, no live collaboration
GrowthOT with a central server - efficient and well-trodden
ScaleCRDT (Yjs / Automerge) for offline-first, multi-device, and peer-to-peer extensions

The algorithm choice is a one-way decision - data formats and client implementations are tied to it, and migrating between OT and CRDT means a coordinated rewrite. Pick deliberately on the offline-and-multi-device requirements rather than because one is fashionable.

Observability

Track ops/sec per document, op-propagation p99 latency, reconnect sync time and op-batch size, convergence checksums (sampled cross-replica state hashes - any divergence is a serious incident), per-document CRDT state size (tombstone growth), garbage-collection lag, and presence-event rate. Convergence checks belong on the dashboard regardless of algorithm choice.


Step 6 - Bottlenecks and Trade-offs

  • Per-document throughput is bounded by the single owner that fans out to collaborators - rarely a problem, since a document's own collaborators are few.
  • CRDT metadata overhead is the cost of commutativity; clever implementations (RGA, YATA) and run-length encoding keep it manageable.
  • Tombstones grow until garbage-collected and are the standing CRDT trade-off.
  • OT's transform function is the standing OT trade-off - effort and care, indefinitely.
  • Offline-batch size is bounded only by the disconnect duration and is far easier for CRDTs.

Reference Architecture

The pattern this problem teaches, reusable beyond editors:

Conflict-free concurrent state via either operational transformation against a central authority, or a CRDT whose operations commute by construction - ridden over a persistent connection layer for real-time fan-out, snapshot-plus-tail for sync.

flowchart LR
    subgraph Choice["Two correct answers"]
        direction TB
        OT["Operational Transformation<br/>central authority transforms ops"]
        CRDT["CRDT<br/>ops commute by construction"]
    end
    subgraph Common["Shared infrastructure"]
        direction TB
        Conn[Persistent connection layer]
        Snap[(Snapshot + op log)]
    end
    Choice --> Common

Figure 4. The reference architecture: two correct convergence algorithms (OT and CRDT) sharing the same operational infrastructure - a persistent connection layer for real-time fan-out and a snapshot-plus-op-log for sync. The same shape recurs in multiplayer game state, shared cursors, real-time dashboards, and distributed configuration - any system where many writers act on shared state and must converge.

The same shape recurs in multiplayer-game shared state, shared cursors, real-time dashboards, and distributed configuration - any system where many writers act on the same logical state and must converge without a single arbiter (CRDT) or with a deliberately chosen one (OT).


Common Mistakes in the Interview

  • "Last-write-wins on the document" - silently loses changes and is not what any production editor does.
  • Lock-based editing - one user at a time, terrible UX, irrelevant to the actual problem.
  • Confusing OT and CRDT, or claiming "we use CRDT" without naming why operations commute.
  • Underestimating OT's transform-function complexity by treating "just shift positions" as the whole story.
  • Ignoring offline editing - the requirement that drove the industry to CRDTs.
  • No mention of tombstones in a CRDT design, and therefore no garbage-collection story.
  • Per-character operations without causality metadata, leaving the system unable to detect concurrency at all.

Quick Reference

TopicKey Point
Core problemStrong eventual consistency on a shared document under concurrent edits
OTCentral authority orders + transforms ops; efficient; transform function is hard
CRDTOps commute by construction; no central authority needed; metadata overhead
CausalityLamport timestamps or vector clocks on each op; deterministic tie-break
OfflineCRDT trivially merges; OT becomes complicated but is doable
TombstonesCRDT keeps deleted elements as anchors; background GC required
TransportPersistent WebSocket connection layer reused from Part 6
Document ownerOne per document; CRDT owner is a relay, OT owner is the authority
FailoverOwner re-elected; clients resync via snapshot + tail
Algorithm choiceA one-way decision tied to data format; pick on offline-and-P2P needs

This is Part 15 of an extended system design series. Next: Design a Distributed File System.

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides