Design a Key-Value Store: System Design Interview 2026

·16 min read
system-designdynamoquorumvector-clocksarchitectureinterview-preparation

A Dynamo-style key-value store is the architecture you reach for when "the database refuses writes during a partition" is unacceptable - a shopping cart, a session store, a distributed counter, anything whose business cost of being unavailable is higher than the cost of briefly inconsistent reads. It is the textbook AP point of the CAP spectrum, opposite the CP stance of the payment system in Part 11, and the deep dive is mostly about three things: a partitioning scheme that scales horizontally without coordination, a quorum mechanism that is tunable rather than fixed, and a set of techniques for reconciling the divergence that always-writable inevitably produces.

This walkthrough assumes the 6-step system design framework and applies it at senior-plus depth. It is Part 14 of an extended system design series.

Table of Contents

  1. The Problem
  2. Step 1 - Clarify Requirements
  3. Step 2 - Estimate Scale
  4. Step 3 - API and Data Model
  5. Step 4 - High-Level Design
  6. Step 5 - Deep Dive: Tunable Quorum, Vector Clocks, and Anti-Entropy
  7. Step 6 - Bottlenecks and Trade-offs
  8. Reference Architecture
  9. Common Mistakes in the Interview
  10. Quick Reference
  11. Related Articles

The Problem

We are designing a distributed key-value store with a deliberately narrow contract - get, put, delete by key - and a deliberately broad consistency model: eventually consistent and always writable, with the ability to opt into strong consistency on a per-request basis. The canonical example is Amazon Dynamo, with descendants including Cassandra, Riak, ScyllaDB, and the storage layers underneath several cloud services.

The senior framing is that this is the AP point of the CAP spectrum, picked deliberately. The system never refuses a write - not even during a network partition - and it pays for that with the obligation to reconcile divergent replicas afterwards. Every interesting technique in the deep dive is a mechanism for doing that reconciliation correctly while preserving the no-refused-write invariant.


Step 1 - Clarify Requirements

Functional requirements:

  • get, put, and delete by primary key.
  • Per-request tunable consistency through the quorum parameters R and W.
  • No transactions across keys, no joins, no secondary indexes. The narrow contract is the design.

Out of scope (name, then defer): SQL-style transactions, range scans across arbitrary keys, secondary indexes (some descendants like Cassandra add these; classic Dynamo does not).

Non-functional requirements:

  • Always writable - even during partitions and node failures.
  • Massive horizontal scale - billions of keys, hundreds of nodes.
  • Low latency - tens of milliseconds.
  • Survive any single-node or single-rack failure without losing recent writes.
  • Eventual consistency by default; strong consistency available on demand.

The decisive clarifying question: the system is AP, not CP. The contrast with Part 11's payment system is total - there, correctness dominated availability and every operation was strongly consistent; here, availability dominates strict consistency and the application accepts the responsibility of resolving conflicts when they happen. A senior answer states this trade-off up front rather than implying both can be had at once.


Step 2 - Estimate Scale

Data. 100 billion keys at ~1 KB each is ~100 TB; with replication factor N = 3 the cluster holds ~300 TB. At ~10 TB of usable storage per node, the cluster is ~100 nodes order-of-magnitude.

Throughput. Peak around 1 million reads/sec and 100,000 writes/sec - the read-write ratio is firmly read-heavy, but every write fans out to W replicas.

Latency budget. Sub-50 ms p99 for typical (N, W, R) = (3, 2, 2) requests; tighter for R = 1 reads, slower for R = 3 reads or W = 3 writes.

The defining shape: large total data, modest throughput per node, every request touches multiple replicas in the same datacenter to meet its quorum.


Step 3 - API and Data Model

The API is intentionally tiny and surfaces the quorum knobs:

get(key, R)              -> value(s) + context
put(key, value, context, W)
delete(key, context, W)  // writes a tombstone
ElementNotes
valueOpaque bytes - the store does not interpret it
contextOpaque token containing the vector clock of the version observed
value(s)A read may return more than one when concurrent writes have produced siblings
N, R, WReplication factor and quorum sizes; N is fixed per key, R and W are per request

The context is the mechanism that makes vector clocks invisible to most callers: a client reads a value, holds the context it received, and passes it back on a subsequent write. The store uses the context to know which version the writer was acting on and how to update the vector clock without losing concurrent updates.


Step 4 - High-Level Design

flowchart TD
    Client([Client]) -->|get/put| Coord[Coordinator<br/>any node]
    Coord -->|hash key -> ring| Ring[(Consistent Hash Ring)]
    Ring -->|preference list of N nodes| R1[(Replica 1)]
    Ring -->|preference list of N nodes| R2[(Replica 2)]
    Ring -->|preference list of N nodes| R3[(Replica 3)]
    Coord -->|forward request| R1
    Coord -->|forward request| R2
    Coord -->|forward request| R3
    Coord -->|wait for W (writes) or R (reads) acks| Client
    Hint[Hinted Handoff Node]
    R1 -.target down.-> Hint
    AE[Anti-Entropy / Merkle] -.background.-> R1
    AE -.background.-> R2
    AE -.background.-> R3
    Gossip[Gossip Membership] -.cluster view.-> Coord

Figure 1. The cluster forms a consistent-hash ring; each key's preference list is the next N nodes clockwise. Any node can coordinate a request; gossip maintains membership; hinted handoff and Merkle anti-entropy run as background processes converging drifted replicas. The architectural point is that there is no central coordinator - Dynamo-style systems are peer-to-peer by design.

The cluster forms a consistent-hash ring (Part 4). Each key's preference list is the next N nodes clockwise from the key's hash. Any node can act as coordinator for a request: it forwards to the N replicas in parallel and replies to the client as soon as W writes (or R reads) have acknowledged. Cluster membership propagates by gossip, no central coordinator. Two background processes - hinted handoff and Merkle-tree anti-entropy - converge replicas that drift.


Step 5 - Deep Dive: Tunable Quorum, Vector Clocks, and Anti-Entropy

This is the core. Five mechanisms cooperate: consistent-hashing partition + replication, tunable quorum, vector clocks for conflict detection, sloppy quorum with hinted handoff for always-writable, and Merkle-tree anti-entropy for background repair.

Part A - Partition and replication

The ring partitions the key space (the Part 4 idea with virtual nodes for balance) and each key's preference list - the next N distinct physical nodes clockwise - holds its replicas. N is a cluster-wide parameter; 3 is the canonical default.

Choosing replicas from distinct physical nodes (skipping vnode duplicates) and ideally from distinct failure domains (different racks) is what makes the cluster survive single-rack failures with no data loss. State this explicitly - it is a piece of policy the consistent-hashing primitive does not enforce on its own.

Part B - Tunable quorum and R + W > N

For each request the client (or system) chooses R and W independently from N:

flowchart LR
    subgraph Cfg["N = 3 - common configurations"]
        direction TB
        S1["W=2, R=2: R+W=4 > 3"] --> Strong["Strong: read sees latest write"]
        S2["W=1, R=1: R+W=2 (less than 3)"] --> Eventual["Eventual: fastest, may read stale"]
        S3["W=3, R=1: R+W=4 > 3"] --> WriteHeavy["Strong; slow writes, fast reads"]
        S4["W=1, R=3: R+W=4 > 3"] --> ReadHeavy["Strong; fast writes, slow reads"]
    end

Figure 2. Four common (N, R, W) configurations for N = 3 and what each one buys. The R + W > N rule is the structural guarantee of strong consistency; choosing R + W <= N picks fast, always-available, possibly-stale operations. The store exposes (R, W) per request precisely so each operation can sit at a different point on the latency-consistency trade-off.

The guarantee is structural. The latest write touches at least W replicas; a read touches at least R replicas; R + W > N forces the two sets to overlap by at least one replica, so a read always observes the latest write. Choosing R + W <= N picks faster, more available operations that may see stale data - perfectly acceptable for a shopping cart, unacceptable for an idempotency-key check.

A senior answer enumerates the trade-offs: higher W means slower writes and lower write availability (more replicas must be up); higher R means slower reads; choosing both small means the system is always-writable, always-readable, and sometimes wrong. The right knob depends on the operation, and the store exposes it precisely so each call can choose.

Part C - Vector clocks and sibling resolution

Without coordination on every write, two replicas can accept writes to the same key concurrently and produce conflicting versions. The store must be able to tell which case it is in:

  1. One version happened after the other - keep the later one.
  2. The versions happened concurrently - neither is more authoritative.

A vector clock is a map from node ID to a per-node counter, attached to every value. When a replica accepts a write, it increments its own entry. Comparing two clocks:

  • If clock A's entries are all >= clock B's, and at least one is strictly greater, A causally follows B - keep A, discard B.
  • Otherwise the writes are concurrent; both versions are kept as siblings, and a subsequent read returns both. The client merges them per its domain - the canonical example is a shopping cart whose merge is the union of items.
sequenceDiagram
    participant C1 as Client 1
    participant C2 as Client 2
    participant N1 as Node A
    participant N2 as Node B
 
    Note over N1,N2: clock for key K starts: {}
    C1->>N1: put K = "cart with [book]"
    N1-->>C1: ok, clock {A:1}
    Note over N1,N2: network partition - replicas diverge
    C2->>N2: put K = "cart with [pen]" (saw {})
    N2-->>C2: ok, clock {B:1}
    Note over N1,N2: partition heals
    C1->>N1: get K
    N1-->>C1: SIBLINGS: ["cart with [book]" @ {A:1}, "cart with [pen]" @ {B:1}]
    Note over C1: merge by union -> "cart with [book, pen]"
    C1->>N1: put K = "cart with [book, pen]" (context: {A:1, B:1})
    N1-->>C1: ok, clock {A:2, B:1}

Figure 3. Vector clocks in action across a partition. Two clients write to different replicas while the partition isolates them; on heal, neither clock dominates the other, so both versions are kept as siblings and returned to the next reader to merge. The shopping-cart-merge-by-union is the canonical Dynamo example - LWW would have silently dropped one of the puts.

The alternative - last-write-wins by timestamp - is simpler but silently drops data whenever clocks drift or two writes are truly simultaneous. For values where merging is genuinely impossible, LWW is the only option; for everything else, vector clocks preserve the information and let the application do the right thing.

Part D - Sloppy quorum and hinted handoff

A strict quorum fails when fewer than W intended replicas are reachable, which contradicts "always writable". A sloppy quorum instead promotes the next live node further along the ring into the write set: that node accepts the write as a hint, marked as held on behalf of the intended owner. When the owner recovers, the hint is handed off - delivered, then deleted from the temporary holder.

The trade-off is explicit: data is briefly stored at the "wrong" node, and a read that does not include the hinting node can miss the most recent write. The system is always writable in exchange for that brief misplacement, and anti-entropy ensures the durable state converges.

Part E - Merkle-tree anti-entropy

Replicas drift over time: hinted handoffs are delayed, messages are dropped, bits flip. The system periodically anti-entropies each replica pair, reconciling differences without re-shipping all the data. Each replica builds a Merkle tree - a tree of hashes - over the key range it owns, and reconciliation starts at the roots:

  • Matching roots mean identical data, no further work.
  • Differing roots prompt recursion into the differing subtrees.
  • Only the genuinely differing leaves are exchanged.

The data exchanged is therefore proportional to the actual divergence, not to the total data. The same idea underpins Git's object protocol and Bitcoin's transaction trees, and it is what makes anti-entropy affordable on a multi-terabyte node.

Consistency model

The model is tunable, defaulting to eventual. With R + W > N a single request is strongly consistent; with R + W <= N it is fast and possibly stale. The system never refuses a write because of unavailable replicas - the worst it does is store the write at a hinting node. Concurrent writes become siblings the client must resolve; LWW is available for values that cannot be merged, with the caveats noted above.

This is the deliberate opposite of Part 11's payment system: there, correctness dominated availability. Here, availability dominates, and the application accepts the burden of resolving conflicts in exchange.

Failure modes

  • Single-node failure. The preference list slides to the next live node via sloppy quorum; hinted handoff repairs once the owner returns; anti-entropy closes any remaining drift.
  • Network partition. Each side keeps accepting writes through sloppy quorum; on healing, vector clocks let the system distinguish concurrent writes from causally-ordered ones, surfacing siblings where appropriate.
  • Rack failure. Designing the preference list across racks keeps every key with at least one surviving replica.
  • Coordinator failure mid-request. The client retries with a different coordinator; vector clocks make the retry safe.
  • Silent bit rot. Anti-entropy detects the divergence and repairs.

Multi-region

Cross-region replication is normally per-region rings with asynchronous replication between them - strong consistency is local; cross-region is eventual. A single N that spans regions is technically possible but pays a cross-region round trip on every quorum, which is almost always unacceptable. Senior designs keep N within one region and let global replication be a separate, eventually-consistent layer above.

Evolution path

StageApproach
LaunchA single relational store with primary + replicas - simple
GrowthShard, add read replicas
ScaleDynamo-style ring with tunable (N, R, W), vector clocks, sloppy quorum, anti-entropy - or adopt Cassandra, DynamoDB, Riak, ScyllaDB rather than build

Build the (N, R, W) abstraction and the context-and-siblings API contract from day one - both are very hard to retrofit. Defer multi-region.

Observability

Track per-key read and write p99 latencies, quorum-success rate at each (R, W), hinted-handoff queue size and age, sibling-creation rate (a high one means concurrent-write hotspots), anti-entropy lag and bandwidth, ring rebalancing events, and gossip-convergence time. A high sibling rate on operations whose merge is hard is a real product signal worth surfacing.


Step 6 - Bottlenecks and Trade-offs

  • Latency versus consistency is the standing trade-off: R + W > N costs latency and write availability for strong reads.
  • Hot keys are a Part 4 problem unchanged - the per-key load is on the preference list, regardless of quorum maths.
  • Sibling resolution shifts work to the client; designs should keep the merge cheap or be honest that LWW is acceptable.
  • Anti-entropy bandwidth is bounded by Merkle's actual-divergence proportionality but is non-zero - schedule it off peak.
  • Cross-region quorum is rarely the right answer; keep N local and replicate asynchronously above.

Reference Architecture

The pattern this problem teaches, reusable beyond key-value stores:

Partition and replicate with consistent hashing, tune R and W per request to choose between latency and consistency, track causality with vector clocks so concurrent writes surface as siblings, stay writable through sloppy quorum and hinted handoff, and converge replicas in the background with Merkle-tree anti-entropy.

flowchart LR
    subgraph Hot["Hot path"]
        H1[Consistent hash -> preference list of N] --> H2[Coordinator]
        H2 --> H3["Replicas (wait for R or W)"]
    end
    subgraph Bg["Background"]
        B1[Hinted handoff] --> B2[Merkle anti-entropy]
        B2 --> B3[Replicas converge]
    end
    Hot -.always writable, eventually consistent.-> Bg

Figure 4. The reference architecture splits the hot path from background work. The hot path serves quorum-tunable get/put operations through the preference list; background processes - hinted handoff and Merkle anti-entropy - converge replicas that drift. "Always writable, eventually consistent" is the resulting stance, and the deferred reconciliation in the background is what makes it work.

The same shape recurs across the AP family - Cassandra, Riak, ScyllaDB, and parts of DynamoDB - and the central insight that "always writable plus tunable consistency plus deferred reconciliation" is a coherent point in the design space is one of the most influential ideas in modern distributed systems.


Common Mistakes in the Interview

  • Claiming strong consistency without doing the R + W > N arithmetic.
  • Using last-write-wins for non-trivial values, silently dropping concurrent writes.
  • Ignoring vector clocks and not explaining how concurrent writes are even detected.
  • Pretending the client never sees siblings, hiding the conflict-resolution responsibility instead of acknowledging it.
  • A strict quorum with no story for how the system stays writable during partitions.
  • No anti-entropy, letting drift accumulate until reads visibly diverge.
  • A single global N spanning regions, paying a cross-region round trip on every quorum.

Quick Reference

TopicKey Point
StanceAP - always writable, eventually consistent, tunable to strong
PartitionConsistent-hashing ring with virtual nodes; preference list = next N distinct nodes
QuorumR + W > N guarantees the read set overlaps the latest write set
TuningPer-request (R, W); N is per key
ConcurrencyVector clocks distinguish causal order from concurrent; concurrent -> siblings
Conflict resolutionApplication merges siblings; LWW only when merge is impossible
Sloppy quorumWrite to next live node on the ring; hinted handoff returns the data later
Anti-entropyMerkle tree narrows reconciliation to the actually-divergent keys
Failure domainSpread the preference list across racks to survive rack loss
Multi-regionKeep N local; replicate asynchronously across regions

This is Part 14 of an extended system design series. Next: Design a Collaborative Editor.

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides