Design a Rate Limiter: System Design Interview 2026

·16 min read
system-designrate-limitingdistributed-countersarchitecturebackendinterview-preparation

A rate limiter is the smallest component in most architectures and one of the most consequential: it sits in the critical path of every single request, and when it is wrong it either lets an abusive client melt your backend or it throttles legitimate users. That tension - microsecond-level latency budget against correctness under concurrency and failure - is why "design a rate limiter" is a sharp senior interview question. The data model is two numbers per client; the difficulty is making those two numbers correct when a thousand gateway nodes update them at once and the store underneath can disappear.

This walkthrough assumes the 6-step system design framework and applies it at the depth expected of a senior or staff candidate.

Table of Contents

  1. The Problem
  2. Step 1 - Clarify Requirements
  3. Step 2 - Estimate Scale
  4. Step 3 - API and Data Model
  5. Step 4 - High-Level Design
  6. Step 5 - Deep Dive: Algorithms and Distributed Counters
  7. Step 6 - Bottlenecks and Trade-offs
  8. Reference Architecture
  9. Common Mistakes in the Interview
  10. Quick Reference
  11. Related Articles

The Problem

We are designing a rate limiter that caps how many requests a given client can make within a time window - the mechanism behind "429 Too Many Requests" and the X-RateLimit-* headers on every serious API. It exists to protect backends from abuse and accidental overload, to enforce fair use across tenants, and to contain cascading failure when one client misbehaves.

The senior framing is that a rate limiter is a distributed counter in the hot path. Every request reads and mutates shared state, that state is contended by the whole gateway fleet, and the component must stay fast and safe even when its backing store is slow or gone. Almost every interesting decision flows from those three facts.


Step 1 - Clarify Requirements

Scope the problem out loud before designing. The clarifying questions here are themselves a signal.

Functional requirements:

  • Limit requests per client identity (API key, user ID, or IP) to N per time window.
  • Support multiple rules and tiers - a free tier and a paid tier enforce different limits.
  • Support per-endpoint granularity, not just a single global limit per client.
  • On rejection, return 429 with a Retry-After header; on every response, expose X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset.

Non-functional requirements:

  • Latency: the limiter runs on every request, so its own overhead must be tiny - target a p99 well under ~2 ms.
  • Availability: the limiter must not become a single point of failure. If it cannot reach its store, it needs a defined degraded behaviour (see fail-open vs fail-closed below).
  • Accuracy: strict exactness is not required. Slight over-admission under contention or failover is acceptable; the design question is how much and bounded by what.
  • Distributed correctness: the limit is global across the gateway fleet, so per-process counting is not enough - the nodes must share state.

The key clarifying question: is the limit per-process or global across the fleet? A per-process limit is trivial and needs no shared state, but N gateway nodes then admit N times the intended rate. A global limit is the realistic requirement, and it forces a shared counter store - which is where the whole design gets interesting.


Step 2 - Estimate Scale

The numbers here are unusual: the rate limiter's workload is the traffic it guards.

Throughput. Assume the platform peaks at 1,000,000 requests/sec across the gateway fleet. Every request triggers exactly one limiter check, so the counter store must sustain ~1M operations/sec. A single Redis instance handles roughly 100k-200k ops/sec, so this alone forces a sharded or clustered store - partitioned by client key.

State size. Suppose 10 million distinct clients (API keys plus tracked IPs). Each client needs a few numbers per rule - on the order of 100 bytes. Total counter state is roughly 1 GB, which sits comfortably in memory. The store is throughput-bound, not memory-bound.

Latency budget. A same-datacenter Redis round trip is ~0.5-1 ms. Since that lands on every request, co-locating the store with the gateway and keeping the check to a single round trip is not an optimisation - it is the design.


Step 3 - API and Data Model

A rate limiter is not a public API; it is a decision made by middleware. Its internal contract is:

checkLimit(clientId, rule) -> { allowed: bool, remaining: int, retryAfter: int }

The result maps onto standard HTTP semantics:

200 OK
  X-RateLimit-Limit: 1000
  X-RateLimit-Remaining: 742
  X-RateLimit-Reset: 1716400000
 
429 Too Many Requests
  Retry-After: 23

The per-client state depends on the algorithm chosen in the deep dive, and is deliberately small:

AlgorithmState per client/rule
Fixed windowOne counter + a window key with TTL
Sliding window counterTwo counters: previous window, current window
Sliding window logA sorted set of request timestamps - grows with traffic
Token bucketTwo values: token count, last refill timestamp

The state being tiny and fixed-size (for every option except the log) is what lets the limiter scale to millions of clients.


Step 4 - High-Level Design

The limiter is middleware inside the stateless API gateway. The gateway nodes hold no counter state themselves; they share a counter store partitioned by client key.

flowchart TD
    Client([Client]) --> GW[API Gateway<br/>stateless fleet]
    GW --> RL{Rate Limiter<br/>middleware}
    RL -->|atomic check| Store[(Counter Store<br/>Redis Cluster)]
    RL -->|allowed| Backend[Backend Services]
    RL -->|rejected| Deny[429 + Retry-After]
    RL -.->|store unreachable| Local[Local fallback limiter]

Figure 1. The limiter as middleware inside the stateless API gateway. The shared counter store is the single source of truth for per-client state, and the dashed arrow to the local fallback limiter is what keeps the system protective during a store outage instead of becoming a single point of failure.

Two properties matter. First, the limiter rejects abusive traffic at the edge, before it costs any backend capacity - the gateway is the cheapest place to say no. Second, each gateway node carries a local fallback limiter that engages only when the shared store is unreachable, so a store outage degrades accuracy instead of removing protection.


Step 5 - Deep Dive: Algorithms and Distributed Counters

This is the core of the problem. It has two halves: which counting algorithm to use, and how to mutate the shared counter correctly under fleet-wide concurrency.

Part A - The counting algorithms

A senior answer compares the options on burst behaviour, memory, and accuracy rather than naming one.

Fixed window counter. Count requests in a fixed clock window - say, per minute - and reset at the boundary. Implementation is a single atomic INCR with a TTL. It is the cheapest option, but it has a real flaw: a client can send the full limit in the last second of one window and the full limit again in the first second of the next, producing up to 2x the intended rate across the boundary.

flowchart LR
    A["Window 12:00:00-12:00:59<br/>100 requests at :59"] --> B["Window 12:01:00-12:01:59<br/>100 requests at :00"]
    B --> C["200 requests in ~2 seconds<br/>limit was 100/min"]

Figure 2. Why the fixed window's boundary is a flaw, not a quirk. A client sends the full limit at the end of one window and the full limit again at the start of the next, producing roughly twice the intended rate across the boundary - the very load the limiter exists to stop. Sliding-window algorithms remove this by counting over a rolling interval.

Sliding window log. Store the timestamp of every request in a sorted set and count entries within [now - window, now]. It is exactly accurate and has no boundary burst, but memory is O(requests) per client - a high-volume client can occupy megabytes. Use it only when exactness is worth the cost.

Sliding window counter. The pragmatic middle ground. Keep the previous and current fixed-window counts and estimate the rolling rate by weighting the previous window by how much it still overlaps "now":

estimated = current_count + previous_count * (overlap_fraction)

It is O(1) memory, removes almost all of the boundary burst, and is accurate enough for nearly every real system. This is the usual senior default.

Token bucket. A bucket of capacity C refills at rate R tokens/sec; each request consumes one token, and an empty bucket means rejection. It stores just {tokens, lastRefillTs} and deliberately permits bursts up to C while holding the long-run average at R - ideal for APIs that want to tolerate spiky-but-bounded clients. Refill is computed lazily on read, so there is no background timer.

Leaky bucket. A queue drained at a fixed rate. It shapes traffic into a perfectly smooth output stream with no bursts at all - the right choice when the downstream has a hard fixed capacity, but it adds queueing latency and is less common for API limiting.

AlgorithmBurst handlingMemoryAccuracy
Fixed windowAllows 2x at boundaryO(1)Low
Sliding window logNoneO(requests)Exact
Sliding window counterMinimalO(1)High (approx.)
Token bucketAllows bounded burstsO(1)High
Leaky bucketNone - smooths outputO(1) + queueHigh

Part B - Atomic counters across the fleet

The algorithm is the easy half. The hard half is that hundreds of gateway nodes mutate the same client's counter concurrently. A naive read-modify-write races:

sequenceDiagram
    participant A as Gateway A
    participant B as Gateway B
    participant R as Redis (counter store)
 
    Note over R: client K has 1 token remaining
    A->>R: GET tokens for K
    R-->>A: 1
    B->>R: GET tokens for K
    R-->>B: 1
    Note over A: sees 1 - allow
    Note over B: sees 1 - allow
    A->>R: SET tokens = 0
    B->>R: SET tokens = -1
    Note over A,B: both requests admitted on a single token - over-admission

Figure 3. The naive read-then-write race that defeats a non-atomic counter. Both gateways read the same "1 token left", both decide to allow, and the system admits two requests on a single token. Naming this race out loud is what motivates the atomic Lua-script fix introduced next.

The fix is to make the entire refill-and-consume a single atomic operation. In Redis this means a Lua script: Redis executes scripts single-threaded, so the read, the refill calculation, the decrement, and the write happen indivisibly in one round trip.

-- token bucket: atomic refill + consume, KEYS[1] = client bucket
local state    = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens   = tonumber(state[1]) or tonumber(ARGV[3])  -- capacity
local last     = tonumber(state[2]) or tonumber(ARGV[4])  -- now
local rate     = tonumber(ARGV[1])                        -- tokens/sec
local capacity = tonumber(ARGV[2])
local now      = tonumber(ARGV[4])
 
tokens = math.min(capacity, tokens + (now - last) * rate) -- lazy refill
local allowed = tokens >= 1
if allowed then tokens = tokens - 1 end
 
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], ARGV[5])
return allowed and 1 or 0

Prefer a Lua script over WATCH/MULTI: optimistic transactions retry under contention, and a hot client is exactly the high-contention case, so retries would spike latency precisely when you can least afford it. The Lua script is one round trip with no retry loop. For a plain fixed window, a bare INCR plus EXPIRE is already atomic and needs no script.

Consistency model. The counter store is the single source of truth and is strongly consistent for single-key operations - which is all the limiter uses. The weak spot is failover: if a Redis primary dies before replicating recent increments, the replica resumes with slightly stale counts and briefly over-admits. This is acceptable - rate limiting is a system that tolerates approximation by design - but you should state the bound rather than imply perfect accuracy.

Failure modes and degradation

  • Counter store unreachable. This is the decision every interviewer probes: fail-open or fail-closed? Fail-open preserves availability and suits user-facing latency-critical paths; fail-closed protects a fragile backend. The strong answer is that it is configurable per rule, and that each gateway node runs a local in-process fallback limiter sized to roughly globalLimit / nodeCount. When the store is gone, the fleet still enforces an approximate limit instead of either flooding the backend or rejecting everyone.
  • Hot key. One abusive client concentrates load on the single shard owning its key. The limiter is already rejecting most of that traffic, but the check itself still costs a round trip. Mitigations: a short-lived local decision cache so a client already known to be over-limit is rejected without touching Redis, or sharding one client's counter into K sub-counters across shards.
  • Latency spikes. Because the check is on every request, a slow store directly slows the whole platform. Co-locate the store with the gateway, pipeline where possible, and budget the check explicitly.

Multi-region

A strictly accurate global limit requires cross-region coordination on every request, and that round-trip latency is unacceptable in the hot path. State the trade-off and pick one:

  • Partition the limit per region. Each region enforces limit / regionCount locally with zero coordination. Simple and fast; slightly unfair if regional traffic is uneven.
  • Asynchronously replicate counters. Each region counts locally and gossips counts; the global view is eventually consistent, and over-admission is bounded by (regionCount - 1) x regionalBurst.

Strict global rate limiting across regions is a known hard problem - naming it as a trade-off, not solving it perfectly, is the senior signal.

Evolution path

StageApproach
LaunchIn-process limiter; fine for a single node or sticky routing
GrowthShared Redis, fixed window - simplest correct distributed limiter
ScaleToken bucket / sliding window counter via Lua, clustered store, local fallback, per-region partitioning

Build the middleware abstraction with a pluggable backend and the X-RateLimit-* header contract from day one - both are painful to retrofit. Defer multi-region and per-client counter sharding until traffic forces them.

Observability

Track the 429 rate per client and per rule, counter-store p99 latency, the limiter's own decision latency, and - critically - how often the local fallback engages, since fallback activation means the shared store is in trouble. Alert on fallback activation and on sudden 429 spikes, which usually indicate either an attack or a misconfigured client.


Step 6 - Bottlenecks and Trade-offs

  • Counter-store throughput is the first ceiling: ~1M ops/sec exceeds a single Redis node, so the store must be clustered and partitioned by client key from the start.
  • Per-request latency is the second: the limiter taxes every request, so the check must stay one local round trip.
  • Accuracy versus cost is the standing trade-off: the sliding window log is exact but memory-heavy, while the sliding window counter and token bucket are approximate but O(1) - and approximate is almost always the right call.
  • Concurrency correctness is non-negotiable: without an atomic check-and-decrement the limiter silently over-admits under exactly the load it exists to stop.

Reference Architecture

The pattern this problem teaches, reusable well beyond rate limiting:

An atomic shared counter mutated in the request hot path, fronted by a fast co-located store and backed by a local degraded fallback for when that store is unavailable.

flowchart LR
    subgraph Hot["Hot path - every request"]
        direction TB
        H1[Gateway middleware] --> H2{Atomic check<br/>Lua script}
        H2 --> H3[(Sharded counter store)]
    end
    subgraph Degraded["Degraded path - store down"]
        direction TB
        D1[Local fallback limiter]
    end
    Hot -.failover.-> Degraded

Figure 4. The reference architecture stripped to its two paths. The hot path mutates a shared counter through one atomic operation; the degraded path runs a local fallback when the counter store cannot be reached. The same shape applies to quotas, concurrency caps, and any contended-state-in-the-hot-path system.

The same shape recurs in API quota enforcement, concurrency limiting (caps on in-flight requests), and budget-bounded feature flags: a small piece of contended shared state, mutated atomically, with a defined answer for what happens when the shared state is unreachable.


Common Mistakes in the Interview

  • Proposing a fixed window without mentioning the 2x boundary-burst flaw.
  • A non-atomic read-modify-write counter - the silent over-admission bug that defeats the limiter under contention.
  • No story for a counter-store outage - failing to discuss fail-open versus fail-closed and a local fallback.
  • A limiter check that dominates request latency because it is slow, chatty, or not co-located.
  • Assuming a strict global limit across regions is free, instead of partitioning the limit or accepting bounded over-admission.
  • Choosing the sliding window log for high-volume clients and ignoring its O(requests) memory blow-up.
  • Per-process counting presented as a global limit - N nodes then admit N times the rate.

Quick Reference

TopicKey Point
PlacementAt the API gateway / edge; reject before backend cost is incurred
Default algorithmSliding window counter or token bucket - O(1) state, minimal burst
Fixed window flawAllows up to 2x the limit across the window boundary
Token bucketPermits bounded bursts up to capacity; steady average refill rate
AtomicitySingle atomic check-and-decrement - Redis Lua script, not WATCH/MULTI
StoreClustered/sharded by client key; throughput-bound, not memory-bound
FailureConfigurable fail-open/closed + local in-process fallback limiter
Multi-regionPartition the limit per region, or replicate counters eventually
Observability429 rate, store p99, fallback activation rate

This is Part 2 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a Notification Service.

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides