What is the difference between token bucket and sliding window rate limiting?

Token bucket allows short bursts up to the bucket capacity while enforcing a steady average refill rate, which suits APIs that want to tolerate spiky-but-bounded clients. Sliding window counts requests over a rolling time window, giving smoother enforcement with no boundary burst. Token bucket stores two numbers per client and naturally models bursts; the sliding window counter stores two window counts and approximates a true rolling window cheaply.

Where should a rate limiter sit in the architecture?

The primary limiter belongs at the API gateway or edge, so abusive traffic is rejected before it consumes any backend capacity. Larger systems also add per-service limiters as defense in depth, protecting an individual service even when traffic arrives through internal callers. The gateway is the cheapest place to say no.

How do you make rate limiter counters atomic across many servers?

A naive read-then-write race lets two gateway nodes both see one token left and both allow a request, over-admitting traffic. The fix is to make the check-and-decrement a single atomic operation - typically a Lua script in Redis, which executes scripts single-threaded so the refill and consume happen indivisibly in one round trip. Plain INCR works for a fixed-window counter because INCR is itself atomic.

Should a rate limiter fail open or fail closed?

It depends on what you are protecting. Fail-open (allow all requests when the counter store is unreachable) preserves availability and is common for user-facing latency-critical paths. Fail-closed (reject all) protects a fragile backend from being overwhelmed. A strong design makes this configurable and adds a local in-process fallback limiter so a counter-store outage degrades accuracy rather than disabling protection entirely.

Why does a fixed window counter allow bursts at the window boundary?

A fixed window resets its count at fixed clock boundaries, so a client can send the full limit in the last moment of one window and the full limit again in the first moment of the next. That produces up to twice the intended rate across the boundary. Sliding window algorithms remove this by counting over a rolling interval rather than a fixed one.

Can you enforce a global rate limit across multiple regions?

A strictly accurate global limit requires cross-region coordination on every request, and that latency is usually unacceptable. The practical approaches are to partition the limit per region (each region enforces limit divided by region count) or to replicate counters asynchronously and accept bounded over-admission. Naming this trade-off explicitly is what an interviewer is looking for.

Design a Rate Limiter: System Design Interview 2026

A rate limiter is the smallest component in most architectures and one of the most consequential: it sits in the critical path of every single request, and when it is wrong it either lets an abusive client melt your backend or it throttles legitimate users. That tension - microsecond-level latency budget against correctness under concurrency and failure - is why "design a rate limiter" is a sharp senior interview question. The data model is two numbers per client; the difficulty is making those two numbers correct when a thousand gateway nodes update them at once and the store underneath can disappear.

This walkthrough assumes the 6-step system design framework and applies it at the depth expected of a senior or staff candidate.

The Problem
Step 1 - Clarify Requirements
Step 2 - Estimate Scale
Step 3 - API and Data Model
Step 4 - High-Level Design
Step 5 - Deep Dive: Algorithms and Distributed Counters
Step 6 - Bottlenecks and Trade-offs
Reference Architecture
Common Mistakes in the Interview
Quick Reference
Related Articles

The Problem

We are designing a rate limiter that caps how many requests a given client can make within a time window - the mechanism behind "429 Too Many Requests" and the X-RateLimit-* headers on every serious API. It exists to protect backends from abuse and accidental overload, to enforce fair use across tenants, and to contain cascading failure when one client misbehaves.

The senior framing is that a rate limiter is a distributed counter in the hot path. Every request reads and mutates shared state, that state is contended by the whole gateway fleet, and the component must stay fast and safe even when its backing store is slow or gone. Almost every interesting decision flows from those three facts.

Step 1 - Clarify Requirements

Scope the problem out loud before designing. The clarifying questions here are themselves a signal.

Functional requirements:

Limit requests per client identity (API key, user ID, or IP) to N per time window.
Support multiple rules and tiers - a free tier and a paid tier enforce different limits.
Support per-endpoint granularity, not just a single global limit per client.
On rejection, return 429 with a Retry-After header; on every response, expose X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset.

Non-functional requirements:

Latency: the limiter runs on every request, so its own overhead must be tiny - target a p99 well under ~2 ms.
Availability: the limiter must not become a single point of failure. If it cannot reach its store, it needs a defined degraded behaviour (see fail-open vs fail-closed below).
Accuracy: strict exactness is not required. Slight over-admission under contention or failover is acceptable; the design question is how much and bounded by what.
Distributed correctness: the limit is global across the gateway fleet, so per-process counting is not enough - the nodes must share state.

The key clarifying question: is the limit per-process or global across the fleet? A per-process limit is trivial and needs no shared state, but N gateway nodes then admit N times the intended rate. A global limit is the realistic requirement, and it forces a shared counter store - which is where the whole design gets interesting.

Step 2 - Estimate Scale

The numbers here are unusual: the rate limiter's workload is the traffic it guards.

Throughput. Assume the platform peaks at 1,000,000 requests/sec across the gateway fleet. Every request triggers exactly one limiter check, so the counter store must sustain ~1M operations/sec. A single Redis instance handles roughly 100k-200k ops/sec, so this alone forces a sharded or clustered store - partitioned by client key.

State size. Suppose 10 million distinct clients (API keys plus tracked IPs). Each client needs a few numbers per rule - on the order of 100 bytes. Total counter state is roughly 1 GB, which sits comfortably in memory. The store is throughput-bound, not memory-bound.

Latency budget. A same-datacenter Redis round trip is ~0.5-1 ms. Since that lands on every request, co-locating the store with the gateway and keeping the check to a single round trip is not an optimisation - it is the design.

Step 3 - API and Data Model

A rate limiter is not a public API; it is a decision made by middleware. Its internal contract is:

checkLimit(clientId, rule) -> { allowed: bool, remaining: int, retryAfter: int }

The result maps onto standard HTTP semantics:

200 OK
  X-RateLimit-Limit: 1000
  X-RateLimit-Remaining: 742
  X-RateLimit-Reset: 1716400000
 
429 Too Many Requests
  Retry-After: 23

The per-client state depends on the algorithm chosen in the deep dive, and is deliberately small:

Algorithm	State per client/rule
Fixed window	One counter + a window key with TTL
Sliding window counter	Two counters: previous window, current window
Sliding window log	A sorted set of request timestamps - grows with traffic
Token bucket	Two values: token count, last refill timestamp

The state being tiny and fixed-size (for every option except the log) is what lets the limiter scale to millions of clients.

Step 4 - High-Level Design

The limiter is middleware inside the stateless API gateway. The gateway nodes hold no counter state themselves; they share a counter store partitioned by client key.

flowchart TD
    Client([Client]) --> GW[API Gateway<br/>stateless fleet]
    GW --> RL{Rate Limiter<br/>middleware}
    RL -->|atomic check| Store[(Counter Store<br/>Redis Cluster)]
    RL -->|allowed| Backend[Backend Services]
    RL -->|rejected| Deny[429 + Retry-After]
    RL -.->|store unreachable| Local[Local fallback limiter]

Figure 1. The limiter as middleware inside the stateless API gateway. The shared counter store is the single source of truth for per-client state, and the dashed arrow to the local fallback limiter is what keeps the system protective during a store outage instead of becoming a single point of failure.

Two properties matter. First, the limiter rejects abusive traffic at the edge, before it costs any backend capacity - the gateway is the cheapest place to say no. Second, each gateway node carries a local fallback limiter that engages only when the shared store is unreachable, so a store outage degrades accuracy instead of removing protection.

Step 5 - Deep Dive: Algorithms and Distributed Counters

This is the core of the problem. It has two halves: which counting algorithm to use, and how to mutate the shared counter correctly under fleet-wide concurrency.

Part A - The counting algorithms

A senior answer compares the options on burst behaviour, memory, and accuracy rather than naming one.

Fixed window counter. Count requests in a fixed clock window - say, per minute - and reset at the boundary. Implementation is a single atomic INCR with a TTL. It is the cheapest option, but it has a real flaw: a client can send the full limit in the last second of one window and the full limit again in the first second of the next, producing up to 2x the intended rate across the boundary.

flowchart LR
    A["Window 12:00:00-12:00:59<br/>100 requests at :59"] --> B["Window 12:01:00-12:01:59<br/>100 requests at :00"]
    B --> C["200 requests in ~2 seconds<br/>limit was 100/min"]

Figure 2. Why the fixed window's boundary is a flaw, not a quirk. A client sends the full limit at the end of one window and the full limit again at the start of the next, producing roughly twice the intended rate across the boundary - the very load the limiter exists to stop. Sliding-window algorithms remove this by counting over a rolling interval.

Sliding window log. Store the timestamp of every request in a sorted set and count entries within [now - window, now]. It is exactly accurate and has no boundary burst, but memory is O(requests) per client - a high-volume client can occupy megabytes. Use it only when exactness is worth the cost.

Sliding window counter. The pragmatic middle ground. Keep the previous and current fixed-window counts and estimate the rolling rate by weighting the previous window by how much it still overlaps "now":

estimated = current_count + previous_count * (overlap_fraction)

It is O(1) memory, removes almost all of the boundary burst, and is accurate enough for nearly every real system. This is the usual senior default.

Token bucket. A bucket of capacity C refills at rate R tokens/sec; each request consumes one token, and an empty bucket means rejection. It stores just {tokens, lastRefillTs} and deliberately permits bursts up to C while holding the long-run average at R - ideal for APIs that want to tolerate spiky-but-bounded clients. Refill is computed lazily on read, so there is no background timer.

Leaky bucket. A queue drained at a fixed rate. It shapes traffic into a perfectly smooth output stream with no bursts at all - the right choice when the downstream has a hard fixed capacity, but it adds queueing latency and is less common for API limiting.

Algorithm	Burst handling	Memory	Accuracy
Fixed window	Allows 2x at boundary	O(1)	Low
Sliding window log	None	O(requests)	Exact
Sliding window counter	Minimal	O(1)	High (approx.)
Token bucket	Allows bounded bursts	O(1)	High
Leaky bucket	None - smooths output	O(1) + queue	High

Part B - Atomic counters across the fleet

The algorithm is the easy half. The hard half is that hundreds of gateway nodes mutate the same client's counter concurrently. A naive read-modify-write races:

sequenceDiagram
    participant A as Gateway A
    participant B as Gateway B
    participant R as Redis (counter store)
 
    Note over R: client K has 1 token remaining
    A->>R: GET tokens for K
    R-->>A: 1
    B->>R: GET tokens for K
    R-->>B: 1
    Note over A: sees 1 - allow
    Note over B: sees 1 - allow
    A->>R: SET tokens = 0
    B->>R: SET tokens = -1
    Note over A,B: both requests admitted on a single token - over-admission

Figure 3. The naive read-then-write race that defeats a non-atomic counter. Both gateways read the same "1 token left", both decide to allow, and the system admits two requests on a single token. Naming this race out loud is what motivates the atomic Lua-script fix introduced next.

The fix is to make the entire refill-and-consume a single atomic operation. In Redis this means a Lua script: Redis executes scripts single-threaded, so the read, the refill calculation, the decrement, and the write happen indivisibly in one round trip.

-- token bucket: atomic refill + consume, KEYS[1] = client bucket
local state    = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens   = tonumber(state[1]) or tonumber(ARGV[3])  -- capacity
local last     = tonumber(state[2]) or tonumber(ARGV[4])  -- now
local rate     = tonumber(ARGV[1])                        -- tokens/sec
local capacity = tonumber(ARGV[2])
local now      = tonumber(ARGV[4])
 
tokens = math.min(capacity, tokens + (now - last) * rate) -- lazy refill
local allowed = tokens >= 1
if allowed then tokens = tokens - 1 end
 
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', KEYS[1], ARGV[5])
return allowed and 1 or 0

Prefer a Lua script over WATCH/MULTI: optimistic transactions retry under contention, and a hot client is exactly the high-contention case, so retries would spike latency precisely when you can least afford it. The Lua script is one round trip with no retry loop. For a plain fixed window, a bare INCR plus EXPIRE is already atomic and needs no script.

Consistency model. The counter store is the single source of truth and is strongly consistent for single-key operations - which is all the limiter uses. The weak spot is failover: if a Redis primary dies before replicating recent increments, the replica resumes with slightly stale counts and briefly over-admits. This is acceptable - rate limiting is a system that tolerates approximation by design - but you should state the bound rather than imply perfect accuracy.

Failure modes and degradation

Counter store unreachable. This is the decision every interviewer probes: fail-open or fail-closed? Fail-open preserves availability and suits user-facing latency-critical paths; fail-closed protects a fragile backend. The strong answer is that it is configurable per rule, and that each gateway node runs a local in-process fallback limiter sized to roughly globalLimit / nodeCount. When the store is gone, the fleet still enforces an approximate limit instead of either flooding the backend or rejecting everyone.
Hot key. One abusive client concentrates load on the single shard owning its key. The limiter is already rejecting most of that traffic, but the check itself still costs a round trip. Mitigations: a short-lived local decision cache so a client already known to be over-limit is rejected without touching Redis, or sharding one client's counter into K sub-counters across shards.
Latency spikes. Because the check is on every request, a slow store directly slows the whole platform. Co-locate the store with the gateway, pipeline where possible, and budget the check explicitly.

Multi-region

A strictly accurate global limit requires cross-region coordination on every request, and that round-trip latency is unacceptable in the hot path. State the trade-off and pick one:

Partition the limit per region. Each region enforces limit / regionCount locally with zero coordination. Simple and fast; slightly unfair if regional traffic is uneven.
Asynchronously replicate counters. Each region counts locally and gossips counts; the global view is eventually consistent, and over-admission is bounded by (regionCount - 1) x regionalBurst.

Strict global rate limiting across regions is a known hard problem - naming it as a trade-off, not solving it perfectly, is the senior signal.

Evolution path

Stage	Approach
Launch	In-process limiter; fine for a single node or sticky routing
Growth	Shared Redis, fixed window - simplest correct distributed limiter
Scale	Token bucket / sliding window counter via Lua, clustered store, local fallback, per-region partitioning

Build the middleware abstraction with a pluggable backend and the X-RateLimit-* header contract from day one - both are painful to retrofit. Defer multi-region and per-client counter sharding until traffic forces them.

Observability

Track the 429 rate per client and per rule, counter-store p99 latency, the limiter's own decision latency, and - critically - how often the local fallback engages, since fallback activation means the shared store is in trouble. Alert on fallback activation and on sudden 429 spikes, which usually indicate either an attack or a misconfigured client.

Step 6 - Bottlenecks and Trade-offs

Counter-store throughput is the first ceiling: ~1M ops/sec exceeds a single Redis node, so the store must be clustered and partitioned by client key from the start.
Per-request latency is the second: the limiter taxes every request, so the check must stay one local round trip.
Accuracy versus cost is the standing trade-off: the sliding window log is exact but memory-heavy, while the sliding window counter and token bucket are approximate but O(1) - and approximate is almost always the right call.
Concurrency correctness is non-negotiable: without an atomic check-and-decrement the limiter silently over-admits under exactly the load it exists to stop.

Reference Architecture

The pattern this problem teaches, reusable well beyond rate limiting:

An atomic shared counter mutated in the request hot path, fronted by a fast co-located store and backed by a local degraded fallback for when that store is unavailable.

flowchart LR
    subgraph Hot["Hot path - every request"]
        direction TB
        H1[Gateway middleware] --> H2{Atomic check<br/>Lua script}
        H2 --> H3[(Sharded counter store)]
    end
    subgraph Degraded["Degraded path - store down"]
        direction TB
        D1[Local fallback limiter]
    end
    Hot -.failover.-> Degraded

Figure 4. The reference architecture stripped to its two paths. The hot path mutates a shared counter through one atomic operation; the degraded path runs a local fallback when the counter store cannot be reached. The same shape applies to quotas, concurrency caps, and any contended-state-in-the-hot-path system.

The same shape recurs in API quota enforcement, concurrency limiting (caps on in-flight requests), and budget-bounded feature flags: a small piece of contended shared state, mutated atomically, with a defined answer for what happens when the shared state is unreachable.

Common Mistakes in the Interview

Proposing a fixed window without mentioning the 2x boundary-burst flaw.
A non-atomic read-modify-write counter - the silent over-admission bug that defeats the limiter under contention.
No story for a counter-store outage - failing to discuss fail-open versus fail-closed and a local fallback.
A limiter check that dominates request latency because it is slow, chatty, or not co-located.
Assuming a strict global limit across regions is free, instead of partitioning the limit or accepting bounded over-admission.
Choosing the sliding window log for high-volume clients and ignoring its O(requests) memory blow-up.
Per-process counting presented as a global limit - N nodes then admit N times the rate.

Quick Reference

Topic	Key Point
Placement	At the API gateway / edge; reject before backend cost is incurred
Default algorithm	Sliding window counter or token bucket - O(1) state, minimal burst
Fixed window flaw	Allows up to 2x the limit across the window boundary
Token bucket	Permits bounded bursts up to capacity; steady average refill rate
Atomicity	Single atomic check-and-decrement - Redis Lua script, not WATCH/MULTI
Store	Clustered/sharded by client key; throughput-bound, not memory-bound
Failure	Configurable fail-open/closed + local in-process fallback limiter
Multi-region	Partition the limit per region, or replicate counters eventually
Observability	429 rate, store p99, fallback activation rate

System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
Design a URL Shortener - Part 1 of this series; another read-heavy system built on a shared store.
Design a Notification Service - Part 3; the token-bucket limiter reused to pace external providers.
Redis Interview Questions - the counter store in depth: atomic operations, Lua scripting, clustering, and persistence.

This is Part 2 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a Notification Service.

Design a Rate Limiter: System Design Interview 2026

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Algorithms and Distributed Counters

Part A - The counting algorithms

Part B - Atomic counters across the fleet

Failure modes and degradation

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Ready to ace your interview?

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Algorithms and Distributed Counters

Part A - The counting algorithms

Part B - Atomic counters across the fleet

Failure modes and degradation

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Related Articles

Ready to ace your interview?