Why does a notification service need a message queue?

Delivery to third-party providers like APNs, FCM, or an SMS gateway is slow, rate-limited, and unreliable, so the producing service must not call them synchronously. A durable queue lets the producer enqueue in milliseconds and return immediately, absorbs traffic spikes such as a marketing campaign, and isolates provider outages from upstream services. Queue depth then becomes the natural backpressure signal.

Can a notification service guarantee exactly-once delivery?

No - exactly-once delivery is not achievable end to end, because the boundary between calling a provider and recording that it succeeded can never be made atomic. The realistic guarantee is at-least-once delivery combined with idempotency: deduplicate on an idempotency key at ingestion, and atomically claim a send marker before each provider call. The observable result is effectively-once.

How do you handle a notification provider failing?

Distinguish transient failures (timeouts, 503s, throttling) from permanent ones (invalid device token, unsubscribed user). Retry transient failures with exponential backoff and jitter up to a capped number of attempts; do not retry permanent failures. Messages that exhaust their retries move to a dead-letter queue for inspection and alerting, and a sustained provider outage can trigger failover to a secondary provider.

What is a dead-letter queue and why is it needed?

A dead-letter queue holds messages that could not be processed after exhausting their retries, including poison messages that fail every time. Without it, a permanently failing message either blocks the queue or is retried forever. The dead-letter queue keeps the main pipeline flowing, preserves failed messages for diagnosis, and its size is a direct health signal worth alerting on.

How do you stop a marketing campaign from delaying transactional notifications?

Separate the two priorities into physically distinct queues with their own worker pools. Transactional notifications such as order confirmations always have dedicated capacity, while bulk campaign traffic drains through its own pool with whatever throughput remains. Priority ordering within a single shared queue is fragile under load and tends to collapse exactly when a campaign is running.

How do you prevent users from receiving duplicate notifications?

Deduplicate at two points. At ingestion, reject a request whose idempotency key has already been seen within the dedup window. At delivery, atomically claim a per-notification send marker before calling the provider, so a redelivered message after a worker crash does not send twice. Idempotency keys are kept with a TTL that comfortably covers all retry windows, typically 24 to 48 hours.

Design a Notification Service: System Design Interview 2026

A notification service looks like a thin wrapper around "send an email" - until you notice that it sits between hundreds of upstream services and a handful of slow, rate-limited, occasionally-down third-party providers, and that a single marketing campaign can ask it to deliver fifty million messages in five minutes. The service exists precisely to absorb that mismatch. This is why it is a favourite senior interview problem: almost every decision is about handling failure and load asymmetry, not about the happy path.

This walkthrough assumes the 6-step system design framework and applies it at the depth expected of a senior or staff candidate. It is Part 3 of a system design series.

The Problem
Step 1 - Clarify Requirements
Step 2 - Estimate Scale
Step 3 - API and Data Model
Step 4 - High-Level Design
Step 5 - Deep Dive: Queue-Based Asynchronous Processing
Step 6 - Bottlenecks and Trade-offs
Reference Architecture
Common Mistakes in the Interview
Quick Reference
Related Articles

The Problem

We are designing a service that delivers notifications to users across multiple channels - mobile push, email, SMS, and in-app - on behalf of many upstream services. An order service wants to confirm a purchase, a social service wants to announce a new follower, a marketing team wants to blast a campaign. They all hand work to one notification service.

The senior framing is that this is an event-driven pipeline bridging a fast, reliable producer side and a slow, unreliable consumer side. The provider boundary - APNs, FCM, an email or SMS gateway - is where latency, rate limits, and outages live. Every important decision in the design is about decoupling from that boundary and staying correct when it misbehaves.

Step 1 - Clarify Requirements

Scope the problem out loud before designing.

Functional requirements:

Accept a notification request from any upstream service.
Deliver across multiple channels: push, email, SMS, in-app.
Render content from templates.
Respect user preferences: channel choice, opt-outs, and quiet hours.
Support two priority classes: transactional (an order confirmation - urgent, must not be lost) and bulk (a marketing campaign - tolerant of minutes of delay).

Out of scope (name them, then defer): the template-authoring UI and click/open analytics. We will note where analytics changes the design.

Non-functional requirements:

Throughput is spiky. Steady-state load is modest; a campaign produces a massive short burst.
Reliability. A transactional notification must not be silently dropped.
Latency is tiered. Transactional delivery should complete in seconds; bulk can take minutes.
The providers are the weak link. They are slow (100 ms+ per call), rate-limited, and they fail. The system must absorb that.
Idempotency. Upstream services retry; the same logical notification must not be sent twice.

The decisive clarifying question: what delivery guarantee do we promise? The honest answer is at-least-once, made effectively-once by idempotency. Exactly-once is not achievable end to end, and a candidate who claims it has missed the central difficulty of the problem. Establishing this early shapes everything that follows.

Step 2 - Estimate Scale

Make the arithmetic visible; it justifies the queue, the worker count, and the storage.

Throughput. Assume 1 billion notifications/day.

Average: 1B / 86,400 s ≈ ~11,600/sec.
A campaign of 50 million messages pushed in ~5 minutes adds ~165,000/sec on top - so design for a peak near 200,000/sec while the steady state is twenty times lower. This spike-to-average ratio is the whole reason a buffer exists.

Worker count. A provider call averages ~150 ms. A single worker thread doing one call at a time manages ~7/sec; with concurrency of 50 per worker process, ~350/sec. Steady state therefore needs on the order of 35 worker processes, and peak needs the fleet to autoscale roughly 15-20x. Workers are cheap and stateless - the queue lets you add them on demand.

Storage. A status record per notification is ~300 bytes. At 1B/day that is ~300 GB/day of status and idempotency data. It does not need to live forever: idempotency keys carry a 24-48 hour TTL, and status records are archived to cold storage after a short retention window.

Step 3 - API and Data Model

The ingestion API is asynchronous by contract. It accepts work and returns immediately - it does not wait for delivery.

POST /api/notifications
  body: { "idempotencyKey": "...", "userId": "...", "templateId": "...",
          "payload": { ... }, "priority": "transactional" }
  202 Accepted   { "notificationId": "..." }

Returning 202 Accepted, not 200 OK, is a deliberate signal: the notification is durably queued, not delivered. Promising 200 would force the slow provider call into the request path - the exact mistake the architecture exists to avoid.

The core entities:

Entity	Key fields
Notification	`id`, `idempotencyKey`, `userId`, `templateId`, `priority`, `channel`, `status`, `attempts`, `createdAt`
Idempotency record	`idempotencyKey` -> `notificationId`, with a 24-48h TTL
User preferences	`userId` -> channels, opt-outs, quiet hours, device tokens

A notification moves through a well-defined lifecycle, and naming those states makes the failure handling concrete:

stateDiagram-v2
    [*] --> QUEUED: ingested
    QUEUED --> PROCESSING: worker picks up
    PROCESSING --> SENT: provider accepted
    PROCESSING --> RETRYING: transient failure
    RETRYING --> PROCESSING: backoff elapsed
    RETRYING --> DEAD_LETTER: max attempts reached
    PROCESSING --> FAILED: permanent error
    SENT --> DELIVERED: receipt webhook
    DEAD_LETTER --> [*]
    FAILED --> [*]
    DELIVERED --> [*]

Figure 1. The notification lifecycle as an explicit state machine. The split between SENT (we handed the message to the provider) and DELIVERED (the provider confirmed device receipt) is deliberate: one is strongly known, the other depends on an unreliable receipt webhook - and conflating the two is how teams overpromise their delivery guarantee.

Note the honest distinction between SENT (we handed it to the provider) and DELIVERED (the provider confirmed it reached the device). The system knows the first strongly and the second only weakly - more on that below.

Step 4 - High-Level Design

The architecture is a pipeline. Each stage does one thing and is separated from the next by a durable queue.

flowchart TD
    Up[Upstream Services] --> API[Ingestion API]
    API -->|dedup check| Idem[(Idempotency Store)]
    API -->|enqueue| QT[Transactional Queue]
    API -->|enqueue| QB[Bulk Queue]
    QT --> WT[Transactional Workers]
    QB --> WB[Bulk Workers]
    WT --> Disp[Channel Dispatchers]
    WB --> Disp
    Disp -->|throttled| Prov[Push / Email / SMS Providers]
    Disp -->|transient fail| RQ[Retry Queue]
    RQ --> Disp
    Disp -->|exhausted / poison| DLQ[Dead-Letter Queue]
    WT -.read.-> Pref[(User Preferences)]
    Disp -.update.-> Status[(Status Store)]

Figure 2. The pipeline architecture. The ingestion API does only the fast work - dedup, enqueue, return 202. Every slow or unreliable step happens behind a queue, and the transactional and bulk priorities have physically separate queues so a marketing campaign cannot delay an order confirmation.

The ingestion API is thin: validate, deduplicate, enqueue, return 202. Workers render templates, apply user preferences, and split a notification into per-channel deliveries. Channel dispatchers make the actual provider calls under a per-provider rate limit. Failures flow into a retry queue and, when exhausted, into a dead-letter queue. Every tier is stateless and scales horizontally; the queues are the backbone.

Step 5 - Deep Dive: Queue-Based Asynchronous Processing

This is the core of the problem. Four things make the pipeline correct under load and failure: the decoupling itself, the delivery semantics, the retry and dead-letter machinery, and provider-aware throttling.

Part A - Why the queue is the architecture

Putting a durable queue between the producer and the workers is not an implementation detail; it is the design.

The producer never waits. It enqueues in single-digit milliseconds and returns 202. A provider being slow has zero effect on upstream latency.
The queue absorbs spikes. A 50-million-message campaign lands in the queue almost instantly; workers then drain it at a rate the providers can survive. The buffer converts a 200,000/sec spike into a steady, controlled outflow.
Queue depth is the backpressure signal. When providers slow down, workers ack slower, and the queue grows. That growth is observable and actionable - autoscale workers, or shed bulk load - rather than a hidden cascading failure.
Failure domains are isolated. A provider outage fills a queue; it does not propagate back into the order service.

Part B - Delivery semantics: at-least-once plus idempotency

A durable queue gives at-least-once delivery. A worker can crash after calling the provider but before acknowledging the message; the queue, seeing no ack, redelivers it, and a second worker sends the notification again. You cannot eliminate this: the provider call and the ack are two separate steps with no atomic bracket around them. Exactly-once is therefore impossible end to end - the achievable goal is at-least-once delivery made effectively-once by deduplication at two points:

At ingestion. The producer supplies an idempotencyKey. The API checks the idempotency store; a key already seen returns the original notificationId without enqueuing again. This absorbs upstream retries.
At delivery. Just before calling the provider, the dispatcher atomically claims a send marker for that notification (a conditional write - "set sent=true if not already set"). If the claim fails, a previous attempt already sent it, and the dispatcher skips the provider call. This closes the worker-crash redelivery gap.

Idempotency keys are retained with a TTL of 24-48 hours - long enough to cover every realistic retry window, short enough to keep the store bounded. Where the provider itself accepts an idempotency token, pass it through for a third layer of protection.

Part C - Retries, backoff, and the dead-letter queue

Provider failures split into two kinds, and conflating them is a classic mistake:

Transient - timeouts, 503s, throttling responses. Retry these.
Permanent - an invalid device token, an unsubscribed address, a malformed payload. Retrying these is pure waste; mark them FAILED and, for a dead token, trigger cleanup.

Transient retries must use exponential backoff with jitter. Backoff prevents hammering a struggling provider; jitter prevents every failed message in a spike from retrying in lockstep and re-creating the spike.

delay = min(cap, base * 2^attempt) * random(0.5, 1.0)

Retries are capped - typically around five attempts. A message that exhausts them, or a poison message that fails deterministically every time, must not circle forever or block the queue. It moves to a dead-letter queue: a separate queue holding unprocessable messages for inspection, manual replay, and alerting. The DLQ's size is itself a health metric.

sequenceDiagram
    participant W as Dispatcher
    participant P as Provider
    participant R as Retry Queue
    participant D as Dead-Letter Queue
 
    W->>P: deliver (attempt 1)
    P-->>W: 503 - transient
    W->>R: requeue with backoff + jitter
    R->>W: redeliver after delay
    W->>P: deliver (attempt 2)
    P-->>W: timeout
    Note over W: attempts exhausted (cap reached)
    W->>D: move to dead-letter queue
    Note over D: alert fires on DLQ growth

Figure 3. The retry-and-dead-letter sequence. A transient provider failure is retried with exponential backoff and jitter; once the attempt cap is reached the message moves to the dead-letter queue rather than circling forever. The DLQ's growth is the headline alerting signal for a struggling provider.

Part D - Provider rate limits and channel isolation

Every provider enforces its own rate limit, and exceeding it gets you throttled or banned. Each channel dispatcher therefore sits behind its own token-bucket rate limiter tuned to that provider's allowance - the same primitive built in Part 2, reused here to pace outflow rather than reject callers.

Channels must also be isolated from one another: a slow SMS gateway must not starve push delivery. Give each channel its own queue and worker pool so one degraded provider cannot consume the whole fleet's capacity.

Priority isolation

Transactional and bulk traffic get physically separate queues and worker pools. An order confirmation must never sit behind ten million campaign messages. Priority ordering inside a single shared queue looks simpler but is fragile: under exactly the load that matters - a campaign in progress - it tends to collapse. Separate queues guarantee the transactional pool always has capacity.

Consistency model

The status store is eventually consistent with reality. SENT is known strongly - we received the provider's acceptance. DELIVERED is known only weakly: it depends on an asynchronous delivery-receipt webhook that some channels send late and some never send at all. State this explicitly. The system can promise "handed to the provider"; it cannot, by itself, promise "appeared on the device".

Failure modes

Worker crash mid-delivery. The message is redelivered; the delivery-time send-marker claim prevents a duplicate send.
Queue unavailable. Ingestion cannot enqueue, so the API returns 503 and relies on producer retries; a transactional path may add a small durable local buffer as a safety net.
Total provider outage. Messages accumulate in the retry queue and then the DLQ; alerts fire, and a secondary provider for that channel can be failed over to.
Poison message. Capped retries plus the DLQ guarantee one bad message never stalls the pipeline.

Multi-region

Run the pipeline per region with regional queues and worker pools; providers are globally reachable. Replicate the user-preference and device-token stores. The subtle point is the idempotency store: if a producer's retries can land in different regions, a local dedup check misses the duplicate. The clean fix is region affinity by userId - route all of a user's notifications to one region - so the dedup check stays local and fast instead of forcing a slow globally-replicated store.

Evolution path

Stage	Approach
Launch	Synchronous send inside the app; one provider per channel
Growth	Introduce a durable queue and stateless workers; add retries with backoff
Scale	Priority queues, per-provider rate limiting, dead-letter queue, multi-region, provider failover

Build the asynchronous 202 contract and the idempotency-key contract from day one - both are painful to retrofit because they are promises to every upstream caller. Defer multi-region, provider failover, and elaborate priority schemes until volume demands them.

Observability

The single most important metric is queue depth per priority and channel - the backpressure signal that reveals trouble before users do. Track also: delivery success rate per provider, retry rate, dead-letter queue size, and end-to-end latency per priority tier (p50/p99). A reasonable SLO: 99% of transactional notifications handed to a provider within 30 seconds. Alert on DLQ growth and on any queue depth trending up without recovery.

Step 6 - Bottlenecks and Trade-offs

Provider throughput is the ultimate ceiling. Mitigate with request batching where the provider supports it, multiple provider accounts, and failover to a secondary provider.
Queue throughput is the next ceiling - partition queues by channel and priority so no single queue is a chokepoint.
The idempotency store is on the hot path - every notification reads and writes it, so size it for peak QPS the way any shared counter store is sized.
Campaign spikes versus transactional latency is the standing tension, resolved by physical queue separation rather than in-queue prioritisation.
Retry storms are self-inflicted: backoff without jitter synchronises every retry and re-spikes a provider that is already struggling.

Reference Architecture

The pattern this problem teaches, reusable far beyond notifications:

An event-driven pipeline: a durable queue between a fast producer and slow, unreliable consumers, with retries under backoff, a dead-letter queue for the unprocessable, and idempotent delivery turning at-least-once into effectively-once.

flowchart LR
    subgraph Fast["Producer side - fast, reliable"]
        F1[Ingestion API] --> F2[(Durable Queue)]
    end
    subgraph Slow["Consumer side - slow, unreliable"]
        S1[Workers] --> S2[Rate-limited dispatch]
        S2 --> S3[External providers]
    end
    F2 --> S1
    S2 -.retry / backoff.-> S1
    S2 -.exhausted.-> DLQ[(Dead-Letter Queue)]

Figure 4. The reference architecture made explicit: a fast, reliable producer side and a slow, unreliable consumer side, with a durable queue absorbing the mismatch and a dead-letter queue catching what the consumers cannot. The same shape applies to any integration with a slow external system.

The same shape recurs whenever you integrate with a slow or unreliable external system: webhook delivery, payment capture, third-party data sync, document processing. A durable queue, idempotent consumers, bounded retries, and a dead-letter queue is the default toolkit for "make an unreliable boundary reliable".

Common Mistakes in the Interview

Claiming exactly-once delivery. It is impossible end to end; the correct answer is at-least-once plus idempotency.
Synchronous provider calls in the request path, defeating the entire point of the service.
Retrying without jitter, which synchronises retries into a storm that re-overwhelms the provider.
One queue for transactional and bulk traffic, letting a marketing campaign delay order confirmations.
No dead-letter queue, so a poison message either blocks the pipeline or is retried forever.
Ignoring per-provider rate limits, which gets the platform throttled or banned.
Not deduplicating, so an upstream retry delivers the same notification several times.
Treating SENT as DELIVERED - conflating "handed to the provider" with "reached the device".

Quick Reference

Topic	Key Point
Core pattern	Durable queue decoupling a fast producer from slow, unreliable consumers
API contract	Asynchronous: return `202 Accepted`, never block on the provider
Delivery guarantee	At-least-once + idempotency = effectively-once; exactly-once is impossible
Idempotency	Dedup at ingestion (key) and at delivery (atomic send-marker claim)
Retries	Exponential backoff with jitter; cap attempts; classify transient vs permanent
Dead-letter queue	Holds exhausted and poison messages; its size is a health signal
Priority	Physically separate queues and worker pools for transactional vs bulk
Providers	Per-provider token-bucket rate limiting; isolate channels from each other
Consistency	`SENT` is strong, `DELIVERED` depends on a weak async receipt webhook
Multi-region	Region affinity by `userId` keeps the idempotency check local
Observability	Queue depth per priority/channel; DLQ size; success rate per provider

System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
Design a Rate Limiter - Part 2; the token-bucket primitive reused here to pace provider calls.
Design a Distributed Cache - Part 4; consistent hashing for the partitioned stores this pipeline relies on.
Apache Kafka Interview Questions - the durable queue and consumer-group mechanics behind this pipeline.

This is Part 3 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a Distributed Cache.

Design a Notification Service: System Design Interview 2026

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Queue-Based Asynchronous Processing

Part A - Why the queue is the architecture

Part B - Delivery semantics: at-least-once plus idempotency

Part C - Retries, backoff, and the dead-letter queue

Part D - Provider rate limits and channel isolation

Priority isolation

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Ready to ace your interview?

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Queue-Based Asynchronous Processing

Part A - Why the queue is the architecture

Part B - Delivery semantics: at-least-once plus idempotency

Part C - Retries, backoff, and the dead-letter queue

Part D - Provider rate limits and channel isolation

Priority isolation

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Related Articles

Ready to ace your interview?