How does a real-time chat system deliver one message to 50,000 online users in under 200ms · while sustaining 10B+ messages per day at peak 200K msg/sec?
Scope: Real-time delivery path only · not search, file storage, or push to offline users. The hard question: how does one message fan-out to thousands of live WebSocket connections without write amplification killing the system?
10B+
messages / day
~115K avg msg/sec
200K
msg/sec peak
~1.7· avg load
10M+
concurrent WebSockets
~20K Gateway servers
<200ms
delivery P99
~11ms same-region
Functional Requirements
What the system must do · the core user-facing behaviours
Must Have (Core)
1. User can send a text message to a channel or DM 2. All online members of a channel receive the message in real time (<200ms) 3. Messages are ordered · every user sees them in the same sequence 4.Offline users receive missed messages on reconnect (catch-up) 5. Users can see presence · who is currently online in a channel 6. Messages are persisted durably · never lost after the sender gets an ACK
Out of Scope (this design)
? Full-text message search (separate Elasticsearch pipeline) ?Push notifications to mobile when app is backgrounded (APNs / FCM) ?File / media uploads (separate blob storage path) ?Reactions, threads, edits, deletes (same delivery path, different events) ?User authentication and workspace management ?Read receipts at scale (separate aggregation problem)
Non-Functional Requirements
The quality constraints that shape every architectural decision
Property
Target
Why It Matters / Design Impact
Latency
P99 <200ms same-region · <250ms cross-region
Chat feels broken above ~500ms. Drives WebSocket over polling, Redis over DB for hot path.
Client reconnects + seq-based catch-up means a Gateway crash is invisible to the user within seconds.
Security
TLS everywhere · workspace isolation
All WS connections over WSS. Vitess shard-per-workspace prevents cross-tenant data leaks.
Key tension:Availability vs. Consistency. Slack chooses availability · a message may arrive 80ms later for a cross-region user, but the system never blocks. Ordering is guaranteed within a channel, not globally.
Scale Estimation
Given numbers from the question ? derive infrastructure sizing. Show this reasoning in an interview.
Given (from question):10B messages/day · 50,000 online users per channel · <200ms P99 delivery · 10M+ concurrent users
Step
What to Derive
Calculation
Result
Design Decision
1
Avg msg/sec
10B · 86,400 sec/day
~115K msg/sec
Baseline throughput · Kafka must sustain this 24/7
2
Peak msg/sec
115K · 1.7· peak factor
~200K msg/sec
Size Kafka partitions and fan-out workers for peak, not avg
3
Gateway servers needed
10M connections · 500K conn/server (epoll limit)
~20 Gateway servers
Each Gateway holds 500K persistent WebSockets; Redis maps user ? gateway
4
Fan-out ops/sec (worst case)
200K msg/sec · 50K members/channel
10B ops/sec ?
Impossible ? must skip offline. Only ~10% online ? 200K · 5K = 1B ops/sec ?
5
Kafka partitions
200K msg/sec · ~20 msg/sec per partition (safe throughput)
~10K partitions
1 partition per hot channel; hash-assigned for the rest. 1 fan-out worker per partition.
Kafka (~5ms) + Redis (~1ms) + gRPC (~3ms) + WS push (~2ms) = ~11ms. Well within 200ms budget.
Interview tip: Start with step 1·2 (traffic), then step 3 (connections), then step 4 (fan-out math) · that's where the key insight lives. The 10B ops/sec impossibility is what forces the hybrid fan-out design. Derive the architecture from the numbers, not the other way around.
Why WebSocket over SSE or long-polling? SSE is server?client only · can't send messages. Long-polling adds 1 RTT per message and hammers the server. WebSocket is full-duplex, persistent, and handles 500K connections per server with epoll. The only tradeoff is stateful connections requiring sticky routing · solved by the Redis user:{id}:gateway mapping.
?? HTTP/2 · Bearer Token Auth · JSON Responses
POST/v1/ws/connect
Returns the nearest Gateway WSS URL via GeoDNS. Token is short-lived (60s), used only for the WebSocket upgrade handshake.
200 OK401 Unauthorized403 Not in channel404 Channel not found
Idempotency:client_msg_id (client UUID) is the dedup key for REST retries · server returns 409 with the original response body. channel_seq (server-assigned monotonic) is the dedup key on the client side for WS delivery · duplicate frames are silently dropped.
? Fan-out Worker ? Gateway (Internal Only)
RPCPushMessages
Fan-out worker groups users by gateway, calls each gateway once with a batch. not_found = user disconnected since Redis lookup.
Why gRPC? Fan-out worker ? Gateway calls are high-frequency and latency-sensitive. gRPC gives multiplexed HTTP/2 connections, typed protobuf contracts, and built-in deadline propagation. A single fan-out worker batches all users on the same gateway into one PushMessages call · not one call per user.
Hot path (Redis, sub-ms) for presence and routing · Cold path (Vitess/MySQL) for durable message history
-- -- COLD PATH: Vitess/MySQL (sharded by workspace_id) ----------------------
CREATE TABLE messages (
workspace_id BIGINT NOT NULL, -- shard key: all workspace data on one shard
channel_id BIGINT NOT NULL,
channel_seq BIGINT NOT NULL, -- monotonic per channel (gap detection + backfill)
user_id BIGINT NOT NULL,
client_msg_id VARCHAR(64), -- client UUID for idempotent dedup (indexed)
content TEXT,
ts TIMESTAMP(6),
PRIMARY KEY (workspace_id, channel_id, channel_seq),
UNIQUE KEY uq_client_msg (workspace_id, user_id, client_msg_id)
-- PK doubles as the range-scan index for catch-up reads
);
CREATE TABLE channel_members (
workspace_id BIGINT NOT NULL,
channel_id BIGINT NOT NULL,
user_id BIGINT NOT NULL,
joined_at TIMESTAMP,
PRIMARY KEY (workspace_id, channel_id, user_id)
);
-- -- HOT PATH: Redis (sub-ms, ephemeral) -------------------------------------
channel:{id}:online ? SET of user_ids currently connected (updated on WS connect/disconnect)
user:{id}:gateway ? STRING "gateway-host-7" (which Gateway holds this user's WS)
channel:{id}:members ? SET of all user_ids (cached from MySQL, TTL-refreshed)
-- Fan-out worker reads channel:{id}:online ? groups by gateway ? gRPC batch push
-- On disconnect: DEL user:{id}:gateway + SREM channel:{id}:online {user_id}
Hot/cold split: Redis holds only live session state · it's ephemeral and can be rebuilt from MySQL on restart. MySQL holds the durable ordered log. This means a Redis failure doesn't lose messages; it only temporarily disrupts real-time delivery until presence is rebuilt.
Sharding key choice:workspace_id (not channel_id) keeps all data for a workspace on one shard, enabling SQL joins and transactions within a workspace. Large enterprise customers get a dedicated shard to prevent noisy-neighbour problems.
Fan-out Strategy · The Hard Part
A 50K-member channel with 200K msg/sec would require 10 billion writes/sec if you pushed to every member. The fix: push only to online members (~10% of total).
Channel Size
Strategy
Why
<100 members
Write-time fan-out · push to every member's Gateway immediately
Fan-out is tiny; latency matters more than write cost
100 · 50K members
Hybrid · push to online members only, skip offline
50K members but only ~5K online ? 90% less work. Offline users catch up on reconnect.
>50K members
Read-time fan-out · write to channel log; clients stream from cursor
Write amplification at this scale is prohibitive regardless of online ratio
Presence tracking:channel:{id}:online ? Redis SET, updated on WS connect/disconnect. Fan-out worker iterates this set · not the full membership list. For a 50K-member channel with 5K online, that's 90% fewer Gateway calls per message. See Scale Estimation (step 4) for the full math.
Deep dive: For channels with 1M+ members (Telegram scale), even the hybrid approach breaks. See Telegram Large Group Fan-out for the read-time pull architecture that handles 20· this scale.
Presence Protocol
How does the system know who is online? Heartbeat-based presence with Redis TTL · no gossip protocol needed at this scale
Heartbeat Mechanism
Client ping: Every 15 seconds via WebSocket Gateway timeout: No ping for 30 seconds ? close connection Redis TTL: Presence keys expire in 45 seconds (safety net) Jitter: Clients add ·3s random jitter to avoid thundering herd on reconnect
State Transitions
online ? WS connected + ping received within 30s away ? No user activity for 5 min (client sends status change) offline ? WS closed OR no ping for 30s OR Redis TTL expired Broadcast: Only on state transitions, not on every heartbeat
Scalability Considerations
No gossip: Centralized Redis · simpler than distributed gossip at 10M users Batch updates: Gateway batches presence changes every 5s to Redis (not per-ping) Large channels: Presence broadcast throttled to 1 update/sec for channels >10K Memory: 10M users · 50 bytes = ~500MB · fits in single Redis instance
Interview insight: Presence is an eventually consistent system · a user might appear online for up to 45s after crashing. This is acceptable for chat. The alternative (synchronous presence with consensus) would add latency to every message delivery.
Deep dive: At 100M+ concurrent users (WhatsApp scale), centralized Redis breaks. See Presence at 100M Scale for the distributed sharded presence architecture with subscription-based delivery.
Ordering & Consistency
All messages in #general must arrive in the same order for every user · achieved via Kafka partition-per-channel + monotonic sequence numbers
How Ordering Works
1. All messages for #general ? same Kafka partition (key = channel_id) 2. Single consumer per partition ? messages processed in strict order 3. Each message assigned channel_seq · monotonic increment per channel 4. Client tracks last received seq. Gap detected ? request backfill from Vitess 5. On reconnect: client sends last_seq=47 ? server returns msgs 48+
Failure Scenarios
WS drop: Client reconnects, sends last_seq ? backfill from Vitess Gateway crash: All users on that gateway reconnect; Redis mapping updated Duplicate delivery: Client deduplicates by channel_seq (idempotent render) @channel in 50K: Rate-limit fan-out, stagger over 1·2s to avoid thundering herd Cross-region lag: EU user sees msg ~80ms after US user · acceptable for chat
Guarantees Provided
Per-channel ordering: Kafka partition key = channel_id ensures strict order within a channel No cross-channel ordering: Different channels may be on different partitions · that's fine At-least-once delivery: Gaps detected client-side; backfill closes them Idempotent render: Duplicate messages silently dropped by seq dedup
Sequence Number Generation
How does the system assign monotonic channel_seq without becoming a bottleneck? Single-writer-per-channel via Kafka partition ownership.
Why This Works
Single writer: Kafka guarantees one consumer per partition ? no race conditions Redis INCR: Atomic O(1) operation, ~0.1ms latency ? no bottleneck Crash recovery: New consumer reads last seq from Redis ? resumes from correct position No gaps: Unlike DB auto-increment, Redis INCR never creates gaps (no rollbacks)
Edge Cases
Partition rebalance: New consumer reads current seq from Redis before processing Redis failure: Fall back to Vitess SELECT MAX(channel_seq) · slower but correct Duplicate INCR: If worker crashes after INCR but before Vitess write ? seq gap. Client backfill handles gracefully (empty gap = no-op) Hot channel: One partition = one worker. If channel is too hot, split into sub-channels at app layer.
Interview tip: This is a common follow-up: "How do you generate monotonic IDs without a single point of failure?" The answer is partition-level single writer · Kafka gives you distributed single-writer semantics for free. Each partition has exactly one active consumer, so no coordination needed.
Resilience & Edge Cases
Five failure modes and how the architecture handles each without data loss or visible gaps
#
Failure / Edge Case
What Breaks
How It's Handled
1
WebSocket drops mid-session
Client misses messages sent while disconnected
On reconnect, client sends last_seq ? server backfills from Vitess. Gap closed transparently.
2
Gateway server crash
All ~500K connections on that host lost
Clients reconnect to any Gateway. Redis user:{id}:gateway mapping updated. Fan-out resumes immediately.
3
Duplicate delivery
Same message rendered twice in UI
Client deduplicates by channel_seq · idempotent render, second copy silently dropped.
4
@channel in 50K-member channel
Thundering herd: 50K fan-out ops at once
Rate-limit fan-out worker; stagger delivery over 1·2s for huge channels. Acceptable for broadcast notifications.
5
Kafka consumer lag spike
Fan-out falls behind; messages delayed
Partition count = number of hot channels. Scale fan-out workers horizontally. Alert on consumer lag > threshold.
Key insight: The system tolerates delivery failures by making the client responsible for gap detection. The server never needs to track per-user delivery state · it just stores the ordered log and lets clients request what they missed.
Backpressure & Flow Control
What happens when a Gateway is overwhelmed, Kafka consumers lag, or a viral message hits a 50K channel? Layered defense from client to storage.
Key Principle: Graceful Degradation
Never lose messages: Even under extreme load, messages are persisted in Kafka + Vitess. Delivery may be delayed, but data is never lost. Prioritize small channels: Under load, large broadcast channels degrade first (switch to pull). DMs and small channels keep real-time delivery. Client self-heals: Gap detection + backfill means any dropped delivery is automatically recovered on the client side.
Monitoring & Alerts
Kafka consumer lag: Alert if any partition > 10K messages behind Gateway connection count: Alert at 80% capacity (400K connections) gRPC error rate: Alert if > 1% of PushMessages calls fail P99 delivery latency: Alert if > 500ms (2.5· budget) Circuit breaker state: Alert on any circuit opening
Interview insight: Interviewers love hearing about graceful degradation. The system doesn't crash under 10· load · it progressively reduces real-time guarantees for large channels while protecting small channels and DMs. Messages are never lost, only delayed.
Multi-Region Architecture
How does a message sent in US-East reach a user in EU-West within 250ms? GeoDNS + regional Kafka clusters + async cross-region replication
Replication Strategy
Kafka MirrorMaker 2: Async replication US?EU, EU?US Lag: ~60-80ms (network RTT between regions) Vitess: Read replicas in each region for history queries Redis: Independent per region · each region tracks its own online users
Write Routing
Workspace home region: All writes for a workspace go to its primary region GeoDNS: Routes user to nearest Gateway, but writes proxy to primary EU user sends msg: Gateway EU ? Msg Service US (primary) ? Kafka US ? replicate to EU Extra latency: ~80ms for cross-region write · acceptable for sender ACK
Failover
Region down: GeoDNS removes unhealthy region in ~30s Promote replica: EU Kafka becomes primary; Vitess promotes read replica Data loss window: Up to ~80ms of unreplicated messages (async replication) Recovery: Clients reconnect, send last_seq ? backfill closes any gaps
Key tradeoff: Async replication means EU users see messages ~80ms after US users. The alternative · synchronous cross-region writes · would add 80ms to every message for every user. Slack chooses availability over strict consistency across regions.
Deep dive: For systems requiring causal ordering guarantees across regions (not just eventual consistency), see Multi-Region Message Ordering · covers HLC, vector clocks, and conflict resolution during region failover.
Tech Stack & Tradeoffs
Each choice made for a specific reason · and what was rejected
Component
Technology
Why This
Why Not X
WebSocket Gateway
Custom (Go / Java)
500K conn per server via epoll; full control over heartbeat, reconnect, backpressure
Nginx/HAProxy: not designed for stateful long-lived connections at this density
Message Bus
Kafka
Partition by channel_id ? strict ordering. Durable, replayable. 1M+ msg/sec.
RabbitMQ: no replay, lower throughput, fan-out harder. SQS: no ordering across messages.
Primary Storage
Vitess (MySQL)
Shard by workspace_id · all workspace data co-located. Transparent resharding. SQL joins work within shard.
Cassandra: no transactions, harder to query by seq range. DynamoDB: vendor lock-in, cost at this scale.
REST/HTTP: higher overhead per call. Raw TCP: no service discovery, no load balancing.
Search
Elasticsearch
Full-text search across message history. Separate from delivery path · no latency coupling.
MySQL FULLTEXT: doesn't scale to 10B messages. Solr: operationally heavier.
DNS Routing
GeoDNS
Routes each user to nearest Gateway region. Reduces cross-region hops by ~80ms.
Single-region: unacceptable latency for global users. Anycast: complex to operate.
Decision rule:Stateful hot path (connections, presence) ? Redis. Ordered durable log (messages, fan-out) ? Kafka. Queryable history (catch-up, search) ? Vitess + Elasticsearch. Each component owns exactly one concern · no overlap.
Real-world validation:Slack uses this exact stack. Discord uses a similar pattern (Cassandra instead of Vitess). WhatsApp uses Erlang/BEAM for the gateway layer instead of Go/Java · same architectural shape, different runtime.
Related Deep Dives
Each problem below explores one aspect of this system at greater depth or different scale
The 8 things an interviewer wants to hear · say these and you've covered the design
? WebSocket + Redis routing
Each user holds one persistent WebSocket to a Gateway server. Redis maps user:{id} ? gateway-host. Fan-out worker looks up this mapping to know which gateway to call.
? Kafka partition per channel
Partition key = channel_id. All messages for a channel land on the same partition ? strict ordering guaranteed. One fan-out worker per partition ? no race conditions.
? Hybrid fan-out (skip offline)
Push only to online members (Redis SET). 50K members but 5K online = 90% less work. Channels >50K use read-time fan-out · write once to log, clients pull.
? Monotonic channel_seq + backfill
Every message gets a channel_seq. Client tracks last seen. Gap detected ? request backfill from Vitess. On reconnect, client sends last_seq ? server returns missed messages.
? Vitess sharded by workspace_id
All data for a workspace on one shard ? SQL joins work. Large enterprise gets dedicated shard. Vitess handles transparent resharding as workspaces grow.
? Presence via heartbeat + Redis TTL
Client pings every 15s. Gateway closes on 30s timeout. Redis SET with 45s TTL auto-expires stale presence. Broadcast only on state transitions · not every heartbeat. Eventually consistent (up to 45s stale).
? Multi-region: async replication + local fan-out
Writes go to workspace's primary region. Kafka MirrorMaker replicates to other regions (~60-80ms). Each region has local fan-out workers + Redis + Gateways. EU users see messages ~80ms after US users · acceptable for chat.
? Backpressure: 4-layer graceful degradation
Client rate limit ? Gateway admission control ? Fan-out circuit breaker ? Kafka consumer auto-scale. Under extreme load, large channels degrade to pull-based while DMs keep real-time. Messages never lost, only delayed.
One-liner answer:"WebSocket per user with Redis routing, Kafka partitioned by channel for ordering, hybrid fan-out to online-only members, monotonic sequence numbers for gap detection and catch-up, Vitess sharded by workspace, heartbeat-based presence with Redis TTL, async multi-region replication via MirrorMaker, and 4-layer backpressure for graceful degradation."