How does Slack handle 10+ billion messages per day?

How does a real-time chat system deliver one message to 50,000 online users in under 200ms · while sustaining 10B+ messages per day at peak 200K msg/sec?

Scope: Real-time delivery path only · not search, file storage, or push to offline users. The hard question: how does one message fan-out to thousands of live WebSocket connections without write amplification killing the system✗

10B+

messages / day

~115K avg msg/sec

200K

msg/sec peak

~1.7· avg load

10M+

concurrent WebSockets

~20K Gateway servers

<200ms

delivery P99

~11ms same-region

Functional Requirements

What the system must do · the core user-facing behaviours

Must Have (Core)

1. User can send a text message to a channel or DM
2. All online members of a channel receive the message in real time (<200ms)
3. Messages are ordered · every user sees them in the same sequence
4. Offline users receive missed messages on reconnect (catch-up)
5. Users can see presence · who is currently online in a channel
6. Messages are persisted durably · never lost after the sender gets an ACK

Out of Scope (this design)

✗ Full-text message search (separate Elasticsearch pipeline)
✗ Push notifications to mobile when app is backgrounded (APNs / FCM)
✗ File / media uploads (separate blob storage path)
✗ Reactions, threads, edits, deletes (same delivery path, different events)
✗ User authentication and workspace management
✗ Read receipts at scale (separate aggregation problem)

Non-Functional Requirements

The quality constraints that shape every architectural decision

Property	Target	Why It Matters / Design Impact
Latency	P99 <200ms same-region · <250ms cross-region	Chat feels broken above ~500ms. Drives WebSocket over polling, Redis over DB for hot path.
Throughput	200K msg/sec peak sustained	Requires Kafka (1M+ msg/sec headroom), partitioned fan-out workers, and Gateway horizontal scale.
Availability	99.99% (<52 min/year downtime)	No single point of failure. Gateway, Kafka, Redis, Vitess all multi-node. GeoDNS for region failover.
Durability	Zero message loss after producer ACK	Kafka `acks=all` + `min.insync.replicas=2`. Vitess synchronous replication within shard.
Consistency	Per-channel ordering (not global)	All members see messages in the same order within a channel. Cross-channel ordering not required.
Scalability	Linear scale to 10M+ concurrent WS	Stateless fan-out workers + stateful Gateways (each holds ~500K connections). Add nodes independently.
Fault tolerance	Survive single-node failures transparently	Client reconnects + seq-based catch-up means a Gateway crash is invisible to the user within seconds.
Security	TLS everywhere · workspace isolation	All WS connections over WSS. Vitess shard-per-workspace prevents cross-tenant data leaks.

Key tension: Availability vs. Consistency. Slack chooses availability · a message may arrive 80ms later for a cross-region user, but the system never blocks. Ordering is guaranteed within a channel, not globally.

Scale Estimation

Given numbers from the question → derive infrastructure sizing. Show this reasoning in an interview.

Given (from question): 10B messages/day · 50,000 online users per channel · <200ms P99 delivery · 10M+ concurrent users

Step	What to Derive	Calculation	Result	Design Decision
1	Avg msg/sec	10B · 86,400 sec/day	~115K msg/sec	Baseline throughput · Kafka must sustain this 24/7
2	Peak msg/sec	115K · 1.7· peak factor	~200K msg/sec	Size Kafka partitions and fan-out workers for peak, not avg
3	Gateway servers needed	10M connections · 500K conn/server (epoll limit)	~20 Gateway servers	Each Gateway holds 500K persistent WebSockets; Redis maps user → gateway
4	Fan-out ops/sec (worst case)	200K msg/sec · 50K members/channel	10B ops/sec ✗	Impossible → must skip offline. Only ~10% online → 200K · 5K = 1B ops/sec ✔
5	Kafka partitions	200K msg/sec · ~20 msg/sec per partition (safe throughput)	~10K partitions	1 partition per hot channel; hash-assigned for the rest. 1 fan-out worker per partition.
6	Message storage / day	10B msgs · 1KB avg payload	~10TB/day	Vitess sharded by workspace_id. 7-day retention → ~70TB total. Kafka retention separate.
7	Redis memory (presence)	10M users · 50 bytes (user_id + gateway mapping)	~500MB	Fits comfortably in Redis. Channel online sets add ~100MB more. Total <1GB hot data.
8	Latency budget breakdown	200ms budget - 80ms cross-region gRPC = 120ms same-region budget	~11ms actual	Kafka (~5ms) + Redis (~1ms) + gRPC (~3ms) + WS push (~2ms) = ~11ms. Well within 200ms budget.

Interview tip: Start with step 1·2 (traffic), then step 3 (connections), then step 4 (fan-out math) · that's where the key insight lives. The 10B ops/sec impossibility is what forces the hybrid fan-out design. Derive the architecture from the numbers, not the other way around.

APIs

Three surfaces: WebSocket (real-time delivery) · REST (send + history) · Internal gRPC (fan-out Worker → Gateway)

▸ Connection Handshake

1

Client fetches WSS endpoint via REST
Then upgrades the connection:

GET /v1/ws?token=<jwt>&workspace_id=<id>
Upgrade: websocket
Connection: Upgrade

2

Server responds

101 Switching Protocols 401 Unauthorized 429 Too Many Requests

3

Client sends hello frame · declares last known position per channel

{ "type": "hello", "last_seq_map": { "C123": 47, "C456": 12 } }

4

Server replies with hello_ack + backfill for any gaps

{ "type": "hello_ack", "user_id": "U789", "session_id": "sess_abc" }
// If gaps detected:
{ "type": "message.backfill", "channel_id": "C123", "messages": [...] }

▸ Client → Server Events

C → S hello

Declare last known seq per channel. Triggers backfill for any gaps.

{ last_seq_map: { channel_id: seq } }

Close code:4001 if token invalid

C → S message.send

Send a message. client_msg_id is a client-generated UUID for dedup on retry.

{ channel_id, text, client_msg_id }

C → S ping

Heartbeat · Gateway closes connection if no ping received within 30s.

{ ts }

Close code:1001 if no ping in 30s

▸ Server → Client Events

S → C hello_ack

Confirms auth succeeded. Client can now send messages.

{ user_id, session_id }

S → C message.new

Real-time push of a new message. Client renders and advances last_seq.

{ channel_id, channel_seq, user_id, text, ts }

S → C message.ack

Confirms server received and persisted the sent message.

{ client_msg_id, channel_seq, ts }

S → C message.backfill

Catch-up batch after hello or when client detects a gap in sequence numbers.

{ channel_id, messages: [...], has_more }

S → C presence.update

Presence change broadcast to all online channel members.

{ user_id, channel_id, status: "online" | "away" | "offline" }

S → C error

Error response for invalid operations.

{ code, message, ref_id }
// codes: "not_in_channel", "rate_limited", "msg_too_long"

S → C pong

Heartbeat reply.

{ ts }

Why WebSocket over SSE or long-polling? SSE is server→client only · can't send messages. Long-polling adds 1 RTT per message and hammers the server. WebSocket is full-duplex, persistent, and handles 500K connections per server with epoll. The only tradeoff is stateful connections requiring sticky routing · solved by the Redis user:{id}:gateway mapping.

▸ HTTP/2 · Bearer Token Auth · JSON Responses

POST /v1/ws/connect

Returns the nearest Gateway WSS URL via GeoDNS. Token is short-lived (60s), used only for the WebSocket upgrade handshake.

// Request
{ workspace_id: "W123" }

// Response 200
{ ws_url: "wss://gw-us-east-7.slack.com/ws", token: "eyJ...", expires_in: 60 }

200 OK 401 Bad auth 404 Workspace not found

POST /v1/messages

Fallback send when WebSocket is unavailable. Idempotent · same client_msg_id returns 409 with the original response.

// Request
{ channel_id: "C123", text: "hello team!", client_msg_id: "uuid-abc-123" }

// Response 201
{ message_id: "M789", channel_seq: 48, ts: "2025-01-15T10:30:00Z" }

201 Created 400 Invalid payload 401 Unauthorized 403 Not in channel 409 Duplicate client_msg_id 429 Rate limited

GET /v1/channels/{id}/messages

Paginated message history. Reads from Vitess by (channel_id, channel_seq) index. Max limit=200.

// Request params
?before_seq=47&limit=50

// Response 200
{ messages: [{ channel_seq: 46, user_id: "U1", text: "..." }, ...], has_more: true, next_cursor: "seq:45" }

200 OK 400 Invalid params 401 Unauthorized 403 Not in channel 404 Channel not found

GET /v1/channels/{id}/members/online

Reads Redis channel:{id}:online SET. as_of_ts indicates data freshness (Redis TTL-based).

// Response 200
{ user_ids: ["U1", "U2", "U3"], count: 3, as_of_ts: "2025-01-15T10:30:02Z" }

200 OK 401 Unauthorized 403 Not in channel 404 Channel not found

Idempotency: client_msg_id (client UUID) is the dedup key for REST retries · server returns 409 with the original response body. channel_seq (server-assigned monotonic) is the dedup key on the client side for WS delivery · duplicate frames are silently dropped.

▸ Fan-out Worker → Gateway (Internal Only)

RPC PushMessages

Fan-out worker groups users by gateway, calls each gateway once with a batch. not_found = user disconnected since Redis lookup.

// Request
{ user_ids: ["U1", "U2", "U3"], message: MessageProto }

// Response
{ delivered: ["U1", "U2"], not_found: ["U3"], failed: [] }

OK UNAVAILABLE · gateway overloaded DEADLINE_EXCEEDED · >50ms

RPC GetConnectedUsers

Used to prune stale Redis entries after a gateway crash. Called lazily, not on every message.

// Request
{ user_ids: ["U1", "U2", "U3"] }

// Response
{ connected: ["U1", "U3"], disconnected: ["U2"] }

OK INVALID_ARGUMENT · empty list

Why gRPC? Fan-out Worker → Gateway calls are high-frequency and latency-sensitive. gRPC gives multiplexed HTTP/2 connections, typed protobuf contracts, and built-in deadline propagation. A single fan-out worker batches all users on the same gateway into one PushMessages call · not one call per user.

Architecture Overview

Write path (send) → Kafka → Fan-out workers → Gateway → WebSocket push to clients

Data Model

Hot path (Redis, sub-ms) for presence and routing · Cold path (Vitess/MySQL) for durable message history

-- -- COLD PATH: Vitess/MySQL (sharded by workspace_id) ----------------------

CREATE TABLE messages (
  workspace_id   BIGINT NOT NULL,    -- shard key: all workspace data on one shard
  channel_id     BIGINT NOT NULL,
  channel_seq    BIGINT NOT NULL,    -- monotonic per channel (gap detection + backfill)
  user_id        BIGINT NOT NULL,
  client_msg_id  VARCHAR(64),        -- client UUID for idempotent dedup (indexed)
  content        TEXT,
  ts             TIMESTAMP(6),
  PRIMARY KEY (workspace_id, channel_id, channel_seq),
  UNIQUE KEY uq_client_msg (workspace_id, user_id, client_msg_id)
  -- PK doubles as the range-scan index for catch-up reads
);

CREATE TABLE channel_members (
  workspace_id   BIGINT NOT NULL,
  channel_id     BIGINT NOT NULL,
  user_id        BIGINT NOT NULL,
  joined_at      TIMESTAMP,
  PRIMARY KEY (workspace_id, channel_id, user_id)
);

-- -- HOT PATH: Redis (sub-ms, ephemeral) -------------------------------------

channel:{id}:online    → SET of user_ids currently connected (updated on WS connect/disconnect)
user:{id}:gateway      → STRING "gateway-host-7"  (which Gateway holds this user's WS)
channel:{id}:members   → SET of all user_ids       (cached from MySQL, TTL-refreshed)

-- Fan-out worker reads channel:{id}:online → groups by gateway → gRPC batch push
-- On disconnect: DEL user:{id}:gateway + SREM channel:{id}:online {user_id}

Hot/cold split: Redis holds only live session state · it's ephemeral and can be rebuilt from MySQL on restart. MySQL holds the durable ordered log. This means a Redis failure doesn't lose messages; it only temporarily disrupts real-time delivery until presence is rebuilt.

Sharding key choice: workspace_id (not channel_id) keeps all data for a workspace on one shard, enabling SQL joins and transactions within a workspace. Large enterprise customers get a dedicated shard to prevent noisy-neighbour problems.

Fan-out Strategy · The Hard Part

A 50K-member channel with 200K msg/sec would require 10 billion writes/sec if you pushed to every member. The fix: push only to online members (~10% of total).

Channel Size	Strategy	Why
<100 members	Write-time fan-out · push to every member's Gateway immediately	Fan-out is tiny; latency matters more than write cost
100 · 50K members	Hybrid · push to online members only, skip offline	50K members but only ~5K online → 90% less work. Offline users catch up on reconnect.
>50K members	Read-time fan-out · write to channel log; clients stream from cursor	Write amplification at this scale is prohibitive regardless of online ratio

Presence tracking: channel:{id}:online → Redis SET, updated on WS connect/disconnect. Fan-out worker iterates this set · not the full membership list. For a 50K-member channel with 5K online, that's 90% fewer Gateway calls per message. See Scale Estimation (step 4) for the full math.

Deep dive: For channels with 1M+ members (Telegram scale), even the hybrid approach breaks. See Telegram Large Group Fan-out for the read-time pull architecture that handles 20· this scale.

Presence Protocol

How does the system know who is online? Heartbeat-based presence with Redis TTL · no gossip protocol needed at this scale

Heartbeat Mechanism

Client ping: Every 15 seconds via WebSocket
Gateway timeout: No ping for 30 seconds → close connection
Redis TTL: Presence keys expire in 45 seconds (safety net)
Jitter: Clients add ·3s random jitter to avoid thundering herd on reconnect

State Transitions

online → WS connected + ping received within 30s
away → No user activity for 5 min (client sends status change)
offline → WS closed OR no ping for 30s OR Redis TTL expired
Broadcast: Only on state transitions, not on every heartbeat

Scalability Considerations

No gossip: Centralized Redis · simpler than distributed gossip at 10M users
Batch updates: Gateway batches presence changes every 5s to Redis (not per-ping)
Large channels: Presence broadcast throttled to 1 update/sec for channels >10K
Memory: 10M users · 50 bytes = ~500MB · fits in single Redis instance

Interview insight: Presence is an eventually consistent system · a user might appear online for up to 45s after crashing. This is acceptable for chat. The alternative (synchronous presence with consensus) would add latency to every message delivery.

Deep dive: At 100M+ concurrent users (WhatsApp scale), centralized Redis breaks. See Presence at 100M Scale for the distributed sharded presence architecture with subscription-based delivery.

Ordering & Consistency

All messages in #general must arrive in the same order for every user · achieved via Kafka partition-per-channel + monotonic sequence numbers

How Ordering Works

1. All messages for #general → same Kafka partition (key = channel_id)
2. Single consumer per partition → messages processed in strict order
3. Each message assigned channel_seq · monotonic increment per channel
4. Client tracks last received seq. Gap detected → request backfill from Vitess
5. On reconnect: client sends last_seq=47 → server returns msgs 48+

Failure Scenarios

WS drop: Client reconnects, sends last_seq → backfill from Vitess
Gateway crash: All users on that gateway reconnect; Redis mapping updated
Duplicate delivery: Client deduplicates by channel_seq (idempotent render)
@channel in 50K: Rate-limit fan-out, stagger over 1·2s to avoid thundering herd
Cross-region lag: EU user sees msg ~80ms after US user · acceptable for chat

Guarantees Provided

Per-channel ordering: Kafka partition key = channel_id ensures strict order within a channel
No cross-channel ordering: Different channels may be on different partitions · that's fine
At-least-once delivery: Gaps detected client-side; backfill closes them
Idempotent render: Duplicate messages silently dropped by seq dedup

Sequence Number Generation

How does the system assign monotonic channel_seq without becoming a bottleneck? Single-writer-per-channel via Kafka partition ownership.

Why This Works

Single writer: Kafka guarantees one consumer per partition → no race conditions
Redis INCR: Atomic O(1) operation, ~0.1ms latency → no bottleneck
Crash recovery: New consumer reads last seq from Redis → resumes from correct position
No gaps: Unlike DB auto-increment, Redis INCR never creates gaps (no rollbacks)

Edge Cases

Partition rebalance: New consumer reads current seq from Redis before processing
Redis failure: Fall back to Vitess SELECT MAX(channel_seq) · slower but correct
Duplicate INCR: If worker crashes after INCR but before Vitess write → seq gap. Client backfill handles gracefully (empty gap = no-op)
Hot channel: One partition = one worker. If channel is too hot, split into sub-channels at app layer.

Interview tip: This is a common follow-up: "How do you generate monotonic IDs without a single point of failure?" The answer is partition-level single writer · Kafka gives you distributed single-writer semantics for free. Each partition has exactly one active consumer, so no coordination needed.

Resilience & Edge Cases

Five failure modes and how the architecture handles each without data loss or visible gaps

#	Failure / Edge Case	What Breaks	How It's Handled
1	WebSocket drops mid-session	Client misses messages sent while disconnected	On reconnect, client sends `last_seq` → server backfills from Vitess. Gap closed transparently.
2	Gateway server crash	All ~500K connections on that host lost	Clients reconnect to any Gateway. Redis `user:{id}:gateway` mapping updated. Fan-out resumes immediately.
3	Duplicate delivery	Same message rendered twice in UI	Client deduplicates by `channel_seq` · idempotent render, second copy silently dropped.
4	@channel in 50K-member channel	Thundering herd: 50K fan-out ops at once	Rate-limit fan-out worker; stagger delivery over 1·2s for huge channels. Acceptable for broadcast notifications.
5	Kafka consumer lag spike	Fan-out falls behind; messages delayed	Partition count = number of hot channels. Scale fan-out workers horizontally. Alert on consumer lag > threshold.

Key insight: The system tolerates delivery failures by making the client responsible for gap detection. The server never needs to track per-user delivery state · it just stores the ordered log and lets clients request what they missed.

Backpressure & Flow Control

What happens when a Gateway is overwhelmed, Kafka consumers lag, or a viral message hits a 50K channel? Layered defense from client to storage.

Key Principle: Graceful Degradation

Never lose messages: Even under extreme load, messages are persisted in Kafka + Vitess. Delivery may be delayed, but data is never lost.
Prioritize small channels: Under load, large broadcast channels degrade first (switch to pull). DMs and small channels keep real-time delivery.
Client self-heals: Gap detection + backfill means any dropped delivery is automatically recovered on the client side.

Monitoring & Alerts

Kafka consumer lag: Alert if any partition > 10K messages behind
Gateway connection count: Alert at 80% capacity (400K connections)
gRPC error rate: Alert if > 1% of PushMessages calls fail
P99 delivery latency: Alert if > 500ms (2.5· budget)
Circuit breaker state: Alert on any circuit opening

Interview insight: Interviewers love hearing about graceful degradation. The system doesn't crash under 10· load · it progressively reduces real-time guarantees for large channels while protecting small channels and DMs. Messages are never lost, only delayed.

Multi-Region Architecture

How does a message sent in US-East reach a user in EU-West within 250ms? GeoDNS + regional Kafka clusters + async cross-region replication

Replication Strategy

Kafka MirrorMaker 2: Async replication US?EU, EU?US
Lag: ~60-80ms (network RTT between regions)
Vitess: Read replicas in each region for history queries
Redis: Independent per region · each region tracks its own online users

Write Routing

Workspace home region: All writes for a workspace go to its primary region
GeoDNS: Routes user to nearest Gateway, but writes proxy to primary
EU user sends msg: Gateway EU → Msg Service US (primary) → Kafka US → replicate to EU
Extra latency: ~80ms for cross-region write · acceptable for sender ACK

Failover

Region down: GeoDNS removes unhealthy region in ~30s
Promote replica: EU Kafka becomes primary; Vitess promotes read replica
Data loss window: Up to ~80ms of unreplicated messages (async replication)
Recovery: Clients reconnect, send last_seq → backfill closes any gaps

Key tradeoff: Async replication means EU users see messages ~80ms after US users. The alternative · synchronous cross-region writes · would add 80ms to every message for every user. Slack chooses availability over strict consistency across regions.

Deep dive: For systems requiring causal ordering guarantees across regions (not just eventual consistency), see Multi-Region Message Ordering · covers HLC, vector clocks, and conflict resolution during region failover.

Tech Stack & Tradeoffs

Each choice made for a specific reason · and what was rejected

Component	Technology	Why This	Why Not X
WebSocket Gateway	Custom (Go / Java)	500K conn per server via epoll; full control over heartbeat, reconnect, backpressure	Nginx/HAProxy: not designed for stateful long-lived connections at this density
Message Bus	Kafka	Partition by channel_id → strict ordering. Durable, replayable. 1M+ msg/sec.	RabbitMQ: no replay, lower throughput, fan-out harder. SQS: no ordering across messages.
Primary Storage	Vitess (MySQL)	Shard by workspace_id · all workspace data co-located. Transparent resharding. SQL joins work within shard.	Cassandra: no transactions, harder to query by seq range. DynamoDB: vendor lock-in, cost at this scale.
Hot Data / Presence	Redis	Sub-ms SET/GET. Channel online sets + user?gateway mapping. Pub/Sub for internal fan-out.	Memcached: no sets/sorted sets. In-memory on Gateway: doesn't survive restarts, not shared.
Internal RPC	gRPC	Fan-out Worker → Gateway. Multiplexed, low latency, typed contracts via protobuf.	REST/HTTP: higher overhead per call. Raw TCP: no service discovery, no load balancing.
Search	Elasticsearch	Full-text search across message history. Separate from delivery path · no latency coupling.	MySQL FULLTEXT: doesn't scale to 10B messages. Solr: operationally heavier.
DNS Routing	GeoDNS	Routes each user to nearest Gateway region. Reduces cross-region hops by ~80ms.	Single-region: unacceptable latency for global users. Anycast: complex to operate.

Decision rule: Stateful hot path (connections, presence) → Redis. Ordered durable log (messages, fan-out) → Kafka. Queryable history (catch-up, search) → Vitess + Elasticsearch. Each component owns exactly one concern · no overlap.

Real-world validation: Slack uses this exact stack. Discord uses a similar pattern (Cassandra instead of Vitess). WhatsApp uses Erlang/BEAM for the gateway layer instead of Go/Java · same architectural shape, different runtime.

Related Deep Dives

Each problem below explores one aspect of this system at greater depth or different scale

WhatsApp Offline Delivery

Store-and-forward for 2B+ users. E2E encryption. Server as temporary mailbox. Erlang/BEAM.

Discord WebSocket Infrastructure

Guild-level sharding vs user-level routing. Voice signaling. Elixir/BEAM for 1M conn/node.

Multi-Region Ordering

HLC for causal ordering. Conflict resolution. Region failover without message loss.

Telegram 1M+ Fan-out

Read-time pull from append-only log. Why write-time fan-out is impossible at 1M+.

Presence at 100M

Distributed sharded presence. Subscription model. Flap suppression. 20· Slack scale.

Typing Indicators

Ephemeral events. No Kafka. Redis Pub/Sub. Fire-and-forget. 8s TTL.

Chat Search (10B msgs)

Elasticsearch NRT indexing. Permission-aware queries. BM25 + recency ranking.

Push Notifications

APNs/FCM. Idempotency dedup. Token lifecycle. Collapse keys for batching.

Multi-Device Sync

pts/qts sequences. getDifference API. LWW conflict resolution.

Top Active Channels

Sliding window aggregation. Redis sorted sets. Count-Min Sketch for approximate top-K.

Interview Cheat Sheet

The 8 things an interviewer wants to hear · say these and you've covered the design

? WebSocket + Redis routing

Each user holds one persistent WebSocket to a Gateway server. Redis maps user:{id} → gateway-host. Fan-out worker looks up this mapping to know which gateway to call.

? Kafka partition per channel

Partition key = channel_id. All messages for a channel land on the same partition → strict ordering guaranteed. One fan-out worker per partition → no race conditions.

? Hybrid fan-out (skip offline)

Push only to online members (Redis SET). 50K members but 5K online = 90% less work. Channels >50K use read-time fan-out · write once to log, clients pull.

? Monotonic channel_seq + backfill

Every message gets a channel_seq. Client tracks last seen. Gap detected → request backfill from Vitess. On reconnect, client sends last_seq → server returns missed messages.

? Vitess sharded by workspace_id

All data for a workspace on one shard → SQL joins work. Large enterprise gets dedicated shard. Vitess handles transparent resharding as workspaces grow.

? Presence via heartbeat + Redis TTL

Client pings every 15s. Gateway closes on 30s timeout. Redis SET with 45s TTL auto-expires stale presence. Broadcast only on state transitions · not every heartbeat. Eventually consistent (up to 45s stale).

? Multi-region: async replication + local fan-out

Writes go to workspace's primary region. Kafka MirrorMaker replicates to other regions (~60-80ms). Each region has local fan-out workers + Redis + Gateways. EU users see messages ~80ms after US users · acceptable for chat.

? Backpressure: 4-layer graceful degradation

Client rate limit → Gateway admission control → Fan-out circuit breaker → Kafka consumer auto-scale. Under extreme load, large channels degrade to pull-based while DMs keep real-time. Messages never lost, only delayed.

One-liner answer: "WebSocket per user with Redis routing, Kafka partitioned by channel for ordering, hybrid fan-out to online-only members, monotonic sequence numbers for gap detection and catch-up, Vitess sharded by workspace, heartbeat-based presence with Redis TTL, async multi-region replication via MirrorMaker, and 4-layer backpressure for graceful degradation."

System Design Case Study