System Design Case Study

How does Discord maintain 10M+ concurrent WebSocket connections across thousands of gateway servers?

?? Design a real-time communication platform that maintains 10M+ concurrent WebSocket connections with heartbeats, shard assignment, graceful failover, and voice signaling
Concepts Involved

Problem Statement

How does a chat platform maintain 10M+ concurrent WebSocket connections across thousands of gateway servers, handling heartbeats, shard assignment, graceful failover, and voice signaling without dropping connections?

Scope: WebSocket gateway infrastructure, connection lifecycle, guild sharding, and voice signaling path. Unlike Slack (which focuses on message delivery fan-out), Discord's primary challenge is maintaining stateful connections for real-time voice + event distribution across millions of guilds.
10M+
concurrent WebSockets
persistent stateful connections
150M+
monthly active users
peak during gaming hours
19M+
active servers (guilds)
each assigned to a shard
<100ms
voice latency target
WebRTC + SFU architecture

Functional Requirements

What the system must do · core real-time communication behaviours

Must Have (Core)

1. Maintain persistent WebSocket connections for all online users
2. Distribute guilds across shards · each guild assigned to exactly one shard
3. Heartbeat mechanism to detect zombie connections (41.25s interval)
4. Session resume · reconnect without missing events using session_id + seq
5. Voice signaling · negotiate WebRTC connections for voice/video channels
6. Event distribution · deliver guild events (messages, presence, typing) to all connected members

Out of Scope (this design)

? Message persistence and search (Cassandra/ScyllaDB layer)
? Push notifications for mobile (APNs / FCM pipeline)
? CDN and media delivery (attachments, embeds, avatars)
? Bot framework and application commands
? User authentication and OAuth2 flows
? Rate limiting logic details (separate middleware)

Non-Functional Requirements

Quality constraints shaping the gateway infrastructure design

PropertyTargetWhy It Matters / Design Impact
Connection Capacity10M+ concurrent WSRequires thousands of gateway servers. Each Elixir node handles ~1M connections via BEAM scheduler.
Voice Latency<100ms mouth-to-earWebRTC with SFU topology. Separate voice servers from gateway. Regional voice pools.
Availability99.99% for gateway layerSession resume + graceful failover. No single gateway is a SPOF for any guild.
Reconnect Time<5s session resumeClient replays missed events from server-side buffer. No full re-identify needed.
Shard RebalanceZero-downtime migrationGuilds moved between shards without disconnecting users. Drain + redirect pattern.
ScalabilityLinear scale with guild countAdd shards as guilds grow. Shard count = f(total_guilds, max_guilds_per_shard).
Fault ToleranceSurvive gateway crash transparentlySession state buffered server-side. Client reconnects, resumes from last seq number.
SecurityTLS everywhere · token rotationWSS only. Bot tokens rate-limited separately. Identify payload validated per connection.
Key tension: Statefulness vs. Scalability. Each gateway holds in-memory session state (event buffer, guild subscriptions). This makes horizontal scaling harder than stateless services · solved by guild-level sharding and session resume protocol.

Scale Estimation

Derive infrastructure sizing from the given constraints. Show this reasoning in an interview.

Given: 10M+ concurrent WS · 150M+ MAU · 19M+ guilds · <100ms voice latency · 41.25s heartbeat interval
StepWhat to DeriveCalculationResultDesign Decision
1 Gateway servers needed 10M connections · 1M conn/server (Elixir/BEAM) ~10·15 gateway nodes Elixir lightweight processes handle 1M+ WS per node with BEAM scheduler
2 Shard count 19M guilds · ~2,500 guilds/shard (max comfortable) ~7,600 shards Bots with >2,500 guilds must shard. Formula: (guild_count - 1) / 2500 + 1
3 Heartbeat load 10M connections · (1 heartbeat / 41.25s) ~242K heartbeats/sec Lightweight · just update last_seen timestamp in memory. No DB write.
4 Events/sec (peak) ~5M events/sec (messages + presence + typing + voice state) ~5M events/sec Event distribution must be guild-scoped to avoid broadcast storms
5 Session buffer memory 10M sessions · 128 events · 1KB avg = 1.28TB ~1.3TB distributed Spread across gateway nodes. ~130GB per node for resume buffers.
6 Voice servers needed ~1.5M concurrent voice users · 1,000 users/SFU ~1,500 voice servers Separate from gateway. Regional pools for latency. SFU topology.
7 Reconnect storm (worst case) 1 gateway crash ? 1M reconnects in ~5s = 200K/sec 200K reconnects/sec Jittered backoff + session resume (not full re-identify) reduces load
8 Redis presence memory 10M users · 64 bytes (user_id + status + guild_ids) ~640MB Redis cluster for presence. Guild member lists cached separately.
Interview tip: Start with connection count (step 1), then shard math (step 2), then the reconnect storm (step 7) · that's the key insight. A single gateway crash creates a thundering herd that must be mitigated with jittered backoff and session resume.

Architecture Overview

Client ? Gateway (WS termination) ? Session Service ? Guild Sharding ? Event Distribution

CLIENT LAYER · Desktop / Mobile / Bot connections Desktop Client Electron + WS Mobile Client iOS / Android Bot Client sharded WS WSS + zlib GATEWAY LAYER · WebSocket Termination (Elixir/Erlang BEAM) Gateway Node 1 Shards 0·999 ~1M connections Gateway Node 2 Shards 1000·1999 ~1M connections Gateway Node N Shards N000·N999 ~1M connections SESSION & GUILD SHARDING · State Management Session Service session_id ? seq buffer resume token mgmt Guild Shard Map guild_id ? shard_id (guild_id >> 22) % num_shards Presence Service Redis cluster user status + activity EVENT DISTRIBUTION · Pub/Sub to Guild Members Event Publisher guild-scoped fan-out Guild Subscription Ring Erlang PG (process groups) ? Gateway WS Push to all subscribed clients KEY CONCEPT Each guild is assigned to exactly ONE shard. shard_id = (guild_id >> 22) % num_shards All events for a guild route through its shard. Voice: separate path ? signaling over WS, media over WebRTC/UDP Guild-level sharding means: one message in a guild only fans out to members on THAT shard's gateway connections · not all gateways. Compare: Slack uses user?gateway mapping via Redis. Discord uses guild?shard?gateway mapping for locality.

Gateway Sharding

How guilds are distributed across shards · the core routing mechanism. Slack uses user?gateway mapping via Redis · Discord uses guild?shard mapping instead, giving better event locality.

SHARD ASSIGNMENT FORMULA shard_id = (guild_id >> 22) % num_shards Deterministic · any node can compute which shard owns a guild without lookup GUILD ? SHARD DISTRIBUTION Shard 0 Guild A (id: 0) Guild D (id: 3) Guild G (id: 6) ~2,500 guilds Shard 1 Guild B (id: 1) Guild E (id: 4) Guild H (id: 7) ~2,500 guilds Shard 2 Guild C (id: 2) Guild F (id: 5) Guild I (id: 8) ~2,500 guilds Shard N Guild X, Y, Z... ... ~2,500 guilds SHARD REBALANCING · Zero-Downtime Migration 1. Mark shard DRAINING Stop accepting new guilds Continue serving existing 2. Transfer guild state Copy subscriptions + buffer to new shard assignment 3. Redirect clients Send RECONNECT opcode with new gateway URL 4. Complete migration Old shard decommissioned Clients resume on new shard Slack vs Discord Routing Comparison Slack: user_id ? Redis ? gateway_host (user-level routing, any gateway can serve any channel) Discord: guild_id ? shard_id ? gateway_node (guild-level sharding, events stay local to shard)
Why guild-level sharding? A guild's events (messages, presence, voice state) are highly correlated · members of the same guild need the same events. By co-locating all guild members on the same shard, Discord avoids cross-node communication for intra-guild events. This is fundamentally different from Slack's approach where any user can be on any gateway.

Connection Lifecycle

State machine: CONNECTING ? IDENTIFYING ? RESUMING/READY ? CONNECTED ? DISCONNECTED. Similar heartbeat mechanism to Slack's presence system but with different timeout values (41.25s vs 30s).

CONNECTION STATE MACHINE CONNECTING WS handshake Hello (op 10) IDENTIFYING send token + intents Ready (op 0) Resume (op 6) READY new session RESUMING replay missed CONNECTED receiving events close/timeout DISCONNECTED reconnect logic jittered backoff ? reconnect (resume if session valid, else re-identify) HEARTBEAT MECHANISM Server sends Hello with heartbeat_interval: 41250ms Client must send Heartbeat (op 1) within interval Server responds with Heartbeat ACK (op 11) No ACK received ? zombie connection ? close + reconnect RESUME vs RECONNECT Resume (op 6): session_id + seq ? server replays missed events ? Fast (~1s), no guild re-subscription, no READY payload Reconnect (op 2): full re-identify, new session, full READY ? Slow (~5s), re-subscribe all guilds, large READY payload GATEWAY OPCODES op 0 · Dispatch Server ? Client event (MESSAGE_CREATE, PRESENCE_UPDATE, etc.) op 1 · Heartbeat Client ? Server keepalive (send last seq number) op 2 · Identify Client ? Server auth (token, intents, shard info) op 6 · Resume Client ? Server reconnect (session_id, seq) op 7 · Reconnect Server ? Client: you should reconnect op 9 · Invalid Session Server ? Client: session expired, re-identify op 10 · Hello Server ? Client: connection established, heartbeat_interval op 11 · Heartbeat ACK Server ? Client: heartbeat received Zombie Detection: If client sends heartbeat but gets no ACK: ? close WS with code 1001 ? wait random(1-5s) jitter ? reconnect with Resume Close Codes: 4000 · Unknown error (reconnect) 4004 · Auth failed (don't reconnect)
Why 41.25s heartbeat? Discord uses a jittered interval around 41.25s (server sends the exact value in Hello). This is longer than Slack's 30s timeout · the tradeoff is slower zombie detection vs. less heartbeat traffic at 10M+ connections. At 242K heartbeats/sec, even small reductions matter.

Voice Architecture

Signaling path (WebSocket) is separate from media path (WebRTC/UDP). SFU topology for multi-user voice channels.

SIGNALING PATH · Over WebSocket (Gateway) Client Voice State Update op 4 Gateway routes to voice service Voice Service assigns voice server Voice Server Ready endpoint + token ? client Client ? Voice WS separate WS to voice server Flow: Client sends Voice State Update (op 4) ? Gateway responds with Voice Server Update ? Client opens separate WS to voice endpoint ? exchanges SDP/ICE Voice WS handles: session description, speaking indicators, heartbeat (separate from gateway heartbeat) MEDIA PATH · WebRTC / UDP (Rust Voice Servers) User A Opus audio User B Opus audio SFU (Selective Forwarding Unit) Written in Rust for performance Receives all streams, forwards selectively No mixing · each client decodes independently User C receives A+B User D receives A+B Why SFU over MCU? MCU: mixes audio server-side (CPU heavy) SFU: just forwards packets (low CPU) Tradeoff: more client bandwidth but scales to 1000s of servers VOICE PROTOCOL DETAILS Audio Codec: Opus 48kHz, 64kbps, 20ms frames Adaptive bitrate based on packet loss Transport: UDP (RTP/SRTP) Encrypted with SRTP (AES-128-CM) Fallback to TCP if UDP blocked Latency Budget: <100ms Encode: ~5ms | Network: ~30-50ms Jitter buffer: ~20ms | Decode: ~5ms Capacity: ~1000 users/SFU Regional voice server pools Auto-migrate on server overload
Key insight: Voice signaling and media are completely decoupled. The gateway WS handles signaling (which voice channel, who's speaking), while a separate Rust-based voice server handles the actual audio/video streams over UDP. This means a gateway crash doesn't kill active voice · the voice server continues independently.

Resilience & Edge Cases

How the system handles failures without dropping connections or losing events

ScenarioProblemSolutionRecovery Time
Gateway Crash 1M connections lost instantly. Users see disconnect. Session state buffered server-side. Clients reconnect with Resume (op 6) · server replays missed events from seq buffer. Jittered backoff prevents thundering herd. <5s with resume
Shard Rebalance Need to move guilds between shards without downtime. Drain pattern: mark shard DRAINING ? transfer state ? send Reconnect (op 7) to affected clients ? clients resume on new shard. <10s per guild
Thundering Herd Gateway crash ? 1M clients reconnect simultaneously ? overload remaining gateways. Jittered exponential backoff: random(1-5s) initial delay, then 2· backoff. Server returns 429 with Retry-After header. Rate limit reconnects per IP. ~30s to stabilize
Voice Server Failover Voice server crash ? active voice channels interrupted. Voice service detects failure ? assigns new voice server ? sends Voice Server Update to affected clients ? clients reconnect voice WS + re-negotiate WebRTC. ~2-3s audio gap
Rate Limiting Bots Abusive bots flood gateway with events, degrading service for legitimate users. Per-connection rate limits (120 events/60s). Identify payload includes intents · bots only receive events they need. Global rate limit headers. Close with 4008 (rate limited). Immediate rejection
Zombie Connections Client appears connected but is actually dead (network issue). Heartbeat ACK mechanism: if client sends heartbeat but receives no ACK, it closes and reconnects. Server-side: if no heartbeat received within interval · 1.5, server closes connection. ~62s detection
Large Guild Events Guild with 500K+ members ? presence updates would flood all connections. Lazy loading: guilds >250 members don't send full member list in READY. Client requests member chunks on demand. Presence only for visible members. N/A (prevention)
Worst case: Full region failure. All gateways in a region go down simultaneously. Mitigation: GeoDNS failover to next-closest region. Clients reconnect to new region. Session resume may fail (buffer lost) ? full re-identify. Expected recovery: 30-60s with degraded latency until region recovers.

Tech Stack

Discord's infrastructure choices · each optimized for its specific workload

ComponentTechnologyWhy This Choice
Gateway Servers Elixir / Erlang (BEAM VM) BEAM handles millions of lightweight processes. Each WS connection = 1 Erlang process (~2KB). Built-in supervision trees for fault tolerance. Hot code reloading for zero-downtime deploys.
Voice Servers Rust Low-latency audio processing requires predictable performance. No GC pauses. Zero-cost abstractions for packet forwarding. Custom SFU implementation.
Message Storage Cassandra ? ScyllaDB Write-heavy workload (billions of messages/day). ScyllaDB: C++ rewrite of Cassandra with better tail latencies. Partition key = channel_id for time-series access pattern.
Presence & Routing Redis Cluster Sub-ms lookups for user status, guild member lists, and shard?gateway mapping. Pub/Sub for cross-node event distribution.
API Services Python (Flask) ? Rust Originally Python for rapid development. Migrating hot paths to Rust for performance. API gateway handles REST + rate limiting.
WS Protocol Custom binary over WebSocket JSON for identify/resume. ETF (Erlang Term Format) or zlib-compressed JSON for events. Reduces bandwidth 60-80% vs raw JSON.
Service Mesh gRPC + Envoy Internal service communication. Protobuf contracts. Load balancing, circuit breaking, and observability built into the mesh.
Orchestration Kubernetes (GKE) Auto-scaling gateway pods based on connection count. Rolling deploys with drain + reconnect pattern.
Key migration: Discord moved from Cassandra to ScyllaDB for message storage, achieving P99 latency reduction from 40-125ms to 15ms. The BEAM VM (Elixir) was chosen specifically because each WebSocket connection maps to a lightweight Erlang process · enabling 1M+ connections per node with built-in fault isolation.

Interview Cheat Sheet

5 key points to nail a Discord WebSocket infrastructure interview question

1. Guild-Level Sharding is the Core Insight
Unlike Slack (user?gateway mapping), Discord assigns each guild to a specific shard. This means all events for a guild are processed by the same gateway node · no cross-node fan-out for intra-guild events. Formula: shard_id = (guild_id >> 22) % num_shards. This is deterministic · any node can compute it without a lookup.
2. Session Resume Prevents Thundering Herd
When a gateway crashes, 1M clients don't all re-identify (expensive). Instead, they Resume with session_id + last seq number. The server replays missed events from an in-memory buffer. Combined with jittered backoff, this turns a catastrophic reconnect storm into a manageable trickle. Resume is O(missed_events), not O(all_guilds).
3. Voice is Completely Decoupled from Gateway
Signaling (who joins/leaves voice) goes through the gateway WS. But actual audio goes through separate Rust-based voice servers using WebRTC/UDP with SFU topology. A gateway crash doesn't interrupt active voice calls. Voice servers are regionally deployed for <100ms latency.
4. Elixir/BEAM is Purpose-Built for This
Each WebSocket = 1 Erlang process (~2KB memory). BEAM's preemptive scheduler ensures no single connection can starve others. Supervision trees auto-restart crashed processes. Hot code reloading enables zero-downtime deploys. This is why Discord handles 1M+ connections per gateway node.
5. Intents + Lazy Loading Solve the Large Guild Problem
Guilds with 500K+ members would overwhelm connections with presence updates. Solution: Gateway Intents let clients declare which events they want (bots don't need typing indicators). Guilds >250 members use lazy member loading · presence only sent for visible members. This reduces event volume by 90%+ for large guilds.