How does Discord maintain 10M+ concurrent WebSocket connections across thousands of gateway servers?
?? Design a real-time communication platform that maintains 10M+ concurrent WebSocket connections with heartbeats, shard assignment, graceful failover, and voice signaling
How does a chat platform maintain 10M+ concurrent WebSocket connections across thousands of gateway servers, handling heartbeats, shard assignment, graceful failover, and voice signaling without dropping connections?
Scope: WebSocket gateway infrastructure, connection lifecycle, guild sharding, and voice signaling path. Unlike Slack (which focuses on message delivery fan-out), Discord's primary challenge is maintaining stateful connections for real-time voice + event distribution across millions of guilds.
10M+
concurrent WebSockets
persistent stateful connections
150M+
monthly active users
peak during gaming hours
19M+
active servers (guilds)
each assigned to a shard
<100ms
voice latency target
WebRTC + SFU architecture
Functional Requirements
What the system must do · core real-time communication behaviours
Must Have (Core)
1. Maintain persistent WebSocket connections for all online users 2. Distribute guilds across shards · each guild assigned to exactly one shard 3.Heartbeat mechanism to detect zombie connections (41.25s interval) 4.Session resume · reconnect without missing events using session_id + seq 5.Voice signaling · negotiate WebRTC connections for voice/video channels 6.Event distribution · deliver guild events (messages, presence, typing) to all connected members
Out of Scope (this design)
? Message persistence and search (Cassandra/ScyllaDB layer) ?Push notifications for mobile (APNs / FCM pipeline) ?CDN and media delivery (attachments, embeds, avatars) ?Bot framework and application commands ?User authentication and OAuth2 flows ?Rate limiting logic details (separate middleware)
Non-Functional Requirements
Quality constraints shaping the gateway infrastructure design
Property
Target
Why It Matters / Design Impact
Connection Capacity
10M+ concurrent WS
Requires thousands of gateway servers. Each Elixir node handles ~1M connections via BEAM scheduler.
Voice Latency
<100ms mouth-to-ear
WebRTC with SFU topology. Separate voice servers from gateway. Regional voice pools.
Availability
99.99% for gateway layer
Session resume + graceful failover. No single gateway is a SPOF for any guild.
Reconnect Time
<5s session resume
Client replays missed events from server-side buffer. No full re-identify needed.
Shard Rebalance
Zero-downtime migration
Guilds moved between shards without disconnecting users. Drain + redirect pattern.
Scalability
Linear scale with guild count
Add shards as guilds grow. Shard count = f(total_guilds, max_guilds_per_shard).
Fault Tolerance
Survive gateway crash transparently
Session state buffered server-side. Client reconnects, resumes from last seq number.
Key tension:Statefulness vs. Scalability. Each gateway holds in-memory session state (event buffer, guild subscriptions). This makes horizontal scaling harder than stateless services · solved by guild-level sharding and session resume protocol.
Scale Estimation
Derive infrastructure sizing from the given constraints. Show this reasoning in an interview.
Event distribution must be guild-scoped to avoid broadcast storms
5
Session buffer memory
10M sessions · 128 events · 1KB avg = 1.28TB
~1.3TB distributed
Spread across gateway nodes. ~130GB per node for resume buffers.
6
Voice servers needed
~1.5M concurrent voice users · 1,000 users/SFU
~1,500 voice servers
Separate from gateway. Regional pools for latency. SFU topology.
7
Reconnect storm (worst case)
1 gateway crash ? 1M reconnects in ~5s = 200K/sec
200K reconnects/sec
Jittered backoff + session resume (not full re-identify) reduces load
8
Redis presence memory
10M users · 64 bytes (user_id + status + guild_ids)
~640MB
Redis cluster for presence. Guild member lists cached separately.
Interview tip: Start with connection count (step 1), then shard math (step 2), then the reconnect storm (step 7) · that's the key insight. A single gateway crash creates a thundering herd that must be mitigated with jittered backoff and session resume.
Architecture Overview
Client ? Gateway (WS termination) ? Session Service ? Guild Sharding ? Event Distribution
Gateway Sharding
How guilds are distributed across shards · the core routing mechanism. Slack uses user?gateway mapping via Redis · Discord uses guild?shard mapping instead, giving better event locality.
Why guild-level sharding? A guild's events (messages, presence, voice state) are highly correlated · members of the same guild need the same events. By co-locating all guild members on the same shard, Discord avoids cross-node communication for intra-guild events. This is fundamentally different from Slack's approach where any user can be on any gateway.
Connection Lifecycle
State machine: CONNECTING ? IDENTIFYING ? RESUMING/READY ? CONNECTED ? DISCONNECTED. Similar heartbeat mechanism to Slack's presence system but with different timeout values (41.25s vs 30s).
Why 41.25s heartbeat? Discord uses a jittered interval around 41.25s (server sends the exact value in Hello). This is longer than Slack's 30s timeout · the tradeoff is slower zombie detection vs. less heartbeat traffic at 10M+ connections. At 242K heartbeats/sec, even small reductions matter.
Voice Architecture
Signaling path (WebSocket) is separate from media path (WebRTC/UDP). SFU topology for multi-user voice channels.
Key insight: Voice signaling and media are completely decoupled. The gateway WS handles signaling (which voice channel, who's speaking), while a separate Rust-based voice server handles the actual audio/video streams over UDP. This means a gateway crash doesn't kill active voice · the voice server continues independently.
Resilience & Edge Cases
How the system handles failures without dropping connections or losing events
Scenario
Problem
Solution
Recovery Time
Gateway Crash
1M connections lost instantly. Users see disconnect.
Session state buffered server-side. Clients reconnect with Resume (op 6) · server replays missed events from seq buffer. Jittered backoff prevents thundering herd.
<5s with resume
Shard Rebalance
Need to move guilds between shards without downtime.
Drain pattern: mark shard DRAINING ? transfer state ? send Reconnect (op 7) to affected clients ? clients resume on new shard.
Jittered exponential backoff: random(1-5s) initial delay, then 2· backoff. Server returns 429 with Retry-After header. Rate limit reconnects per IP.
~30s to stabilize
Voice Server Failover
Voice server crash ? active voice channels interrupted.
Voice service detects failure ? assigns new voice server ? sends Voice Server Update to affected clients ? clients reconnect voice WS + re-negotiate WebRTC.
~2-3s audio gap
Rate Limiting Bots
Abusive bots flood gateway with events, degrading service for legitimate users.
Per-connection rate limits (120 events/60s). Identify payload includes intents · bots only receive events they need. Global rate limit headers. Close with 4008 (rate limited).
Immediate rejection
Zombie Connections
Client appears connected but is actually dead (network issue).
Heartbeat ACK mechanism: if client sends heartbeat but receives no ACK, it closes and reconnects. Server-side: if no heartbeat received within interval · 1.5, server closes connection.
~62s detection
Large Guild Events
Guild with 500K+ members ? presence updates would flood all connections.
Lazy loading: guilds >250 members don't send full member list in READY. Client requests member chunks on demand. Presence only for visible members.
N/A (prevention)
Worst case: Full region failure. All gateways in a region go down simultaneously. Mitigation: GeoDNS failover to next-closest region. Clients reconnect to new region. Session resume may fail (buffer lost) ? full re-identify. Expected recovery: 30-60s with degraded latency until region recovers.
Tech Stack
Discord's infrastructure choices · each optimized for its specific workload
Component
Technology
Why This Choice
Gateway Servers
Elixir / Erlang (BEAM VM)
BEAM handles millions of lightweight processes. Each WS connection = 1 Erlang process (~2KB). Built-in supervision trees for fault tolerance. Hot code reloading for zero-downtime deploys.
Voice Servers
Rust
Low-latency audio processing requires predictable performance. No GC pauses. Zero-cost abstractions for packet forwarding. Custom SFU implementation.
Message Storage
Cassandra ? ScyllaDB
Write-heavy workload (billions of messages/day). ScyllaDB: C++ rewrite of Cassandra with better tail latencies. Partition key = channel_id for time-series access pattern.
Presence & Routing
Redis Cluster
Sub-ms lookups for user status, guild member lists, and shard?gateway mapping. Pub/Sub for cross-node event distribution.
API Services
Python (Flask) ? Rust
Originally Python for rapid development. Migrating hot paths to Rust for performance. API gateway handles REST + rate limiting.
WS Protocol
Custom binary over WebSocket
JSON for identify/resume. ETF (Erlang Term Format) or zlib-compressed JSON for events. Reduces bandwidth 60-80% vs raw JSON.
Service Mesh
gRPC + Envoy
Internal service communication. Protobuf contracts. Load balancing, circuit breaking, and observability built into the mesh.
Orchestration
Kubernetes (GKE)
Auto-scaling gateway pods based on connection count. Rolling deploys with drain + reconnect pattern.
Key migration: Discord moved from Cassandra to ScyllaDB for message storage, achieving P99 latency reduction from 40-125ms to 15ms. The BEAM VM (Elixir) was chosen specifically because each WebSocket connection maps to a lightweight Erlang process · enabling 1M+ connections per node with built-in fault isolation.
Interview Cheat Sheet
5 key points to nail a Discord WebSocket infrastructure interview question
1. Guild-Level Sharding is the Core Insight Unlike Slack (user?gateway mapping), Discord assigns each guild to a specific shard. This means all events for a guild are processed by the same gateway node · no cross-node fan-out for intra-guild events. Formula: shard_id = (guild_id >> 22) % num_shards. This is deterministic · any node can compute it without a lookup.
2. Session Resume Prevents Thundering Herd When a gateway crashes, 1M clients don't all re-identify (expensive). Instead, they Resume with session_id + last seq number. The server replays missed events from an in-memory buffer. Combined with jittered backoff, this turns a catastrophic reconnect storm into a manageable trickle. Resume is O(missed_events), not O(all_guilds).
3. Voice is Completely Decoupled from Gateway Signaling (who joins/leaves voice) goes through the gateway WS. But actual audio goes through separate Rust-based voice servers using WebRTC/UDP with SFU topology. A gateway crash doesn't interrupt active voice calls. Voice servers are regionally deployed for <100ms latency.
4. Elixir/BEAM is Purpose-Built for This Each WebSocket = 1 Erlang process (~2KB memory). BEAM's preemptive scheduler ensures no single connection can starve others. Supervision trees auto-restart crashed processes. Hot code reloading enables zero-downtime deploys. This is why Discord handles 1M+ connections per gateway node.
5. Intents + Lazy Loading Solve the Large Guild Problem Guilds with 500K+ members would overwhelm connections with presence updates. Solution: Gateway Intents let clients declare which events they want (bots don't need typing indicators). Guilds >250 members use lazy member loading · presence only sent for visible members. This reduces event volume by 90%+ for large guilds.