Discord WebSocket Infrastructure

🎯 Design a real-time communication platform that maintains 10M+ concurrent WebSocket connections with heartbeats, shard assignment, graceful failover, and voice signaling

Concepts Involved

WebSocket Sharding Service Discovery Load Balancer Fault Tolerance UDP (Voice/WebRTC)

Problem Statement

How does a chat platform maintain 10M+ concurrent WebSocket connections across thousands of gateway servers, handling heartbeats, shard assignment, graceful failover, and voice signaling without dropping connections?

Scope: WebSocket gateway infrastructure, connection lifecycle, guild sharding, and voice signaling path. Unlike Slack (which focuses on message delivery fan-out), Discord's primary challenge is maintaining stateful connections for real-time voice + event distribution across millions of guilds.

10M+

concurrent WebSockets

persistent stateful connections

150M+

monthly active users

peak during gaming hours

19M+

active servers (guilds)

each assigned to a shard

<100ms

voice latency target

WebRTC + SFU architecture

Functional Requirements

What the system must do · core real-time communication behaviours

Must Have (Core)

1. Maintain persistent WebSocket connections for all online users
2. Distribute guilds across shards · each guild assigned to exactly one shard
3. Heartbeat mechanism to detect zombie connections (41.25s interval)
4. Session resume · reconnect without missing events using session_id + seq
5. Voice signaling · negotiate WebRTC connections for voice/video channels
6. Event distribution · deliver guild events (messages, presence, typing) to all connected members

Out of Scope (this design)

✗ Message persistence and search (Cassandra/ScyllaDB layer)
✗ Push notifications for mobile (APNs / FCM pipeline)
✗ CDN and media delivery (attachments, embeds, avatars)
✗ Bot framework and application commands
✗ User authentication and OAuth2 flows
✗ Rate limiting logic details (separate middleware)

Non-Functional Requirements

Quality constraints shaping the gateway infrastructure design

Property	Target	Why It Matters / Design Impact
Connection Capacity	10M+ concurrent WS	Requires thousands of gateway servers. Each Elixir node handles ~1M connections via BEAM scheduler.
Voice Latency	<100ms mouth-to-ear	WebRTC with SFU topology. Separate voice servers from gateway. Regional voice pools.
Availability	99.99% for gateway layer	Session resume + graceful failover. No single gateway is a SPOF for any guild.
Reconnect Time	<5s session resume	Client replays missed events from server-side buffer. No full re-identify needed.
Shard Rebalance	Zero-downtime migration	Guilds moved between shards without disconnecting users. Drain + redirect pattern.
Scalability	Linear scale with guild count	Add shards as guilds grow. Shard count = f(total_guilds, max_guilds_per_shard).
Fault Tolerance	Survive gateway crash transparently	Session state buffered server-side. Client reconnects, resumes from last seq number.
Security	TLS everywhere · token rotation	WSS only. Bot tokens rate-limited separately. Identify payload validated per connection.

Key tension: Statefulness vs. Scalability. Each gateway holds in-memory session state (event buffer, guild subscriptions). This makes horizontal scaling harder than stateless services · solved by guild-level sharding and session resume protocol.

Scale Estimation

Derive infrastructure sizing from the given constraints. Show this reasoning in an interview.

Given: 10M+ concurrent WS · 150M+ MAU · 19M+ guilds · <100ms voice latency · 41.25s heartbeat interval

Step	What to Derive	Calculation	Result	Design Decision
1	Gateway servers needed	10M connections · 1M conn/server (Elixir/BEAM)	~10·15 gateway nodes	Elixir lightweight processes handle 1M+ WS per node with BEAM scheduler
2	Shard count	19M guilds · ~2,500 guilds/shard (max comfortable)	~7,600 shards	Bots with >2,500 guilds must shard. Formula: (guild_count - 1) / 2500 + 1
3	Heartbeat load	10M connections · (1 heartbeat / 41.25s)	~242K heartbeats/sec	Lightweight · just update last_seen timestamp in memory. No DB write.
4	Events/sec (peak)	~5M events/sec (messages + presence + typing + voice state)	~5M events/sec	Event distribution must be guild-scoped to avoid broadcast storms
5	Session buffer memory	10M sessions · 128 events · 1KB avg = 1.28TB	~1.3TB distributed	Spread across gateway nodes. ~130GB per node for resume buffers.
6	Voice servers needed	~1.5M concurrent voice users · 1,000 users/SFU	~1,500 voice servers	Separate from gateway. Regional pools for latency. SFU topology.
7	Reconnect storm (worst case)	1 gateway crash → 1M reconnects in ~5s = 200K/sec	200K reconnects/sec	Jittered backoff + session resume (not full re-identify) reduces load
8	Redis presence memory	10M users · 64 bytes (user_id + status + guild_ids)	~640MB	Redis cluster for presence. Guild member lists cached separately.

Interview tip: Start with connection count (step 1), then shard math (step 2), then the reconnect storm (step 7) · that's the key insight. A single gateway crash creates a thundering herd that must be mitigated with jittered backoff and session resume.

Architecture Overview

Client → Gateway (WS termination) → Session Service → Guild Sharding → Event Distribution

Gateway Sharding

How guilds are distributed across shards · the core routing mechanism. Slack uses user?gateway mapping via Redis · Discord uses guild?shard mapping instead, giving better event locality.

Why guild-level sharding? A guild's events (messages, presence, voice state) are highly correlated · members of the same guild need the same events. By co-locating all guild members on the same shard, Discord avoids cross-node communication for intra-guild events. This is fundamentally different from Slack's approach where any user can be on any gateway.

Connection Lifecycle

State machine: CONNECTING → IDENTIFYING → RESUMING/READY → CONNECTED → DISCONNECTED. Similar heartbeat mechanism to Slack's presence system but with different timeout values (41.25s vs 30s).

Why 41.25s heartbeat? Discord uses a jittered interval around 41.25s (server sends the exact value in Hello). This is longer than Slack's 30s timeout · the tradeoff is slower zombie detection vs. less heartbeat traffic at 10M+ connections. At 242K heartbeats/sec, even small reductions matter.

Voice Architecture

Signaling path (WebSocket) is separate from media path (WebRTC/UDP). SFU topology for multi-user voice channels.

Key insight: Voice signaling and media are completely decoupled. The gateway WS handles signaling (which voice channel, who's speaking), while a separate Rust-based voice server handles the actual audio/video streams over UDP. This means a gateway crash doesn't kill active voice · the voice server continues independently.

Resilience & Edge Cases

How the system handles failures without dropping connections or losing events

Scenario	Problem	Solution	Recovery Time
Gateway Crash	1M connections lost instantly. Users see disconnect.	Session state buffered server-side. Clients reconnect with Resume (op 6) · server replays missed events from seq buffer. Jittered backoff prevents thundering herd.	<5s with resume
Shard Rebalance	Need to move guilds between shards without downtime.	Drain pattern: mark shard DRAINING → transfer state → send Reconnect (op 7) to affected clientS → Clients resume on new shard.	<10s per guild
Thundering Herd	Gateway crash → 1M clients reconnect simultaneously → overload remaining gateways.	Jittered exponential backoff: random(1-5s) initial delay, then 2· backoff. Server returns `429` with `Retry-After` header. Rate limit reconnects per IP.	~30s to stabilize
Voice Server Failover	Voice server crash → active voice channels interrupted.	Voice service detects failure → assigns new voice server → sends Voice Server Update to affected clientS → Clients reconnect voice WS + re-negotiate WebRTC.	~2-3s audio gap
Rate Limiting Bots	Abusive bots flood gateway with events, degrading service for legitimate users.	Per-connection rate limits (120 events/60s). Identify payload includes `intents` · bots only receive events they need. Global rate limit headers. Close with `4008` (rate limited).	Immediate rejection
Zombie Connections	Client appears connected but is actually dead (network issue).	Heartbeat ACK mechanism: if client sends heartbeat but receives no ACK, it closes and reconnects. Server-side: if no heartbeat received within interval · 1.5, server closes connection.	~62s detection
Large Guild Events	Guild with 500K+ members → presence updates would flood all connections.	Lazy loading: guilds >250 members don't send full member list in READY. Client requests member chunks on demand. Presence only for visible members.	N/A (prevention)

Worst case: Full region failure. All gateways in a region go down simultaneously. Mitigation: GeoDNS failover to next-closest region. Clients reconnect to new region. Session resume may fail (buffer lost) → full re-identify. Expected recovery: 30-60s with degraded latency until region recovers.

Tech Stack

Discord's infrastructure choices · each optimized for its specific workload

Component	Technology	Why This Choice
Gateway Servers	Elixir / Erlang (BEAM VM)	BEAM handles millions of lightweight processes. Each WS connection = 1 Erlang process (~2KB). Built-in supervision trees for fault tolerance. Hot code reloading for zero-downtime deploys.
Voice Servers	Rust	Low-latency audio processing requires predictable performance. No GC pauses. Zero-cost abstractions for packet forwarding. Custom SFU implementation.
Message Storage	Cassandra → ScyllaDB	Write-heavy workload (billions of messages/day). ScyllaDB: C++ rewrite of Cassandra with better tail latencies. Partition key = channel_id for time-series access pattern.
Presence & Routing	Redis Cluster	Sub-ms lookups for user status, guild member lists, and shard?gateway mapping. Pub/Sub for cross-node event distribution.
API Services	Python (Flask) → Rust	Originally Python for rapid development. Migrating hot paths to Rust for performance. API gateway handles REST + rate limiting.
WS Protocol	Custom binary over WebSocket	JSON for identify/resume. ETF (Erlang Term Format) or zlib-compressed JSON for events. Reduces bandwidth 60-80% vs raw JSON.
Service Mesh	gRPC + Envoy	Internal service communication. Protobuf contracts. Load balancing, circuit breaking, and observability built into the mesh.
Orchestration	Kubernetes (GKE)	Auto-scaling gateway pods based on connection count. Rolling deploys with drain + reconnect pattern.

Key migration: Discord moved from Cassandra to ScyllaDB for message storage, achieving P99 latency reduction from 40-125ms to 15ms. The BEAM VM (Elixir) was chosen specifically because each WebSocket connection maps to a lightweight Erlang process · enabling 1M+ connections per node with built-in fault isolation.

Interview Cheat Sheet

5 key points to nail a Discord WebSocket infrastructure interview question

1. Guild-Level Sharding is the Core Insight
Unlike Slack (user?gateway mapping), Discord assigns each guild to a specific shard. This means all events for a guild are processed by the same gateway node · no cross-node fan-out for intra-guild events. Formula: shard_id = (guild_id >> 22) % num_shards. This is deterministic · any node can compute it without a lookup.

2. Session Resume Prevents Thundering Herd
When a gateway crashes, 1M clients don't all re-identify (expensive). Instead, they Resume with session_id + last seq number. The server replays missed events from an in-memory buffer. Combined with jittered backoff, this turns a catastrophic reconnect storm into a manageable trickle. Resume is O(missed_events), not O(all_guilds).

3. Voice is Completely Decoupled from Gateway
Signaling (who joins/leaves voice) goes through the gateway WS. But actual audio goes through separate Rust-based voice servers using WebRTC/UDP with SFU topology. A gateway crash doesn't interrupt active voice calls. Voice servers are regionally deployed for <100ms latency.

4. Elixir/BEAM is Purpose-Built for This
Each WebSocket = 1 Erlang process (~2KB memory). BEAM's preemptive scheduler ensures no single connection can starve others. Supervision trees auto-restart crashed processes. Hot code reloading enables zero-downtime deploys. This is why Discord handles 1M+ connections per gateway node.

5. Intents + Lazy Loading Solve the Large Guild Problem
Guilds with 500K+ members would overwhelm connections with presence updates. Solution: Gateway Intents let clients declare which events they want (bots don't need typing indicators). Guilds >250 members use lazy member loading · presence only sent for visible members. This reduces event volume by 90%+ for large guilds.

System Design Case Study