How does Zoom handle 300-person meetings with adaptive video quality?

🎯 Zoom · 300 participants · <200ms latency · adaptive quality per participant

Concepts Involved

WebSocket Load Balancer UDP Rate Limiting Multi-Region

How does Zoom handle 300-person meetings with screen sharing, adapting per-participant video quality in real-time based on available bandwidth while keeping total meeting latency under 200ms?

Scope: Real-time media routing · SFU architecture, simulcast, bandwidth estimation, speaker detection. Not scheduling, recording, or chat. The hard question: how do you route 300 video streams selectively to each participant based on their individual bandwidth and viewport, all under 200ms✗

300

participants

per meeting

<200ms

end-to-end latency

glass-to-glass

2-5 Mbps

per participant

upload + download

48kHz

audio sample rate

Opus codec

Functional Requirements

Feature	Description
Simulcast	Each client sends multiple quality layers (e.g., 180p, 360p, 720p) · SFU picks which to forward
Speaker Detection	Audio energy analysis identifies active speaker; SFU prioritizes their high-quality stream
Screen Share	Separate stream with higher resolution (1080p) and lower framerate (5-15fps), prioritized over video
Bandwidth Estimation	Per-participant REMB/TWCC feedback; SFU adapts forwarded quality every 1-2 seconds
Gallery View	Show up to 25-49 thumbnails; SFU sends low-res for non-speakers, high-res for active speaker
SRTP Encryption	All media encrypted end-to-hop (SRTP); optional E2EE for sensitive meetings

Non-Functional Requirements

Property	Target	Design Impact
Latency	<200ms glass-to-glass	SFU (no transcoding) over MCU. UDP transport. Geo-distributed media servers.
Scalability	300 participants per room	SFU selective forwarding · O(N) not O(N·). Only forward what each client needs.
Adaptability	Per-participant quality	Simulcast layers + bandwidth estimation. Each receiver gets optimal quality for their link.
Reliability	99.9% uptime	Redundant SFU clusters. Seamless failover. FEC + NACK for packet loss recovery.
Audio Priority	Audio never drops	Audio packets prioritized over video. Separate DSCP marking. Opus FEC built-in.
Security	SRTP + DTLS	All streams encrypted. Key exchange via DTLS-SRTP. Optional E2EE layer.

High-Level Architecture

SFU-based selective forwarding · each participant sends once, SFU decides what to forward to whom

Key Design Decisions

Decision	Choice	Why
Topology	SFU (Selective Forwarding Unit)	No transcoding = <200ms latency. MCU adds 300-500ms for mixing.
Encoding	Simulcast (3 quality layers per sender)	SFU picks layer per receiver without re-encoding. Adapts instantly.
Transport	UDP + SRTP + DTLS	UDP for low latency. SRTP for encryption. DTLS for key exchange.
Bandwidth	TWCC (Transport-Wide Congestion Control)	Per-receiver feedback every 100ms. SFU adjusts forwarded layer.
Audio	Opus codec, always prioritized over video	Audio is critical for meetings. Never dropped even at 500kbps.
Screen Share	Separate stream, higher priority than video	Content sharing is primary use case. Gets bandwidth allocation first.

SFU vs MCU: An MCU (Multipoint Control Unit) decodes all streams, composites them into one video, and re-encodes · adding 300-500ms latency. An SFU just forwards packets selectively · <50ms added latency. At 300 participants, SFU is the only viable choice for real-time interaction.

Simulcast trade-off: Each sender uploads 3 layers (~2.5 Mbps total) instead of 1 (~1 Mbps). The extra upload cost enables the SFU to instantly switch quality per receiver without requesting a new encode from the sender · critical for real-time adaptation.

Failure handling: If a participant's bandwidth drops → SFU switches to lower simulcast layer within 1-2 seconds. If SFU node failS → Clients reconnect to backup SFU (ICE restart). Audio is never sacrificed · video drops first.

Interview Tips

Must mention: SFU architecture (not MCU for large meetings). Simulcast (multiple quality layers). Per-participant bandwidth estimation. Speaker detection for priority routing. UDP transport for low latency.

Bonus points: TWCC feedback mechanism. Opus FEC for audio resilience. Cascaded SFUs for geo-distribution. SRTP/DTLS security model. Explain why screen share is a separate stream with different encoding parameters.

Common mistakes: Suggesting MCU · doesn't scale to 300 with <200ms. Peer-to-peer mesh · O(N·) connections impossible at 300. TCP transport · head-of-line blocking kills real-time. Single quality stream · can't adapt per receiver.

Interview Cheat Sheet

The 8 things to say for video calling design

1. SFU (Selective Forwarding Unit) · server receives all streams, selectively forwards (not MCU mixing)
2. Simulcast · sender encodes 3 quality layers (high/med/low), SFU picks per receiver
3. Per-participant bandwidth estimation · TWCC feedback, adapt forwarded quality individually
4. Active speaker detection · only forward video of top 4-9 speakers (not all 300)
5. UDP + SRTP · low latency transport, encrypted, tolerates packet loss (vs TCP retransmit)
6. Cascaded SFUs · geo-distributed servers, each region has local SFU, inter-SFU relay
7. Opus codec for audio · FEC (forward error correction) handles 10% packet loss gracefully
8. Separate screen share stream · different encoding params (high res, low fps, text-optimized)

System Design Case Study

Functional Requirements

Non-Functional Requirements

High-Level Architecture

Key Design Decisions

Interview Tips

Interview Cheat Sheet