System Design Case Study

How does Zoom handle 300-person meetings with screen sharing and adaptive video quality?

?? Zoom · 300 participants · <200ms latency · adaptive quality per participant
Concepts Involved

How does Zoom handle 300-person meetings with screen sharing, adapting per-participant video quality in real-time based on available bandwidth while keeping total meeting latency under 200ms?

Scope: Real-time media routing · SFU architecture, simulcast, bandwidth estimation, speaker detection. Not scheduling, recording, or chat. The hard question: how do you route 300 video streams selectively to each participant based on their individual bandwidth and viewport, all under 200ms?
300
participants
per meeting
<200ms
end-to-end latency
glass-to-glass
2-5 Mbps
per participant
upload + download
48kHz
audio sample rate
Opus codec

Functional Requirements

FeatureDescription
SimulcastEach client sends multiple quality layers (e.g., 180p, 360p, 720p) · SFU picks which to forward
Speaker DetectionAudio energy analysis identifies active speaker; SFU prioritizes their high-quality stream
Screen ShareSeparate stream with higher resolution (1080p) and lower framerate (5-15fps), prioritized over video
Bandwidth EstimationPer-participant REMB/TWCC feedback; SFU adapts forwarded quality every 1-2 seconds
Gallery ViewShow up to 25-49 thumbnails; SFU sends low-res for non-speakers, high-res for active speaker
SRTP EncryptionAll media encrypted end-to-hop (SRTP); optional E2EE for sensitive meetings

Non-Functional Requirements

PropertyTargetDesign Impact
Latency<200ms glass-to-glassSFU (no transcoding) over MCU. UDP transport. Geo-distributed media servers.
Scalability300 participants per roomSFU selective forwarding · O(N) not O(N·). Only forward what each client needs.
AdaptabilityPer-participant qualitySimulcast layers + bandwidth estimation. Each receiver gets optimal quality for their link.
Reliability99.9% uptimeRedundant SFU clusters. Seamless failover. FEC + NACK for packet loss recovery.
Audio PriorityAudio never dropsAudio packets prioritized over video. Separate DSCP marking. Opus FEC built-in.
SecuritySRTP + DTLSAll streams encrypted. Key exchange via DTLS-SRTP. Optional E2EE layer.

High-Level Architecture

SFU-based selective forwarding · each participant sends once, SFU decides what to forward to whom

LAYER 1: SENDER · Each participant encodes 3 simulcast layers + audio (Opus with FEC) Participant (Sender) encodes video at 3 quality layers: Layer 3: 1080p @30fps (~2.5 Mbps) Layer 2: 720p @30fps (~1.0 Mbps) Layer 1: 180p @15fps (~0.15 Mbps) Audio Encoding Opus codec @ 48kHz FEC (Forward Error Correction) handles 10% packet loss gracefully always prioritized over video never dropped (even at 500kbps) Transport: UDP/SRTP UDP for low latency (no head-of-line blocking) SRTP encryption (DTLS key exchange) tolerate packet loss vs TCP retransmit delay NACK for selective retransmit (critical frames) all 3 layers sent to SFU simultaneously LAYER 2: SFU · Selective Forwarding Unit (per-participant bandwidth estimation, layer selection) SFU (Selective Forwarding Unit) receives all 300 streams · forwards SUBSET to each client (no transcoding) Bandwidth Estimator (TWCC per receiver) Active Speaker Detector (audio energy) Layer Selector (pick quality per receiver) Packet Router (O(1) per packet, no decode) Decision Logic: speaker ? forward 1080p/720p gallery ? forward 180p thumbnails off-screen ? forward nothing adapt every 1-2s per receiver BW Cascaded SFUs (Geo) multi-region meetings: SFU per region inter-SFU relay (1 stream per region) minimize cross-region hops US-East ? EU-West ? AP-South users connect to nearest SFU 300 users ? 300 cross-region streams only unique streams relayed between SFUs LAYER 3: RECEIVER · Each gets 1 high-quality speaker + N thumbnails ? gallery/speaker view Client B (5 Mbps) Receives: speaker @720p + 24 gallery thumbnails @180p + screen share @1080p (if active) + audio for all active speakers total: ~5 Mbps download Client C (1 Mbps) Receives: speaker @360p + 4 thumbnails @180p only + screen share @720p (downgraded) + audio (always full quality) total: ~1 Mbps download Client D (Mobile/Cellular) Receives: speaker @360p only + audio for active speakers no gallery thumbnails (save data) speaker view only on small screen total: ~0.5 Mbps download SFU not MCU (no server-side mixing) | Simulcast: sender encodes 3 layers, SFU picks per receiver | UDP: tolerate loss vs TCP delay <200ms glass-to-glass | 300 uploads + ~25 downloads each = 7,800 streams (vs 89,700 for P2P) | Audio never dropped

Key Design Decisions

DecisionChoiceWhy
TopologySFU (Selective Forwarding Unit)No transcoding = <200ms latency. MCU adds 300-500ms for mixing.
EncodingSimulcast (3 quality layers per sender)SFU picks layer per receiver without re-encoding. Adapts instantly.
TransportUDP + SRTP + DTLSUDP for low latency. SRTP for encryption. DTLS for key exchange.
BandwidthTWCC (Transport-Wide Congestion Control)Per-receiver feedback every 100ms. SFU adjusts forwarded layer.
AudioOpus codec, always prioritized over videoAudio is critical for meetings. Never dropped even at 500kbps.
Screen ShareSeparate stream, higher priority than videoContent sharing is primary use case. Gets bandwidth allocation first.
SFU vs MCU: An MCU (Multipoint Control Unit) decodes all streams, composites them into one video, and re-encodes · adding 300-500ms latency. An SFU just forwards packets selectively · <50ms added latency. At 300 participants, SFU is the only viable choice for real-time interaction.
Simulcast trade-off: Each sender uploads 3 layers (~2.5 Mbps total) instead of 1 (~1 Mbps). The extra upload cost enables the SFU to instantly switch quality per receiver without requesting a new encode from the sender · critical for real-time adaptation.
Failure handling: If a participant's bandwidth drops ? SFU switches to lower simulcast layer within 1-2 seconds. If SFU node fails ? clients reconnect to backup SFU (ICE restart). Audio is never sacrificed · video drops first.

Interview Tips

Must mention: SFU architecture (not MCU for large meetings). Simulcast (multiple quality layers). Per-participant bandwidth estimation. Speaker detection for priority routing. UDP transport for low latency.
Bonus points: TWCC feedback mechanism. Opus FEC for audio resilience. Cascaded SFUs for geo-distribution. SRTP/DTLS security model. Explain why screen share is a separate stream with different encoding parameters.
Common mistakes: Suggesting MCU · doesn't scale to 300 with <200ms. Peer-to-peer mesh · O(N·) connections impossible at 300. TCP transport · head-of-line blocking kills real-time. Single quality stream · can't adapt per receiver.

Interview Cheat Sheet

The 8 things to say for video calling design

1. SFU (Selective Forwarding Unit) · server receives all streams, selectively forwards (not MCU mixing)
2. Simulcast · sender encodes 3 quality layers (high/med/low), SFU picks per receiver
3. Per-participant bandwidth estimation · TWCC feedback, adapt forwarded quality individually
4. Active speaker detection · only forward video of top 4-9 speakers (not all 300)
5. UDP + SRTP · low latency transport, encrypted, tolerates packet loss (vs TCP retransmit)
6. Cascaded SFUs · geo-distributed servers, each region has local SFU, inter-SFU relay
7. Opus codec for audio · FEC (forward error correction) handles 10% packet loss gracefully
8. Separate screen share stream · different encoding params (high res, low fps, text-optimized)