How does Zoom handle 300-person meetings with screen sharing, adapting per-participant video quality in real-time based on available bandwidth while keeping total meeting latency under 200ms?
Scope: Real-time media routing · SFU architecture, simulcast, bandwidth estimation, speaker detection. Not scheduling, recording, or chat. The hard question: how do you route 300 video streams selectively to each participant based on their individual bandwidth and viewport, all under 200ms?
300
participants
per meeting
<200ms
end-to-end latency
glass-to-glass
2-5 Mbps
per participant
upload + download
48kHz
audio sample rate
Opus codec
Functional Requirements
Feature
Description
Simulcast
Each client sends multiple quality layers (e.g., 180p, 360p, 720p) · SFU picks which to forward
Speaker Detection
Audio energy analysis identifies active speaker; SFU prioritizes their high-quality stream
Screen Share
Separate stream with higher resolution (1080p) and lower framerate (5-15fps), prioritized over video
Bandwidth Estimation
Per-participant REMB/TWCC feedback; SFU adapts forwarded quality every 1-2 seconds
Gallery View
Show up to 25-49 thumbnails; SFU sends low-res for non-speakers, high-res for active speaker
SRTP Encryption
All media encrypted end-to-hop (SRTP); optional E2EE for sensitive meetings
Non-Functional Requirements
Property
Target
Design Impact
Latency
<200ms glass-to-glass
SFU (no transcoding) over MCU. UDP transport. Geo-distributed media servers.
Scalability
300 participants per room
SFU selective forwarding · O(N) not O(N·). Only forward what each client needs.
Adaptability
Per-participant quality
Simulcast layers + bandwidth estimation. Each receiver gets optimal quality for their link.
Reliability
99.9% uptime
Redundant SFU clusters. Seamless failover. FEC + NACK for packet loss recovery.
Audio Priority
Audio never drops
Audio packets prioritized over video. Separate DSCP marking. Opus FEC built-in.
Security
SRTP + DTLS
All streams encrypted. Key exchange via DTLS-SRTP. Optional E2EE layer.
High-Level Architecture
SFU-based selective forwarding · each participant sends once, SFU decides what to forward to whom
Key Design Decisions
Decision
Choice
Why
Topology
SFU (Selective Forwarding Unit)
No transcoding = <200ms latency. MCU adds 300-500ms for mixing.
Encoding
Simulcast (3 quality layers per sender)
SFU picks layer per receiver without re-encoding. Adapts instantly.
Transport
UDP + SRTP + DTLS
UDP for low latency. SRTP for encryption. DTLS for key exchange.
Bandwidth
TWCC (Transport-Wide Congestion Control)
Per-receiver feedback every 100ms. SFU adjusts forwarded layer.
Audio
Opus codec, always prioritized over video
Audio is critical for meetings. Never dropped even at 500kbps.
Screen Share
Separate stream, higher priority than video
Content sharing is primary use case. Gets bandwidth allocation first.
SFU vs MCU: An MCU (Multipoint Control Unit) decodes all streams, composites them into one video, and re-encodes · adding 300-500ms latency. An SFU just forwards packets selectively · <50ms added latency. At 300 participants, SFU is the only viable choice for real-time interaction.
Simulcast trade-off: Each sender uploads 3 layers (~2.5 Mbps total) instead of 1 (~1 Mbps). The extra upload cost enables the SFU to instantly switch quality per receiver without requesting a new encode from the sender · critical for real-time adaptation.
Failure handling: If a participant's bandwidth drops ? SFU switches to lower simulcast layer within 1-2 seconds. If SFU node fails ? clients reconnect to backup SFU (ICE restart). Audio is never sacrificed · video drops first.
Interview Tips
Must mention:SFU architecture (not MCU for large meetings). Simulcast (multiple quality layers). Per-participant bandwidth estimation. Speaker detection for priority routing. UDP transport for low latency.
Bonus points:TWCC feedback mechanism. Opus FEC for audio resilience. Cascaded SFUs for geo-distribution. SRTP/DTLS security model. Explain why screen share is a separate stream with different encoding parameters.
Common mistakes:Suggesting MCU · doesn't scale to 300 with <200ms. Peer-to-peer mesh · O(N·) connections impossible at 300. TCP transport · head-of-line blocking kills real-time. Single quality stream · can't adapt per receiver.
Interview Cheat Sheet
The 8 things to say for video calling design
1.SFU (Selective Forwarding Unit) · server receives all streams, selectively forwards (not MCU mixing) 2.Simulcast · sender encodes 3 quality layers (high/med/low), SFU picks per receiver 3.Per-participant bandwidth estimation · TWCC feedback, adapt forwarded quality individually 4.Active speaker detection · only forward video of top 4-9 speakers (not all 300) 5.UDP + SRTP · low latency transport, encrypted, tolerates packet loss (vs TCP retransmit) 6.Cascaded SFUs · geo-distributed servers, each region has local SFU, inter-SFU relay 7.Opus codec for audio · FEC (forward error correction) handles 10% packet loss gracefully 8.Separate screen share stream · different encoding params (high res, low fps, text-optimized)