System Design Case Study

How does a group chat with 1M+ members deliver a single message without causing a fan-out explosion?

?? Design a large-group messaging system that delivers messages to 1M+ members with <2s latency · without write-time fan-out
Concepts Involved

Problem Statement

How does a group chat with 1M+ members deliver a single message without causing a fan-out explosion, keeping delivery latency under 2 seconds for all participants regardless of group size?

Scope: Large channel/group delivery path · not 1:1 DMs, not media upload, not E2E encryption. The hard question: how do you avoid O(N) write amplification when N = 1,000,000?
1M+
members per channel
some channels exceed 5M
200K+
channels at this scale
public broadcast channels
<2s
delivery target
all members, any region
700M+
monthly active users
global distribution
Cross-reference: Slack's Fan-out Strategy covers hybrid fan-out up to 50K members · this problem goes 20· further. At 1M+ members, even Slack's "push to online only" model breaks because 10% online = 100K simultaneous pushes per message.

Functional Requirements

What the system must do · core behaviours for 1M+ member channels

Must Have (Core)

1. Admin/allowed user can post a message to a channel with 1M+ members
2. All online members receive the message within 2 seconds
3. Offline members see the message on next app open (pull-based catch-up)
4. Messages are ordered · all members see the same sequence
5. Notification tiering · @mentions push immediately, regular messages update badge only
6. System handles member join/leave storms without disrupting delivery

Out of Scope (this design)

? End-to-end encryption (not applicable to large public channels)
? Media/file delivery (separate CDN pipeline)
? Message editing/deletion propagation
? Bot API and webhook integrations
? User authentication and session management
? Small group chat (<1K members · standard fan-out works fine)

Non-Functional Requirements

Quality constraints that shape the architecture for extreme fan-out

PropertyTargetWhy It Matters / Design Impact
Latency<2s delivery for all online membersUsers expect near-instant delivery. Read-time pull with long-polling achieves this without write amplification.
Throughput100M+ deliveries/day per large channelA single channel with 1M members and 100 msgs/day = 100M delivery events. Must not scale linearly with writes.
Availability99.99% (<52 min/year)Channels are broadcast infrastructure for governments and media. Downtime = public trust loss.
DurabilityZero message loss after server ACKAppend-only log with replication. Messages never deleted from server (unlike E2E chats).
ScalabilityLinear with readers, not writersWrite cost = O(1) per message. Read cost distributed across millions of clients pulling independently.
Fan-out ratio1 write : 0 server-side copiesNo per-user inbox. Single log entry serves all members. Eliminates write amplification entirely.
ConsistencyPer-channel total orderAll members see messages in identical sequence. Monotonic message IDs enable gap detection.
Fault toleranceSurvive datacenter failuresMulti-DC replication of channel log. Clients reconnect to any DC and resume from cursor.
Key insight: At 1M+ members, the system cannot afford O(N) work per message. The architecture must make message posting O(1) and shift the cost to readers · each reader does O(1) work to fetch their unread messages.

Scale Estimation

Derive why write-time fan-out is impossible at 1M+ members · and what replaces it

Given: 1M members per channel · 100 messages/day per channel · <2s delivery · 700M+ MAU
StepWhat to DeriveCalculationResultDesign Decision
1 Deliveries per channel/day 1M members · 100 msgs/day 100M deliveries/day Per-channel delivery volume · impossible to write-amplify
2 Write-time fan-out cost 1M inbox writes · 100 msgs/day 100M writes/day/channel ? With 200K channels: 20 trillion writes/day. Completely infeasible.
3 Read-time cost (log model) 1 append per message (O(1) write) 100 writes/day/channel ? Single append-only log. Readers pull. Write cost independent of member count.
4 Online members at any time 1M · 5% online rate ~50K concurrent readers 50K long-poll connections per large channel · manageable with connection pooling
5 Notification fan-out (push) Only @mentions trigger push: ~1% of msgs ~1 push/day per user Push notifications are rare · most msgs only increment badge counter
6 Storage per channel/day 100 msgs · 2KB avg (text + metadata) ~200KB/day/channel Trivial storage. 200K channels · 200KB = 40GB/day total for all large channels
7 Cursor storage 700M users · 8 bytes (last_read_msg_id) ~5.6GB Per-user cursor per channel. Fits in distributed cache. No per-user message copies.
8 Latency budget Long-poll wakeup (~50ms) + log read (~5ms) + network (~100ms) ~155ms actual Well within 2s budget. Long-poll gives near-real-time without persistent connections.
The impossibility proof: Write-time fan-out at 1M members means writing 1M inbox entries per message. At 100 msgs/day that's 100M writes/day for ONE channel. Multiply by 200K large channels = 20 trillion writes/day. No storage system can sustain this. Read-time pull from a shared log is the only viable architecture.

Architecture Overview

Single append-only channel log eliminates write amplification · clients pull from shared log independently

Admin posts Server appends to log Online clients pull (long-poll) Offline clients pull on reconnect Real-time UI Catch-up sync
Key insight: The architecture is read-time pull from a shared channel log. One write serves all members. Online clients long-poll for new entries; offline clients fetch from their last-seen cursor on reconnect. See detailed strategies below.

Fan-out Strategies at Scale

Comparing three approaches · and why read-time pull wins at 1M+ members

Fan-out Strategy Comparison at 1M+ Members ? Write-time Fan-out (Push to All) 1 message ? 1M inbox writes Message Fan-out Writer 1M copies Inbox 1 Inbox 2 Inbox 1M ... Cost: 100M writes/day/channel Storage explodes. Latency unbounded. IMPOSSIBLE. ? Read-time Pull (Log-based) 1 message ? 1 log append. Clients pull. Message Channel Log append-only Client 1 pull Client 2 pull Client N pull Cost: 100 writes/day (O(1) per msg) Reads distributed. Scales to any N. WINNER at 1M+. ?? Hybrid (Push Online + Pull Offline) Push to 50K online + offline pull later Message Fan-out (online) 50K pushes Online 1 Online 2 Offline pull Cost: 50K pushes/msg (still O(N)) Works for Slack (50K). Breaks at 1M+ (100K online pushes). Why Read-time Wins at 1M+ · Write cost: O(1) · single log append regardless of member count · Read cost: distributed across clients · each does O(1) cursor-based fetch · No per-user state on server · cursor lives on client · Notification push is separate (only @mentions, not content) Scale Thresholds · When Each Strategy Breaks <100 Write-time OK 100·50K Hybrid (Slack) 50K·1M Hybrid breaks 1M+ Read-time only
Cross-reference: Slack uses hybrid fan-out for channels up to 50K members · push to online, offline users catch up on reconnect. At 1M+ even the "push to online" part becomes 50K·100K pushes per message, creating unacceptable latency spikes. Telegram's solution: don't push message content at all · push only a lightweight "new message available" signal and let clients pull.

Channel Log Architecture

Single append-only log per channel · clients maintain their own cursor. Server tracks zero per-user delivery state.

CHANNEL LOG · Single Append-Only Sequence msg_id: 1 Hello world msg_id: 2 Breaking news msg_id: 3 Update v2.0 msg_id: 4 New feature msg_id: 5 Latest msg ? append direction CLIENT CURSORS · Each Client Tracks Its Own Position Client A (online) cursor = msg_id: 5 (caught up) Long-poll: waiting for msg_id > 5 Server holds connection until new msg arrives Client B (behind) cursor = msg_id: 3 (2 msgs behind) Fetches: GET msgs after id:3 ? [4, 5] Advances cursor to 5, then long-polls Client C (offline 2 days) cursor = msg_id: 1 (far behind) On open: paginated fetch from id:1 Catches up in batches of 100 New Member D cursor = msg_id: 5 (join point) Starts from latest · no backfill Can scroll up to load history on demand Key Properties of the Log Model O(1) Write Single append per message Independent of member count 1M members = same cost as 10 Client-side Cursor Server stores NO per-user read state Client persists last_read_msg_id locally Eliminates 1M cursor updates/msg Long-poll for Real-time Client: "give me msgs after cursor" Server holds request until new msg ~50ms wakeup latency Replicated Log Multi-DC replication for durability Any DC can serve reads Survives datacenter failure
Why this works: The server does zero per-user bookkeeping. It doesn't know who has read what · that's the client's job. When a client opens the app, it says "give me everything after msg_id X" and the server does a simple range scan on the log. This makes the system scale with readers (distributed) rather than writers (centralized).

Notification Tiering

Only push notifications fan-out · not message content. Tiered by priority to minimize push volume.

Notification Priority Tiers ?? Tier 1: @mention / Reply Push notification immediately via APNs/FCM Fan-out: YES · but only to mentioned users Typical: 1·10 users per @mention Payload: "User X mentioned you in Channel Y" ?? Tier 2: Regular Message Badge count increment only (silent push) Fan-out: MINIMAL · counter update via APNs Batched: aggregate unread count, send every 30s Payload: { badge: 42 } · no message content ?? Tier 3: Muted Channel Silent · no push, no badge Fan-out: NONE · zero server work Client discovers new msgs on next pull Most large channel members are muted (~80%) Notification Decision Flow New Message Has @mentions? Parse message text yes no Push to @mentioned users User muted? yes no No push (silent) Badge +1 (batched) Push Volume Reduction 1M members · 80% muted = 800K get NO push 200K unmuted · badge batching (30s) = ~7K pushes/msg Effective fan-out: 7K (not 1M) · 99.3% reduction
Key insight: Notification fan-out and message delivery are completely decoupled. Message content lives in the channel log (pulled by clients). Push notifications are lightweight signals ("you have N unread") that fan-out only to unmuted, offline users · and even those are batched to reduce volume by 99%+.

Resilience & Edge Cases

What breaks at 1M+ scale · and how the architecture handles it

Edge CaseProblemSolutionImpact if Unhandled
Viral message in 1M channel Sudden spike: 50K clients simultaneously pull after a popular post CDN-like caching of recent messages at edge. Hot channel log replicated to read replicas. Rate-limit pulls to 1/sec per client. Log server overload, cascading timeouts for all channels on same shard
Member join/leave storms 100K users join a channel in minutes (viral invite link). Membership list update becomes bottleneck. Membership is eventually consistent. New members start cursor at latest msg_id. Membership changes don't affect the log · only the "who can read" ACL check on pull. Membership write lock blocks message delivery if coupled
Cursor management for offline users User offline for 30 days. Channel has 3000 new messages. Full catch-up is expensive. Paginated fetch (100 msgs/page). Show latest first, load history on scroll-up. If gap > 1000 msgs, show "jump to latest" with unread count only. Client OOM on massive payload. Server timeout on huge range scan.
Thundering herd on popular channel Channel admin posts ? 50K long-poll connections wake up simultaneously ? 50K read requests hit log server at once Staggered wakeup: add random jitter (0·200ms) to long-poll response. Coalesce reads: batch wakeup notifications, serve from cache. Connection-level rate limiting. Log server CPU spike, increased latency for all channels on same partition
Multi-device sync User has phone + desktop + tablet. Each device has its own cursor. Must not show duplicate notifications. Server-side "last notified msg_id" per user (not per device). Devices sync cursors via shared user state. Push notification sent once, all devices update badge. Triple notifications per message. User disables notifications entirely.
Log compaction / retention Channel with 1M msgs over 5 years. Log grows unbounded. Old messages rarely accessed. Tiered storage: hot (last 7 days, SSD), warm (last 90 days, HDD), cold (archive, object storage). Cursor-based access works across tiers transparently. Storage costs grow linearly forever. SSD capacity exhaustion.
Design principle: Every edge case is handled by keeping the write path simple (single append) and pushing complexity to the read path (caching, pagination, rate limiting). The log is the source of truth · everything else is derived.

Tech Stack

Telegram's infrastructure choices optimized for large-scale message delivery

LayerTechnologyWhy This Choice
Protocol MTProto (custom) Binary protocol over TCP. Smaller payloads than JSON/HTTP. Built-in encryption, compression, and multiplexing. Supports long-poll natively.
Transport TCP + custom framing Persistent TCP connections with keep-alive. No HTTP overhead. Supports multiple concurrent requests on single connection (multiplexed).
Message Storage Distributed append-only log Custom storage engine optimized for sequential writes and range reads. Sharded by channel_id. Replicated across 3+ DCs.
Metadata Store Distributed KV store User profiles, channel membership, notification preferences. Eventually consistent with strong consistency for writes.
Media Delivery Global CDN + edge caching Photos/videos served from nearest edge. Message log contains only media_id reference. Decouples media from text delivery.
Push Notifications APNs / FCM + custom gateway Custom aggregation layer batches badge updates. Direct APNs/FCM for @mentions. Reduces API calls to Apple/Google by 99%.
Service Mesh Custom RPC framework Internal services communicate via lightweight binary RPC. Service discovery, load balancing, and circuit breaking built-in.
Datacenter Strategy Multi-DC active-active 5+ DCs globally. Users connect to nearest DC. Channel logs replicated synchronously within region, async across regions.
Why custom protocol? HTTP/JSON adds ~200 bytes overhead per request. At 700M users making frequent pulls, that's 140GB/day of wasted bandwidth just in headers. MTProto's binary framing reduces this to ~20 bytes overhead · a 10· improvement that matters at Telegram's scale.

Interview Cheat Sheet

5 key points to nail the Telegram large-group fan-out question

?? 5 Points That Win the Interview

1. Write-time fan-out is impossible at 1M+ · prove it with math: 1M members · 100 msgs/day = 100M writes/day per channel. Show the interviewer you understand why the naive approach fails before proposing the solution.
2. Read-time pull from append-only log · single write per message, clients pull from their cursor. Write cost is O(1) regardless of channel size. This is the core architectural insight.
3. Decouple notifications from content delivery · push notifications are lightweight signals (badge count), not message content. Only @mentions trigger immediate push. 80% of members are muted = zero push work.
4. Client-side cursor eliminates server state · server doesn't track per-user read position for 1M users. Client stores last_read_msg_id locally. On reconnect, client tells server "give me everything after X."
5. Thundering herd mitigation · when a message arrives, 50K long-poll connections wake up. Solve with: jittered wakeup, read replicas, edge caching of recent messages, and connection-level rate limiting.
Interviewer follow-ups to prepare for:
· "How do you handle ordering?" ? Monotonic msg_id per channel. Client detects gaps and re-fetches.
· "What about read receipts?" ? Not feasible at 1M. Show "X people saw this" as an approximate counter, not a full list.
· "How does this differ from Kafka?" ? Similar concept (log + consumer offset), but Kafka tracks offsets server-side. Telegram pushes offset tracking to the client to avoid 1M offset updates per message.
· "What if the log server goes down?" ? Multi-DC replication. Client reconnects to any DC and resumes from cursor. No data loss after ACK.