How does Twitter deliver posts to 400M+ timelines in real-time?

🎯 Design a timeline delivery system: 400M users, celebrity fan-out, <5s delivery

Concepts Involved

Redis Kafka Pub/Sub Sharding Caching Consistent Hashing

Problem Statement

How does a social platform deliver posts to 400M+ users' timelines in real-time, handling the asymmetry between users with few followers and celebrities with millions, while keeping timeline delivery under 5 seconds?

Core challenge: A celebrity with 100M followers posts a tweet. You cannot write to 100M timelines simultaneously · that's a fan-out explosion. But followers expect to see it within seconds. How do you balance write amplification vs read latency?

400M+

active users

timelines to serve

100M

max followers

celebrity problem

<5s

delivery target

post → visible in feed

500K+

tweets / sec peak

during major events

Functional Requirements

Must Have

1. User posts a tweet → appears in all followers' timelines
2. Timeline is ranked (not purely chronological) · relevance + recency
3. Real-time updates · new tweets appear without manual refresh
4. Handle celebrity asymmetry · users with 100M followers
5. Support infinite scroll with pagination
6. Engagement signals update in real-time (likes, retweets, replies count)

Out of Scope

✗ Tweet composition and media upload
✗ Search and trending topics
✗ Direct messages
✗ Notifications
✗ Content moderation

Functional Requirements

Must Have

Out of Scope

✗ Tweet composition and media upload
✗ Search and trending topics
✗ Direct messages
✗ Notifications
✗ Content moderation

Non-Functional Requirements

Property	Target	Design Impact
Latency	<5s tweet → visible in follower timeline	Fan-out workers must process within seconds, not minutes
Read Latency	<100ms timeline load	Pre-computed timeline in Redis (no DB query on read)
Throughput	500K tweets/sec peak (major events)	Kafka + parallel fan-out workers, auto-scale on lag
Availability	99.99%	Redis replicas, multi-AZ, graceful degradation (serve stale)
Consistency	Eventual · tweets may appear 1-5s late	Fan-out is async. Acceptable: not a banking system.
Scalability	400M users, linear scale with user count	Shard timeline cache by user_id. Add Redis nodes as users grow.

High-Level Architecture

Hybrid fan-out: write for normal users, read for celebrities

Why hybrid? Pure fan-out-on-write: a celebrity tweet triggers 100M Redis writes = minutes of lag. Pure fan-out-on-read: every timeline load queries all followed users = slow reads. Hybrid: 99% of tweets fan-out on write (fast reads), celebrities merge on read (no write explosion).

Timeline cache: Redis sorted set per user. Score = timestamp. ZREVRANGE for latest N tweets. Trim to 800 entries (older tweets fetched from DB). Cache hit rate: ~99% for active users.

Failure modes: Fan-out worker lag → tweets delayed for some followers (acceptable: eventual). Redis node failure → rebuild from DB (cold start ~30s). Celebrity tweet during event → read path handles gracefully (no write storm).

Real-world: Twitter uses this exact hybrid approach. Threshold is ~10K followers. Instagram uses similar fan-out-on-write for feed. Facebook uses primarily fan-out-on-read with aggressive caching (TAO).

Scale Estimation

Back-of-envelope math for timeline delivery

Given: 400M active users · 500K tweets/sec peak · avg 200 followers · top 1% have >10K followers

Step	Derivation	Result	Design Impact
1	Fan-out writes (normal): 500K tweets · 200 followers · 99%	~99M Redis writes/sec	Redis Cluster with 100+ shards for timeline cache
2	Celebrity tweets skipped: 500K · 1%	~5K tweets/sec (read path)	Merged at read time · no write amplification
3	Timeline reads: 400M users · 10 opens/day · 86400	~46K timeline reads/sec	Redis cache hit rate ~99% · most reads served from cache
4	Timeline cache size: 400M users · 800 tweet_ids · 8 bytes	~2.5 TB Redis	Sorted sets, trim to 800 entries per user
5	Fan-out worker count: 99M writes/sec · 100K writes/worker	~990 fan-out workers	Stateless, horizontally scalable, Kafka-driven
6	Tweet storage: 500K/sec · 1KB · 86400	~43 TB/day	Sharded by tweet_id, replicated 3·

Data Model

Timeline cache (Redis) + Tweet store (MySQL/Manhattan) + Social graph (who follows whom)

// --- Timeline Cache (Redis Sorted Set) ---
Key:   timeline:{user_id}
Value: Sorted Set · score=timestamp, member=tweet_id
Ops:   ZADD timeline:user123 1704067200 tweet_abc    // fan-out write
       ZREVRANGE timeline:user123 0 19               // get latest 20
       ZREMRANGEBYRANK timeline:user123 0 -801       // trim to 800

// --- Tweet Store (Sharded MySQL / Manhattan) ---
tweets {
  tweet_id:    BIGINT (Snowflake ID)    -- partition key
  user_id:     BIGINT
  text:        VARCHAR(280)
  media_urls:  JSON
  created_at:  TIMESTAMP
  reply_to:    BIGINT NULL
  retweet_of:  BIGINT NULL
  like_count:  INT (eventually consistent counter)
  rt_count:    INT
}

// --- Social Graph (who follows whom) ---
follows {
  follower_id:  BIGINT
  followee_id:  BIGINT
  created_at:   TIMESTAMP
}
-- Index: (followee_id) → get all followers (for fan-out)
-- Index: (follower_id) → get all following (for timeline merge)

// --- Celebrity List (cached) ---
celebrities: SET of user_ids with followers > 10K
-- Updated hourly, cached in memory on fan-out workers

Resilience & Edge Cases

Failure	Impact	Recovery
Fan-out worker lag	Tweets delayed for some followers	Acceptable: eventual delivery. Auto-scale workers on Kafka lag.
Redis node failure	Timeline cache lost for shard	Rebuild from DB: query follows → fetch recent tweets → populate cache (~30s cold start)
Celebrity tweets during event	Could overwhelm read path	Pre-compute celebrity timelines, cache aggressively, CDN for trending tweets
New follower of celebrity	Missing old tweets in timeline	On follow: backfill last N tweets from celebrity into follower's timeline
User unfollows	Stale tweets in timeline	Lazy removal: filter at read time. Async cleanup in background.

Tech Stack & Tradeoffs

Component	Technology	Why	Rejected
Timeline Cache	Redis Cluster (sorted sets)	Sub-ms reads, atomic ZADD, TTL, sharding	Memcached (no sorted sets), DynamoDB (higher latency)
Tweet Store	Manhattan (Twitter's KV) / MySQL	Sharded by tweet_id, high write throughput	Cassandra (less familiar), PostgreSQL (single-node limits)
Fan-out Queue	Kafka	Durable, replayable, handles 500K tweets/sec	RabbitMQ (no replay), SQS (no ordering)
Social Graph	FlockDB (Twitter's graph DB)	Optimized for follower lookups, sharded	Neo4j (doesn't scale), MySQL (slow for graph traversal)
Ranking	ML model (TensorFlow Serving)	Personalized relevance scoring per user	Chronological only (lower engagement)
Real-time Push	WebSocket / Server-Sent Events	New tweets appear without refresh	Polling (wasteful, laggy)

Interview Cheat Sheet

The 8 things to say for timeline/feed design

1. Hybrid fan-out · write for normal users (<10K followers), read for celebrities (>10K)
2. Redis sorted set per user · score=timestamp, ZREVRANGE for latest, trim to 800
3. Fan-out workers consume from Kafka · tweet event → lookup followers → ZADD to each timeline
4. Celebrity merge at read time · query "my celebrity follows" → fetch their recent tweets → merge + rank
5. ML ranking · not purely chronological. Score by relevance, engagement prediction, recency
6. Cache hit rate ~99% · active users' timelines always warm. Cold start on first login after long absence.
7. Social graph sharded by user_id · follower lookup is the hot path during fan-out
8. Engagement counters eventually consistent · likes/RTs updated async, not blocking delivery

System Design Case Study

Problem Statement

Functional Requirements

Must Have

Out of Scope

Functional Requirements

Must Have

Out of Scope

Non-Functional Requirements

High-Level Architecture

Scale Estimation

Data Model

Resilience & Edge Cases

Tech Stack & Tradeoffs

Interview Cheat Sheet