How does Meta invalidate cache across 1000+ servers in 1 second?

? Design a cache invalidation system: 1000+ servers, <1s propagation, no thundering herd

Concepts Involved

Caching Strategies Redis CDN Pub/Sub Consistency

Problem Statement

How does a social platform invalidate cached objects across 1000+ servers within 1 second of a write, preventing stale reads while avoiding thundering herd on the backing store?

Core challenge: A user updates their profile photo. That photo URL is cached on 1000+ servers worldwide. You must invalidate all copies within 1 second without causing 1000 simultaneous DB reads (thundering herd).

1000+

cache servers

globally distributed

<1s

invalidation latency

write → all caches cleared

10B+

cache reads/sec

99.9% hit rate target

thundering herd

single refill per key

Functional Requirements

What the system must do

Must Have

1. On write to DB, invalidate all cached copies within 1 second
2. Prevent thundering herd · only one request refills cache per key
3. Support graph-structured data (objects + associations)
4. Maintain 99.9%+ cache hit rate despite invalidations
5. Handle cross-region invalidation with bounded staleness
6. Support lease-based protection against stale sets

Out of Scope

✗ Full database design (focus on cache layer)
✗ Application-level business logic
✗ User authentication and authorization
✗ CDN for static assets (separate system)

Non-Functional Requirements

Quality constraints shaping architecture

Property	Target	Design Impact
Invalidation Latency	<1 second globally	Pub/sub invalidation bus (not polling). McRouter + mcrouter-based fanout.
Hit Rate	99.9%+	Tiered caching (L1 local + L2 regional). Warm cache on startup.
Throughput	10B+ reads/sec	Memcached clusters per region. Consistent hashing for key distribution.
Consistency	Read-after-write for writer	Writer reads from DB directly after write. Others get eventual consistency.
Availability	99.99%	Cache miss falls through to DB. Gutter pool absorbs failed nodes.

High-Level Architecture

Meta's TAO: a graph-aware distributed cache with invalidation

? TAO Architecture · Write Path & Invalidation

Key Design Decisions

The architectural choices that make this work at Meta's scale

Decision	Choice	Why	Alternative Rejected
Invalidation strategy	Delete on write (not update)	Avoids race conditions between concurrent writes	Write-through (complex ordering)
Thundering herd	Lease-based (token on first miss)	Only 1 request hits DB per key per invalidation	Mutex lock (doesn't scale across servers)
Propagation	Pub/sub invalidation bus	<1s fanout to all followers	Polling (too slow), TTL-only (too stale)
Topology	Leader-follower per region	Writes go to leader, reads from local follower	Flat (no write coordination)
Failure handling	Gutter pool (backup cache tier)	Failed node traffic goes to gutter, not DB	Direct DB fallback (overloads DB)
Data model	Graph-aware (objects + edges)	Social data is naturally a graph	Key-value only (loses relationship semantics)

Why DELETE not UPDATE: If two writes happen concurrently (W1 sets value=A, W2 sets value=B), update-based invalidation can leave cache with stale value if messages arrive out of order. DELETE is commutative · order doesn't matter, next read always gets fresh data from DB.

Lease mechanism: On cache miss, server issues a lease token (short-lived, ~10s). Only the lease holder can SET the value. If an invalidation arrives while lease is active, the lease is revoked · preventing stale data from being cached. This solves both thundering herd AND stale-set race conditions.

Tradeoffs: Eventual consistency · followers may serve stale data for up to 1 second after write. Complexity · leader/follower topology adds operational overhead. Cross-region latency · invalidation across regions takes longer (bounded by network RTT).

Real-world: Meta TAO · serves 10B+ reads/sec for social graph. Netflix EVCache · similar lease-based thundering herd protection. Twitter · Manhattan cache with pub/sub invalidation. Memcached at Facebook paper (2013) · foundational design for this pattern.

Interview Cheat Sheet

The 8 things to say for cache invalidation design

1. DELETE on write, not UPDATE · commutative, order-independent, prevents stale-set races
2. Lease mechanism · only lease holder can SET value, revoked on invalidation (prevents thundering herd + stale)
3. Leader/follower cache topology · writes go to leader, async replicate to followers (<1s)
4. Pub/Sub invalidation bus · write triggers invalidation event → all cache servers subscribe
5. Thundering herd protection · on miss, only 1 request goes to DB (others wait for lease result)
6. Eventual consistency accepted · followers may serve stale for ~1s (acceptable for social data)
7. Version/generation numbers · detect stale writes (only accept SET if version matches)
8. Cross-region: async replication · invalidation propagates across regions via dedicated channel

System Design Case Study