How does a social platform invalidate cached objects across 1000+ servers within 1 second of a write, preventing stale reads while avoiding thundering herd on the backing store?
Core challenge: A user updates their profile photo. That photo URL is cached on 1000+ servers worldwide. You must invalidate all copies within 1 second without causing 1000 simultaneous DB reads (thundering herd).
1000+
cache servers
globally distributed
<1s
invalidation latency
write ? all caches cleared
10B+
cache reads/sec
99.9% hit rate target
0
thundering herd
single refill per key
Functional Requirements
What the system must do
Must Have
1. On write to DB, invalidate all cached copies within 1 second 2. Prevent thundering herd · only one request refills cache per key 3. Support graph-structured data (objects + associations) 4. Maintain 99.9%+ cache hit rate despite invalidations 5. Handle cross-region invalidation with bounded staleness 6. Support lease-based protection against stale sets
Out of Scope
? Full database design (focus on cache layer) ? Application-level business logic ? User authentication and authorization ? CDN for static assets (separate system)
Non-Functional Requirements
Quality constraints shaping architecture
Property
Target
Design Impact
Invalidation Latency
<1 second globally
Pub/sub invalidation bus (not polling). McRouter + mcrouter-based fanout.
Hit Rate
99.9%+
Tiered caching (L1 local + L2 regional). Warm cache on startup.
Throughput
10B+ reads/sec
Memcached clusters per region. Consistent hashing for key distribution.
Consistency
Read-after-write for writer
Writer reads from DB directly after write. Others get eventual consistency.
Availability
99.99%
Cache miss falls through to DB. Gutter pool absorbs failed nodes.
High-Level Architecture
Meta's TAO: a graph-aware distributed cache with invalidation
? TAO Architecture · Write Path & Invalidation
Key Design Decisions
The architectural choices that make this work at Meta's scale
Decision
Choice
Why
Alternative Rejected
Invalidation strategy
Delete on write (not update)
Avoids race conditions between concurrent writes
Write-through (complex ordering)
Thundering herd
Lease-based (token on first miss)
Only 1 request hits DB per key per invalidation
Mutex lock (doesn't scale across servers)
Propagation
Pub/sub invalidation bus
<1s fanout to all followers
Polling (too slow), TTL-only (too stale)
Topology
Leader-follower per region
Writes go to leader, reads from local follower
Flat (no write coordination)
Failure handling
Gutter pool (backup cache tier)
Failed node traffic goes to gutter, not DB
Direct DB fallback (overloads DB)
Data model
Graph-aware (objects + edges)
Social data is naturally a graph
Key-value only (loses relationship semantics)
Why DELETE not UPDATE: If two writes happen concurrently (W1 sets value=A, W2 sets value=B), update-based invalidation can leave cache with stale value if messages arrive out of order. DELETE is commutative · order doesn't matter, next read always gets fresh data from DB.
Lease mechanism: On cache miss, server issues a lease token (short-lived, ~10s). Only the lease holder can SET the value. If an invalidation arrives while lease is active, the lease is revoked · preventing stale data from being cached. This solves both thundering herd AND stale-set race conditions.
Tradeoffs:Eventual consistency · followers may serve stale data for up to 1 second after write. Complexity · leader/follower topology adds operational overhead. Cross-region latency · invalidation across regions takes longer (bounded by network RTT).
Real-world:Meta TAO · serves 10B+ reads/sec for social graph. Netflix EVCache · similar lease-based thundering herd protection. Twitter · Manhattan cache with pub/sub invalidation. Memcached at Facebook paper (2013) · foundational design for this pattern.
Interview Cheat Sheet
The 8 things to say for cache invalidation design
1.DELETE on write, not UPDATE · commutative, order-independent, prevents stale-set races 2.Lease mechanism · only lease holder can SET value, revoked on invalidation (prevents thundering herd + stale) 3.Leader/follower cache topology · writes go to leader, async replicate to followers (<1s) 4.Pub/Sub invalidation bus · write triggers invalidation event ? all cache servers subscribe 5.Thundering herd protection · on miss, only 1 request goes to DB (others wait for lease result) 6.Eventual consistency accepted · followers may serve stale for ~1s (acceptable for social data) 7.Version/generation numbers · detect stale writes (only accept SET if version matches) 8.Cross-region: async replication · invalidation propagates across regions via dedicated channel