How does Slack shard 10B+ messages with Vitess?

➔? Design a database sharding layer: 10B+ messages, online resharding, 100· connection reduction

Concepts Involved

Sharding SQL Indexing Replication Partitioning

Problem Statement

How does a database sharding layer transparently shard 10B+ messages across hundreds of shards, supporting online resharding (split/merge without downtime) and connection pooling that reduces backend connections by 100·?

Core challenge: Single MySQL can't hold 10B messages. You need hundreds of shards. But resharding (splitting a hot shard) traditionally requires downtime. And 10K app servers · 100 shards = 1M DB connections (impossible). Vitess solves both.

10B+

messages stored

growing 10TB/day

100s

MySQL shards

transparent to app

Zero

downtime resharding

online split/merge

100·

connection reduction

VTGate connection pooling

Architecture · Vitess Sharding Layer

Component	Role	How
VTGate	Query router + connection pool	Parses SQL, routes to correct shard by vindex (shard key). Pools connections: 10K apps → 1 VTGate pool → 100 shards
VTTablet	Shard-local agent	Sits in front of each MySQL. Handles replication, health checks, query rewriting
Topology (etcd)	Shard map + metadata	Stores which keyrange → which shard. Updated during resharding
VReplication	Online resharding engine	Streams rows from source to target shard. Catches up binlog. Atomic cutover.

Shard key: workspace_id · all messages for one workspace on same shard. Ensures channel queries never cross shards. Vindex maps workspace_id → keyrange → shard. Lookup vindex for secondary access patterns.

Online resharding: → Create target shards → VReplication copies existing data → Binlog streaming catches up writes → When lag < 1s, atomic cutover (freeze writes for ~1s, switch routing) → Old shard becomes read-only, then decommissioned. Zero downtime for reads, ~1s pause for writes.

Limitations: Cross-shard joins · not supported (denormalize or scatter-gather). Cross-shard transactions · 2PC available but expensive. Hot shards · one large workspace can overwhelm a shard (solution: dedicated shard for enterprise customers).

Real-world: Slack · Vitess for all message storage. YouTube · Vitess originated at Google (open-sourced). GitHub · Vitess for MySQL sharding. Square · Vitess for payment data. PlanetScale · managed Vitess as a service.

Scale Estimation

Step	Derivation	Result	Design Impact
1	Messages: 10B/day · 86400	~115K writes/sec	Single MySQL maxes at ~10K writes/sec → need 12+ shards minimum
2	Storage growth: 10B · 1KB/msg	~10 TB/day	7-day hot retention = 70TB across shards
3	Shards needed: 115K writes · 5K/shard (safe)	~23 shards minimum	Use 100+ for headroom and per-workspace isolation
4	Connections without Vitess: 10K apps · 100 shards	1M connections (impossible)	VTGate pools: 10K apps → 1 pool → 100 shards = manageable
5	Read queries: 10· write ratio	~1.15M reads/sec	Read replicas per shard (2 replicas each)

Resilience & Edge Cases

Failure	Impact	Recovery
Shard primary failure	Writes blocked for that shard's workspaces	VTTablet promotes replica within seconds. VTGate retries transparently. ~5s write unavailability.
Hot shard (large workspace)	One shard overwhelmed	Online resharding: split hot shard into 2 via VReplication. Or: dedicated shard for enterprise customers.
Cross-shard query needed	Can't JOIN across shards	VTGate scatter-gather (slow). Better: denormalize data or use separate search index (Elasticsearch).
VTGate failure	App can't reach any shard	Multiple VTGate instances behind load balancer. Stateless · any VTGate can serve any query.

Interview Cheat Sheet

The 6 things to say for database sharding

1. Shard key = workspace_id · all data for one tenant on same shard (no cross-shard joins)
2. VTGate connection pooling · 10K apps → 1 pool → 100 shards (100· reduction)
3. Online resharding via VReplication · stream + catch up binlog + atomic cutover (~1s pause)
4. Vindex · maps shard key → keyrange → physical shard (like consistent hashing)
5. No cross-shard joins · denormalize or scatter-gather (design constraint)
6. Dedicated shards for large tenants · enterprise customers get their own shard to avoid noisy-neighbor

System Design Case Study

Problem Statement

Architecture · Vitess Sharding Layer

Scale Estimation

Resilience & Edge Cases

Interview Cheat Sheet