System Design Case Study

How does Slack shard 10B+ messages across hundreds of shards with online resharding?

??? Design a database sharding layer: 10B+ messages, online resharding, 100· connection reduction
Concepts Involved

Problem Statement

How does a database sharding layer transparently shard 10B+ messages across hundreds of shards, supporting online resharding (split/merge without downtime) and connection pooling that reduces backend connections by 100·?

Core challenge: Single MySQL can't hold 10B messages. You need hundreds of shards. But resharding (splitting a hot shard) traditionally requires downtime. And 10K app servers · 100 shards = 1M DB connections (impossible). Vitess solves both.
10B+
messages stored
growing 10TB/day
100s
MySQL shards
transparent to app
Zero
downtime resharding
online split/merge
100·
connection reduction
VTGate connection pooling

Architecture · Vitess Sharding Layer

APP LAYER App Server 1 MySQL driver no shard awareness App Server 2 MySQL driver no shard awareness App Server 3 MySQL driver no shard awareness ... 10K instances standard MySQL protocol 10K apps ? 1 pool ? 100 shards SQL queries VITESS LAYER VTGate (Stateless Router) Parse SQL ? route by vindex (hash of workspace_id) Connection pooling: 10K app conns ? 100 shard conns (100· reduction) Scatter-gather for cross-shard queries | Stateless, horizontally scalable shard key: workspace_id VTTablet (per shard) Health checks & failover Replication management Query rewriting & throttling Promotes replica on primary failure (~5s) route Topology (etcd) shard map keyrange?node updated on reshard watch STORAGE LAYER · MySQL Shards Shard 1 · keyrange 00-40 MySQL Primary Replica 1 Replica 2 workspaces hash 00-40 Shard 2 · keyrange 40-80 MySQL Primary Replica 1 Replica 2 workspaces hash 40-80 Shard 3 · keyrange 80-FF MySQL Primary Replica 1 Replica 2 workspaces hash 80-FF ... 100+ shards total (each with Primary + 2 Replicas) shard key: workspace_id | 115K writes/sec across shards ONLINE RESHARDING · Zero Downtime via VReplication ? VReplication stream rows to target ? Catch-up Binlog replay new writes ? Lag < 1s nearly caught up ? Atomic Cutover ~1s write pause ? Done! zero downtime 10K apps ? 1 pool ? 100 shards | shard key: workspace_id | 115K writes/sec across shards Connection reduction: 10K · 100 = 1M impossible ? VTGate pools to ~100 backend conns per shard
ComponentRoleHow
VTGateQuery router + connection poolParses SQL, routes to correct shard by vindex (shard key). Pools connections: 10K apps ? 1 VTGate pool ? 100 shards
VTTabletShard-local agentSits in front of each MySQL. Handles replication, health checks, query rewriting
Topology (etcd)Shard map + metadataStores which keyrange ? which shard. Updated during resharding
VReplicationOnline resharding engineStreams rows from source to target shard. Catches up binlog. Atomic cutover.
Shard key: workspace_id · all messages for one workspace on same shard. Ensures channel queries never cross shards. Vindex maps workspace_id ? keyrange ? shard. Lookup vindex for secondary access patterns.
Online resharding: ? Create target shards ? VReplication copies existing data ? Binlog streaming catches up writes ? When lag < 1s, atomic cutover (freeze writes for ~1s, switch routing) ? Old shard becomes read-only, then decommissioned. Zero downtime for reads, ~1s pause for writes.
Limitations: Cross-shard joins · not supported (denormalize or scatter-gather). Cross-shard transactions · 2PC available but expensive. Hot shards · one large workspace can overwhelm a shard (solution: dedicated shard for enterprise customers).
Real-world: Slack · Vitess for all message storage. YouTube · Vitess originated at Google (open-sourced). GitHub · Vitess for MySQL sharding. Square · Vitess for payment data. PlanetScale · managed Vitess as a service.

Scale Estimation

StepDerivationResultDesign Impact
1Messages: 10B/day · 86400~115K writes/secSingle MySQL maxes at ~10K writes/sec ? need 12+ shards minimum
2Storage growth: 10B · 1KB/msg~10 TB/day7-day hot retention = 70TB across shards
3Shards needed: 115K writes · 5K/shard (safe)~23 shards minimumUse 100+ for headroom and per-workspace isolation
4Connections without Vitess: 10K apps · 100 shards1M connections (impossible)VTGate pools: 10K apps ? 1 pool ? 100 shards = manageable
5Read queries: 10· write ratio~1.15M reads/secRead replicas per shard (2 replicas each)

Resilience & Edge Cases

FailureImpactRecovery
Shard primary failureWrites blocked for that shard's workspacesVTTablet promotes replica within seconds. VTGate retries transparently. ~5s write unavailability.
Hot shard (large workspace)One shard overwhelmedOnline resharding: split hot shard into 2 via VReplication. Or: dedicated shard for enterprise customers.
Cross-shard query neededCan't JOIN across shardsVTGate scatter-gather (slow). Better: denormalize data or use separate search index (Elasticsearch).
VTGate failureApp can't reach any shardMultiple VTGate instances behind load balancer. Stateless · any VTGate can serve any query.

Interview Cheat Sheet

The 6 things to say for database sharding

1. Shard key = workspace_id · all data for one tenant on same shard (no cross-shard joins)
2. VTGate connection pooling · 10K apps ? 1 pool ? 100 shards (100· reduction)
3. Online resharding via VReplication · stream + catch up binlog + atomic cutover (~1s pause)
4. Vindex · maps shard key ? keyrange ? physical shard (like consistent hashing)
5. No cross-shard joins · denormalize or scatter-gather (design constraint)
6. Dedicated shards for large tenants · enterprise customers get their own shard to avoid noisy-neighbor