How does a distributed key-value store achieve consensus across 5 nodes, handling leader election, log replication, and split-brain prevention requiring majority quorum (3 of 5) for all decisions?
Core challenge: 5 nodes must agree on the same sequence of operations. A node crashes. A network partition splits the cluster. How do you guarantee all surviving nodes have the same data, elect exactly one leader, and prevent split-brain?
Follower times out (150-300ms random) ? becomes candidate ? requests votes ? majority grants ? becomes leader
Exactly one leader per term. Random timeout prevents split vote.
Log Replication
Leader appends entry ? sends AppendEntries to all followers ? majority ACK ? entry committed
Committed entries never lost. All nodes converge to same log.
Safety
Candidate must have all committed entries to win election (log completeness)
New leader always has all committed data. No data loss on failover.
Heartbeat
Leader sends empty AppendEntries every 100ms
Followers know leader is alive. No unnecessary elections.
Membership Change
Joint consensus: old + new config both must agree during transition
Safe add/remove nodes without split-brain during reconfiguration.
Split-brain prevention: Majority quorum (3/5) means at most one partition can have a majority. The minority partition cannot elect a leader (can't get 3 votes). Stale leader in minority detects it lost majority ? steps down. Fencing tokens (monotonic term numbers) prevent stale leaders from making progress.
etcd specifics: Backbone of Kubernetes (stores all cluster state). Watch API · clients subscribe to key changes (efficient notification). Lease-based TTL · keys auto-expire (used for leader election, service discovery). Linearizable reads via ReadIndex (leader confirms it's still leader before serving read).
Limitations:Write throughput limited by leader (~10K writes/sec). Cross-region latency · consensus requires RTT to majority (100ms+ cross-region). Large values · etcd optimized for small KV (<1.5MB per value). Cluster size · 3 or 5 nodes (more = slower consensus).
Real-world:Kubernetes · all cluster state in etcd. CockroachDB · Raft per data range. TiKV · Raft for distributed KV. Consul · Raft for service catalog. Kafka KRaft · replacing ZooKeeper with Raft controller.
Interview Cheat Sheet
The 7 things to say for consensus/Raft design
1.Majority quorum (2f+1) · 5 nodes tolerates 2 failures. Only one partition can have majority. 2.Leader handles all writes · single point of serialization, no conflicts 3.Random election timeout (150-300ms) · prevents split vote, ensures one candidate wins 4.Log completeness · candidate must have all committed entries to win election (no data loss) 5.Committed = majority ACK · entry is durable once 3/5 nodes have it 6.Heartbeats every 100ms · followers know leader is alive, no unnecessary elections 7.Fencing tokens (term numbers) · stale leader can't make progress (monotonically increasing term)