How does Google Docs handle 100 users editing simultaneously?

🎯 Design a collaborative editor resolving conflicting edits without locks, converging within 50ms

Concepts Involved

Conflict Resolution Eventual Consistency WebSocket Pub/Sub

Problem Statement

How does a collaborative editor handle 100 users editing the same paragraph simultaneously, resolving conflicting character insertions without a central lock while maintaining convergence across all clients within 50ms?

Core challenge: Two users type at the same position at the same time. Without coordination, their documents diverge permanently. How do you guarantee all clients converge to the same final state without locking✗

100+

concurrent editors

same document

<50ms

convergence latency

local-first, sync async

1B+

documents

Google Workspace scale

data loss

every keystroke preserved

Functional Requirements

What the system must do · core collaborative editing behaviours

Must Have (Core)

Multiple users edit same document simultaneously
All clients converge to identical state (consistency)
No locking · users never blocked from typing
Preserve user intent (insertions don't overwrite each other)
Real-time cursor/selection visibility of other users
Undo/redo works correctly per-user in collaborative context

Should Have

Offline editing with sync on reconnect
Version history with point-in-time restore
Comments and suggestions (non-destructive)
Rich text formatting (bold, headings, lists)
Presence indicators (who's viewing/editing)
Permission levels (view, comment, edit)

Non-Functional Requirements

Requirement	Target	Why
Latency	<50ms local apply, <200ms remote sync	Typing must feel instant; remote changes appear quickly
Consistency	Strong eventual convergence	All clients must reach same state, regardless of operation order
Availability	99.99% uptime	Users depend on Docs for daily work
Scalability	100+ concurrent editors per doc	Large team meetings, live editing sessions
Durability	Zero data loss	Every keystroke must be persisted
Bandwidth	<10KB/sec per user	Works on mobile/slow connections

High-Level Architecture

Client-side OT with server as single source of truth

OT vs CRDT · Core Algorithm Choice

Two fundamentally different approaches to conflict resolution

? Operational Transformation (OT) · Google's Approach

How OT works: Each edit is an operation (insert 'H' at position 5, delete at position 3). When two ops conflict, a transform function adjusts positions so both can apply correctly. Server maintains linear revision history · all ops are serialized through one point.

// OT Transform Example:
// User A: insert('X', pos=3)  at revision 5
// User B: insert('Y', pos=1)  at revision 5  (concurrent!)

// Server receives A first → applies at pos 3 → rev 6
// Server receives B (based on rev 5, but now rev 6 exists)
// Transform: B's pos=1 < A's pos=3, so B stays at pos=1
// Result: "abYcXdef" · both insertions preserved, positions adjusted

transform(insertA, insertB):
  if A.pos <= B.pos: B.pos += len(A.text)
  if B.pos < A.pos:  A.pos += len(B.text)

? CRDT (Conflict-Free Replicated Data Types) · Figma's Approach

How CRDTs work: Each character has a unique ID (not position-based). Operations reference IDs, not indices. Merge is commutative + associative · order doesn't matter, result is always the same. No central server needed for correctness.

	OT (Google Docs)	CRDT (Figma, Yjs, Automerge)
Server	Required · serializes ops	Optional · peer-to-peer possible
Complexity	Transform functions (O(n·) worst case)	Unique IDs per character (metadata overhead)
Offline	Limited (must sync through server)	Excellent · merge on reconnect
Memory	Low (ops are small)	Higher (tombstones, unique IDs)
Correctness	Proven for specific transform pairs	Mathematically guaranteed convergence
Latency	Server round-trip for confirmation	Instant local, async merge
Used by	Google Docs, Etherpad	Figma, Apple Notes, Notion (partial), Yjs

Key Design Decisions

Critical choices that determine system behavior

Client-Side OT Pipeline

Local apply: user types → apply immediately (0ms latency)
Buffer: store pending ops not yet ACKed by server
Send: send op with base revision to server
ACK: server confirms → remove from buffer
Remote op arrives: transform against pending buffer
Apply transformed: update local doc with remote changes

Server-Side Processing

Receive op: client sends (op, baseRevision)
Transform: against all ops since baseRevision
Apply: to server document state
Assign revision: monotonic increment
Persist: append to revision log (Spanner)
Broadcast: send transformed op to all other clients

Cursor & Presence

Cursor positions sent as ephemeral (not persisted)
Throttled to 50ms intervals (avoid flooding)
Transformed alongside document ops
Color-coded per user (up to ~20 visible)
Selection ranges shown as highlights

Failure Handling

Disconnect: buffer ops locally, resync on reconnect
Server crash: replay from revision log
Conflict: OT guarantees convergence (no manual merge)
Slow client: server compacts ops into snapshots
Version history: periodic snapshots + op replay

Scaling & Production Considerations

Challenge	Solution	Detail
Hot documents	Dedicated server per doc	Shard by docId, pin to single server for serialization
100+ editors	Op batching + throttling	Batch rapid keystrokes into single op (debounce 50ms)
Large documents	Chunked loading	Load visible portion, lazy-load rest on scroll
Version history	Periodic snapshots	Snapshot every N ops, replay from nearest snapshot
Undo/Redo	Per-user inverse ops	Undo transforms against subsequent ops (complex!)
Rich text	Structured ops	Ops include formatting attributes, not just text

Google's approach: OT with server as serialization point. Each document has a single collaboration server (sharded by docId). Server maintains linear revision history. Clients optimistically apply locally, server transforms and broadcasts. Jupiter protocol (Google's OT variant) handles the client-server transform.

Real-world numbers: Google Docs handles 1B+ documents, 100+ concurrent editors per doc, <50ms local latency, <200ms sync latency. Revision log stored in Spanner for strong consistency. Presence via Colossus (Google's distributed file system).

Common mistakes: Position-based without transform · divergence guaranteed. Locking paragraphs · terrible UX. Last-write-wins · loses edits silently. Sending full document on each edit · bandwidth explosion.

Interview Cheat Sheet

The 8 things to say for collaborative editing design

1. OT (Operational Transform) · transform concurrent ops against each other to preserve intent
2. Server as serialization point · single source of truth, assigns revision numbers
3. Optimistic local apply · user sees their edit instantly, sync happens async
4. Transform function · insert(pos=5) vs insert(pos=3) → shift first to pos=6
5. CRDTs as alternative · no server needed (P2P), but larger metadata overhead
6. Cursor/selection sync · broadcast cursor positions via presence channel (ephemeral)
7. Undo = inverse operation · not "restore previous state" (would undo others' edits)
8. Revision log for history · every op stored, enables time-travel and "see changes"

System Design Case Study