System Design Case Study

How does Google Docs handle 100 users editing the same paragraph simultaneously?

?? Design a collaborative editor resolving conflicting edits without locks, converging within 50ms
Concepts Involved

Problem Statement

How does a collaborative editor handle 100 users editing the same paragraph simultaneously, resolving conflicting character insertions without a central lock while maintaining convergence across all clients within 50ms?

Core challenge: Two users type at the same position at the same time. Without coordination, their documents diverge permanently. How do you guarantee all clients converge to the same final state without locking?
100+
concurrent editors
same document
<50ms
convergence latency
local-first, sync async
1B+
documents
Google Workspace scale
0
data loss
every keystroke preserved

Functional Requirements

What the system must do · core collaborative editing behaviours

Must Have (Core)

  • Multiple users edit same document simultaneously
  • All clients converge to identical state (consistency)
  • No locking · users never blocked from typing
  • Preserve user intent (insertions don't overwrite each other)
  • Real-time cursor/selection visibility of other users
  • Undo/redo works correctly per-user in collaborative context

Should Have

  • Offline editing with sync on reconnect
  • Version history with point-in-time restore
  • Comments and suggestions (non-destructive)
  • Rich text formatting (bold, headings, lists)
  • Presence indicators (who's viewing/editing)
  • Permission levels (view, comment, edit)

Non-Functional Requirements

RequirementTargetWhy
Latency<50ms local apply, <200ms remote syncTyping must feel instant; remote changes appear quickly
ConsistencyStrong eventual convergenceAll clients must reach same state, regardless of operation order
Availability99.99% uptimeUsers depend on Docs for daily work
Scalability100+ concurrent editors per docLarge team meetings, live editing sessions
DurabilityZero data lossEvery keystroke must be persisted
Bandwidth<10KB/sec per userWorks on mobile/slow connections

High-Level Architecture

Client-side OT with server as single source of truth

Google Docs · End-to-End Collaborative Editing Architecture CLIENT LAYER · Optimistic Local Apply (instant feedback, no waiting for server) Client A (Browser) 1. User types ? create Operation (insert 'H' at pos 5) 2. Apply locally IMMEDIATELY (no wait) 3. Add to pending ops buffer 4. Send to server via WebSocket OT Engine: transforms incoming ops against pending Client B (Browser) 1. User types ? create Operation (insert 'X' at pos 5) 2. Apply locally IMMEDIATELY 3. Add to pending ops buffer 4. Send to server via WebSocket OT Engine: transforms incoming ops against pending Client C...N 100+ concurrent editors same architecture per client cursor positions shared send op (base_rev=N) send op (base_rev=N) WebSocket WebSocket SERVER LAYER · Single Source of Truth (serialization point per document) Collaboration Server (per docId) one server instance per document (sharded by docId hash) ? Receive op from Client A (base_rev=5, insert 'H' at pos 5) ? Transform against ops since base_rev (if any concurrent ops) ? Append to revision log ? assign rev=6 ? ACK to sender (your op is rev=6) ? Broadcast transformed op to ALL other clients Presence Service cursor positions (ephemeral) who's viewing which section broadcast every 100ms Doc Router / LB hash(docId) ? server sticky sessions per doc failover ? new server broadcast transformed op broadcast transformed op STORAGE LAYER · Durable Persistence + History Revision Log (Spanner) rev 1: insert 'Hello' at 0 rev 2: insert ' World' at 5 rev 3: delete at pos 3, len 2 append-only, immutable Document Snapshot materialized state @ rev N periodic checkpoint (every 100 revs) speeds up doc open (no full replay) Bigtable / Cloud Storage Media / Blob images, embedded files referenced by ops GCS / S3 CDN for delivery Access Control per-doc permissions viewer / editor / owner checked on connect sharing links OT Transform · How Conflicts Are Resolved Client A: insert('H', pos=5) at rev=5 CONFLICT: both edit same position! Client B: insert('X', pos=5) at rev=5 Server receives A first ? A gets rev=6. Transform B: pos 5 ? pos 6 (shift right) Result: "HelloHXWorld" · both insertions preserved, order determined by server arrival All clients converge to SAME final state regardless of network delays · guaranteed by OT transform properties

OT vs CRDT · Core Algorithm Choice

Two fundamentally different approaches to conflict resolution

? Operational Transformation (OT) · Google's Approach
How OT works: Each edit is an operation (insert 'H' at position 5, delete at position 3). When two ops conflict, a transform function adjusts positions so both can apply correctly. Server maintains linear revision history · all ops are serialized through one point.
// OT Transform Example:
// User A: insert('X', pos=3)  at revision 5
// User B: insert('Y', pos=1)  at revision 5  (concurrent!)

// Server receives A first ? applies at pos 3 ? rev 6
// Server receives B (based on rev 5, but now rev 6 exists)
// Transform: B's pos=1 < A's pos=3, so B stays at pos=1
// Result: "abYcXdef" · both insertions preserved, positions adjusted

transform(insertA, insertB):
  if A.pos <= B.pos: B.pos += len(A.text)
  if B.pos < A.pos:  A.pos += len(B.text)
? CRDT (Conflict-Free Replicated Data Types) · Figma's Approach
How CRDTs work: Each character has a unique ID (not position-based). Operations reference IDs, not indices. Merge is commutative + associative · order doesn't matter, result is always the same. No central server needed for correctness.
OT (Google Docs)CRDT (Figma, Yjs, Automerge)
ServerRequired · serializes opsOptional · peer-to-peer possible
ComplexityTransform functions (O(n·) worst case)Unique IDs per character (metadata overhead)
OfflineLimited (must sync through server)Excellent · merge on reconnect
MemoryLow (ops are small)Higher (tombstones, unique IDs)
CorrectnessProven for specific transform pairsMathematically guaranteed convergence
LatencyServer round-trip for confirmationInstant local, async merge
Used byGoogle Docs, EtherpadFigma, Apple Notes, Notion (partial), Yjs

Key Design Decisions

Critical choices that determine system behavior

Client-Side OT Pipeline

  • Local apply: user types ? apply immediately (0ms latency)
  • Buffer: store pending ops not yet ACKed by server
  • Send: send op with base revision to server
  • ACK: server confirms ? remove from buffer
  • Remote op arrives: transform against pending buffer
  • Apply transformed: update local doc with remote changes

Server-Side Processing

  • Receive op: client sends (op, baseRevision)
  • Transform: against all ops since baseRevision
  • Apply: to server document state
  • Assign revision: monotonic increment
  • Persist: append to revision log (Spanner)
  • Broadcast: send transformed op to all other clients

Cursor & Presence

  • Cursor positions sent as ephemeral (not persisted)
  • Throttled to 50ms intervals (avoid flooding)
  • Transformed alongside document ops
  • Color-coded per user (up to ~20 visible)
  • Selection ranges shown as highlights

Failure Handling

  • Disconnect: buffer ops locally, resync on reconnect
  • Server crash: replay from revision log
  • Conflict: OT guarantees convergence (no manual merge)
  • Slow client: server compacts ops into snapshots
  • Version history: periodic snapshots + op replay

Scaling & Production Considerations

ChallengeSolutionDetail
Hot documentsDedicated server per docShard by docId, pin to single server for serialization
100+ editorsOp batching + throttlingBatch rapid keystrokes into single op (debounce 50ms)
Large documentsChunked loadingLoad visible portion, lazy-load rest on scroll
Version historyPeriodic snapshotsSnapshot every N ops, replay from nearest snapshot
Undo/RedoPer-user inverse opsUndo transforms against subsequent ops (complex!)
Rich textStructured opsOps include formatting attributes, not just text
Google's approach: OT with server as serialization point. Each document has a single collaboration server (sharded by docId). Server maintains linear revision history. Clients optimistically apply locally, server transforms and broadcasts. Jupiter protocol (Google's OT variant) handles the client-server transform.
Real-world numbers: Google Docs handles 1B+ documents, 100+ concurrent editors per doc, <50ms local latency, <200ms sync latency. Revision log stored in Spanner for strong consistency. Presence via Colossus (Google's distributed file system).
Common mistakes: Position-based without transform · divergence guaranteed. Locking paragraphs · terrible UX. Last-write-wins · loses edits silently. Sending full document on each edit · bandwidth explosion.

Interview Cheat Sheet

The 8 things to say for collaborative editing design

1. OT (Operational Transform) · transform concurrent ops against each other to preserve intent
2. Server as serialization point · single source of truth, assigns revision numbers
3. Optimistic local apply · user sees their edit instantly, sync happens async
4. Transform function · insert(pos=5) vs insert(pos=3) ? shift first to pos=6
5. CRDTs as alternative · no server needed (P2P), but larger metadata overhead
6. Cursor/selection sync · broadcast cursor positions via presence channel (ephemeral)
7. Undo = inverse operation · not "restore previous state" (would undo others' edits)
8. Revision log for history · every op stored, enables time-travel and "see changes"