System Design Case Study

How does Stripe process millions of charges without double-charging?

?? Design an idempotent payment system: millions of charges, zero double-charge, two-phase state machine
Concepts Involved

Problem Statement

How does a payment platform process millions of charges without double-charging, ensuring idempotent request handling, two-phase state transitions (pending?captured?settled), and at-least-once delivery with server-side deduplication?

Core challenge: Network is unreliable. Client sends charge request, server processes it, but the response is lost. Client retries. Without idempotency, the customer is charged twice. At millions of transactions/day, even 0.01% duplicates = thousands of angry customers.
Millions
charges / day
across 3M+ merchants
Zero
double charges
idempotency guarantee
2-Phase
state machine
pending ? captured ? settled
99.999%
availability
payments can't go down

Functional Requirements

Must Have

1. Accept charge request with idempotency key (client-generated UUID)
2. Deduplicate retries · same key returns same result without re-processing
3. Two-phase capture: authorize (hold funds) ? capture (charge) ? settle (transfer)
4. Atomic state transitions · no partial states visible
5. Support refunds as compensating transactions
6. Audit trail · every state change logged immutably

Out of Scope

? Fraud detection and risk scoring
? Multi-currency and FX
? Merchant onboarding and KYC
? Subscription billing and invoicing
? PCI-DSS card tokenization

Non-Functional Requirements

PropertyTargetDesign Impact
CorrectnessZero double-chargesIdempotency key + atomic state machine + DB constraints
Availability99.999% (5 min/year)Multi-region active-active, no single point of failure
Latency<2s end-to-end chargeMost time is card network RTT (~1s). Internal processing <100ms.
DurabilityZero transaction lossSynchronous replication, WAL, ledger is append-only
AuditabilityEvery state change logged immutablyDouble-entry ledger, event sourcing for state transitions
CompliancePCI-DSS Level 1Card data tokenized, encrypted at rest, access logged

High-Level Architecture

Idempotency key ? dedup check ? state machine ? double-entry ledger

Stripe Idempotent Payments · End-to-End Architecture REQUEST LAYER · Idempotency Check Client POST /charges Idempotency-Key: uuid API Gateway auth, rate limit route to payment svc Idempotency Check Key EXISTS? ? return cached response (no reprocess) Key NEW? ? INSERT key + lock ? proceed to processing INSERT ... ON CONFLICT DO NOTHING (atomic) | request_hash check: different body + same key ? 422 | TTL: 24h Stored: {key ? request_hash + response_code + response_body + created_at + expires_at} Idempotency Key Store Redis + PostgreSQL ~10M active keys (24h TTL) Unique index on key Hourly cleanup cron Fits in memory (~10M · 1KB) Multi-region replicated PROCESSING LAYER · Payment State Machine PENDING authorize AUTHORIZED capture CAPTURED settle SETTLED FAILED timeout/decline ? void Timeout: PENDING > 30min ? auto-void | Each transition is atomic DB transaction (forward-only) Compensating actions: refund creates reverse ledger entries | Async reconciliation for network timeouts Card Network (External) Visa / Mastercard / Amex authorize ? capture flow External, unreliable (~1s RTT) Timeout ? store PENDING Async reconciliation resolves Duplicate webhook ? state machine ignores (idempotent) LEDGER LAYER · Double-Entry Accounting & Delivery Double-Entry Ledger DEBIT: customer_account $100.00 CREDIT: merchant_account $97.00 CREDIT: platform_fee $3.00 SUM(debits) = SUM(credits) always Webhook Delivery Notify merchant of state changes ~30M webhooks/day (3 per charge) Exponential backoff retry DLQ after 5 failures | at-least-once Reconciliation Batch Daily: our ledger vs card network settlements Detect: missing captures, orphan auths Auto-resolve PENDING > 24h Auditors verify: total debits = total credits Idempotency in Action · Why Retries Are Safe ? Request: POST /charges {key: "abc-123", amount: $100} ? server processes ? charges customer ? stores response ? Response LOST (network timeout) · client doesn't know if charge succeeded or failed ? Retry: POST /charges {key: "abc-123", amount: $100} ? key exists ? return CACHED {id: "ch_xyz", status: "captured"} ? Customer charged exactly once. Retry is safe. Response is identical. Zero double-charges.
Idempotency key design: Client generates UUID before first attempt. Server stores {key ? request_hash + response + created_at}. On retry: if key exists AND request matches ? return stored response. If key exists but request differs ? return 422 (misuse). Key expires after 24h.
State machine guarantees: Each transition is atomic (DB transaction). Forward-only · can't go from CAPTURED back to PENDING. Timeout handlers · PENDING > 30min ? auto-void. Compensating actions · refund creates reverse ledger entries.
Failure modes: Network timeout to card network ? store as PENDING, async reconciliation resolves. DB crash mid-transaction ? idempotency key not committed ? retry is safe. Duplicate webhook from card network ? idempotent state machine ignores duplicate transitions.
Real-world: Stripe · Idempotency-Key header on all mutating APIs. PayPal · request_id for deduplication. Square · idempotency_key with 24h window. Adyen · reference field for merchant-side dedup. All major payment processors implement this pattern.

Scale Estimation

Back-of-envelope math for a payment platform

Given: Millions of charges/day · 3M+ merchants · 99.999% uptime · Zero double-charges
StepDerivationResultDesign Impact
1Charges/sec: 10M charges/day · 86400~115 charges/sec avgNot high throughput · correctness matters more than speed
2Peak: 115 · 10· (Black Friday)~1,150 charges/sec peakMust handle 10· burst without dropping transactions
3Idempotency keys stored: 10M/day · 24h TTL~10M active keysRedis or DB index · fits in memory easily
4Ledger entries: 10M charges · 3 entries each~30M ledger rows/dayAppend-only, partitioned by merchant_id
5Webhook deliveries: 10M charges · 3 events each~30M webhooks/dayAsync delivery with retry queue (exponential backoff)
6Uptime: 99.999% = 5.26 min downtime/yearMulti-region active-activeCan't have single region · payment must always work

Data Model

Double-entry ledger + state machine + idempotency store

PENDING authorize AUTHORIZED capture CAPTURED settle SETTLED refund REFUNDED FAILED / VOIDED timeout/decline Forward-only: PENDING ? AUTHORIZED ? CAPTURED ? SETTLED (never backward). Each transition is atomic DB transaction.
// --- Charges Table ---
charges {
  id:              UUID (ch_xxx)        -- primary key
  idempotency_key: VARCHAR(255)         -- unique index, client-provided
  merchant_id:     UUID                 -- partition key for sharding
  amount:          BIGINT               -- in smallest currency unit (cents)
  currency:        CHAR(3)              -- ISO 4217 (usd, eur, gbp)
  status:          ENUM(pending, authorized, captured, settled, failed, refunded)
  payment_method:  UUID ? payment_methods table
  created_at:      TIMESTAMP
  captured_at:     TIMESTAMP NULL
  settled_at:      TIMESTAMP NULL
  metadata:        JSONB                -- merchant-provided key-value
}

// --- Idempotency Keys Table ---
idempotency_keys {
  key:             VARCHAR(255)         -- primary key (client UUID)
  merchant_id:     UUID
  request_hash:    CHAR(64)            -- SHA-256 of request body
  response_code:   INT
  response_body:   JSONB               -- cached response
  created_at:      TIMESTAMP
  expires_at:      TIMESTAMP           -- created_at + 24h
}
-- Index: (key) UNIQUE
-- Cleanup: DELETE WHERE expires_at < NOW() (cron every hour)

// --- Ledger Entries (Double-Entry) ---
ledger_entries {
  id:              BIGSERIAL
  charge_id:       UUID ? charges
  account_id:      UUID                -- debit or credit account
  entry_type:      ENUM(debit, credit)
  amount:          BIGINT
  balance_after:   BIGINT              -- running balance
  created_at:      TIMESTAMP
}
-- Constraint: SUM(debits) = SUM(credits) per charge_id (always balanced)
Double-entry guarantee: Every charge creates exactly 2+ ledger entries that sum to zero. Debit customer $100, credit merchant $97, credit platform $3. If any entry fails, the transaction rolls back. Auditors can verify: total debits = total credits across entire system.

Resilience & Edge Cases

Payment systems must handle every failure mode gracefully · money can never be lost or duplicated

FailureImpactRecovery
Network timeout to card networkDon't know if charge succeededStore as PENDING. Async reconciliation query to card network. Resolve within minutes.
DB crash mid-transactionIdempotency key not committedClient retries ? key doesn't exist ? safe to reprocess. No double-charge.
Client retries after successCould double-chargeIdempotency key exists ? return cached response. Identical result.
Card network sends duplicate webhookCould process settlement twiceState machine: CAPTURED ? SETTLED is idempotent. Second attempt is no-op.
Merchant sends different body with same keyAmbiguous intentCompare request_hash. If different ? return 422 (idempotency key reuse error).
Region failureAll charges in that region failActive-active: route to other region. Reconcile ledgers after recovery.

Interview Cheat Sheet

The 8 things to say for payment system design

1. Idempotency key · client-generated UUID, server deduplicates retries, 24h TTL
2. Two-phase state machine · PENDING ? AUTHORIZED ? CAPTURED ? SETTLED (forward-only)
3. Double-entry ledger · every charge = balanced debit + credit entries (auditable)
4. Atomic state transitions · DB transaction wraps state change + ledger entries
5. Timeout handling · PENDING > 30min ? auto-void. Never leave money in limbo.
6. Reconciliation · daily batch compares our ledger vs card network settlements
7. Webhooks with retry · notify merchant of state changes, exponential backoff, DLQ
8. Multi-region active-active · payments can't go down. Route to healthy region on failure.