System Design Case Study

How does Uber process 1M+ ride events per second with exactly-once semantics?

?? Design an event streaming platform: 1M+ events/sec, city-level locality, exactly-once, real-time surge pricing
Concepts Involved

Problem Statement

How does a ride-hailing platform process 1M+ ride events per second with city-level locality, guaranteeing exactly-once processing semantics and computing real-time surge pricing from the event stream?

Core challenge: Every ride generates 10+ events (requested, matched, en-route, arrived, started, GPS·N, completed, rated). At Uber's scale, that's millions of events/sec that must be processed exactly once, partitioned by city, and feed real-time pricing models.
1M+
events / second
peak during rush hour
Exactly-Once
processing guarantee
no double-charge, no lost ride
<10s
surge price update
supply/demand per zone
600+
cities worldwide
city-level partitioning

Functional Requirements

What the system must do

Must Have (Core)

1. Ingest ride lifecycle events (requested ? matched ? started ? GPS ? completed ? rated)
2. Guarantee exactly-once processing · no duplicate charges, no lost events
3. Partition events by city/zone for locality-aware processing
4. Compute real-time surge pricing per zone (supply/demand ratio) within 10 seconds
5. Feed downstream consumers: billing, analytics, ML models, driver payouts
6. Support event replay for debugging and reprocessing after bug fixes

Out of Scope

? Driver-rider matching algorithm (separate service)
? GPS routing and ETA calculation
? Payment processing (downstream consumer)
? Push notifications to riders/drivers
? Historical analytics (batch pipeline)

Non-Functional Requirements

Quality constraints shaping architecture

PropertyTargetDesign Impact
Throughput1M+ events/sec sustained, 3M peakKafka with 10K+ partitions, multi-cluster
Latency<10s for surge pricing updateFlink streaming with tumbling windows
Exactly-onceZero duplicates in downstream stateKafka transactions + idempotent producers + Flink checkpoints
OrderingPer-ride ordering (not global)Partition key = ride_id ensures all events for one ride are ordered
DurabilityZero event loss after producer ACKacks=all, min.insync.replicas=2, RF=3
Availability99.99% · rides can't stopMulti-AZ Kafka, cross-region replication for DR
ScalabilityLinear scale with city countCity-based topic partitioning, independent consumer groups per use case

High-Level Architecture

Event-driven pipeline: produce ? partition ? process ? materialize

Uber Kafka Ride Events · End-to-End Architecture PRODUCER LAYER Rider App ride requests, ratings Driver App GPS every 4s, status Trip Service lifecycle events Payment Service billing events Idempotent Producer key=ride_id · acks=all Partition Key: ride_id (per-ride ordering) | city_zone (GPS locality) KAFKA CLUSTER · 10K+ Partitions · RF=3 · acks=all ride-events partitioned by ride_id ~8.5K events/sec peak gps-updates partitioned by city_zone 1.25M updates/sec surge-input partitioned by city supply/demand signals billing-events partitioned by ride_id fare, payment, refund Idempotent Producer + Kafka Transactions | min.insync.replicas=2 | 7-day retention ? S3 tiered Avro + Schema Registry (backward-compatible evolution) | Consumer groups per use-case FLINK STREAM PROCESSING Surge Pricing 10s tumbling windows supply/demand per H3 Trip State Machine requested?matched? started?completed Fraud Detection anomaly patterns GPS spoofing check ETA Compute real-time routing traffic-aware Checkpoints every 30s | Exactly-once state | Auto-scale on consumer lag (KEDA) ~600 Flink task slots (1 per city) | DLQ after 3 retries for poison messages SINK LAYER Redis Surge multiplier per H3 zone Cassandra Trip DB ride state + history S3 / ClickHouse analytics + data lake Exactly-Once Guarantee · End-to-End Pipeline ? Idempotent Producer: producer_id + sequence_number ? Kafka deduplicates retries at broker ? Kafka Transactions: atomic write across multiple partitions (all-or-nothing commit) ? Flink Checkpoints: consumer offset + operator state snapshot committed atomically every 30s ? Idempotent Sinks: upsert with ride_id as key ? safe to replay on recovery End-to-end: produce exactly-once ? process exactly-once ? sink idempotently = zero duplicates in downstream state Without this: network retry ? duplicate event ? rider charged twice for same ride | surge computed from stale data

Key Design Decisions

The architectural choices that make this work at Uber's scale

DecisionChoiceWhyAlternative Rejected
Partition Keyride_id for ride events, city_zone for GPSAll events for one ride land in same partition ? orderedRandom (loses ordering), user_id (hot partition for frequent riders)
Exactly-OnceKafka transactions + Flink checkpointsEnd-to-end guarantee without application-level dedupAt-least-once + app dedup (complex, error-prone)
Surge ComputationFlink tumbling window (10s)Real-time supply/demand ratio per zoneBatch (too slow), polling DB (doesn't scale)
Multi-CityTopic-per-city or city prefix in partition keyCity-level isolation, independent scalingSingle global topic (cross-city interference)
SchemaAvro + Schema RegistryBackward-compatible evolution, compact binaryJSON (no schema enforcement), Protobuf (less Kafka ecosystem support)
Retention7 days in Kafka, then S3 (Parquet)Replay window for bug fixes + cold storage for analyticsInfinite retention (cost), short retention (can't replay)
Surge pricing algorithm: Every 10 seconds, Flink counts active ride requests (demand) and available drivers (supply) per H3 hex zone. Ratio > 1.5 ? surge multiplier applied. Smoothing prevents oscillation (exponential moving average). Results written to Redis ? rider app reads on next request.
Real-world scale: Uber's Kafka processes 4+ trillion messages/day across multiple clusters. Each city has independent consumer groups. GPS events alone = ~5M updates/sec globally (1M drivers · every 4 seconds).
Failure modes: Consumer lag ? surge pricing stale ? auto-scale consumers via KEDA. Kafka broker failure ? ISR takes over, no data loss. Flink checkpoint failure ? restart from last checkpoint, reprocess (idempotent sinks handle duplicates).

Scale Estimation

Back-of-envelope math · derive infrastructure from given numbers

Given: 1M events/sec peak · 600 cities · 5M active drivers (GPS every 4s) · 10 event types per ride · avg 15M rides/day
StepDerivationResultDesign Impact
1GPS updates: 5M drivers · 4s interval1.25M GPS/secSeparate topic for GPS (high volume, lower criticality)
2Ride events: 15M rides · 10 events · 86400s~1,700 ride events/sec avgMuch lower than GPS · but each is critical (exactly-once)
3Peak ride events: 1,700 · 5· peak factor~8,500 ride events/sec peakKafka handles easily · GPS is the real throughput challenge
4Total events: 1.25M GPS + 8.5K ride + misc~1.3M events/sec peakNeed 10K+ Kafka partitions across multiple clusters
5Kafka partitions: 1.3M · 5K msg/partition/sec~260 partitions minimumUse 1000+ for headroom and city-level isolation
6Storage: 1.3M · 1KB avg · 86400s · 7 days~786 TB retentionTiered storage: 7d hot (SSD) ? S3 cold (Parquet)
7Flink workers for surge: 600 cities · 1 worker~600 Flink task slotsEach city independently computes surge every 10s
8Kafka brokers: 1.3M msg/sec · 100K/broker~13 brokers minimumUse 30+ for replication (RF=3) and headroom

APIs

Event schema + producer/consumer interfaces

// --- Ride Event Schema (Avro) ---
{
  "type": "record",
  "name": "RideEvent",
  "namespace": "com.uber.events",
  "fields": [
    {"name": "event_id",    "type": "string"},       // UUID, idempotency key
    {"name": "ride_id",     "type": "string"},       // partition key
    {"name": "event_type",  "type": {"type": "enum", "symbols": [
      "RIDE_REQUESTED", "DRIVER_MATCHED", "DRIVER_EN_ROUTE",
      "RIDER_PICKED_UP", "TRIP_IN_PROGRESS", "GPS_UPDATE",
      "TRIP_COMPLETED", "FARE_CALCULATED", "PAYMENT_CHARGED", "RATING_SUBMITTED"
    ]}},
    {"name": "timestamp",   "type": "long"},         // epoch millis
    {"name": "city_id",     "type": "string"},       // for routing
    {"name": "driver_id",   "type": ["null","string"]},
    {"name": "rider_id",    "type": "string"},
    {"name": "location",    "type": ["null", {"type": "record", "name": "Location",
      "fields": [{"name":"lat","type":"double"},{"name":"lng","type":"double"}]
    }]},
    {"name": "metadata",    "type": {"type": "map", "values": "string"}}
  ]
}

// --- Producer (Trip Service) ---
POST /internal/events/publish
Headers: X-Idempotency-Key: {event_id}
Body: { ride_id, event_type, ... }
? Kafka produce with key=ride_id, transactional

// --- Consumer (Surge Pricing) ---
Kafka Consumer Group: "surge-pricing-service"
Topic: "uber.ride-events" (filtered: RIDE_REQUESTED, TRIP_COMPLETED)
Processing: Flink tumbling window (10s) ? count per H3 zone ? write Redis

// --- Surge Query API ---
GET /api/surge?lat={lat}&lng={lng}
? Redis HGET surge:{h3_cell} multiplier
Response: { "multiplier": 1.8, "zone": "h3_abc123", "expires_in_sec": 10 }

Data Model

Hot path (Kafka + Redis) for real-time · Cold path (Cassandra + S3) for state and analytics

StoreDataAccess PatternRetention
KafkaAll raw events (immutable log)Sequential read by consumer groups7 days hot, then tiered to S3
RedisSurge multipliers per zoneGET surge:{h3_cell} · <1msTTL 10s (auto-refresh by Flink)
RedisActive ride state machineHGET ride:{ride_id} statusTTL 24h (ride lifecycle)
CassandraCompleted ride historyQuery by rider_id or driver_id + time rangeForever (compliance)
S3 (Parquet)Archived events for analyticsBatch queries via Spark/PrestoYears (data lake)
Schema RegistryAvro schemas (versioned)Lookup by schema_id on deserializeForever (all versions kept)
Ride state machine (Redis): Each ride is a hash: {status, driver_id, rider_id, pickup_lat, pickup_lng, started_at, ...}. State transitions are atomic (Lua script): only valid transitions allowed (e.g., can't go from COMPLETED ? EN_ROUTE). Flink updates state on each event.

Resilience & Edge Cases

What breaks and how the system recovers

FailureImpactRecovery
Kafka broker diesPartitions on that broker unavailable for ~10sISR replica promoted to leader. Producers retry. Zero data loss (acks=all).
Flink job crashesSurge pricing stale for affected citiesRestart from last checkpoint (every 30s). Reprocess buffered events. Idempotent Redis writes.
Consumer lag > 5minSurge pricing outdated ? wrong pricesKEDA auto-scales consumers. Alert at 2min lag. Fallback: use last-known surge.
Schema incompatibilityConsumer can't deserialize new eventsSchema Registry rejects breaking changes in CI. Backward-compatible only.
Duplicate events (network retry)Could double-charge riderIdempotent producer (producer_id + seq). Flink dedup by event_id. Sink upsert by ride_id.
City-wide outageAll events for one city lostCross-region Kafka replication (MirrorMaker 2). Failover to DR cluster within 60s.
Poison messageConsumer crashes in loopDLQ after 3 retries. Alert. Manual inspection. Fix + replay from DLQ.

Tech Stack & Tradeoffs

Why each technology was chosen

ComponentTechnologyWhy ChosenWhat Was Rejected
Event BusApache Kafka1M+ msg/sec, durable, replayable, exactly-onceRabbitMQ (no replay), SQS (no ordering), Pulsar (less mature)
Stream ProcessingApache FlinkExactly-once, event-time windows, checkpointingSpark Streaming (micro-batch, higher latency), Kafka Streams (simpler but less powerful)
Hot StateRedis ClusterSub-ms reads, atomic ops (Lua), TTLMemcached (no persistence), DynamoDB (higher latency)
Cold StorageApache CassandraWrite-heavy, time-series friendly, linear scalePostgreSQL (doesn't scale writes), MongoDB (less predictable at scale)
AnalyticsS3 + Parquet + PrestoCheap, columnar, SQL-on-anythingSnowflake (cost at this volume), Redshift (less flexible)
SchemaAvro + Confluent Schema RegistryCompact binary, backward-compatible evolutionProtobuf (less Kafka tooling), JSON (no schema enforcement)

Interview Cheat Sheet

The 8 things to say in a system design interview for this problem

1. Partition by ride_id · guarantees per-ride ordering without global coordination
2. Exactly-once = idempotent producer + Kafka transactions + Flink checkpoints + idempotent sinks
3. Separate topics by event type · GPS (high volume, lossy OK) vs ride events (low volume, critical)
4. Surge pricing via Flink tumbling windows · 10s window, count supply/demand per H3 zone, write to Redis
5. City-level isolation · independent consumer groups per city, no cross-city interference
6. Schema Registry · enforce backward compatibility, reject breaking changes in CI
7. Tiered storage · 7d hot in Kafka, then S3 Parquet for analytics (cost-effective)
8. DLQ + auto-scaling · poison messages isolated, consumer lag triggers KEDA scale-up