System Design Case Study

How does Uber serve ML predictions at 1M+ requests/sec with P99 <10ms?

?? Design an ML serving platform: 1M+ req/sec, 100+ models, <10ms P99, version rollouts
Concepts Involved

Problem Statement

How does a ride platform serve ML predictions (ETA, surge, fraud) at 1M+ requests/sec with P99 <10ms latency, loading 100+ models into GPU memory, handling model version rollouts, and falling back to rule-based systems when models are unavailable?

Core challenge: Every ride request needs 5+ ML predictions (ETA, price, fraud score, driver match, route). Each prediction must complete in <10ms at P99. Models are retrained daily. Bad model = bad ETAs = angry users. How do you serve, version, and rollback safely?
1M+
predictions / sec
across all models
<10ms
P99 latency
per prediction
100+
production models
retrained daily
Canary
rollout strategy
auto-rollback on regression

Architecture · ML Serving Platform

LAYER 1 · FEATURE (Feature Store: Same Logic in Training + Serving) Request "predict ETA" user_id, pickup, dropoff, time 1M+ req/sec Online: Redis (<5ms) Real-time features: last 5 rides, current location Pre-computed, low-latency lookup Batch: Hive (Hourly) Historical features: lifetime stats, driver rating Pre-computed hourly, joined at serve time Streaming: Flink (Real-time) Rolling windows: surge in last 5 min, demand Continuous computation ? Redis Train-Serve Consistency Same feature logic in training + serving No skew ? no silent accuracy loss Feature Vector [surge_5m, dist_km, driver_rating, user_rides, time_of_day, ...] Assembled in <5ms total LAYER 2 · SERVING (Model Inference ? Canary Router ? Response) Model Serving (TF Serving / Triton) GPU batching: group requests, amortize overhead 100+ models loaded in GPU memory P99 <10ms per prediction Auto-scale by QPS, warm model pool No cold start: models pre-loaded Canary Router 95% ? Model v2 (production, stable) 5% ? Model v3 (canary, monitoring) Deterministic: hash(user_id) % 100 Prediction Response ETA: 12 minutes Confidence: 0.92 Model version: v2.3.1 Total latency: <10ms P99 Includes feature fetch + inference LAYER 3 · LIFECYCLE (Registry + Monitoring + Fallback) Model Registry (S3) Versioned, immutable artifacts Metadata: accuracy, lineage, owner Approval workflow before deploy Rollback: instant revert to prev version train ? validate ? shadow ? canary ? full Monitoring + Auto-Rollback Accuracy drift >2% ? auto-rollback canary Feature drift detected ? retrain trigger Latency P99 >10ms ? alert + scale-up Compare canary vs production metrics Dashboard: accuracy, latency, throughput Fallback (Rule-Based) If model timeout or error: Use historical average ETA Degraded accuracy acceptable Full failure unacceptable Never block the ride request Rollout: train ? validate offline ? shadow mode ? canary 5% ? monitor 24h ? full rollout Auto-rollback: accuracy regression >2% | Feature store ensures train-serve consistency | Fallback: rule-based heuristic if model timeout
LayerComponentRole
Feature StoreOnline store (Redis) + Offline store (Hive)Serve pre-computed features at prediction time (<5ms). Batch features joined with real-time features.
Model RegistryVersioned model artifacts (S3 + metadata DB)Store trained models with metrics, lineage, approval status. Immutable versions.
Serving LayerTensorFlow Serving / Triton / customLoad model into GPU/CPU memory. Batch inference requests. Auto-scale by QPS.
RouterTraffic splitting + canaryRoute 5% to new model version, 95% to current. Compare metrics. Auto-promote or rollback.
FallbackRule-based systemIf model times out or errors ? use heuristic (e.g., historical average ETA). Never block the ride.
MonitoringPrediction quality metricsTrack accuracy, latency, feature drift. Alert on degradation. Auto-rollback trigger.
Feature serving: ML models need features at prediction time. Online features (user's last 5 rides, current location) served from Redis (<2ms). Batch features (user lifetime stats, driver rating) pre-computed hourly. Real-time features (surge in last 5 min) computed by Flink, written to Redis. Feature store ensures training-serving consistency · same feature logic in both paths.
Model rollout: ? Train new model ? Validate offline (A/B metrics on historical data) ? Shadow mode (run alongside prod, compare outputs, no user impact) ? Canary (5% traffic) ? Monitor for 24h ? Full rollout or auto-rollback. Key metric: ETA accuracy (predicted vs actual). Regression > 2% ? auto-rollback.
Failure modes: Model timeout ? fallback to rules (never block ride request). Feature store down ? use default/cached features (degraded accuracy, not failure). Bad model deployed ? canary catches within minutes, auto-rollback. Training-serving skew ? feature store ensures same computation in both paths.
Real-world: Uber · Michelangelo (end-to-end ML platform). Netflix · Metaflow for ML pipelines. Google · TFX (TensorFlow Extended). Meta · FBLearner for model training + serving. Spotify · ML for Discover Weekly recommendations.

Interview Cheat Sheet

The 7 things to say for ML serving design

1. Feature store (online + offline) · Redis for real-time features (<5ms), Hive for batch features
2. Model registry with versioning · immutable artifacts, approval workflow, rollback capability
3. Canary rollout · 5% traffic to new model, monitor accuracy, auto-rollback on regression
4. Shadow mode · run new model alongside prod, compare outputs, no user impact
5. Fallback to rules · if model times out or errors, use heuristic (never block the request)
6. Training-serving consistency · feature store ensures same computation in both paths
7. Batch inference for request batching · group multiple predictions, amortize GPU overhead