How does Uber serve ML predictions at 1M+ req/sec with <10ms?

🎯 Design an ML serving platform: 1M+ req/sec, 100+ models, <10ms P99, version rollouts

Concepts Involved

Caching Load Balancer Kafka Docker/K8s Stream Processing

Problem Statement

How does a ride platform serve ML predictions (ETA, surge, fraud) at 1M+ requests/sec with P99 <10ms latency, loading 100+ models into GPU memory, handling model version rollouts, and falling back to rule-based systems when models are unavailable?

Core challenge: Every ride request needs 5+ ML predictions (ETA, price, fraud score, driver match, route). Each prediction must complete in <10ms at P99. Models are retrained daily. Bad model = bad ETAs = angry users. How do you serve, version, and rollback safely?

1M+

predictions / sec

across all models

<10ms

P99 latency

per prediction

100+

production models

retrained daily

Canary

rollout strategy

auto-rollback on regression

Architecture · ML Serving Platform

Layer	Component	Role
Feature Store	Online store (Redis) + Offline store (Hive)	Serve pre-computed features at prediction time (<5ms). Batch features joined with real-time features.
Model Registry	Versioned model artifacts (S3 + metadata DB)	Store trained models with metrics, lineage, approval status. Immutable versions.
Serving Layer	TensorFlow Serving / Triton / custom	Load model into GPU/CPU memory. Batch inference requests. Auto-scale by QPS.
Router	Traffic splitting + canary	Route 5% to new model version, 95% to current. Compare metrics. Auto-promote or rollback.
Fallback	Rule-based system	If model times out or errors → use heuristic (e.g., historical average ETA). Never block the ride.
Monitoring	Prediction quality metrics	Track accuracy, latency, feature drift. Alert on degradation. Auto-rollback trigger.

Feature serving: ML models need features at prediction time. Online features (user's last 5 rides, current location) served from Redis (<2ms). Batch features (user lifetime stats, driver rating) pre-computed hourly. Real-time features (surge in last 5 min) computed by Flink, written to Redis. Feature store ensures training-serving consistency · same feature logic in both paths.

Model rollout: → Train new model → Validate offline (A/B metrics on historical data) → Shadow mode (run alongside prod, compare outputs, no user impact) → Canary (5% traffic) → Monitor for 24h → Full rollout or auto-rollback. Key metric: ETA accuracy (predicted vs actual). Regression > 2% → auto-rollback.

Failure modes: Model timeout → fallback to rules (never block ride request). Feature store down → use default/cached features (degraded accuracy, not failure). Bad model deployed → canary catches within minutes, auto-rollback. Training-serving skew → feature store ensures same computation in both paths.

Real-world: Uber · Michelangelo (end-to-end ML platform). Netflix · Metaflow for ML pipelines. Google · TFX (TensorFlow Extended). Meta · FBLearner for model training + serving. Spotify · ML for Discover Weekly recommendations.

Interview Cheat Sheet

The 7 things to say for ML serving design

1. Feature store (online + offline) · Redis for real-time features (<5ms), Hive for batch features
2. Model registry with versioning · immutable artifacts, approval workflow, rollback capability
3. Canary rollout · 5% traffic to new model, monitor accuracy, auto-rollback on regression
4. Shadow mode · run new model alongside prod, compare outputs, no user impact
5. Fallback to rules · if model times out or errors, use heuristic (never block the request)
6. Training-serving consistency · feature store ensures same computation in both paths
7. Batch inference for request batching · group multiple predictions, amortize GPU overhead

System Design Case Study

Problem Statement

Architecture · ML Serving Platform

Interview Cheat Sheet