System Design Case Study

How does Uber trace requests across 1000+ microservices with Jaeger?

?? Design a distributed tracing system: 1000+ services, async boundaries, intelligent sampling
Concepts Involved

Problem Statement

How does a distributed tracing system correlate requests across 1000+ microservices, propagating trace context through async boundaries (Kafka, queues) and sampling intelligently to keep storage under 1% of total traffic while capturing all error traces?

Core challenge: A single ride request touches 50+ services. When it fails, which service caused it? You need end-to-end visibility. But tracing every request at 1M+ RPS = petabytes/day. Intelligent sampling must capture errors while discarding boring successes.
1000+
microservices
per request: 50+ spans
1%
sampling rate
but 100% of errors
Async
context propagation
Kafka, queues, cron
<5ms
overhead per span
must be invisible

Architecture · Trace Collection Pipeline

LAYER 1 · INSTRUMENTATION (4 Services with OTel SDK Auto-Instrumenting) API Gateway OTel SDK auto-instrument HTTP inbound/outbound Span: 200ms total trace_id generated here root span (parent=null) Trip Service OTel SDK auto-instrument gRPC calls instrumented Span: 50ms parent=API Gateway span tags: trip_id, city Payment Service OTel SDK auto-instrument HTTP + DB calls Span: 80ms parent=Trip span tags: amount, currency DB (PostgreSQL) OTel SDK auto-instrument SQL queries captured Span: 15ms parent=Payment span tags: db.statement, rows spans emitted LAYER 2 · COLLECTION (Context Propagation ? Collector ? Storage) Context Propagation W3C traceparent header 00-{trace_id}-{span_id}-01 HTTP ? header injection Kafka ? message header gRPC ? metadata field Cron ? job metadata inject OTel Collector Buffer spans in memory (reduce writes) Batch: group spans before flush Tail-Based Sampling: 100% errors + slow traces kept 1% success traces (probabilistic) Result: ~1% storage, 100% error visibility Trace Storage Elasticsearch (Jaeger) / Tempo Indexed by trace_id Partitioned by time (hourly) Retention: 7-14 days Secondary index: service, operation Object store for cold traces LAYER 3 · VISUALIZATION (Jaeger UI) Waterfall View (Nested Spans) API Gateway 200ms Trip 50ms Payment 80ms DB 15ms ? slow query! Service Dependency Map API Trip Pay DB Root Cause Analysis Latency histograms per service P50 / P95 / P99 breakdown Identify: which span caused the overall latency spike? Compare error vs success traces Sampling: head-based 1% + tail-based 100% errors | <5ms overhead per span | Context: HTTP header + Kafka header + gRPC metadata Clock skew: NTP ·10ms acceptable | High cardinality tags: limit unique values | Trace depth limit: prevent fan-out explosion
ComponentRoleImplementation
InstrumentationGenerate spans at each serviceOpenTelemetry SDK · auto-instrument HTTP, gRPC, DB calls
Context PropagationPass trace_id + span_id across boundariesW3C Traceparent header (HTTP), Kafka header (async), baggage items
SamplingDecide which traces to keepHead-based (decide at entry) + tail-based (decide after seeing outcome)
Agent/CollectorBuffer and forward spansOTel Collector sidecar ? central collector ? storage
StorageStore spans for queryingElasticsearch (Jaeger), Tempo (Grafana), X-Ray (AWS)
Query/UIVisualize traces, find bottlenecksJaeger UI · waterfall view, service dependency graph, latency histograms
Sampling strategies: Head-based (1%) · decide at ingress, propagate decision. Simple but misses rare errors. Tail-based (100% errors) · buffer spans, decide after trace completes. Captures all errors + slow traces. Adaptive · increase sampling for services with high error rates. Uber uses tail-based for errors + 1% probabilistic for baseline.
Async propagation: Kafka message carries traceparent in headers. Consumer creates child span linked to producer's span. Same for SQS (message attributes), cron jobs (inject trace_id in job metadata). Result: full trace from HTTP request ? Kafka ? consumer ? DB · all correlated.
Challenges: Clock skew · spans from different hosts have different clocks (use NTP + accept ·10ms). High cardinality · too many unique tag values explode storage. Trace too large · fan-out to 1000 services = 1000 spans per trace (limit depth). Overhead · must be <5ms per span or developers disable it.
Real-world: Uber · Jaeger (open-sourced, CNCF). Google · Dapper (paper that started it all). AWS · X-Ray. Datadog · APM with distributed tracing. Grafana · Tempo (cost-effective, object storage backend).

Interview Cheat Sheet

The 7 things to say for distributed tracing design

1. Trace = tree of spans · each span = one operation in one service, linked by trace_id
2. Context propagation · W3C Traceparent header (HTTP), Kafka headers (async)
3. Head-based sampling (1%) · decide at ingress, propagate decision downstream
4. Tail-based sampling (100% errors) · buffer spans, keep all error/slow traces
5. OpenTelemetry SDK · auto-instrument HTTP, gRPC, DB calls (zero code changes)
6. Collector pipeline · agent (sidecar) ? central collector ? storage (ES/Tempo)
7. <5ms overhead per span · must be invisible to application performance