How does Uber trace requests across 1000+ microservices?

🎯 Design a distributed tracing system: 1000+ services, async boundaries, intelligent sampling

Concepts Involved

Tracing Kafka Elasticsearch Stream Processing Sampling

Problem Statement

How does a distributed tracing system correlate requests across 1000+ microservices, propagating trace context through async boundaries (Kafka, queues) and sampling intelligently to keep storage under 1% of total traffic while capturing all error traces?

Core challenge: A single ride request touches 50+ services. When it fails, which service caused it✗ You need end-to-end visibility. But tracing every request at 1M+ RPS = petabytes/day. Intelligent sampling must capture errors while discarding boring successes.

1000+

microservices

per request: 50+ spans

sampling rate

but 100% of errors

Async

context propagation

Kafka, queues, cron

<5ms

overhead per span

must be invisible

Architecture · Trace Collection Pipeline

Component	Role	Implementation
Instrumentation	Generate spans at each service	OpenTelemetry SDK · auto-instrument HTTP, gRPC, DB calls
Context Propagation	Pass trace_id + span_id across boundaries	W3C Traceparent header (HTTP), Kafka header (async), baggage items
Sampling	Decide which traces to keep	Head-based (decide at entry) + tail-based (decide after seeing outcome)
Agent/Collector	Buffer and forward spans	OTel Collector sidecar → central collector → storage
Storage	Store spans for querying	Elasticsearch (Jaeger), Tempo (Grafana), X-Ray (AWS)
Query/UI	Visualize traces, find bottlenecks	Jaeger UI · waterfall view, service dependency graph, latency histograms

Sampling strategies: Head-based (1%) · decide at ingress, propagate decision. Simple but misses rare errors. Tail-based (100% errors) · buffer spans, decide after trace completes. Captures all errors + slow traces. Adaptive · increase sampling for services with high error rates. Uber uses tail-based for errors + 1% probabilistic for baseline.

Async propagation: Kafka message carries traceparent in headers. Consumer creates child span linked to producer's span. Same for SQS (message attributes), cron jobs (inject trace_id in job metadata). Result: full trace from HTTP request → Kafka → consumer → DB · all correlated.

Challenges: Clock skew · spans from different hosts have different clocks (use NTP + accept ·10ms). High cardinality · too many unique tag values explode storage. Trace too large · fan-out to 1000 services = 1000 spans per trace (limit depth). Overhead · must be <5ms per span or developers disable it.

Real-world: Uber · Jaeger (open-sourced, CNCF). Google · Dapper (paper that started it all). AWS · X-Ray. Datadog · APM with distributed tracing. Grafana · Tempo (cost-effective, object storage backend).

Interview Cheat Sheet

The 7 things to say for distributed tracing design

1. Trace = tree of spans · each span = one operation in one service, linked by trace_id
2. Context propagation · W3C Traceparent header (HTTP), Kafka headers (async)
3. Head-based sampling (1%) · decide at ingress, propagate decision downstream
4. Tail-based sampling (100% errors) · buffer spans, keep all error/slow traces
5. OpenTelemetry SDK · auto-instrument HTTP, gRPC, DB calls (zero code changes)
6. Collector pipeline · agent (sidecar) → central collector → storage (ES/Tempo)
7. <5ms overhead per span · must be invisible to application performance

System Design Case Study

Problem Statement

Architecture · Trace Collection Pipeline

Interview Cheat Sheet