How does a distributed tracing system correlate requests across 1000+ microservices, propagating trace context through async boundaries (Kafka, queues) and sampling intelligently to keep storage under 1% of total traffic while capturing all error traces?
Core challenge: A single ride request touches 50+ services. When it fails, which service caused it? You need end-to-end visibility. But tracing every request at 1M+ RPS = petabytes/day. Intelligent sampling must capture errors while discarding boring successes.
1000+
microservices
per request: 50+ spans
1%
sampling rate
but 100% of errors
Async
context propagation
Kafka, queues, cron
<5ms
overhead per span
must be invisible
Architecture · Trace Collection Pipeline
Component
Role
Implementation
Instrumentation
Generate spans at each service
OpenTelemetry SDK · auto-instrument HTTP, gRPC, DB calls
Head-based (decide at entry) + tail-based (decide after seeing outcome)
Agent/Collector
Buffer and forward spans
OTel Collector sidecar ? central collector ? storage
Storage
Store spans for querying
Elasticsearch (Jaeger), Tempo (Grafana), X-Ray (AWS)
Query/UI
Visualize traces, find bottlenecks
Jaeger UI · waterfall view, service dependency graph, latency histograms
Sampling strategies:Head-based (1%) · decide at ingress, propagate decision. Simple but misses rare errors. Tail-based (100% errors) · buffer spans, decide after trace completes. Captures all errors + slow traces. Adaptive · increase sampling for services with high error rates. Uber uses tail-based for errors + 1% probabilistic for baseline.
Async propagation: Kafka message carries traceparent in headers. Consumer creates child span linked to producer's span. Same for SQS (message attributes), cron jobs (inject trace_id in job metadata). Result: full trace from HTTP request ? Kafka ? consumer ? DB · all correlated.
Challenges:Clock skew · spans from different hosts have different clocks (use NTP + accept ·10ms). High cardinality · too many unique tag values explode storage. Trace too large · fan-out to 1000 services = 1000 spans per trace (limit depth). Overhead · must be <5ms per span or developers disable it.
Real-world:Uber · Jaeger (open-sourced, CNCF). Google · Dapper (paper that started it all). AWS · X-Ray. Datadog · APM with distributed tracing. Grafana · Tempo (cost-effective, object storage backend).
Interview Cheat Sheet
The 7 things to say for distributed tracing design
1.Trace = tree of spans · each span = one operation in one service, linked by trace_id 2.Context propagation · W3C Traceparent header (HTTP), Kafka headers (async) 3.Head-based sampling (1%) · decide at ingress, propagate decision downstream 4.Tail-based sampling (100% errors) · buffer spans, keep all error/slow traces 5.OpenTelemetry SDK · auto-instrument HTTP, gRPC, DB calls (zero code changes) 6.Collector pipeline · agent (sidecar) ? central collector ? storage (ES/Tempo) 7.<5ms overhead per span · must be invisible to application performance