How does LaunchDarkly evaluate 1M+ flag checks/sec with <5ms latency?

🎯 Design a feature flag system: 1M+ checks/sec, P99 <5ms, 10s propagation to 100K servers

Concepts Involved

Caching CDN WebSocket Pub/Sub Consistency

Problem Statement

How does a feature flag platform evaluate 1M+ flag checks per second with P99 <5ms latency, support complex targeting rules (user segments, percentages, custom attributes), and propagate flag changes to 100K+ servers within 10 seconds?

Core challenge: Flag evaluation must be local and fast (no network call per check), but changes must propagate globally in seconds. Complex targeting rules (10% of users in segment X with attribute Y) need efficient evaluation without per-request API calls.

1M+

flag checks / sec

P99 <5ms

evaluation latency

10s

global propagation

100K+

connected servers

Architecture

Local evaluation: The SDK downloads all flag rules at startup and keeps them in-memory. Every flag check is a local computation · hash user context against targeting rules. No network call per evaluation → sub-millisecond P99.

Propagation via streaming: Flag changes push through SSE/WebSocket relay layer. Delta updates (only changed flags) minimize bandwidth. Relay proxies handle fan-out to 100K+ connected SDKs. Fallback: polling every 30s if stream disconnects.

Anti-patterns: API call per flag check · adds latency, creates SPOF. Polling-only updates · minutes of stale flags. No fallback defaults · service crashes if flag service is down. Unbounded rule complexity · evaluation time explodes.

Targeting rules: Evaluated in order: individual targets → segment rules → percentage rollout → default. Percentage uses deterministic hashing (same user always gets same variant) · consistent experience across requests.

Scale Estimation

Step	Derivation	Result	Design Impact
1	Flag checks: 1M/sec across all services	1M evaluations/sec	Must be local (in-process) · no network call per check
2	Flags per project: ~500 flags · 10 rules each	~5K rules in memory	Fits in ~10MB RAM per SDK instance · trivial
3	Connected SDKs: 100K servers · 1 connection each	100K persistent connections	Relay proxy layer for fan-out (not direct to origin)
4	Propagation: change → relay → all SDKs	<10s end-to-end	SSE streaming with delta updates (only changed flags)
5	Event analytics: 1M checks/sec · 100 bytes	~100 MB/sec telemetry	Sampled (1-10%) + aggregated client-side before sending

Resilience & Edge Cases

Failure	Impact	Recovery
Stream disconnected	SDK can't receive flag updates	Use last-known state (in-memory cache). Fallback: poll every 30s. Hardcoded defaults as last resort.
Flag service completely down	No updates propagate	SDK continues with cached rules indefinitely. App never crashes due to flag service outage.
Bad flag rule deployed	Feature broken for all users	Kill switch: emergency flag override. Instant propagation via streaming. Audit log for rollback.
Percentage rollout inconsistency	User sees different variant on different requests	Deterministic hash: hash(user_id + flag_key) % 100 · always same result for same user.
Stale SDK cache after deploy	New service instance has no flags	SDK fetches full flag set on startup (blocking init). Ready to serve only after initial load.

Interview Cheat Sheet

1. In-process evaluation · SDK holds rules locally, no network call per check (<1ms)
2. Streaming propagation · SSE/WebSocket push, delta updates, 10s global propagation
3. Deterministic hashing · hash(user_id + flag) % 100 for consistent percentage rollouts
4. Rule evaluation order · individual → segment → percentage → default
5. Graceful degradation · SDK uses last-known state if stream disconnects, hardcoded defaults as last resort
6. Relay proxy · intermediate cache for high fan-out, reduces load on origin

System Design Case Study

Problem Statement

Architecture

Scale Estimation

Resilience & Edge Cases

Interview Cheat Sheet