System Design Case Study

How does LaunchDarkly evaluate 1M+ flag checks/sec with <5ms latency?

?? Design a feature flag system: 1M+ checks/sec, P99 <5ms, 10s propagation to 100K servers
Concepts Involved

Problem Statement

How does a feature flag platform evaluate 1M+ flag checks per second with P99 <5ms latency, support complex targeting rules (user segments, percentages, custom attributes), and propagate flag changes to 100K+ servers within 10 seconds?

Core challenge: Flag evaluation must be local and fast (no network call per check), but changes must propagate globally in seconds. Complex targeting rules (10% of users in segment X with attribute Y) need efficient evaluation without per-request API calls.
1M+
flag checks / sec
P99 <5ms
evaluation latency
10s
global propagation
100K+
connected servers

Architecture

LAYER 1 · MANAGEMENT (Dashboard ? Flag Store ? Change Event) Dashboard Toggle flags on/off Edit targeting rules Define user segments % rollout configuration Audit log per change Flag Store DB Rules + segments + targeting conditions Versioned configs (v1, v2, v3...) ~500 flags · 10 rules = ~5K rules total Fits in ~10MB RAM per SDK Immutable versions for rollback Change Event Trigger propagation on any edit Delta payload (only the diff) Version N ? N+1 (incremental) No full sync per change Minimizes bandwidth to 100K+ SDKs LAYER 2 · DISTRIBUTION (Stream Relay ? 100K+ SDK Instances) Stream Relay Cluster SSE / WebSocket persistent connections Fan-out delta updates to 100K+ SDKs Horizontal scale: N relay nodes, each holds subset Intermediate cache: reduces load on origin DB Fallback: polling every 30s if stream disconnects SDK Instance A in-process cache local eval <1ms SDK Instance B in-process cache local eval <1ms 100K+ SDK Instances All rules cached in-process memory No network call per flag check Graceful degradation: use last-known if stream disconnects LAYER 3 · EVALUATION (Local In-Process, <1ms, Deterministic) App Code if (flag("new-ui")) { showNewUI(); } Response: <1ms Pure in-memory lookup SDK Evaluates Locally (<1ms) · Rule Evaluation Order 1. Individual Targets user_id in list? 2. Segment Rules user in segment? 3. % Rollout hash(uid+flag)%100 4. Default fallback value Deterministic: hash(user_id + flag_key) % 100 · same user always gets same variant Consistent experience across requests | No randomness | Reproducible 10s propagation end-to-end | deterministic: hash(user_id + flag) % 100 | graceful degradation: last-known if stream disconnects 1M+ evals/sec across fleet | ~10MB RAM per SDK | Polling fallback every 30s | Hardcoded defaults as last resort
Local evaluation: The SDK downloads all flag rules at startup and keeps them in-memory. Every flag check is a local computation · hash user context against targeting rules. No network call per evaluation ? sub-millisecond P99.
Propagation via streaming: Flag changes push through SSE/WebSocket relay layer. Delta updates (only changed flags) minimize bandwidth. Relay proxies handle fan-out to 100K+ connected SDKs. Fallback: polling every 30s if stream disconnects.
Anti-patterns: API call per flag check · adds latency, creates SPOF. Polling-only updates · minutes of stale flags. No fallback defaults · service crashes if flag service is down. Unbounded rule complexity · evaluation time explodes.
Targeting rules: Evaluated in order: individual targets ? segment rules ? percentage rollout ? default. Percentage uses deterministic hashing (same user always gets same variant) · consistent experience across requests.

Scale Estimation

StepDerivationResultDesign Impact
1Flag checks: 1M/sec across all services1M evaluations/secMust be local (in-process) · no network call per check
2Flags per project: ~500 flags · 10 rules each~5K rules in memoryFits in ~10MB RAM per SDK instance · trivial
3Connected SDKs: 100K servers · 1 connection each100K persistent connectionsRelay proxy layer for fan-out (not direct to origin)
4Propagation: change ? relay ? all SDKs<10s end-to-endSSE streaming with delta updates (only changed flags)
5Event analytics: 1M checks/sec · 100 bytes~100 MB/sec telemetrySampled (1-10%) + aggregated client-side before sending

Resilience & Edge Cases

FailureImpactRecovery
Stream disconnectedSDK can't receive flag updatesUse last-known state (in-memory cache). Fallback: poll every 30s. Hardcoded defaults as last resort.
Flag service completely downNo updates propagateSDK continues with cached rules indefinitely. App never crashes due to flag service outage.
Bad flag rule deployedFeature broken for all usersKill switch: emergency flag override. Instant propagation via streaming. Audit log for rollback.
Percentage rollout inconsistencyUser sees different variant on different requestsDeterministic hash: hash(user_id + flag_key) % 100 · always same result for same user.
Stale SDK cache after deployNew service instance has no flagsSDK fetches full flag set on startup (blocking init). Ready to serve only after initial load.

Interview Cheat Sheet

1. In-process evaluation · SDK holds rules locally, no network call per check (<1ms)
2. Streaming propagation · SSE/WebSocket push, delta updates, 10s global propagation
3. Deterministic hashing · hash(user_id + flag) % 100 for consistent percentage rollouts
4. Rule evaluation order · individual ? segment ? percentage ? default
5. Graceful degradation · SDK uses last-known state if stream disconnects, hardcoded defaults as last resort
6. Relay proxy · intermediate cache for high fan-out, reduces load on origin