1. Chat & Messaging WhatsAppSlackDiscordTelegram
Real-time delivery, fan-out, presence, ordering at scale · 11 problems
| # | Problem | Company / Scale |
| 1 | How a real-time chat system can deliver a single message to 50,000 online users within 200ms while handling 10B+ messages per day (~115K msgs/sec, peak ~200K msgs/sec) without hitting scalability bottlenecks? | Slack · 10B msg/day |
|
| 2 | How does a messaging system guarantee delivery to 2B+ users who go offline for hours/days, ensuring zero message loss, correct ordering on reconnect, and end-to-end encryption without the server ever seeing plaintext? | WhatsApp · 2B users |
|
| 3 | How does a chat platform maintain 10M+ concurrent WebSocket connections across thousands of gateway servers, handling heartbeats, shard assignment, graceful failover, and voice signaling without dropping connections? | Discord · millions WS |
|
| 4 | How does a multi-region chat system guarantee causal message ordering within conversations when an entire cloud region goes down, while maintaining consistency and conflict resolution across geographically distributed nodes? | Multi-region ordering |
|
| 5 | How does a group chat with 1M+ members deliver a single message without causing a fan-out explosion, keeping delivery latency under 2 seconds for all participants regardless of group size? | Telegram · 1M members |
|
| 6 | How does a presence system track online/offline status for 100M+ concurrent users in real-time, delivering status changes to relevant contacts within seconds without flooding the network with unnecessary updates? | WhatsApp · 100M presence |
|
| 7 | How does a chat system handle millions of ephemeral "typing..." events per second without persisting anything to disk, while ensuring sub-100ms delivery to all conversation participants? | Slack · ephemeral events |
|
| 8 | How does a search system index 10B+ messages with <5s indexing latency from send to searchable, and return full-text results in <100ms across the entire message corpus? | Slack · billions msgs search |
|
| 9 | How does a notification system deliver push notifications to millions of devices without duplicates, handling device token lifecycle, retry logic, and cross-platform delivery (iOS/Android) reliably? | WhatsApp · dedup push |
|
| 10 | How does a multi-device messaging app keep messages, read status, and edits perfectly synced across phone/tablet/desktop, resolving conflicts when edits happen simultaneously on different devices? | Telegram · multi-device sync |
|
| 11 | Given message events across millions of channels, design a system that continuously computes the top active channels/users for the last 1 hour without scanning historical data, updating rankings within seconds of activity changes? | Slack · top active channels |
|
2. Real-Time Collaboration Google DocsFigmaNotionMiro
Concurrent editing, CRDTs, OT, multiplayer cursors · 7 problems
| # | Problem | Company / Scale |
| 1 | How does a collaborative editor handle 100 users editing the same paragraph simultaneously, resolving conflicting character insertions without a central lock while maintaining convergence across all clients within 50ms? | Google Docs · OT |
|
| 2 | How does a design tool render 50+ cursors moving at 60fps in real-time, achieving conflict-free state merge across all clients while keeping bandwidth under 10KB/sec per user? | Figma · CRDTs |
|
| 3 | How does a block-based editor sync granular edits (move block, change text, nest) across devices in <100ms, handling concurrent modifications to the same block structure without data loss? | Notion · block sync |
|
| 4 | How does a collaborative whiteboard handle 1000+ objects being dragged simultaneously by 50 users, merging conflicting position updates without visual glitches or lost operations? | Miro · spatial CRDT |
|
| 5 | How does a collaborative code editor sync cursor positions, selections, and edits across continents with <150ms latency, maintaining consistent document state despite network delays between geographically distributed participants? | VS Code Live Share |
|
| 6 | How does a multiplayer game sync world state for 100 players at 60fps, handling client-side prediction, server reconciliation, and entity interpolation while keeping perceived latency below 100ms? | Gaming · state sync |
|
| 7 | How does a design platform handle real-time collaboration on documents with 500MB+ media assets, ensuring edit operations remain fast (<100ms) regardless of total document size? | Canva · media + collab |
|
3. Video Streaming NetflixYouTubeTwitchSpotify
Adaptive bitrate, CDN, transcoding, live vs VOD · 8 problems
| # | Problem | Company / Scale |
| 1 | How does a streaming platform deliver 4K HDR video to 230M+ subscribers globally without buffering, adapting quality in real-time to each viewer's bandwidth while minimizing rebuffer events to <0.1% of sessions? | Netflix · 230M users |
|
| 2 | How does a video platform transcode 500+ hours of video uploaded every minute into 8+ resolutions, finishing 4K transcoding in <30 minutes while prioritizing live content over on-demand uploads? | YouTube · 500hr/min |
|
| 3 | How does a live streaming platform deliver video to 5M+ concurrent viewers with <3 second glass-to-glass latency, while also supporting sub-second interactive streams for smaller audiences? | Twitch · live low-latency |
|
| 4 | How does a streaming platform handle 25M concurrent viewers during a single live event, gracefully degrading quality under load while maintaining stream continuity and minimizing regional failures? | Hotstar · 25M concurrent |
|
| 5 | How does a short-video platform achieve instant playback (<200ms to first frame) for an infinite-scroll feed, ensuring zero perceived loading time as users swipe between videos? | TikTok · instant playback |
|
| 6 | How does adaptive bitrate streaming prevent buffering on degrading networks, seamlessly switching quality mid-stream without visible artifacts while maximizing video quality for available bandwidth? | ABR · HLS/DASH |
|
| 7 | How do CDNs cache and serve video segments at the edge for 1B+ daily requests, maximizing cache hit rates while minimizing origin load and ensuring popular content is always available at the nearest edge? | CDN · edge caching |
|
| 8 | How does an audio streaming platform achieve gapless playback with offline mode, ensuring zero gaps between tracks, seamless quality adaptation over cellular, and encrypted local storage for offline content? | Spotify · audio stream |
|
4. Video Calling / WebRTC ZoomGoogle MeetDiscord
P2P, SFU, MCU, signaling, NAT traversal · 6 problems
| # | Problem | Company / Scale |
| 1 | How does a video conferencing system handle 300-person meetings with screen sharing, adapting per-participant video quality in real-time based on available bandwidth while keeping total meeting latency under 200ms? | Zoom · 300 participants |
|
| 2 | How does a video calling platform dynamically switch between peer-to-peer (2 users, lowest latency) and server-relayed (3+ users) topology, performing seamless mid-call migration without audio/video interruption? | Google Meet · adaptive |
|
| 3 | How does a voice chat platform handle 100+ users in a single voice channel with <50ms audio latency, selectively mixing only active speakers while maintaining clear audio for all participants? | Discord · voice channels |
|
| 4 | How does a real-time communication system establish peer-to-peer connections through NATs and firewalls, discovering public endpoints and falling back to relay when direct connection is impossible, while minimizing connection setup time? | WebRTC · NAT traversal |
|
| 5 | How does a telehealth video platform ensure HIPAA-compliant calls with recording, maintaining end-to-end encryption, consent-based recording, audit trails, and data residency controls per jurisdiction? | Telehealth · compliance |
|
| 6 | How does a live audio room platform handle thousands of listeners with <500ms latency, dynamically promoting speakers from the audience while distributing audio efficiently to large audiences? | Twitter Spaces · audio rooms |
|
5. Ticket Booking UberAirbnbAmazonTicketmaster
Seat locking, double-booking prevention, flash sales · 6 problems
| # | Problem | Company / Scale |
| 1 | Design a ticket booking system where 5M users attempt to book 50K concert seats simultaneously, guaranteeing no double booking, fair queue ordering with position tracking, distributed seat locking with expiry, and bot detection? | BookMyShow · 5M users, 50K seats |
|
| 2 | How does a movie theater chain handle seat selection for 500+ screens simultaneously, showing real-time seat availability updates to thousands of concurrent users while preventing double-booking through optimistic locking with sub-second conflict resolution? | AMC · real-time seat map |
|
| 3 | How does a rental platform prevent double-booking across time zones when two users in different continents try to book the same dates, detecting and resolving calendar conflicts with automatic resolution? | Airbnb · calendar sync |
|
| 4 | How do airlines handle seat selection with distributed inventory across 100+ booking channels (website, app, agents, GDS), preventing overselling while maintaining responsive seat availability display? | Airlines · distributed inv |
|
| 5 | How does a ticketing platform manage 1M+ users in a virtual queue with fair ordering, providing real-time position updates and estimated wait times while preventing bot abuse and queue jumping? | Ticketmaster · queue |
|
| 6 | Design a flash sale system where 20M users try to buy the same product within 2 minutes, preventing overselling while maintaining inventory consistency across 5 regions and confirming orders within seconds? | Amazon · 20M users, 2 min flash |
|
6. Cache & CDN CloudflareRedisAkamai
Cache invalidation, thundering herd, edge compute · 8 problems
| # | Problem | Company / Scale |
| 1 | How does a social platform invalidate cached objects across 1000+ servers within 1 second of a write, preventing stale reads while avoiding thundering herd on the backing store? | Meta · TAO cache |
|
| 2 | How does a streaming platform prevent thundering herd when a hot cache key expires and 100K requests simultaneously hit the database, ensuring only one request rebuilds the cache while others wait or receive stale data? | Netflix · EVCache |
|
| 3 | How does an edge network serve 45M+ requests/sec from 300+ PoPs without hitting origin, maximizing cache hit rates through tiered caching and intelligent routing for cache misses? | Cloudflare · edge |
|
| 4 | How does a social platform cache the home timeline for 400M+ users, handling the asymmetry between normal users (hundreds of followers) and celebrities (100M+ followers) without overwhelming write capacity? | Twitter · timeline cache |
|
| 5 | How does a distributed cache cluster handle 10M+ ops/sec with automatic failover, completing replica promotion within 2 seconds of node failure while maintaining consistent routing during topology changes? | Redis Cluster · 10M ops/sec |
|
| 6 | How do CDNs purge cached content globally within 150ms for breaking news updates, invalidating stale content at all edge locations without causing origin overload from simultaneous cache misses? | CDN · instant purge |
|
| 7 | How does a social platform cache ephemeral content (Stories) that expires after 24 hours, ensuring instant access for sequential viewing while automatically evicting expired content without manual cleanup? | Instagram · TTL cache |
|
| 8 | How does an edge computing platform execute custom logic (authentication, A/B routing, header manipulation) at 300+ PoPs with <1ms cold start, deploying code changes globally in <30 seconds without origin round-trips? | Cloudflare Workers · edge compute |
|
7. Queues & Events KafkaRabbitMQAWS SQS
Kafka, exactly-once, dead letters, event sourcing, streaming analytics, notifications · 12 problems
| # | Problem | Company / Scale |
| 1 | How does a ride-hailing platform process 1M+ ride events per second with city-level locality, guaranteeing exactly-once processing semantics and computing real-time surge pricing from the event stream? | Uber · Kafka 1M/sec |
|
| 2 | How does a payment platform guarantee exactly-once processing when network retries can duplicate requests, ensuring no double-charges while maintaining at-least-once delivery guarantees from upstream systems? | Stripe · exactly-once |
|
| 3 | How does a professional network handle 4 trillion events/day across 100K+ partitions, supporting schema evolution for backward compatibility while auto-scaling consumers based on lag? | LinkedIn · 4T events |
|
| 4 | How do you design a dead letter queue that never loses messages, isolating poison messages from healthy processing while providing retry policies, monitoring with alerting, and manual replay tooling for operations? | DLQ · poison messages |
|
| 5 | How does a streaming platform use event sourcing for microservices, storing all state changes as immutable events, rebuilding materialized views on demand, and handling schema evolution without breaking consumers? | Netflix · event sourcing |
|
| 6 | How does an e-commerce platform handle order events across 1M+ merchants in a multi-tenant event cluster, enforcing per-merchant quotas and providing priority lanes for high-volume sellers without noisy-neighbor effects? | Shopify · multi-tenant |
|
| 7 | How do you implement the SAGA pattern for a distributed order?payment?inventory?shipping transaction, coordinating compensating transactions for rollback and handling timeouts across independently deployed services? | Saga · choreography |
|
| 8 | How does a distributed event streaming system handle consumer group rebalancing without message loss or duplicate processing, minimizing rebalance time while maintaining exactly-once delivery semantics? | Kafka · rebalance |
|
| 9 | Given ad impression and click events from billions of daily ad requests, design an analytics system that computes CTR/CPC dashboards per advertiser with less than 1-minute delay from event occurrence to dashboard visibility. | Google Ads · CTR <1min delay |
|
| 10 | Given IoT telemetry from 500M smart devices sending events every few seconds, design a real-time anomaly detection system that detects outages/spikes within 10 seconds, maintaining statistical baselines per device and alerting on deviations. | IoT · 500M devices anomaly |
|
| 11 | Given repository events (push, fork, PR, issue, star) from millions of repositories, build a "Trending Repositories" system that updates rankings globally every minute, weighting recent activity higher than older activity and segmenting by language/topic. | GitHub · trending repos |
|
| 12 | How does a notification orchestration platform decide what to send (push/email/SMS/in-app), when to send it (optimal timing per user), and how to batch/deduplicate across channels · processing 1B+ notification decisions/day with user preference enforcement? | Notification orchestration |
|
8. Social Media Feeds & Recommendations XInstagramTikTokPinterest
Fan-out on write/read, ranking, real-time updates, personalization engines, trending computation · 16 problems
| # | Problem | Company / Scale |
| 1 | How does a social platform deliver posts to 400M+ users' timelines in real-time, handling the asymmetry between users with few followers and celebrities with millions, while keeping timeline delivery under 5 seconds? | Twitter · fan-out |
|
| 2 | How does a photo-sharing platform rank your feed from millions of candidate posts, balancing relevance, recency, and popularity while maintaining exploration/exploitation balance to avoid filter bubbles? | Instagram · ML ranking |
|
| 3 | How does a short-video platform learn your preferences within 3 minutes of first use, leveraging real-time engagement signals (watch time, replays, shares) to personalize recommendations for brand-new users with no history? | TikTok · cold-start rec |
|
| 4 | How does a forum handle nested comments with millions of votes, supporting deep threading, real-time vote counts, and efficient pagination ("load more replies") without N+1 query performance degradation? | Reddit · comment tree |
|
| 5 | How does a video platform count views accurately at 1B+ views/day without double-counting, filtering bot traffic and fraudulent views while keeping the public count updated within 5 minutes of actual views? | YouTube · view counting |
|
| 6 | How does a social platform handle the celebrity problem · a user with 100M followers posts, and you can't write to 100M timelines simultaneously · while still delivering the post to active followers within seconds? | Facebook · celebrity fan-out |
|
| 7 | How does a professional network generate "People You May Know" recommendations in real-time, computing relationship suggestions from graph connections (friends-of-friends, shared attributes) and updating as new connections form? | LinkedIn · graph rec |
|
| 8 | How does a visual discovery platform handle infinite scroll with personalized content, pre-fetching upcoming pages, computing layout server-side, and re-ranking in real-time based on scroll behavior and engagement signals? | Pinterest · infinite scroll |
|
| 9 | How does a music streaming platform generate personalized playlists (Discover Weekly) for 600M+ users, balancing familiar preferences with novel discovery while avoiding filter bubbles and repetitive recommendations? | Spotify · Discover Weekly |
|
| 10 | How does a video platform recommend the next video with 80%+ click-through rate, narrowing candidates from 1B+ videos to a ranked shortlist in <50ms while incorporating real-time watch signals (watch time, skip, replay)? | YouTube · next video rec |
|
| 11 | Given an API that records every song play event (userId, songId, albumId, timestamp, country), build a system that continuously computes the top 100 songs/albums for the last 5 min / 1 hour / 1 week globally and per country, processing 5B+ listen events/day with rankings updating within seconds of activity changes. | Spotify · top charts 5B/day |
|
| 12 | Given a stream of video watch events (videoId, userId, watchDuration, region), design a system that updates the "Trending Videos" page every 30 seconds while handling 50M+ concurrent viewers, weighting scores by views, watch percentage, and velocity. | YouTube · Trending 50M concurrent |
|
| 13 | How does a social platform implement real-time content moderation at scale, classifying 500M+ posts/day for policy violations using multi-modal ML (text + image + video), routing edge cases to human reviewers within minutes, and handling appeals with audit trails? | Meta · content moderation |
|
| 14 | How does a social platform implement real-time A/B testing for feed ranking algorithms, splitting traffic across 100+ concurrent experiments, measuring engagement metrics with statistical significance within hours, and safely rolling back experiments that degrade user experience? | Meta · feed experimentation |
|
| 15 | How does a social platform serve personalized notifications to 2B+ users, deciding what to notify (likes, comments, follows), batching low-priority notifications, computing optimal send times per user based on activity patterns, and suppressing notifications during quiet hours? | Instagram · smart notifications |
|
| 16 | How does a social platform detect and suppress viral misinformation in real-time, scoring content credibility within seconds of posting using engagement velocity anomalies, cross-referencing fact-check databases, and applying distribution throttling before content reaches millions? | Meta · misinformation detection |
|
9. Geo, Ride Sharing & Food Delivery UberGoogle MapsDoorDashZomato
Location tracking, geofencing, matching, ETA, order lifecycle, real-time tracking · 13 problems
| # | Problem | Company / Scale |
| 1 | How does a ride-hailing platform match riders to the nearest available driver within 3 seconds, scoring candidates by distance, ETA, and driver rating while avoiding global scans of all drivers? | Uber · H3 matching |
|
| 2 | How does a maps platform calculate ETA for millions of simultaneous route requests, incorporating real-time traffic data from GPS probes and ML-based travel time prediction for accuracy within 10% of actual travel time? | Google Maps · ETA |
|
| 3 | How does a delivery platform optimize routes for 1M+ concurrent orders, batching nearby orders and dynamically re-routing when new orders arrive mid-delivery while minimizing total delivery time? | DoorDash · routing |
|
| 4 | How does a ride platform ingest and query 5M+ driver location updates every 4 seconds, supporting spatial queries ("find drivers within 2km") without scanning all drivers globally? | Uber · location ingestion |
|
| 5 | How does an AR game handle millions of players interacting with geo-anchored objects, maintaining server-authoritative state for shared objects while providing smooth client-side rendering at 60fps? | Pok·mon GO · geo-spatial |
|
| 6 | Given ride request and completion events (driverId, riderId, lat, long, timestamp), build a surge pricing system that recomputes hotspot pricing for every city zone within 10 seconds during peak traffic, computing supply/demand ratios per zone with smoothing to avoid price oscillation. | Uber · surge 10s recompute |
|
| 7 | How does a local search platform find "restaurants near me" from 200M+ businesses in <50ms, supporting radius queries, real-time availability filtering, and ranking by distance, rating, and relevance? | Google Maps · local search |
|
| 8 | How does a food delivery platform push live driver-location updates to the customer's map every 2 seconds, showing smooth movement interpolation on the client even when GPS updates arrive irregularly? | Zomato/Swiggy · live tracking |
|
| 9 | How does a food delivery platform calculate multi-leg ETA (restaurant prep + driver pickup + delivery) that updates in real-time, incorporating per-restaurant prep time models, live traffic, and Bayesian updates as each leg completes? | Uber Eats · multi-leg ETA |
|
| 10 | How does a food delivery platform manage the 3-party order lifecycle (customer ? restaurant ? driver) with state machine transitions (placed?accepted?preparing?ready?picked_up?delivered), timeout handlers per state, and compensating actions on cancellation? | Swiggy · order lifecycle |
|
| 11 | How does a food delivery platform assign orders to drivers optimizing for delivery time, driver earnings, and restaurant wait, enforcing constraints (distance < 3km, capacity = 2 orders) and re-assigning when a driver rejects within 30s? | DoorDash · driver dispatch |
|
| 12 | Build a traffic analytics platform where GPS pings arrive every 3 seconds from 100M vehicles, computing congestion levels per road segment and ETA updates that refresh within 5 seconds, pushing tile-based map updates to clients. | Google Maps · 100M vehicles traffic |
|
| 13 | How does a geofencing platform detect when millions of devices enter/exit custom geographic boundaries in real-time, processing 10M+ location updates/sec against 100M+ geofences with sub-second trigger latency for marketing notifications and compliance alerts? | Radar · geofencing 10M/sec |
|
10. Payment Systems StripePayPalApple PayVisa
Idempotency, ledgers, reconciliation, fraud, trading · 9 problems
| # | Problem | Company / Scale |
| 1 | How does a payment platform process millions of charges without double-charging, ensuring idempotent request handling, two-phase state transitions (pending?captured?settled), and at-least-once delivery with server-side deduplication? | Stripe · idempotency |
|
| 2 | How does a payment platform detect fraud in real-time across 400M accounts, scoring each transaction in sub-100ms using velocity checks, device fingerprinting, and geo-anomaly detection while routing edge cases to human review? | PayPal · fraud ML |
| 3 | How does a cross-border payment platform handle transfers across 80+ currencies, locking FX rates within 30-second windows, managing multi-currency ledgers, and batching settlements to minimize wire fees? | Wise · FX ledger |
| 4 | How does a trading platform handle stock orders with sub-millisecond latency, maintaining deterministic order matching, lock-free order book operations, and complete audit replay for regulatory compliance? | Robinhood · trading |
| 5 | How does a digital wallet handle offline tap-to-pay transactions, using pre-authorized tokens with device-local transaction limits and deferred settlement that reconciles when connectivity returns? | Apple Pay · offline |
| 6 | How do banks reconcile millions of transactions daily without losing a cent, using double-entry ledgers, end-of-day batch reconciliation, exception queues for mismatches, and cryptographic audit trails for regulatory compliance? | Banking · reconciliation |
| 7 | How does an e-commerce platform handle checkout for 1M+ merchants during Black Friday peak, maintaining order confirmation within seconds while gracefully degrading non-critical features under extreme load? | Shopify · flash checkout |
| 8 | How does a cryptocurrency exchange handle 100K+ trades/sec with real-time order matching, maintaining a deterministic in-memory order book, supporting limit/market/stop orders, and providing guaranteed execution ordering with nanosecond timestamps for regulatory audit? | Binance · crypto exchange |
| 9 | Design a stock market analytics system ingesting millions of trades per second that continuously updates top gainers, losers, and unusual volume spikes with millisecond latency, computing OHLC aggregations per symbol and pushing updates to trading terminals in real-time. | Stock exchange · ms-latency analytics |
11. API Gateway & Backend CloudflareNetflixKong
Rate limiting, auth, circuit breaking, service mesh · 8 problems
12. Database & Storage PostgresMongoDBRedisElasticsearch
Sharding, replication, consistency, migrations · 10 problems
| # | Problem | Company / Scale |
| 1 | How does a database sharding layer transparently shard 10B+ messages across hundreds of shards, supporting online resharding (split/merge without downtime) and connection pooling that reduces backend connections by 100·? | Slack · Vitess |
|
| 2 | How does a globally distributed database achieve strong consistency with <10ms reads, bounding clock uncertainty across regions and using consensus-based replication across 5+ regions? | Google Spanner · TrueTime |
| 3 | How does a chat platform store trillions of messages in a wide-column database, partitioning by channel with time-ordered clustering, and tuning compaction strategies for time-series append patterns? | Discord · ScyllaDB |
| 4 | How does a platform migrate billions of rows between database schemas with zero downtime, validating correctness through shadow reads and providing instant rollback capability during cutover? | Uber · online migration |
| 5 | How does a serverless database achieve single-digit ms latency at any scale, routing requests to the correct partition via in-memory partition maps and providing adaptive burst capacity? | DynamoDB · partition |
| 6 | How does a productivity platform shard a relational database for millions of workspaces, routing queries by workspace_id, pooling connections efficiently, and automatically rebalancing shards as workspaces grow? | Notion · Postgres shard |
| 7 | How does an object storage service achieve 99.999999999% (11 nines) durability, splitting data into fragments across availability zones with integrity checksums on every read and automatic repair of degraded objects within hours? | S3 · 11 nines |
| 8 | How does a streaming platform handle write-heavy workloads (1M+ writes/sec) in a distributed database, tuning consistency levels and compaction strategies based on read/write ratio while maintaining token-aware routing? | Netflix · Cassandra |
| 9 | How does a distributed SQL database survive entire region failures without data loss, using consensus per data range, leaseholder placement policies, and non-voting replicas for fast failover (<10s RTO)? | CockroachDB · multi-region |
| 10 | How does a search engine index and search petabytes of logs in <100ms, using inverted indexes with time-based rotation and scatter-gather queries across 1000s of shards with early termination? | Elasticsearch · log search |
13. Distributed Systems ApacheGoogle SpannerCockroachDB
Consensus, leader election, clock sync, partition tolerance · 7 problems
| # | Problem | Company / Scale |
| 1 | How does a distributed key-value store achieve consensus across 5 nodes, handling leader election, log replication, and split-brain prevention requiring majority quorum (3 of 5) for all decisions? | etcd · Raft |
|
| 2 | How does a distributed event streaming platform handle leader election when a broker dies, reassigning partitions within seconds without message loss while promoting only in-sync replicas? | Kafka · KRaft |
| 3 | How does a distributed system achieve causal ordering of events across data centers without synchronized clocks, using hybrid logical clocks (HLC) to bound uncertainty and provide happens-before guarantees for cross-region transactions? | HLC · causal ordering |
| 4 | How does a distributed system implement linearizable reads without sacrificing availability, choosing between leader leases, quorum reads, and read-repair strategies based on consistency requirements and latency budgets? | Linearizability · read strategies |
| 5 | How does a coordination service manage distributed locks, leader election, and configuration for 1000s of services, detecting liveness via ephemeral sessions and providing sequential ordering guarantees for distributed queues? | ZooKeeper · coordination |
| 6 | How does a wide-column database maintain availability during network splits (AP in CAP), offering tunable consistency levels, hinted handoff for downed nodes, and anti-entropy repair to eventually converge divergent replicas? | Cassandra · AP system |
| 7 | How do distributed systems detect and recover from split-brain scenarios, invalidating stale leaders with monotonic fencing tokens, epoch-based leadership with lease expiry, and ensuring only one leader can make progress at any time? | Split-brain · fencing |
14. Live Sports & Real-Time Event Broadcasting ESPNCricbuzzDream11Hotstar
Live score push, ball-by-ball updates, millions of concurrent readers consuming the same event stream · 8 problems
| # | Problem | Company / Scale |
| 1 | Build a live scoring platform where every ball event (runs, wicket, over, batsman change) must reach 30M concurrent users globally within sub-second latency during World Cup finals, supporting both persistent connections and fallback polling for all client types. | Cricbuzz · 30M concurrent, sub-second |
|
| 2 | How does a sports platform handle 10M+ simultaneous score poll requests during a World Cup final without melting the backend, serving fresh scores (=1s stale) from edge while protecting origin servers from traffic spikes? | ESPN · World Cup traffic |
| 3 | How does a fantasy sports platform lock/unlock player selections in real-time as a match starts, performing atomic state transitions triggered by match-start events with eventual consistency for leaderboard updates? | Dream11 · lineup lock |
| 4 | How does a fantasy sports platform calculate live leaderboard rankings for 10M+ users as each ball is bowled, applying pre-computed point deltas per event and updating rankings incrementally without full recomputation? | Dream11 · live leaderboard |
| 5 | How does a live sports platform ingest events from stadium data feeds (ball tracking, hawk-eye) and normalize them into a unified event stream within 200ms, deduplicating events and handling out-of-order delivery from multiple feed sources? | Sports data · event ingestion |
| 6 | How does a betting platform update live odds for 1000+ markets simultaneously as match events occur, recalculating odds within milliseconds and handling stale-data rollback when events are corrected? | Betting · live odds |
| 7 | How does a live commentary platform handle millions of users receiving the same text/audio commentary stream without per-user fan-out, efficiently broadcasting identical content to massive audiences with minimal per-user resource cost? | Commentary · broadcast |
| 8 | Build a live sports notification platform where wicket/goal/touchdown events must push notifications to 100M subscribed users within a few seconds globally, classifying event priority (high: goal/wicket vs low: boundary) and distributing push load across regional gateways. | Sports · 100M push notifications |
15. File Upload & Media Processing DropboxGoogle DriveYouTube
Chunked upload, sync engines, transcoding pipelines, deduplication · 7 problems
| # | Problem | Company / Scale |
| 1 | How does a cloud storage platform sync file changes across millions of devices within seconds, deduplicating content at the block level and tracking per-file block maps for efficient delta sync? | Dropbox · sync engine |
|
| 2 | How does a file platform handle resumable uploads for 5GB+ files over unreliable networks, uploading in chunks with per-chunk checksum verification and automatic retry from the last successful chunk? | Google Drive · resumable upload |
| 3 | How does a video platform process 500+ hours of uploaded video per minute into multiple formats, prioritizing live content over VOD, executing DAG-based transcoding pipelines, and storing intermediate results between stages? | YouTube · transcoding pipeline |
| 4 | How does a cloud storage platform deduplicate files across 700M+ users to save petabytes of storage, using content-defined chunking, block-level hashing, and reference counting to safely garbage-collect unreferenced blocks? | Dropbox · deduplication |
| 5 | How does a photo platform generate thumbnails, apply filters, and extract metadata for 100M+ uploads/day, processing each image into multiple resolutions in parallel and pre-warming CDN caches for popular images? | Instagram · image processing |
| 6 | How does a collaboration platform handle concurrent edits to the same file by multiple users, using transform-based resolution for text and lock-based editing for binary files with conflict resolution UI for manual merge? | Google Docs · concurrent edit |
| 7 | How does a cloud platform implement file versioning and point-in-time restore for billions of objects, using copy-on-write semantics, version chains, and lifecycle policies that auto-delete versions older than 30 days? | S3 / Dropbox · versioning |
16. Search Systems GoogleElasticsearchAlgolia
Inverted indexes, ranking, typeahead, personalization, freshness · 9 problems
| # | Problem | Company / Scale |
| 1 | How does a web search engine index 100B+ pages and return ranked results in <200ms, combining authority scoring with relevance ranking across a tiered index serving architecture with early termination? | Google · web search |
|
| 2 | How does an e-commerce platform search 500M+ products with filters (price, rating, brand) in <50ms, boosting results by relevance, recency, and popularity while personalizing re-ranking based on purchase history? | Amazon · product search |
| 3 | How does a professional network search 900M+ member profiles with complex filters (location, skills, company), supporting real-time index updates for profile changes within 10 seconds of modification? | LinkedIn · people search |
| 4 | How does a search engine implement typeahead/autocomplete that returns suggestions in <50ms as the user types, serving pre-computed top-K suggestions per prefix with personalized ranking based on recent searches? | Google · typeahead |
| 5 | How does a search engine handle spelling correction and "did you mean" suggestions in real-time, computing edit distance, mining query logs for common corrections, and supporting phonetic matching? | Google · spell correction |
| 6 | How does a food delivery platform search restaurants with geo-filtering, real-time availability, and delivery time estimation, re-ranking results by ETA and updating availability as orders come in? | DoorDash · local search |
| 7 | How does a search engine keep its index fresh when millions of pages change daily, prioritizing crawl frequency by change rate, supporting real-time index updates for breaking news, and periodically re-indexing for consistency? | Google · index freshness |
| 8 | How does a platform implement semantic/vector search that understands meaning beyond keywords, combining embedding-based similarity with traditional keyword scoring for hybrid results using approximate nearest neighbor search? | Semantic · vector search |
| 9 | Given search query logs from millions of searches per second, design a system that continuously computes the most searched queries/products/categories globally and regionally, using hierarchical aggregation and approximate top-K algorithms for memory-efficient ranking. | Search · trending queries |
17. Scheduling & Calendar Google CalendarCalendlyTemporal
Recurring events, timezone handling, conflict detection, availability · 6 problems
18. Observability & Monitoring DatadogGrafanaPrometheus
Distributed tracing, metrics aggregation, alerting pipelines, log analytics · 8 problems
| # | Problem | Company / Scale |
| 1 | How does a distributed tracing system correlate requests across 1000+ microservices, propagating trace context through async boundaries (Kafka, queues) and sampling intelligently to keep storage under 1% of total traffic while capturing all error traces? | Uber · Jaeger tracing |
|
| 2 | How does a metrics platform ingest 500M+ time-series data points per second, supporting real-time aggregation (P50/P99/max) with 10-second granularity and multi-dimensional queries across 100K+ metric names? | Datadog · 500M metrics/sec |
| 3 | How does an alerting system evaluate 10M+ alert rules every 15 seconds without false positives, supporting complex conditions (rate-of-change, anomaly detection), alert grouping, and escalation policies with on-call rotation? | PagerDuty · alerting at scale |
| 4 | How does a log analytics platform ingest 1PB+ of logs daily from millions of sources, indexing them for sub-second search while applying retention policies and providing real-time tail functionality for debugging? | Splunk · PB-scale logs |
| 5 | How does a platform implement real-time error tracking that groups millions of exceptions into unique issues, detecting regressions within minutes of deployment and auto-assigning to the team that owns the failing code path? | Sentry · error grouping |
| 6 | How does a cloud platform build real-time service dependency maps from trace data, detecting cascading failures within seconds and identifying the root-cause service in a chain of 20+ dependent services? | AWS X-Ray · dependency map |
| 7 | How does a platform implement SLO-based monitoring that continuously computes error budgets across 10K+ services, triggering automated responses (traffic shifting, rollback) when burn rate exceeds thresholds? | Google · SLO monitoring |
| 8 | How does a real-time dashboard system serve 100K+ concurrent viewers watching the same metrics, pushing incremental updates via WebSocket without recomputing full queries per viewer? | Grafana · live dashboards |
19. ML Model Serving & Feature Stores OpenAITensorFlowPyTorchMeta
Real-time inference, feature computation, model deployment, A/B testing · 7 problems
| # | Problem | Company / Scale |
| 1 | How does a ride platform serve ML predictions (ETA, surge, fraud) at 1M+ requests/sec with P99 <10ms latency, loading 100+ models into GPU memory, handling model version rollouts, and falling back to rule-based systems when models are unavailable? | Uber · Michelangelo |
|
| 2 | How does a feature store compute and serve real-time features (user's last 5 actions, rolling 1-hour spend) for ML models at prediction time, combining batch-computed features with streaming features while maintaining point-in-time correctness? | Feast · real-time features |
| 3 | How does a search platform deploy new ranking models to production without degrading relevance, using shadow scoring, interleaved experiments, and gradual traffic ramp with automatic rollback on metric regression? | Google · model deployment |
| 4 | How does a recommendation platform retrain models on fresh data every hour, incorporating the latest user interactions while ensuring training-serving skew stays below 1% and new models don't catastrophically forget learned patterns? | TikTok · online learning |
| 5 | How does an LLM serving platform handle 10K+ concurrent inference requests with variable-length outputs, optimizing GPU utilization through continuous batching, KV-cache management, and speculative decoding for 3x throughput improvement? | OpenAI · LLM serving |
| 6 | How does an ad platform compute click-through-rate predictions for 10B+ ad candidates daily in <50ms per request, combining sparse features (user history) with dense embeddings and serving from a distributed model across 1000+ inference nodes? | Meta · ads prediction |
| 7 | How does a content platform implement real-time embedding generation for new content (images, videos, text), indexing embeddings for approximate nearest neighbor search within seconds of upload for immediate recommendation eligibility? | Pinterest · embedding pipeline |
20. Security & Authentication CloudflareAuth0Okta
Session management, DDoS mitigation, rate limiting, zero-trust · 7 problems
| # | Problem | Company / Scale |
| 1 | How does a platform manage sessions for 2B+ users across multiple devices, supporting instant revocation (password change invalidates all sessions), sliding expiry, and device-specific session limits without checking a central store on every request? | Google · session management |
|
| 2 | How does a CDN mitigate L7 DDoS attacks at 100M+ requests/sec, distinguishing legitimate traffic from bot traffic using behavioral analysis, JavaScript challenges, and adaptive rate limiting without blocking real users during an attack? | Cloudflare · DDoS mitigation |
| 3 | How does a platform implement distributed rate limiting across 300+ edge PoPs, enforcing per-user and per-IP limits with eventual consistency between nodes while using fail-open policies to avoid blocking legitimate traffic during sync delays? | Stripe · distributed rate limit |
| 4 | How does a zero-trust architecture authenticate and authorize every service-to-service call in a 5000+ microservice mesh, issuing short-lived certificates, enforcing least-privilege policies, and detecting lateral movement without adding more than 1ms latency per hop? | Google BeyondCorp · zero-trust |
| 5 | How does an OAuth provider handle 1M+ token issuance/sec during peak login, supporting PKCE flows, token rotation, and cross-device SSO while detecting token theft through binding tokens to device fingerprints? | Auth0 · OAuth at scale |
| 6 | How does a platform implement real-time account takeover detection, scoring login attempts using device fingerprint, geo-velocity, and behavioral biometrics, triggering step-up authentication (MFA) for suspicious sessions within milliseconds? | Netflix · ATO detection |
| 7 | How does a secrets management platform handle 100K+ services fetching credentials, supporting automatic rotation every 24 hours, lease-based access with revocation, and zero-downtime rotation without service restarts? | HashiCorp Vault · secrets |
21. URL Shortening & Redirection BitlyTinyURLGoogle
ID generation, redirect scaling, analytics, hot key caching · 4 problems
| # | Problem | Company / Scale |
| 1 | How does a URL shortener generate globally unique short codes at 1000+ URLs/sec, ensuring no collisions across distributed nodes while keeping codes short (7 chars) and supporting custom aliases? | Bitly · ID generation |
|
| 2 | How does a URL shortener handle 100K+ redirect requests/sec with P99 <10ms latency, caching hot URLs at the edge while handling expired links, geo-targeted redirects, and A/B test routing? | TinyURL · redirect at scale |
| 3 | How does a link analytics platform track clicks in real-time (referrer, geo, device, timestamp) for billions of redirects/day without adding latency to the redirect path, computing dashboards with <1min freshness? | Bitly · click analytics |
| 4 | How does a URL shortener handle hot keys (viral links getting 1M+ clicks/sec) without melting the cache layer, using consistent hashing, request coalescing, and tiered caching (L1 in-process ? L2 Redis ? L3 DB)? | t.co · hot key caching |
22. Email Systems GmailSendGridMailchimp
SMTP delivery, spam filtering, inbox indexing, threading · 4 problems
| # | Problem | Company / Scale |
| 1 | How does an email platform deliver 500M+ emails/day with high deliverability, managing IP reputation, DKIM/SPF/DMARC authentication, bounce handling, and throttling per recipient domain to avoid being blacklisted? | SendGrid · email delivery |
|
| 2 | How does an email provider classify 1B+ incoming emails/day as spam/ham in real-time using content analysis, sender reputation, behavioral signals, and ML models · with <0.1% false positive rate? | Gmail · spam filtering |
| 3 | How does an email platform index 1B+ mailboxes for instant full-text search, supporting complex queries (from:, has:attachment, date range) with <200ms latency while handling 100K+ new emails/sec ingestion? | Gmail · inbox search |
| 4 | How does an email client implement conversation threading that correctly groups replies, forwards, and CC chains using In-Reply-To/References headers, handling broken threads and cross-client compatibility? | Gmail · threading |
23. Content Moderation & Trust MetaYouTubeTwitch
Spam detection, abuse prevention, ML moderation, reporting systems · 4 problems
24. Configuration & Feature Flags LaunchDarklyNetflixGitHub
Dynamic config, experimentation, A/B testing, rollout systems · 3 problems
25. Web Crawling & Indexing GoogleCloudflareAmazon
Distributed crawlers, politeness, deduplication, indexing pipelines · 3 problems
| # | Problem | Company / Scale |
| 1 | How does a web crawler discover and fetch 10B+ pages across the internet, respecting robots.txt, managing crawl politeness (rate limits per domain), prioritizing fresh/important pages, and deduplicating content? | Google · web crawler |
|
| 2 | How does a search engine build and maintain an inverted index over 100B+ documents, supporting incremental updates (new/modified pages) without full re-indexing, and serving queries across a distributed index in <200ms? | Google · indexing pipeline |
| 3 | How does a price comparison platform crawl 10M+ product pages daily from 1000+ e-commerce sites, extracting structured data (price, availability, specs) using site-specific parsers, and detecting price changes within minutes? | PriceRunner · product crawling |
26. ID Generation Systems TwitterStripeSnowflake
Snowflake IDs, distributed unique IDs, ordering guarantees, collision avoidance · 3 problems
27. Ad Serving & Monetization Google AdsMetaAmazon
Ad ranking, real-time bidding, targeting, impression tracking · 4 problems
| # | Problem | Company / Scale |
| 1 | How does an ad platform select the best ad from 10M+ candidates in <100ms per page load, scoring by predicted CTR, bid amount, relevance, and advertiser budget · serving 10B+ ad requests/day? | Google Ads · ad ranking |
|
| 2 | How does a real-time bidding (RTB) exchange conduct auctions across 100+ demand-side platforms within 100ms, handling 1M+ bid requests/sec with timeout-based fallback and fraud detection? | OpenRTB · real-time bidding |
| 3 | How does an ad platform track impressions, clicks, and conversions across billions of events/day without double-counting, attributing conversions across devices/sessions, and computing ROI dashboards in near-real-time? | Meta Ads · attribution tracking |
| 4 | How does an ad platform enforce advertiser budgets in real-time across distributed serving nodes, preventing overspend while maximizing delivery, handling budget changes mid-campaign, and pacing spend evenly throughout the day? | Google Ads · budget pacing |
28. Developer Platform & CI/CD GitHub ActionsJenkinsDockerKubernetes
Build systems, deployment orchestration, artifact storage, release pipelines · 3 problems
| # | Problem | Company / Scale |
| 1 | How does a CI platform execute 10M+ builds/day across a distributed fleet of workers, scheduling jobs by priority and resource requirements, caching build artifacts for 10· speedup, and providing real-time build logs? | GitHub Actions · build at scale |
|
| 2 | How does a deployment platform orchestrate zero-downtime rollouts across 100K+ servers, supporting canary deployments (1% ? 10% ? 100%), automatic rollback on health check failure, and blue-green switching? | Spinnaker · deployment orchestration |
| 3 | How does an artifact registry store and serve 1PB+ of build artifacts (Docker images, npm packages, Maven JARs) with global replication, content-addressable deduplication, and vulnerability scanning on upload? | Artifactory · artifact storage |
29. AI & LLM Systems OpenAIAnthropicGooglePerplexity
LLM serving, RAG pipelines, agent orchestration, AI gateways, guardrails, embedding search · 10 problems
| # | Problem | Company / Scale |
| 1 | How does an LLM serving platform handle 100K+ concurrent chat sessions with variable-length outputs, achieving <500ms time-to-first-token through continuous batching, KV-cache paging (PagedAttention), and speculative decoding · while keeping GPU utilization above 80%? | OpenAI · LLM serving at scale |
|
| 2 | How does a RAG-powered search engine ingest 10M+ documents, chunk them optimally, generate embeddings, and serve grounded answers with citations in <2s · while keeping hallucination rate below 5% through hybrid retrieval (dense + sparse) and cross-encoder reranking? | Perplexity · RAG at scale |
|
| 3 | How does an AI gateway route 1M+ requests/day across multiple LLM providers (GPT-4, Claude, Llama), implementing semantic caching (30-60% cost reduction), automatic fallback on provider outages, and per-team token budgets · all with <50ms added latency? | Enterprise · AI Gateway |
|
| 4 | How does an AI coding assistant serve real-time code completions to 10M+ developers with <200ms latency, retrieving relevant context from the user's repository (100K+ files), ranking suggestions by relevance, and adapting to per-user coding patterns? | GitHub Copilot · code AI |
|
| 5 | How does a multi-agent system orchestrate 5+ specialized AI agents (researcher, coder, reviewer, planner) to complete complex tasks, managing shared memory, tool execution, inter-agent communication, and graceful failure handling · with total cost under $0.50 per task? | AutoGen · Agent orchestration |
|
| 6 | How does a vector search platform index 1B+ embeddings and serve similarity queries in <10ms at 50K QPS, supporting real-time index updates (new documents searchable in <5s), metadata filtering, and multi-tenancy with per-tenant isolation? | Pinecone · Vector search at scale |
|
| 7 | How does a content moderation system classify 500M+ user-generated posts/day using multi-modal AI (text + image + video), achieving <2s classification latency, routing edge cases to human reviewers, and handling adversarial attacks that try to bypass filters? | Meta · AI content moderation |
|
| 8 | How does a conversational AI platform maintain context across multi-turn conversations for 50M+ daily active users, managing conversation memory (short-term buffer + long-term vector store), session persistence, and personalization · while keeping per-user storage costs under $0.001/day? | ChatGPT · Conversation memory |
|
| 9 | How does a real-time AI translation system serve 1B+ translation requests/day across 100+ language pairs with <300ms latency, dynamically selecting between specialized models per language pair and falling back to general models for rare pairs? | Google Translate · AI at scale |
|
| 10 | How does an AI safety platform detect prompt injection attacks, jailbreak attempts, and PII leakage across 100M+ LLM requests/day in <50ms per request, using layered classifiers (fast regex ? ML model ? LLM judge) with <0.1% false positive rate on legitimate queries? | Lakera · AI guardrails at scale |
|