System Design Concepts

No fluff โ€” visual, concise, interview-ready

๐Ÿ“จ 8 ยท MESSAGING

Message Queues (RabbitMQ / SQS)

Messages wait in a queue for consumers to process โ€” task distribution & load balancing with retry + DLQ

Guarantees: At-least-once delivery (consumer must be idempotent). Ordering per queue (FIFO). Dead Letter Queue โ€” after N retries, message moved to DLQ. Visibility timeout โ€” message invisible to others while being processed.
Limitations: No message replay. Limited throughput for high-volume streaming. Fan-out harder than Kafka. No long-term retention.
RabbitMQ exchanges: Direct (exact key) ยท Fanout (broadcast) ยท Topic (pattern: order.*.created) ยท Headers. SQS: Standard (at-least-once, best-effort order) vs FIFO (exactly-once, strict order, 3K msg/sec).

Message Streams - Apache Kafka

Messages continuously appended to a durable ordered log โ€” 1M+ events/sec, replayable history, multiple independent readers.

โ–ธ How Kafka Works
Producer 1 Producer 2 Producer 3 Kafka Cluster Partition 0 Partition 1 Partition 2 offsets โ†’ Group A C1 C2 Group B C3 C4 Each group reads ALL independently Partitions replicated (ISR) acks=all โ†’ no data loss
โ–ธ Top Kafka Use Cases

Event Streaming

๐Ÿ“ฑ๐ŸŒ๐Ÿ–ฅ๏ธ Kafka Spark Stream ๐Ÿ“Š Real-time analytics, ML pipelines

Log Aggregation

Srv1 Srv2 Connect Kafka ELK Stack Centralize logs from all services

Message Queuing

Prod 1 Prod 2 Kafka Cons 1 Cons 2 Decouple services, async processing

Web Activity Tracking

๐Ÿ“ฑ๐Ÿ–ฅ๏ธ Connect Kafka Spark ๐Ÿ“Š Clicks, views โ†’ real-time dashboards

Data Replication (CDC)

DB1 DB2 Connect Kafka DB3 DB4 Debezium CDC โ†’ replicate across DCs
ConceptDetailGuarantee
PartitionUnit of parallelism. Append-only log.Ordered within partition, not across.
OffsetSequential ID per message.Consumer can rewind/replay from any offset.
ISRIn-Sync Replicas caught up with leader.acks=all + min.insync.replicas=2 = no data loss.
Consumer GroupEach partition โ†’ exactly 1 consumer per group.Multiple groups = independent reads.
Compacted TopicKeep latest value per key.State snapshots, changelogs.
KRaftReplaces ZooKeeper (Kafka 3.3+).Simpler operations.
Delivery Guarantees: At-most-once (acks=0, lossy). At-least-once (acks=all, may duplicate). Exactly-once (idempotent producer + transactions).
Limitations: Operational complexity. Ordering per partition only. No per-message acking. Overkill for low volume.
Real-world: LinkedIn 4T+ events/day. Uber โ€” trip events, surge pricing. Netflix โ€” recommendations pipeline.
โ–ธ Real-World: Uber Ride Events in Kafka โ€” Complete Flow
Uber Ride Lifecycle โ€” Events Produced to Kafka E1 RIDE_REQUESTED E2 DRIVER_MATCHED E3 DRIVER_EN_ROUTE E4 RIDER_PICKED_UP E5 TRIP_IN_PROGRESS E6 GPS_UPDATE ร—N E7 TRIP_COMPLETED E8 FARE_CALCULATED E9 PAYMENT_CHARGED E10 RATING_SUBMITTED Producers Rider App Driver App Trip Service Payment Service key = trip_id hash(trip_id) % 3 Kafka Cluster โ€” Topic: "uber.ride-events" 3 Partitions ร— RF=3 | acks=all | min.insync.replicas=2 | retention=7d Partition 0 (Leader: Broker 1) key: trip_101 โ†’ hash % 3 = 0 RIDE_REQ offset 0 DRV_MATCH offset 1 EN_ROUTE offset 2 PICKED_UP offset 3 IN_PROG offset 4 COMPLETE offset 5 FARE_CALC offset 6 โ† ORDERED: all trip_101 events in sequence (same partition) โ†’ Partition 1 (Leader: Broker 2) key: trip_202 โ†’ hash % 3 = 1 RIDE_REQ offset 0 DRV_MATCH offset 1 GPS ร—3 offset 2-4 COMPLETE offset 5 PAYMENT offset 6 RATING offset 7 HW=8 โ† ORDERED: all trip_202 events in sequence โ†’ Partition 2 (Leader: Broker 3) key: trip_303 โ†’ hash % 3 = 2 RIDE_REQ offset 0 DRV_MATCH offset 1 GPS ร—5 offset 2-6 IN_PROG offset 7 LEO=8 โ† trip_303 still in progress... more events coming โ†’ RF=3: Each partition replicated to ALL 3 brokers (1 Leader โ˜… + 2 ISR followers) ISR = In-Sync Replicas | HW = High Watermark (committed) | LEO = Log End Offset (latest written) KRaft Controller Quorum โ€” leader election, partition assignment (no ZooKeeper) Consumer Group: surge-pricing Flink โ€” count rides/zone/5min โ†’ dynamic pricing C1 โ†’ P0 committed: 5 C2 โ†’ P1 committed: 7 C3 โ†’ P2 committed: 6 lag: P0=1 | P1=1 | P2=2 (events behind) __consumer_offsets topic stores committed positions Consumer Group: analytics Spark batch โ€” daily trip reports โ†’ BigQuery C4 โ†’ P0 committed: 3 C5 โ†’ P1,P2 committed: 4, 5 lag: P0=3 | P1=4 | P2=3 (batch behind real-time) auto.offset.reset=earliest โ†’ replay from beginning on new deploy Consumer Group: eta-service Real-time ETA โ†’ push notification to rider C6 โ†’ P0 committed: 6 C7 โ†’ P1 committed: 8 C8 โ†’ P2 committed: 8 lag: P0=0 | P1=0 | P2=0 (fully caught up!) max.poll.interval.ms=300000 | session.timeout.ms=10000 Offset Mechanics โ€” How Consumers Track Position Partition 0: 0 1 2 3 4 5 6 ... analytics (C4) committed=3 surge (C1) committed=5 eta (C6) committed=6 HW=6 LEO=7 Offset Concepts: Committed Offset: last offset consumer confirmed processing (stored in __consumer_offsets) High Watermark (HW): last offset replicated to ALL ISR replicas (consumers can only read up to HW) Log End Offset (LEO): last offset written to leader (not yet replicated to all ISR) Consumer Lag: HW - committed offset = messages behind (monitor with Burrow/Kafka Lag Exporter) End-to-End Flow: trip_101 Lifecycle Through Kafka โ‘  Rider App produce(key=trip_101 val=RIDE_REQUESTED) acks=all โ‘ก Partitioner hash(trip_101) % 3 = Partition 0 batch + lz4 compress โ‘ข Leader (B1) append to P0 log offset=0, LEO=1 write to .log segment โ‘ฃ ISR Replicate B2, B3 fetch from B1 both ACK โ†’ HW=1 min.insync.replicas=2 โœ“ โ‘ค ACK Producer acks=all satisfied producer callback OK idempotent: seq=0 โ‘ฅ Consumer Poll C1 polls P0 from offset 0 (or last+1) max.poll.records=500 โ‘ฆ Process + Commit process event โ†’ commit offset=1 stored in __consumer_offsets at-least-once (process then commit) ORDERING GUARANTEE: All events for trip_101 arrive at P0 in exact producer order (E1โ†’E2โ†’E3โ†’...โ†’E10) Different trips (trip_202, trip_303) may be on different partitions โ€” NO ordering across partitions, only WITHIN Every Kafka Concept in This Diagram Partitioning: key=trip_id โ†’ same partition RF=3 + ISR: 3 copies, min 2 must ACK Offsets: per-consumer position tracking HW/LEO: committed vs written boundary Consumer Groups: independent reads, own offsets KRaft: controller quorum, no ZooKeeper Ordering: guaranteed within partition only Lag: HW - committed = consumer behind
โ–ธ Kafka: Producers, Consumers, Brokers

Producers

Acks: 0=fire-forget ยท 1=leader ACK ยท all=all ISR ACK (safest)
Retries: On failure, retry with backoff. Idempotent producer (enable.idempotence=true) deduplicates retries via sequence numbers.
Batching: linger.ms (wait to fill batch) + batch.size (max bytes). Bigger batch = higher throughput, slight latency.
Compression: snappy (fast), lz4 (balanced), zstd (best ratio). Compress at producer, decompress at consumer.
Partitioner: Default = hash(key) % partitions. Sticky partitioner for null keys (batch to one partition, then rotate).

Consumers

Consumer Group: Each partition โ†’ exactly 1 consumer in group. Add consumers = parallel. Max consumers = partitions.
Offsets: Track position per partition. Committed to __consumer_offsets topic. auto.offset.reset: earliest (replay all) / latest (new only).
Delivery: At-most-once (commit before process) ยท At-least-once (process then commit, may re-read) ยท Exactly-once (transactions).
Rebalance: Consumer joins/leaves โ†’ partitions reassigned. Cooperative/incremental rebalance (Kafka 2.4+) avoids stop-the-world.
Poll: max.poll.records, max.poll.interval.ms. Too slow = kicked from group.

Brokers & Topics

Broker: Single Kafka server. Cluster = N brokers. Each broker stores subset of partitions.
Replication: RF=3 โ†’ 3 copies. Leader handles reads/writes. Followers in ISR (In-Sync Replicas) replicate. min.insync.replicas=2 + acks=all = no data loss.
Segments: Partition = ordered log split into segment files (1GB default). Each segment has .log + .index + .timeindex.
Retention: Time-based (retention.ms=7 days) or size-based (retention.bytes=1TB). Log compaction = keep latest value per key (state snapshots).
KRaft: Replaces ZooKeeper (Kafka 3.3+). Controller quorum via Raft. Simpler ops, faster failover.
ConfigSettingEffect
acks=all + min.insync=2Producer waits for 2+ replicasZero data loss (even if 1 broker dies)
enable.idempotence=trueSequence numbers per producerNo duplicates on retry
TransactionsAtomic multi-partition writesExactly-once end-to-end
Unclean leader election=falseOnly ISR members can become leaderNo data loss (may reduce availability)
Log compactionKeep latest value per keyState snapshots, changelogs (KTable)

Pub/Sub (SNS / Google Pub/Sub)

Publishers broadcast (fan-out) to a topic โ€” every subscriber receives its own independent copy

Producer publishes to TOPIC (not a queue)
       โ†“ topic: "order-events"
  โ”Œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ†“    โ†“        โ†“            โ†“
 Sub A  Sub B   Sub C       Sub D
(Email) (Analytics) (Audit) (Webhook)

Each subscriber gets its OWN copy of every message.
Subscribers are independent โ€” different speeds, different retry logic.
Guarantees: At-least-once delivery to every subscriber. Independent processing. Fully managed. Google Pub/Sub: seek to timestamp, ordering keys, exactly-once delivery. SNS: push to SQS/Lambda/HTTP/email/SMS.
Limitations: No long retention by default. No consumer-side replay in SNS. Ordering not guaranteed by default. Not suited for point-to-point work queues.
Real-world: Spotify โ€” Google Pub/Sub for event-driven microservices. Shopify โ€” SNS+SQS for order event fan-out. Uber โ€” Google Pub/Sub for cross-service event propagation.

Message Queues vs Event Streams vs Pub/Sub

Three fundamentally different messaging patterns

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ MESSAGE QUEUE (Point-to-Point)                               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Producer โ†’ [ Queue ] โ†’ Consumer A
                         (message deleted after ACK)
        One message โ†’ one consumer (ownership)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ EVENT STREAM (Log/replay-based)                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Producer โ†’ [ Append-only Log ] โ†’ offset 0 โ†’ offset 5 โ†’ offset 9
                                  โ”‚            โ”‚            โ”‚
                              Group A      Group B      Group C
      Data is NEVER removed, only appended
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ PUB/SUB (Fan-out)                                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Producer โ†’ [ Topic ]
              โ”œโ†’ Subscriber A โ†’ receives COPY (independent)
              โ”œโ†’ Subscriber B โ†’ receives COPY (independent)
              โ””โ†’ Subscriber C โ†’ receives COPY (independent)
      One event โ†’ many independent receivers
FeatureMessage QueueEvent StreamPub/Sub
DefinitionMessages wait in a queue for a consumer to pick up & processMessages appended to a durable ordered log โ€” replayable historyPublisher broadcasts to a topic; every subscriber gets its own copy
Core IdeaQueue of tasks / jobsDurable ordered event logBroadcast events to subscribers
Concrete Question"Which worker should process this image upload / email / payment retry?""What exactly happened in this system / user journey over time?""Which systems should know that an order was placed?"
Why this fits bestBest when exactly one worker should process a task / jobBest when events must be retained as an ordered replayable historyBest when one event must be broadcast (fan-out) to many independent subscribers
Why others don't fitPub/Sub broadcasts to many consumers causing duplicate processing; Streams retain/replay history โ€” unnecessary for one-time task executionQueue removes messages after processing; Pub/Sub focuses on live delivery rather than durable replayable event historyQueue is designed for task distribution where only one consumer gets the message; Streams add retention/replay/ordering overhead when only real-time notification is needed
ModelCompeting consumers โ€” task distribution & load balancingIndependent readers โ€” each group reads the full log at its own paceFan-out โ€” one publish, N independent deliveries
ConsumptionOne consumer per messageMultiple groups, each reads allAll subscribers get copy
After readDeletedRetainedGone (SNS) or short retention
DeliveryAt-least-once (consumer must be idempotent)At-least-once ยท exactly-once (idempotent producer + transactions)At-least-once to every subscriber
RetentionNo long-term retention โ€” gone after ACKConfigurable (days / size / compaction)Short (SNS: none ยท Pub/Sub: 7 days)
Replayโœ—โœ“ (rewind to any offset)Limited
Orderingโœ“ per queue (FIFO)โœ“ per partitionโœ— by default
Throughput~10K-100K msg/sec~1M+ msg/sec~100K-1M msg/sec
Best forTask distribution, job queuesEvent sourcing, CDC, analyticsNotifications, fan-out, triggers
ExampleRabbitMQ, SQSKafka, Kinesis, Redis StreamsSNS, Google Pub/Sub
Key Insight: The fundamental difference is the consumption model. Queues are competing consumers (work distribution). Streams are independent consumers (each reads everything). Pub/Sub is broadcast (everyone gets a copy). Many real systems combine them โ€” e.g., SNS โ†’ SQS โ†’ Lambda.

Dead Letter Queue (DLQ)

Quarantine poison messages so the main queue keeps flowing โ€” critical for fault isolation & observability

โ–ธ DLQ Lifecycle โ€” Retry, Quarantine & Recovery
Producer publishes msg Main Queue maxReceiveCount = 5 visibilityTimeout = 30s retention = 14 days Consumer process message ACK on success NACK on failure retry (back-off: 1s โ†’ 2s โ†’ 4s โ†’ 8s โ†’ 16s) after 5 failures receiveCount > max โ˜  DLQ poison messages retention = 14 days โš  ALARM: depth > 0 ๐Ÿ”” CloudWatch PagerDuty / Slack alert โ™ป Redrive fix bug โ†’ replay back to main queue ๐Ÿ” Inspect debug payload archive to S3 ๐Ÿ—‘ Purge discard if obsolete โœ“ ACK msg deleted Legend happy path failure path retry loop recovery optional
ConceptDetailBest Practice
maxReceiveCountNumber of delivery attempts before moving to DLQ3โ€“5 retries with exponential back-off
Visibility TimeoutTime message is hidden from other consumers during processingSet to 6ร— avg processing time
Redrive PolicyConfig linking source queue to its DLQAlways configure โ€” never lose messages silently
Redrive Allow PolicyControls which source queues can target this DLQRestrict to specific source queues
Retention PeriodHow long DLQ keeps messages before auto-deletion14 days (max for SQS) โ€” gives time to investigate
โ–ธ DLQ Implementation by Broker
BrokerMechanismConfigurationRecovery
SQSSource queue โ†’ redrive policy โ†’ DLQ (separate queue)RedrivePolicy: {deadLetterTargetArn, maxReceiveCount}StartMessageMoveTask API (redrive)
RabbitMQDead-letter exchange (DLX) on TTL/reject/nackx-dead-letter-exchange + x-dead-letter-routing-keyShovel plugin or manual republish
KafkaApp writes failed events to topic.DLTSpring Kafka: @RetryableTopic(attempts=3) + @DltHandlerReplay from DLT topic (seek to beginning)
Azure SBBuilt-in $DeadLetterQueue sub-queueMaxDeliveryCount on subscription/queueService Bus Explorer โ€” peek & resubmit
GCP Pub/SubDead-letter topic on subscriptiondeadLetterPolicy: {deadLetterTopic, maxDeliveryAttempts}Pull from dead-letter subscription
โ–ธ Why Messages End Up in DLQ

Poison Messages

  • Malformed payload (invalid JSON/XML)
  • Schema mismatch (missing required fields)
  • Encoding errors (wrong charset)
  • Message too large for consumer buffer

Transient Failures

  • Downstream service unavailable
  • Database connection timeout
  • Rate limiting / throttling
  • Network partition (temporary)

Logic Errors

  • Unhandled exception in consumer code
  • Business rule violation
  • Referential integrity failure
  • Idempotency key collision
Best Practices: Always alarm on DLQ depth > 0 โ€” even 1 message means something is broken. Set up CloudWatch / Prometheus metrics on ApproximateNumberOfMessagesVisible. Include correlation IDs and original timestamps in message headers for debugging. Use separate DLQs per source queue to isolate failure domains.
Recovery Playbook: โ‘  Alert fires โ†’ โ‘ก Inspect DLQ messages (peek, don't consume) โ†’ โ‘ข Identify root cause (schema? downstream? bug?) โ†’ โ‘ฃ Fix the consumer/downstream โ†’ โ‘ค Redrive messages back to source queue โ†’ โ‘ฅ Verify processing succeeds โ†’ โ‘ฆ Post-mortem if recurring.
Anti-patterns: Ignoring DLQ โ€” messages expire silently. No alarm โ€” failures go unnoticed for days. Infinite retries without DLQ โ€” blocks the entire queue. Same retention on DLQ as source โ€” messages may expire before investigation. No metadata โ€” can't trace why message failed.
Real-world: Uber โ€” DLQ per microservice, auto-redrive after circuit breaker resets. Netflix โ€” DLQ + S3 archival for compliance audit trail. Stripe โ€” webhook DLQ with exponential backoff (5 retries over 3 days).

Event Sourcing

Persist facts (events) as the source of truth โ€” derive state by replaying the event log. Never mutate, only append.

โ–ธ Event Sourcing Architecture โ€” Append, Replay, Project
Command CreateOrder Aggregate validate business rules emit event(s) optimistic concurrency Event Store append-only, immutable log OrderCreated v1 | seq 0 ItemAdded v1 | seq 1 OrderPaid v2 | seq 2 stream: order-{orderId} expectedVersion for concurrency control retention: forever (or compacted snapshots) ๐Ÿ“ธ Snapshot state @ seq N speeds up replay every 100 events subscribe / poll Projections (Read Models) Order List PostgreSQL denormalized view Search Index Elasticsearch full-text search Analytics ClickHouse OLAP aggregates Query GetOrderById โช Time Travel / Replay Rebuild projection from event 0 Debug: "what was state at 3pm yesterday?" Fix bug โ†’ replay โ†’ corrected view ๐Ÿ“‹ Audit Trail (Free) Every state change is an event Who did what, when, why Compliance: SOX, GDPR, PCI-DSS ๐Ÿ”„ Temporal Queries Point-in-time state reconstruction Bi-temporal: valid time + transaction time Retroactive corrections possible
ConceptDetailImplementation
EventImmutable fact that happened โ€” past tense namingOrderCreated, PaymentReceived, ItemShipped
StreamOrdered sequence of events for one aggregateorder-{orderId} โ€” one stream per entity instance
AggregateConsistency boundary โ€” validates commands, emits eventsLoad from stream โ†’ apply events โ†’ check invariants โ†’ emit
ProjectionRead model built by processing eventsSubscribe to stream โ†’ update denormalized view (async)
SnapshotMaterialized state at a point in timeStore every N events โ€” replay only from snapshot forward
IdempotencyProcessing same event twice must produce same resultTrack last processed position / use event ID as dedup key
โ–ธ Event Store Technologies
StoreTypeStrengthsConsiderations
EventStoreDBPurpose-builtNative projections, subscriptions, optimistic concurrencySmaller community, self-hosted
KafkaLog-basedHigh throughput, built-in replication, ecosystemNo per-stream concurrency, compaction โ‰  snapshots
PostgreSQLRDBMSACID, familiar, NOTIFY for subscriptionsManual stream management, polling for projections
DynamoDBNoSQLServerless, DynamoDB Streams for projections25-item transaction limit, cost at scale
MartenLibrary (.NET)PostgreSQL-backed, projections built-in.NET ecosystem only
โ–ธ Key Patterns & Challenges

Schema Evolution

  • Upcasting: transform old events to new schema on read
  • Versioning: OrderCreated_v2 with migration
  • Weak schema: add optional fields (backward compatible)
  • Never delete/modify stored events

Snapshotting Strategy

  • Snapshot every N events (e.g., 100)
  • Snapshot on time interval (e.g., daily)
  • Snapshot on aggregate complexity threshold
  • Store snapshot + version in separate stream

Handling Side Effects

  • Process managers / sagas for multi-aggregate workflows
  • Outbox pattern: event + side-effect in same transaction
  • Idempotent handlers: safe to replay
  • Compensating events for "undo" (not delete)

Common Pitfalls

  • GDPR: "right to erasure" conflicts with immutability โ†’ crypto-shredding
  • Large aggregates: too many events โ†’ slow replay โ†’ need snapshots
  • Event granularity: too fine = noise, too coarse = lost info
  • Projection lag: eventual consistency confuses users
Wins: Perfect audit trail โ€” every state change recorded. Time-travel debugging โ€” reconstruct state at any point. Rebuild projections โ€” fix bugs, replay, get corrected views. Natural fit for financial ledgers, order systems, collaboration tools.
Costs: Schema evolution of events is hard (upcasting, versioning). Eventual consistency on read models. Snapshot/compaction strategy needed for long-lived aggregates. Steeper learning curve โ€” team must think in events, not state.
Real-world: Stripe โ€” payment state machine as events. Datomic โ€” immutable database (event-sourced by design). LMAX Exchange โ€” event-sourced trading engine (6M orders/sec). Git โ€” commits are events, working tree is projection.

CQRS (Command Query Responsibility Segregation)

Separate the write model (commands) from the read model (queries) โ€” scale, optimize, and evolve them independently

โ–ธ CQRS Architecture โ€” Write Path vs Read Path
WRITE SIDE (Command Path) Client POST /orders Command Handler validate input load aggregate apply business rules Write Store normalized schema optimized for writes PostgreSQL / EventStore Event Bus publish domain events Kafka / SNS / EventBridge guaranteed delivery async sync (eventual consistency) READ SIDE (Query Path) Projector consume events update read models Read Stores Redis Elastic DynamoDB Materialized Query Handler GET /orders?status= optimized reads Client GET (fast) Key Benefits โœ“ Scale reads independently โœ“ Optimize each store for its job โœ“ Multiple read models per write โœ“ Evolve independently Tradeoffs โš  Eventual consistency โš  More infrastructure โš  Sync failure handling
ConceptWrite SideRead Side
ModelDomain model / aggregates โ€” enforces invariantsDenormalized views โ€” optimized for specific queries
StoreNormalized RDBMS or event storeRedis, Elasticsearch, DynamoDB, materialized views
ScaleVertical (consistency matters)Horizontal (read replicas, caches, CDN)
ConsistencyStrong (ACID transactions)Eventual (async projection updates)
Schema3NF โ€” no redundancyDenormalized โ€” pre-joined, pre-computed
โ–ธ When to Use (and When NOT to)

โœ“ Good Fit

  • Read/write ratio heavily skewed (100:1 reads)
  • Complex queries need different data shapes
  • Write model has complex business rules
  • Multiple read representations needed (search, reports, API)
  • Event sourcing already in use
  • Microservices with separate read/write services

โœ— Bad Fit

  • Simple CRUD โ€” overhead not justified
  • Strong consistency required on reads (banking UI)
  • Small team โ€” dual model doubles maintenance
  • Low traffic โ€” no scaling benefit
  • Read-after-write needed immediately
  • Domain is simple with no complex queries
โ–ธ CQRS + Event Sourcing (Power Combo)
Pattern: Commands โ†’ Aggregate โ†’ Events persisted โ†’ Projectors subscribe โ†’ Read models updated async. The event store IS the write model. Projections ARE the read models. Rebuild any projection by replaying events from the beginning.
Consistency strategies: Pull-based: query handler checks if projection is up-to-date (compare position). Push-based: projector publishes "ready" event. Hybrid: serve stale + indicate "updating" in UI. Inline projection: update synchronously in same transaction (sacrifices scalability for consistency).
Anti-patterns: Querying the write model โ€” defeats the purpose. Bidirectional sync โ€” creates conflicts. Shared database for read/write โ€” coupling returns. Over-engineering โ€” CQRS for a TODO app.
Real-world: Microsoft โ€” Azure architecture patterns (official CQRS guidance). Uber โ€” trip service (write) + rider-facing API (read from cache). Netflix โ€” catalog writes vs personalized read views. Shopify โ€” order writes vs merchant dashboard reads.

Ordering Guarantees

Where order is preserved โ€” and where it is not. Critical for financial transactions, state machines, and causal consistency.

โ–ธ Partition-Based Ordering โ€” How Keys Determine Order
Topic: "orders" โ€” 3 partitions, key = orderId Producers order-A โ†’ P0 order-B โ†’ P1 order-C โ†’ P2 murmur2(key) % N P0 (order-A) CREATED off:0 PAID off:1 SHIPPED off:2 DELIVERED off:3 โœ“ ORDERED P1 (order-B) CREATED off:0 CANCELLED off:1 โœ“ ORDERED P2 (order-C) CREATED off:0 PAID off:1 SHIPPED off:2 โœ“ ORDERED โš  NO ordering across partitions (A vs B vs C independent) Consumer Group C1 โ† P0 C2 โ† P1 C3 โ† P2 1 partition : 1 consumer Ordering Strategies โœ“ Same key = same partition = ordered โšก Single partition = global order (no parallelism) ๐Ÿ”ข Sequence numbers for cross-partition
BrokerOrder ScopeMechanismThroughput Impact
KafkaPer-partition (key โ†’ partition)murmur2(key) % numPartitionsMore partitions = more parallelism, less global order
SQS StandardBest effort โ€” may reorderDistributed architecture, no ordering guaranteeUnlimited throughput
SQS FIFOStrict per MessageGroupIdDeduplication + sequencing per group3,000 msg/sec (batching: 30K)
RabbitMQPer queue (single consumer)FIFO within a single queuePrefetch count affects perceived order
Pub/SubOptional ordering keysorderingKey on publishOrdered messages go to same region
KinesisPer shard (partition key)MD5(partitionKey) โ†’ shard1 MB/sec or 1000 records/sec per shard
Azure Event HubsPer partitionPartition key โ†’ consistent hash1 MB/sec ingress per throughput unit
โ–ธ Patterns for Maintaining Order

Partition Key Design

  • userId: all user events ordered (profile, orders, sessions)
  • orderId: order lifecycle events in sequence
  • accountId: financial transactions ordered per account
  • deviceId: IoT telemetry ordered per device
  • โš  Hot keys: popular users โ†’ partition hotspot

Cross-Partition Ordering

  • Sequence numbers: embed monotonic seq in payload
  • Vector clocks: causal ordering across partitions
  • Lamport timestamps: logical ordering
  • Single partition: sacrifice parallelism for global order
  • External sequencer: Redis INCR or DB sequence
Key Insight: Use a stable partition key (userId, accountId, orderId) so all related events land in the same partition and stay ordered. Choose the key based on your consistency boundary โ€” what entity needs its events in order?
Rebalancing risk: When partitions are added/removed, key-to-partition mapping changes. Use sticky partitioning or consistent hashing to minimize disruption. During rebalance, consumers may see temporary out-of-order delivery โ€” handle with buffering + reordering window.
Anti-patterns: Random partition key โ€” destroys ordering. Too few partitions โ€” bottleneck on throughput. Assuming global order in multi-partition topics. Processing out-of-order without idempotency โ€” corrupted state.

Schema Registry

Central contract for event payloads โ€” prevents "it broke prod". Enforces backward/forward compatibility across producer and consumer versions.

โ–ธ Schema Registry โ€” Serialize, Register, Validate
Producer serialize payload embed schema ID [magic byte][schema_id][data] โ‘  register schema โ‘ก produce Schema Registry stores schema versions validates compatibility returns schema ID v1 โ†’ v2 โ†’ v3 (Avro/Protobuf/JSON) Kafka Topic [0][schema_id=42][avro bytes...] schema ID embedded in every message Consumer read schema ID fetch schema deserialize with schema โ‘ฃ fetch schema by ID โ‘ข consume CI/CD Pipeline validate schema before merge compatibility check Compatibility Modes BACKWARD FORWARD FULL NONE (dangerous) Serialization Formats Avro โ€” compact binary, schema evolution Protobuf โ€” typed, fast, gRPC native JSON Schema โ€” human readable Thrift โ€” legacy, Facebook origin Plain JSON โ€” no schema (risky)
CompatibilityRuleProducers MayConsumers MayUse Case
BackwardNew schema can read old dataAdd optional fields, remove fieldsRead old data with new codeMost common โ€” consumers upgrade first
ForwardOld schema can read new dataAdd fields, remove optional fieldsRead new data with old codeProducers upgrade first
FullBoth backward + forwardOnly add/remove optional fieldsBoth directions workSafest โ€” independent deployments
TransitiveCompatible with ALL previous versionsStrictest constraintsAny version reads any otherLong-lived topics, many consumers
NoneNo checksAnythingMay breakDevelopment only โ€” never in prod
โ–ธ Schema Registry Implementations
ToolFormatsIntegrationKey Feature
Confluent Schema RegistryAvro, Protobuf, JSON SchemaKafka native, REST APIDe facto standard, subject strategies, schema references
AWS Glue Schema RegistryAvro, JSON Schema, ProtobufMSK, Kinesis, LambdaServerless, IAM integration, auto-registration
Apicurio RegistryAvro, Protobuf, JSON, OpenAPI, GraphQLKafka, HTTP, gRPCOpen source, CNCF, multi-format
Azure Schema RegistryAvroEvent HubsAzure-native, RBAC, client-side caching
Buf (BSR)ProtobufgRPC, ConnectBreaking change detection, linting, code gen
โ–ธ Schema Evolution Best Practices

Safe Changes (Backward Compatible)

  • Add optional field with default value
  • Add new enum value (if consumer ignores unknown)
  • Widen numeric type (int โ†’ long)
  • Add new union member (Avro)
  • Deprecate field (keep, stop writing)

Breaking Changes (Avoid)

  • Remove required field
  • Rename field (without alias)
  • Change field type (string โ†’ int)
  • Remove enum value
  • Change field from optional to required
Best Practices: Enforce compatibility in CI โ€” reject PRs that break schema contracts. Use subject naming strategies (TopicNameStrategy, RecordNameStrategy) to control scope. Cache schemas on producer/consumer side (schema ID โ†’ schema). Set FULL_TRANSITIVE for critical topics.
Subject Strategies: TopicNameStrategy (default) โ€” one schema per topic. RecordNameStrategy โ€” schema per record type (multiple types in one topic). TopicRecordNameStrategy โ€” schema per topic+record combo (most flexible).
Anti-patterns: No schema registry โ€” "just use JSON" leads to silent breakage. Compatibility = NONE in prod โ€” ticking time bomb. Not versioning schemas โ€” can't roll back. Tight coupling โ€” producer and consumer must deploy simultaneously.
Real-world: LinkedIn โ€” Avro + Confluent SR for all Kafka topics (thousands of schemas). Uber โ€” Protobuf + custom registry for gRPC services. Netflix โ€” Avro schemas with automated compatibility testing in CI. Shopify โ€” Protobuf for event-driven architecture with Buf for linting.