System Design Concepts

No fluff β€” visual, concise, interview-ready

πŸ—οΈ 5 Β· INFRASTRUCTURE

Load Balancer

L4 (TCP β€” fast, IP+port) vs L7 (HTTP β€” smart, URL/headers). ALB, NLB, NGINX, HAProxy

β–Έ What is a Load Balancer?
πŸ‘€Client πŸ‘€Client πŸ‘€Client βš–οΈ Load Balancer Server 1 Server 2 Server 3 DB
β–Έ Layer 4 vs Layer 7

L4 β€” Transport Layer

πŸ‘€ 44.1.1.1 L4 LB IP + Port Backend 44.3.3.3 Backend 44.4.4.4 Backend 44.5.5.5 Data Packet: From: 44.1.1.1 Message To: 44.2.2.2 Routes by IP + Port only β€” blind to content

L7 β€” Application Layer

πŸ‘€ 44.1.1.1 HTTPS L7 LB URL/Headers /posts /comments Post Service Comment Svc Data Packet: From: 44.1.1.1 Data: /posts To: 44.2.2.2 Routes by URL, headers, cookies β€” content-aware
β–Έ Load Balancing Algorithms

Round Robin

LB Srv 1 Srv 2 Srv 3 1 2 3 1β†’2β†’3β†’1β†’2β†’3…

Weighted RR

LB Srv 1 Γ—3 Srv 2 Γ—1 75% 25%

Least Conn

LB 100 conn 10 conn βœ“ β†’ picks least busy

IP Hash

LB Srv 1 Srv 2 Srv 3 hash(IP) same client β†’ same server session affinity
β–Έ L4 vs L7 Comparison
AspectL4 (Transport Layer)L7 (Application Layer)
Works OnIP + TCP/UDP β€” raw bytes, no content inspectionHTTP, gRPC, WebSocket β€” reads headers, URL, cookies
Routing ByIP address + Port onlyURL path, host, method, headers, cookies
SpeedFaster β€” no parsing overheadSlightly slower β€” inspects every request
SSLPassthrough β€” encrypted traffic forwarded as-isTermination β€” decrypts, inspects, re-encrypts
Sticky SessionsIP-based onlyCookie / header based β€” more reliable
Content AwarenessNone β€” blind to payloadFull β€” can rewrite headers, redirect, A/B route
Use CaseTCP proxying, DB connections, raw throughput, gamingAPI gateway, microservices routing, canary deploys
ExamplesAWS NLB, HAProxy (TCP mode), NGINX (stream)AWS ALB, NGINX (http), Envoy, Traefik, Istio
Health CheckTCP connect only (is port open?)HTTP /health endpoint β€” checks actual app response
WebSocketNative β€” TCP passthrough, no extra configNeeds explicit Upgrade header proxying config
Guarantees: No single point of failure (if LB is HA). Health checks auto-remove unhealthy backends. SSL termination at L7 offloads crypto from backends.
Sticky Sessions: Route same client to same backend. Problem: hot spots, breaks on death. Better: externalize session to Redis.

API Gateway

Single entry point β€” auth, rate limiting, routing, SSL. Kong, Apigee, AWS API Gateway

Clients 🌐Web πŸ“±Mobile πŸ–₯️PC 1 HTTP Request API Gateway Parameter Validation 2 Allow-list / Deny-list 3 Authentication Authorization 4 Service Discovery 7 Dynamic Routing 6 Rate Limiting 5 Protocol Conversion 8 Error Handling + Logging 9 11 Circuit Breaker + Cache 10 12 Microservices Elasticsearch Redis Gateway vs LB vs Mesh LB Distribute traffic (no business logic) Gateway North-South external β†’ internal Mesh East-West internal β†’ internal BFF Separate gateway per client (web / mobile) Kong Β· Apigee Β· AWS API GW Β· Envoy
Guarantee: API Gateway provides a single enforcement point for cross-cutting concerns β€” auth, rate limiting, logging, circuit breaking happen once at the edge, not duplicated in every service.

Forward & Reverse Proxy

Forward = hide client (corporate proxy). Reverse = hide servers (NGINX, Cloudflare)

Reverse Proxy Guarantees: Backend isolation β€” clients never see internal IPs. SSL termination β€” decrypt at proxy, HTTP internally. Caching β€” serve cached responses without hitting origin. DDoS absorption at edge.

NGINX

Reverse proxy, LB, web server, API gateway. ~34% of all websites. Event-driven, non-blocking.

Architecture

🌐 Master W1 W2 W3 non-blocking I/O (epoll) 1 thread = 1000s of conns
Master manages config. Workers handle all connections via event loop β€” no thread-per-request.

Web Server

🌐 Request N Web Server Response πŸ“„ HTML / CSS / JS / images served directly from disk 100K+ concurrent connections
Serves static content under high traffic. Event-driven β€” massive concurrency with minimal resources.

Reverse Proxy + LB

🌐 πŸ“± πŸ–₯️ N NGINX App 1 App 2 App 3 Clients never see backend IPs
Hides backends, distributes traffic. Round Robin / Least Conn / IP Hash / Weighted.

SSL Termination

🌐 HTTPS N decrypt HTTP App 1 App 2 App 3 πŸ”’ HTTPS in β†’ πŸ”“ HTTP out Offloads crypto from backends

Content Cache

Master W1 W2 W3 Proxy Cache Cache Loader Cache Manager
proxy_cache serves repeated requests without hitting backend. TTL-based eviction.
upstream backend {
    least_conn;
    server app1:8080 weight=3;
    server app2:8080;
}
server {
    listen 443 ssl;
    location /api/ { proxy_pass http://backend; }
    location / { root /var/www/html; }  # Static files directly
}
Real-world: Netflix (video delivery), Dropbox (replaced Apache, cut servers 75%), Kubernetes (default Ingress Controller). Solves C10K β€” event-driven workers handle 100K+ concurrent connections vs Apache's thread-per-connection.

Docker & Kubernetes

Docker packages apps into containers. Kubernetes (K8s) orchestrates them at scale.

Docker ConceptDetail
ImageImmutable template with app + dependencies. Built from Dockerfile. Stored in registry (Docker Hub, ECR).
ContainerRunning instance of image. Lightweight isolation (shared kernel, not full VM). Starts in seconds.
VolumePersistent storage that survives container restarts.
K8s ConceptDetail
PodSmallest unit. 1+ containers sharing network/storage. Ephemeral.
ServiceStable network endpoint for pods (ClusterIP, NodePort, LoadBalancer).
DeploymentDeclarative desired state. Rolling updates, rollbacks.
HPAHorizontal Pod Autoscaler β€” scale on CPU/memory/custom metrics.
StatefulSetOrdered, stable pod identities. For DBs, Kafka, ZooKeeper.
IngressHTTP routing rules (NGINX Ingress, Traefik). External traffic β†’ services.
Deployment Strategies: Rolling (gradual, default) Β· Blue-Green (swap envs) Β· Canary (5% traffic first) Β· A/B (feature flags, Istio traffic split)
Guarantees: K8s guarantees desired state reconciliation β€” if a pod dies, controller restarts it. Self-healing via liveness/readiness probes. Service discovery via DNS.
Real-world: Google (Borg predecessor). Spotify 2000+ services on K8s. Managed: EKS (AWS), GKE (Google), AKS (Azure).

Service Mesh

Istio, Linkerd β€” sidecar proxy (Envoy) for service-to-service networking

Guarantees: mTLS everywhere (zero-trust). Automatic observability (metrics, traces per call). Traffic management (canary, retries, circuit breaking) via YAML, not code. vs API Gateway: Gateway = north-south (external→internal). Mesh = east-west (internal→internal).

Multi-Region & Multi-Tenant

Deploying across regions for low latency, disaster recovery, and compliance

PatternHowTrade-off
Active-PassivePrimary region serves traffic; standby for failoverSimple but standby idle; failover delay (minutes)
Active-ActiveBoth regions serve traffic; data replicatedLow latency globally but conflict resolution needed
Follow-the-SunRoute to region where it's business hoursGood for support/ops workloads
Guarantees: Multi-region provides disaster recovery (entire region can fail) and data residency compliance (GDPR: EU data stays in EU). Trade-off: cross-region replication lag and conflict resolution complexity.
β–Έ Multi-Tenant Architecture Types

Shared App, Shared DB

πŸ‘€πŸ‘€ πŸ‘€πŸ‘€ πŸ‘€πŸ‘€ Tenant A Tenant B Tenant C App DB (tenant_id)
Pro Cheapest, simple ops
Con Noisy neighbor, data leak risk
Ex: Salesforce, Slack

Shared App, Multi DB

πŸ‘€πŸ‘€ πŸ‘€πŸ‘€ πŸ‘€πŸ‘€ Tenant A Tenant B Tenant C App DB-A DB-B DB-C
Pro Strong data isolation, shared compute
Con More DB ops, connection pooling
Ex: Shopify, GitHub Enterprise

Multi App, Multi DB

πŸ‘€πŸ‘€ πŸ‘€πŸ‘€ πŸ‘€πŸ‘€ Tenant A Tenant B Tenant C App A App B App C DB-A DB-B DB-C
Pro Full isolation, no noisy neighbor
Con Expensive, complex ops at scale
Ex: AWS accounts, dedicated SaaS
Choose: Shared/Shared for cost (Salesforce) β†’ Shared/Multi-DB for data isolation (Shopify) β†’ Multi/Multi for full isolation (enterprise/compliance). Most SaaS starts shared and migrates to hybrid as they scale.

Service Discovery

How services find each other in a dynamic fleet β€” IPs change constantly as containers scale, restart, and migrate

β–Έ Service Discovery β€” Client-Side vs Server-Side
Client-Side Discovery Client + SDK / library load balances locally Registry Consul / Eureka returns [IP list] 10.0.0.5 10.0.0.6 10.0.0.7 βœ“ No extra hop βœ— SDK per language, client complexity Server-Side Discovery Client simple HTTP LB / Proxy ALB / Envoy routes to healthy svc-A:1 svc-A:2 svc-A:3 βœ“ Simple client, language-agnostic βœ— Extra network hop, LB is SPOF DNS-Based Discovery (K8s CoreDNS / Consul DNS) Pod A resolve: svc-b.ns CoreDNS A/SRV records K8s Service ClusterIP / headless Pod B:1 Pod B:2 βœ“ Zero SDK, stdlib resolver | βœ— TTL caching can serve stale IPs, no client-side LB K8s: ClusterIP (virtual IP + kube-proxy) vs Headless (returns all pod IPs directly)
PatternWho ResolvesExamplesProsCons
Client-sideApp library / SDKEureka + Ribbon, Consul SDK, gRPC name resolverNo extra hop, client LBSDK per language, stale cache
Server-sideLoad balancer / proxyAWS ALB + Cloud Map, Envoy, IstioSimple client, language-agnosticExtra hop, LB is SPOF
DNS-basedStdlib DNS resolverK8s CoreDNS, Consul DNS, AWS Route 53Zero SDK, universalTTL caching, no health-aware LB
Service MeshSidecar proxy (transparent)Istio/Envoy, Linkerd, Consul ConnectZero app changes, mTLS, observabilityComplexity, resource overhead
β–Έ Registry Implementations
ToolConsensusHealth CheckKey Feature
ConsulRaftHTTP, TCP, gRPC, scriptMulti-DC, service mesh (Connect), KV store
etcdRaftLease-based TTLK8s backbone, strong consistency, watch API
EurekaAP (peer replication)Heartbeat (30s default)Netflix OSS, self-preservation mode
ZooKeeperZABEphemeral nodesMature, Kafka/Hadoop ecosystem
AWS Cloud MapManagedRoute 53 health checksNative AWS, API + DNS discovery
K8s (built-in)etcdLiveness + readiness probesZero setup, CoreDNS, Endpoints API
Health checks remove dead instances within seconds β€” critical for fast failover. Use liveness (is it alive?) + readiness (can it serve traffic?) probes. Deregister unhealthy instances immediately, don't wait for TTL.
K8s patterns: ClusterIP β€” virtual IP, kube-proxy routes (default). Headless β€” returns all pod IPs (for stateful sets, client-side LB). ExternalName β€” CNAME to external service. Service Mesh β€” Envoy sidecar intercepts all traffic transparently.
Anti-patterns: Hardcoded IPs β€” breaks on any scale event. Long DNS TTL β€” routes to dead instances. No health checks β€” registry serves stale entries. Single registry without replication β€” SPOF.

CI/CD & Deployment Strategies

Ship safely without taking the site down β€” automate everything from commit to production

β–Έ Deployment Strategies Compared
Rolling Update v1 v1 v2 v2 v2 replace 1-by-1 zero downtime mixed versions during rollout Blue / Green BLUE (live) GREEN β†’ flip router instantly instant rollback 2Γ— infrastructure cost Canary 95% β†’ v1 5% monitor errors β†’ ramp up low blast radius slow rollout, complex routing Feature Flag ON decoupled from deploy per-user targeting flag debt, testing complexity CI/CD Pipeline Commit PR merge Build compile, lint Test unit + integration Image Docker push Staging e2e tests Canary 5% β†’ monitor Prod 100% full rollout Rollback auto on SLO Auto-rollback on: error rate > 1% | p99 latency > 500ms | SLO breach Tools: GitHub Actions, GitLab CI, ArgoCD, Spinnaker, Flux
StrategyDowntimeRollback SpeedRiskBest For
RollingZeroMinutes (re-roll)Mixed versions during rolloutStateless services, K8s default
Blue/GreenZeroInstant (flip router)2Γ— cost, DB schema must be compatibleCritical services, instant rollback needed
CanaryZeroFast (route 0% to canary)Slow rollout, needs good observabilityHigh-traffic services, gradual confidence
Feature FlagZeroInstant (toggle off)Flag debt, testing matrix growsPer-user rollout, A/B testing, kill switch
RecreateYesRedeploy old versionDowntime during swapDev/staging, stateful apps that can't run mixed
Pipeline: commit β†’ build β†’ unit tests β†’ image β†’ deploy staging β†’ e2e β†’ promote prod (canary 5% β†’ 25% β†’ 100%) β†’ auto-rollback on SLO breach. Use GitOps (ArgoCD/Flux) for declarative, auditable deployments.
Anti-patterns: Manual deploys β€” error-prone, no audit trail. No rollback plan β€” "we'll fix forward" fails at 3am. Big-bang releases β€” all changes at once = impossible to debug. No staging environment β€” prod is your test environment.
Real-world: Netflix β€” Spinnaker canary with automated analysis (Kayenta). Google β€” 1% β†’ 10% β†’ 50% β†’ 100% over days. Amazon β€” one-box deployment (single host first). GitHub β€” feature flags + Scientist for safe refactoring.

Serverless / FaaS

Pay-per-invocation compute that scales to zero β€” no servers to manage, auto-scales per request

β–Έ Serverless Execution Model β€” Cold Start & Warm Invocation
Event Sources API Gateway (HTTP) SQS / SNS / EventBridge S3 (object events) DynamoDB Streams Schedule (cron) Execution Environment ❄ Cold Start (first invocation) 1. Download code + layers 2. Init runtime (JVM, Node, Python) 100ms – 10s (Java worst) πŸ”₯ Warm Invocation (reuse) handler() called directly 1-5ms overhead Function your business logic max: 15 min (Lambda) max: 60 min (GCF gen2) memory: 128MB – 10GB stateless (use external state) concurrency: 1000 default Downstream DynamoDB / RDS Proxy S3 / ElastiCache SQS / SNS (async) External APIs Step Functions (orchestrate) Cost Model: requests Γ— duration Γ— memory Free tier: 1M requests + 400K GB-seconds/month | ~$0.20 per 1M requests after
Good ForBad ForWhy
Event-driven glueLong-running jobs (>15 min)Timeout limits, cost per duration
Spiky / low-volume trafficSustained high RPSCost exceeds containers at ~1M req/day
Image/video processingStateful sessions / WebSocketsStateless by design, no persistent connections
Cron jobs + queue workersLow-latency APIs (p99 < 10ms)Cold start adds 100ms-10s
Prototyping / MVPsComplex orchestrationUse Step Functions for multi-step workflows
β–Έ Cold Start Mitigation Strategies

Reduce Cold Start

  • Provisioned Concurrency: pre-warm N instances ($$)
  • SnapStart: snapshot after init, restore on invoke (Java)
  • Slim runtimes: Go, Rust (10-50ms cold start)
  • Smaller packages: tree-shake, no unused deps
  • Keep-warm pings: scheduled invoke every 5 min
  • Init outside handler: DB connections in global scope

Serverless Platforms

  • AWS Lambda: most mature, 15 min max, SnapStart
  • GCP Cloud Functions: gen2 (Cloud Run based), 60 min
  • Azure Functions: durable functions for orchestration
  • Cloudflare Workers: V8 isolates, 0ms cold start, edge
  • Vercel/Netlify: frontend-focused, edge functions
  • Knative: K8s-native serverless (scale to zero)
Mitigate cold start: Provisioned concurrency for latency-sensitive paths. Slim runtimes (Go, Rust: 10-50ms cold start vs Java: 3-10s). SnapStart (Lambda Java). Init outside handler β€” DB connections, SDK clients in global scope (reused across warm invocations).
Anti-patterns: Lambda monolith β€” one giant function doing everything. Synchronous chains β€” Lambda β†’ Lambda β†’ Lambda (use Step Functions). VPC without NAT β€” adds 6-10s cold start. Ignoring concurrency limits β€” throttled at 1000 default.
Real-world: Netflix β€” Lambda for encoding pipeline triggers. Coca-Cola β€” vending machine backend (spiky, event-driven). iRobot β€” IoT event processing. Capital One β€” real-time fraud detection. BBC β€” on-demand video transcoding.

Infrastructure as Code

Version-control your cloud just like your app β€” reproducible, auditable, reviewable infrastructure

β–Έ IaC Tools β€” Declarative vs Imperative
ToolLanguageApproachStateStrength
TerraformHCL (declarative)Plan β†’ ApplyS3 + DynamoDB lock / TF CloudMulti-cloud, huge provider catalog, modules
OpenTofuHCL (declarative)Plan β†’ ApplySame as TerraformOpen-source fork, community-driven
CloudFormationYAML/JSONStack-basedAWS-managed (free)Native AWS, drift detection, StackSets
PulumiTS/Python/Go/C#Real codePulumi Cloud / self-managedLoops, tests, abstractions, type safety
AWS CDKTS/Python/Java/GoSynthesizes to CFNCloudFormationL2/L3 constructs, AWS-blessed patterns
CrossplaneYAML (K8s CRDs)Reconciliation loopK8s etcdGitOps-native, K8s-first, compositions
β–Έ IaC Workflow & Best Practices

GitOps Workflow

  • PR: change infra code β†’ terraform plan in CI
  • Review: team reviews plan diff (what will change)
  • Merge: terraform apply runs automatically
  • State: remote backend (S3 + DynamoDB lock)
  • Drift: detect with scheduled plan runs
  • Modules: reusable components (VPC, EKS, RDS)

Best Practices

  • Environments: separate state per env (dev/staging/prod)
  • Least privilege: CI role has only needed permissions
  • No manual changes: all changes through code
  • Blast radius: small stacks, not one mega-stack
  • Secrets: never in code β€” use Vault, SSM, SOPS
  • Testing: tflint, checkov, terratest
// Terraform example β€” S3 bucket with versioning
resource "aws_s3_bucket" "logs" {
  bucket = "acme-logs-prod"
  tags   = { env = "prod", team = "platform" }
}

resource "aws_s3_bucket_versioning" "logs" {
  bucket = aws_s3_bucket.logs.id
  versioning_configuration { status = "Enabled" }
}
Workflow: PR β†’ plan in CI (diff visible in PR comment) β†’ team review β†’ apply on merge. State in S3 + DynamoDB lock or Terraform Cloud. Use workspaces or directory structure for environment separation.
Testing IaC: tflint β€” lint HCL for errors. checkov / tfsec β€” security scanning (open S3 buckets, missing encryption). terratest β€” integration tests (deploy, validate, destroy). OPA/Sentinel β€” policy-as-code (enforce tagging, region restrictions).
Anti-patterns: ClickOps β€” manual console changes that drift from code. Mega-stack β€” one state file for everything (slow, risky). Secrets in state β€” state file contains sensitive values (encrypt it). No locking β€” concurrent applies corrupt state.
Real-world: HashiCorp β€” Terraform manages millions of cloud resources globally. Shopify β€” CDK for AWS infrastructure. Uber β€” custom IaC for multi-cloud. GitLab β€” Terraform + GitOps for all infrastructure changes.