HTTP, gRPC, WebSocket β reads headers, URL, cookies
Routing By
IP address + Port only
URL path, host, method, headers, cookies
Speed
Faster β no parsing overhead
Slightly slower β inspects every request
SSL
Passthrough β encrypted traffic forwarded as-is
Termination β decrypts, inspects, re-encrypts
Sticky Sessions
IP-based only
Cookie / header based β more reliable
Content Awareness
None β blind to payload
Full β can rewrite headers, redirect, A/B route
Use Case
TCP proxying, DB connections, raw throughput, gaming
API gateway, microservices routing, canary deploys
Examples
AWS NLB, HAProxy (TCP mode), NGINX (stream)
AWS ALB, NGINX (http), Envoy, Traefik, Istio
Health Check
TCP connect only (is port open?)
HTTP /health endpoint β checks actual app response
WebSocket
Native β TCP passthrough, no extra config
Needs explicit Upgrade header proxying config
Guarantees:No single point of failure (if LB is HA). Health checks auto-remove unhealthy backends. SSL termination at L7 offloads crypto from backends.
Sticky Sessions: Route same client to same backend. Problem: hot spots, breaks on death. Better: externalize session to Redis.
API Gateway
Single entry point β auth, rate limiting, routing, SSL. Kong, Apigee, AWS API Gateway
Guarantee: API Gateway provides a single enforcement point for cross-cutting concerns β auth, rate limiting, logging, circuit breaking happen once at the edge, not duplicated in every service.
Guarantees: K8s guarantees desired state reconciliation β if a pod dies, controller restarts it. Self-healing via liveness/readiness probes. Service discovery via DNS.
Real-world:Google (Borg predecessor). Spotify 2000+ services on K8s. Managed: EKS (AWS), GKE (Google), AKS (Azure).
Service Mesh
Istio, Linkerd β sidecar proxy (Envoy) for service-to-service networking
Guarantees:mTLS everywhere (zero-trust). Automatic observability (metrics, traces per call). Traffic management (canary, retries, circuit breaking) via YAML, not code. vs API Gateway: Gateway = north-south (externalβinternal). Mesh = east-west (internalβinternal).
Multi-Region & Multi-Tenant
Deploying across regions for low latency, disaster recovery, and compliance
Pattern
How
Trade-off
Active-Passive
Primary region serves traffic; standby for failover
Simple but standby idle; failover delay (minutes)
Active-Active
Both regions serve traffic; data replicated
Low latency globally but conflict resolution needed
Follow-the-Sun
Route to region where it's business hours
Good for support/ops workloads
Guarantees: Multi-region provides disaster recovery (entire region can fail) and data residency compliance (GDPR: EU data stays in EU). Trade-off: cross-region replication lag and conflict resolution complexity.
βΈ Multi-Tenant Architecture Types
Shared App, Shared DB
Pro Cheapest, simple ops Con Noisy neighbor, data leak risk Ex: Salesforce, Slack
Shared App, Multi DB
Pro Strong data isolation, shared compute Con More DB ops, connection pooling Ex: Shopify, GitHub Enterprise
Multi App, Multi DB
Pro Full isolation, no noisy neighbor Con Expensive, complex ops at scale Ex: AWS accounts, dedicated SaaS
Choose:Shared/Shared for cost (Salesforce) β Shared/Multi-DB for data isolation (Shopify) β Multi/Multi for full isolation (enterprise/compliance). Most SaaS starts shared and migrates to hybrid as they scale.
Service Discovery
How services find each other in a dynamic fleet β IPs change constantly as containers scale, restart, and migrate
βΈ Service Discovery β Client-Side vs Server-Side
Pattern
Who Resolves
Examples
Pros
Cons
Client-side
App library / SDK
Eureka + Ribbon, Consul SDK, gRPC name resolver
No extra hop, client LB
SDK per language, stale cache
Server-side
Load balancer / proxy
AWS ALB + Cloud Map, Envoy, Istio
Simple client, language-agnostic
Extra hop, LB is SPOF
DNS-based
Stdlib DNS resolver
K8s CoreDNS, Consul DNS, AWS Route 53
Zero SDK, universal
TTL caching, no health-aware LB
Service Mesh
Sidecar proxy (transparent)
Istio/Envoy, Linkerd, Consul Connect
Zero app changes, mTLS, observability
Complexity, resource overhead
βΈ Registry Implementations
Tool
Consensus
Health Check
Key Feature
Consul
Raft
HTTP, TCP, gRPC, script
Multi-DC, service mesh (Connect), KV store
etcd
Raft
Lease-based TTL
K8s backbone, strong consistency, watch API
Eureka
AP (peer replication)
Heartbeat (30s default)
Netflix OSS, self-preservation mode
ZooKeeper
ZAB
Ephemeral nodes
Mature, Kafka/Hadoop ecosystem
AWS Cloud Map
Managed
Route 53 health checks
Native AWS, API + DNS discovery
K8s (built-in)
etcd
Liveness + readiness probes
Zero setup, CoreDNS, Endpoints API
Health checks remove dead instances within seconds β critical for fast failover. Use liveness (is it alive?) + readiness (can it serve traffic?) probes. Deregister unhealthy instances immediately, don't wait for TTL.
K8s patterns:ClusterIP β virtual IP, kube-proxy routes (default). Headless β returns all pod IPs (for stateful sets, client-side LB). ExternalName β CNAME to external service. Service Mesh β Envoy sidecar intercepts all traffic transparently.
Anti-patterns:Hardcoded IPs β breaks on any scale event. Long DNS TTL β routes to dead instances. No health checks β registry serves stale entries. Single registry without replication β SPOF.
CI/CD & Deployment Strategies
Ship safely without taking the site down β automate everything from commit to production
βΈ Deployment Strategies Compared
Strategy
Downtime
Rollback Speed
Risk
Best For
Rolling
Zero
Minutes (re-roll)
Mixed versions during rollout
Stateless services, K8s default
Blue/Green
Zero
Instant (flip router)
2Γ cost, DB schema must be compatible
Critical services, instant rollback needed
Canary
Zero
Fast (route 0% to canary)
Slow rollout, needs good observability
High-traffic services, gradual confidence
Feature Flag
Zero
Instant (toggle off)
Flag debt, testing matrix grows
Per-user rollout, A/B testing, kill switch
Recreate
Yes
Redeploy old version
Downtime during swap
Dev/staging, stateful apps that can't run mixed
Pipeline: commit β build β unit tests β image β deploy staging β e2e β promote prod (canary 5% β 25% β 100%) β auto-rollback on SLO breach. Use GitOps (ArgoCD/Flux) for declarative, auditable deployments.
Anti-patterns:Manual deploys β error-prone, no audit trail. No rollback plan β "we'll fix forward" fails at 3am. Big-bang releases β all changes at once = impossible to debug. No staging environment β prod is your test environment.
Real-world:Netflix β Spinnaker canary with automated analysis (Kayenta). Google β 1% β 10% β 50% β 100% over days. Amazon β one-box deployment (single host first). GitHub β feature flags + Scientist for safe refactoring.
Serverless / FaaS
Pay-per-invocation compute that scales to zero β no servers to manage, auto-scales per request
βΈ Serverless Execution Model β Cold Start & Warm Invocation
Good For
Bad For
Why
Event-driven glue
Long-running jobs (>15 min)
Timeout limits, cost per duration
Spiky / low-volume traffic
Sustained high RPS
Cost exceeds containers at ~1M req/day
Image/video processing
Stateful sessions / WebSockets
Stateless by design, no persistent connections
Cron jobs + queue workers
Low-latency APIs (p99 < 10ms)
Cold start adds 100ms-10s
Prototyping / MVPs
Complex orchestration
Use Step Functions for multi-step workflows
βΈ Cold Start Mitigation Strategies
Reduce Cold Start
Provisioned Concurrency: pre-warm N instances ($$)
SnapStart: snapshot after init, restore on invoke (Java)
Slim runtimes: Go, Rust (10-50ms cold start)
Smaller packages: tree-shake, no unused deps
Keep-warm pings: scheduled invoke every 5 min
Init outside handler: DB connections in global scope
Serverless Platforms
AWS Lambda: most mature, 15 min max, SnapStart
GCP Cloud Functions: gen2 (Cloud Run based), 60 min
Azure Functions: durable functions for orchestration
Mitigate cold start:Provisioned concurrency for latency-sensitive paths. Slim runtimes (Go, Rust: 10-50ms cold start vs Java: 3-10s). SnapStart (Lambda Java). Init outside handler β DB connections, SDK clients in global scope (reused across warm invocations).
Anti-patterns:Lambda monolith β one giant function doing everything. Synchronous chains β Lambda β Lambda β Lambda (use Step Functions). VPC without NAT β adds 6-10s cold start. Ignoring concurrency limits β throttled at 1000 default.
Real-world:Netflix β Lambda for encoding pipeline triggers. Coca-Cola β vending machine backend (spiky, event-driven). iRobot β IoT event processing. Capital One β real-time fraud detection. BBC β on-demand video transcoding.
Infrastructure as Code
Version-control your cloud just like your app β reproducible, auditable, reviewable infrastructure
βΈ IaC Tools β Declarative vs Imperative
Tool
Language
Approach
State
Strength
Terraform
HCL (declarative)
Plan β Apply
S3 + DynamoDB lock / TF Cloud
Multi-cloud, huge provider catalog, modules
OpenTofu
HCL (declarative)
Plan β Apply
Same as Terraform
Open-source fork, community-driven
CloudFormation
YAML/JSON
Stack-based
AWS-managed (free)
Native AWS, drift detection, StackSets
Pulumi
TS/Python/Go/C#
Real code
Pulumi Cloud / self-managed
Loops, tests, abstractions, type safety
AWS CDK
TS/Python/Java/Go
Synthesizes to CFN
CloudFormation
L2/L3 constructs, AWS-blessed patterns
Crossplane
YAML (K8s CRDs)
Reconciliation loop
K8s etcd
GitOps-native, K8s-first, compositions
βΈ IaC Workflow & Best Practices
GitOps Workflow
PR: change infra code β terraform plan in CI
Review: team reviews plan diff (what will change)
Merge:terraform apply runs automatically
State: remote backend (S3 + DynamoDB lock)
Drift: detect with scheduled plan runs
Modules: reusable components (VPC, EKS, RDS)
Best Practices
Environments: separate state per env (dev/staging/prod)
Least privilege: CI role has only needed permissions
No manual changes: all changes through code
Blast radius: small stacks, not one mega-stack
Secrets: never in code β use Vault, SSM, SOPS
Testing:tflint, checkov, terratest
// Terraform example β S3 bucket with versioning
resource "aws_s3_bucket" "logs" {
bucket = "acme-logs-prod"
tags = { env = "prod", team = "platform" }
}
resource "aws_s3_bucket_versioning" "logs" {
bucket = aws_s3_bucket.logs.id
versioning_configuration { status = "Enabled" }
}
Workflow: PR β plan in CI (diff visible in PR comment) β team review β apply on merge. State in S3 + DynamoDB lock or Terraform Cloud. Use workspaces or directory structure for environment separation.
Anti-patterns:ClickOps β manual console changes that drift from code. Mega-stack β one state file for everything (slow, risky). Secrets in state β state file contains sensitive values (encrypt it). No locking β concurrent applies corrupt state.
Real-world:HashiCorp β Terraform manages millions of cloud resources globally. Shopify β CDK for AWS infrastructure. Uber β custom IaC for multi-cloud. GitLab β Terraform + GitOps for all infrastructure changes.