How does GitHub Actions execute 10M+ builds/day at scale?

🎯 Design a CI/CD system: 10M builds/day, distributed workers, artifact caching, real-time logs

Concepts Involved

Message Queues Docker/K8s Kafka Caching Auto-Scaling

Problem Statement

How does a CI/CD platform execute 10M+ builds per day using distributed workers, with intelligent artifact caching providing 10· speedup, job scheduling with dependency graphs, and real-time log streaming to developers?

Core challenge: Builds are bursty (Monday 9am = 10· weekend traffic), each needs an isolated environment, jobs have complex dependency DAGs, and developers expect real-time log output · not "check back in 5 minutes."

10M+

builds / day

10·

cache speedup

Distributed

worker pool (auto-scale)

Real-time

log streaming

Architecture

Job DAG scheduling: Workflow YAML defines jobs with needs: dependencies. Orchestrator builds a DAG → schedules independent jobs in parallel → waits for dependencies before starting downstream jobs. Matrix builds fan out (e.g., test on 5 OS · 3 Node versions = 15 parallel jobs).

Artifact caching: Cache key = hash(package-lock.json + OS + Node version). On hit, restore node_modules from blob store (seconds vs minutes). Content-addressed storage · same dependencies across branches share cache. Layer caching for Docker builds (each layer cached independently).

Anti-patterns: No job isolation · one build's side effects break another. Polling for logs · wastes bandwidth, poor UX. No concurrency limits · one repo monopolizes all workers. Cache everything blindly · stale caches cause flaky builds.

Real-time log streaming: Workers stream stdout/stderr via WebSocket to log aggregator. Client connects to aggregator for live tail. Logs also persisted to blob storage for later retrieval. Chunked upload · don't wait for job completion to see output.

Interview Cheat Sheet

1. Job DAG · parse workflow dependencies, maximize parallelism, respect ordering constraints
2. Ephemeral workers · fresh VM/container per job, auto-scale 0?10K based on queue depth
3. Content-addressed caching · hash(lockfile) as cache key, 10· speedup on cache hit
4. Fair-share scheduling · per-org quotas prevent one customer from starving others
5. Real-time logs · WebSocket streaming from worker → aggregator → client browser
6. Isolation · each job gets clean environment, secrets injected at runtime, destroyed after

System Design Case Study

Problem Statement

Architecture

Interview Cheat Sheet