System Design Case Study

How does GitHub Actions execute 10M+ builds/day at scale?

?? Design a CI/CD system: 10M builds/day, distributed workers, artifact caching, real-time logs
Concepts Involved

Problem Statement

How does a CI/CD platform execute 10M+ builds per day using distributed workers, with intelligent artifact caching providing 10· speedup, job scheduling with dependency graphs, and real-time log streaming to developers?

Core challenge: Builds are bursty (Monday 9am = 10· weekend traffic), each needs an isolated environment, jobs have complex dependency DAGs, and developers expect real-time log output · not "check back in 5 minutes."
10M+
builds / day
10·
cache speedup
Distributed
worker pool (auto-scale)
Real-time
log streaming

Architecture

LAYER 1: TRIGGER · Git Push ? Orchestrator ? Matrix Expansion Git Push webhook trigger PR / branch / tag event payload commit SHA, author Orchestrator parse workflow YAML (.github/workflows/) build job DAG (dependency graph) identify needs: constraints evaluate if: conditions per job Matrix Expansion 5 OS · 3 versions = 15 parallel jobs fan-out from single definition each combo = independent job include/exclude overrides Job DAG build ? test test ? deploy lint ? test maximize parallel LAYER 2: EXECUTION · Job Queue (fair-share) ? Worker Pool (ephemeral, auto-scale 0?10K) Job Queue priority scheduling (paid tiers first) fair-share per org (prevent starvation) concurrency limits enforced per repo queue depth ? auto-scale signal dequeue when worker available per-org quotas prevent monopolization Worker Pool (Ephemeral) ephemeral VMs / containers per job auto-scale 0 ? 10K workers on demand isolated execution (no cross-job leakage) secrets injected at runtime (vault integration) destroyed after job completes (clean slate) warm pool for fast startup (~5s) Auto-Scale queue depth ? scale workers LAYER 3: OUTPUT · Artifact Cache + Log Streaming + Status Reporting Artifact Cache (Blob Store) content-addressed by hash(lockfile) key: hash(package-lock + OS + Node) cache hit ? skip install ? 10· speedup shared across branches (same deps) Docker layer caching eviction: LRU after 7 days Log Streaming (Real-time) WebSocket: worker ? aggregator ? browser real-time tail (chunked upload) persisted to blob storage for later structured log lines (timestamp + step) collapsible step groups in UI no polling · push-based Status Reporting GitHub Checks API integration PR status: pending ? pass/fail per-job annotations (errors, warnings) Slack/Teams notifications deployment status tracking branch protection enforcement DAG: maximize parallelism, respect ordering | Cache key: hash(package-lock.json + OS + Node) | Fair-share: per-org quotas prevent starvation 10M builds/day | Bursty (Mon 9am = 10· weekend) | Ephemeral workers destroyed after each job | WebSocket log streaming (no polling)
Job DAG scheduling: Workflow YAML defines jobs with needs: dependencies. Orchestrator builds a DAG ? schedules independent jobs in parallel ? waits for dependencies before starting downstream jobs. Matrix builds fan out (e.g., test on 5 OS · 3 Node versions = 15 parallel jobs).
Artifact caching: Cache key = hash(package-lock.json + OS + Node version). On hit, restore node_modules from blob store (seconds vs minutes). Content-addressed storage · same dependencies across branches share cache. Layer caching for Docker builds (each layer cached independently).
Anti-patterns: No job isolation · one build's side effects break another. Polling for logs · wastes bandwidth, poor UX. No concurrency limits · one repo monopolizes all workers. Cache everything blindly · stale caches cause flaky builds.
Real-time log streaming: Workers stream stdout/stderr via WebSocket to log aggregator. Client connects to aggregator for live tail. Logs also persisted to blob storage for later retrieval. Chunked upload · don't wait for job completion to see output.

Interview Cheat Sheet

1. Job DAG · parse workflow dependencies, maximize parallelism, respect ordering constraints
2. Ephemeral workers · fresh VM/container per job, auto-scale 0?10K based on queue depth
3. Content-addressed caching · hash(lockfile) as cache key, 10· speedup on cache hit
4. Fair-share scheduling · per-org quotas prevent one customer from starving others
5. Real-time logs · WebSocket streaming from worker ? aggregator ? client browser
6. Isolation · each job gets clean environment, secrets injected at runtime, destroyed after