System Design Case Study

How does SendGrid deliver 500M+ emails/day with 99%+ deliverability?

?? Design an email delivery system: 500M emails/day, 99%+ deliverability, IP warm-up, bounce handling
Concepts Involved

Problem Statement

How does an email delivery platform send 500M+ emails/day while maintaining 99%+ deliverability through IP reputation management, authentication (DKIM/SPF/DMARC), intelligent bounce handling, and per-domain throttling?

Core challenge: Email deliverability depends on IP reputation · one bad actor can poison an entire IP pool. Must balance throughput with per-ISP rate limits (Gmail accepts ~500/hr from new IPs), handle bounces instantly, and authenticate every message.
500M+
emails delivered / day
99%+
deliverability rate
IP warm-up
gradual ramp over weeks
<2%
bounce rate threshold

Architecture

LAYER 1 · INGESTION (API ? Validation ? Per-Domain Queue) API Ingestion POST /v3/mail/send to, from, subject, body ~5,800/sec avg 29K/sec peak 500M emails/day Validation DKIM sign (private key ? body+headers) SPF check (authorized sending IP?) Suppression list check (Bloom ~1B) Reject if suppressed/invalid Content scanning for spam signals Per-Domain Queue Gmail queue: 500/hr limit Outlook queue: 1K/hr limit Yahoo queue: 800/hr limit Adaptive rate limiting Back off on 4xx, pause on 5xx Rate Scheduler Per-ISP throttle control Respect Retry-After Adaptive: learn ISP limits Priority: transactional > marketing Queue depth monitoring LAYER 2 · DELIVERY (MTA Pool ? ISP Mailbox ? Inbox) MTA Pool (300 IPs) IP rotation by reputation score Warm IPs ? more traffic allocated Cold IPs ? ramp slowly (100?200?400/day) Bad sender isolation (dedicated pools) Full warm-up: 4-6 weeks ISP Mailbox Gmail / Outlook / Yahoo / iCloud Response: accept / reject / defer Checks: DKIM sig, SPF, DMARC policy Reputation lookup on sending IP Content filtering (spam score) ?? Inbox Delivered successfully! DKIM+SPF+DMARC aligned = inbox placement (not spam) LAYER 3 · FEEDBACK (Bounce Handling + Reputation + Suppression) Hard Bounce Invalid address (550) Domain doesn't exist ? Suppress immediately Never send again Add to suppression list Soft Bounce Mailbox full (452) Temp server issue ? Retry 3· over 72h Exponential backoff Suppress after 3 fails Spam Complaint FBL report from ISP User marked as spam ? Auto-unsubscribe Target: <0.1% rate Exceeding = IP blacklist Reputation Score Per-IP tracking Bounce rate <2% Complaint <0.1% Score drives IP allocation Low score ? less traffic Suppression List Bloom filter (~1B entries) O(1) lookup per send Check before every send Hard bounces + complaints + manual unsubscribes DKIM + SPF + DMARC = inbox placement | IP Warm-up: 100?200?400/day over 4-6 weeks Bounce <2%, complaint <0.1% | Per-domain throttling | Bad sender isolation | Suppression Bloom filter ~1B addresses
IP reputation management: Dedicated IP pools per customer segment. New IPs warm up gradually (100?200?400/day). Reputation score tracks bounces, spam complaints, and engagement. Bad senders isolated to prevent pool contamination.
Per-domain throttling: Each ISP has different rate limits. Gmail ~500/hr for new IPs, Outlook ~1000/hr. The scheduler maintains per-domain queues with adaptive rate limiting · backs off on 4xx responses, pauses on 5xx.
Anti-patterns: Blast all at once · ISPs throttle/block. Shared IPs without isolation · one spammer ruins everyone. Ignoring bounces · reputation tanks within hours. No DKIM/SPF · straight to spam folder.
Bounce handling: Hard bounces (invalid address) ? immediately suppress. Soft bounces (mailbox full) ? retry 3x over 72hrs. Complaint feedback loops (FBL) ? auto-unsubscribe. Keep bounce rate <2% or risk IP blacklisting.

Scale Estimation

StepDerivationResultDesign Impact
1Emails/sec: 500M · 86400~5,800 emails/sec avgModerate throughput · throttling per domain is the constraint
2Peak: 5,800 · 5· (marketing blast)~29K emails/sec peakQueue absorbs bursts, MTA pool drains at ISP-allowed rate
3IP pool: 29K/sec · ~100 emails/sec/IP~300 sending IPsWarm IPs get more traffic, cold IPs ramp slowly
4Bounce processing: 2% of 500M~10M bounces/dayReal-time suppression list update (Redis set)
5Suppression list: 10M/day · 365 days accumulated~1B suppressed addressesBloom filter for fast "should I send?" check

Resilience & Edge Cases

FailureImpactRecovery
IP blacklistedAll emails from that IP go to spamRotate to warm backup IP. Investigate cause (bad sender). Apply for delisting. Isolate offending customer.
ISP rate-limits (4xx)Emails queued, delivery delayedExponential backoff per domain. Spread load across more IPs. Respect Retry-After header.
DKIM key compromisedAttacker can forge emails from your domainRotate DKIM keys immediately. Publish new key in DNS. Old signatures invalid within TTL.
Bulk sender sends to purchased listHigh bounce rate ? IP reputation tanksPre-send validation (check MX records). Rate-limit new senders. Suspend accounts exceeding bounce threshold.
Email queue backlogDelivery delayed hoursPriority queues (transactional > marketing). Auto-scale MTA workers. Alert on queue depth > 1M.

Interview Cheat Sheet

1. DKIM/SPF/DMARC · cryptographic signing + IP authorization + policy alignment for inbox placement
2. IP warm-up · gradual volume increase over 4-6 weeks to build reputation with ISPs
3. Per-domain throttling · respect ISP rate limits (Gmail ? Outlook ? Yahoo)
4. Bounce classification · hard (suppress immediately) vs soft (retry with backoff)
5. IP pool isolation · separate bad senders to protect shared reputation
6. Feedback loops · process spam complaints in real-time, auto-suppress complainers