Cache Problems & Real World Solutions

Caching issues can turn a snappy application into a sluggish mess (or a total outage) in seconds. Here are some real-world-inspired scenarios and how high-scale companies handle them.


1. Thundering Herd (The “Midnight Expiry”)

This happens when you set a fixed TTL (Time To Live) for a large batch of data.

  • Case Study: A major E-commerce platform during a holiday sale.
    • The Scenario: They cached 100,000 product descriptions at exactly 12:00 AM with a 24-hour expiry. At 12:00 AM the next day, all 100,000 cache keys expired simultaneously.
    • The Result: The next wave of user requests found “Cache Misses” for everything. The database was suddenly hit with 50,000+ concurrent queries to “re-warm” the cache, causing a database CPU spike to 100% and crashing the site.
    • The Fix: Jitter. By adding a random offset to the TTL (e.g., 24 hours + rand(0, 300) seconds), the expirations are staggered over several minutes rather than hitting all at once.

2. Cache Penetration (The “Ghost Key” Attack)

This occurs when requests are made for data that exists neither in the cache nor the database.

  • Case Study: A Social Media Startup under a malicious bot attack.
    • The Scenario: An attacker used a script to request profiles with non-existent IDs (e.g., example.com/user/-9999 or random UUIDs).
    • The Result: The application checked the cache (Miss) and then queried the database (Null). Since the result was null, nothing was cached. The attacker sent millions of these requests, bypassing the cache entirely and overwhelming the database disk I/O.
    • The Fix: Bloom Filters. They implemented a Bloom filter—a space-efficient probabilistic data structure—at the application level.
    • If the Bloom filter says “No,” the app rejects the request immediately without even touching the database.

3. Cache Breakdown (The “Hot Key” Collapse)

Unlike Thundering Herd, this involves a single extremely popular key (a “Hot Key”).

  • Case Study: A News Outlet during a breaking world event.
    • The Scenario: A viral news article was being requested 10,000 times per second. The cache key for that article expired.
    • The Result: In the few milliseconds it took for the first “miss” to fetch the data from the DB and write it back to the cache, 5,000 other concurrent requests also saw a “miss” and rushed the database for the exact same row. This is often called “Cache Stampede.”
    • The Fix: Mutex Locks. The first request to see a “miss” acquires a lock. Other requests for that same key are told to wait or “sleep” for a few milliseconds until the first request updates the cache.

4. Cache Crash (The “Total Blackout”)

This is the nightmare scenario where the entire caching layer (e.g., Redis cluster) goes offline.

  • Case Study:Facebook (2010 Outage).
    • The Scenario: An automated system attempted to fix a configuration error but instead triggered a feedback loop that took down their caching cluster.
    • The Result: With the cache down, the massive volume of traffic shifted directly to the databases. The databases, not scaled for that kind of load, buckled instantly.
    • The Fix: Circuit Breakers and Multi-Level Caching. Modern architectures use a “Circuit Breaker” pattern. If the cache is down, the system might serve “stale” data from a local backup or simply return an error/limited version of the site to protect the database from a total permanent crash.

Comparison Summary

ProblemTargetPrimary Solution
Thundering HerdMany keysTTL Jitter
PenetrationNon-existent keysBloom Filter / Cache Nulls
BreakdownOne “Hot” keyMutex Locks (SetNX)
CrashEntire SystemCircuit Breakers / Redundancy

Leave a Comment