Day 28 — Caching strategies for data systems

Phase 1: Foundations & Frameworks | Category: Caching & Performance

Why Caching Is a Senior DE Topic (Not Just SWE)

Most caching content is oriented toward application engineers — caching user sessions, profile data, API responses. For data engineering interviews, caching appears in a different context: pre-computing expensive analytical aggregations, serving ML features at low latency, reducing warehouse query costs, and managing the freshness-vs-performance trade-off across your entire data platform. As System Design Handbook puts it: “Stale reads are OK in feeds, but not in account balances.” The DE version of this: stale reads are OK in a dashboard showing last week’s GMV, but not in a fraud detection feature store.

The Caching Hierarchy: Where Caches Live

Before choosing a strategy, you need to know WHERE caching happens in a data system:

L1 — Query Engine (BigQuery / Snowflake)
    Query result cache (in-engine, free, seconds TTL)
    ↓ on miss

L2 — Materialized Views / Summary Tables
    Pre-computed aggregates (warehouse, minutes–hours)
    ↓ on miss

L3 — Serving Cache (Redis)
    Computed results (in-memory, milliseconds)
    ↓ on miss

L4 — Data Warehouse / OLAP Store
    Source of truth (seconds–minutes query time)

Each layer has different latency, freshness, and cost characteristics. A senior DE designs which data lives at which layer — and why.

The Four Caching Patterns

1. Cache-Aside (Lazy Loading)

The most common pattern. The application manages the cache directly.

READ
  1. Check Redis for key
  2. Cache HIT → return data (sub-ms)
  3. Cache MISS → query warehouse/DB → store in Redis with TTL → return data

WRITE
  → Write to source of truth only
  → Let cache expire naturally (TTL-driven invalidation)
     OR explicitly delete the cache key on write (event-driven invalidation)

Characteristics:

Cache only contains data that was actually requested (“hot” data only)
Cold start problem: first request always hits the source
Read performance improves over time as cache warms up
Stale data between updates (TTL determines max staleness)

Data engineering use case: Caching pre-aggregated dashboard metrics. When an analyst requests “DAU for March,” check Redis first. On miss, query BigQuery, store result with a 30-minute TTL, return.

When to choose: Read-heavy workloads where writes are infrequent and some staleness is tolerable. The default pattern for 80% of analytical caching use cases.

2. Write-Through

Every write goes to BOTH the cache and the database synchronously.

WRITE
  1. Write to Redis (cache)
  2. Write to database (source of truth)
  3. Both succeed → ACK to client
     Either fails → entire operation fails (atomicity)

READ
  → Always hits cache (warm on every key that was ever written)

Characteristics:

Cache is always consistent with the database
Higher write latency (two synchronous writes)
Cache populated on writes, not reads → no cold start for recently written data
Wastes cache space: every write populates cache even if that data is never read

Data engineering use case: Feature store serving. When a batch job updates user LTV features, it writes to both Redis (online store) and the data warehouse (offline store) atomically. The ML inference service reads from Redis and always gets fresh features.

When to choose: Data where stale reads are genuinely harmful. User account balances, inventory counts, real-time feature stores. The cost (higher write latency) is worth the benefit (no stale reads).

3. Write-Behind (Write-Back)

Write to cache first, flush to database asynchronously.

WRITE
  1. Write to Redis immediately → ACK to client (fast)
  2. Async flush to database in background (queue-based, batched)

READ
  → Reads from cache (always warm for recently written data)

Characteristics:

Fastest write path — application doesn’t wait for DB write
Risk: data loss if Redis fails before async flush completes
Batching multiple writes → fewer DB writes → lower DB load
Complexity: need reliable flush queue + retry logic

Data engineering use case: High-frequency metric counters. A counter for “active connections” updated 10K times/sec — writing to Postgres at that rate is impractical. Write to Redis counters, flush aggregated counts to Postgres every 5 seconds. If Redis fails, you lose at most 5 seconds of counts — acceptable for non-financial metrics.

When to choose: Write-heavy workloads where DB write throughput is the bottleneck AND some data loss is tolerable. NOT for financial or compliance data.

4. Read-Through

Like cache-aside, but the cache itself is responsible for loading from the database on a miss (not the application).

READ
  1. Application queries cache only
  2. Cache HIT → return data
  3. Cache MISS → cache fetches from DB, stores locally, returns to app

Application never directly queries the database.

Characteristics:

Application code is simpler — no cache-miss logic in application
Cache handles the DB call → centralizes the “load on miss” behavior
Less common in distributed data systems (Redis doesn’t natively do this)

Where it appears in DE context: Database proxy layers (PlanetScale’s branching, Snowflake query result caching). The warehouse’s built-in query result cache is effectively read-through — same query, same result, served from cache automatically.

TTL Strategy: The Practical Tuning Knob

TTL (Time-to-Live) is the most direct control over freshness vs. performance.

TTL selection framework:

Data type	Acceptable staleness	Recommended TTL
Static reference data (country codes, categories)	Hours to days	24 hours
User profile (non-financial)	Minutes	5–15 minutes
Dashboard aggregates (internal)	Minutes	5–30 minutes
Feature store (non-critical ML)	Minutes	5–15 minutes
Feature store (fraud / billing ML)	Seconds	30–60 seconds
Real-time metric (live dashboard)	Seconds	10–30 seconds
Account balances, inventory counts	Near-zero	No cache, or write-through + immediate invalidation

TTL pitfalls to mention in interviews:

Thundering herd (cache stampede): Many keys expire simultaneously → all requests hit the database at once → DB overload.

Solutions:

Jitter: Add random offset to TTL (TTL = base_ttl + random(0, base_ttl * 0.1))
Probabilistic early expiration: Before TTL expires, some requests proactively refresh the cache
Request coalescing (mutex lock): On cache miss, only one request queries the DB; others wait for it to repopulate the cache

# Request coalescing with Redis distributed lock
def get_metric(key):
    result = redis.get(key)
    if result:
        return result
    lock_key = f"lock:{key}"
    if redis.set(lock_key, 1, nx=True, ex=5):  # acquire lock
        result = query_warehouse(key)
        redis.setex(key, ttl=300, value=result)
        redis.delete(lock_key)
        return result
    else:
        time.sleep(0.1)  # wait for lock holder to populate cache
        return redis.get(key)  # retry

Cache penetration: Requests for keys that don’t exist in cache OR database bypass the cache and hammer the DB on every request.

Solution: Cache negative results. SET user:99999 “null” EX 60 — subsequent requests return the cached null rather than hitting the DB.

Hot keys: A single cache key receives disproportionate read traffic (e.g., the metrics for “total global DAU” — every dashboard reads it). Single Redis node becomes a bottleneck.

Solutions:

Key sharding: Replicate hot key across multiple Redis nodes with a random suffix: dau:2026-04-10:shard0, :shard1, … :shard9. Read from random shard.
Local in-process cache: For extremely hot read-only data, cache in application memory for 1-5 seconds. Eliminates network hop entirely.

Materialized Views: Caching at the Warehouse Layer

Materialized views are “cache” in the warehouse layer — pre-computed query results stored as a table, refreshed periodically.

When to use materialized views:

Scenario	Recommendation
Dashboard query runs identically 1000×/day	Materialize — compute once, serve many
Query scans 5 TB but aggregates to ~1000 rows	Materialize at gold — avoid 5 TB scan every time
Ad-hoc query where results change with every filter	Don’t materialize — flexibility beats pre-computation
Sub-second dashboard SLA on a large table	Materialize + Redis on top

BigQuery materialized views:

CREATE MATERIALIZED VIEW gold.daily_dau_by_country
OPTIONS (
  enable_refresh = true,
  refresh_interval_minutes = 60
) AS
SELECT
  event_date,
  country,
  COUNT(DISTINCT user_id) AS dau
FROM gold.fact_user_events
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY event_date, country;

BigQuery automatically rewrites queries against the base table to use the materialized view when possible — transparent optimization.

Snowflake dynamic tables (2025-2026): Define a target lag (how stale is acceptable) and Snowflake automatically refreshes the table:

CREATE DYNAMIC TABLE gold.daily_revenue
  TARGET_LAG = '1 hour'
  WAREHOUSE = analytics_wh
AS
SELECT order_date, SUM(line_total) AS revenue
FROM silver.fact_orders
GROUP BY order_date;

Pre-Computation vs On-Demand: The Core DE Trade-off

This is the senior-level framing that ties caching, materialized views, and pipeline design together.

Topic	Details
Query latency	Very low (read pre-computed result) Higher (compute at query time)
Flexibility	Fixed dimensions/metrics only Any query, any filter
Freshness	Depends on refresh schedule Always current
Storage cost	Higher (store aggregates) Lower (only raw data)
Compute cost	Computed once, read many times Computed on every read
Engineering complexity	Higher (define & maintain pre-comp jobs) Lower (just query the table)

Decision matrix for data engineering:

How often is this query pattern run?

├── Hundreds/thousands of times daily → Pre-compute
│       Is the metric fixed (same dimensions always)?
│       ├── YES → Materialized view (warehouse layer)
│       └── NO → Pre-compute common dimension combos + cache remainder
└── Rarely (ad-hoc) → On-demand from warehouse

What's the acceptable latency?

├── <100 ms (user-facing dashboard) → Redis + materialized view
├── <5 s (analyst dashboard) → Materialized view in warehouse
└── Minutes (scheduled reports) → On-demand at batch time

What's the query size?

├── Scans >1 TB → Almost always worth pre-computing
└── Scans <100 GB → On-demand is often fine with good partitioning

The layered approach (what you should propose in interviews):

Tier 1 — Redis cache (<100 ms)
  Top ~50 most common queries, 5–30 min TTL

Tier 2 — Materialized views in warehouse (<5 s)
  Common aggregations, hourly refresh

Tier 3 — Partitioned base tables (<30 s)
  Ad-hoc queries with date partition pruning

Tier 4 — Full historical data (minutes)
  Rare deep queries, cold storage

Redis for Data Engineering: Specific Patterns

Feature Store Pattern

Offline pipeline (nightly Spark job)
  Compute user features → write to Redis (online) + Iceberg (offline)
  Redis key: "features:{user_id}"
  TTL: 24 hours (refreshed daily by pipeline)

Online inference path
  Model needs features → GET "features:{user_id}" from Redis
  Cache miss (new user) → default features or cold-start logic

Pre-computed Metric Cache

Airflow job (every 5 min)
  Query BigQuery: SELECT country, DAU FROM gold.dau_5min
  Write each row: SETEX "dau:{country}:{window}" 300 {value}

Dashboard API
  GET /metrics/dau?country=US → MGET ["dau:US:{window}"] → sub-ms response

Leaderboard / Sorted Metrics (Redis Sorted Sets)

Use case: "Top 10 content pieces by views in the last hour"

Redis sorted set
  ZADD "top_content:2026-04-10-14"
  → Updated every minute from Flink streaming aggregations

API
  ZREVRANGE "top_content:2026-04-10-14" 0 9 WITHSCORES
  → Top 10 with scores in microseconds

Interview Questions

Q1: “Your analytics API serves 10,000 dashboard queries per minute, all reading from BigQuery. Query costs are high and latency is 2-3 seconds. How do you improve this?”

Model Answer: “I’d implement a three-layer caching strategy. Layer 1: BigQuery’s built-in query result cache is free and already active — the same query returning the same result within 24 hours is served from cache at no cost. First step is audit: are dashboards sending identical queries? If not, normalize them (consistent parameter formatting, no unnecessary UDFs that defeat caching). Layer 2: For the top 20% of queries that drive 80% of the traffic, materialize them as BigQuery materialized views refreshed hourly. Now those queries hit a 1000-row result table instead of scanning 10TB. Layer 3: For the absolute top queries (same metric, same dimensions, refreshed every 5 minutes), use a Redis cache in front of the API. The Airflow job runs BigQuery every 5 minutes, writes results to Redis with a 5-minute TTL, and the API reads from Redis first. This reduces BigQuery calls for those queries by 99%. Result: latency drops from 2-3 seconds to < 100ms for cached queries, cost drops by estimated 70% from eliminating redundant scans.”

Q2: “How does caching interact with data freshness in a fraud detection feature store? Can you cache features if freshness is critical?”

Model Answer: “Yes, but with a short TTL and write-through pattern rather than cache-aside. For fraud detection, the risk model needs features like ‘transactions in the last 5 minutes’ and ‘account balance change in the last hour.’ Serving stale features means potentially missing a fraud signal. My design: batch features (computed daily — like historical average transaction amount) get a 1-hour TTL in Redis — they don’t change frequently and a 1-hour staleness is fine. Real-time features (last 5 minutes of activity, computed by Flink) use write-through — the streaming pipeline writes to Redis and the feature store simultaneously. TTL = 6 minutes (slightly longer than the recomputation window, so there’s always a valid value). This way, the feature store always has a value, it’s at most 5 minutes stale for real-time features, and the inference service reads at sub-10ms. For the absolute freshest state (current account balance) — no cache, direct DB read with Redis acting only as a fallback if the DB is unavailable, returning a slightly stale value rather than an error.”

Think About This

You’re in a Google interview. The prompt: “Design the caching layer for Google Ads analytics. Advertisers check their campaign performance dashboards hundreds of times per day. The underlying data warehouse processes billions of impressions and clicks daily. Current dashboard load time is 8 seconds. Target: < 500ms.”

Walk through:

Where is the 8-second latency coming from? (Likely scanning large unpartitioned or poorly partitioned BigQuery tables + no caching layer)
What’s the first optimization before adding Redis? (Partitioning by date + clustering on campaign_id in BigQuery — reduces scans by 100x for typical date-filtered queries)
What can be materialized? (Daily campaign performance metrics at daily grain, pre-aggregated by campaign + ad group + keyword — covers 90% of dashboard queries)
What needs Redis? (Real-time spend monitoring — advertisers check “how much have I spent today?” constantly — put this in Redis, updated every minute from Flink, TTL = 2 minutes)
What can’t be cached? (Custom ad-hoc date ranges, multi-campaign cross-account analysis — route these to BigQuery, set user expectation that these take 5-30 seconds)

The insight: you can hit the < 500ms target for 90% of queries with materialized views alone, without Redis. Redis is only needed for the remaining 10% that require real-time freshness. Propose the simplest solution first.

Quick Reference

Cache-Aside: Application manages cache. Best for read-heavy, tolerable-staleness use cases. Default for analytical caching.
Write-Through: Writes go to cache AND DB synchronously. Best for critical data where staleness is harmful (feature stores, account state).
Write-Behind: Write to cache first, async flush to DB. Best for write-heavy non-critical counters. Risk: data loss on cache failure.
TTL strategy: Match TTL to business tolerance for staleness. Add jitter to prevent thundering herd. Cache negative results to prevent cache penetration.
Hot key fixes: Key sharding (random suffix across N replicas) + in-process L0 cache for extremely hot read-only data.
Materialized views = warehouse-layer cache: Pre-compute expensive aggregations. BigQuery and Snowflake support automatic refresh with lag targets.
Pre-compute when: Query runs hundreds/day, scans > 1 TB, fixed metric/dimension pattern.
On-demand when: Ad-hoc queries, rarely run, flexible filters.
The layered answer: Redis (sub-100ms) → Materialized views (sub-5s) → Partitioned table (sub-30s) → Full scan (minutes). Each layer serves a different freshness/cost/flexibility point.

Tomorrow’s Preview

Day 29: Pipeline Observability & Monitoring — Metrics (freshness, completeness, volume), logging, alerting. Circuit breakers for pipelines. SLOs/SLIs for data platforms. Incident response for data outages. The operational maturity topic that distinguishes senior DEs who’ve run systems in production from those who’ve only built them.

Day 28: Caching strategies for data systems

Why Caching Is a Senior DE Topic (Not Just SWE)

The Caching Hierarchy: Where Caches Live

The Four Caching Patterns

1. Cache-Aside (Lazy Loading)

2. Write-Through

3. Write-Behind (Write-Back)

4. Read-Through

TTL Strategy: The Practical Tuning Knob

Materialized Views: Caching at the Warehouse Layer

Pre-Computation vs On-Demand: The Core DE Trade-off

Redis for Data Engineering: Specific Patterns

Feature Store Pattern

Pre-computed Metric Cache

Leaderboard / Sorted Metrics (Redis Sorted Sets)

Interview Questions

Think About This

Quick Reference

Tomorrow’s Preview

Feedback