Phase 1: Foundations & Frameworks | Category: Caching & Performance
Why Caching Is a Senior DE Topic (Not Just SWE)
Most caching content is oriented toward application engineers — caching user sessions, profile data, API responses. For data engineering interviews, caching appears in a different context: pre-computing expensive analytical aggregations, serving ML features at low latency, reducing warehouse query costs, and managing the freshness-vs-performance trade-off across your entire data platform. As System Design Handbook puts it: “Stale reads are OK in feeds, but not in account balances.” The DE version of this: stale reads are OK in a dashboard showing last week’s GMV, but not in a fraud detection feature store.
The Caching Hierarchy: Where Caches Live
Before choosing a strategy, you need to know WHERE caching happens in a data system:
L1 — Query Engine (BigQuery / Snowflake)
Query result cache (in-engine, free, seconds TTL)
↓ on miss
L2 — Materialized Views / Summary Tables
Pre-computed aggregates (warehouse, minutes–hours)
↓ on miss
L3 — Serving Cache (Redis)
Computed results (in-memory, milliseconds)
↓ on miss
L4 — Data Warehouse / OLAP Store
Source of truth (seconds–minutes query time)
Each layer has different latency, freshness, and cost characteristics. A senior DE designs which data lives at which layer — and why.
The Four Caching Patterns
1. Cache-Aside (Lazy Loading)
The most common pattern. The application manages the cache directly.
READ
1. Check Redis for key
2. Cache HIT → return data (sub-ms)
3. Cache MISS → query warehouse/DB → store in Redis with TTL → return data
WRITE
→ Write to source of truth only
→ Let cache expire naturally (TTL-driven invalidation)
OR explicitly delete the cache key on write (event-driven invalidation)
Characteristics:
-
Cache only contains data that was actually requested (“hot” data only)
-
Cold start problem: first request always hits the source
-
Read performance improves over time as cache warms up
-
Stale data between updates (TTL determines max staleness)
Data engineering use case: Caching pre-aggregated dashboard metrics. When an analyst requests “DAU for March,” check Redis first. On miss, query BigQuery, store result with a 30-minute TTL, return.
When to choose: Read-heavy workloads where writes are infrequent and some staleness is tolerable. The default pattern for 80% of analytical caching use cases.
2. Write-Through
Every write goes to BOTH the cache and the database synchronously.
WRITE
1. Write to Redis (cache)
2. Write to database (source of truth)
3. Both succeed → ACK to client
Either fails → entire operation fails (atomicity)
READ
→ Always hits cache (warm on every key that was ever written)
Characteristics:
-
Cache is always consistent with the database
-
Higher write latency (two synchronous writes)
-
Cache populated on writes, not reads → no cold start for recently written data
-
Wastes cache space: every write populates cache even if that data is never read
Data engineering use case: Feature store serving. When a batch job updates user LTV features, it writes to both Redis (online store) and the data warehouse (offline store) atomically. The ML inference service reads from Redis and always gets fresh features.
When to choose: Data where stale reads are genuinely harmful. User account balances, inventory counts, real-time feature stores. The cost (higher write latency) is worth the benefit (no stale reads).
3. Write-Behind (Write-Back)
Write to cache first, flush to database asynchronously.
WRITE
1. Write to Redis immediately → ACK to client (fast)
2. Async flush to database in background (queue-based, batched)
READ
→ Reads from cache (always warm for recently written data)
Characteristics:
-
Fastest write path — application doesn’t wait for DB write
-
Risk: data loss if Redis fails before async flush completes
-
Batching multiple writes → fewer DB writes → lower DB load
-
Complexity: need reliable flush queue + retry logic
Data engineering use case: High-frequency metric counters. A counter for “active connections” updated 10K times/sec — writing to Postgres at that rate is impractical. Write to Redis counters, flush aggregated counts to Postgres every 5 seconds. If Redis fails, you lose at most 5 seconds of counts — acceptable for non-financial metrics.
When to choose: Write-heavy workloads where DB write throughput is the bottleneck AND some data loss is tolerable. NOT for financial or compliance data.
4. Read-Through
Like cache-aside, but the cache itself is responsible for loading from the database on a miss (not the application).
READ
1. Application queries cache only
2. Cache HIT → return data
3. Cache MISS → cache fetches from DB, stores locally, returns to app
Application never directly queries the database.
Characteristics:
-
Application code is simpler — no cache-miss logic in application
-
Cache handles the DB call → centralizes the “load on miss” behavior
-
Less common in distributed data systems (Redis doesn’t natively do this)
Where it appears in DE context: Database proxy layers (PlanetScale’s branching, Snowflake query result caching). The warehouse’s built-in query result cache is effectively read-through — same query, same result, served from cache automatically.
TTL Strategy: The Practical Tuning Knob
TTL (Time-to-Live) is the most direct control over freshness vs. performance.
TTL selection framework:
| Data type | Acceptable staleness | Recommended TTL |
|---|---|---|
| Static reference data (country codes, categories) | Hours to days | 24 hours |
| User profile (non-financial) | Minutes | 5–15 minutes |
| Dashboard aggregates (internal) | Minutes | 5–30 minutes |
| Feature store (non-critical ML) | Minutes | 5–15 minutes |
| Feature store (fraud / billing ML) | Seconds | 30–60 seconds |
| Real-time metric (live dashboard) | Seconds | 10–30 seconds |
| Account balances, inventory counts | Near-zero | No cache, or write-through + immediate invalidation |
TTL pitfalls to mention in interviews:
Thundering herd (cache stampede): Many keys expire simultaneously → all requests hit the database at once → DB overload.
Solutions:
-
Jitter: Add random offset to TTL (TTL = base_ttl + random(0, base_ttl * 0.1))
-
Probabilistic early expiration: Before TTL expires, some requests proactively refresh the cache
-
Request coalescing (mutex lock): On cache miss, only one request queries the DB; others wait for it to repopulate the cache
# Request coalescing with Redis distributed lock
def get_metric(key):
result = redis.get(key)
if result:
return result
lock_key = f"lock:{key}"
if redis.set(lock_key, 1, nx=True, ex=5): # acquire lock
result = query_warehouse(key)
redis.setex(key, ttl=300, value=result)
redis.delete(lock_key)
return result
else:
time.sleep(0.1) # wait for lock holder to populate cache
return redis.get(key) # retry
Cache penetration: Requests for keys that don’t exist in cache OR database bypass the cache and hammer the DB on every request.
Solution: Cache negative results. SET user:99999 “null” EX 60 — subsequent requests return the cached null rather than hitting the DB.
Hot keys: A single cache key receives disproportionate read traffic (e.g., the metrics for “total global DAU” — every dashboard reads it). Single Redis node becomes a bottleneck.
Solutions:
-
Key sharding: Replicate hot key across multiple Redis nodes with a random suffix: dau:2026-04-10:shard0, :shard1, … :shard9. Read from random shard.
-
Local in-process cache: For extremely hot read-only data, cache in application memory for 1-5 seconds. Eliminates network hop entirely.
Materialized Views: Caching at the Warehouse Layer
Materialized views are “cache” in the warehouse layer — pre-computed query results stored as a table, refreshed periodically.
When to use materialized views:
| Scenario | Recommendation |
|---|---|
| Dashboard query runs identically 1000×/day | Materialize — compute once, serve many |
| Query scans 5 TB but aggregates to ~1000 rows | Materialize at gold — avoid 5 TB scan every time |
| Ad-hoc query where results change with every filter | Don’t materialize — flexibility beats pre-computation |
| Sub-second dashboard SLA on a large table | Materialize + Redis on top |
BigQuery materialized views:
CREATE MATERIALIZED VIEW gold.daily_dau_by_country
OPTIONS (
enable_refresh = true,
refresh_interval_minutes = 60
) AS
SELECT
event_date,
country,
COUNT(DISTINCT user_id) AS dau
FROM gold.fact_user_events
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY event_date, country;
BigQuery automatically rewrites queries against the base table to use the materialized view when possible — transparent optimization.
Snowflake dynamic tables (2025-2026): Define a target lag (how stale is acceptable) and Snowflake automatically refreshes the table:
CREATE DYNAMIC TABLE gold.daily_revenue
TARGET_LAG = '1 hour'
WAREHOUSE = analytics_wh
AS
SELECT order_date, SUM(line_total) AS revenue
FROM silver.fact_orders
GROUP BY order_date;
Pre-Computation vs On-Demand: The Core DE Trade-off
This is the senior-level framing that ties caching, materialized views, and pipeline design together.
| Topic | Details |
|---|---|
| Query latency | Very low (read pre-computed result) Higher (compute at query time) |
| Flexibility | Fixed dimensions/metrics only Any query, any filter |
| Freshness | Depends on refresh schedule Always current |
| Storage cost | Higher (store aggregates) Lower (only raw data) |
| Compute cost | Computed once, read many times Computed on every read |
| Engineering complexity | Higher (define & maintain pre-comp jobs) Lower (just query the table) |
Decision matrix for data engineering:
How often is this query pattern run?
├── Hundreds/thousands of times daily → Pre-compute
│ Is the metric fixed (same dimensions always)?
│ ├── YES → Materialized view (warehouse layer)
│ └── NO → Pre-compute common dimension combos + cache remainder
└── Rarely (ad-hoc) → On-demand from warehouse
What's the acceptable latency?
├── <100 ms (user-facing dashboard) → Redis + materialized view
├── <5 s (analyst dashboard) → Materialized view in warehouse
└── Minutes (scheduled reports) → On-demand at batch time
What's the query size?
├── Scans >1 TB → Almost always worth pre-computing
└── Scans <100 GB → On-demand is often fine with good partitioning
The layered approach (what you should propose in interviews):
Tier 1 — Redis cache (<100 ms)
Top ~50 most common queries, 5–30 min TTL
Tier 2 — Materialized views in warehouse (<5 s)
Common aggregations, hourly refresh
Tier 3 — Partitioned base tables (<30 s)
Ad-hoc queries with date partition pruning
Tier 4 — Full historical data (minutes)
Rare deep queries, cold storage
Redis for Data Engineering: Specific Patterns
Feature Store Pattern
Offline pipeline (nightly Spark job)
Compute user features → write to Redis (online) + Iceberg (offline)
Redis key: "features:{user_id}"
TTL: 24 hours (refreshed daily by pipeline)
Online inference path
Model needs features → GET "features:{user_id}" from Redis
Cache miss (new user) → default features or cold-start logic
Pre-computed Metric Cache
Airflow job (every 5 min)
Query BigQuery: SELECT country, DAU FROM gold.dau_5min
Write each row: SETEX "dau:{country}:{window}" 300 {value}
Dashboard API
GET /metrics/dau?country=US → MGET ["dau:US:{window}"] → sub-ms response
Leaderboard / Sorted Metrics (Redis Sorted Sets)
Use case: "Top 10 content pieces by views in the last hour"
Redis sorted set
ZADD "top_content:2026-04-10-14"
→ Updated every minute from Flink streaming aggregations
API
ZREVRANGE "top_content:2026-04-10-14" 0 9 WITHSCORES
→ Top 10 with scores in microseconds
Interview Questions
Q1: “Your analytics API serves 10,000 dashboard queries per minute, all reading from BigQuery. Query costs are high and latency is 2-3 seconds. How do you improve this?”
Model Answer: “I’d implement a three-layer caching strategy. Layer 1: BigQuery’s built-in query result cache is free and already active — the same query returning the same result within 24 hours is served from cache at no cost. First step is audit: are dashboards sending identical queries? If not, normalize them (consistent parameter formatting, no unnecessary UDFs that defeat caching). Layer 2: For the top 20% of queries that drive 80% of the traffic, materialize them as BigQuery materialized views refreshed hourly. Now those queries hit a 1000-row result table instead of scanning 10TB. Layer 3: For the absolute top queries (same metric, same dimensions, refreshed every 5 minutes), use a Redis cache in front of the API. The Airflow job runs BigQuery every 5 minutes, writes results to Redis with a 5-minute TTL, and the API reads from Redis first. This reduces BigQuery calls for those queries by 99%. Result: latency drops from 2-3 seconds to < 100ms for cached queries, cost drops by estimated 70% from eliminating redundant scans.”
Q2: “How does caching interact with data freshness in a fraud detection feature store? Can you cache features if freshness is critical?”
Model Answer: “Yes, but with a short TTL and write-through pattern rather than cache-aside. For fraud detection, the risk model needs features like ‘transactions in the last 5 minutes’ and ‘account balance change in the last hour.’ Serving stale features means potentially missing a fraud signal. My design: batch features (computed daily — like historical average transaction amount) get a 1-hour TTL in Redis — they don’t change frequently and a 1-hour staleness is fine. Real-time features (last 5 minutes of activity, computed by Flink) use write-through — the streaming pipeline writes to Redis and the feature store simultaneously. TTL = 6 minutes (slightly longer than the recomputation window, so there’s always a valid value). This way, the feature store always has a value, it’s at most 5 minutes stale for real-time features, and the inference service reads at sub-10ms. For the absolute freshest state (current account balance) — no cache, direct DB read with Redis acting only as a fallback if the DB is unavailable, returning a slightly stale value rather than an error.”
Think About This
You’re in a Google interview. The prompt: “Design the caching layer for Google Ads analytics. Advertisers check their campaign performance dashboards hundreds of times per day. The underlying data warehouse processes billions of impressions and clicks daily. Current dashboard load time is 8 seconds. Target: < 500ms.”
Walk through:
-
Where is the 8-second latency coming from? (Likely scanning large unpartitioned or poorly partitioned BigQuery tables + no caching layer)
-
What’s the first optimization before adding Redis? (Partitioning by date + clustering on campaign_id in BigQuery — reduces scans by 100x for typical date-filtered queries)
-
What can be materialized? (Daily campaign performance metrics at daily grain, pre-aggregated by campaign + ad group + keyword — covers 90% of dashboard queries)
-
What needs Redis? (Real-time spend monitoring — advertisers check “how much have I spent today?” constantly — put this in Redis, updated every minute from Flink, TTL = 2 minutes)
-
What can’t be cached? (Custom ad-hoc date ranges, multi-campaign cross-account analysis — route these to BigQuery, set user expectation that these take 5-30 seconds)
The insight: you can hit the < 500ms target for 90% of queries with materialized views alone, without Redis. Redis is only needed for the remaining 10% that require real-time freshness. Propose the simplest solution first.
Quick Reference
-
Cache-Aside: Application manages cache. Best for read-heavy, tolerable-staleness use cases. Default for analytical caching.
-
Write-Through: Writes go to cache AND DB synchronously. Best for critical data where staleness is harmful (feature stores, account state).
-
Write-Behind: Write to cache first, async flush to DB. Best for write-heavy non-critical counters. Risk: data loss on cache failure.
-
TTL strategy: Match TTL to business tolerance for staleness. Add jitter to prevent thundering herd. Cache negative results to prevent cache penetration.
-
Hot key fixes: Key sharding (random suffix across N replicas) + in-process L0 cache for extremely hot read-only data.
-
Materialized views = warehouse-layer cache: Pre-compute expensive aggregations. BigQuery and Snowflake support automatic refresh with lag targets.
-
Pre-compute when: Query runs hundreds/day, scans > 1 TB, fixed metric/dimension pattern.
-
On-demand when: Ad-hoc queries, rarely run, flexible filters.
-
The layered answer: Redis (sub-100ms) → Materialized views (sub-5s) → Partitioned table (sub-30s) → Full scan (minutes). Each layer serves a different freshness/cost/flexibility point.
Tomorrow’s Preview
Day 29: Pipeline Observability & Monitoring — Metrics (freshness, completeness, volume), logging, alerting. Circuit breakers for pipelines. SLOs/SLIs for data platforms. Incident response for data outages. The operational maturity topic that distinguishes senior DEs who’ve run systems in production from those who’ve only built them.