Phase 2: Deep Dives | Category: System Design Practice
The Prompt
“Design the analytics data pipeline for a CDN like Netflix’s Open Connect or Meta’s CDN. The CDN serves billions of requests daily from thousands of edge nodes globally. The analytics system must provide real-time dashboards for operations teams (cache hit rate, bandwidth, error rates by edge node), historical reporting for content teams (most-served content, geographic distribution), and support anomaly detection for outages. Handle deduplication across edge nodes and late-arriving logs.”
Step 1: Requirements
Functional requirements
- Collect access logs from thousands of edge nodes worldwide
- Real-time operational dashboards (
< 2 minutesfreshness):- cache hit rate
- bandwidth
- error rates
- latency percentiles by edge node and region
- Historical analytics:
- content performance
- geographic distribution
- bandwidth trends
- Anomaly detection:
- edge node failures
- cache poisoning
- unusual traffic patterns
- Deduplicate logs (same request logged by multiple systems/nodes)
- Handle late-arriving logs from nodes with connectivity issues
Non-functional requirements
Scale (illustrative):
10B requests/day ≈ 115K req/sec average (higher at peak)
Log size ~500B/request
Freshness:
Ops dashboards: < 2 minutes
Historical: hourly refresh; daily reports ready by morning
Availability: 99.99% (outage detection is time-critical)
Completeness:
Real-time: best-effort (e.g., <1% loss acceptable)
Historical: near-perfect (batch closes the gaps)
Deduplication:
Each request has a unique request_id; may be logged by edge + origin
Scope:
Log collection, dedup, real-time processing, storage/serving layers
Not designing CDN routing itself
Step 2: Scale Estimation (and the key correction)
Naively streaming raw logs from every edge node is expensive.
If you truly had:
1,000 edge nodes × 10K req/sec × 500B ≈ 5 GB/sec raw logs
Per day: 5 GB/sec × 86,400 ≈ 432 TB/day raw
That pushes you toward a hybrid approach:
Stream 1 (real-time): edge pre-aggregated metrics every 10 seconds
Stream 2 (historical): raw logs buffered at edge, compressed, batch uploaded
This hybrid architecture is the main senior-level insight: real-time ops doesn’t need per-request logs.
Step 3: Architecture
EDGE NODES (global fleet)
Each edge node runs:
1) Log agent (Vector/Fluent Bit/Fluentd)
- tails access logs
- local buffer for batch upload (resilient to network flaps)
- emits pre-aggregated metrics every 10s
2) Pre-aggregation (every 10s emit):
{node_id, region, ts_10s,
request_count, cache_hits, cache_misses,
bytes_served, error_4xx, error_5xx,
p50_ms, p95_ms, p99_ms}
3) Raw log batch:
gzip/zstd compress and upload every ~5 minutes to object store:
s3://cdn-logs/raw/region=.../node=.../year=.../month=.../day=.../hour=.../minute=.../logs.gz
↓ (10s aggregates) ↓ (5m raw batches)
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ KAFKA: cdn_metrics │ │ S3 (Bronze): raw logs │
│ - partition by node_id │ │ - compressed batch files │
│ - resilient ingestion │ │ - partitioned by region/node/time│
└───────────────┬──────────────┘ └───────────────┬──────────────┘
↓ ↓
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ FLINK (real-time) │ │ SPARK (batch, hourly/daily) │
│ - 1m windows over aggregates │ │ - parse raw logs → Parquet │
│ - per-node + per-region KPIs │ │ - dedup by request_id │
│ - anomaly detection + alerts │ │ - join content catalog │
│ - output → real-time OLAP │ │ - write Silver/Gold tables │
└───────────────┬──────────────┘ └───────────────┬──────────────┘
↓ ↓
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ ClickHouse (ops serving) │ │ Iceberg (Silver/Gold) │
│ - sub-second time-series OLAP │ │ - Silver: parsed deduped logs │
│ - ~30d retention │ │ - Gold: content/geo trends │
└──────────────────────────────┘ └──────────────────────────────┘
Step 4: Deduplication (the hard problem)
Common duplicate sources:
- Edge + origin both log the same request
- Log agent retries cause duplicate uploads of the same batch
- Failover causes a short interval to be logged by two nodes
Layer 1: Global request_id at ingestion
The CDN assigns a globally unique request_id at first edge contact.
log_entry = {
"request_id": "req-20260413-edge042-000000123",
"edge_node_id": "edge-042",
"origin_logged": False,
"timestamp_ms": 1744530000123,
"url": "/content/movie-001/segment-005.m4s",
"bytes_served": 524288,
"cache_hit": True,
"status_code": 200,
"latency_ms": 12,
"client_ip_hashed": "a8f2c...",
}
Layer 2: Real-time stream avoids per-request dedup entirely
Because the Kafka stream is pre-aggregated metrics (not raw events), the real-time path does not need per-request dedup.
Layer 3: Batch dedup in Silver (raw logs)
Dedup by request_id and keep the “best” record.
from pyspark.sql.functions import row_number, col
from pyspark.sql.window import Window
window = (
Window.partitionBy("request_id")
.orderBy(
col("edge_logged").desc(),
col("cache_hit").desc(),
col("timestamp_ms").asc(),
)
)
deduped_logs = (
raw_logs_df.withColumn("rank", row_number().over(window))
.filter("rank = 1")
.drop("rank")
)
deduped_logs.write.mode("overwrite").partitionBy("log_date", "region").format("iceberg").saveAsTable(
"silver.cdn_access_logs"
)
Expectation: duplicate rate should be very low in a mature CDN (often < 0.1%), but the system must handle retries/failover safely.
Step 5: Late-Arriving Logs
Edge nodes may upload hours late due to connectivity.
Real-time path:
- Watermarks + allowed lateness (e.g., accept up to 5 minutes late)
- Beyond that, route to side output / ignore for ops dashboards
- Ops UI shows per-node
last_heartbeattime
Batch path:
- Hourly Spark job processes whatever files exist in S3
- If a file arrives 4 hours late, it’s included in the next batch and in daily reports
- Daily reporting can use a 24-hour completeness buffer to capture stragglers
Step 6: What the Ops Dashboard Shows
The dashboard should be built from the pre-aggregated stream:
Per-region KPIs:
requests/sec
cache hit rate
bandwidth (Gbps)
error 4xx/5xx rate
p95/p99 latency
Per-node drilldown:
node health, last heartbeat time, node-level cache hit rate, error spikes
Active alerts:
- node unreachable (no heartbeat)
- 5xx spike
- cache hit rate collapse
- regional traffic anomaly
Step 7: Anomaly Detection
Three patterns:
- Edge node failure:
# Conceptual: detect sudden drop in requests + missing heartbeats
# Alert within ~30–60s of missing expected 10s windows
- Cache poisoning / cache flush anomaly:
# Signal: cache_hit_rate drops sharply while bandwidth remains high
# Often scoped to specific URL/content families in batch analysis
- Geographic traffic anomaly (DDoS or routing issue):
# Signal: one region spikes 5× while others remain flat
# Add supporting signals: IP diversity collapse, user-agent anomalies
Baseline approach:
Compute baseline = 7-day same-hour rolling mean and stddev
Alert if |current - baseline| > 3σ for consecutive windows
Step 8: Trade-offs to Articulate
Trade-off 1: Pre-aggregate at edge vs stream raw logs to Kafka
- Pre-aggregation reduces throughput by orders of magnitude and makes
<2 minutedashboards easy. - Trade-off: lose per-request debugging in real time; raw logs arrive with batch latency.
Trade-off 2: ClickHouse vs Druid vs BigQuery for real-time ops OLAP
- ClickHouse: great for low-latency time-series OLAP and straightforward operations.
- Druid: also strong but often higher operational complexity.
- BigQuery: excellent for historical analytics, not for sub-second ops dashboards at this tier.
Trade-off 3: 5-minute batch upload vs true raw-log streaming
- Batch: better compression, resiliency to network flaps, lower cost.
- Streaming: lower raw-log latency but significantly higher infra complexity/cost at edge.
Interview Questions
Q1: Viral content becomes 30% of global traffic. How do you detect and alert?
Model answer:
- Real-time: detect through aggregate streams (bandwidth spike + content-level top-N if included in pre-aggregation).
- Batch: confirm with raw logs; compute geo distribution; identify hot spots; drive cache pre-warm actions.
- Alert on: single content exceeding bandwidth share threshold for N windows and on total bandwidth anomaly.
Q2: How do you distinguish “log agent died” vs “node has no traffic”?
Model answer:
- Heartbeat semantics: every node emits an aggregate every 10 seconds even if request_count = 0.
- If no message for > 60 seconds → agent/network/node issue (alert).
- If message arrives with zeros → node alive, no traffic (no alert).
Self-Assessment: 5 Questions
- Why pre-aggregate at the edge instead of streaming raw logs to Kafka?
- What are three duplicate sources, and which layer addresses each?
- Why ClickHouse for real-time ops instead of BigQuery?
- How do heartbeats separate “no traffic” from “agent failure”?
- How do late logs get into historical reports without corrupting real-time KPIs?
Quick Reference
- Hybrid pipeline: edge aggregates for ops + raw batches for history.
- Dedup: request_id + batch dedup in Silver.
- Late logs: strict lateness in real-time; eventual correctness in batch.
- Serving: ClickHouse for ops; Iceberg/Trino (or warehouse) for history.
- Alerts: heartbeats, error spikes, cache hit drops, geo anomalies.
Tomorrow’s Preview
Day 60: Phase 2 Review & Self-Assessment — consolidation checkpoint before Phase 3 mock interviews.