Phase 2: Deep Dives | Category: System Design Practice

The Prompt

“Design the analytics data pipeline for a CDN like Netflix’s Open Connect or Meta’s CDN. The CDN serves billions of requests daily from thousands of edge nodes globally. The analytics system must provide real-time dashboards for operations teams (cache hit rate, bandwidth, error rates by edge node), historical reporting for content teams (most-served content, geographic distribution), and support anomaly detection for outages. Handle deduplication across edge nodes and late-arriving logs.”

Step 1: Requirements

Functional requirements

  • Collect access logs from thousands of edge nodes worldwide
  • Real-time operational dashboards (< 2 minutes freshness):
    • cache hit rate
    • bandwidth
    • error rates
    • latency percentiles by edge node and region
  • Historical analytics:
    • content performance
    • geographic distribution
    • bandwidth trends
  • Anomaly detection:
    • edge node failures
    • cache poisoning
    • unusual traffic patterns
  • Deduplicate logs (same request logged by multiple systems/nodes)
  • Handle late-arriving logs from nodes with connectivity issues

Non-functional requirements

Scale (illustrative):
  10B requests/day ≈ 115K req/sec average (higher at peak)
  Log size ~500B/request

Freshness:
  Ops dashboards: < 2 minutes
  Historical: hourly refresh; daily reports ready by morning

Availability: 99.99% (outage detection is time-critical)
Completeness:
  Real-time: best-effort (e.g., <1% loss acceptable)
  Historical: near-perfect (batch closes the gaps)

Deduplication:
  Each request has a unique request_id; may be logged by edge + origin

Scope:
  Log collection, dedup, real-time processing, storage/serving layers
  Not designing CDN routing itself

Step 2: Scale Estimation (and the key correction)

Naively streaming raw logs from every edge node is expensive.

If you truly had:
  1,000 edge nodes × 10K req/sec × 500B ≈ 5 GB/sec raw logs
  Per day: 5 GB/sec × 86,400 ≈ 432 TB/day raw

That pushes you toward a hybrid approach:
  Stream 1 (real-time): edge pre-aggregated metrics every 10 seconds
  Stream 2 (historical): raw logs buffered at edge, compressed, batch uploaded

This hybrid architecture is the main senior-level insight: real-time ops doesn’t need per-request logs.

Step 3: Architecture

EDGE NODES (global fleet)
  Each edge node runs:
    1) Log agent (Vector/Fluent Bit/Fluentd)
       - tails access logs
       - local buffer for batch upload (resilient to network flaps)
       - emits pre-aggregated metrics every 10s

    2) Pre-aggregation (every 10s emit):
       {node_id, region, ts_10s,
        request_count, cache_hits, cache_misses,
        bytes_served, error_4xx, error_5xx,
        p50_ms, p95_ms, p99_ms}

    3) Raw log batch:
       gzip/zstd compress and upload every ~5 minutes to object store:
       s3://cdn-logs/raw/region=.../node=.../year=.../month=.../day=.../hour=.../minute=.../logs.gz

          ↓ (10s aggregates)                           ↓ (5m raw batches)
┌──────────────────────────────┐              ┌──────────────────────────────┐
│ KAFKA: cdn_metrics            │              │ S3 (Bronze): raw logs         │
│ - partition by node_id        │              │ - compressed batch files      │
│ - resilient ingestion         │              │ - partitioned by region/node/time│
└───────────────┬──────────────┘              └───────────────┬──────────────┘
                ↓                                             ↓
┌──────────────────────────────┐              ┌──────────────────────────────┐
│ FLINK (real-time)             │              │ SPARK (batch, hourly/daily)  │
│ - 1m windows over aggregates  │              │ - parse raw logs → Parquet    │
│ - per-node + per-region KPIs  │              │ - dedup by request_id         │
│ - anomaly detection + alerts  │              │ - join content catalog        │
│ - output → real-time OLAP     │              │ - write Silver/Gold tables    │
└───────────────┬──────────────┘              └───────────────┬──────────────┘
                ↓                                             ↓
┌──────────────────────────────┐              ┌──────────────────────────────┐
│ ClickHouse (ops serving)      │              │ Iceberg (Silver/Gold)        │
│ - sub-second time-series OLAP │              │ - Silver: parsed deduped logs │
│ - ~30d retention              │              │ - Gold: content/geo trends    │
└──────────────────────────────┘              └──────────────────────────────┘

Step 4: Deduplication (the hard problem)

Common duplicate sources:

  • Edge + origin both log the same request
  • Log agent retries cause duplicate uploads of the same batch
  • Failover causes a short interval to be logged by two nodes

Layer 1: Global request_id at ingestion

The CDN assigns a globally unique request_id at first edge contact.

log_entry = {
    "request_id": "req-20260413-edge042-000000123",
    "edge_node_id": "edge-042",
    "origin_logged": False,
    "timestamp_ms": 1744530000123,
    "url": "/content/movie-001/segment-005.m4s",
    "bytes_served": 524288,
    "cache_hit": True,
    "status_code": 200,
    "latency_ms": 12,
    "client_ip_hashed": "a8f2c...",
}

Layer 2: Real-time stream avoids per-request dedup entirely

Because the Kafka stream is pre-aggregated metrics (not raw events), the real-time path does not need per-request dedup.

Layer 3: Batch dedup in Silver (raw logs)

Dedup by request_id and keep the “best” record.

from pyspark.sql.functions import row_number, col
from pyspark.sql.window import Window

window = (
    Window.partitionBy("request_id")
    .orderBy(
        col("edge_logged").desc(),
        col("cache_hit").desc(),
        col("timestamp_ms").asc(),
    )
)

deduped_logs = (
    raw_logs_df.withColumn("rank", row_number().over(window))
    .filter("rank = 1")
    .drop("rank")
)

deduped_logs.write.mode("overwrite").partitionBy("log_date", "region").format("iceberg").saveAsTable(
    "silver.cdn_access_logs"
)

Expectation: duplicate rate should be very low in a mature CDN (often < 0.1%), but the system must handle retries/failover safely.

Step 5: Late-Arriving Logs

Edge nodes may upload hours late due to connectivity.

Real-time path:

  • Watermarks + allowed lateness (e.g., accept up to 5 minutes late)
  • Beyond that, route to side output / ignore for ops dashboards
  • Ops UI shows per-node last_heartbeat time

Batch path:

  • Hourly Spark job processes whatever files exist in S3
  • If a file arrives 4 hours late, it’s included in the next batch and in daily reports
  • Daily reporting can use a 24-hour completeness buffer to capture stragglers

Step 6: What the Ops Dashboard Shows

The dashboard should be built from the pre-aggregated stream:

Per-region KPIs:
  requests/sec
  cache hit rate
  bandwidth (Gbps)
  error 4xx/5xx rate
  p95/p99 latency

Per-node drilldown:
  node health, last heartbeat time, node-level cache hit rate, error spikes

Active alerts:
  - node unreachable (no heartbeat)
  - 5xx spike
  - cache hit rate collapse
  - regional traffic anomaly

Step 7: Anomaly Detection

Three patterns:

  1. Edge node failure:
# Conceptual: detect sudden drop in requests + missing heartbeats
# Alert within ~30–60s of missing expected 10s windows
  1. Cache poisoning / cache flush anomaly:
# Signal: cache_hit_rate drops sharply while bandwidth remains high
# Often scoped to specific URL/content families in batch analysis
  1. Geographic traffic anomaly (DDoS or routing issue):
# Signal: one region spikes 5× while others remain flat
# Add supporting signals: IP diversity collapse, user-agent anomalies

Baseline approach:

Compute baseline = 7-day same-hour rolling mean and stddev
Alert if |current - baseline| > 3σ for consecutive windows

Step 8: Trade-offs to Articulate

Trade-off 1: Pre-aggregate at edge vs stream raw logs to Kafka

  • Pre-aggregation reduces throughput by orders of magnitude and makes <2 minute dashboards easy.
  • Trade-off: lose per-request debugging in real time; raw logs arrive with batch latency.

Trade-off 2: ClickHouse vs Druid vs BigQuery for real-time ops OLAP

  • ClickHouse: great for low-latency time-series OLAP and straightforward operations.
  • Druid: also strong but often higher operational complexity.
  • BigQuery: excellent for historical analytics, not for sub-second ops dashboards at this tier.

Trade-off 3: 5-minute batch upload vs true raw-log streaming

  • Batch: better compression, resiliency to network flaps, lower cost.
  • Streaming: lower raw-log latency but significantly higher infra complexity/cost at edge.

Interview Questions

Q1: Viral content becomes 30% of global traffic. How do you detect and alert?

Model answer:

  • Real-time: detect through aggregate streams (bandwidth spike + content-level top-N if included in pre-aggregation).
  • Batch: confirm with raw logs; compute geo distribution; identify hot spots; drive cache pre-warm actions.
  • Alert on: single content exceeding bandwidth share threshold for N windows and on total bandwidth anomaly.

Q2: How do you distinguish “log agent died” vs “node has no traffic”?

Model answer:

  • Heartbeat semantics: every node emits an aggregate every 10 seconds even if request_count = 0.
  • If no message for > 60 seconds → agent/network/node issue (alert).
  • If message arrives with zeros → node alive, no traffic (no alert).

Self-Assessment: 5 Questions

  • Why pre-aggregate at the edge instead of streaming raw logs to Kafka?
  • What are three duplicate sources, and which layer addresses each?
  • Why ClickHouse for real-time ops instead of BigQuery?
  • How do heartbeats separate “no traffic” from “agent failure”?
  • How do late logs get into historical reports without corrupting real-time KPIs?

Quick Reference

  • Hybrid pipeline: edge aggregates for ops + raw batches for history.
  • Dedup: request_id + batch dedup in Silver.
  • Late logs: strict lateness in real-time; eventual correctness in batch.
  • Serving: ClickHouse for ops; Iceberg/Trino (or warehouse) for history.
  • Alerts: heartbeats, error spikes, cache hit drops, geo anomalies.

Tomorrow’s Preview

Day 60: Phase 2 Review & Self-Assessment — consolidation checkpoint before Phase 3 mock interviews.