Phase 1: Foundations & Frameworks | Category: System Design Practice

What This Day Is

This is your first full end-to-end system design session. Every concept from Days 1-23 converges here. The goal is to practice drawing a complete design under interview conditions — requirements, scale, architecture, deep dives, and trade-offs — all in 45 minutes.

Set a timer. Work through each section before reading the answer. This is the single most effective interview prep exercise.

The Prompt

“Design a real-time analytics dashboard for a product like Netflix or Meta. The dashboard shows business metrics (DAU, session duration, content performance, engagement rate) and refreshes every 10-30 seconds. The system must handle millions of user events per second.”

Step 1: Requirements Gathering (5 minutes)

Your questions before drawing anything:

Functional:

  • Who are the consumers? (Product managers, executives, data scientists, or all of the above?)

  • What metrics specifically? (DAU, MAU, session duration, click-through rate, content completion?)

  • What dimensions? (By country, device type, content category, cohort?)

  • What time ranges? (Last 1 hour, 24 hours, 7 days, 30 days?)

  • Is this internal-only or also customer-facing? (Complexity scales with external SLAs)

  • Do we need drill-down capability (summary → detail)?

Non-functional:

  • Freshness SLA: 10-30 second refresh → near-real-time (rules out pure batch)

  • Scale: millions of events/sec → distributed streaming mandatory

  • Read concurrency: how many analysts query simultaneously?

  • Historical retention: how far back must we serve? (Impacts storage tier design)

  • Availability: 99.9% for internal? 99.99% for customer-facing?

  • Write assumptions per DataVidhya: “If the problem says near real-time, ask what that means. For some teams it means 1 second; for others it means 15 minutes.”

State your assumptions explicitly:

I'll design for:
  - 2M events/sec peak ingestion (e.g., Netflix scale at prime time)
  - 30-second dashboard refresh SLA
  - Metrics: DAU, session duration, content views, engagement rate
  - Dimensions: by country, content category, device type
  - Historical view: up to 90 days
  - 500 concurrent dashboard users, 50 power users running ad-hoc queries
  - Internal team (not customer-facing)

Step 2: Scale Estimation (3 minutes)

Ingestion:

  • 2M events/sec × 500 bytes avg event size = 1 GB/sec = 86 TB/day raw

  • Compressed Parquet (6x): ~14 TB/day to store

  • Annual storage (90-day hot + remainder warm): ~1.3 PB/year

Kafka sizing:

  • 2M events/sec → need ~20-50 partitions (100K events/sec/partition is safe)

  • Replication factor 3 → 3 GB/sec total write throughput

  • Standard Kafka cluster: 5-10 brokers handles this comfortably

Serving layer reads:

  • 500 analysts × 10 dashboard refreshes/minute = 5,000 reads/min = ~83 QPS

  • 50 power users running complex ad-hoc queries: higher compute, lower QPS

  • This is moderate — an in-memory OLAP or well-partitioned warehouse handles it

Aggregation compute:

  • Pre-aggregate to 1-min, 5-min, 30-min, 1-hour granularities

  • Reduces serving-layer query to key-value lookups vs. real-time aggregation

**Architecture implication from numbers:**2M events/sec requires streaming (not batch). 30-second SLA and 83 read QPS means we can pre-aggregate rather than compute on-demand. Historical 90-day range rules out in-memory-only solutions.

Step 3: High-Level Architecture (10-12 minutes)

Draw this left-to-right on the whiteboard:

┌─────────────────┐
│   Event Sources  │  Apps, web, mobile, CDN logs
└────────┬────────┘
         ↓ HTTP/SDK
┌─────────────────┐
│  Ingestion Tier  │  Event collectors (nginx, custom SDK)
│                 │  Schema validation, PII masking
└────────┬────────┘
         ↓ Avro + Schema Registry
┌─────────────────────────────────────────┐
│            Apache Kafka                  │
│  topic: user-events (50 partitions)      │
│  topic: content-views (50 partitions)    │
│  Retention: 7 days for replay            │
└──────────┬──────────────┬───────────────┘
           ↓              ↓
┌──────────────┐  ┌──────────────────────┐
│ Flink Cluster│  │  S3 / GCS (Bronze)   │
│(Stream Proc.)│  │  Raw Parquet + Zstd  │
│ - Windowing  │  │  (7-day → 90-day     │
│ - Dedup      │  │   tiered storage)    │
│ - Enrich     │  └──────────┬───────────┘
└──────┬───────┘             │
       ↓ (hourly batch)      ↓ (pre-aggregated)
       Spark/dbt → Silver → Gold

┌─────────────────┐   ┌─────────────────────────┐
│ Real-Time OLAP  │   │  Data Warehouse          │
│ (ClickHouse /   │   │  (BigQuery/Snowflake)    │
│  Apache Pinot)  │   │  - Historical analysis   │
│ - 10-sec SLA    │   │  - Ad-hoc SQL queries    │
│ - 1-min windows │   │  - 90-day retention      │
└──────┬──────────┘   └──────────┬──────────────┘
       └──────────────┬──────────┘

               ┌────────────────┐
               │  Dashboard API  │
               │  + WebSocket   │
               │  (push updates) │
               └────────────────┘

               ┌────────────────┐
               │   Dashboard    │
               │  (Superset /   │
               │   Looker /     │
               │   Custom UI)   │
               └────────────────┘

Step 4: Component Deep Dives (12-15 minutes)

Ingestion Tier

Client-side event batching: Don’t send one HTTP request per event — batch events on the client (100ms or 50-event batch, whichever comes first). Reduces connection overhead dramatically.

PII masking at ingestion: Before events hit Kafka, strip/tokenize PII fields (user emails, IP addresses). Once in Kafka, data is clean for all downstream consumers.

Schema validation: Events that fail schema validation go to a dead letter queue, not to the main topic. Alert on DLQ volume — sudden spike means a client shipped a breaking change.

What Flink does here:

  1. Deduplication: Users may send duplicate events (retry logic, network issues). Deduplicate by event_id within a 1-hour stateful window using Flink’s keyed state.

  2. Enrichment: Join event stream with a broadcast stream containing user dimension data (country, device, subscription tier). Broadcast state keeps dimension data in memory.

  3. Windowed aggregation: Compute metrics over tumbling 1-minute windows per dimension combination.

Flink windowed aggregation:

Tumbling 1-minute window, keyed by (country, device_type, content_category):
    - event_count: COUNT(*)
    - unique_users: COUNT(DISTINCT user_id)   // use HyperLogLog sketch for scale
    - total_duration: SUM(session_duration_sec)
    - completion_rate: SUM(completed) / COUNT(*)

Output: Pre-aggregated records written to ClickHouse every minute. Dashboard queries read from ClickHouse — no real-time aggregation at query time.

Real-Time OLAP: ClickHouse vs Pinot vs Druid

Per Ksolves comparison:

MetricApache PinotApache DruidClick House
Latency~10-100ms (ultra-low)~100-300msSub-second to a few seconds
IngestionReal-time Kafka nativeReal-time Kafka/KinesisBatch-first, Kafka via connector
Best forUser-facing analytics, Linked In-styleTime-series explorationComplex analytics, log analysis
Used atLinked In, UberTwitter, NetflixCloudflare, Contentsquare
Query languagePinot SQL (slightly limited)Druid SQLFull ANSI SQL

For this use case (30-second dashboard SLA, pre-aggregated data):

  • ClickHouse is the recommended choice — full SQL, excellent compression, sub-second queries on pre-aggregated tables. If user-facing (< 10ms) were required, choose Pinot.

ClickHouse schema for pre-aggregated metrics:

CREATE TABLE dashboard_metrics (
    window_start     DateTime,
    country          String,
    device_type      LowCardinality(String),
    content_category LowCardinality(String),
    event_count      UInt64,
    unique_users_hll AggregateFunction(uniqHLL12, String),  -- HyperLogLog
    total_duration   Float64,
    completion_rate  Float32
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(window_start)
ORDER BY (country, device_type, content_category, window_start);

SummingMergeTree automatically sums numeric columns when merging duplicates — perfect for pre-aggregated metrics that may arrive multiple times.

Serving / API Layer

WebSocket push for 30-second refresh (more efficient than polling):

Dashboard opens WebSocket connection
Server pushes updated metric snapshot every 30 seconds
Dashboard re-renders only changed values

REST fallback: For clients that can’t maintain WebSocket connections. Dashboard polls every 30 seconds via HTTP.

Caching layer (Redis): Cache the latest metric snapshot (last 30 seconds of pre-aggregated data). Each dashboard refresh reads from Redis (~1ms) instead of ClickHouse (~50ms). TTL = 30 seconds. Cache invalidated when Flink publishes new aggregates.

For ad-hoc queries: Route to warehouse (BigQuery/Snowflake) instead of ClickHouse. Analysts can run complex joins on 90 days of history without impacting dashboard performance.

Step 5: Trade-offs & Design Decisions (8-10 minutes)

Trade-off 1: Pre-aggregation vs on-demand aggregation

  • Pre-aggregation (our choice): Flink computes metrics every minute. Dashboard queries are fast key lookups. Limitation: dashboard can only show pre-defined metrics and dimensions. Adding a new dimension requires a pipeline change.

  • On-demand aggregation: Every dashboard query runs against raw events. Fully flexible but requires a very fast OLAP engine (Pinot) and can be expensive at 2M events/sec scale.

  • Our choice: Pre-aggregate known KPIs in Flink → ClickHouse. Route exploratory/ad-hoc to warehouse. This serves 80% of use cases with 10x better performance.

Trade-off 2: Late-arriving data

Events from mobile may arrive up to 60 seconds late (offline mode, network issues). For a 30-second dashboard:

  • Business dashboard option: Late events roll into the next window (accept slight inaccuracy for simplicity)

  • Financial/accuracy-critical option: Flink allowed lateness of 60 seconds + Flink retractions to update ClickHouse when late events change a window’s result

“What does the business actually need? If this is an executive engagement dashboard, 1-2% inaccuracy due to late mobile events is acceptable. If this is ad billing, we need corrections.”

Trade-off 3: Exactly-once vs at-least-once

  • At-least-once + deduplication (our choice): Simpler and sufficient for analytics metrics. Duplicate events are deduplicated in Flink using event_id within a 1-hour window. Minor risk: if Flink state is lost and recovered from checkpoint, a small window of events may be reprocessed — but idempotent MERGE into ClickHouse handles this correctly.

  • Exactly-once: More complex (Flink 2PC + ClickHouse transactional sink). Reserve for billing pipelines.

Trade-off 4: HyperLogLog for unique users

Counting exact DAU at 2M events/sec requires tracking billions of user IDs in memory — expensive. HyperLogLog (HLL) provides ~2% error for COUNT DISTINCT using ~12KB per dimension combination. For a dashboard showing “4.2M DAU,” showing “4.18M - 4.22M DAU” is indistinguishable. Use HLL.

Failure Mode Analysis

FailureImpactMitigation
Kafka broker goes downIngestion pauses for affected partitionsRF=3, min.insync.replicas=2. Auto leader election restores in < 30 sec.
Flink job crashesDashboard falls behind by checkpoint intervalFlink recovers from last checkpoint (every 30 sec). Max lag = 30 sec + recovery time.
ClickHouse node failsRead queries slow downReplication factor 2. Queries fail over to replica.
Late data spike (mobile users reconnect)Window metrics incorrect brieflyAllowed lateness + retractions OR alert consumers “metrics may be slightly behind”
Source schema changeFlink deserialization failsSchema Registry blocks incompatible changes. DLQ for any that get through. Alert fires.
Dashboard API overloadSlow dashboard rendersRedis cache absorbs burst. Rate limit per user. Load balancer + horizontal scaling.

Monitoring: The Operational Layer

Kafka consumer lag (Flink consumers) → alert if lag > 100K events (growing)
Flink checkpoint duration → alert if > 10 sec (may indicate GC or resource pressure)
ClickHouse query p99 latency → alert if > 500ms (dashboard SLA impact)
Dashboard freshness → measure time from event generation to dashboard render
    → alert if > 60 seconds (2x the 30-sec SLA = clear violation)
Event volume anomaly → alert if events/sec drops > 30% from 1-hour baseline
    → may indicate client-side tracking bug or ingestion failure

Interview Q&A — Anticipate These Probes

“What if we need sub-second dashboard updates instead of 30 seconds?”

Switch from pre-aggregation to event-time streaming with Pinot’s real-time ingestion. Pinot consumes directly from Kafka, segments are queryable within seconds of ingestion. Dashboard polls every 1 second. Cost: ~3-5x infrastructure cost vs. 30-second ClickHouse design. “Is sub-second refresh worth that cost? What business decision requires freshness under one second that 30 seconds doesn’t serve?”

“How do you add a new metric (e.g., cart abandonment rate) to the dashboard?”

  1. Ensure the relevant events exist in Kafka (cart_viewed, cart_abandoned events)

  2. Add a new Flink aggregation job or extend the existing one for the new metric

  3. Add a new column to the ClickHouse schema

  4. Add the metric to the dashboard query

  5. Backfill from S3 bronze layer via Spark if historical data needed

The bronze layer on S3 is the safety net — it preserves all raw events, enabling backfill of any new metric.

“How would this design change for Meta at 10x the scale (20M events/sec)?”

  • Kafka: 10x more partitions (500-1000), multi-cluster with Kafka MirrorMaker for cross-region

  • Flink: More parallelism (10x task managers), careful state management to avoid memory pressure

  • ClickHouse: Multi-node cluster (10-20 nodes), sharded by dimension key

  • Ingestion: Edge pre-aggregation — reduce to 10x lower volume before hitting central Kafka

  • Same architecture fundamentally — just scaled horizontally

Think About This (Self-Assessment)

After reading this design, answer these questions without looking back:

  1. Why did we choose pre-aggregation over on-demand aggregation? What’s the limitation?

  2. Why is HyperLogLog used for DAU? What’s the error rate?

  3. What’s in the bronze layer and why is it critical?

  4. How does the serving layer handle 500 concurrent dashboard users at 30-second refresh?

  5. What happens if Flink crashes? How long is the dashboard stale?

If you can answer all 5 fluently in 60 seconds each, this design is yours.

Quick Reference

  • The architecture stack: SDK → Kafka → Flink (dedup + enrich + window) → ClickHouse (pre-aggregated) + S3/Iceberg (raw) + Warehouse (historical ad-hoc) → Dashboard API → WebSocket push

  • Key sizing numbers: 2M events/sec = ~50 Kafka partitions, 1 GB/sec raw, ~14 TB/day compressed

  • Real-time OLAP choice: ClickHouse for sub-second dashboards with full SQL. Pinot when < 100ms or user-facing. Druid for time-series exploration.

  • Critical design choices: Pre-aggregate in Flink (not on-demand) → HLL for COUNT DISTINCT → Redis cache in front of ClickHouse → WebSocket push vs polling → at-least-once + deduplication

  • The bronze layer is always your safety net — all raw events on S3, any future metric can be backfilled

Tomorrow’s Preview

Day 25: Design: Event Logging & Telemetry System — Building for billions of events per day. Schema design for events. Ingestion at scale (Kafka/Kinesis), processing and storage (columnar), querying. Handling late data and deduplication. This is the foundational infrastructure problem at every FAANG company.