Phase 1: Foundations & Frameworks | Category: System Design Practice
What This Day Is
This is your first full end-to-end system design session. Every concept from Days 1-23 converges here. The goal is to practice drawing a complete design under interview conditions — requirements, scale, architecture, deep dives, and trade-offs — all in 45 minutes.
Set a timer. Work through each section before reading the answer. This is the single most effective interview prep exercise.
The Prompt
“Design a real-time analytics dashboard for a product like Netflix or Meta. The dashboard shows business metrics (DAU, session duration, content performance, engagement rate) and refreshes every 10-30 seconds. The system must handle millions of user events per second.”
Step 1: Requirements Gathering (5 minutes)
Your questions before drawing anything:
Functional:
-
Who are the consumers? (Product managers, executives, data scientists, or all of the above?)
-
What metrics specifically? (DAU, MAU, session duration, click-through rate, content completion?)
-
What dimensions? (By country, device type, content category, cohort?)
-
What time ranges? (Last 1 hour, 24 hours, 7 days, 30 days?)
-
Is this internal-only or also customer-facing? (Complexity scales with external SLAs)
-
Do we need drill-down capability (summary → detail)?
Non-functional:
-
Freshness SLA: 10-30 second refresh → near-real-time (rules out pure batch)
-
Scale: millions of events/sec → distributed streaming mandatory
-
Read concurrency: how many analysts query simultaneously?
-
Historical retention: how far back must we serve? (Impacts storage tier design)
-
Availability: 99.9% for internal? 99.99% for customer-facing?
-
Write assumptions per DataVidhya: “If the problem says near real-time, ask what that means. For some teams it means 1 second; for others it means 15 minutes.”
State your assumptions explicitly:
I'll design for:
- 2M events/sec peak ingestion (e.g., Netflix scale at prime time)
- 30-second dashboard refresh SLA
- Metrics: DAU, session duration, content views, engagement rate
- Dimensions: by country, content category, device type
- Historical view: up to 90 days
- 500 concurrent dashboard users, 50 power users running ad-hoc queries
- Internal team (not customer-facing)
Step 2: Scale Estimation (3 minutes)
Ingestion:
-
2M events/sec × 500 bytes avg event size = 1 GB/sec = 86 TB/day raw
-
Compressed Parquet (6x): ~14 TB/day to store
-
Annual storage (90-day hot + remainder warm): ~1.3 PB/year
Kafka sizing:
-
2M events/sec → need ~20-50 partitions (100K events/sec/partition is safe)
-
Replication factor 3 → 3 GB/sec total write throughput
-
Standard Kafka cluster: 5-10 brokers handles this comfortably
Serving layer reads:
-
500 analysts × 10 dashboard refreshes/minute = 5,000 reads/min = ~83 QPS
-
50 power users running complex ad-hoc queries: higher compute, lower QPS
-
This is moderate — an in-memory OLAP or well-partitioned warehouse handles it
Aggregation compute:
-
Pre-aggregate to 1-min, 5-min, 30-min, 1-hour granularities
-
Reduces serving-layer query to key-value lookups vs. real-time aggregation
**Architecture implication from numbers:**2M events/sec requires streaming (not batch). 30-second SLA and 83 read QPS means we can pre-aggregate rather than compute on-demand. Historical 90-day range rules out in-memory-only solutions.
Step 3: High-Level Architecture (10-12 minutes)
Draw this left-to-right on the whiteboard:
┌─────────────────┐
│ Event Sources │ Apps, web, mobile, CDN logs
└────────┬────────┘
↓ HTTP/SDK
┌─────────────────┐
│ Ingestion Tier │ Event collectors (nginx, custom SDK)
│ │ Schema validation, PII masking
└────────┬────────┘
↓ Avro + Schema Registry
┌─────────────────────────────────────────┐
│ Apache Kafka │
│ topic: user-events (50 partitions) │
│ topic: content-views (50 partitions) │
│ Retention: 7 days for replay │
└──────────┬──────────────┬───────────────┘
↓ ↓
┌──────────────┐ ┌──────────────────────┐
│ Flink Cluster│ │ S3 / GCS (Bronze) │
│(Stream Proc.)│ │ Raw Parquet + Zstd │
│ - Windowing │ │ (7-day → 90-day │
│ - Dedup │ │ tiered storage) │
│ - Enrich │ └──────────┬───────────┘
└──────┬───────┘ │
↓ (hourly batch) ↓ (pre-aggregated)
Spark/dbt → Silver → Gold
┌─────────────────┐ ┌─────────────────────────┐
│ Real-Time OLAP │ │ Data Warehouse │
│ (ClickHouse / │ │ (BigQuery/Snowflake) │
│ Apache Pinot) │ │ - Historical analysis │
│ - 10-sec SLA │ │ - Ad-hoc SQL queries │
│ - 1-min windows │ │ - 90-day retention │
└──────┬──────────┘ └──────────┬──────────────┘
└──────────────┬──────────┘
↓
┌────────────────┐
│ Dashboard API │
│ + WebSocket │
│ (push updates) │
└────────────────┘
↓
┌────────────────┐
│ Dashboard │
│ (Superset / │
│ Looker / │
│ Custom UI) │
└────────────────┘
Step 4: Component Deep Dives (12-15 minutes)
Ingestion Tier
Client-side event batching: Don’t send one HTTP request per event — batch events on the client (100ms or 50-event batch, whichever comes first). Reduces connection overhead dramatically.
PII masking at ingestion: Before events hit Kafka, strip/tokenize PII fields (user emails, IP addresses). Once in Kafka, data is clean for all downstream consumers.
Schema validation: Events that fail schema validation go to a dead letter queue, not to the main topic. Alert on DLQ volume — sudden spike means a client shipped a breaking change.
Flink Stream Processing
What Flink does here:
-
Deduplication: Users may send duplicate events (retry logic, network issues). Deduplicate by event_id within a 1-hour stateful window using Flink’s keyed state.
-
Enrichment: Join event stream with a broadcast stream containing user dimension data (country, device, subscription tier). Broadcast state keeps dimension data in memory.
-
Windowed aggregation: Compute metrics over tumbling 1-minute windows per dimension combination.
Flink windowed aggregation:
Tumbling 1-minute window, keyed by (country, device_type, content_category):
- event_count: COUNT(*)
- unique_users: COUNT(DISTINCT user_id) // use HyperLogLog sketch for scale
- total_duration: SUM(session_duration_sec)
- completion_rate: SUM(completed) / COUNT(*)
Output: Pre-aggregated records written to ClickHouse every minute. Dashboard queries read from ClickHouse — no real-time aggregation at query time.
Real-Time OLAP: ClickHouse vs Pinot vs Druid
Per Ksolves comparison:
| Metric | Apache Pinot | Apache Druid | Click House |
|---|---|---|---|
| Latency | ~10-100ms (ultra-low) | ~100-300ms | Sub-second to a few seconds |
| Ingestion | Real-time Kafka native | Real-time Kafka/Kinesis | Batch-first, Kafka via connector |
| Best for | User-facing analytics, Linked In-style | Time-series exploration | Complex analytics, log analysis |
| Used at | Linked In, Uber | Twitter, Netflix | Cloudflare, Contentsquare |
| Query language | Pinot SQL (slightly limited) | Druid SQL | Full ANSI SQL |
For this use case (30-second dashboard SLA, pre-aggregated data):
- ClickHouse is the recommended choice — full SQL, excellent compression, sub-second queries on pre-aggregated tables. If user-facing (< 10ms) were required, choose Pinot.
ClickHouse schema for pre-aggregated metrics:
CREATE TABLE dashboard_metrics (
window_start DateTime,
country String,
device_type LowCardinality(String),
content_category LowCardinality(String),
event_count UInt64,
unique_users_hll AggregateFunction(uniqHLL12, String), -- HyperLogLog
total_duration Float64,
completion_rate Float32
) ENGINE = SummingMergeTree()
PARTITION BY toYYYYMMDD(window_start)
ORDER BY (country, device_type, content_category, window_start);
SummingMergeTree automatically sums numeric columns when merging duplicates — perfect for pre-aggregated metrics that may arrive multiple times.
Serving / API Layer
WebSocket push for 30-second refresh (more efficient than polling):
Dashboard opens WebSocket connection
Server pushes updated metric snapshot every 30 seconds
Dashboard re-renders only changed values
REST fallback: For clients that can’t maintain WebSocket connections. Dashboard polls every 30 seconds via HTTP.
Caching layer (Redis): Cache the latest metric snapshot (last 30 seconds of pre-aggregated data). Each dashboard refresh reads from Redis (~1ms) instead of ClickHouse (~50ms). TTL = 30 seconds. Cache invalidated when Flink publishes new aggregates.
For ad-hoc queries: Route to warehouse (BigQuery/Snowflake) instead of ClickHouse. Analysts can run complex joins on 90 days of history without impacting dashboard performance.
Step 5: Trade-offs & Design Decisions (8-10 minutes)
Trade-off 1: Pre-aggregation vs on-demand aggregation
-
Pre-aggregation (our choice): Flink computes metrics every minute. Dashboard queries are fast key lookups. Limitation: dashboard can only show pre-defined metrics and dimensions. Adding a new dimension requires a pipeline change.
-
On-demand aggregation: Every dashboard query runs against raw events. Fully flexible but requires a very fast OLAP engine (Pinot) and can be expensive at 2M events/sec scale.
-
Our choice: Pre-aggregate known KPIs in Flink → ClickHouse. Route exploratory/ad-hoc to warehouse. This serves 80% of use cases with 10x better performance.
Trade-off 2: Late-arriving data
Events from mobile may arrive up to 60 seconds late (offline mode, network issues). For a 30-second dashboard:
-
Business dashboard option: Late events roll into the next window (accept slight inaccuracy for simplicity)
-
Financial/accuracy-critical option: Flink allowed lateness of 60 seconds + Flink retractions to update ClickHouse when late events change a window’s result
“What does the business actually need? If this is an executive engagement dashboard, 1-2% inaccuracy due to late mobile events is acceptable. If this is ad billing, we need corrections.”
Trade-off 3: Exactly-once vs at-least-once
-
At-least-once + deduplication (our choice): Simpler and sufficient for analytics metrics. Duplicate events are deduplicated in Flink using event_id within a 1-hour window. Minor risk: if Flink state is lost and recovered from checkpoint, a small window of events may be reprocessed — but idempotent MERGE into ClickHouse handles this correctly.
-
Exactly-once: More complex (Flink 2PC + ClickHouse transactional sink). Reserve for billing pipelines.
Trade-off 4: HyperLogLog for unique users
Counting exact DAU at 2M events/sec requires tracking billions of user IDs in memory — expensive. HyperLogLog (HLL) provides ~2% error for COUNT DISTINCT using ~12KB per dimension combination. For a dashboard showing “4.2M DAU,” showing “4.18M - 4.22M DAU” is indistinguishable. Use HLL.
Failure Mode Analysis
| Failure | Impact | Mitigation |
|---|---|---|
| Kafka broker goes down | Ingestion pauses for affected partitions | RF=3, min.insync.replicas=2. Auto leader election restores in < 30 sec. |
| Flink job crashes | Dashboard falls behind by checkpoint interval | Flink recovers from last checkpoint (every 30 sec). Max lag = 30 sec + recovery time. |
| ClickHouse node fails | Read queries slow down | Replication factor 2. Queries fail over to replica. |
| Late data spike (mobile users reconnect) | Window metrics incorrect briefly | Allowed lateness + retractions OR alert consumers “metrics may be slightly behind” |
| Source schema change | Flink deserialization fails | Schema Registry blocks incompatible changes. DLQ for any that get through. Alert fires. |
| Dashboard API overload | Slow dashboard renders | Redis cache absorbs burst. Rate limit per user. Load balancer + horizontal scaling. |
Monitoring: The Operational Layer
Kafka consumer lag (Flink consumers) → alert if lag > 100K events (growing)
Flink checkpoint duration → alert if > 10 sec (may indicate GC or resource pressure)
ClickHouse query p99 latency → alert if > 500ms (dashboard SLA impact)
Dashboard freshness → measure time from event generation to dashboard render
→ alert if > 60 seconds (2x the 30-sec SLA = clear violation)
Event volume anomaly → alert if events/sec drops > 30% from 1-hour baseline
→ may indicate client-side tracking bug or ingestion failure
Interview Q&A — Anticipate These Probes
“What if we need sub-second dashboard updates instead of 30 seconds?”
Switch from pre-aggregation to event-time streaming with Pinot’s real-time ingestion. Pinot consumes directly from Kafka, segments are queryable within seconds of ingestion. Dashboard polls every 1 second. Cost: ~3-5x infrastructure cost vs. 30-second ClickHouse design. “Is sub-second refresh worth that cost? What business decision requires freshness under one second that 30 seconds doesn’t serve?”
“How do you add a new metric (e.g., cart abandonment rate) to the dashboard?”
-
Ensure the relevant events exist in Kafka (cart_viewed, cart_abandoned events)
-
Add a new Flink aggregation job or extend the existing one for the new metric
-
Add a new column to the ClickHouse schema
-
Add the metric to the dashboard query
-
Backfill from S3 bronze layer via Spark if historical data needed
The bronze layer on S3 is the safety net — it preserves all raw events, enabling backfill of any new metric.
“How would this design change for Meta at 10x the scale (20M events/sec)?”
-
Kafka: 10x more partitions (500-1000), multi-cluster with Kafka MirrorMaker for cross-region
-
Flink: More parallelism (10x task managers), careful state management to avoid memory pressure
-
ClickHouse: Multi-node cluster (10-20 nodes), sharded by dimension key
-
Ingestion: Edge pre-aggregation — reduce to 10x lower volume before hitting central Kafka
-
Same architecture fundamentally — just scaled horizontally
Think About This (Self-Assessment)
After reading this design, answer these questions without looking back:
-
Why did we choose pre-aggregation over on-demand aggregation? What’s the limitation?
-
Why is HyperLogLog used for DAU? What’s the error rate?
-
What’s in the bronze layer and why is it critical?
-
How does the serving layer handle 500 concurrent dashboard users at 30-second refresh?
-
What happens if Flink crashes? How long is the dashboard stale?
If you can answer all 5 fluently in 60 seconds each, this design is yours.
Quick Reference
-
The architecture stack: SDK → Kafka → Flink (dedup + enrich + window) → ClickHouse (pre-aggregated) + S3/Iceberg (raw) + Warehouse (historical ad-hoc) → Dashboard API → WebSocket push
-
Key sizing numbers: 2M events/sec = ~50 Kafka partitions, 1 GB/sec raw, ~14 TB/day compressed
-
Real-time OLAP choice: ClickHouse for sub-second dashboards with full SQL. Pinot when < 100ms or user-facing. Druid for time-series exploration.
-
Critical design choices: Pre-aggregate in Flink (not on-demand) → HLL for COUNT DISTINCT → Redis cache in front of ClickHouse → WebSocket push vs polling → at-least-once + deduplication
-
The bronze layer is always your safety net — all raw events on S3, any future metric can be backfilled
Tomorrow’s Preview
Day 25: Design: Event Logging & Telemetry System — Building for billions of events per day. Schema design for events. Ingestion at scale (Kafka/Kinesis), processing and storage (columnar), querying. Handling late data and deduplication. This is the foundational infrastructure problem at every FAANG company.