Phase 1: Foundations | Category: System Design Practice
The Prompt
“Design a telemetry and event logging system that ingests billions of user events per day from web, mobile, and server-side sources. The system must support both real-time analytics (anomaly detection, dashboards) and batch analytics (daily reports, cohort analysis). Handle late-arriving events, deduplication, and schema evolution.”
Set a timer for 45 minutes and work through each section before reading.
Step 1: Requirements (5 min)
Functional:
-
Ingest events from multiple surfaces: web (JS SDK), mobile (iOS/Android SDK), server (backend services)
-
Store raw events immutably for replay and audit
-
Support real-time analytics (anomaly detection, live dashboards)
-
Support batch analytics (retention, cohort analysis, A/B test results)
-
Handle schema evolution without breaking downstream consumers
-
Support GDPR right-to-deletion
Non-functional:
Scale: 500B events/day (Netflix/Google range), ~6M events/sec peak
Latency: Real-time path: < 30 sec. Batch path: < 4 hours (available by 6 AM)
Freshness: Streaming dashboards: 30-second SLA
Retention: Hot data: 30 days. Warm: 1 year. Cold: 7 years
Durability: Zero data loss (events are the source of truth)
Late data: Mobile events may arrive up to 2 hours late
Duplicates: Mobile retries cause duplicates — must deduplicate
Step 2: Scale Estimation (3 min)
6M events/sec × 500 bytes avg = 3 GB/sec raw ingestion
3 GB/sec × 86,400 sec = ~259 TB/day raw
Parquet + Zstd (6x compression) = ~43 TB/day compressed storage
Annual hot storage: 43 TB × 365 = ~15.7 PB/year
Kafka sizing:
6M events/sec → need 60-100 partitions (100K events/sec per partition)
Replication factor 3 → ~9 GB/sec total Kafka write throughput
Retention: 7 days on Kafka for replay = 43 TB × 7 / 6 ≈ ~50 TB Kafka storage
→ Tiered Kafka storage (hot disks for recent, S3 for older)
Step 3: Architecture
┌──────────────────────────────────────────────────────┐
│ EVENT SOURCES │
│ Web SDK → JS → batch(50ms/50 events) → HTTPS │
│ Mobile SDK → iOS/Android → batch with retry logic │
│ Server → gRPC → high-throughput direct publish │
└────────────────────────┬─────────────────────────────┘
↓
┌──────────────────────────────────────────────────────┐
│ INGESTION GATEWAY (stateless, horizontally │
│ scalable, behind load balancer) │
│ • Auth token validation │
│ • Schema validation (Avro + Schema Registry) │
│ • PII masking / tokenization (before Kafka) │
│ • Rate limiting per client │
│ • Invalid events → Dead Letter Queue (DLQ) │
└────────────────────────┬─────────────────────────────┘
↓ Avro serialized
┌──────────────────────────────────────────────────────┐
│ APACHE KAFKA │
│ Topics: │
│ user-events (100 partitions, key=user_id) │
│ server-events (50 partitions, key=service_name) │
│ dlq-events (invalid / rejected events) │
│ Retention: 7 days (tiered storage to S3 for 30 days) │
└──────────────┬─────────────────┬────────────────────┘
↓ ↓
┌──────────────────┐ ┌─────────────────────────────┐
│ FLINK (STREAM) │ │ SPARK STRUCTURED STREAMING │
│ • Deduplication │ │ (BATCH-FIRST PATH) │
│ • Late data (2h │ │ • Micro-batch every 5 min │
│ watermark) │ │ • Write to S3/Iceberg │
│ • Sessionization│ │ • Bronze layer, partitioned │
│ • Enrichment │ │ by date + hour │
│ • Aggregation │ └──────────────┬──────────────┘
└────────┬─────────┘ │
└──────────────┬───────────────┘
↓
(Bronze → Silver → Gold — hourly/daily dbt/Spark)
┌─────────────────────┐ ┌─────────────────────────┐
│ S3 / GCS + Iceberg │ │ ClickHouse / Pinot │
│ BRONZE LAYER │ │ (real-time OLAP) │
│ (raw, immutable) │ │ • 30-sec dashboards │
└──────────┬──────────┘ └──────────┬──────────────┘
↓ ↓ (hourly dbt/Spark)
┌──────────────────────┐ ┌──────────────────────┐
│ SILVER LAYER │ │ GOLD LAYER │
│ (cleaned, deduped, │ │ (star schema, │
│ sessionized) │ │ retention, cohorts,│
└──────────┬───────────┘ │ A/B metrics) │
└──────────────────┴──────────┬───────────
↓
┌─────────────────┐
│ SERVING LAYER │
│ Dashboard API │
│ + BigQuery/ │
│ Snowflake for │
│ ad-hoc queries │
└─────────────────┘
Step 4: The Hard Problems Deep Dive
Deduplication at Scale
Mobile clients retry failed HTTP requests. A single event may arrive 2-3 times.
Three-layer deduplication strategy:
Layer 1 — Client side: Generate a deterministic event_id = hash(user_id + event_type + client_timestamp + session_id). Idempotent retries send the same event_id.
Layer 2 — Flink streaming dedup: Keyed state on event_id with 1-hour state TTL. First occurrence passes; duplicates dropped.
Flink: KeyedState, TTL = 1 hour
→ process(event):
if state.get(event_id) == null: emit + setState(true)
else: drop
Layer 3 — Batch dedup in Silver layer: Daily Spark job deduplicates across partition boundaries (for duplicates that arrived in different micro-batch windows):
-- Silver dedup: keep first occurrence per event_id per day
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY event_id
ORDER BY processing_time ASC ) AS rn
FROM bronze.user_events
WHERE date = '2026-04-10' ) WHERE rn = 1
Late-Arriving Events (2-hour window)
Mobile users go offline and reconnect — events arrive with timestamps 2 hours in the past.
Flink watermark configuration:
Watermark = MAX(event_time_seen) - 2 hours Window [10:00-10:05] fires when watermark reaches 12:05 (2 hours after the window end time)
Cost: Each window stays open 2 extra hours → memory pressure for stateful operations. Mitigation: use incremental aggregation (Flink’s AggregateFunction) to compute partial aggregates rather than storing all events.
For events arriving BEYOND the 2-hour window:
-
Route to a side output (DLQ-like)
-
Batch correction job runs hourly: takes late events from the side output, merges into the Iceberg silver layer using MERGE INTO
-
Downstream gold tables recalculate affected partitions
Monitoring: Alert on % late events > 5% — sudden spike signals a mobile client issue or region-wide connectivity problem.
Schema Evolution
As the product adds new event types and fields:
Event v1: { event_id, user_id, event_type, timestamp } Event v2: + device_type, app_version (added fields, backward compatible) Event v3: + geo_lat, geo_lng (another compatible addition)
Enforcement:
-
Schema Registry with BACKWARD_TRANSITIVE mode
-
New field additions with defaults → compatible, no pipeline changes needed
-
Breaking changes (renames, type changes) → blocked by registry, require dual-field migration
Bronze layer resilience: Iceberg mergeSchema = true automatically adds new columns. Old partitions don’t have the new column (NULL). Downstream SQL uses COALESCE(new_col, default_value).
GDPR Deletion
User requests deletion: must be removed from all tables and storage.
1. Tag user_id in a "deletion_requests" table with request_timestamp
2. Hard delete from all Iceberg tables: DELETE FROM silver.user_events WHERE user_id = 'U-12345' (Iceberg delete files handle this; compaction physically removes later)
3. Replay Bronze → Silver after deletion marks are applied (for audit)
4. Kafka: configure log compaction + tombstone messages (key=user_id, value=null)
5. Audit record: "user U-12345 deleted from all layers at 2026-04-10 14:23 UTC"
Key Trade-offs
DecisionChoiceRationaleDecisionChoiceRationaleAt-least-once vs exactly-onceAt-least-once + dedup3-layer dedup cheaper than distributed 2PC at 6M/secFlink vs Spark for stream pathFlinkSub-second state management, native watermarks, 2-hour late windowSpark for batch pathSpark Structured Streaming (micro-batch)Unified batch + stream codebase; 5-min micro-batch is sufficient for bronze layerEvent id strategyClient-generated deterministic hashEnables dedup without central coordination; works offlineKafka retention7 days hot + tiered S37 days covers operational replay; tiered storage eliminates retention cost concerns
Self-Assessment Questions
-
Why do we deduplicate at 3 layers instead of just one?
-
How does the 2-hour watermark affect memory in Flink?
-
What happens to the dashboard if Flink crashes and recovers from a 30-second-old checkpoint?
-
How is GDPR deletion handled across Kafka + Iceberg + ClickHouse?