Phase 1: Foundations | Category: System Design Practice

The Prompt

“Design a telemetry and event logging system that ingests billions of user events per day from web, mobile, and server-side sources. The system must support both real-time analytics (anomaly detection, dashboards) and batch analytics (daily reports, cohort analysis). Handle late-arriving events, deduplication, and schema evolution.”

Set a timer for 45 minutes and work through each section before reading.

Step 1: Requirements (5 min)

Functional:

  • Ingest events from multiple surfaces: web (JS SDK), mobile (iOS/Android SDK), server (backend services)

  • Store raw events immutably for replay and audit

  • Support real-time analytics (anomaly detection, live dashboards)

  • Support batch analytics (retention, cohort analysis, A/B test results)

  • Handle schema evolution without breaking downstream consumers

  • Support GDPR right-to-deletion

Non-functional:

Scale:      500B events/day (Netflix/Google range), ~6M events/sec peak
Latency:    Real-time path: < 30 sec. Batch path: < 4 hours (available by 6 AM)
Freshness:  Streaming dashboards: 30-second SLA
Retention:  Hot data: 30 days. Warm: 1 year. Cold: 7 years
Durability: Zero data loss (events are the source of truth)
Late data:  Mobile events may arrive up to 2 hours late
Duplicates: Mobile retries cause duplicates — must deduplicate

Step 2: Scale Estimation (3 min)

6M events/sec × 500 bytes avg = 3 GB/sec raw ingestion
3 GB/sec × 86,400 sec = ~259 TB/day raw
Parquet + Zstd (6x compression) = ~43 TB/day compressed storage
Annual hot storage: 43 TB × 365 = ~15.7 PB/year
Kafka sizing:
    6M events/sec → need 60-100 partitions (100K events/sec per partition)
    Replication factor 3 → ~9 GB/sec total Kafka write throughput
    Retention: 7 days on Kafka for replay = 43 TB × 7 / 6 ≈ ~50 TB Kafka storage
    → Tiered Kafka storage (hot disks for recent, S3 for older)

Step 3: Architecture

┌──────────────────────────────────────────────────────┐
│              EVENT SOURCES                            │
│  Web SDK → JS → batch(50ms/50 events) → HTTPS        │
│  Mobile SDK → iOS/Android → batch with retry logic   │
│  Server → gRPC → high-throughput direct publish       │
└────────────────────────┬─────────────────────────────┘

┌──────────────────────────────────────────────────────┐
│         INGESTION GATEWAY (stateless, horizontally   │
│         scalable, behind load balancer)               │
│  • Auth token validation                              │
│  • Schema validation (Avro + Schema Registry)         │
│  • PII masking / tokenization (before Kafka)          │
│  • Rate limiting per client                           │
│  • Invalid events → Dead Letter Queue (DLQ)           │
└────────────────────────┬─────────────────────────────┘
                         ↓ Avro serialized
┌──────────────────────────────────────────────────────┐
│              APACHE KAFKA                             │
│  Topics:                                              │
│    user-events (100 partitions, key=user_id)          │
│    server-events (50 partitions, key=service_name)  │
│    dlq-events (invalid / rejected events)           │
│  Retention: 7 days (tiered storage to S3 for 30 days) │
└──────────────┬─────────────────┬────────────────────┘
               ↓                 ↓
┌──────────────────┐    ┌─────────────────────────────┐
│  FLINK (STREAM)  │    │  SPARK STRUCTURED STREAMING │
│  • Deduplication │    │  (BATCH-FIRST PATH)          │
│  • Late data (2h │    │  • Micro-batch every 5 min   │
│    watermark)    │    │  • Write to S3/Iceberg       │
│  • Sessionization│    │  • Bronze layer, partitioned │
│  • Enrichment    │    │    by date + hour            │
│  • Aggregation   │    └──────────────┬──────────────┘
└────────┬─────────┘                   │
         └──────────────┬───────────────┘

         (Bronze → Silver → Gold — hourly/daily dbt/Spark)

┌─────────────────────┐       ┌─────────────────────────┐
│  S3 / GCS + Iceberg │       │  ClickHouse / Pinot       │
│  BRONZE LAYER       │       │  (real-time OLAP)         │
│  (raw, immutable)   │       │  • 30-sec dashboards      │
└──────────┬──────────┘       └──────────┬──────────────┘
           ↓                    ↓ (hourly dbt/Spark)
┌──────────────────────┐      ┌──────────────────────┐
│  SILVER LAYER        │      │  GOLD LAYER          │
│  (cleaned, deduped,  │      │  (star schema,       │
│   sessionized)       │      │   retention, cohorts,│
└──────────┬───────────┘      │   A/B metrics)        │
           └──────────────────┴──────────┬───────────

                               ┌─────────────────┐
                               │  SERVING LAYER  │
                               │  Dashboard API  │
                               │  + BigQuery/    │
                               │  Snowflake for  │
                               │  ad-hoc queries │
                               └─────────────────┘

Step 4: The Hard Problems Deep Dive

Deduplication at Scale

Mobile clients retry failed HTTP requests. A single event may arrive 2-3 times.

Three-layer deduplication strategy:

Layer 1 — Client side: Generate a deterministic event_id = hash(user_id + event_type + client_timestamp + session_id). Idempotent retries send the same event_id.

Layer 2 — Flink streaming dedup: Keyed state on event_id with 1-hour state TTL. First occurrence passes; duplicates dropped.

Flink: KeyedState, TTL = 1 hour
→ process(event):
    if state.get(event_id) == null: emit + setState(true)
    else: drop

Layer 3 — Batch dedup in Silver layer: Daily Spark job deduplicates across partition boundaries (for duplicates that arrived in different micro-batch windows):

-- Silver dedup: keep first occurrence per event_id per day
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (      PARTITION BY event_id
ORDER BY processing_time ASC    ) AS rn
FROM bronze.user_events
WHERE date = '2026-04-10'  ) WHERE rn = 1

Late-Arriving Events (2-hour window)

Mobile users go offline and reconnect — events arrive with timestamps 2 hours in the past.

Flink watermark configuration:

Watermark = MAX(event_time_seen) - 2 hours  Window [10:00-10:05] fires when watermark reaches 12:05  (2 hours after the window end time)

Cost: Each window stays open 2 extra hours → memory pressure for stateful operations. Mitigation: use incremental aggregation (Flink’s AggregateFunction) to compute partial aggregates rather than storing all events.

For events arriving BEYOND the 2-hour window:

  1. Route to a side output (DLQ-like)

  2. Batch correction job runs hourly: takes late events from the side output, merges into the Iceberg silver layer using MERGE INTO

  3. Downstream gold tables recalculate affected partitions

Monitoring: Alert on % late events > 5% — sudden spike signals a mobile client issue or region-wide connectivity problem.

Schema Evolution

As the product adds new event types and fields:

Event v1: { event_id, user_id, event_type, timestamp }  Event v2: + device_type, app_version (added fields, backward compatible)  Event v3: + geo_lat, geo_lng (another compatible addition)

Enforcement:

  • Schema Registry with BACKWARD_TRANSITIVE mode

  • New field additions with defaults → compatible, no pipeline changes needed

  • Breaking changes (renames, type changes) → blocked by registry, require dual-field migration

Bronze layer resilience: Iceberg mergeSchema = true automatically adds new columns. Old partitions don’t have the new column (NULL). Downstream SQL uses COALESCE(new_col, default_value).

GDPR Deletion

User requests deletion: must be removed from all tables and storage.

1. Tag user_id in a "deletion_requests" table with request_timestamp
2. Hard delete from all Iceberg tables:      DELETE FROM silver.user_events WHERE user_id = 'U-12345'     (Iceberg delete files handle this; compaction physically removes later)
3. Replay Bronze → Silver after deletion marks are applied (for audit)
4. Kafka: configure log compaction + tombstone messages (key=user_id, value=null)
5. Audit record: "user U-12345 deleted from all layers at 2026-04-10 14:23 UTC"

Key Trade-offs

DecisionChoiceRationaleDecisionChoiceRationaleAt-least-once vs exactly-onceAt-least-once + dedup3-layer dedup cheaper than distributed 2PC at 6M/secFlink vs Spark for stream pathFlinkSub-second state management, native watermarks, 2-hour late windowSpark for batch pathSpark Structured Streaming (micro-batch)Unified batch + stream codebase; 5-min micro-batch is sufficient for bronze layerEvent id strategyClient-generated deterministic hashEnables dedup without central coordination; works offlineKafka retention7 days hot + tiered S37 days covers operational replay; tiered storage eliminates retention cost concerns

Self-Assessment Questions

  1. Why do we deduplicate at 3 layers instead of just one?

  2. How does the 2-hour watermark affect memory in Flink?

  3. What happens to the dashboard if Flink crashes and recovers from a 30-second-old checkpoint?

  4. How is GDPR deletion handled across Kafka + Iceberg + ClickHouse?