Phase 2: Deep Dives | Category: Advanced Data Modeling

Why This Matters for Your Target Companies

Every FAANG-scale company — Meta (engagement events), Netflix (viewing events), Google (search events), OpenAI (conversation events) — ingests billions of events daily. The data modeling choice for event data fundamentally shapes how quickly analysts can answer questions, how much pipeline maintenance is required, and how well the model handles evolving business questions. At senior DE interviews, you’ll be given ambiguous event modeling scenarios and expected to reason through the trade-offs between Activity Schema, Kimball fact tables, and OBT approaches without a prescribed answer.

The Problem With Traditional Modeling for Events

Consider this scenario: You have user events (views, clicks, purchases, shares, opens). With Kimball star schema, you build:

  • fact_page_views
  • fact_clicks
  • fact_purchases
  • fact_email_opens
  • fact_shares

The hidden cost: A new business question arrives: “Show me users who clicked within 1 hour of receiving an email.” In a Kimball model, this requires joining two separate fact tables on user_id and time — a complex, potentially expensive cross-fact join. And every new question requires a new model or a new join pattern.

As the Activity Schema spec on GitHub describes: “Traditional approaches repeatedly failed to deliver at companies like WeWork — tables were question-specific, every new question required new modeling work, and numbers never matched across tables.”

The Three Event Modeling Patterns

Pattern 1: Kimball Fact Tables (One Table Per Event Type)

fact_page_views
:    session_id, user_id, page_id, view_date, duration_sec  fact_clicks:        click_id, user_id, element_id, click_date, context  fact_purchases:     order_id, user_id, product_id, purchase_date, amount  fact_email_opens:   open_id, user_id, email_id, open_date, device_type

Strengths:

  • Clear, purpose-built schema for each event type
  • Optimized columns per event type (no NULLs for irrelevant attributes)
  • Easy to understand and query for single-event-type questions
  • Natural fit for financial data where each event type has distinct attributes

Weaknesses:

  • Cross-event queries require multi-table joins
  • New event types = new tables = new pipeline work
  • Numbers may diverge across tables (different definitions of “user” per table)
  • Pipeline maintenance multiplies with each new event type
  • Retention queries (“users who did X then Y”) are expensive cross-joins

Use when: Event types are heterogeneous, each with unique attributes (purchase has amount, view has duration — little overlap). Analytics are primarily single-event-type.

Pattern 2: Activity Schema (Wide Time-Series, Narrow Schema)

Proposed by Ahmed Elsamadisi, codified in the ActivitySchema spec. All activities for an entity type are stored in a single narrow table:

CREATE
TABLE customer_stream (      activity_id        STRING NOT NULL,       -- unique event identifier      ts                 TIMESTAMP NOT NULL,    -- event timestamp      customer_id        STRING NOT NULL,       -- the entity      activity_name      STRING NOT NULL,       -- 'page_view', 'click', 'purchase', 'email_open'      anonymous_id       STRING,               -- for pre-login events      revenue_impact     FLOAT64,              -- optional financial measure      feature_1          STRING,              -- activity-specific attribute 1      feature_2          STRING,              -- activity-specific attribute 2      feature_3          STRING               -- activity-specific attribute 3  )  PARTITION BY DATE(ts)  CLUSTER BY customer_id, activity_name;

The 11-column design principle: The spec targets ~11 columns (vs. 50+ columns in a wide event table). Activity-specific attributes are limited to feature_1/2/3 — if you need more, create a per-activity table alongside the stream.

How it handles cross-event queries: Self-joins using temporal operators:

--
"Users who clicked within 1 hour of email open"  -- In Activity Schema: self-join on same entity + time relationship  SELECT      opens.customer_id,      opens.ts as email_open_time,      clicks.ts as click_time,      TIMESTAMP_DIFF(clicks.ts, opens.ts, MINUTE) as minutes_to_click  FROM customer_stream opens  JOIN customer_stream clicks      ON opens.customer_id = clicks.customer_id      AND clicks.ts BETWEEN opens.ts AND TIMESTAMP_ADD(opens.ts, INTERVAL 1 HOUR)  WHERE opens.activity_name = 'email_open'    AND clicks.activity_name = 'click'    AND opens.ts >= '2026-04-01'

The temporal operators (the “10 relationship patterns” from the spec):

  1. within_next(X minutes) — did Y happen after X?

  2. first_ever — what was the first activity of type X?

  3. last_before — most recent X before this event

  4. count_between(start, end) — how many X happened between two timestamps?

  5. ever_done — has the customer ever done X?

Strengths:

  • Any question answerable from one table (no cross-table joins — only self-joins)
  • New event types: just insert new rows with a different activity_name — zero schema changes
  • Numbers always match (single source of truth)
  • Temporal relationship queries are natural
  • Fast in columnar warehouses: few columns, many rows

Weaknesses:

  • feature_1/2/3 are generic — meaning is implicit (“feature_1 means page_url for page_view, but product_id for purchase”)
  • Self-joins on large tables can be expensive if not partitioned/clustered correctly
  • Less familiar to analysts used to Kimball
  • Very heterogeneous events with many unique attributes don’t fit the 3-feature limit

Use when: Many event types, common entity (customer), temporal relationship queries are primary, schema evolves frequently, analyst self-serve is a priority.

Pattern 3: One Big Table (OBT) — Wide Event Table

All events in one table, with a column for every possible attribute across all event types:

CREATE
TABLE events (      event_id           STRING,      event_timestamp    TIMESTAMP,      event_type         STRING,    -- 'page_view', 'click', 'purchase', ...      user_id            STRING,      session_id         STRING,      -- Page view columns      page_url           STRING,      page_category      STRING,      view_duration_sec  INT64,      -- Click columns      element_id         STRING,      element_type       STRING,      click_x            FLOAT64,      click_y            FLOAT64,      -- Purchase columns      order_id           STRING,      product_id         STRING,      purchase_amount    FLOAT64,      -- Email columns      email_campaign_id  STRING,      email_subject      STRING      -- ... potentially 100+ columns  )

The OBT trade-off per LinkedIn OBT analysis: “OBT is designed to solve the Read Performance puzzle. It involves denormalizing your data into a single massive table. ‘Big’ refers to scope, not a ‘dumping ground.’”

Strengths:

  • No JOINs for any query — everything is in one scan
  • Very fast for columnar engines (BigQuery, Snowflake skip non-relevant columns)
  • Simple mental model for analysts

Weaknesses:

  • Column explosion: 100+ columns with 90% NULLs for most rows
  • Wasteful storage (NULL columns still occupy space in some engines)
  • Schema changes are expensive (adding a column touches the whole table)
  • Tends to become a “dumping ground” without governance

Use when: Very few event types with mostly shared attributes, ML feature tables (analysts/scientists need a wide flat table), serving layers where analysts expect one-table access.

Schema-on-Read vs Schema-on-Write for Events

This is the cross-cutting tension in all event modeling:

AspectSchema-on-write (strict)Schema-on-read (flexible)
DefinitionSchema enforced at write timeSchema inferred/applied at query time
Storage formatStructured (Parquet with known schema)JSON/JSONB, or Parquet with generic columns
New event fieldsRequires ALTER TABLE → schema migrationJust write new JSON key → no migration
Query performanceFast (columnar pruning, type-safe)Slower (JSON parsing overhead)
Data qualityEnforced at ingestion (fail early)Must validate at query time
Analytics compatibilityBI tools work nativelyRequires JSON extraction in SQL
Best forStable event schemas, BI-heavy analyticsRapidly evolving schemas, exploratory
ExampleParquet with defined columnsJSONB in Postgres, variant/JSON in Snowflake

The 2026 production pattern: Hybrid per layer. Bronze: schema-on-read (raw JSON events, no schema enforcement → preserve all fields, handle schema evolution). Silver: schema-on-write (parse JSON, enforce typed columns, fail bad events to DLQ). Gold: strict schema (Kimball or Activity Schema, typed, BI-ready).

The Modern Event Schema Pattern at FAANG Scale

In practice, the winning pattern at your target companies is:

Ingestion schema (events arriving via Kafka/Scribe):

{    "event_id": "uuid",    "event_type": "page_view",    "entity_id": "user-12345",    "entity_type": "user",    "timestamp": "2026-04-12T00:00:00Z",    "session_id": "sess-abc",    "properties": {      "page_url": "https://netflix.com/watch/123",      "content_id": "nm0123456",      "surface": "homepage"    },    "context": {      "device_type": "mobile",      "country": "US",      "app_version": "5.2.1"    }  }

Silver layer (structured, typed, per-event-type tables):

--
One table per major event type, structured columns  silver.viewing_events:    event_id, entity_id, timestamp, session_id,    content_id, surface, view_duration_sec,    device_type, country, app_version  -- Plus a common properties stored as STRUCT/JSON for extensibility    extra_properties JSON

Gold layer (Activity Schema for temporal/funnel queries, Kimball for aggregated metrics):

--
Gold Activity Stream (for temporal analysis, funnel queries)  gold.user_stream: activity_id, ts, user_id, activity_name, feature_1, feature_2, feature_3  -- Gold Fact Tables (for pre-aggregated metrics)  gold.daily_viewing_metrics: date, user_segment, content_category, views, watch_time

The insight: Activity Schema and Kimball are not competing. Activity Schema lives in the analytical layer for temporal questions. Kimball aggregated tables serve BI dashboards. Both read from the same silver structured events.

Session Windows: The Hardest Event Modeling Problem

Sessionization deserves special attention — it comes up in almost every DE system design that involves events.

The problem: Events are timestamped, but sessions have no explicit start/end. A session is a group of events from the same user where no gap exceeds N minutes (typically 30 minutes).

The SQL approach (for batch pipelines):

--
Step 1: flag session starts (first event or gap > 30 min from prior event)  WITH events_with_gaps AS (    SELECT      user_id, event_id, event_timestamp,      LAG(event_timestamp) OVER (PARTITION BY user_id ORDER BY event_timestamp) AS prior_ts,      TIMESTAMP_DIFF(event_timestamp,        LAG(event_timestamp) OVER (PARTITION BY user_id ORDER BY event_timestamp),        MINUTE) AS gap_minutes    FROM silver.events  ),  session_starts AS (    SELECT *,      CASE WHEN prior_ts IS NULL OR gap_minutes > 30 THEN 1 ELSE 0 END AS is_session_start    FROM events_with_gaps  ),  -- Step 2: assign session IDs via cumulative sum of session starts  session_assignments AS (    SELECT *,      -- Cumulative sum of session starts = session number per user      SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_timestamp) AS session_num    FROM session_starts  )  SELECT    user_id,    session_num,    CONCAT(user_id, '_', session_num) AS session_id,    MIN(event_timestamp) AS session_start,    MAX(event_timestamp) AS session_end,    TIMESTAMP_DIFF(MAX(event_timestamp), MIN(event_timestamp), MINUTE) AS session_duration_min,    COUNT(*) AS event_count  FROM session_assignments  GROUP BY user_id, session_num

The streaming approach (Flink session windows):

on

Flink: session window with 30-minute inactivity gap  stream    .keyBy("user_id")    .window(EventTimeSessionWindows.withGap(Time.minutes(30)))    .process(SessionWindowFunction())

Edge cases you must address in interviews:

  1. Multi-device sessions: Same user on mobile and laptop simultaneously — are these one session or two? (Usually two, by device_id)

  2. Very long sessions: A user leaves Netflix running → 8-hour “session.” Cap at reasonable max (4 hours) or use content-specific signals (pause events)

  3. Late-arriving events: Event with timestamp 2 hours ago arrives now — does it extend a past session? (Use allowed lateness in Flink, or batch correction in the next day’s job)

  4. Cross-midnight sessions: A session spanning midnight breaks date partitioning. Solution: assign session to the date it started (not ended)

Interview Questions

Q1: “You’re designing the data model for Meta’s engagement data — billions of events/day across likes, comments, shares, reactions, views. How do you model this?”

Model Answer: “I’d use a layered approach. Bronze: raw JSON events via Scribe, no schema enforcement, all fields preserved for replay. Silver: structured per-event-type tables — silver.likes, silver.comments, silver.views — with typed columns extracted from JSON. This gives me the performance of columnar storage and the type safety for downstream consumers. Gold has two paths: For product analytics dashboards needing pre-aggregated metrics (‘daily engagement by surface’), I build Kimball-style aggregated fact tables partitioned by date and clustered by surface and content_category. For analyst questions requiring temporal analysis (‘users who commented within 1 hour of viewing’), I maintain an Activity Schema user_stream table — all event types unified with activity_name, feature_1/2/3, partitioned by date and clustered by user_id. The Activity Schema enables self-join temporal queries without cross-table joins, which is critical when analysts ask funnel and retention questions that span multiple event types. The Kimball tables exist only for pre-computed metrics that dashboards need at sub-second latency. Both read from the same silver layer — single source of truth.”

Q2: “Should you use schema-on-read or schema-on-write for a rapidly evolving event schema?”

Model Answer: “Both, at different layers. Schema-on-read at the bronze layer: event producers (mobile apps, web SDKs) add new properties constantly — if I enforce schema at ingestion, every new property field causes a pipeline failure. Bronze accepts any valid JSON, stores it raw, and preserves maximum fidelity for replay. This is critical: the next feature could require a property we’re not capturing yet, and we can always backfill from bronze. Schema-on-write at silver and gold: once data is in bronze, my ETL pipeline applies schema validation during the bronze→silver transform. New properties that aren’t in the schema go into an extra_properties JSON column — they’re captured but not typed. Known critical columns (user_id, event_type, timestamp) are enforced strictly — NULL in these is a DLQ event, not a tolerated NULL. The silver→gold transform uses explicit column selection — only typed, validated columns reach the gold layer. This gives me the flexibility of schema-on-read for production resilience AND the performance/governance of schema-on-write for analytics. The tradeoff is dual storage (raw JSON in bronze + typed columns in silver) but at today’s storage costs, that’s the right call.”

Think About This

You’re in a Netflix interview. The prompt: “Netflix is launching a new interactive feature where users can ‘choose their own adventure’ during certain shows. Model the event data for this feature.”

Walk through:

  1. What new event types does this create? (choice_presented, choice_selected, outcome_viewed, branch_completed, episode_end)

  2. How does sessionization change? (A session now has a branching tree structure, not a linear sequence. Session_id groups all events from one viewing, but the sequence within the session has a tree, not a list.)

  3. Which modeling pattern? (Activity Schema — because the key analytics question is temporal: “What was the most common choice path? What percentage of users who chose A in scene 3 chose B in scene 7?” These are multi-hop temporal self-joins against a single stream table.)

  4. What’s the schema for feature_1/2/3 for choice_selected? (feature_1 = choice_node_id, feature_2 = choice_option_id, feature_3 = path_id from root)

  5. How do you track the full decision tree per user per session? (Accumulating snapshot fact table — one row per user-session-episode, with columns for each decision node’s choice. This enables analytics on “path to completion” without nested tree queries.)

Quick Reference

  • Three event modeling patterns: Kimball fact tables (one per event type, optimized per type), Activity Schema (single time-series table, temporal self-joins), OBT (all events wide, no JOINs, column explosion)
  • Activity Schema formula: entity_id + timestamp + activity_name + feature_1/2/3 — enables any temporal query via self-join
  • Schema-on-read vs schema-on-write: Bronze = schema-on-read (accept any JSON, preserve for replay). Silver/Gold = schema-on-write (type safety, column pruning, BI tool compatibility).
  • Sessionization SQL pattern: LAG() to compute gaps → cumulative SUM(is_session_start) to assign session numbers. One of the most commonly tested SQL patterns in DE interviews.
  • The production 2026 answer: Activity Schema + Kimball together — Activity Schema for temporal analysis (analyst self-serve), Kimball aggregates for dashboard performance. Both from the same silver source.
  • Interview signal: Knowing Activity Schema by name and understanding when it beats Kimball is a strong senior DE signal — most candidates know only Kimball.

Tomorrow’s Preview

Day 42: Graph Data Modeling — Graph databases (Neo4j, Neptune), property graph model, adjacency list patterns in relational databases, social network modeling, fraud detection graphs, and when graph traversal queries actually need a graph database vs. when SQL handles them just fine.