Phase 2: Deep Dives | Category: Advanced Data Modeling
Why This Matters for Your Target Companies
Every FAANG-scale company — Meta (engagement events), Netflix (viewing events), Google (search events), OpenAI (conversation events) — ingests billions of events daily. The data modeling choice for event data fundamentally shapes how quickly analysts can answer questions, how much pipeline maintenance is required, and how well the model handles evolving business questions. At senior DE interviews, you’ll be given ambiguous event modeling scenarios and expected to reason through the trade-offs between Activity Schema, Kimball fact tables, and OBT approaches without a prescribed answer.
The Problem With Traditional Modeling for Events
Consider this scenario: You have user events (views, clicks, purchases, shares, opens). With Kimball star schema, you build:
- fact_page_views
- fact_clicks
- fact_purchases
- fact_email_opens
- fact_shares
The hidden cost: A new business question arrives: “Show me users who clicked within 1 hour of receiving an email.” In a Kimball model, this requires joining two separate fact tables on user_id and time — a complex, potentially expensive cross-fact join. And every new question requires a new model or a new join pattern.
As the Activity Schema spec on GitHub describes: “Traditional approaches repeatedly failed to deliver at companies like WeWork — tables were question-specific, every new question required new modeling work, and numbers never matched across tables.”
The Three Event Modeling Patterns
Pattern 1: Kimball Fact Tables (One Table Per Event Type)
fact_page_views
: session_id, user_id, page_id, view_date, duration_sec fact_clicks: click_id, user_id, element_id, click_date, context fact_purchases: order_id, user_id, product_id, purchase_date, amount fact_email_opens: open_id, user_id, email_id, open_date, device_type
Strengths:
- Clear, purpose-built schema for each event type
- Optimized columns per event type (no NULLs for irrelevant attributes)
- Easy to understand and query for single-event-type questions
- Natural fit for financial data where each event type has distinct attributes
Weaknesses:
- Cross-event queries require multi-table joins
- New event types = new tables = new pipeline work
- Numbers may diverge across tables (different definitions of “user” per table)
- Pipeline maintenance multiplies with each new event type
- Retention queries (“users who did X then Y”) are expensive cross-joins
Use when: Event types are heterogeneous, each with unique attributes (purchase has amount, view has duration — little overlap). Analytics are primarily single-event-type.
Pattern 2: Activity Schema (Wide Time-Series, Narrow Schema)
Proposed by Ahmed Elsamadisi, codified in the ActivitySchema spec. All activities for an entity type are stored in a single narrow table:
CREATE
TABLE customer_stream ( activity_id STRING NOT NULL, -- unique event identifier ts TIMESTAMP NOT NULL, -- event timestamp customer_id STRING NOT NULL, -- the entity activity_name STRING NOT NULL, -- 'page_view', 'click', 'purchase', 'email_open' anonymous_id STRING, -- for pre-login events revenue_impact FLOAT64, -- optional financial measure feature_1 STRING, -- activity-specific attribute 1 feature_2 STRING, -- activity-specific attribute 2 feature_3 STRING -- activity-specific attribute 3 ) PARTITION BY DATE(ts) CLUSTER BY customer_id, activity_name;
The 11-column design principle: The spec targets ~11 columns (vs. 50+ columns in a wide event table). Activity-specific attributes are limited to feature_1/2/3 — if you need more, create a per-activity table alongside the stream.
How it handles cross-event queries: Self-joins using temporal operators:
--
"Users who clicked within 1 hour of email open" -- In Activity Schema: self-join on same entity + time relationship SELECT opens.customer_id, opens.ts as email_open_time, clicks.ts as click_time, TIMESTAMP_DIFF(clicks.ts, opens.ts, MINUTE) as minutes_to_click FROM customer_stream opens JOIN customer_stream clicks ON opens.customer_id = clicks.customer_id AND clicks.ts BETWEEN opens.ts AND TIMESTAMP_ADD(opens.ts, INTERVAL 1 HOUR) WHERE opens.activity_name = 'email_open' AND clicks.activity_name = 'click' AND opens.ts >= '2026-04-01'
The temporal operators (the “10 relationship patterns” from the spec):
-
within_next(X minutes) — did Y happen after X?
-
first_ever — what was the first activity of type X?
-
last_before — most recent X before this event
-
count_between(start, end) — how many X happened between two timestamps?
-
ever_done — has the customer ever done X?
Strengths:
- Any question answerable from one table (no cross-table joins — only self-joins)
- New event types: just insert new rows with a different activity_name — zero schema changes
- Numbers always match (single source of truth)
- Temporal relationship queries are natural
- Fast in columnar warehouses: few columns, many rows
Weaknesses:
- feature_1/2/3 are generic — meaning is implicit (“feature_1 means page_url for page_view, but product_id for purchase”)
- Self-joins on large tables can be expensive if not partitioned/clustered correctly
- Less familiar to analysts used to Kimball
- Very heterogeneous events with many unique attributes don’t fit the 3-feature limit
Use when: Many event types, common entity (customer), temporal relationship queries are primary, schema evolves frequently, analyst self-serve is a priority.
Pattern 3: One Big Table (OBT) — Wide Event Table
All events in one table, with a column for every possible attribute across all event types:
CREATE
TABLE events ( event_id STRING, event_timestamp TIMESTAMP, event_type STRING, -- 'page_view', 'click', 'purchase', ... user_id STRING, session_id STRING, -- Page view columns page_url STRING, page_category STRING, view_duration_sec INT64, -- Click columns element_id STRING, element_type STRING, click_x FLOAT64, click_y FLOAT64, -- Purchase columns order_id STRING, product_id STRING, purchase_amount FLOAT64, -- Email columns email_campaign_id STRING, email_subject STRING -- ... potentially 100+ columns )
The OBT trade-off per LinkedIn OBT analysis: “OBT is designed to solve the Read Performance puzzle. It involves denormalizing your data into a single massive table. ‘Big’ refers to scope, not a ‘dumping ground.’”
Strengths:
- No JOINs for any query — everything is in one scan
- Very fast for columnar engines (BigQuery, Snowflake skip non-relevant columns)
- Simple mental model for analysts
Weaknesses:
- Column explosion: 100+ columns with 90% NULLs for most rows
- Wasteful storage (NULL columns still occupy space in some engines)
- Schema changes are expensive (adding a column touches the whole table)
- Tends to become a “dumping ground” without governance
Use when: Very few event types with mostly shared attributes, ML feature tables (analysts/scientists need a wide flat table), serving layers where analysts expect one-table access.
Schema-on-Read vs Schema-on-Write for Events
This is the cross-cutting tension in all event modeling:
| Aspect | Schema-on-write (strict) | Schema-on-read (flexible) |
|---|---|---|
| Definition | Schema enforced at write time | Schema inferred/applied at query time |
| Storage format | Structured (Parquet with known schema) | JSON/JSONB, or Parquet with generic columns |
| New event fields | Requires ALTER TABLE → schema migration | Just write new JSON key → no migration |
| Query performance | Fast (columnar pruning, type-safe) | Slower (JSON parsing overhead) |
| Data quality | Enforced at ingestion (fail early) | Must validate at query time |
| Analytics compatibility | BI tools work natively | Requires JSON extraction in SQL |
| Best for | Stable event schemas, BI-heavy analytics | Rapidly evolving schemas, exploratory |
| Example | Parquet with defined columns | JSONB in Postgres, variant/JSON in Snowflake |
The 2026 production pattern: Hybrid per layer. Bronze: schema-on-read (raw JSON events, no schema enforcement → preserve all fields, handle schema evolution). Silver: schema-on-write (parse JSON, enforce typed columns, fail bad events to DLQ). Gold: strict schema (Kimball or Activity Schema, typed, BI-ready).
The Modern Event Schema Pattern at FAANG Scale
In practice, the winning pattern at your target companies is:
Ingestion schema (events arriving via Kafka/Scribe):
{ "event_id": "uuid", "event_type": "page_view", "entity_id": "user-12345", "entity_type": "user", "timestamp": "2026-04-12T00:00:00Z", "session_id": "sess-abc", "properties": { "page_url": "https://netflix.com/watch/123", "content_id": "nm0123456", "surface": "homepage" }, "context": { "device_type": "mobile", "country": "US", "app_version": "5.2.1" } }
Silver layer (structured, typed, per-event-type tables):
--
One table per major event type, structured columns silver.viewing_events: event_id, entity_id, timestamp, session_id, content_id, surface, view_duration_sec, device_type, country, app_version -- Plus a common properties stored as STRUCT/JSON for extensibility extra_properties JSON
Gold layer (Activity Schema for temporal/funnel queries, Kimball for aggregated metrics):
--
Gold Activity Stream (for temporal analysis, funnel queries) gold.user_stream: activity_id, ts, user_id, activity_name, feature_1, feature_2, feature_3 -- Gold Fact Tables (for pre-aggregated metrics) gold.daily_viewing_metrics: date, user_segment, content_category, views, watch_time
The insight: Activity Schema and Kimball are not competing. Activity Schema lives in the analytical layer for temporal questions. Kimball aggregated tables serve BI dashboards. Both read from the same silver structured events.
Session Windows: The Hardest Event Modeling Problem
Sessionization deserves special attention — it comes up in almost every DE system design that involves events.
The problem: Events are timestamped, but sessions have no explicit start/end. A session is a group of events from the same user where no gap exceeds N minutes (typically 30 minutes).
The SQL approach (for batch pipelines):
--
Step 1: flag session starts (first event or gap > 30 min from prior event) WITH events_with_gaps AS ( SELECT user_id, event_id, event_timestamp, LAG(event_timestamp) OVER (PARTITION BY user_id ORDER BY event_timestamp) AS prior_ts, TIMESTAMP_DIFF(event_timestamp, LAG(event_timestamp) OVER (PARTITION BY user_id ORDER BY event_timestamp), MINUTE) AS gap_minutes FROM silver.events ), session_starts AS ( SELECT *, CASE WHEN prior_ts IS NULL OR gap_minutes > 30 THEN 1 ELSE 0 END AS is_session_start FROM events_with_gaps ), -- Step 2: assign session IDs via cumulative sum of session starts session_assignments AS ( SELECT *, -- Cumulative sum of session starts = session number per user SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_timestamp) AS session_num FROM session_starts ) SELECT user_id, session_num, CONCAT(user_id, '_', session_num) AS session_id, MIN(event_timestamp) AS session_start, MAX(event_timestamp) AS session_end, TIMESTAMP_DIFF(MAX(event_timestamp), MIN(event_timestamp), MINUTE) AS session_duration_min, COUNT(*) AS event_count FROM session_assignments GROUP BY user_id, session_num
The streaming approach (Flink session windows):
on
Flink: session window with 30-minute inactivity gap stream .keyBy("user_id") .window(EventTimeSessionWindows.withGap(Time.minutes(30))) .process(SessionWindowFunction())
Edge cases you must address in interviews:
-
Multi-device sessions: Same user on mobile and laptop simultaneously — are these one session or two? (Usually two, by device_id)
-
Very long sessions: A user leaves Netflix running → 8-hour “session.” Cap at reasonable max (4 hours) or use content-specific signals (pause events)
-
Late-arriving events: Event with timestamp 2 hours ago arrives now — does it extend a past session? (Use allowed lateness in Flink, or batch correction in the next day’s job)
-
Cross-midnight sessions: A session spanning midnight breaks date partitioning. Solution: assign session to the date it started (not ended)
Interview Questions
Q1: “You’re designing the data model for Meta’s engagement data — billions of events/day across likes, comments, shares, reactions, views. How do you model this?”
Model Answer: “I’d use a layered approach. Bronze: raw JSON events via Scribe, no schema enforcement, all fields preserved for replay. Silver: structured per-event-type tables — silver.likes, silver.comments, silver.views — with typed columns extracted from JSON. This gives me the performance of columnar storage and the type safety for downstream consumers. Gold has two paths: For product analytics dashboards needing pre-aggregated metrics (‘daily engagement by surface’), I build Kimball-style aggregated fact tables partitioned by date and clustered by surface and content_category. For analyst questions requiring temporal analysis (‘users who commented within 1 hour of viewing’), I maintain an Activity Schema user_stream table — all event types unified with activity_name, feature_1/2/3, partitioned by date and clustered by user_id. The Activity Schema enables self-join temporal queries without cross-table joins, which is critical when analysts ask funnel and retention questions that span multiple event types. The Kimball tables exist only for pre-computed metrics that dashboards need at sub-second latency. Both read from the same silver layer — single source of truth.”
Q2: “Should you use schema-on-read or schema-on-write for a rapidly evolving event schema?”
Model Answer: “Both, at different layers. Schema-on-read at the bronze layer: event producers (mobile apps, web SDKs) add new properties constantly — if I enforce schema at ingestion, every new property field causes a pipeline failure. Bronze accepts any valid JSON, stores it raw, and preserves maximum fidelity for replay. This is critical: the next feature could require a property we’re not capturing yet, and we can always backfill from bronze. Schema-on-write at silver and gold: once data is in bronze, my ETL pipeline applies schema validation during the bronze→silver transform. New properties that aren’t in the schema go into an extra_properties JSON column — they’re captured but not typed. Known critical columns (user_id, event_type, timestamp) are enforced strictly — NULL in these is a DLQ event, not a tolerated NULL. The silver→gold transform uses explicit column selection — only typed, validated columns reach the gold layer. This gives me the flexibility of schema-on-read for production resilience AND the performance/governance of schema-on-write for analytics. The tradeoff is dual storage (raw JSON in bronze + typed columns in silver) but at today’s storage costs, that’s the right call.”
Think About This
You’re in a Netflix interview. The prompt: “Netflix is launching a new interactive feature where users can ‘choose their own adventure’ during certain shows. Model the event data for this feature.”
Walk through:
-
What new event types does this create? (choice_presented, choice_selected, outcome_viewed, branch_completed, episode_end)
-
How does sessionization change? (A session now has a branching tree structure, not a linear sequence. Session_id groups all events from one viewing, but the sequence within the session has a tree, not a list.)
-
Which modeling pattern? (Activity Schema — because the key analytics question is temporal: “What was the most common choice path? What percentage of users who chose A in scene 3 chose B in scene 7?” These are multi-hop temporal self-joins against a single stream table.)
-
What’s the schema for feature_1/2/3 for choice_selected? (feature_1 = choice_node_id, feature_2 = choice_option_id, feature_3 = path_id from root)
-
How do you track the full decision tree per user per session? (Accumulating snapshot fact table — one row per user-session-episode, with columns for each decision node’s choice. This enables analytics on “path to completion” without nested tree queries.)
Quick Reference
- Three event modeling patterns: Kimball fact tables (one per event type, optimized per type), Activity Schema (single time-series table, temporal self-joins), OBT (all events wide, no JOINs, column explosion)
- Activity Schema formula: entity_id + timestamp + activity_name + feature_1/2/3 — enables any temporal query via self-join
- Schema-on-read vs schema-on-write: Bronze = schema-on-read (accept any JSON, preserve for replay). Silver/Gold = schema-on-write (type safety, column pruning, BI tool compatibility).
- Sessionization SQL pattern: LAG() to compute gaps → cumulative SUM(is_session_start) to assign session numbers. One of the most commonly tested SQL patterns in DE interviews.
- The production 2026 answer: Activity Schema + Kimball together — Activity Schema for temporal analysis (analyst self-serve), Kimball aggregates for dashboard performance. Both from the same silver source.
- Interview signal: Knowing Activity Schema by name and understanding when it beats Kimball is a strong senior DE signal — most candidates know only Kimball.
Tomorrow’s Preview
Day 42: Graph Data Modeling — Graph databases (Neo4j, Neptune), property graph model, adjacency list patterns in relational databases, social network modeling, fraud detection graphs, and when graph traversal queries actually need a graph database vs. when SQL handles them just fine.