Phase 1: Foundations & Frameworks | Category: Data Modeling
Why Dimensional Modeling Is Still Central to DE Interviews
Even with the rise of lakehouses and the modern data stack, dimensional modeling remains the backbone of analytics data design. Every one of your target companies — Meta, Netflix, Google, OpenAI, Anthropic — expects senior data engineers to fluently design star schemas, reason about grain, and choose the right fact table type for a given business process. As the Kimball Group puts it: “The grain declaration becomes a binding contract on the design.” Get this wrong and everything downstream crumbles.
At your level, interviewers won’t ask “what is a star schema?” — they’ll give you an ambiguous business scenario and expect you to derive the model from first principles, defend your grain choice, and explain trade-offs.
The Four-Step Kimball Design Process
This is the framework you should follow at the whiteboard for any data modeling question:
-
Select the business process — What activity are we measuring? (orders, clicks, video views, model inference requests)
-
Declare the grain — What does one row in the fact table represent?
-
Identify the dimensions — What context surrounds each fact? (who, what, where, when, how)
-
Identify the facts/measures — What are we measuring? (amount, count, duration, cost)
Never skip step 2. Never.
Grain: The Most Important Concept
The grain is the atomic level of detail that one row in your fact table represents. It’s the single most impactful decision in your data model.
Rules of grain:
-
Always start at the most atomic level available. You can always aggregate up, but you can never drill down past your grain.
-
Never mix grains in the same fact table. A table with both individual order lines and daily aggregates is broken.
-
The grain dictates which dimensions are valid. If grain = one order line item, then “customer” is a valid dimension. If grain = daily store summary, “individual order” is not.
How to articulate grain in an interview:
“The grain of this fact table is one row per video view event per user per content item. Each row captures a single viewing session with start time, duration, completion percentage, and device type.”
That single sentence tells the interviewer:
-
You understand atomic grain
-
You know the primary dimensions (user, content, time)
-
You’ve thought about the measures (duration, completion)
-
The model supports drill-down at any level
Common grain mistakes to flag proactively:
-
Mixing transaction-level and summary-level rows
-
Defining grain too coarsely (“one row per user per day” when you need session-level analysis)
-
Forgetting that grain determines join cardinality
Star Schema vs. Snowflake Schema
| Aspect | Star Schema | Snowflake Schema |
|---|---|---|
| Structure | Fact table surrounded by denormalized dimension tables | Fact table with normalized (multi-level) dimension tables |
| Joins | Fewer joins, simpler queries | More joins, more complex queries |
| Query performance | Faster — fewer joins, columnar engines optimize wide tables | Slower — multiple joins, though modern MPP engines mitigate this |
| Storage | More redundant (denormalized dimensions) | Less redundant |
| Maintainability | Simpler to understand and build | More complex schema management |
| Best for | BI dashboards, ad-hoc analytics, most OLAP use cases | Cases where dimension integrity matters, very large dimension tables, or storage-constrained environments |
Senior-level take: In modern columnar warehouses (BigQuery, Snowflake, Redshift), star schema wins 95% of the time. The storage cost of denormalization is negligible compared to the query performance and simplicity benefits. Snowflake schema only makes sense when a dimension is extremely large (millions of rows) and frequently updated — normalizing it avoids write amplification.
What to say in the interview: “I default to star schema for the gold/serving layer because modern columnar engines handle wide denormalized tables efficiently, and it reduces cognitive load for analysts. I’d only snowflake a dimension if it’s very large and volatile — for example, a product catalog with millions of SKUs and daily price changes, where I’d separate product_attributes from product_pricing to avoid rewriting the entire dimension daily.”
The Three Types of Fact Tables
This is a Kimball fundamental that interviewers love to probe because it tests whether you can match business requirements to the right modeling pattern (Holistics, Kimball Group).
1. Transaction Fact Table
What it captures: One row per discrete business event at the atomic grain.
Characteristics:
-
Most common type
-
Rows are insert-only (immutable once written)
-
Grain = individual event (one click, one purchase, one video view)
-
Supports additive measures (you can sum across any dimension)
-
Can grow very large but is highly flexible for ad-hoc analysis
Example — E-commerce order line:
fact_order_line — transaction fact (grain: one row per order line)
Keys / FKs
order_line_id (PK)
order_id (FK → dim_order)
customer_id (FK → dim_customer)
product_id (FK → dim_product)
store_id (FK → dim_store)
order_date_id (FK → dim_date)
Measures
quantity
unit_price
discount_amount
line_total
When to use: Almost always your default. Start here. If someone asks you to design a model and doesn’t specify otherwise, build a transaction fact table.
2. Periodic Snapshot Fact Table
What it captures: A measurement at regular time intervals (daily, weekly, monthly), regardless of whether any activity occurred.
Characteristics:
-
One row per entity per period
-
Rows always inserted, even if no transactions occurred (this is the key difference — a null row means “nothing happened,” which is meaningful)
-
Measures are typically semi-additive (you can sum across non-time dimensions, but summing across time periods is usually meaningless — e.g., you don’t add Monday’s inventory to Tuesday’s inventory)
Example — Daily inventory snapshot:
fact_daily_inventory — periodic snapshot (grain: one row per product × warehouse × day)
Keys / FKs
snapshot_date_id (FK → dim_date)
product_id (FK → dim_product)
warehouse_id (FK → dim_warehouse)
Measures (semi-additive — careful summing across time)
quantity_on_hand
quantity_on_order
days_of_supply
When to use: When the business question is about state at a point in time — inventory levels, account balances, pipeline stage counts, user counts. The key signal is: “What was the value of X on date Y?”
Modern nuance: In modern MPP warehouses, you can often compute periodic snapshots on-the-fly from transaction data rather than materializing them as separate tables. Only pre-compute when (a) the computation is expensive, (b) it’s queried frequently, or (c) the source transaction volume is enormous.
3. Accumulating Snapshot Fact Table
What it captures: The lifecycle of a process with well-defined milestones. Rows are updated as the process progresses.
Characteristics:
-
One row per entity (order, application, claim, pipeline run)
-
Multiple date foreign keys — one for each milestone
-
Dates start as NULL and fill in as milestones are reached
-
Includes lag measures (days between milestones) to measure velocity
-
Rows are updated, not appended — this is the only fact table type where UPDATE is expected
Example — Order fulfillment pipeline:
fact_order_fulfillment — accumulating snapshot (grain: one row per order; row updated over time)
Keys / FKs
order_id (PK)
customer_id (FK → dim_customer)
product_id (FK → dim_product)
Milestone dates (NULL until reached)
order_placed_date_id (FK → dim_date)
payment_confirmed_date_id (FK → dim_date)
shipped_date_id (FK → dim_date)
delivered_date_id (FK → dim_date)
returned_date_id (FK → dim_date)
Measures
days_to_confirm
days_to_ship
days_to_deliver
order_total
current_status (degenerate dimension)
When to use: Workflows, pipelines, funnels — anything with a defined beginning, end, and milestones in between. The business question is: “How fast are things moving through our process, and where are the bottlenecks?”
Real-world relevance at your target companies:
-
Meta: Ad campaign lifecycle (created → reviewed → approved → live → completed)
-
Netflix: Content pipeline (pitched → greenlit → in-production → post-production → released)
-
OpenAI/Anthropic: Model training pipeline (data prepared → training started → checkpointed → evaluated → deployed)
Dimension Table Design
Dimensions provide the “who, what, where, when, how” context around facts.
Key principles:
| Principle | Description |
|---|---|
| Surrogate keys | Always use integer surrogate keys (not business keys) as the PK of dimension tables. Business keys change; surrogate keys don’t. |
| Natural keys preserved | Keep the business/natural key as an attribute for lookups, but don’t use it as the join key. |
| Descriptive attributes | Pack dimensions with as many descriptive attributes as relevant. Wide dimensions are good — they reduce the need for joins. |
| Conformed dimensions | Shared dimensions (customer, product, date) should have a single authoritative source used across all fact tables. This enables cross-process analysis. |
| Date dimension | Always use a dedicated date dimension with pre-computed attributes (day_of_week, is_holiday, fiscal_quarter, etc.). Never just store a raw timestamp in the fact table. |
The date dimension is non-negotiable:
dim_date
date_id (PK, INT, YYYYMMDD)
full_date (DATE)
day_of_week (VARCHAR)
is_weekend (BOOLEAN)
is_holiday (BOOLEAN)
week_number (INT)
month_name (VARCHAR)
fiscal_quarter (VARCHAR)
fiscal_year (INT)
This pre-computation avoids repeated date logic in every query and enables BI tools to filter/group by any time attribute without transformation.
Dimensional Modeling in the Modern Lakehouse
How does Kimball fit with the medallion architecture? Per Databricks and common practice:
| Lakehouse Layer | Modeling Approach | Purpose |
|---|---|---|
| Bronze | Raw/schema-on-read, append-only | Preserve source fidelity. No modeling. |
| Silver | Cleaned, 3NF-like or source-aligned | Deduplicated, typed, validated. Light normalization. |
| Gold | Kimball star schema / OBT | Business-ready, denormalized, query-optimized |
The senior insight: Your gold layer is where dimensional modeling lives. Bronze and silver are staging infrastructure. In an interview, when asked “where do you apply dimensional modeling?”, the answer is: “In the gold/serving layer, after data has been cleaned and conformed in silver. The gold layer is Kimball star schema optimized for the consumption pattern — BI dashboards get star schema, ML teams get wide denormalized feature tables (OBT), and applications get pre-aggregated serving tables.”
OBT (One Big Table): A modern pattern where you denormalize everything into a single wide table. Useful for:
-
Data science / ML feature tables
-
Simple dashboards with a single data source
-
Small-to-medium dimension cardinality
Trade-off: Loses the flexibility of star schema for complex multi-fact analysis, and has higher storage/refresh cost due to duplication.
Interview Questions
Q1: “Design the data model for a ride-sharing platform like Uber. Walk me through your approach.”
Model Answer: “I’d start with the business process: trip fulfillment. The grain is one row per completed trip — the most atomic measurable event.
Fact table — fact_trip:
-
trip_id (degenerate dimension)
-
rider_id → dim_rider
-
driver_id → dim_driver
-
pickup_location_id → dim_location
-
dropoff_location_id → dim_location
-
trip_start_date_id → dim_date
-
vehicle_type_id → dim_vehicle_type
-
Measures: trip_duration_seconds, trip_distance_miles, fare_amount, surge_multiplier, tip_amount, driver_rating
Dimensions: dim_rider (name, signup_date, tier), dim_driver (name, rating, vehicle), dim_location (lat, lng, city, zone, airport_flag), dim_date, dim_vehicle_type (UberX, Pool, Black).
I’d also build a second fact table — a periodic snapshot — fact_daily_driver_activity with grain = one row per driver per day, capturing: trips_completed, hours_online, total_earnings, avg_rating. This supports the ‘driver earnings trend’ use case without scanning billions of trip records.
For the rider funnel (request → match → pickup → trip → rating), I’d use an accumulating snapshot with date keys for each milestone and lag measures for time-to-match and time-to-pickup. This lets ops teams identify where riders drop off and where drivers are slow.”
Q2: “When would you NOT use a star schema?”
Model Answer: “Three scenarios. First, if the serving layer is an application API with key-value access patterns — I’d use a denormalized document model in DynamoDB or a wide-column store instead of a relational star schema. Second, for ML feature serving where data scientists need a single flat table with all features — an OBT pattern or feature store is more appropriate. Third, in the silver layer of a lakehouse where I want source-aligned, lightly normalized tables for maximum flexibility before committing to a specific analytical model in gold. The star schema is specifically optimized for dimensional analysis via BI tools — if the consumption pattern is different, the model should be different.”
Think About This
You’re in a Netflix interview. The prompt: “Design the data model for Netflix’s content viewing analytics.”
Walk through:
-
What is the grain? (One row per viewing session? Per play event? Per content-user-day? What are the trade-offs of each?)
-
What are your fact tables? (Think about whether you need a transaction fact, a periodic snapshot for daily engagement, and an accumulating snapshot for the content lifecycle)
-
What conformed dimensions would be shared across multiple fact tables?
-
How would you handle a user who pauses, resumes, and switches devices within a single viewing session — does your grain handle this, or does it break?
The key insight: “viewing session” is deceptively complex. A user might watch 10 minutes, pause, resume 2 hours later, then switch to their phone. You need to define session boundaries explicitly (e.g., a gap > 30 minutes = new session). This definition becomes part of your grain, and it’s the kind of nuance that distinguishes a senior answer from a mid-level one.
Quick Reference
-
Grain first, always. Declare it explicitly before choosing any dimensions or facts. “One row per X” is your anchor statement.
-
Three fact types: Transaction (events, insert-only), Periodic Snapshot (state at intervals, semi-additive), Accumulating Snapshot (process lifecycle, updated rows with milestone dates)
-
Star > Snowflake in modern columnar warehouses. Default to star; only snowflake when a dimension is massive and volatile.
-
Conformed dimensions enable cross-process analysis. dim_customer and dim_date should be the same table across all fact tables.
-
Gold = Kimball. In a lakehouse, dimensional modeling lives in the gold layer. Bronze is raw, silver is cleaned, gold is business-ready star schema.
Tomorrow’s Preview
Day 5: Slowly Changing Dimensions (SCD) — SCD Types 0 through 6, when to use each, implementation patterns in modern warehouses (MERGE INTO, Delta Lake), and how to handle the classic interview curveball: “What happens when a customer changes their address — how does that affect your historical reports?”