Day 4 — Dimensional modeling fundamentals

Phase 1: Foundations & Frameworks | Category: Data Modeling

Why Dimensional Modeling Is Still Central to DE Interviews

Even with the rise of lakehouses and the modern data stack, dimensional modeling remains the backbone of analytics data design. Every one of your target companies — Meta, Netflix, Google, OpenAI, Anthropic — expects senior data engineers to fluently design star schemas, reason about grain, and choose the right fact table type for a given business process. As the Kimball Group puts it: “The grain declaration becomes a binding contract on the design.” Get this wrong and everything downstream crumbles.

At your level, interviewers won’t ask “what is a star schema?” — they’ll give you an ambiguous business scenario and expect you to derive the model from first principles, defend your grain choice, and explain trade-offs.

The Four-Step Kimball Design Process

This is the framework you should follow at the whiteboard for any data modeling question:

Select the business process — What activity are we measuring? (orders, clicks, video views, model inference requests)
Declare the grain — What does one row in the fact table represent?
Identify the dimensions — What context surrounds each fact? (who, what, where, when, how)
Identify the facts/measures — What are we measuring? (amount, count, duration, cost)

Never skip step 2. Never.

Grain: The Most Important Concept

The grain is the atomic level of detail that one row in your fact table represents. It’s the single most impactful decision in your data model.

Rules of grain:

Always start at the most atomic level available. You can always aggregate up, but you can never drill down past your grain.
Never mix grains in the same fact table. A table with both individual order lines and daily aggregates is broken.
The grain dictates which dimensions are valid. If grain = one order line item, then “customer” is a valid dimension. If grain = daily store summary, “individual order” is not.

How to articulate grain in an interview:

“The grain of this fact table is one row per video view event per user per content item. Each row captures a single viewing session with start time, duration, completion percentage, and device type.”

That single sentence tells the interviewer:

You understand atomic grain
You know the primary dimensions (user, content, time)
You’ve thought about the measures (duration, completion)
The model supports drill-down at any level

Common grain mistakes to flag proactively:

Mixing transaction-level and summary-level rows
Defining grain too coarsely (“one row per user per day” when you need session-level analysis)
Forgetting that grain determines join cardinality

Star Schema vs. Snowflake Schema

Aspect	Star Schema	Snowflake Schema
Structure	Fact table surrounded by denormalized dimension tables	Fact table with normalized (multi-level) dimension tables
Joins	Fewer joins, simpler queries	More joins, more complex queries
Query performance	Faster — fewer joins, columnar engines optimize wide tables	Slower — multiple joins, though modern MPP engines mitigate this
Storage	More redundant (denormalized dimensions)	Less redundant
Maintainability	Simpler to understand and build	More complex schema management
Best for	BI dashboards, ad-hoc analytics, most OLAP use cases	Cases where dimension integrity matters, very large dimension tables, or storage-constrained environments

Senior-level take: In modern columnar warehouses (BigQuery, Snowflake, Redshift), star schema wins 95% of the time. The storage cost of denormalization is negligible compared to the query performance and simplicity benefits. Snowflake schema only makes sense when a dimension is extremely large (millions of rows) and frequently updated — normalizing it avoids write amplification.

What to say in the interview: “I default to star schema for the gold/serving layer because modern columnar engines handle wide denormalized tables efficiently, and it reduces cognitive load for analysts. I’d only snowflake a dimension if it’s very large and volatile — for example, a product catalog with millions of SKUs and daily price changes, where I’d separate product_attributes from product_pricing to avoid rewriting the entire dimension daily.”

The Three Types of Fact Tables

This is a Kimball fundamental that interviewers love to probe because it tests whether you can match business requirements to the right modeling pattern (Holistics, Kimball Group).

1. Transaction Fact Table

What it captures: One row per discrete business event at the atomic grain.

Characteristics:

Most common type
Rows are insert-only (immutable once written)
Grain = individual event (one click, one purchase, one video view)
Supports additive measures (you can sum across any dimension)
Can grow very large but is highly flexible for ad-hoc analysis

Example — E-commerce order line:

fact_order_line  —  transaction fact (grain: one row per order line)

Keys / FKs
  order_line_id (PK)
  order_id       (FK → dim_order)
  customer_id    (FK → dim_customer)
  product_id     (FK → dim_product)
  store_id       (FK → dim_store)
  order_date_id  (FK → dim_date)

Measures
  quantity
  unit_price
  discount_amount
  line_total

When to use: Almost always your default. Start here. If someone asks you to design a model and doesn’t specify otherwise, build a transaction fact table.

2. Periodic Snapshot Fact Table

What it captures: A measurement at regular time intervals (daily, weekly, monthly), regardless of whether any activity occurred.

Characteristics:

One row per entity per period
Rows always inserted, even if no transactions occurred (this is the key difference — a null row means “nothing happened,” which is meaningful)
Measures are typically semi-additive (you can sum across non-time dimensions, but summing across time periods is usually meaningless — e.g., you don’t add Monday’s inventory to Tuesday’s inventory)

Example — Daily inventory snapshot:

fact_daily_inventory  —  periodic snapshot (grain: one row per product × warehouse × day)

Keys / FKs
  snapshot_date_id  (FK → dim_date)
  product_id          (FK → dim_product)
  warehouse_id        (FK → dim_warehouse)

Measures (semi-additive — careful summing across time)
  quantity_on_hand
  quantity_on_order
  days_of_supply

When to use: When the business question is about state at a point in time — inventory levels, account balances, pipeline stage counts, user counts. The key signal is: “What was the value of X on date Y?”

Modern nuance: In modern MPP warehouses, you can often compute periodic snapshots on-the-fly from transaction data rather than materializing them as separate tables. Only pre-compute when (a) the computation is expensive, (b) it’s queried frequently, or (c) the source transaction volume is enormous.

3. Accumulating Snapshot Fact Table

What it captures: The lifecycle of a process with well-defined milestones. Rows are updated as the process progresses.

Characteristics:

One row per entity (order, application, claim, pipeline run)
Multiple date foreign keys — one for each milestone
Dates start as NULL and fill in as milestones are reached
Includes lag measures (days between milestones) to measure velocity
Rows are updated, not appended — this is the only fact table type where UPDATE is expected

Example — Order fulfillment pipeline:

fact_order_fulfillment  —  accumulating snapshot (grain: one row per order; row updated over time)

Keys / FKs
  order_id      (PK)
  customer_id   (FK → dim_customer)
  product_id    (FK → dim_product)

Milestone dates (NULL until reached)
  order_placed_date_id           (FK → dim_date)
  payment_confirmed_date_id      (FK → dim_date)
  shipped_date_id                (FK → dim_date)
  delivered_date_id              (FK → dim_date)
  returned_date_id               (FK → dim_date)

Measures
  days_to_confirm
  days_to_ship
  days_to_deliver
  order_total
  current_status                 (degenerate dimension)

When to use: Workflows, pipelines, funnels — anything with a defined beginning, end, and milestones in between. The business question is: “How fast are things moving through our process, and where are the bottlenecks?”

Real-world relevance at your target companies:

Meta: Ad campaign lifecycle (created → reviewed → approved → live → completed)
Netflix: Content pipeline (pitched → greenlit → in-production → post-production → released)
OpenAI/Anthropic: Model training pipeline (data prepared → training started → checkpointed → evaluated → deployed)

Dimension Table Design

Dimensions provide the “who, what, where, when, how” context around facts.

Key principles:

Principle	Description
Surrogate keys	Always use integer surrogate keys (not business keys) as the PK of dimension tables. Business keys change; surrogate keys don’t.
Natural keys preserved	Keep the business/natural key as an attribute for lookups, but don’t use it as the join key.
Descriptive attributes	Pack dimensions with as many descriptive attributes as relevant. Wide dimensions are good — they reduce the need for joins.
Conformed dimensions	Shared dimensions (customer, product, date) should have a single authoritative source used across all fact tables. This enables cross-process analysis.
Date dimension	Always use a dedicated date dimension with pre-computed attributes (day_of_week, is_holiday, fiscal_quarter, etc.). Never just store a raw timestamp in the fact table.

The date dimension is non-negotiable:

dim_date

  date_id         (PK, INT, YYYYMMDD)
  full_date       (DATE)
  day_of_week     (VARCHAR)
  is_weekend      (BOOLEAN)
  is_holiday      (BOOLEAN)
  week_number     (INT)
  month_name      (VARCHAR)
  fiscal_quarter  (VARCHAR)
  fiscal_year     (INT)

This pre-computation avoids repeated date logic in every query and enables BI tools to filter/group by any time attribute without transformation.

Dimensional Modeling in the Modern Lakehouse

How does Kimball fit with the medallion architecture? Per Databricks and common practice:

Lakehouse Layer	Modeling Approach	Purpose
Bronze	Raw/schema-on-read, append-only	Preserve source fidelity. No modeling.
Silver	Cleaned, 3NF-like or source-aligned	Deduplicated, typed, validated. Light normalization.
Gold	Kimball star schema / OBT	Business-ready, denormalized, query-optimized

The senior insight: Your gold layer is where dimensional modeling lives. Bronze and silver are staging infrastructure. In an interview, when asked “where do you apply dimensional modeling?”, the answer is: “In the gold/serving layer, after data has been cleaned and conformed in silver. The gold layer is Kimball star schema optimized for the consumption pattern — BI dashboards get star schema, ML teams get wide denormalized feature tables (OBT), and applications get pre-aggregated serving tables.”

OBT (One Big Table): A modern pattern where you denormalize everything into a single wide table. Useful for:

Data science / ML feature tables
Simple dashboards with a single data source
Small-to-medium dimension cardinality

Trade-off: Loses the flexibility of star schema for complex multi-fact analysis, and has higher storage/refresh cost due to duplication.

Interview Questions

Q1: “Design the data model for a ride-sharing platform like Uber. Walk me through your approach.”

Model Answer: “I’d start with the business process: trip fulfillment. The grain is one row per completed trip — the most atomic measurable event.

Fact table — fact_trip:

trip_id (degenerate dimension)
rider_id → dim_rider
driver_id → dim_driver
pickup_location_id → dim_location
dropoff_location_id → dim_location
trip_start_date_id → dim_date
vehicle_type_id → dim_vehicle_type
Measures: trip_duration_seconds, trip_distance_miles, fare_amount, surge_multiplier, tip_amount, driver_rating

Dimensions: dim_rider (name, signup_date, tier), dim_driver (name, rating, vehicle), dim_location (lat, lng, city, zone, airport_flag), dim_date, dim_vehicle_type (UberX, Pool, Black).

I’d also build a second fact table — a periodic snapshot — fact_daily_driver_activity with grain = one row per driver per day, capturing: trips_completed, hours_online, total_earnings, avg_rating. This supports the ‘driver earnings trend’ use case without scanning billions of trip records.

For the rider funnel (request → match → pickup → trip → rating), I’d use an accumulating snapshot with date keys for each milestone and lag measures for time-to-match and time-to-pickup. This lets ops teams identify where riders drop off and where drivers are slow.”

Q2: “When would you NOT use a star schema?”

Model Answer: “Three scenarios. First, if the serving layer is an application API with key-value access patterns — I’d use a denormalized document model in DynamoDB or a wide-column store instead of a relational star schema. Second, for ML feature serving where data scientists need a single flat table with all features — an OBT pattern or feature store is more appropriate. Third, in the silver layer of a lakehouse where I want source-aligned, lightly normalized tables for maximum flexibility before committing to a specific analytical model in gold. The star schema is specifically optimized for dimensional analysis via BI tools — if the consumption pattern is different, the model should be different.”

Think About This

You’re in a Netflix interview. The prompt: “Design the data model for Netflix’s content viewing analytics.”

Walk through:

What is the grain? (One row per viewing session? Per play event? Per content-user-day? What are the trade-offs of each?)
What are your fact tables? (Think about whether you need a transaction fact, a periodic snapshot for daily engagement, and an accumulating snapshot for the content lifecycle)
What conformed dimensions would be shared across multiple fact tables?
How would you handle a user who pauses, resumes, and switches devices within a single viewing session — does your grain handle this, or does it break?

The key insight: “viewing session” is deceptively complex. A user might watch 10 minutes, pause, resume 2 hours later, then switch to their phone. You need to define session boundaries explicitly (e.g., a gap > 30 minutes = new session). This definition becomes part of your grain, and it’s the kind of nuance that distinguishes a senior answer from a mid-level one.

Quick Reference

Grain first, always. Declare it explicitly before choosing any dimensions or facts. “One row per X” is your anchor statement.
Three fact types: Transaction (events, insert-only), Periodic Snapshot (state at intervals, semi-additive), Accumulating Snapshot (process lifecycle, updated rows with milestone dates)
Star > Snowflake in modern columnar warehouses. Default to star; only snowflake when a dimension is massive and volatile.
Conformed dimensions enable cross-process analysis. dim_customer and dim_date should be the same table across all fact tables.
Gold = Kimball. In a lakehouse, dimensional modeling lives in the gold layer. Bronze is raw, silver is cleaned, gold is business-ready star schema.

Tomorrow’s Preview

Day 5: Slowly Changing Dimensions (SCD) — SCD Types 0 through 6, when to use each, implementation patterns in modern warehouses (MERGE INTO, Delta Lake), and how to handle the classic interview curveball: “What happens when a customer changes their address — how does that affect your historical reports?”

Day 4: Dimensional modeling fundamentals

Why Dimensional Modeling Is Still Central to DE Interviews

The Four-Step Kimball Design Process

Grain: The Most Important Concept

Star Schema vs. Snowflake Schema

The Three Types of Fact Tables

1. Transaction Fact Table

2. Periodic Snapshot Fact Table

3. Accumulating Snapshot Fact Table

Dimension Table Design

Dimensional Modeling in the Modern Lakehouse

Interview Questions

Think About This

Quick Reference

Tomorrow’s Preview

Feedback