Phase 1: Foundations & Frameworks | Category: Data Modeling

Why SCDs Matter at Senior Level

SCDs are the most frequently asked data modeling topic in DE interviews. At your level, interviewers won’t just ask you to define the types — they’ll present a business scenario and expect you to choose the right SCD strategy, defend the trade-offs, explain the implementation in a modern lakehouse (MERGE INTO on Delta/Iceberg), and handle edge cases like late-arriving data, retroactive corrections, and multi-attribute changes.

The core question SCDs answer: When a dimension attribute changes, what happens to the historical facts that were associated with the old value?

SCD Types at a Glance

TypeStrategyHistory Preserved?Storage ImpactComplexityUse When
0Retain original, never updateOriginal value onlyNoneTrivialAttribute should never change (join date, original signup source)
1OverwriteNoNoneLowHistory doesn’t matter (typo fix, data backfill, current-state-only reporting)
2Add new row with versioningFull historyHigh (row per change)Medium-HighMost common. Needed when historical analysis must reflect the state at time of the fact.
3Add column for previous valueOne prior valueLowLowOnly need current + previous (e.g., previous department after reorg)
4Mini-dimension for rapidly changing attributesFull in separate tableMediumMediumHigh-frequency changes on a subset of attributes (customer risk score, engagement tier)
6Hybrid of 1 + 2 + 3Full + current columnHighHighNeed both point-in-time queries AND easy current-value access without self-joins

SCD Type 0: Fixed Attributes

The simplest — some attributes should never change after initial load.

Examples: Employee hire date, customer first purchase date, original signup channel, account creation timestamp.

Implementation: Simply exclude these columns from any UPDATE logic. If a source system erroneously sends a changed value, ignore it or log an alert.

When to mention in interviews: When you’re designing a dimension and want to signal that you’ve thought about which attributes are mutable vs. immutable. “I’d apply SCD Type 0 to signup_date and original_referral_source — these are historical facts about the customer that shouldn’t change regardless of source updates.”

SCD Type 1: Overwrite

The old value is gone. Replaced. No history.

Schema:

dim_customer

_______________________________

customer_id (PK/SK)
customer_name
email
city ← simply overwritten on change

When it’s the right choice:

  • Correcting data quality errors (misspelled name, wrong zip code)

  • Backfilling attributes that were NULL at initial load

  • Attributes where historical value is genuinely irrelevant to analysis

  • Operational dashboards that only need current state

Trade-off: Simple and fast, but you lose the ability to answer “what city was the customer in when they made that purchase last March?” If the business ever needs that analysis, you’re stuck.

Senior-level nuance: Even when using Type 2 for key attributes, you’ll typically apply Type 1 to low-value attributes on the same dimension. A customer dimension might be SCD2 on city and membership_tier but SCD1 on phone_number and profile_photo_url. Mixed SCD types on a single dimension are the norm, not the exception.

SCD Type 2: Full Row Versioning

The gold standard. Each change creates a new row. Full history preserved.

Schema:

dim_customer

_______________________________

customer_sk (surrogate key, PK)
customer_id (natural/business key)
customer_name
email
city
membership_tier

_______________________________

valid_from (TIMESTAMP)
valid_to (TIMESTAMP, NULL or '9999-12-31' for current)
is_current (BOOLEAN)

Example — customer moves from Seattle to San Francisco:

customer_skcustomer_idcityvalid_fromvalid_tois_current
1001C-5678Seattle2024-01-152026-02-28false
1002C-5678San Francisco2026-03-019999-12-31true

Why surrogate keys are mandatory: The natural key (customer_id = C-5678) appears in multiple rows. The surrogate key (customer_sk) is the unique PK that the fact table references. This is how historical facts link to the correct version:

  • An order from Feb 2025 joins to customer_sk = 1001 → shows “Seattle”

  • An order from Mar 2026 joins to customer_sk = 1002 → shows “San Francisco”

Point-in-time queries: “What was this customer’s city when they placed order X?”

SELECT f.order_amount, d.city
FROM fact_orders f
JOIN dim_customer d ON f.customer_sk = d.customer_sk
-- The join already gives you the correct historical state
-- because the fact was loaded with the customer_sk that was current at order time

Current-state queries: “What is this customer’s current city?”

SELECT *
FROM dim_customer
WHERE customer_id = 'C-5678'
  AND is_current = true;

SCD Type 2 Implementation with MERGE INTO

This is what interviewers at Meta/Google/Netflix expect you to know. The pattern from Start Data Engineering is the industry standard:

The logic in three operations within a single MERGE:

  1. WHEN MATCHED (existing customer with changed attributes): Set is_current = false and valid_to = change_timestamp on the old row

  2. WHEN NOT MATCHED (new customer OR updated version to insert): Insert new row with is_current = true, valid_from = change_timestamp, valid_to = ‘9999-12-31’

  3. WHEN NOT MATCHED BY SOURCE (deleted in source): Set is_active = false

The NULL join_key trick (critical for interviews):

MERGE INTO dim_customer t
USING (
  -- New customers + existing customers to expire
  SELECT customer_id AS join_key, * FROM staging_customer
  UNION ALL
  -- Updated customers to insert as new version (NULL key forces NOT MATCHED)
  SELECT NULL AS join_key, * FROM customers_with_changes
) s
ON t.customer_id = s.join_key
WHEN MATCHED AND t.is_current = true AND s.updated_at > t.updated_at THEN
  UPDATE SET is_current = false, valid_to = s.updated_at
WHEN NOT MATCHED THEN
  INSERT (customer_id, name, city, valid_from, is_current)
  VALUES (s.customer_id, s.name, s.city, s.updated_at, true);

The NULL AS join_key pattern causes updated customer rows to hit the NOT MATCHED branch, forcing an INSERT of the new version. This lets you handle both the expire (UPDATE) and the new version (INSERT) in a single MERGE statement — which is both atomic and performant on Delta Lake and Iceberg.

SCD Type 3: Previous-Value Column

Instead of adding rows, you add a column to store the prior value.

Schema:

dim_customer

_______________________________

customer_id (PK)
customer_name
current_city
previous_city
city_change_date

Limitations: Only stores one prior value. If the customer moves three times, you only see current and most recent previous. No full history.

When it’s useful: Organizational changes where you need exactly one “before/after” comparison — e.g., after a company reorg, current_department and previous_department let you analyze movement patterns without the overhead of SCD2.

Senior-level take: Type 3 is rarely used alone in practice. If you mention it in an interview, always acknowledge its limitation: “Type 3 only gives me one prior state. If I need full history, I’d use Type 2. If I only need current state, Type 1 is simpler. Type 3 sits in a narrow sweet spot where exactly one previous value is useful for comparison analysis.”

SCD Type 4: Mini-Dimension

Separates rapidly changing attributes into their own dimension table to prevent bloating the main dimension with excessive SCD2 rows.

Problem it solves: Imagine a customer dimension with 50 attributes. If risk_score changes monthly for every customer, SCD2 creates 12 rows/year/customer just for one attribute — duplicating all 49 stable attributes each time.

Solution: Extract volatile attributes into a mini-dimension:

dim_customer (main, stable — SCD1 or SCD2 on rare changes)

_______________________________

customer_sk (PK)
customer_id
customer_name
signup_date
...
dim_customer_profile (mini-dimension — changes frequently)

_______________________________

profile_sk (PK)
risk_score_band
engagement_tier
ltv_segment
income_bracket
fact_orders

_______________________________

order_id
customer_sk → dim_customer
profile_sk → dim_customer_profile
...

When to propose this: When the interviewer’s scenario has a dimension with a mix of stable and rapidly changing attributes. “I’d separate the volatile attributes — risk score, engagement tier — into a mini-dimension to avoid row explosion in the main customer dimension. The fact table carries FKs to both.”

SCD Type 6: The Hybrid (1 + 2 + 3)

Combines Type 2 (full row versioning) with Type 3 (current value column on every row) and Type 1 (overwrite the current-value column).

Schema:

dim_customer

_______________________________

customer_sk (PK, surrogate)
customer_id (natural key)
historical_city (the city at the time of this version — SCD2)
current_city (always the latest city — SCD1 overwrite on all rows)
valid_from
valid_to
is_current

When customer moves from Seattle → San Francisco:

customer_skcustomer_idhistorical_citycurrent_cityvalid_fromvalid_tois_current
1001C-5678SeattleSan Francisco2024-01-152026-02-28false
1002C-5678San FranciscoSan Francisco2026-03-019999-12-31true

The power: You can do both historical analysis (historical_city) and current-state grouping (current_city) without self-joins or window functions. “Show me all orders from customer C-5678, with the city they were in at the time AND their current city” — one simple join.

Trade-off: Every attribute change requires updating the current_* column on ALL historical rows for that customer. At massive scale (billions of rows), this update cost can be significant.

When to propose: When the business explicitly needs both time-travel AND current-state analysis on the same dimension, and the dimension size is manageable.

Decision Framework: Which SCD Type to Use

Use this flowchart in interviews to demonstrate structured thinking:

Does the business need historical analysis on this attribute?
├── NO → SCD Type 1 (overwrite)  │         Is this attribute immutable by nature?  │
└── YES → SCD Type 0
└── YES → Does the attribute change frequently (monthly or more)?
├── YES → SCD Type 4 (mini-dimension)
└── NO → Do consumers need BOTH historical AND current value                     in the same query without self-joins?
├── YES → SCD Type 6 (hybrid)
└── NO → SCD Type 2 (standard row versioning)

Edge Cases Interviewers Love

1. Late-arriving facts: An order from March arrives in April, but the customer moved in between. Which customer_sk do you use?

  • Answer: Look up the customer version that was is_current = true as of the order date, not as of the load date. This requires a point-in-time lookup: WHERE order_date BETWEEN valid_from AND valid_to.

2. Late-arriving dimension changes: You discover a customer actually moved on Feb 15, but you didn’t process the change until March 1. Your SCD2 has the wrong valid_from.

  • Answer: This requires retroactive correction — splitting or adjusting existing rows. It’s operationally complex and can affect already-loaded facts. Mention that this is why some teams add an effective_date (business date of change) separate from loaded_date (when the pipeline processed it).

3. Multi-attribute changes: Customer changes both city and tier in the same batch.

  • Answer: One new SCD2 row captures both changes simultaneously. Don’t create separate rows for each attribute change in the same batch — that would violate the grain.

4. Deletes from source: Customer is deleted upstream.

  • Answer: Soft delete. Set is_active = false and valid_to = current_timestamp() on the current row. Never physically delete from the dimension — downstream facts still reference it.

Interview Questions

Q1: “A customer changes their address. How does that affect your historical sales reports?”

Model Answer: “It depends on the SCD type. With Type 1, the address is overwritten — all historical sales now appear under the new address, which distorts geographic analysis. With Type 2, a new dimension row is created with the new address. Historical facts still point to the old surrogate key, so ‘sales by region last year’ correctly reflects where the customer was at the time of purchase. New orders going forward get the new surrogate key. I’d default to Type 2 for address because geographic analysis is a core use case. However, I’d apply Type 1 to non-analytical attributes like email or phone number. In practice, most dimensions use mixed SCD types — Type 0 for immutable fields like signup_date, Type 1 for operational fields, Type 2 for analytically important attributes.”

Q2: “How would you implement SCD2 in a Delta Lake / Iceberg environment?”

Model Answer: “I’d use a single MERGE INTO statement with three branches. First, I identify changed records by comparing the staging table against is_current = true rows in the dimension on the natural key. For the MERGE source, I UNION the staging data twice — once with the natural key as join_key (to match and expire existing rows), once with NULL as join_key (to force a NOT MATCHED insert of the new version). The WHEN MATCHED branch sets is_current = false and valid_to to the change timestamp. The WHEN NOT MATCHED branch inserts both new customers and new versions of changed customers. The WHEN NOT MATCHED BY SOURCE branch handles deletes by setting is_active = false. This runs as a single atomic transaction on Delta Lake or Iceberg, leveraging their ACID guarantees. For performance at scale, I’d partition the dimension by a hash of customer_id to parallelize the MERGE operation.”

Think About This

You’re in an OpenAI interview. The prompt: “Design the dimension model for tracking API usage by customers. Customer tier (free, plus, team, enterprise) changes based on subscription events, and we need to analyze usage patterns relative to what tier customers were on at the time of each API call.”

Walk through:

  1. What SCD type would you use for customer_tier? (Type 2 — historical analysis is explicitly required)

  2. How would you handle a customer who upgrades mid-day with millions of API calls that day? (Need sub-day precision on valid_from/valid_to — timestamp, not just date)

  3. The rate limit is different per tier. If a customer upgraded at 2 PM, should API calls at 1 PM show the old tier’s rate limit or the new one? (Old — this is why SCD2 with proper time ranges matters)

  4. At OpenAI’s scale (millions of customers, billions of API calls), would you SCD2 every attribute or use a mini-dimension? (Mini-dimension for rapidly changing attributes like usage_quota_remaining; SCD2 for tier)

Quick Reference

  • SCD2 is your default for interview answers — it’s the most commonly expected and covers the most use cases

  • Mixed SCD types on one dimension is normal — Type 0 for immutable, Type 1 for operational, Type 2 for analytical attributes

  • MERGE INTO with the NULL join_key pattern is the standard implementation on Delta Lake / Iceberg — know this cold

  • Surrogate keys are mandatory for SCD2 — natural keys repeat across versions; surrogate keys uniquely identify each version

  • Always handle edge cases: late-arriving facts (point-in-time lookup), late-arriving dimension changes (retroactive correction), and source deletes (soft delete, never physical)

Tomorrow’s Preview

Day 6: Normalization vs Denormalization Trade-offs — 1NF through BCNF, when to denormalize for analytics, write amplification vs read performance, OLTP vs OLAP modeling decisions, and how to discuss this trade-off spectrum in interviews at your target companies.