Phase 1: Foundations & Frameworks | Category: Data Modeling

Why This Is a Senior-Level Differentiator

Every data engineer knows “normalize for OLTP, denormalize for OLAP.” But interviewers at Meta, Netflix, Google, OpenAI, and Anthropic want you to reason through the full spectrum — when to break the rules, what the real costs are, and how the decision changes across lakehouse layers. As CelerData puts it: “Normalization and denormalization are not opposing design philosophies — they are two sides of the same database engineering coin.” Your job is to pick the right point on that spectrum for a given scenario and defend it.

Normal Forms: Quick Reference

You won’t be asked to define these from scratch, but you must be able to identify violations and explain why they matter.

1NF — Atomic Values

  • Every column holds a single, indivisible value

  • No repeating groups or arrays in a cell

Violation: A phone_numbers column storing “555-1234, 555-5678”. Fix: separate table customer_phones with one row per number.

Why it matters in practice: Semi-structured data (JSON columns in Postgres, VARIANT in Snowflake) intentionally violates 1NF for flexibility. In interviews, acknowledge the trade-off: “Storing a JSON blob violates 1NF but gives us schema flexibility at the cost of queryability and indexing.”

2NF — No Partial Dependencies

  • Must be in 1NF

  • Every non-key attribute depends on the entire composite key, not just part of it

Violation: Table with composite PK (order_id, product_id) that includes customer_name. Customer name depends only on order_id, not on the full key. Fix: move customer_name to an orders table keyed on order_id.

Why it matters: Partial dependencies cause update anomalies — changing a customer’s name requires updating every row for every product they ordered.

3NF — No Transitive Dependencies

  • Must be in 2NF

  • No non-key attribute depends on another non-key attribute

Violation: orders table with customer_id, customer_city, customer_state. City and state depend on customer_id, not on the order. Fix: move them to dim_customer.

The classic mnemonic: “Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key — so help me Codd.”

BCNF — Stricter 3NF

  • For every functional dependency A → B, A must be a superkey

  • Resolves edge cases where 3NF still allows some anomalies with overlapping candidate keys

When it matters in interviews: Rarely asked directly, but if you’re designing a highly normalized OLTP schema and the interviewer probes, saying “I’d take this to BCNF to eliminate the remaining dependency between instructor and department” signals depth.

The Core Trade-off Matrix

TopicDetails
RedundancyNormalized: minimal — single source of truth. Denormalized: intentional duplication across tables
Write performanceNormalized: fast — update one place. Denormalized: slower — same change propagates to many rows
Read performanceNormalized: slower — multi-table joins. Denormalized: fast — fewer/no joins, pre-computed values
Data integrityNormalized: strong — constraints enforced structurally. Denormalized: weaker — consistency maintained by ETL
Query complexityNormalized: high — many JOINs for analytics. Denormalized: low — simple SELECTs, BI-tool friendly
StorageNormalized: lower. Denormalized: higher (2–5× typical)
Schema evolutionNormalized: easier — each table models one concept. Denormalized: harder — changes cascade across wide tables
Best forNormalized: OLTP, source systems, Silver layer. Denormalized: OLAP, dashboards, Gold layer, ML features

When to Normalize

1. Source/operational systems (OLTP)

  • High write frequency, transactional integrity critical

  • Updates must propagate instantly without anomalies

  • Example: Amazon’s order management system — an address change should update in one place, not across millions of order rows

2. Silver layer of a lakehouse

  • Cleaned, deduplicated, source-aligned tables

  • 3NF-like structure preserves flexibility for multiple downstream use cases

  • You haven’t committed to a specific analytical model yet

3. Large, volatile dimensions

  • As discussed on Day 5 (SCD), if a dimension has millions of rows and changes frequently, keeping it normalized reduces write amplification

  • Example: A product catalog with 10M SKUs where prices change daily — normalize product_attributes from product_pricing

4. When consistency outweighs read speed

  • Financial systems, billing, compliance

  • “One row of truth” prevents the $1M-bug where two copies of a customer’s billing address disagree

When to Denormalize

1. Gold/serving layer for analytics

  • Star schema with denormalized dimensions — this is Kimball’s entire philosophy

  • BI tools (Looker, Tableau, Power BI) perform best with pre-joined, wide tables

  • Modern columnar engines (BigQuery, Snowflake, Redshift) are optimized for scanning wide denormalized tables

2. Read-heavy workloads with predictable query patterns

  • Dashboards that run the same 20 queries thousands of times per day

  • Pre-aggregated summary tables for known KPIs

  • According to Pure Storage benchmarks, denormalized structures show 10-15x performance improvements for complex analytical queries

3. ML feature tables

  • Data scientists need a single flat table with all features for a given entity

  • One Big Table (OBT) pattern: one row per user with 200+ feature columns

  • No joins during model training = faster iteration

4. Real-time serving layers

  • APIs serving < 10ms responses can’t afford multi-table joins

  • Pre-join at write time, serve from a denormalized store (DynamoDB, Redis, Cassandra)

  • Stream processing handles denormalized events 2-3x faster than normalized data requiring cross-stream joins

5. Event/streaming data

  • Events are inherently denormalized — each event carries its full context (user_id, user_name, device, geo, timestamp)

  • Normalizing clickstream events after the fact adds join overhead at query time with no write-performance benefit (events are append-only)

The Write Amplification Problem

This is the core cost of denormalization that you must articulate in interviews.

Scenario: Customer dimension is denormalized into the fact table. Customer “Alice” has her city in every order row.

  • Normalized: Alice moves → update 1 row in dim_customer

  • Denormalized: Alice moves → update 500,000 order rows that contain her city

At Amazon scale: 300M customers × 100 orders each = 30B rows. If 1% of customers change an attribute monthly, that’s 3M customers × 100 rows = 300M row updates per month — just from dimension changes.

How to discuss this: “The trade-off with denormalization is write amplification. If I embed customer city in the fact table and 1% of customers move monthly, I’m rewriting hundreds of millions of rows. In a lakehouse with immutable files (Parquet on S3), that means rewriting entire partitions. I’d only accept this cost if the read-side query simplification justifies it — and I’d mitigate it by partitioning on date so only recent partitions need rewriting.”

Modern Nuances: Where the Textbook Rules Bend

Columnar storage changes the join calculus

In row-oriented databases (Postgres, MySQL), joins are expensive because they touch many pages. In columnar engines (BigQuery, Snowflake, ClickHouse), joins on integer keys with sorted/clustered data are highly optimized. This means the read-performance gap between normalized and denormalized is smaller than textbooks suggest.

Senior-level statement: “In BigQuery, a star schema join between a 10B-row fact table and a 1M-row dimension on an integer key is fast because the dimension fits in memory and the engine broadcast-joins it. I don’t need to denormalize just to avoid a join — I only denormalize when I need sub-100ms latency or when the BI tool generates cleaner SQL against a flat table.”

Lakehouse layers allow progressive denormalization

Bronze: Raw/1NF at best → append-only, schema-on-read
Silver: 3NF-like → cleaned, source-aligned, normalized
Gold: Star schema/OBT → denormalized for consumption

You don’t choose one or the other. You apply the right level at each layer. This is the answer interviewers are looking for.

Materialized views as “selective denormalization”

Instead of physically denormalizing tables, you can use materialized views to pre-compute join results. The normalized base tables remain the source of truth; the materialized view is a denormalized read cache.

Trade-off: Materialized views need refresh (incremental or full). Stale materialized views = stale dashboards. But you get the read performance of denormalization with the write integrity of normalization.

The Spectrum in Practice

Here’s how to think about where your target companies sit:

TopicDetails
OLTP source(Amazon order system) — 3NF / BCNF. Write-optimized, transactional integrity
CDC stream(Kafka events) Denormalized — events carry full context, no joins in flight
Bronze lakeRaw / ~1NF — as-is from source, preserve fidelity
Silver layer3NF-like — cleaned, source-aligned, maximum flexibility
Gold / data martStar schema — read-optimized, BI-friendly, dimensional model
OBT / feature tableFully denormalized — ML consumption, one flat table per entity
Serving API store(DynamoDB) Fully denormalized — sub-10ms reads, no joins possible
Real-time OLAP(Druid / ClickHouse) Partially denormalized — pre-join dimensions into facts for fast aggregation

Interview Questions

Q1: “You’re designing the data platform for a new product at Meta. The product team wants real-time dashboards AND the data science team wants flexible ad-hoc analysis. How do you approach normalization?”

Model Answer: “I’d use layered modeling. The silver layer stays normalized — 3NF-like tables aligned to source entities (users, actions, content). This gives the data science team maximum flexibility for ad-hoc joins and exploration. For the product dashboards, I’d build a gold layer with star schema models — denormalized fact tables with pre-joined dimensions, partitioned by date, clustered on the primary filter dimensions. For any dashboard requiring sub-second latency, I’d push further and create pre-aggregated summary tables or pipe into a real-time OLAP store like Druid. The key principle: normalize for flexibility, denormalize for performance — and let the consumption pattern dictate which layer you serve from. The normalized silver layer is the single source of truth; the gold layer is a read-optimized projection.”

Q2: “When would you denormalize even in an OLTP system?”

Model Answer: “Three cases. First, caching a computed value to avoid repeated aggregation — like storing total_order_amount on the order header instead of recalculating from line items on every read. The trade-off is maintaining consistency on insert/update/delete of line items, usually via triggers or application logic. Second, embedding a frequently accessed attribute to avoid a join in a hot path — like storing customer_name directly on the order row for the order details API that serves millions of requests per second. Third, in NoSQL stores like DynamoDB where joins simply don’t exist — you must denormalize at write time because there’s no join at read time. In all cases, I’d be explicit about the consistency mechanism: application-layer enforcement, eventual consistency via async update, or transactional guarantee with triggers.”

Think About This

You’re in a Google interview. The prompt: “We have a normalized 3NF schema for our ad auction system that processes 10 million auctions per second. Downstream, we need analytics on auction outcomes by advertiser, keyword, and geography. How would you design the analytics layer?”

Walk through:

  1. Would you query the 3NF tables directly for analytics? (No — joins at 10M events/sec would be prohibitive for repeated dashboard queries)

  2. Where does denormalization happen — in the streaming pipeline or in the warehouse? (Both — partially denormalize in the streaming ETL, fully denormalize in the gold warehouse layer)

  3. How would you handle the fact that advertiser_name or keyword_category might change? (SCD2 in the dimension, join at ETL time to embed the correct version, avoid re-denormalizing historical data)

  4. What’s the cost trade-off between storing denormalized data in BigQuery vs. computing it on-the-fly? (At 10M events/sec × 86,400 sec = 864B events/day — pre-computed denormalized gold tables are far cheaper than running massive joins for every dashboard refresh)

The key insight: at Google scale, denormalization isn’t optional for the analytics layer — it’s a necessity. But the 3NF source system remains critical for transactional correctness in the auction itself. Both exist, serving different purposes.

Quick Reference

  • Normalize for writes, denormalize for reads — the fundamental trade-off, but modern columnar engines have narrowed the read-side gap

  • Normal forms in 10 seconds: 1NF = atomic values, 2NF = no partial dependencies, 3NF = no transitive dependencies, BCNF = every determinant is a superkey

  • Write amplification is the real cost of denormalization — quantify it in interviews (“1% of customers changing monthly means X million row rewrites”)

  • Lakehouse layers solve the dilemma — 3NF in silver for flexibility, star schema in gold for performance, OBT for ML consumption

  • Materialized views are “selective denormalization” — normalized base tables + denormalized read cache with refresh

Tomorrow’s Preview

Day 7: Data Vault 2.0 & Advanced Modeling — Hubs, links, and satellites. Why Data Vault exists (auditability, agility, parallel loading). How it compares to Kimball and Inmon. When interviewers at your target companies might expect you to know this, and when it’s overkill.