Phase 1: Foundations | Category: System Design Practice

The Prompt

“Design the data warehouse for an e-commerce platform. The platform processes millions of orders per day across products, sellers, and customers globally. Business teams need dashboards for revenue, inventory, returns, and customer lifetime value. Data scientists need training datasets. Handle SCD, multiple fact types, and scalable query performance.”

Step 1: Requirements (5 min)

Business processes to model (Kimball step 1):

  1. Order transactions (primary)

  2. Product inventory (snapshot)

  3. Returns and refunds

  4. Customer lifetime value (accumulating)

  5. Seller performance

Key consumers and access patterns:

  • Finance: daily revenue by geography, product category, payment method

  • Merchandising: top/bottom products, inventory levels, sell-through rates

  • Customer success: LTV, retention, return rates by segment

  • Data science: training data for recommendation and pricing models

  • Exec: high-level KPI dashboards (DAU, GMV, conversion rate)

Non-functional:

Scale:      5M orders/day, 50M order line items/day, ~500 GB/day new data
Latency:
Finance dashboard: data by 8 AM. Real-time GMV: 5-min freshness
Retention:  7 years (financial compliance)
Availability: 99.9% for analytics reads

Step 2: Scale Estimation (3 min)

50M order line items/day × 500 bytes = 25 GB/day fact table additions  Parquet + Zstd (5x): ~5 GB/day compressed
Annual fact table: 5 GB × 365 = ~1.8 TB/year  7-year retention: ~12.5 TB — very manageable for any modern warehouse
Serving:     200 analysts × 20 queries/day = 4,000 queries/day = ~0.05 QPS average    Peak 9AM-11AM: ~200 concurrent queries → moderate concurrency    → BigQuery or Snowflake handles this easily

Architectural implication: At 500 GB/day, this is NOT a streaming problem. A well-designed batch pipeline with hourly refreshes and a nightly full rebuild of gold tables is completely sufficient.

Step 3: Data Model Design

The Core Business Processes → Fact Tables

Fact 1: fact_order_line (Transaction Fact)

Grain: one row per order line item (one product in one order)  fact_order_line

_______________________________

order_line_id (degenerate dim)
order_id (degenerate dim)
customer_sk → dim_customer (SCD2)
product_sk → dim_product (SCD2)
seller_sk → dim_seller
store_sk → dim_store (online/marketplace/physical)
order_date_id → dim_date
ship_date_id → dim_date (nullable until shipped)
geo_sk → dim_geography
payment_method_sk → dim_payment_method

_______________________________

quantity (additive)
unit_price (semi-additive)
discount_amount (additive)
line_total (additive) -- quantity × unit_price - discount
gross_margin (additive) -- line_total - unit_cost
is_returned (flag)

Why this grain?: One row per line item (not per order) supports product-level analysis — top products, category performance, margin by SKU. You can always roll up to order level with GROUP BY order_id.

Fact 2: fact_daily_inventory (Periodic Snapshot)

Grain: one row per product × warehouse × date  fact_daily_inventory

_______________________________

snapshot_date_id → dim_date
product_sk → dim_product
warehouse_sk → dim_warehouse

_______________________________

quantity_on_hand (semi-additive — don't SUM across dates)
quantity_on_order (semi-additive)
quantity_reserved (semi-additive)
days_of_supply (semi-additive)
reorder_flag (boolean)

Why periodic snapshot? “What was inventory level on March 15?” is a point-in-time question that a transaction fact can’t answer efficiently. The snapshot materializes the answer.

Fact 3: fact_order_fulfillment (Accumulating Snapshot)

Grain: one row per order (lifecycle tracking)  fact_order_fulfillment

_______________________________

order_id
customer_sk → dim_customer

_______________________________

order_placed_date_id → dim_date
payment_confirmed_date_id → dim_date (NULL until confirmed)
warehouse_picked_date_id → dim_date (NULL until picked)
shipped_date_id → dim_date (NULL until shipped)
delivered_date_id → dim_date (NULL until delivered)
returned_date_id → dim_date (NULL unless returned)

_______________________________

days_to_confirm (lag measure)
days_to_ship (lag measure)
days_to_deliver (lag measure)
order_total (additive)
current_status (degenerate dim: pending/shipped/delivered/returned)

Why accumulating snapshot?: Operations and logistics need “where are orders stuck?” analysis. “What % of orders ship within 2 days?” This is a funnel/milestone question that requires all milestone dates in one row.

Dimension Tables

dim_customer (SCD Type 2 on key analytical attributes)

dim_customer

_______________________________

customer_sk (PK, surrogate)
customer_id (natural key)
first_name, last_name
email_hash (hashed, not raw PII)
city, state, country
registration_date (SCD0 — never changes)
customer_segment (SCD2 — changes as customer moves between segments)
is_prime_member (SCD2 — changes on subscribe/cancel)
valid_from, valid_to, is_current

dim_product (SCD Type 2 on price, category)

dim_product

_______________________________

product_sk (PK, surrogate)
product_id (natural key — SKU)
product_name (SCD1 — minor name fixes OK)
brand
category_l1, category_l2, category_l3 (hierarchy embedded — star schema)
unit_cost (SCD2 — cost changes affect margin calculations)
list_price (SCD2 — must know historical price for accurate margin)
is_active (SCD1)
valid_from, valid_to, is_current

dim_date (static, 10+ years pre-generated)

dim_date

_______________________________

date_id (INT, YYYYMMDD format — fast join)
full_date (DATE)
day_of_week, is_weekend, is_holiday
week_number, month_name, quarter, year
fiscal_quarter, fiscal_year
days_before_holiday (useful for demand forecasting)

dim_geography

dim_geography

_______________________________

geo_sk
city, state, country, region, timezone
latitude, longitude (for distance calculations)
is_domestic

Step 4: Full Warehouse Architecture

Source systems
    Order Management System (PostgreSQL/MySQL)
    Inventory System
    Product Catalog
    Customer Platform
    ↓ CDC (Debezium) + batch extracts
Bronze Layer (S3 + Iceberg)
    raw_orders, raw_order_lines, raw_products; raw_customers, raw_inventory
    Append-only, schema-on-read, full history
    ↓ Hourly Spark / dbt
Silver Layer (S3 + Iceberg)
    Cleaned, typed, deduped, validated; source-aligned (orders, customers, products); SCD2 on dimension tables
    ↓ Nightly dbt (3 AM run, data by 6 AM)
Gold Layer (BigQuery / Snowflake)
    fact_order_line (partitioned by order_date), fact_daily_inventory (partitioned by snapshot_date), fact_order_fulfillment (partitioned by order_placed_date)
    dim_customer (SCD2), dim_product (SCD2), dim_date, dim_geography (static)
    ↓ Serving
BI (Looker, Tableau) → Gold star schema; ad-hoc analysts → Gold + Silver; data scientists → Silver (Spark); real-time GMV → 5-min CDC stream → ClickHouse

Step 5: Pipeline Design

SCD2 processing in the silver layer (daily Airflow DAG):

1. Load raw dim_customer changes from bronze
2. Compare against current silver.dim_customer (is_current = true)
3. For changed records:     a. UPDATE old row: is_current = false, valid_to = change_date     b. INSERT new row: is_current = true, valid_from = change_date, valid_to = '9999-12-31'
4. For new records: INSERT with valid_from = today

Fact table load (daily, incremental):

-- Idempotent: overwrite today's partition
INSERT OVERWRITE gold.fact_order_line
PARTITION (order_date = '2026-04-09')
SELECT
    ol.order_line_id,
    ol.order_id,
    c.customer_sk,
    -- join to current dim at order time
    p.product_sk,
    d.date_id as order_date_id,
    ol.quantity,
    ol.unit_price,
    ol.discount_amount,
    ol.quantity * ol.unit_price - ol.discount_amount as line_total,
    ol.quantity * (ol.unit_price - p.unit_cost) as gross_margin,
    CASE WHEN r.order_line_id IS NOT NULL THEN 1 ELSE 0 END as is_returned
FROM silver.order_lines ol
-- Point-in-time join: get the product price/cost that was current at order time
JOIN gold.dim_product p
  ON ol.product_id = p.product_id
  AND ol.order_placed_at BETWEEN p.valid_from AND p.valid_to
JOIN gold.dim_customer c
  ON ol.customer_id = c.customer_id
  AND ol.order_placed_at BETWEEN c.valid_from AND c.valid_to
JOIN gold.dim_date d ON DATE(ol.order_placed_at) = d.full_date
LEFT JOIN silver.returns r ON ol.order_line_id = r.order_line_id
WHERE DATE(ol.order_placed_at) = '2026-04-09'

The point-in-time join is the most important SQL pattern here: when loading historical facts, always join to the dimension version that was current AT THE TIME OF THE FACT, not the current version. This is why SCD2 exists.

Step 6: Key Trade-offs

Trade-off 1: How to handle returns?

Option A: Add is_returned flag to fact_order_line (our choice). Simple. One fact table serves both orders and return analysis.

Option B: Separate fact_returns table. Better for complex return analysis (partial returns, return reasons).

Choice depends on return complexity. For most e-commerce, Option A is sufficient. Mention both.

Trade-off 2: LTV calculation

LTV is a derived metric — sum of all orders per customer. Options:

  • Compute on-demand from fact_order_line (always fresh, but costly if queried frequently)

  • Materialize as a dim_customer_ltv snapshot rebuilt nightly (fast reads, 24-hour stale)

  • Stream-updated using CDC → real-time LTV in a serving store (complex, justified only if LTV drives real-time decisions)

For most use cases, nightly materialization is the right answer.

Trade-off 3: Real-time GMV vs batch

The daily batch gives accurate GMV by 8 AM. Product teams want to see GMV updating every 5 minutes for live monitoring. Solution: dual path — batch (accurate, for finance) + CDC → 5-min micro-batch into a ClickHouse GMV table (fast, slightly approximate due to lag). Dashboard clearly labels each source.

Self-Assessment Questions

  1. Why does fact_order_line use a point-in-time join to dim_product?

  2. Why do we need both fact_order_line AND fact_daily_inventory? Could we derive inventory from orders alone?

  3. What changes in the data model when a product can be fulfilled from multiple warehouses in a single order?

  4. What’s the grain of fact_order_fulfillment and why is it different from fact_order_line?

Quick Reference — Both Designs

Day 25 (Event Logging):

  • Architecture: SDK → Gateway (PII mask) → Kafka → Flink (dedup+enrich) → ClickHouse + S3/Iceberg

  • Three-layer dedup: client-generated event_id → Flink stateful → batch SQL ROW_NUMBER()

  • Watermark = 2 hours for mobile late data → side output → hourly batch correction

  • Schema evolution: BACKWARD_TRANSITIVE in Schema Registry + Iceberg mergeSchema

Day 26 (E-Commerce Warehouse):

  • Three fact table types in action: Transaction (fact_order_line) + Periodic Snapshot (fact_daily_inventory) + Accumulating Snapshot (fact_order_fulfillment)

  • SCD2 on customer segment, product price, product cost — because historical margin analysis depends on what price/cost was at time of order

  • Point-in-time join is mandatory — always join to the dimension version current at the fact’s timestamp

  • Dual path: CDC → ClickHouse for real-time GMV + batch → warehouse for accurate financial reporting

Tomorrow’s Preview

Day 27: API Design for Data Systems — REST vs gRPC vs GraphQL for data access layers. Rate limiting, pagination, versioning. How data engineers design APIs for self-serve analytics, data products, and feature serving — a topic that increasingly shows up in senior DE interviews at Meta, OpenAI, and Anthropic.