Phase 1: Foundations | Category: System Design Practice
The Prompt
“Design the data warehouse for an e-commerce platform. The platform processes millions of orders per day across products, sellers, and customers globally. Business teams need dashboards for revenue, inventory, returns, and customer lifetime value. Data scientists need training datasets. Handle SCD, multiple fact types, and scalable query performance.”
Step 1: Requirements (5 min)
Business processes to model (Kimball step 1):
-
Order transactions (primary)
-
Product inventory (snapshot)
-
Returns and refunds
-
Customer lifetime value (accumulating)
-
Seller performance
Key consumers and access patterns:
-
Finance: daily revenue by geography, product category, payment method
-
Merchandising: top/bottom products, inventory levels, sell-through rates
-
Customer success: LTV, retention, return rates by segment
-
Data science: training data for recommendation and pricing models
-
Exec: high-level KPI dashboards (DAU, GMV, conversion rate)
Non-functional:
Scale: 5M orders/day, 50M order line items/day, ~500 GB/day new data
Latency:
Finance dashboard: data by 8 AM. Real-time GMV: 5-min freshness
Retention: 7 years (financial compliance)
Availability: 99.9% for analytics reads
Step 2: Scale Estimation (3 min)
50M order line items/day × 500 bytes = 25 GB/day fact table additions Parquet + Zstd (5x): ~5 GB/day compressed
Annual fact table: 5 GB × 365 = ~1.8 TB/year 7-year retention: ~12.5 TB — very manageable for any modern warehouse
Serving: 200 analysts × 20 queries/day = 4,000 queries/day = ~0.05 QPS average Peak 9AM-11AM: ~200 concurrent queries → moderate concurrency → BigQuery or Snowflake handles this easily
Architectural implication: At 500 GB/day, this is NOT a streaming problem. A well-designed batch pipeline with hourly refreshes and a nightly full rebuild of gold tables is completely sufficient.
Step 3: Data Model Design
The Core Business Processes → Fact Tables
Fact 1: fact_order_line (Transaction Fact)
Grain: one row per order line item (one product in one order) fact_order_line
_______________________________
order_line_id (degenerate dim)
order_id (degenerate dim)
customer_sk → dim_customer (SCD2)
product_sk → dim_product (SCD2)
seller_sk → dim_seller
store_sk → dim_store (online/marketplace/physical)
order_date_id → dim_date
ship_date_id → dim_date (nullable until shipped)
geo_sk → dim_geography
payment_method_sk → dim_payment_method
_______________________________
quantity (additive)
unit_price (semi-additive)
discount_amount (additive)
line_total (additive) -- quantity × unit_price - discount
gross_margin (additive) -- line_total - unit_cost
is_returned (flag)
Why this grain?: One row per line item (not per order) supports product-level analysis — top products, category performance, margin by SKU. You can always roll up to order level with GROUP BY order_id.
Fact 2: fact_daily_inventory (Periodic Snapshot)
Grain: one row per product × warehouse × date fact_daily_inventory
_______________________________
snapshot_date_id → dim_date
product_sk → dim_product
warehouse_sk → dim_warehouse
_______________________________
quantity_on_hand (semi-additive — don't SUM across dates)
quantity_on_order (semi-additive)
quantity_reserved (semi-additive)
days_of_supply (semi-additive)
reorder_flag (boolean)
Why periodic snapshot? “What was inventory level on March 15?” is a point-in-time question that a transaction fact can’t answer efficiently. The snapshot materializes the answer.
Fact 3: fact_order_fulfillment (Accumulating Snapshot)
Grain: one row per order (lifecycle tracking) fact_order_fulfillment
_______________________________
order_id
customer_sk → dim_customer
_______________________________
order_placed_date_id → dim_date
payment_confirmed_date_id → dim_date (NULL until confirmed)
warehouse_picked_date_id → dim_date (NULL until picked)
shipped_date_id → dim_date (NULL until shipped)
delivered_date_id → dim_date (NULL until delivered)
returned_date_id → dim_date (NULL unless returned)
_______________________________
days_to_confirm (lag measure)
days_to_ship (lag measure)
days_to_deliver (lag measure)
order_total (additive)
current_status (degenerate dim: pending/shipped/delivered/returned)
Why accumulating snapshot?: Operations and logistics need “where are orders stuck?” analysis. “What % of orders ship within 2 days?” This is a funnel/milestone question that requires all milestone dates in one row.
Dimension Tables
dim_customer (SCD Type 2 on key analytical attributes)
dim_customer
_______________________________
customer_sk (PK, surrogate)
customer_id (natural key)
first_name, last_name
email_hash (hashed, not raw PII)
city, state, country
registration_date (SCD0 — never changes)
customer_segment (SCD2 — changes as customer moves between segments)
is_prime_member (SCD2 — changes on subscribe/cancel)
valid_from, valid_to, is_current
dim_product (SCD Type 2 on price, category)
dim_product
_______________________________
product_sk (PK, surrogate)
product_id (natural key — SKU)
product_name (SCD1 — minor name fixes OK)
brand
category_l1, category_l2, category_l3 (hierarchy embedded — star schema)
unit_cost (SCD2 — cost changes affect margin calculations)
list_price (SCD2 — must know historical price for accurate margin)
is_active (SCD1)
valid_from, valid_to, is_current
dim_date (static, 10+ years pre-generated)
dim_date
_______________________________
date_id (INT, YYYYMMDD format — fast join)
full_date (DATE)
day_of_week, is_weekend, is_holiday
week_number, month_name, quarter, year
fiscal_quarter, fiscal_year
days_before_holiday (useful for demand forecasting)
dim_geography
dim_geography
_______________________________
geo_sk
city, state, country, region, timezone
latitude, longitude (for distance calculations)
is_domestic
Step 4: Full Warehouse Architecture
Source systems
Order Management System (PostgreSQL/MySQL)
Inventory System
Product Catalog
Customer Platform
↓ CDC (Debezium) + batch extracts
Bronze Layer (S3 + Iceberg)
raw_orders, raw_order_lines, raw_products; raw_customers, raw_inventory
Append-only, schema-on-read, full history
↓ Hourly Spark / dbt
Silver Layer (S3 + Iceberg)
Cleaned, typed, deduped, validated; source-aligned (orders, customers, products); SCD2 on dimension tables
↓ Nightly dbt (3 AM run, data by 6 AM)
Gold Layer (BigQuery / Snowflake)
fact_order_line (partitioned by order_date), fact_daily_inventory (partitioned by snapshot_date), fact_order_fulfillment (partitioned by order_placed_date)
dim_customer (SCD2), dim_product (SCD2), dim_date, dim_geography (static)
↓ Serving
BI (Looker, Tableau) → Gold star schema; ad-hoc analysts → Gold + Silver; data scientists → Silver (Spark); real-time GMV → 5-min CDC stream → ClickHouse
Step 5: Pipeline Design
SCD2 processing in the silver layer (daily Airflow DAG):
1. Load raw dim_customer changes from bronze
2. Compare against current silver.dim_customer (is_current = true)
3. For changed records: a. UPDATE old row: is_current = false, valid_to = change_date b. INSERT new row: is_current = true, valid_from = change_date, valid_to = '9999-12-31'
4. For new records: INSERT with valid_from = today
Fact table load (daily, incremental):
-- Idempotent: overwrite today's partition
INSERT OVERWRITE gold.fact_order_line
PARTITION (order_date = '2026-04-09')
SELECT
ol.order_line_id,
ol.order_id,
c.customer_sk,
-- join to current dim at order time
p.product_sk,
d.date_id as order_date_id,
ol.quantity,
ol.unit_price,
ol.discount_amount,
ol.quantity * ol.unit_price - ol.discount_amount as line_total,
ol.quantity * (ol.unit_price - p.unit_cost) as gross_margin,
CASE WHEN r.order_line_id IS NOT NULL THEN 1 ELSE 0 END as is_returned
FROM silver.order_lines ol
-- Point-in-time join: get the product price/cost that was current at order time
JOIN gold.dim_product p
ON ol.product_id = p.product_id
AND ol.order_placed_at BETWEEN p.valid_from AND p.valid_to
JOIN gold.dim_customer c
ON ol.customer_id = c.customer_id
AND ol.order_placed_at BETWEEN c.valid_from AND c.valid_to
JOIN gold.dim_date d ON DATE(ol.order_placed_at) = d.full_date
LEFT JOIN silver.returns r ON ol.order_line_id = r.order_line_id
WHERE DATE(ol.order_placed_at) = '2026-04-09'
The point-in-time join is the most important SQL pattern here: when loading historical facts, always join to the dimension version that was current AT THE TIME OF THE FACT, not the current version. This is why SCD2 exists.
Step 6: Key Trade-offs
Trade-off 1: How to handle returns?
Option A: Add is_returned flag to fact_order_line (our choice). Simple. One fact table serves both orders and return analysis.
Option B: Separate fact_returns table. Better for complex return analysis (partial returns, return reasons).
Choice depends on return complexity. For most e-commerce, Option A is sufficient. Mention both.
Trade-off 2: LTV calculation
LTV is a derived metric — sum of all orders per customer. Options:
-
Compute on-demand from fact_order_line (always fresh, but costly if queried frequently)
-
Materialize as a dim_customer_ltv snapshot rebuilt nightly (fast reads, 24-hour stale)
-
Stream-updated using CDC → real-time LTV in a serving store (complex, justified only if LTV drives real-time decisions)
For most use cases, nightly materialization is the right answer.
Trade-off 3: Real-time GMV vs batch
The daily batch gives accurate GMV by 8 AM. Product teams want to see GMV updating every 5 minutes for live monitoring. Solution: dual path — batch (accurate, for finance) + CDC → 5-min micro-batch into a ClickHouse GMV table (fast, slightly approximate due to lag). Dashboard clearly labels each source.
Self-Assessment Questions
-
Why does fact_order_line use a point-in-time join to dim_product?
-
Why do we need both fact_order_line AND fact_daily_inventory? Could we derive inventory from orders alone?
-
What changes in the data model when a product can be fulfilled from multiple warehouses in a single order?
-
What’s the grain of fact_order_fulfillment and why is it different from fact_order_line?
Quick Reference — Both Designs
Day 25 (Event Logging):
-
Architecture: SDK → Gateway (PII mask) → Kafka → Flink (dedup+enrich) → ClickHouse + S3/Iceberg
-
Three-layer dedup: client-generated event_id → Flink stateful → batch SQL ROW_NUMBER()
-
Watermark = 2 hours for mobile late data → side output → hourly batch correction
-
Schema evolution: BACKWARD_TRANSITIVE in Schema Registry + Iceberg mergeSchema
Day 26 (E-Commerce Warehouse):
-
Three fact table types in action: Transaction (fact_order_line) + Periodic Snapshot (fact_daily_inventory) + Accumulating Snapshot (fact_order_fulfillment)
-
SCD2 on customer segment, product price, product cost — because historical margin analysis depends on what price/cost was at time of order
-
Point-in-time join is mandatory — always join to the dimension version current at the fact’s timestamp
-
Dual path: CDC → ClickHouse for real-time GMV + batch → warehouse for accurate financial reporting
Tomorrow’s Preview
Day 27: API Design for Data Systems — REST vs gRPC vs GraphQL for data access layers. Rate limiting, pagination, versioning. How data engineers design APIs for self-serve analytics, data products, and feature serving — a topic that increasingly shows up in senior DE interviews at Meta, OpenAI, and Anthropic.