Phase 1: Foundations & Frameworks | Category: Data Modeling
Why You Need to Know Data Vault (But Probably Won’t Build One in an Interview)
Data Vault is unlikely to be the primary modeling technique tested at Meta, Netflix, Google, OpenAI, or Anthropic — those companies lean toward Kimball star schemas and modern lakehouse patterns. However, it comes up in two scenarios: (1) the interviewer asks you to compare modeling approaches and wants to see breadth of knowledge, and (2) you’re designing an enterprise data integration layer where multiple source systems feed into a central warehouse. Knowing Data Vault signals that you understand modeling beyond the basics and can reason about when different approaches fit. As a Reddit commenter with Data Vault experience noted: “My #1 Data Vault tip is: just do not unless the business has pretty good data maturity.”
The Three Modeling Philosophies
Before diving into Data Vault, understand where it sits relative to the approaches you already know:
| Topic | Details |
|---|---|
| Core idea | Build star schema marts organized by business process Build a single normalized enterprise warehouse, then derive marts Separate identity, relationships, and attributes into distinct table types |
| Entry point | Business process → dimensional model Enterprise-wide 3NF model → derived marts Business keys → hubs, links, satellites |
| Integration approach | Conformed dimensions across marts Single integrated 3NF warehouse Hubs as shared business key anchors |
| History tracking | SCD Types on dimensions Timestamps on 3NF rows Insert-only satellites (built-in SCD2) |
| Schema flexibility | Medium — adding sources may require redesign Low — changing the enterprise model is expensive High — add new satellites without touching existing tables |
| Query friendliness | High — star schemas are BI-friendly Medium — many joins Low — requires a presentation layer on top |
| Best for | Analytics, BI, well-understood query patterns Enterprise-wide single source of truth Multi-source integration, regulated industries, high-change environments |
| Time to first value | Fast Slow Medium (fast loading, slow querying without marts) |
Data Vault 2.0: The Three Building Blocks
Created by Dan Linstedt, Data Vault separates data into three distinct table types, each handling a different concern:
1. Hubs — Business Entity Identity
A hub captures the unique business keys for a core business entity. Nothing else — no descriptive attributes, no relationships.
hub_customer
_______________________________
hub_customer_hk (PK, hash key of business key)
customer_bk (business key, e.g., "CUST-12345")
load_date (when first loaded)
record_source (which source system provided this key)
Key principles:
-
One hub per core business concept (customer, product, order, employee)
-
The business key is the anchor — it never changes
-
hub_customer_hk is typically a hash (SHA-1, MD5) of the business key for fast joins
-
Hubs are insert-only — once a business key exists, it’s never updated or deleted
-
If the same customer appears in 3 source systems, they get ONE hub row (after business key resolution)
Why hash keys? In traditional Kimball, you’d use auto-incrementing surrogate keys. Data Vault uses hash keys because they can be computed independently by parallel ETL processes without sequence coordination. This enables massive parallel loading — a key advantage at scale.
2. Links — Relationships Between Entities
Links capture the many-to-many relationships between hubs. They’re the “verbs” connecting the “nouns.”
link_customer_order
_______________________________
link_customer_order_hk (PK, hash of combined business keys)
hub_customer_hk (FK → hub_customer)
hub_order_hk (FK → hub_order)
load_date
record_source
Key principles:
-
Links represent business relationships, not foreign keys
-
They are insert-only — a relationship once recorded is never deleted
-
A link can connect two or more hubs (ternary relationships are valid)
-
Hierarchical links connect a hub to itself (e.g., employee → manager)
-
Adding a new relationship between existing entities = adding a new link table, not modifying any existing table
3. Satellites — Descriptive Attributes & History
Satellites store the descriptive context (attributes) for hubs or links, along with their full change history.
sat_customer_details
_______________________________
hub_customer_hk (FK → hub_customer, part of composite PK)
load_date (part of composite PK — each change = new row)
load_end_date (when this version was superseded)
record_source
_______________________________
customer_name
email
city
membership_tier
hash_diff (hash of all attribute values for change detection)
Key principles:
-
Every attribute change inserts a new row (never update — pure SCD Type 2 by design)
-
hash_diff is a hash of all satellite attributes — used to detect whether anything actually changed (avoids inserting duplicate rows when data is reloaded but unchanged)
-
Different source systems can feed different satellites for the same hub. If CRM provides customer contact info and billing provides payment info, those are two separate satellites on hub_customer
-
This separation means adding a new data source = adding new satellites, NOT modifying existing ones
How Data Vault Fits in a Lakehouse
Per Databricks, the pattern maps cleanly to lakehouse layers:
Bronze: Raw staging from source systems (as-is)
↓
Silver: RAW VAULT — Hubs, Links, Satellites (insert-only, full history, no business rules applied)
↓ BUSINESS VAULT — Enriched satellites with derived attributes, business rules, calculated fields
↓
Gold: DATA MARTS — Star schema / dimensional models built from vault objects for BI consumption
The critical insight: Data Vault is NOT a replacement for Kimball. It’s an integration layer that sits between raw data and dimensional marts. You still build star schemas in the gold layer for BI tools. Data Vault handles the hard problem of integrating diverse sources and preserving full history; Kimball handles the presentation problem of making data queryable by analysts.
When Data Vault Shines
| Scenario | Why Data Vault Fits |
|---|---|
| Multiple disparate source systems | Each source gets its own satellites. Adding a new source doesn’t require redesigning existing tables. |
| Regulated industries | (finance, healthcare, government) Full audit trail is built-in. Every change is recorded with source and timestamp. Insert-only = immutable history. |
| Parallel development teams | Hubs, links, and satellites can be loaded independently. Team A loads customer satellites while Team B loads order satellites — no pipeline conflicts. |
| Frequent schema changes | New attributes from a source? Add a new satellite. No ALTER TABLE on existing production tables. |
| Very large-scale enterprises | Hash keys enable fully parallel ETL without sequence coordination across distributed systems. |
When Data Vault Is Overkill
| Scenario | Why Kimball/Simple Modeling Is Better |
|---|---|
| Single source system | No multi-source integration problem to solve. Star schema directly from the source is simpler. |
| Small team / startup | Data Vault’s learning curve and operational overhead aren’t justified. OBT or simple star schema gets you to value faster. |
| BI-first use case | Data Vault requires a presentation layer on top. If your primary goal is dashboards, build the star schema directly. |
| Low data maturity | Data Vault depends on well-defined business keys. If the organization can’t identify consistent business keys across systems, the methodology breaks down. |
| AI/ML feature tables | ML teams want flat, wide feature tables. The multi-table join structure of Data Vault is the opposite of what they need. |
Data Vault vs Kimball: The Interview Comparison
This is the comparison you should be able to articulate fluently (synthesized from AnalyticsCreator and DEV Community):
FactorData VaultKimball Star SchemaSource flexibilityHigh — new satellites, no redesignModerate — may need to alter dimensionsAudit trailBuilt-in (insert-only, record_source)Optional (SCDs, must design for it)Query simplicityLow — 3+ joins for basic attributesHigh — fact + dimension = doneBI tool compatibilityLow — needs presentation layerHigh — native star schema supportLearning curveHighModerateParallel ETL loadingExcellent — no dependencies between table loadsModerate — dimension must load before factsAdding new sourcesEasy — add satellitesHarder — may need to redesign conformed dimensionsHistory preservationDefault behaviorMust explicitly choose SCD type
The Hybrid Approach (What Most Enterprises Actually Do)
In practice, many organizations combine approaches. Here’s the realistic pattern:
Source Systems → Bronze (raw staging)
↓ Silver / Integration Layer ┌─────────────────────────┐
│ Option A: Data Vault │ ← Complex multi-source enterprise │ Option B: 3NF-like │ ← Moderate complexity │ Option C: Source-aligned│ ← Simple, single-source
└─────────────────────────┘
↓ Gold / Presentation Layer ┌─────────────────────────┐
│ Kimball Star Schema │ ← BI / dashboards │ OBT / Feature Tables │ ← ML / data science │ Denormalized APIs │ ← Application serving
└─────────────────────────┘
What to say in an interview: “I’d choose the integration layer modeling based on the problem complexity. For a single-source pipeline with stable schema, I’d go source-aligned in silver and star schema in gold — clean and simple. If we’re integrating 10+ source systems with conflicting keys and regulatory audit requirements, Data Vault in the silver layer gives us the flexibility and auditability we need, with star schema marts in gold for consumption. The gold layer is always Kimball — it’s what BI tools expect and what analysts can self-serve on.”
Key Data Vault 2.0 Concepts to Know
Hash keys: SHA-1 or MD5 of business key(s). Enables parallel computation without sequence coordination. Deterministic — same input always produces same hash.
Hash diff: Hash of all satellite attributes. Used in the loading process: if hash_diff matches the current row’s hash_diff, nothing changed — skip the insert. This is how you handle idempotent reloads without inserting duplicate satellite rows.
Record source: Every row tracks which source system it came from. Essential for auditing and debugging data lineage.
Load date vs business date: load_date is when the ETL processed the row. Business effective date may be in a satellite attribute. This separation is important for audit vs. business timeline analysis.
Raw Vault vs Business Vault: Raw Vault stores data as-is from sources (no business rules). Business Vault applies transformations, derived calculations, and business logic on top. This separation means you can change business rules without reloading source data.
Point-in-Time (PIT) tables: Pre-computed joins of hub + all satellites at each point in time. These are performance optimization tables that materialize the most common join pattern to avoid expensive multi-satellite joins at query time.
Bridge tables: Pre-computed link traversals for common multi-hop queries. Like PIT tables, these are query accelerators, not modeling primitives.
Interview Questions
Q1: “You’re integrating data from 5 different source systems (CRM, billing, web analytics, mobile app, support tickets) into a unified customer view. How would you model this?”
Model Answer: “This is a classic multi-source integration problem where Data Vault earns its keep. I’d create hub_customer with the enterprise business key — likely customer email or a cross-system ID. Each source system gets its own satellite: sat_customer_crm (name, address, account manager), sat_customer_billing (payment method, plan, MRR), sat_customer_web (last_visit, page_views, sessions), sat_customer_mobile (app_version, device, push_enabled), sat_customer_support (ticket_count, last_contact, satisfaction_score). Links connect customers to products, subscriptions, and interactions. The beauty is: when marketing adds a 6th source next quarter, I add sat_customer_marketing without touching anything existing. For the BI team, I’d build a dim_customer star schema view in the gold layer that joins the hub with the latest version from each satellite — they query a clean, wide dimension table and never see the vault complexity.”
Q2: “When would you NOT recommend Data Vault?”
Model Answer: “Three scenarios. First, startups or small teams — the overhead of maintaining hubs, links, and satellites isn’t justified when you have one or two source systems and a 3-person data team. A simple medallion architecture with star schema in gold is faster to build and easier to maintain. Second, pure ML/AI use cases like at OpenAI or Anthropic — data scientists need flat feature tables, not multi-join vault structures. The vault’s strength in auditability and source integration doesn’t serve the ML consumption pattern. Third, when business key identification is immature — Data Vault fundamentally depends on well-defined, consistent business keys across systems. If the organization can’t agree on what uniquely identifies a customer across systems, the hub layer breaks down. In those cases, I’d start with source-aligned silver tables and invest in business key resolution as a separate workstream before considering Data Vault.”
Think About This
You’re in a Netflix interview. The prompt: “Netflix acquires a gaming company. Now you need to integrate gaming user data (player profiles, in-game purchases, session data) with existing streaming data (viewing history, ratings, subscriptions). Both systems use different user IDs. How would you model the integration layer?”
Walk through:
-
What’s the business key for hub_user? (Email? Neither system’s internal ID. You need a cross-system identity resolution step — this is the hardest part.)
-
How would you handle users who exist in one system but not the other? (Hub captures all known business keys; satellites populate independently. A user with only gaming data has sat_user_gaming but no sat_user_streaming yet.)
-
Would you use Data Vault or a simpler approach? (This is legitimately a good Data Vault use case — two complex source systems with independent schemas, likely regulatory requirements around purchase data, and an expectation that more gaming titles/platforms will be added over time.)
-
What does the gold layer look like? (A unified dim_user that combines the latest from both satellites, plus domain-specific star schemas for gaming analytics and streaming analytics that share conformed dimensions.)
Quick Reference
-
Data Vault = Hubs (identity) + Links (relationships) + Satellites (attributes + history)
-
Insert-only by design — full audit trail, built-in SCD2, no updates or deletes
-
Hash keys enable parallel ETL without sequence coordination — critical for scale
-
Data Vault is an integration layer, not a presentation layer — you still need Kimball star schemas on top for BI consumption
-
Use Data Vault when: multiple source systems, regulatory audit requirements, frequent schema changes, parallel development teams
-
Skip Data Vault when: single source, small team, ML-first use case, immature business key definitions
Tomorrow’s Preview
Day 8: ETL vs ELT — Architecture & Trade-offs — Traditional ETL vs modern ELT with push-down compute. When each is appropriate. The tools landscape (Spark, dbt, Dataflow, Glue). How this architectural choice comes up in interviews at your target companies, and why “ELT on a lakehouse” is the dominant modern pattern.