Day 20 — Data lineage and cataloging

Phase 1: Foundations & Frameworks | Category: Data Quality & Governance

Why Lineage Is a Senior-Level Competency

You just spent yesterday on data quality checks. Lineage is the infrastructure that makes quality actionable at scale. As Monte Carlo puts it: “When a dashboard shows incorrect numbers or a pipeline fails, lineage acts as your debugging roadmap. Without lineage, engineers waste time playing detective, checking log files across multiple platforms. With lineage, you see the entire data flow at a glance — trace backward through the transformations and find exactly where things went wrong.” At your target companies, where hundreds or thousands of pipelines run in production, lineage is the difference between a 5-minute incident resolution and a 5-hour one.

What Data Lineage Is

Data lineage tracks the origin, movement, and transformation of data throughout its lifecycle. It answers three questions:

Where did this data come from? (upstream dependencies)
What transformations were applied? (processing history)
Where does this data go? (downstream consumers)

Source DB (orders table)
    ↓ [CDC via Debezium]
Kafka topic (orders_events)
    ↓ [Flink sessionization job]
Silver: fact_order_sessions (Iceberg)
    ↓ [dbt model]
Gold: daily_revenue_summary (BigQuery)
    ↓ [BI query]
VP Revenue Dashboard (Looker)

Lineage maps this entire graph — table to table, column to column, job to job.

Table-Level vs Column-Level Lineage

Table-Level Lineage

Tracks which tables are read and written by each job. The standard starting point.

job: daily_revenue_transform
  inputs: [silver.fact_orders, silver.dim_customer, silver.dim_date]
  outputs: [gold.daily_revenue_summary]

What it enables:

Impact analysis: If I change silver.fact_orders, what downstream tables are affected?
Root cause traversal: The dashboard broke. Which pipeline ran upstream?
Dependency mapping: Before deprecating a table, know who consumes it
Pipeline visualization: See the full DAG from source to consumption

Limitation: Doesn’t tell you which specific columns are involved. If revenue is wrong, you know the upstream table, but not which column within it.

Column-Level Lineage

Tracks which source columns produce which destination columns through transformations (OpenLineage, OneUptime):

gold.daily_revenue_summary.total_revenue
  ← AGGREGATION (SUM)
  ← silver.fact_orders.line_amount
  ← TRANSFORMATION (amount * (1 - discount_rate))
  ← bronze.raw_orders.amount
  ← bronze.raw_orders.discount_rate

What it enables:

Precise debugging: total_revenue is wrong → trace to line_amount → trace to discount_rate → find the bug
Regulatory compliance: Which columns contain PII? Which downstream reports include them?
GDPR right-to-deletion: Delete user_id from source → automatically know every derived column downstream that includes it
Schema change impact: Renaming amount to order_amount → automatically know all downstream column expressions that reference it

The complexity: Column-level lineage requires analyzing query/transformation logic, not just I/O metadata. It’s harder to collect and maintain — typically requires integration with query parsers (SQL AST analysis for BigQuery/Snowflake jobs, Spark logical plan traversal for Spark jobs).

Senior-level advice: “Start with table-level lineage. Column-level lineage is valuable but complex. Add detail incrementally.” In interviews, proposing table-level as the foundation with column-level as Phase 2 for critical paths shows pragmatism.

OpenLineage: The Open Standard

OpenLineage is the open-source standard for lineage event collection. Think of it as the “OpenTelemetry for data pipelines.”

How it works:

Pipeline job runs
    ↓ OpenLineage client (embedded in Spark, Airflow, dbt, Flink, etc.)
    ↓ emits events (RunEvent with job start, complete, fail)
    ↓ OpenLineage backend / Marquez (open-source lineage server)
    ↓ Query lineage graph via API / UI

OpenLineage event structure:

{
  "eventType": "COMPLETE",
  "run": { "runId": "abc-123" },
  "job": { "namespace": "spark", "name": "daily_revenue_transform" },
  "inputs": [
    {
      "namespace": "bigquery",
      "name": "project.silver.fact_orders",
      "facets": {
        "schema": { "fields": ["order_id", "amount", "customer_id", "..."] }
      }
    }
  ],
  "outputs": [
    {
      "namespace": "bigquery",
      "name": "project.gold.daily_revenue_summary",
      "facets": {
        "columnLineage": {
          "total_revenue": {
            "inputFields": [{ "field": "amount", "transformationType": "AGGREGATION" }]
          }
        }
      }
    }
  ]
}

Integrations: Apache Airflow, Apache Spark, dbt, Apache Flink, Great Expectations, Snowflake, BigQuery (via Marquez or DataHub integration).

Why it matters for interviews: “I’d use OpenLineage as the collection standard — it’s engine-agnostic, so the same lineage model captures metadata from Spark jobs, dbt models, and Airflow DAGs. The lineage graph is stored in a backend like DataHub or Marquez. This gives me a unified view across the entire pipeline, regardless of which tools the team uses.”

Lineage Tools Landscape

Topic	Details
Open Lineage	Open standard Engine-agnostic event collection, broad integrations Any stack — the collection layer
Marquez	Open-source server Stores and visualizes Open Lineage events Small/medium teams wanting self-hosted lineage
Data Hub	(Linked In) Open-source platform Full data catalog + lineage + search + governance, column-level lineage UILarge enterprises, multi-cloud, rich discovery needs
Amundsen	(Lyft) Open-source catalog Search-first data discovery, table/column metadata Teams prioritizing search and discovery over lineage depth
Apache Atlas	Open-source Hadoop/Hive ecosystem native, governance-heavy Hadoop-heavy orgs (declining relevance in 2026)
Monte Carlo	CommercialML-powered anomaly detection + automatic lineage Observability-first, data reliability platform
Alation	Commercial Business-user-friendly catalog, policy enforcement Governance-heavy enterprises
Collibra	Commercial Enterprise governance, policy workflows, stewardship Regulated industries (finance, healthcare)
Unity Catalog	(Databricks) Commercial Native lineage within Databricks ecosystem Databricks-centric stacks
Big Query Data Catalog	GCP-native Tag templates, policy enforcement, GCP-native lineageGCP-only stacks

The 2026 reality: At your target companies (Meta, Netflix, Google, OpenAI, Anthropic), they largely build internal lineage tooling at their scale. But they expect you to understand the principles and can reference these tools for smaller-scale or OSS implementations.

Data Catalogs: Discovery + Context

Lineage tells you how data flows. A data catalog tells you what data exists and what it means.

Core catalog capabilities:

Capability	What It Provides
Search & Discovery	”Where is user purchase data?” → find all relevant tables across 10,000 datasets
Schema browsing	View columns, types, sample values, cardinality estimates
Business glossary	”What does ‘active user’ mean?” → authoritative business definition
Ownership	Who owns this table? Who do I contact if data quality fails?
Lineage integration	Where does this table come from? What depends on it?
Usage statistics	How often is this table queried? By whom? Last accessed when?
Quality signals	Pass/fail status from quality tests, anomaly alerts
Documentation	Description, known issues, SLAs, example queries

The catalog + lineage combination creates the full data observability picture:

Catalog answers: “What exists and what does it mean?”
Lineage answers: “Where does it come from and who uses it?”

DataHub in practice:

Search "revenue" →
Results: dim_revenue_segment, fact_daily_revenue, revenue_summary_v2  Click fact_daily_revenue →
Schema: order_date, total_revenue, order_count, avg_order_value
Lineage: upstream [fact_orders, dim_date], downstream [exec_dashboard, ml_revenue_model]
Column lineage: total_revenue ← SUM(fact_orders.amount)
Owner: data-platform@company.com    SLA: refreshed by 8 AM PT daily
Quality: 98/100 (2 warnings: null rate slightly elevated last 3 days)

How Lineage Enables Debugging

This is the practical use case interviewers probe. Here’s the workflow:

Scenario: The Executive Revenue Dashboard shows $0 revenue for yesterday.

Without lineage:

Engineer opens Airflow, checks which jobs ran
Manually traces: did fact_orders load? Did daily_revenue_summary build?
Checks 10 different tables across 3 systems
Interviews other team members: “Did you change anything?”
Total time: 2-4 hours

With lineage:

Engineer opens DataHub/Monte Carlo, searches for the dashboard
Clicks the impacted metric → sees upstream dependency graph
Immediately sees: gold.daily_revenue_summary failed to build at 4:15 AM
Traces one hop upstream: silver.fact_orders was empty
Traces one more hop: CDC job from orders_db produced 0 rows
Root cause: orders_db had a connection timeout at 3:45 AM
Total time: 8 minutes

Impact analysis (proactive): Before deploying a change to silver.fact_orders:

Query lineage graph for downstream consumers
Find: 8 tables, 3 ML models, 2 external API feeds depend on it
Know exactly what to test and which teams to notify

GDPR deletion: User requests data deletion:

Search catalog for all tables containing user_id
Lineage traces all derived tables, aggregates, ML features
Automated deletion script targets every occurrence
Audit trail proves compliance

Building Lineage Into Your Pipeline Design

The interview differentiation: don’t just describe tools — describe how you’d architect lineage collection:

Option 1: Push-based (runtime capture)

Pipelines emit OpenLineage events at job start/complete/fail. The lineage backend receives events in real-time. Most accurate — captures actual I/O, not just declared dependencies.

Airflow DAG runs → OpenLineage listener emits events → Marquez/DataHub
Spark job runs → OpenLineage SparkListener captures I/O → DataHub
dbt runs → dbt-openlineage plugin emits model lineage → DataHub

Option 2: Pull-based (static analysis)

Scan query logs, SQL ASTs, pipeline configs. Cheaper operationally but may miss dynamic queries. Used by tools like Atlan, Alation for warehouse-native lineage.

Option 3: Declarative (config-driven)

Developers declare upstream/downstream dependencies in pipeline config. Cheap to implement, but depends on humans keeping it accurate. Good for smaller teams.

Best practice: Combine runtime capture (OpenLineage) for production pipelines with catalog metadata (DataHub) for business context. Runtime gives you accuracy; catalog gives you meaning.

Interview Questions

Q1: “An important table in your warehouse is being deprecated. How do you safely decommission it?”

Model Answer: “Lineage makes this safe and systematic. First, I’d query the lineage graph to find all downstream consumers — tables, jobs, dashboards, ML models, APIs. Without lineage, this is a dangerous manual process. With lineage, I get a complete dependency list in 30 seconds. Second, I’d identify the owners of each downstream asset via the data catalog and notify them with a deprecation timeline — 60 days minimum for tier-1 tables. Third, I’d monitor usage statistics in the catalog: is anyone still querying this table? If active queries drop to zero after 30 days, it’s safe to proceed. Fourth, I’d add a deprecation tag in the catalog with the replacement table and migration guide. Fifth, rename the table to [tablename]_deprecated temporarily — breaking change detection triggers alerts for anyone still running queries. Finally, after the notice period and zero active queries confirmed, delete. The entire process is tracked in the catalog as an audit record. Without lineage and catalog, this decommissioning would take weeks of manual investigation and stakeholder hunting.”

Q2: “Your analytics team complains that they can’t trust the data. How do you build data trust at scale?”

Model Answer: “Data trust is an engineering problem, not a communication problem. It has three technical pillars. First, data catalog with ownership and SLAs — every tier-1 table has a named owner, documented definition, known SLA, and quality signal visible in one place. Analysts can see at a glance: ‘This table refreshes by 8 AM and passed all 12 quality checks today.’ Second, quality visibility — dbt test results and anomaly detection scores are surfaced in the catalog entry for each table. A green checkmark means tests passed today. A warning flag means investigate before using for critical decisions. Third, lineage transparency — when an analyst sees a number they don’t understand, they can trace exactly which transformations produced it. ‘This revenue figure comes from fact_orders, which was built from these three CDC sources, with this deduplication logic.’ That explainability is what turns skepticism into trust. The catalog, quality signals, and lineage together form a data observability layer. Teams that invest in this report dramatically fewer ‘I don’t trust this data’ conversations because answers are self-service rather than requiring an engineer to investigate.”

Think About This

You’re in a Netflix interview. The prompt: “Netflix has thousands of data pipelines and hundreds of analysts. An analyst finds that the ‘weekly content engagement score’ metric used in a major content investment decision has a calculation that looks wrong. How would you design a system to diagnose this and prevent it from happening again?”

Walk through:

How do you diagnose which pipeline produced the wrong value? (Start from the metric in the catalog → trace lineage to the upstream dbt model → find the SQL defining ‘engagement score’ → version history shows someone changed the session definition 3 days ago → root cause found in 10 minutes)
What downstream impact did the wrong value have? (Lineage shows the engagement score fed into 3 other models and 2 external reports — all potentially affected)
How do you prevent this in the future? (CI/CD with dbt tests — the session definition change should have triggered a test. Add a test: ‘engagement score should be within 20% of prior week average.’ Changes to tier-1 metric definitions require peer review. Breaking changes trigger lineage-based impact notifications to downstream owners.)
What catalog + lineage features make this possible? (Column-level lineage to know which columns feed ‘engagement score.’ Ownership metadata to know who to notify. Version history to pinpoint when the definition changed. Quality signals showing the test that would have caught this.)

Quick Reference

Lineage = origin + movement + transformation of data. Answers: where did it come from, what happened to it, where does it go.
Table-level lineage: Start here. Fast to implement, solves 80% of debugging use cases.
Column-level lineage: High value for compliance (PII tracking, GDPR), precise debugging, schema change impact. Complex to collect — add incrementally.
OpenLineage: The standard for lineage event collection. Engine-agnostic. Integrates with Spark, Airflow, dbt, Flink.
Catalog vs lineage: Catalog = what exists and what it means. Lineage = how data flows. Together = data observability.
Three key use cases: (1) Incident debugging — trace root cause in minutes, not hours. (2) Impact analysis — know what breaks before you change something. (3) Compliance — GDPR deletion, PII tracking, audit trails.
Deprecation checklist: Query lineage → notify owners → monitor usage → add deprecation tag → rename to _deprecated → delete after zero-usage confirmation.

Tomorrow’s Preview

Day 21: Schema Evolution & Data Contracts — Forward/backward/full compatibility, Avro & Protobuf schema registry, data contracts between producers and consumers, and how to prevent breaking changes in pipelines — one of the most operationally painful topics that senior DEs are expected to solve architecturally.

Day 20: Data lineage and cataloging