Day 29 — Pipeline observability and monitoring

Phase 1: Foundations & Frameworks | Category: Observability

Why This Is the “Grown-Up” DE Topic

Anyone can build a pipeline that works on the happy path. Senior DEs build pipelines that fail gracefully, alert intelligently, and recover quickly. As DataVidhya frames it: “The most commonly missed metric is silent failure — the pipeline runs successfully but produces zero rows because the source changed its API format.” At Meta, Netflix, Google, OpenAI, and Anthropic, where data powers product decisions worth millions of dollars daily, the question isn’t “will your pipeline fail?” but “how quickly will you know, and how will you recover?”

Interviewers at these companies listen for whether you treat observability as an afterthought (“and then I’d add some monitoring”) or as a first-class design concern woven throughout your architecture.

The Five Observability Layers

DataVidhya defines the canonical five-layer monitoring model for data pipelines:

Layer 1: PIPELINE HEALTH

Job status (running/failed/degraded/skipped)
Last successful run timestamp vs. SLA
Execution duration trend (creeping up = resource pressure or data growth)
Retry count per run

Layer 2: DATA FRESHNESS

MAX(load_timestamp) in destination table vs. current time
“Data as of” lag — how old is the data consumers are reading?
SLO breach alert if freshness exceeds threshold

Layer 3: DATA VOLUME

Row counts per run vs. historical baseline
Byte volumes per partition
Anomaly detection: > 30% deviation from 7-day rolling average = alert

Layer 4: DATA QUALITY

dbt test pass/fail rates
Null percentage per critical column
Schema drift alerts (new column, missing column, type change)
Duplicate key rate

Layer 5: INFRASTRUCTURE

Kafka consumer lag (growing = throughput gap)
Spark executor memory utilization, GC pauses
Flink checkpoint duration and failure count
Cloud compute cost per pipeline (cost spike = efficiency regression)
Storage growth rate vs. budget

The senior insight: Most teams instrument Layer 1 (job success/fail) and think they’re done. Layers 2-4 are where real incidents originate. A job that succeeds but produces 0 rows is invisible to Layer 1 monitoring — it takes Layer 3 to catch it.

SLIs, SLOs, and Error Budgets for Data Pipelines

Borrowing the SRE framework (Days 19, 20 applied this to data quality — today we apply it to pipeline operations):

Defining your SLIs (what you measure):

SLI	Definition
Freshness	MAX(event_date) in gold.fact_orders — current_date (lag in hours)
Completeness	(rows loaded today / rows in source today) × 100
Volume	daily_row_count / 7_day_rolling_avg_row_count × 100
Availability	(successful pipeline runs / total scheduled runs) × 100
Latency	time from source write to destination availability (end-to-end latency)

Defining your SLOs (what you commit to):

SLO	Target
Freshness	gold.fact_orders refreshed within 4 hours of midnight PT, 29/30 days
Completeness	≥ 99% of source rows loaded within the same partition window
Volume	daily row count within 20% of 7-day rolling average
Availability	≥ 99.5% of scheduled runs complete successfully (≤ 0.5% failure rate)
Latency	End-to-end CDC pipeline latency < 5 minutes, p99

Error budget: If the freshness SLO is 29/30 days (96.7%), the error budget is 1 missed day per month. Track budget burn rate — if you’ve already missed 1 day on April 10, the remaining budget for April is exhausted. This triggers a reliability sprint before any new feature work.

The Observability Stack: Tools by Layer

Topic	Details
Pipeline health	Job status, duration, retry count Airflow metrics → Prometheus + Grafana; Dagster built-in; Cloud Watch (Glue)
Data freshness	MAX(load_ts) per table — automated freshness checks Monte Carlo, Soda, custom SQL probes, dbt source freshness
Data volume	Row counts, byte sizes, anomaly detection Great Expectations, Monte Carlo ML anomaly detection, custom Airflow sensors
Data quality	Null rates, schema drift, FK violations — dbt tests, Great Expectations, Soda
Infrastructure	Kafka lag, Spark metrics, cost Kafka JMX → Prometheus, Spark UI → Prometheus, AWS Cost Explorer API
End-to-end tracing	Lineage + execution correlation Open Lineage → Data Hub/Marquez

The golden path for pipeline logs:

# Every pipeline task should emit structured logs with consistent fields
import structlog

log = structlog.get_logger()

log.info(
    "pipeline.task.complete",
    dag_id="daily_revenue",
    task_id="transform_orders",
    run_date="2026-04-10",
    rows_processed=5_234_891,
    rows_failed=12,
    duration_sec=142.3,
    output_path="s3://gold/fact_orders/date=2026-04-10/",
    source_row_count=5_235_000,
    completeness_pct=99.998,
)

Structured logs (JSON, not free-text) are parseable by Datadog, CloudWatch Insights, Splunk — enabling automated alerting on any field.

Alerting Architecture: Tier-Based, Not Everything-Fires

The most common observability failure: alerting on every task failure → alert fatigue → on-call engineers ignore pages → real incidents missed.

Two-tier alerting model:

P1 — Page on-call immediately (PagerDuty)

Triggers:

Tier-1 table freshness SLO breach
Pipeline producing zero rows (silent failure)
Kafka consumer lag growing continuously > 30 min
Financial/billing pipeline failure

Response expectation: 15 minutes Must have: Runbook attached to every P1 alert

P2 — Create ticket, notify Slack channel (non-blocking)

Triggers:

Tier-2/3 pipeline failure
Quality check failure on non-critical table
Volume anomaly > 20% (not yet SLO breach)
Cost spike > 50% above weekly baseline

Response expectation: Same business day Must have: Self-service remediation steps in alert body

Multi-window burn rate alerting (SRE pattern applied to data):

Instead of alerting only when the SLO is breached, alert when you’re BURNING THROUGH THE ERROR BUDGET TOO FAST:

Fast burn (1-hour window):
  If burn rate > 14x normal, you'll exhaust the monthly budget in 2 hours
  → Immediate P1 page

Slow burn (6-hour window):
  If burn rate > 6x normal, budget exhausted in 1 day
  → P2 ticket, investigate before it becomes P1

Circuit Breakers: Stop Bad Data Before It Spreads

A circuit breaker in a data pipeline is a validation gate that halts downstream propagation when data quality falls below a threshold (Monte Carlo, FirstEigen).

Without circuit breakers:

Bad source data → Bronze layer → Silver layer → Gold layer → Dashboard
                                                               ↑
                                                    VP sees wrong numbers
                                                    3 days after the bad data arrived
                                                    Backfill cost: 3 days × pipeline cost

With circuit breakers:

Bad source data → Bronze layer → [CIRCUIT BREAKER FIRES] → STOP
    ↓ Alert on-call
    Quarantine bad partition
    Silver/Gold remain unaffected
    No backfill needed downstream

Implementing circuit breakers in Airflow:

# Pattern: validate → publish (separate steps, publish only on validation pass)
from airflow.operators.python import BranchPythonOperator

def validate_data(**context):
    """Run quality checks on today's staging data. Return branch."""
    row_count = get_staging_row_count(context["data_interval_start"])
    expected = get_7day_avg_row_count()
    completeness = check_null_rate("order_id")
    if row_count < expected * 0.7:  # > 30% volume drop
        trigger_alert(
            f"CIRCUIT BREAKER: Volume {row_count} is 30%+ below expected {expected}"
        )
        return "quarantine_partition"  # Branch to quarantine
    if completeness < 0.999:  # > 0.1% null rate on PK
        trigger_alert("CIRCUIT BREAKER: order_id null rate exceeds threshold")
        return "quarantine_partition"
    return "publish_to_gold"  # Branch to normal publish

validate = BranchPythonOperator(
    task_id="validate_staging",
    python_callable=validate_data,
)
publish = SQLOperator(task_id="publish_to_gold", ...)
quarantine = PythonOperator(task_id="quarantine_partition", ...)

validate >> [publish, quarantine]

Circuit breaker states (borrowed from microservices):

Closed: Data is passing quality checks. Pipeline runs normally.
Open: Quality check failed. Pipeline halts. No downstream propagation.
Half-open: After manual investigation, a test run is allowed. If it passes, circuit closes. If it fails again, circuit stays open.

The “Data-First” Architecture: Decoupling Through Data Readiness

This is a senior-level design pattern that Modern Data 101 calls “the best defense.” Instead of chaining pipeline jobs directly:

Traditional (pipeline-first) — fragile:

Airflow: extract_job → transform_job → load_job
If extract_job fails at 3 AM → transform_job fails → load_job fails
→ Cascade failure, multiple alerts, dashboard stale

Data-first — resilient:

Airflow:
  extract_job writes to S3 + marks partition complete
  transform_job uses a DataSensor to WAIT for partition freshness:
    "Is bronze.orders/date=2026-04-10 fresh and complete?"
    If YES → proceed
    If NO  → wait (poke every 5 min, timeout 4 hours)
  transform_job never FAILS due to upstream delay — it WAITS
Result: No cascade failures. Transform job SLA alert fires
  only when data is ACTUALLY late, not just because a job happened to run first.

Implementation:

from airflow.sensors.external_task import ExternalTaskSensor
from airflow.sensors.sql import SqlSensor

# Wait for upstream data to be fresh (not just for the job to run)
wait_for_fresh_orders = SqlSensor(
    task_id="wait_for_orders_freshness",
    conn_id="warehouse",
    sql="""
        SELECT COUNT(*) FROM bronze.orders
        WHERE date = '{{ data_interval_start.date() }}'
        AND row_count > 0
        AND is_complete = true
    """,
    poke_interval=300,  # check every 5 minutes
    timeout=14400,  # timeout after 4 hours (SLA breach)
    mode="reschedule",  # release worker slot while waiting
)

Incident Response for Data Outages

When a data pipeline incident is detected, the response framework matters as much as the detection:

Incident severity tiers:

Tier	Definition	Example	Response time
P1	SLA-critical table is stale, finance/safety impact	Revenue dashboard stale at 9 AM	15 minutes
P2	Tier-1 pipeline failing but within SLA window	Pipeline started 30 min late, still has 2 hours	1 hour
P3	Tier-2/3 pipeline failure, no immediate business impact	Internal data science table delayed	Same business day

The 5-step incident response:

1. DETECT (automated)
   Alert fires: "gold.fact_orders freshness SLO breached — data 6 hours stale"

2. TRIAGE (< 15 min for P1)
   Which task failed? (Airflow UI)
   What was the error? (Structured logs)
   Is this a data issue or infrastructure issue?
   What's the blast radius? (Lineage: which tables depend on fact_orders?)

3. CONTAIN
   Stop cascade: circuit breaker prevents bad data from reaching gold layer
   Notify consumers: "fact_orders is stale, will be updated by 10 AM"

4. RESOLVE
   Fix root cause (source outage, schema change, resource exhaustion)
   Re-run pipeline from last successful checkpoint
   Validate output before publishing to gold

5. POST-MORTEM
   Document timeline, root cause, and remediation
   Add a monitoring rule to detect this faster next time
   Update runbook

The runbook: Every P1 alert MUST have a runbook. A runbook is a decision tree with steps to diagnose and resolve the incident. Without it, the on-call engineer wastes 30 minutes investigating from scratch every time the same issue recurs.

Monitoring Anti-Patterns to Name in Interviews

These show production experience — not just theoretical knowledge:

Anti-pattern	Problem	Fix
Alerting only on job failure	Silent failures (succeeds with 0 rows) go undetected	Add volume anomaly detection
Static thresholds	”< 1000 rows” fails during seasonal lows (Black Friday: 100x normal)	Use dynamic thresholds (7-day rolling average ± N stdev)
Alerting on every dbt test failure	Alert fatigue, critical issues buried in noise	Tier alerts: P1 for tier-1 tables, P2/P3 for everything else
Monitoring only at pipeline completion	Pipeline runs for 4 hours, fails at the last step — no visibility for 3.5 hours	Add intermediate checkpoints and progress metrics
No consumer-side monitoring	Dashboard shows stale data but pipeline monitoring shows green (pipeline ran, but wrote 0 rows)	Monitor freshness AT the serving layer, not just in the pipeline

Interview Questions

Q1: “A critical daily pipeline has been running fine for 6 months. Today, the Airflow DAG shows green for all tasks, but your finance team says the revenue dashboard is showing yesterday’s numbers. How do you diagnose this?”

Model Answer: “All tasks green but stale data is a classic silent failure scenario. This is exactly why task success ≠ data freshness. My diagnosis sequence: First, check the actual data: SELECT MAX(order_date), COUNT(*) FROM gold.fact_orders WHERE order_date = CURRENT_DATE. If count is 0 or MAX(order_date) < today, the pipeline ran but produced nothing. Second, check the volume metrics in the transform task’s structured logs — did it process 0 source rows? If yes, the source was empty. Third, check the source system: did the upstream extraction actually pull today’s data, or did it silently return an empty result set? Common culprits: source changed its API format (now returns a different date field), source had a maintenance window that returned 200 OK with empty body, or the extraction time window shifted (job ran at 11:59 PM and pulled ‘yesterday’ when it should have pulled ‘today’). Fix for the future: add a volume circuit breaker — if the transform task processes < 50% of the 7-day average row count, fail the task (don’t let it succeed with 0 rows). This converts a silent failure into a loud one that pages on-call before the finance team opens their dashboards.”

Q2: “How would you design the observability layer for a new real-time streaming pipeline ingesting 500K events/sec?”

Model Answer: “I’d instrument across five layers. Layer 1 pipeline health: Kafka consumer lag as the primary health metric — if lag is growing continuously for > 5 minutes, the pipeline has a throughput gap and needs scaling. Flink checkpoint duration alerts if > 10 seconds (risk of falling behind). Layer 2 data freshness: The ClickHouse downstream table’s MAX(event_time) should be within 60 seconds of real-time. Probe this every 30 seconds — alert if lag > 2 minutes. Layer 3 volume: Flink emits a custom metric events_processed_per_sec to Prometheus. Alert on 30%+ drop from the prior 15-minute average using a sliding window — this catches both source failures and Flink processing issues. Layer 4 data quality: For streaming, run quality checks on sampled windows — every 5-minute tumbling window, validate null rates on required fields, verify event types are within the schema’s accepted_values. Route quality violations to a DLQ topic, not to the main output. Layer 5 infrastructure: Grafana dashboard with Kafka consumer lag, Flink checkpoint duration, Flink GC pause time, memory utilization, and processing throughput. Set two alert tiers: P1 for sustained throughput drop or freshness SLO breach, P2 for elevated DLQ volume or single-checkpoint failure. Every P1 alert links to a runbook with diagnostic steps.”

Think About This

You’re in a Netflix interview. The prompt: “Netflix’s content availability pipeline — which tracks which titles are available in which countries — has been silently serving incorrect data for 3 days before anyone noticed. How would you redesign the monitoring to catch this in minutes, not days?”

Walk through:

Why was it silent for 3 days? (No freshness monitoring at the serving layer. Pipeline “succeeded” but wrote stale data. No volume check. Consumers didn’t notice until a licensing issue caused 404s on the UI.)
What monitoring would have caught it immediately? (Freshness probe: MAX(updated_at) in gold.content_availability WHERE country = ‘US’ — alert if > 6 hours old. Volume probe: row count per country vs. prior day — alert on > 20% drop.)
What circuit breaker would have prevented stale data from reaching production? (After transform, validate that content availability record count matches a threshold: COUNT(*) BETWEEN 95% and 110% of last week’s count. Fail the publish step if outside bounds.)
What’s the consumer-side monitoring? (The content playback service checks is_available at playback time — instrument the rate of 404 “not available” responses by country and alert on sudden spikes. This catches what pipeline monitoring misses.)

The insight: multiple independent monitoring signals (pipeline health + data freshness + volume + consumer-side) create overlapping protection. Any single signal failing is detectable. Relying on only one signal (task success) is how incidents go silent for 3 days.

Quick Reference

Five observability layers: Pipeline health → Data freshness → Data volume → Data quality → Infrastructure. Most teams only monitor Layer 1.
Silent failure is the most dangerous failure: A job that succeeds but writes 0 rows. Caught only by Layer 3 (volume anomaly detection).
SLI/SLO/error budget for pipelines: Define measurable SLIs (MAX lag in hours), set SLOs (≤ 4 hours, 29/30 days), track burn rate, trigger reliability work when budget is burning too fast.
Circuit breakers: Validate before publish. If quality/volume checks fail, halt propagation to gold layer. Prevents bad data from reaching dashboards. Far cheaper than backfilling 3 days of downstream tables.
Data-first architecture: Downstream jobs WAIT for data readiness rather than triggering after upstream job completion. Eliminates cascade failures.
Two-tier alerting: P1 (SLO breach, silent failure, financial impact) → PagerDuty + runbook. P2 (quality degradation, cost spike) → Slack + ticket. Never alert on everything.
Runbooks are mandatory for P1: Without them, the on-call engineer investigates from scratch every time. Write runbooks at pipeline build time, not after the first incident.

Tomorrow’s Preview

Day 30: Phase 1 Comprehensive Review & Self-Assessment — You’ve completed the first 30 days of Phase 1: Foundations. Tomorrow is your review day — a structured self-assessment across all Phase 1 topics, identification of your weak areas, and a practice exercise: explain 3 complete designs end-to-end in 30 minutes each. No new material — pure consolidation before Phase 2 begins.

Day 29: Pipeline observability and monitoring