Day 22 — Workflow orchestration patterns

Phase 1: Foundations & Frameworks | Category: Orchestration

Why Orchestration Depth Matters at Senior Level

Orchestration is where architecture meets operations. Anyone can write a DAG. Senior DEs design orchestration systems that handle late data, safe retries, backfills, SLA monitoring, and failure isolation across hundreds of interdependent pipelines. As Abstract Algorithms puts it: “Pipeline orchestration is less about drawing DAGs and more about controlling freshness, replay, and recovery when one upstream dataset arrives late, wrong, or twice.” At your target companies — where thousands of pipelines run in production — your interviewer wants to know if you can design orchestration that survives failure gracefully, not just orchestration that works on a happy path.

Orchestrator Comparison: 2026 Landscape

Topic	Details
Model	Task-centric DAGs Asset-centric (Software-Defined Assets) Flow/task functions
Scheduling	Cron + sensors Schedule on assets or time-based Deployments + work pools
Execution model	DAG parse time (static) At runtime with dependencies Dynamic (Python control flow)
State management	Metadata DB (Postgres/My SQL) Built-in asset catalog + runs DBPrefect Cloud or self-hosted server
Data lineage	Limited (task-level) First-class (asset-level lineage built-in) Limited
Local dev	Heavy (scheduler + webserver + worker) Excellent (run assets locally) Simple (prefect server start)
Ops complexity	High (multiple components to manage) Medium Low-Medium
Ecosystem	Largest (500+ providers, most battle-tested) Growing fast, dbt-native Moderate
Best for	Large orgs, complex dependencies, existing investment New greenfield, dbt-heavy stacks, asset lineage Small teams, simple pipelines, dynamic workflows
Used at	Meta, Airbnb, Linked In, most large enterprises Growing adoption in modern data stacks Mid-size companies

The 2026 consensus: Start with Airflow unless starting fresh. If greenfield with a dbt-heavy stack, Dagster’s asset-centric model is a meaningful upgrade. The best orchestrator is the one your team can operate confidently.

AWS Step Functions: Serverless, event-driven, for AWS-native pipelines. No DAG files to manage. Deep integration with Lambda, ECS, Glue, EMR. Better for event-triggered pipelines than time-scheduled batch jobs.

Core Airflow Concepts: What Interviewers Probe

Since Airflow is the dominant production tool, know these cold:

DAG (Directed Acyclic Graph): The workflow definition. A Python file that defines tasks and their dependencies.

with DAG(      dag_id="daily_revenue_pipeline",      schedule="0 4 * * *",
# 4 AM UTC daily      start_date=datetime(2026, 1, 1),      catchup=True,
# Backfill from start_date      max_active_runs=1,
# Only 1 concurrent run      default_args={          "retries": 3,          "retry_delay": timedelta(minutes=10),          "retry_exponential_backoff": True,          "execution_timeout": timedelta(hours=2),          "on_failure_callback": alert_slack,      },  ) as dag:      ...

Execution Date vs Data Interval: The most misunderstood Airflow concept.

schedule = "0 4 * * *" (runs at 4 AM daily)
Data interval start: 2026-04-09 00:00:00  ← the DATA being processed
Data interval end:   2026-04-10 00:00:00  ←
Execution date:      2026-04-10 04:00:00  ← when the job actually runs
# This is critical: the job running at 4 AM on April 10  # is processing April 9's data — NOT today's data  # Always use {{ data_interval_start }} in your SQL, NOT {{ ts }}

Wrong (non-idempotent):

-- Uses execution time — different result on rerun
SELECT *
FROM events
WHERE created_at >= NOW() - INTERVAL '1 day';

Right (idempotent):

-- Uses data interval — same result every time for the same interval
SELECT *
FROM events
WHERE created_at >= '{{ data_interval_start }}'
  AND created_at < '{{ data_interval_end }}';

Key DAG parameters to know:

ParameterPurposeSenior Insightcatchup=TrueBackfill from start_dateEnable for new pipelines to process historical data; disable for real-time pipelines where backfill makes no sensemax_active_runs=1Prevent concurrent interval runsCritical for idempotency — prevents two runs writing to the same partition simultaneouslydepends_on_past=TrueDon’t run if previous interval failedUse for accumulating snapshots and sequential pipelines. Risk: one failure cascades indefinitely — use with SLA alertspoolLimit concurrent tasks by resourcePrevent overloading warehouse connections or API rate limitspriority_weightTask execution priority within poolHigher weight = runs first in contention

The Production-Grade DAG Pattern

The senior answer to “how do you design a batch pipeline in Airflow” is not “extract → transform → load.” It’s the Stage → Validate → Publish pattern:

wait_for_upstream
    ↓ extract_to_staging — write to staging area, NOT production table
    ↓ transform_partition — apply transformations in staging
    ↓ validate_data — run quality checks on staging output
[pass?]
├── YES → publish_to_production — atomic overwrite of partition
└── NO  → quarantine_partition → alert_owner → await_manual_review

Why stage first? If you write directly to production and the job fails halfway, your table is in a corrupt state. Staging lets you retry safely — the staging area gets overwritten, but production never sees partial data. Only after validation passes does the publish step atomically move data to production.

Why a separate publish step? A successful transform does NOT mean trusted data. Validation gates determine trustworthiness. Separating them makes this explicit. Dashboards and consumers should read from production only after the publish step completes — not just because the transform task turned green.

Idempotency: The Non-Negotiable Design Principle

Definition: Running a task N times with the same inputs produces the same output as running it once. No duplicates, no corruption, no phantom data.

The four mechanisms:

1. Partition-scoped staging + atomic publish:

# Transform writes to a staging path scoped to the interval
staging_path = f"s3://data/staging/revenue/date={data_interval_start.date()}/run_{run_id}/"
# Publish overwrites only that date's production partition
production_path = f"s3://data/gold/revenue/date={data_interval_start.date()}/"
# On retry: staging path is rewritten (clean), production overwrites same partition

2. MERGE INTO instead of INSERT:

-- Idempotent: run this 10 times, result is identical
MERGE INTO gold.fact_orders t
USING staging.fact_orders s ON t.order_id = s.order_id
AND t.order_date = s.order_date
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

3. Deterministic output paths (interval-based, not wall-clock):

# GOOD: same interval = same path = same result
output_path = f"s3://data/gold/revenue/date={data_interval_start.date()}"
# BAD: wall-clock time = different path each retry
output_path = f"s3://data/gold/revenue/run_{datetime.now().isoformat()}"

4. Deduplication keys:

Build deduplication into your SQL — any reprocessed records are upserted, not appended.

Retry Strategies: Not All Failures Are Equal

Per-task retry configuration:

extract_task = PythonOperator(      task_id="extract_from_api",      retries=5,      retry_delay=timedelta(minutes=2),      retry_exponential_backoff=True,
# 2min, 4min, 8min, 16min, 32min      max_retry_delay=timedelta(minutes=30),  )
validate_task = PythonOperator(      task_id="validate_data",      retries=1,
# Only retry once — data quality failures are usually not transient      retry_delay=timedelta(minutes=5),      on_failure_callback=quarantine_and_alert,  )

When to retry vs. when to quarantine:

Failure TypeRetry?How Many?Action After ExhaustionNetwork timeout / API rate limitYes3-5 with backoffAlert, then quarantineSource file not arrived yetYesVia sensor with poke intervalAlert after timeoutData quality check failureNo/11 (to confirm it’s not flaky)Quarantine partition, alert ownerOOM / resource errorYes2-3 after resource adjustmentAlert, require manual reviewSchema mismatchNo0Quarantine, alert immediately

Retry storm prevention: Cap max concurrent retries across all tasks using Airflow pools. A batch of 100 failed tasks retrying simultaneously can overwhelm your warehouse or API.

SLA Monitoring: Alerting Before Consumers Notice

An SLA in Airflow defines when a DAG run should complete. Missing an SLA triggers a callback — alerting you before the analyst opens the dashboard at 9 AM and finds stale data.

from airflow import DAG
from airflow.models import SlaMiss
def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis):      message = f"SLA MISSED: DAG {dag.dag_id} - tasks {task_list}"      send_pagerduty_alert(message)
# P1 alert for tier-1 tables      send_slack_message("#data-alerts", message)
with DAG(      dag_id="daily_revenue_pipeline",      schedule="0 4 * * *",      sla_miss_callback=sla_miss_callback,      default_args={          "sla": timedelta(hours=2),
# Each task should complete within 2 hours      },  ) as dag:      ...

The dataset-level freshness SLA pattern (better than task-level SLA):Track the actual data freshness, not just whether the task ran:

-- Freshness monitor: runs every 30 minutes
SELECT       MAX(order_date) as latest_data_date,      CURRENT_DATE - MAX(order_date) as lag_days
FROM gold.fact_orders
-- Alert if lag_days > 1 (data is more than 1 day stale)

This catches the scenario where the task ran but produced no data — a task that succeeds but writes 0 rows still misses the SLA.

Fan-Out / Fan-In Patterns

Fan-out: One upstream task triggers multiple parallel downstream tasks.

# Fan-out: extract once, transform in parallel for 5 regions  extract >> [transform_us, transform_eu, transform_apac, transform_latam, transform_mena]

Fan-in: Multiple parallel tasks must all complete before proceeding.

# Fan-in: all regional transforms must complete before merge
[transform_us, transform_eu, transform_apac, transform_latam, transform_mena] >> merge_global

Dynamic task generation (Airflow’s @task.expand / TaskGroup):

from airflow.decorators import task


@task
def get_regions():
    return ["us", "eu", "apac", "latam", "mena"]


@task
def transform_region(region: str):
    # Process data for this region
    run_spark_job(f"transform_{region}")


# Dynamically creates 5 parallel tasks at runtime
regions = get_regions()
transform_region.expand(region=regions) >> merge_global

This is a senior-level pattern — demonstrates understanding of dynamic DAG generation over hardcoded parallelism.

Backfill: The Operational Safety Net

What backfill means: Re-running a pipeline for historical date ranges — either to process data that didn’t exist when the pipeline was first deployed, or to reprocess after fixing a bug.

Airflow CLI backfill:

# Reprocess 7 days of data after fixing a bug in the revenue calculation
airflow dags backfill \
  --start-date 2026-04-01 \
  --end-date 2026-04-07 \
  --reset-dagruns \
  daily_revenue_pipeline

Critical backfill design rules:

Backfill has its own capacity pool: Don’t let backfill jobs compete with today’s production run for warehouse slots. Create a dedicated Airflow pool for backfills.
max_active_runs during backfill: Airflow will try to run all 7 missed dates in parallel. If your warehouse can’t handle 7x load, set max_active_runs=2 for controlled backfill.
depends_on_past=False during backfill: If depends_on_past=True and Day 1 fails, Days 2-7 are blocked. Temporarily override for backfills.
Idempotency is mandatory for backfill: If your pipeline isn’t idempotent, backfilling 7 days will duplicate 7 days of data. There’s no backfill without idempotency.

Dagster’s Asset-Centric Model (Know This for Modern Stack Interviews)

Dagster shifts the mental model: instead of “run these tasks,” you define “these assets depend on those assets.”

from dagster import asset, AssetIn  @asset
def raw_orders():      return extract_from_source()
# Bronze layer  @asset(ins={"raw_orders": AssetIn()})
def clean_orders(raw_orders):      return deduplicate(raw_orders)
# Silver layer  @asset(ins={"clean_orders": AssetIn()})
def daily_revenue(clean_orders):      return aggregate_revenue(clean_orders)
# Gold layer

Why this matters:

Dagster tracks which assets are “fresh” vs. “stale” automatically
Materializing daily_revenue automatically materializes upstream assets if needed
Asset lineage is first-class — you see which assets depend on which, without separate lineage tools
dbt models are native first-class assets in Dagster

The key insight for interviews: Airflow thinks in tasks (verbs). Dagster thinks in assets (nouns). For teams building data platforms where the product IS the data, asset-centric is a better mental model.

Interview Questions

Q1: “Your Airflow DAG runs at 4 AM, but today it failed at 3:47 AM halfway through a transform task. At 4 AM the scheduler retries. Is your data safe? How would you design for this?”

Model Answer: “Whether the data is safe depends entirely on whether the pipeline was designed for idempotency. If the failed transform wrote directly to the production table, we potentially have a partial write — data from 3:00-3:47 AM but not 3:47-4:00 AM. The retry at 4 AM might append duplicates or miss the gap. This is the classic non-idempotent pipeline problem. The correct design: the transform writes to a staging path scoped to the data interval (e.g., staging/date=2026-04-09/run_id=abc). The production table is never touched until a subsequent publish step atomically overwrites that date’s partition. On retry, the staging path is simply rewritten from scratch — no corruption possible. The publish step only runs after the validation step passes. With this design, the 4 AM retry is completely safe: it starts fresh at the staging step, rewrites staging, passes validation, publishes atomically. I’d also set max_active_runs=1 to prevent two retry runs from conflicting.”

Q2: “How do you monitor pipeline health across 200 DAGs in production?”

Model Answer: “I’d layer three levels of monitoring. First, task-level: Airflow’s native SLA miss callbacks for each DAG, wired to Slack for warnings and PagerDuty for critical tables. This catches ‘the task didn’t finish on time.’ Second, data-level: separate freshness monitors that query actual tables — MAX(load_date) > CURRENT_DATE - 1. Task success doesn’t guarantee data quality. A task that ran successfully but wrote 0 rows still fails the SLA. These run every 30 minutes and alert independently of Airflow. Third, pipeline health dashboard: aggregate metrics across all 200 DAGs — DAGs with failures in the last 24 hours, average task duration trends (rising duration = data growth or performance regression), retry rates by DAG (high retry rate = fragile pipeline or flaky dependency), backlog size (how many intervals are waiting to run). For the 200 DAGs, I’d tier them — 20 tier-1 critical DAGs get PagerDuty alerts and runbooks; 180 tier-2/3 get Slack notifications and daily digest summaries. Alert fatigue is the enemy of effective monitoring.”

Think About This

You’re in a Meta interview. The prompt: “Meta runs thousands of Airflow DAGs processing petabytes of data daily. A downstream dashboard used by the ads revenue team shows stale data. The Airflow UI shows all tasks are green for today’s run. How is this possible, and how would you redesign the system to prevent it?”

Walk through:

How can all tasks be green but data be stale? (The pipeline ran but processed 0 new records — a sensor timeout that returned “success” without finding data, or a transform that silently filtered everything out. Task success ≠ data freshness.)
What’s missing from the monitoring? (Dataset-level freshness monitoring independent of Airflow. SLA defined as “the table must contain data from today by 8 AM,” not “the task must turn green by 8 AM.”)
How do you redesign? (Add a validation task that explicitly checks: row count > 0, max event date = today, no quality check failures. If this task fails, the publish step doesn’t run. Add a freshness monitor that queries the production table every 30 minutes — alerts if MAX(event_date) < CURRENT_DATE.)
What’s the publish gate? (A final task that writes a _SUCCESS marker file or updates a metadata registry. The dashboard reads from a view that only shows data with a valid success marker for the current date. If the pipeline fails, the dashboard shows yesterday’s data with a “stale data” indicator — not silently wrong data.)

The insight: “all tasks green” is a necessary but not sufficient condition for “data is correct and fresh.” Senior DEs design validation gates and freshness monitors that decouple pipeline health from data health.

Quick Reference

Airflow = task-centric, mature, largest ecosystem. Default choice for most teams. Master: data_interval vs execution_date, catchup, max_active_runs, pools, SLA callbacks.
Dagster = asset-centric, first-class lineage, excellent dbt integration. Best for greenfield modern data stacks.
Prefect = function-first, simplest ops, dynamic workflows. Best for small teams or dynamic pipeline generation.
Stage → Validate → Publish: The production-grade pattern. Never write directly to production. Staging + atomic publish = safe retries.
Idempotency = interval-scoped paths + MERGE (not INSERT) + deterministic logic. Non-negotiable for pipelines that retry.
Retry strategy: Transient failures (network, API) → exponential backoff retries. Data quality failures → quarantine, not retry.
SLA monitoring = dataset-level freshness, not just task success. “All tasks green” ≠ “data is correct and fresh.”
Backfill safety net: Idempotency + dedicated capacity pool + max_active_runs control = safe historical reprocessing.

Tomorrow’s Preview

Day 23: Change Data Capture (CDC) — Log-based CDC (Debezium, DMS) vs query-based CDC. Use cases: real-time sync, event sourcing, warehouse updates. Handling deletes, schema changes, and ordering. The connective tissue between your OLTP sources and your lakehouse — one of the most practical topics for data engineering interviews.

Day 22: Workflow orchestration patterns