Phase 1: Foundations & Frameworks | Category: ETL/ELT Workflows

The Shift That Defined Modern Data Engineering

If there’s one architectural choice that separates legacy data engineering from the modern practice, it’s the move from ETL to ELT. As a senior DE at Amazon, you’ve lived through this transition. In interviews at Meta, Netflix, Google, OpenAI, and Anthropic, you need to articulate not just the difference, but WHY the shift happened, when ETL still wins, and how the hybrid ELT pattern dominates greenfield architectures in 2026.

ETL: Extract, Transform, Load

Data flow: Source → External transformation engine → Target warehouse/lake

Source Systems
    ↓ Extract
Staging Area (external compute)
    ↓ Transform (cleanse, join, aggregate, apply business rules)
Data Warehouse
    ↓ Load (final, cleaned data)

How it works: Data is extracted from source systems, transformed in a separate processing engine (Informatica, Talend, SSIS, custom Spark jobs on a dedicated cluster), and only then loaded into the destination. The target only ever sees clean, conformed data.

Characteristics:

AspectDetail
Transform locationExternal engine (separate from storage/warehouse)
Raw data preserved?No — only transformed data reaches the warehouse
Compute modelDedicated ETL servers/clusters, sized for peak transform workload
Iteration speedSlow — changing transformation logic requires reprocessing from source
Data quality controlStrong — enforced before data enters the warehouse
Cost profilePredictable compute cost, but expensive to scale
Typical toolsInformatica, Talend, SSIS, AWS Glue (in ETL mode), custom Spark on EMR

When ETL still makes sense:

  • Compliance-first environments: PII must be stripped/masked BEFORE it touches the warehouse (HIPAA, GDPR). You cannot land raw data containing sensitive fields.

  • Legacy on-prem warehouses: Systems like Teradata, Oracle, Netezza have limited compute. Transforming before loading reduces warehouse load.

  • Fragile source systems: If the source is a production OLTP database, you want to extract once, stage externally, and minimize re-extractions.

  • Strict data contracts: When downstream consumers require guaranteed schema and quality, pre-transform enforcement prevents bad data from propagating.

ELT: Extract, Load, Transform

Data flow: Source → Target (raw) → Transform in-place using target’s compute

Source Systems
    ↓ Extract
Data Lake / Warehouse (raw/bronze layer)
    ↓ Load (raw, unmodified)
    ↓ Transform IN-PLACE using warehouse/lakehouse compute
    ↓ Serve (silver/gold layers)

How it works: Data is extracted and immediately loaded into the target in raw form. Transformations happen inside the target system using its own compute engine — BigQuery SQL, Snowflake SQL, Spark on Databricks, dbt models. The raw data is preserved; transformations create new tables/views on top.

Characteristics:

AspectDetail
Transform locationInside the target (warehouse/lakehouse compute)
Raw data preserved?Yes — bronze layer retains source fidelity
Compute modelElastic, pay-per-query or auto-scaling (cloud-native)
Iteration speedFast — raw data already loaded, just change SQL and re-run
Data quality controlApplied in transformation layer (silver), not at ingestion
Cost profileVariable — depends on query volume and compute usage
Typical toolsdbt, Spark (on Databricks/EMR), Big Query scheduled queries, Snowflake Tasks

The Core Trade-off Table

DimensionETLELT
Transform timingBefore loadAfter load
Raw data availabilityLost (only transformed data stored)Preserved (can reprocess anytime)
Iteration speedSlow — re-extract + re-transformFast — re-transform from raw
Compute elasticityFixed/provisionedElastic/serverless
Storage costLower (only final data stored)Higher (raw + transformed)
Compute costPredictable but inflexibleVariable but scalable
Schema flexibilitySchema-on-write (rigid)Schema-on-read (flexible)
Data quality enforcementAt ingestion boundaryAt transformation layer
Reprocessing/backfillPainful — must re-extract from sourceEasy — re-run transforms on raw data
Tool complexityComplex ETL engine (Informatica, SSIS)SQL-first tools (dbt) + orchestration
Best forOn-prem, compliance-heavy, legacyCloud-native, lakehouse, modern stack

Why ELT Won: The Three Shifts

1. Cloud storage became essentially free

The main argument for ETL was “don’t waste expensive warehouse storage on raw data.” When S3 costs $0.023/GB/month and BigQuery storage is $0.02/GB/month, that argument collapses. Storing 10 TB of raw data costs ~$200/month. The value of being able to reprocess from raw data far exceeds that cost.

2. Compute became elastic

ETL required sizing a transformation cluster for peak load. ELT leverages the target warehouse’s elastic compute — BigQuery auto-scales, Snowflake spins up virtual warehouses on demand, Databricks scales Spark clusters dynamically. You pay for what you use, and you never under-provision.

3. dbt changed the transformation paradigm

dbt (data build tool) made SQL-first transformations a first-class engineering practice with version control, testing, documentation, lineage, and CI/CD — all the things ETL tools had but SQL scripts lacked. This democratized the “T” in ELT and made analytics engineers (SQL-proficient, not necessarily Python-proficient) productive.

The Modern ELT Stack

Here’s how the ELT pattern maps to real tools — know this for interviews:

EXTRACT + LOAD                    TRANSFORM                     ORCHESTRATE

_______________________________

─────────────

_______________________________

Fivetran / Airbyte
dbt (SQL-first)
Airflow
(managed connectors)
Spark/PySpark
Dagster
AWS DMS / Datastream
BigQuery SQL
Prefect
(CDC from databases)
Snowflake Tasks
dbt Cloud
Kafka / Debezium
Dataflow (Beam)
Step Functions
(real-time streaming)
AWS Glue
Custom API extractors
Polars/DuckDB (small data)
↓ Raw data → Bronze
↓ Silver → Gold
↓ Schedule + monitor

Company-specific stacks to know:

CompanyLikely ELT Stack
MetaCustom ingestion → Spark → Hive/Presto. Internal tools, but same ELT principle.
NetflixS3 → Spark → Iceberg tables → Trino for querying. Heavy ELT with internal orchestration.
GooglePub/Sub → Dataflow (streaming) → Big Query. Or batch: GCS → Big Query load → Big Query SQL transforms.
OpenAILikely: custom ingestion → Spark/Ray → Delta Lake/Iceberg → dbt or custom transforms.
AnthropicSimilar to OpenAI: streaming ingestion → distributed processing → lakehouse transforms.

Push-Down Compute: The Key ELT Concept

“Push-down” means the transformation logic is pushed into the engine where data already lives, rather than moving data to a separate engine.

ETL approach (pull data out):

10 TB in BigQuery → Extract to Spark cluster → Transform → Write back to BigQuery

Network transfer: 10 TB out + 10 TB back in = 20 TB moved. Slow, expensive.

ELT push-down approach (compute where data lives):

10 TB in BigQuery → Run SQL transformation IN BigQuery → New table in BigQuery

Network transfer: 0 TB. The data never leaves. This is why ELT is faster and cheaper at scale.

dbt is a push-down tool: When you write a dbt model, you write a SQL SELECT. dbt compiles it into a CREATE TABLE AS SELECT or CREATE VIEW and runs it inside your warehouse. The data never leaves the warehouse. This is pure push-down compute.

Spark as ELT: When Spark reads from and writes to the same lakehouse (Delta Lake on S3), it’s also push-down — compute and storage are in the same ecosystem, even if technically separate services.

When the Answer Is “Both” — The Hybrid Pattern

In 2026, most production architectures are hybrid ELT: ELT as the default, with selective ETL for specific requirements.

Source → [ETL: PII masking, schema validation] → Bronze (raw-ish)  Bronze → [ELT: dbt/Spark transforms in-place] → Silver (cleaned)  Silver → [ELT: aggregation, business rules] → Gold (serving)

Where ETL still appears in an ELT-dominant architecture:

  • Ingestion layer: Fivetran/Airbyte do light transformation during extract (type casting, deduplication, flattening nested JSON). This is technically ETL.

  • PII/compliance gate: A Spark job strips PII before data lands in the lake. Data that reaches bronze is already anonymized. This is ETL by necessity.

  • Schema enforcement at ingestion: Validating that incoming data matches expected schema before loading prevents garbage in the lake.

  • Cross-system joins during ingestion: If you need to enrich streaming events with a lookup table during ingestion (e.g., geo-IP resolution), that’s a transform-before-load step.

What to say in the interview: “I default to ELT — land raw data, transform in-place using the warehouse’s compute. But I apply ETL patterns selectively: PII masking before data reaches the lake, schema validation at ingestion to prevent garbage, and any compliance-required transformations that must happen before storage. The architecture is ELT-first with ETL guardrails.”

The dbt Revolution: Why It Matters for Interviews

dbt deserves special attention because it’s the tool that made ELT practical at scale and it comes up frequently in interviews.

What dbt does:

  • Defines transforms as SQL SELECT statements (models)

  • Manages dependencies between models (DAG)

  • Runs tests (not null, unique, accepted values, referential integrity)

  • Generates documentation and lineage automatically

  • Supports version control (Git), CI/CD, and environments (dev/staging/prod)

  • Compiles to native SQL for your warehouse — zero data movement

What dbt does NOT do:

  • Extraction or loading (it’s the “T” only — pair with Fivetran/Airbyte for E+L)

  • Real-time/streaming processing (batch/scheduled only)

  • Complex procedural logic (it’s SQL-first; Python models exist but are secondary)

  • Orchestration (needs Airflow/Dagster/dbt Cloud for scheduling)

dbt in the medallion architecture:

Bronze (raw) → dbt staging models → Silver (cleaned, typed, deduped)  Silver → dbt mart models → Gold (star schema, aggregates, OBTs)

Each dbt model is a SQL SELECT with ref() calls to other models, forming a DAG that dbt compiles and executes in dependency order.

Interview Questions

Q1: “You’re building a new analytics platform from scratch on GCP. Would you use ETL or ELT, and why?”

Model Answer: “ELT, without question, for a greenfield GCP platform. Here’s why: BigQuery has elastic compute that scales automatically — there’s no reason to maintain a separate transformation engine. I’d use Pub/Sub or Cloud Storage for ingestion, land raw data directly into BigQuery’s bronze dataset, then transform in-place using dbt running BigQuery SQL. Raw data is preserved for reprocessing, transformations are version-controlled and tested, and I never move data out of BigQuery for processing. The exception: if we’re ingesting data with PII that must be masked before storage under GDPR, I’d add a Dataflow job (ETL pattern) between ingestion and bronze to strip/tokenize PII fields. Everything downstream of bronze is pure ELT. Cost-wise, BigQuery’s on-demand pricing means I only pay for compute when transformations run, and storage at $0.02/GB/month for compressed columnar data is negligible.”

Q2: “Your company has a legacy Informatica ETL pipeline that takes 8 hours to run nightly. The business wants data refreshed every 15 minutes. How would you modernize this?”

Model Answer: “This is a classic ETL-to-ELT migration. The 8-hour runtime is likely dominated by data movement — extracting from source, transforming in Informatica’s compute engine, then loading into the warehouse. Step 1: Replace the extract with CDC (Debezium or AWS DMS) to capture incremental changes from source databases and stream them into a lake or warehouse bronze layer. This eliminates full-extract overhead. Step 2: Replace Informatica transforms with dbt models running push-down SQL in the warehouse. No data movement, just in-place computation. Step 3: Schedule dbt runs every 15 minutes using Airflow or dbt Cloud, processing only the incremental data from the last window. The transformation that took 8 hours on full data might take 2-3 minutes on a 15-minute increment. Step 4: Validate with parallel runs — keep Informatica running alongside the new pipeline until we confirm data matches. The key architectural shift is from ‘move data to compute’ to ‘push compute to data,’ combined with incremental processing instead of full refreshes.”

Think About This

You’re in an Anthropic interview. The prompt: “We process millions of conversations daily for safety evaluation. Currently, a nightly batch job extracts conversations from the production database, transforms them (tokenizes, classifies, computes safety scores), and loads results into an analytics warehouse. Researchers want results within 30 minutes of a conversation happening. How would you redesign this?”

Walk through:

  1. Which parts of this are currently ETL, and where is the bottleneck? (Full extract from production DB is the bottleneck — ETL pattern)

  2. How would you make this incremental? (CDC from the production DB, or publish conversation events to a stream)

  3. Where does the transformation happen in the new design? (Safety scoring likely requires ML inference — this can’t be pure SQL push-down. It’s a hybrid: streaming ingestion → ML inference service → results loaded into warehouse → dbt transforms for aggregation and reporting)

  4. Is this pure ELT? (No — the ML inference step is a transform-before-load pattern for the safety scores. But the downstream analytics transforms are ELT. This is a realistic hybrid.)

The insight: not everything fits neatly into “ETL or ELT.” ML inference pipelines, real-time scoring, and complex enrichment are transform-before-load by nature. The downstream aggregation and reporting are ELT. Senior candidates recognize this and design for both within the same architecture.

Quick Reference

  • ETL = transform before load (external compute). Still valid for PII masking, compliance, legacy systems, and ML inference.

  • ELT = load then transform in-place (push-down compute). The modern default for cloud-native analytics. Storage is cheap, compute is elastic, raw data is preserved.

  • Push-down compute is the key concept: move the logic to the data, not the data to the logic. dbt, BigQuery SQL, Spark on lakehouse all do this.

  • The 2026 reality is hybrid: ELT-first with selective ETL guardrails for compliance, PII, schema enforcement, and ML inference.

  • dbt = the “T” in ELT: SQL-first, version-controlled, tested, documented transformations running natively in your warehouse. Know its capabilities AND its limitations (no streaming, no extraction, no complex procedural logic).

Tomorrow’s Preview

Day 9: Batch Processing Architecture — Spark internals (DAG, partitioning, shuffles, stages), handling large-scale batch jobs, idempotency and reprocessing patterns, and how to discuss Spark architecture decisions in interviews at your target companies.