Phase 1: Foundations & Frameworks | Category: Storage Systems
The Problem That Created the Lakehouse
By 2018, most companies had two separate systems:
-
Data warehouse (Redshift/Teradata): SQL, ACID, fast analytics — but expensive, closed, and poor support for ML/unstructured data
-
Data lake (S3/HDFS): Cheap, flexible, supports any file format — but no ACID, no schema enforcement, “data swamp” by year two
The lakehouse pattern (Databricks, Hacker News) merges both: object storage costs + open formats + warehouse-grade reliability. By 2026, it’s the dominant production architecture. A senior DE at any of your target companies is expected to design and defend this architecture fluently.
The Data Lake Problem (Why Plain S3 Wasn’t Enough)
Source systems → S3 (raw Parquet files)
This looks simple. In practice, plain S3 tables have critical limitations:
| Problem | Description |
|---|---|
| No ACID | Concurrent writers corrupt files. A failed job leaves partial data. No rollback. |
| No schema enforcement | Anyone can write any schema. A column rename breaks all downstream consumers silently. |
| No deletes/updates | S3 objects are immutable. Updating one record requires rewriting entire partitions. |
| No time travel | Yesterday’s data is gone once overwritten. No way to debug “what happened to Monday’s data.” |
| Inefficient deletes | GDPR right-to-deletion requires full partition rewrites. Painful at scale. |
| Small file explosion | Streaming writes create millions of tiny files. Query performance collapses. |
Open table formats (Delta Lake, Iceberg, Hudi) solve every one of these. They add a metadata layer on top of Parquet files stored in S3/GCS/ADLS, transforming a raw file store into a transactional table system.
How Open Table Formats Work
The core concept is a transaction log maintained alongside data files:
S3 bucket/
├── data/ │
├── part-00001.parquet (file snapshot 1) │
├── part-00002.parquet (file snapshot 2) │
└── part-00003.parquet (new file from latest write)
└── metadata/ ← THE MAGIC
├── v1.metadata.json (snapshot 1: which files = table state)
├── v2.metadata.json (snapshot 2: added part-00003)
└── current → v2.metadata.json
Writing: A new write creates new Parquet files + a new metadata snapshot pointing to the current set of valid files. The old snapshot still exists. This is MVCC (Multi-Version Concurrency Control) on object storage.
Reading: A query reads the current snapshot’s manifest to find which files constitute the table. Concurrent readers always see a consistent view.
Time travel: Query against any past snapshot to see historical data.
Deletes/updates: Mark old files as deleted in metadata, write new files. The actual deletion is deferred to compaction (periodic background process that rewrites small files and removes deleted data).
The Three Table Formats
Delta Lake
Origin: Databricks (2019). Born from the need to bring reliability to Spark-based data pipelines.
Metadata: _delta_log/ directory with JSON commit files. Every commit is a JSON log entry. Every 10 commits, a Parquet checkpoint is written for fast state reconstruction.
Key features:
-
ACID transactions: Fully supported via the transaction log
-
Schema enforcement: Rejects writes that violate the table schema
-
Schema evolution: mergeSchema option to add new columns
-
Time travel: VERSION AS OF or TIMESTAMP AS OF queries
-
Z-Order clustering: Multi-dimensional co-location for high-cardinality column pruning
-
Liquid clustering (Delta 3.x): Adaptive clustering that replaces manual Z-Order
-
Change Data Feed: Captures row-level changes for incremental downstream processing
-
Delta Sharing: Securely share live tables across organizations without copying data
-
UniForm (Delta 3.x): Exposes Delta tables as Iceberg-compatible for multi-engine access
Ecosystem: Deep Spark/Databricks integration. Supported by Spark, Flink, Trino, Presto, BigQuery (via external tables), Snowflake (via external tables).
Best for: Databricks-first architectures, teams standardized on Spark, Spark Structured Streaming workloads.
Apache Iceberg
Origin: Netflix (2018). Born to solve cloud storage scale problems — specifically, listing millions of files on S3 was taking minutes for metadata operations.
Metadata: Three-layer architecture:
Catalog (tracks current metadata pointer)
↓ Metadata file (table schema, partition spec, snapshot history)
↓ Manifest list (which manifest files belong to this snapshot)
↓ Manifest files (which data files, with min/max statistics per file)
↓ Data files (Parquet/ORC/Avro on S3)
This hierarchical structure means pruning happens at multiple levels — manifest → file → column block — enabling efficient queries at petabyte scale.
Key features:
-
Hidden partitioning: You don’t write partition columns in queries. Iceberg handles partition pruning transparently. Queries don’t need to know the physical partition layout.
-
Partition evolution: Change partition strategy for new data without rewriting old data. Old data keeps its original partition; new data uses the new partition spec.
-
Schema evolution: Full support including column renames, reorders, and type promotions — without rewriting data files.
-
Snapshot isolation: Each write creates a new snapshot. Readers always see a consistent snapshot. Multiple engines read simultaneously without conflict.
-
Multi-engine first: Engine-agnostic design. Spark, Flink, Trino, Presto, DuckDB, Snowflake, BigQuery all read Iceberg natively.
-
Row-level deletes: Copy-on-write or merge-on-read delete modes.
Ecosystem: Netflix, Apple, LinkedIn, Adobe. Cloud-native and engine-neutral — the choice when you use multiple query engines.
Best for: Multi-engine environments (Spark + Trino + Flink), avoiding vendor lock-in, complex partition evolution scenarios.
Apache Hudi
Origin: Uber (2016). Born to handle billions of ride events requiring frequent upserts — correcting fare calculations, updating driver ratings in near-real-time.
Metadata: Timeline-based. Every operation (commit, compaction, clean) is recorded on a timeline. LSM-tree based metadata indexing for fast record-level lookups.
Key features:
-
Record-level indexing: Bloom filters, HBase indexes, or Bucket indexes for fast key-based lookups. Enables O(1) upsert performance regardless of table size.
-
Incremental pipelines: Native change streams. Consume only records changed since your last read — without reprocessing the full table.
-
Two table types:
-
Copy-on-Write (COW): Data files rewritten on each update. Optimized for read-heavy workloads.
-
Merge-on-Read (MOR): Updates written as delta logs, merged at read time. Optimized for write-heavy/streaming ingestion.
-
-
DeltaStreamer: Built-in Kafka/CDC ingestion tool with exactly-once semantics. Battle-tested at Uber/Amazon scale.
-
Streaming-first: Designed for continuous ingestion. Unique dual-format (row for streaming writes, columnar for batch reads) enables low-latency ingest AND fast analytics.
Best for: CDC ingestion, high-frequency upserts/deletes, streaming-first architectures, incremental ETL patterns.
Head-to-Head Comparison
| Topic | Details |
|---|---|
| Origin | Databricks Netflix Uber |
| Primary strength | Spark ecosystem, simple ops Multi-engine, partition evolution Streaming upserts, CDC |
| Metadata model | Transaction log (JSON + Parquet checkpoints) Hierarchical manifests Timeline + record-level indexes |
| Hidden partitioning | No (manual partition columns) Yes No |
| Partition evolution | Limited Full Limited |
| Schema evolution | Good (enforcement + evolution) Best (most flexible) Good |
| Upsert performance | Good Good Best (record indexes) |
| Incremental reads | Change Data Feed (requires enabling) Append-only incremental Native, first-class |
| Multi-engine | Good (Uni Form adds Iceberg compat) Best Good |
| Streaming ingestion | Good (Structured Streaming) Good Best (MOR + Delta Streamer) |
| Ecosystem | Databricks, AWS, Azure Neutral (all clouds/engines) AWS, Uber, Amazon scale |
| Operational simplicity | Highest in Databricks High Medium (more configuration) |
The 2026 decision rule:
-
Databricks / Spark lakehouse → Delta Lake
-
Multi-engine lakehouse (Spark + Trino + Flink) → Iceberg
-
CDC / streaming upserts / high-frequency updates → Hudi
The convergence: Delta Lake UniForm now exposes Delta tables as Iceberg-compatible. The choice is becoming less binary — but you still pick a primary format.
The Medallion Architecture: The Standard Lakehouse Pattern
This is the answer to “how do you organize a lakehouse?” at every interview:
Bronze (Raw) Silver (Clean) Gold (Curated)
_______________________________
─────────────────
_______________________________
• Append-only
• Deduped
• Star schema
• Schema-on-read
• Typed
• Aggregates
• Source-aligned
• Validated
• OBTs for ML
• Full history
• 3NF-like
• Conformed dims
• Iceberg/Delta
• Iceberg/Delta
• Warehouse SQL
on S3
on S3
• BI-optimized
Bronze layer: Land exactly what the source sent. No transformation. No filtering. The audit trail. If anything goes wrong downstream, bronze is your reprocessing source.
Silver layer: Cleaned, typed, deduplicated, validated. Source-aligned tables. 3NF-like structure for flexibility. Multiple gold layer models can draw from the same silver table.
Gold layer: Business-ready. Star schema for BI. Wide denormalized feature tables for ML. Pre-aggregated summary tables for dashboards. This is what analysts query.
Key Lakehouse Capabilities for Interviews
ACID on object storage (S3 is not a database):
-
S3 objects are immutable — you can’t update a byte within a file
-
Table formats work around this: writes create new files, metadata atomically points to the new file set
-
Multi-writer conflicts are resolved via optimistic concurrency control (OCC): detect conflict at commit time, retry if needed
Time travel (critical for debugging pipelines):
-- Delta Lake
SELECT * FROM events VERSION AS OF 42
SELECT * FROM events TIMESTAMP AS OF '2026-03-15'
-- Iceberg
SELECT * FROM events FOR VERSION AS OF 42
SELECT * FROM events FOR TIMESTAMP AS OF TIMESTAMP '2026-03-15 00:00:00'
Use cases: Debug a bad ETL run (compare before/after), reprocess historical data from a specific point, comply with audit requests, train ML models on data as-it-was at time T.
Schema evolution (pipelines don’t break when source adds columns):
OperationDelta LakeIcebergAdd column✅ (with mergeSchema)✅ (automatic)Drop column✅✅Rename column✅✅ (no data rewrite)Change typeLimited✅ (type promotion)Reorder columns❌✅
Compaction (managing small files from streaming writes):Streaming pipelines write many small files. Over time this degrades query performance. Compaction merges small files into optimal-sized (128-512 MB) files.
-
Delta: OPTIMIZE command, or auto-optimize in Databricks
-
Iceberg: rewrite_data_files procedure
-
Hudi: Automatic inline or async compaction
Real-World Lakehouse at Your Target Companies
| Topic | Details |
|---|---|
| Netflix | Iceberg (they built it) S3 + Iceberg + Spark for batch, Flink for streaming, Trino for ad-hoc queries, Druid for real-time OLAP |
| Uber | Hudi (they built it) Petabyte-scale, streaming CDC, real-time ride data with frequent upserts |
| Meta | Custom + Hive/Iceberg Internal systems at hyper-scale |
| Big Lake (Iceberg-compatible) GCS + Big Lake Metastore + Big Query/Dataproc | |
| Amazon | S3 + Iceberg/HudiAWS-native: EMR, Glue, Athena all support both formats |
Interview Questions
Q1: “Why would you choose a lakehouse over just loading everything into BigQuery?”
Model Answer: “Three reasons. First, cost at scale: storing petabytes of raw and historical data in BigQuery at $0.02/GB is significantly more expensive than S3 at $0.023/GB, and for cold data that’s rarely queried, S3 Glacier is $0.004/GB. The lakehouse stores 80% of data on cheap object storage, loading only hot/curated data into BigQuery for frequent queries. Second, flexibility: ML workloads need raw data and large-scale feature extraction — Spark running directly on Iceberg tables in S3 is more cost-effective than loading terabytes into BigQuery for every training run. Third, openness: a lakehouse on S3 + Iceberg can be queried by Spark, Trino, Flink, BigQuery, Snowflake, and DuckDB simultaneously. BigQuery creates vendor lock-in. That said, I wouldn’t eliminate BigQuery — I’d use it as the gold layer for BI workloads where its serverless query engine and BI tool integrations add clear value. The lakehouse handles the raw and silver layers; BigQuery handles the gold serving layer.”
Q2: “Your streaming pipeline writes 10K events/sec to an Iceberg table. After two weeks, queries are getting slower. What’s happening and how do you fix it?”
Model Answer: “Small file problem. At 10K events/sec with micro-batch commits every 30 seconds, you’re creating 2 files × 2,880 batches/day = 5,760 files per day. After two weeks, ~80K small files. Query engines must open each file to check min/max statistics and read data — massive metadata overhead and I/O amplification. Fix: implement regular compaction. For Iceberg, I’d run rewrite_data_files as a scheduled Spark job targeting files smaller than 128 MB, consolidating them into 256-512 MB target files. I’d also run expire_snapshots to remove old snapshot metadata — two weeks of 30-second commits accumulates millions of metadata entries. Going forward, I’d use Flink’s streaming write with larger buffer intervals (5-10 minutes instead of 30 seconds) to reduce write frequency. Or use Hudi’s MOR table type, which is specifically designed for high-frequency streaming writes — it buffers delta logs and merges them during compaction, keeping file counts manageable. The principle: streaming writes and analytics reads have opposing file size preferences. Compaction bridges that gap.”
Think About This
You’re in an Anthropic interview. The prompt: “We need to store and process conversation data for model training. We have 500 million conversations, growing at 5 million/day. Each conversation is ~10KB. We need to: (1) ingest new conversations in near-real-time, (2) run nightly data quality and PII scrubbing jobs, (3) provide data scientists point-in-time access to training datasets as they existed on any historical date, (4) support GDPR right-to-deletion requests.”
Walk through:
-
What table format? (Iceberg — multi-engine access for data science, partition evolution as data patterns change, strong schema evolution, and snapshot isolation for point-in-time reads. Netflix built it for exactly this kind of scale.)
-
What does the medallion architecture look like? (Bronze: raw conversations on S3 Iceberg, append-only. Silver: PII-scrubbed, quality-filtered, deduplicated. Gold: versioned training dataset snapshots.)
-
How do you handle point-in-time access? (Iceberg time travel — SELECT * FROM conversations FOR TIMESTAMP AS OF ‘2026-01-01’. Training dataset versions are tagged snapshots.)
-
How do you handle GDPR deletion? (Iceberg row-level deletes — mark rows for deletion in metadata, MOR delete files. Compaction physically removes the data. Iceberg tracks which snapshots a deleted row appeared in for audit compliance.)
-
What’s the daily data volume? (5M conversations × 10KB = 50 GB/day. Annual: ~18 TB. Reasonable for S3 + Iceberg with date partitioning.)
Quick Reference
-
Lakehouse = object storage (S3/GCS) + open table format (Iceberg/Delta/Hudi) + multiple query engines. Warehouse reliability at data lake cost.
-
Three formats: Delta (Databricks ecosystem), Iceberg (multi-engine, partition evolution), Hudi (streaming upserts, CDC)
-
All three have: ACID, time travel, schema evolution, compaction
-
Medallion: Bronze (raw, append-only) → Silver (clean, source-aligned) → Gold (curated, BI-ready)
-
Small file problem: Streaming writes create many small files. Compaction merges them. Schedule it or use Hudi MOR.
-
Decision rule: Databricks → Delta. Multi-engine → Iceberg. High-frequency upserts/CDC → Hudi.
-
Time travel is a killer feature: Debug pipelines, reprocess from checkpoints, ML reproducibility, GDPR compliance.
Tomorrow’s Preview
Day 18: File Formats & Compression — Parquet vs ORC vs Avro vs JSON. Row-oriented vs columnar. Compression algorithms (Snappy, Zstd, Gzip). How your file format choice impacts query performance, storage cost, and pipeline design at your target companies.