Phase 1: Foundations & Frameworks | Category: ETL/ELT Workflows
The Core Problem Both Architectures Solve
You need analytics that are both accurate (complete, correct, eventually consistent) AND fast (low-latency, real-time). Historically, batch systems gave you accuracy but with hours of delay. Streaming systems gave you speed but with weaker correctness guarantees (ordering, exactly-once, complex joins). Lambda and Kappa are two competing answers to this tension.
Lambda Architecture: Batch + Speed Layers
Proposed by Nathan Marz (creator of Apache Storm), Lambda runs two parallel pipelines on the same data:
┌─────────────────────────┐
│ BATCH LAYER │
│ (Spark, MapReduce) │
Source ──→ Kafka ──┤ Full recompute daily ├──→ Serving Layer ──→ Queries
│ High accuracy, high │ (merges both
│ latency │ views)
├─────────────────────────┤
│ SPEED LAYER │
│ (Flink, Storm, Spark SS)│
│ Real-time incremental │
│ Low latency, approximate│
└─────────────────────────┘
How it works:
-
Batch layer: Processes ALL historical data periodically (nightly, hourly). Produces the “canonical truth” — complete, correct, but stale.
-
Speed layer: Processes only new data in real-time since the last batch run. Produces fast but potentially approximate results.
-
Serving layer: Merges batch views + speed views to answer queries. Real-time results are served from the speed layer until the batch catches up and overwrites them.
The key insight: The batch layer is the safety net. Even if the speed layer has bugs, loses events, or produces approximate aggregations, the batch layer will eventually recompute everything correctly and overwrite the speed layer’s results.
Real-world example — Netflix (from interview guides):
-
Batch layer: Nightly Spark jobs compute complete viewing history analytics, content performance, A/B test results
-
Speed layer: Flink processes real-time viewing events for live dashboards, immediate recommendations
-
Serving: Dashboards show real-time numbers (speed layer) with a note “batch-corrected daily at 6 AM”
Kappa Architecture: Stream-Only
Proposed by Jay Kreps (co-creator of Kafka), Kappa eliminates the batch layer entirely:
Source ──→ Kafka (immutable log, long retention) ──→ Stream Processor ──→ Serving Layer ──→ Queries (Flink, Spark SS) ↑ │
└──── Replay from any offset for reprocessing ─┘
How it works:
-
ALL data flows through a single streaming pipeline
-
Kafka retains the full event log (days, weeks, or indefinitely with tiered storage)
-
Real-time processing is the ONLY processing path
-
If logic changes or reprocessing is needed, replay the Kafka log through the updated streaming application
-
No separate batch pipeline, no merge logic
The key insight: If your streaming pipeline is correct (handles ordering, late data, exactly-once), why maintain a separate batch pipeline that does the same thing slower? One codebase, one set of semantics, one operational burden.
Real-world examples:
-
Shopify: Presented “It’s Time To Stop Using Lambda Architecture” at Kafka Summit, using Kafka + Flink + Kafka Streams
-
Twitter/X: Migrated from Lambda (Hadoop + Kafka on-prem) to Kappa (Kafka on GCP), processing 400B events/day at PB scale
-
Spotify: Stream-first architecture for real-time event processing
Head-to-Head Comparison
| Topic | Details |
|---|---|
| Processing paths | Two (batch + stream) One (stream only) |
| Codebase | Dual — batch logic + stream logic (often different languages/frameworks) Single — one streaming application |
| Operational complexity | High — maintain, monitor, and debug two pipelines Lower — one pipeline, but streaming ops is inherently complex |
| Data correctness | Batch layer is the “source of truth,” eventually corrects speed layer errors Correctness depends entirely on streaming pipeline quality |
| Reprocessing | Re-run batch job on full history (well-understood, reliable) Replay Kafka log through streaming app (requires sufficient retention + stateful replay capability) |
| Latency | Real-time via speed layer; batch view has hours of delay Real-time is native for everything |
| Merge complexity | Must reconcile batch and speed layer outputs — this is the “architectural tax”No merge needed — single output |
| Team skills | Works when team has strong batch skills, streaming is secondary Requires deep streaming expertise as the default |
| Cost | Higher — two compute pipelines, duplicated storage Lower infra — but streaming compute for historical replays can be expensive |
| Maturity | Batch tools are extremely mature and well-understood Streaming tools (Flink, Kafka) have matured significantly by 2026 but still require more expertise |
The Lambda Problem: Dual Codebase Drift
The biggest practical issue with Lambda isn’t theoretical — it’s operational. You end up with:
Batch pipeline: Spark SQL → aggregate by customer_id → write to warehouse Speed pipeline: Flink → aggregate by customer_id → write to real-time store
These are two implementations of the “same” logic. Over time:
-
A developer fixes a bug in the batch logic but forgets to update the streaming logic
-
The batch pipeline uses a LEFT JOIN but the streaming pipeline uses an INNER JOIN
-
Business rules diverge: batch includes returns, streaming doesn’t
The result: batch and speed layers produce different numbers. The serving layer’s “merge” becomes a nightmare of reconciliation logic. This is why many teams have moved to Kappa — one codebase = one truth.
The Kappa Challenge: Reprocessing at Scale
Kappa’s Achilles’ heel is reprocessing. If you need to:
-
Fix a bug in your aggregation logic and recompute the last 6 months
-
Add a new dimension to your output that didn’t exist before
-
Backfill after a schema change
You must replay 6 months of Kafka data through your streaming application. At scale (PB of data), this is:
-
Slow: Streaming apps process data sequentially by time; replaying months of data takes hours to days
-
Expensive: You need a temporary cluster sized for the replay throughput
-
Stateful: If your app maintains state (session windows, running aggregates), the state must be rebuilt from scratch during replay
Practical solutions:
-
Kafka tiered storage: Offload old data to S3/GCS, replay from object storage. Solves retention cost but not replay speed.
-
Parallel replay: Spin up a new streaming app version alongside the existing one. The new version processes the replay while the old one continues serving real-time. When caught up, switch over.
-
Hybrid: Use Kappa for real-time, but keep a batch “escape hatch” for massive reprocessing (this is what most companies actually do).
The 2026 Reality: Neither Pure Lambda Nor Pure Kappa
Most production systems in 2026 are hybrid. The modern lakehouse pattern effectively combines both:
Source ──→ Kafka ──→ Flink/Spark Streaming ──→ Bronze (Iceberg/Delta, append)
↓ dbt / Spark Batch ──→ Silver → Gold
↓ Real-time OLAP (Druid/ClickHouse)
What this looks like in practice:
-
Streaming path: Kafka → Flink → Iceberg tables (near-real-time ingestion, seconds latency)
-
Batch path: dbt/Spark transforms on Iceberg tables (hourly/daily, complex joins, full accuracy)
-
Serving: BI tools query the batch-produced gold tables; real-time dashboards query the streaming OLAP store
This is “Lambda-like” in that there are two processing paths, but “Kappa-like” in that:
-
Both paths read from the same source (Kafka / Iceberg)
-
The streaming path writes to the same lakehouse tables
-
There’s no separate “merge” layer — the batch job reads what streaming wrote and transforms it further
What to say in the interview: “I don’t think in terms of pure Lambda or pure Kappa anymore. The modern pattern is a unified lakehouse where streaming ingestion lands data continuously, and batch transforms build on top of what streaming produced. Streaming gives me freshness; batch gives me complex, correct analytics. They share the same storage layer (Iceberg/Delta), so there’s no merge problem. If the interview requires me to name one, I’d say this is closer to Kappa for ingestion with batch for heavy transforms — a pragmatic hybrid.”
Decision Framework for Interviews
Does the use case require sub-second latency?
├── YES → Streaming is mandatory │ Is the logic simple (filter, transform, light aggregation)? │
├── YES → Pure Kappa (Kafka → Flink → sink) │
└── NO → Kappa for ingestion + batch for complex transforms (hybrid)
└── NO → What latency is acceptable?
├── Minutes → Micro-batch (Spark Structured Streaming, 30-sec triggers)
├── Hours → Pure batch (Spark/dbt, scheduled)
└── Needs BOTH real-time AND deep historical analysis?
└── Hybrid: streaming for freshness + batch for depth
When to propose Lambda-style (two explicit paths):
-
Regulatory/financial systems where batch must recompute everything for audit
-
Team has strong batch skills and limited streaming expertise
-
Streaming output is “fast but eventually corrected” — acceptable for the use case
-
Complex joins across multiple large datasets that streaming can’t handle efficiently
When to propose Kappa-style (stream-first):
-
Event-driven microservices architecture
-
Real-time-first use cases (fraud detection, live dashboards, recommendations)
-
Team has strong streaming expertise
-
Logic is relatively straightforward and can be expressed in a single streaming pipeline
-
Kafka retention covers the full reprocessing window needed
Interview Questions
Q1: “You’re designing a data platform for an e-commerce company. They need real-time order tracking for customers AND accurate daily financial reports for the CFO. Lambda or Kappa?”
Model Answer: “This is a textbook case for the modern hybrid approach. For real-time order tracking, I’d build a streaming pipeline: order events from Kafka → Flink for status enrichment and sessionization → a low-latency serving store (Redis or DynamoDB) that the customer-facing API reads from. For the CFO’s financial reports, I’d build batch transforms with dbt: the same order events land in an Iceberg lakehouse via the streaming pipeline, and nightly dbt jobs compute revenue, returns, margins, and reconcile with the payment system. Both paths share the same source data in the lakehouse — streaming writes to bronze/silver continuously, batch reads from silver and builds gold financial tables. I wouldn’t call this Lambda because there’s no explicit ‘merge’ layer — the batch simply transforms what streaming already landed. And I wouldn’t call it pure Kappa because complex financial reconciliation with multiple source joins doesn’t belong in a streaming pipeline. It’s a pragmatic hybrid: stream for speed, batch for depth.”
Q2: “Your company currently runs Lambda architecture. The CTO wants to simplify. How would you migrate to Kappa, and what risks would you flag?”
Model Answer: “I’d migrate incrementally, not big-bang. Step 1: Ensure Kafka retention covers the full reprocessing window — if we need to replay 90 days, Kafka must retain 90 days (tiered storage makes this cost-effective). Step 2: Identify the simplest batch pipeline and rewrite it as a Flink job processing the same Kafka topic. Run both in parallel, compare outputs for correctness — this builds confidence. Step 3: Gradually migrate more pipelines, starting with stateless transforms and moving to stateful ones. Risks I’d flag: (a) Reprocessing time — replaying months of data through Flink is slower than a batch re-run on a Spark cluster reading Parquet; the CTO needs to accept this trade-off. (b) Stateful complexity — if our batch pipelines do complex multi-table joins, replicating that in a streaming pipeline with managed state is significantly harder. (c) Team readiness — streaming debugging and operations require different skills than batch. I’d invest in training before migrating the hardest pipelines. (d) I’d keep a batch ‘escape hatch’ for truly massive reprocessing — the goal is to eliminate the dual-codebase problem, not to dogmatically avoid batch compute entirely.”
Think About This
You’re in an OpenAI interview. The prompt: “We process millions of ChatGPT conversations daily. We need real-time safety monitoring (flag harmful content within seconds) AND weekly aggregate quality reports (response accuracy, hallucination rate, user satisfaction trends). How would you architect this?”
Walk through:
-
Is this Lambda, Kappa, or hybrid? (Hybrid — safety monitoring is streaming, quality reports are batch)
-
What does the streaming path look like? (Conversation events → Kafka → Flink → ML safety classifier → alerts topic → action service. Sub-second latency required.)
-
What does the batch path look like? (Same events land in Iceberg via streaming. Weekly Spark/dbt jobs compute aggregates, join with human evaluation data, build quality dashboards.)
-
Could you make this pure Kappa? (The safety monitoring — yes. The weekly quality reports with complex joins across conversations, human evaluations, and model metadata — trying to do this in a streaming pipeline would be over-engineering. Batch is the right tool.)
-
Where do the two paths share infrastructure? (Kafka as the single source, Iceberg as the shared storage. The streaming path writes to Iceberg; the batch path reads from it.)
Quick Reference
-
Lambda = batch layer (accuracy) + speed layer (freshness) + serving layer (merge). Dual codebase is the main pain point.
-
Kappa = single streaming pipeline + reprocessing via log replay. Simpler, but reprocessing at scale and stateful complexity are real challenges.
-
2026 reality = most companies run a pragmatic hybrid: streaming ingestion into a lakehouse + batch transforms for complex analytics. Same storage, no merge layer.
-
Choose stream-first when: sub-second latency, event-driven architecture, straightforward logic, strong streaming team.
-
Keep batch when: complex multi-source joins, financial/regulatory accuracy, massive historical reprocessing, team expertise is batch-centric.
-
The interview answer is almost never “pure Lambda” or “pure Kappa” — it’s “here’s why I’d use streaming for X and batch for Y, and here’s how they share the same storage layer.”
Tomorrow’s Preview
Day 12: CAP Theorem & Consistency Models — CAP theorem explained with real databases. Strong vs eventual vs causal consistency. PACELC theorem. How to reason about consistency trade-offs in data pipeline design — critical for system design interviews at all your target companies.