Phase 2: Deep Dives | Category: Advanced Streaming
The Core Insight: The Same Data, Two Views
The stream-table duality is one of the most powerful mental models in modern data engineering. As Julian Hyde (Apache Calcite pioneer) put it: “Streams are like derivatives of tables, and tables are integrals of streams.”
- A table is a snapshot of state at a point in time — all the accumulated changes from the beginning of time, collapsed into current values
- A stream is the changelog of a table — the sequence of every insert, update, and delete that produced the current table state
They are mathematically dual: you can always derive one from the other.
Table
(current state): user_id | balance | last_updated --------+---------+------------- u-001 | $350 | 2026-04-12 u-002 | $120 | 2026-04-11 Stream (changelog that produced the table): {op: INSERT, user_id: u-001, balance: $0} {op: UPDATE, user_id: u-001, balance: $100} // deposit $100 {op: UPDATE, user_id: u-001, balance: $450} // deposit $350 {op: UPDATE, user_id: u-001, balance: $350} // withdraw $100 {op: INSERT, user_id: u-002, balance: $120} Table = last state reached by applying all stream operations in order Stream = complete history of every state transition
Why this matters for data engineering: Every Kafka topic is a stream. Every database table can be streamed (via CDC). When you apply aggregations (GROUP BY), you materialize a stream into a table. When you observe changes to a table (CDC), you produce a stream. Your entire data architecture is just transformations between these two forms.
Flink’s Dynamic Tables: SQL on Streams
Flink SQL is built entirely on the stream-table duality concept. Flink treats every data source as a dynamic table — a table that changes over time.
A Flink streaming job expressed in SQL:
--
Create a source table backed by a Kafka topic -- This is a STREAM presented as a TABLE CREATE TABLE orders ( order_id STRING, user_id STRING, amount DECIMAL(10,2), status STRING, order_time TIMESTAMP(3), WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', 'topic' = 'orders', 'format' = 'json' ); -- Run a CONTINUOUS QUERY that produces a dynamic table -- This query runs forever and updates as new orders arrive SELECT user_id, COUNT(*) AS order_count, SUM(amount) AS total_spend FROM orders WHERE status = 'completed' GROUP BY user_id;
This query looks identical to a batch SQL query, but in Flink it runs continuously and produces an updatable result table — every time a new order arrives, the affected user_id row in the result is updated.
The changelog output: When Flink writes this dynamic table to a downstream system, it emits a changelog stream:
{op: INSERT, user_id: "u-001", order_count: 1, total_spend: 99.99} {op: UPDATE_BEFORE, user_id: "u-001", order_count: 1, total_spend: 99.99} {op: UPDATE_AFTER, user_id: "u-001", order_count: 2, total_spend: 249.98}
This is the changelog that a downstream consumer (Kafka topic, database, another Flink job) receives to maintain its materialized view of the query result.
Flink’s four changelog operations:
- +I (INSERT): new row
- -U (UPDATE_BEFORE): old value before update
- +U (UPDATE_AFTER): new value after update
- -D (DELETE): row removed
The Materialized View Spectrum
There’s a spectrum of “materialized view” implementations with very different freshness, cost, and complexity trade-offs:
1. Traditional Materialized Views (Warehouse Layer)
--
BigQuery / Snowflake CREATE MATERIALIZED VIEW gold.daily_order_summary OPTIONS (refresh_interval_minutes = 60) AS SELECT order_date, user_segment, COUNT(*) AS orders, SUM(amount) AS revenue FROM gold.fact_orders GROUP BY order_date, user_segment;
Mechanism: Full recompute (or incremental recompute) triggered by a refresh schedule or base table change.
Freshness: Minutes to hours (limited by refresh interval)
Best for: Pre-aggregating expensive queries for BI dashboards. BigQuery auto-rewrites queries against base tables to use the materialized view transparently.
The refresh storm problem (per DataVidhya): “If 50 views depend on the same base table, a single load can trigger 50 concurrent rebuilds, saturating your warehouse compute.” Mitigation: stagger refresh intervals, use incremental refresh where available, limit materialized view fan-out from high-write tables.
2. Streaming Materialized Views (Flink SQL / ksqlDB)
--
Flink SQL: this view updates in REAL-TIME as events arrive in Kafka CREATE VIEW user_order_stats AS SELECT user_id, COUNT(*) AS order_count, SUM(amount) AS total_spend, MAX(order_time) AS last_order_time FROM orders GROUP BY user_id; -- Or persist to an external system: INSERT INTO user_stats_sink SELECT * FROM user_order_stats;
Mechanism: Incremental computation — only the changed portion of the view is recomputed when new events arrive. Flink maintains state (the current aggregated values per user_id in RocksDB) and applies incremental updates.
Freshness: Seconds (bounded by checkpoint interval and watermark)
Best for: Real-time dashboards, serving layers that need current state, reducing load on databases by pre-materializing the query.
ksqlDB equivalent:
--
ksqlDB: similar streaming materialized view CREATE TABLE user_order_stats AS SELECT user_id, COUNT(*) AS order_count, SUM(amount) AS total_spend FROM orders_stream GROUP BY user_id EMIT CHANGES;
3. Incrementally Maintained Views (Materialize DB)
Materialize is a streaming SQL database that maintains ALL views incrementally — not just simple aggregations, but arbitrary SQL including multi-table JOINs — with sub-second freshness.
--
Materialize: this multi-table join view updates in milliseconds CREATE MATERIALIZED VIEW enriched_orders AS SELECT o.order_id, o.amount, u.user_tier, u.country, p.product_category, o.order_time FROM orders o JOIN users u ON o.user_id = u.user_id JOIN products p ON o.product_id = p.product_id;
As a new order arrives in Kafka → Materialize updates only the affected row in enriched_orders in milliseconds — without rerunning the full JOIN. This is powered by differential dataflow (compute incremental changes to joins, not full recomputation).
The dbt integration: With Materialize, you write a dbt model once. The first dbt run creates the materialized view. You never run it again — Materialize keeps it current automatically. This is genuinely transformative for teams who want continuous real-time views with a batch modeling workflow.
Event Sourcing & CQRS: The Architectural Pattern
These concepts appear in senior DE interviews when the conversation touches on system design for microservices + data platforms.
Event Sourcing: Instead of storing current state, store the FULL HISTORY of events. Current state = replay all events.
Traditional
DB: Event Sourcing (Kafka as event log): ┌─────────────────────┐ Kafka topic "account-events": │ accounts │ {event: "opened", account_id: "a1", amount: 0} │ id | balance │ → {event: "deposit", account_id: "a1", amount: 100} │ ----+-------- │ {event: "deposit", account_id: "a1", amount: 50} │ a1 | $150 │ {event: "withdraw", account_id: "a1", amount: 100} └─────────────────────┘ Current state derived by replaying these 4 events → $50 balance Full audit trail always available
Why event sourcing matters for data engineering: The Kafka topic IS your source of truth. Any view of the data (user balance, order status, inventory count) is derived by subscribing to the event log and materializing the current state. New read models can be created by replaying events from the beginning — no ETL needed.
CQRS (Command Query Responsibility Segregation): Separate the write path from the read path.
WRITE
SIDE: READ SIDE: Commands → Event log (Kafka) Event log → Materialized views "place_order" ──────────────► order_by_user: {user: "u1", orders: [o1, o2]} "cancel_order" order_by_status: {status: "pending", count: 47} "update_status" user_dashboard: {u1: {total_spend: $450, ...}}
The key insight: The event log (Kafka topic with long/infinite retention) is the authoritative source. Multiple read-side views can be derived from it independently — each optimized for its specific query pattern. If a read-side view needs to be rebuilt (bug fix, new query pattern), just replay events from the beginning of the topic.
CQRS with ksqlDB:
--
Write side: events land in Kafka topic "shopping_cart_events" -- Read side: streaming aggregation produces a materialized table CREATE STREAM shopping_cart_events ( customer VARCHAR, item VARCHAR, qty INT, op VARCHAR -- 'add', 'remove' ) WITH (kafka_topic='shopping_cart_events', value_format='json', partitions=1); -- Streaming CQRS read model: current cart contents per customer CREATE TABLE cart_contents AS SELECT customer, item, SUM(CASE WHEN op = 'add' THEN qty ELSE -qty END) AS quantity FROM shopping_cart_events GROUP BY customer, item HAVING SUM(CASE WHEN op = 'add' THEN qty ELSE -qty END) > 0 EMIT CHANGES;
The cart_contents table is always up-to-date. When the application queries “what’s in user X’s cart?”, it reads from this table — not from the event log. The event log is the write side; the materialized view is the read side.
Practical Patterns for Interview Design Questions
Pattern 1: Streaming ETL with Flink SQL
--
Normalize CDC events from Postgres into a clean streaming table CREATE TABLE orders_raw ( order_id STRING, user_id STRING, amount DECIMAL, created_at TIMESTAMP(3), op STRING -- 'c' (create), 'u' (update), 'd' (delete) from Debezium ) WITH (connector = 'kafka', topic = 'postgres.public.orders', format = 'debezium-json'); -- Streaming view: only active (non-deleted) orders CREATE VIEW active_orders AS SELECT order_id, user_id, amount, created_at FROM orders_raw WHERE op <> 'd'; -- Streaming aggregation: real-time revenue by user CREATE TABLE user_revenue AS SELECT user_id, SUM(amount) AS total_revenue FROM active_orders GROUP BY user_id; -- Sink to serving layer (Cassandra for low-latency lookup) INSERT INTO user_revenue_cassandra_sink SELECT * FROM user_revenue;
Pattern 2: Multi-source enrichment with windowing
--
Join two Kafka streams (orders + inventory) with a time window SELECT o.order_id, o.product_id, o.amount, i.stock_remaining FROM orders o JOIN inventory_updates i ON o.product_id = i.product_id AND o.order_time BETWEEN i.update_time AND i.update_time + INTERVAL '1' MINUTE WHERE o.order_time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR;
Interview Questions
Q1: “Explain stream-table duality and how Flink SQL uses it. Why does this matter for data engineering?”
Model Answer: “Stream-table duality means that tables and streams are two representations of the same data. A table represents the current state — the result of applying all historical changes. A stream represents the history of changes — every insert, update, and delete. They’re interchangeable: accumulate (integrate) a stream to get a table; differentiate a table’s changes over time to get a stream. Flink SQL exploits this by treating every data source as a ‘dynamic table’ — a table that changes over time as new events arrive. You write standard SQL against this dynamic table, and Flink executes it as a continuous query: it never stops running, it just updates results as new data arrives. This matters for data engineering because it lets you express real-time pipelines in SQL rather than in Spark or Flink’s procedural API. A product analyst can define a ‘rolling 7-day user engagement’ metric as a SQL query, and Flink ensures that metric is always current in the serving layer. The output of the continuous query is itself a stream (a changelog), which can feed other views, databases, or dashboards.”
Q2: “When would you use Flink SQL’s streaming materialized view vs a BigQuery materialized view vs Materialize DB?”
Model Answer: “These sit at different points on the freshness/cost/complexity spectrum. BigQuery materialized views are refreshed on a schedule — hourly at minimum. Use them when the consumer is a BI dashboard that refreshes on a human timescale (minutes) and the transformation is a simple aggregation. Cost: only the refresh compute, not per-query compute. Best for the gold layer of a data warehouse where freshness of minutes is acceptable. Flink SQL streaming views update in seconds. Use them when the consumer needs near-real-time data — a fraud detection dashboard, a real-time leaderboard, a recommendation feature serving layer. The complexity cost is operating a Flink cluster with stateful checkpointing. Best when the transformation is expressible in SQL and latency matters (seconds, not minutes). Materialize DB maintains views incrementally in milliseconds — including complex multi-table JOINs that would be prohibitively expensive in Flink state. Use it when you want the freshness of streaming but the expressiveness of arbitrary SQL, and your team prefers a database-centric workflow (dbt compatibility). The operational cost is running the Materialize cluster and managing source connectivity. For most data platforms, I’d use BigQuery materialized views for the analytics gold layer, Flink SQL streaming views for the real-time serving layer (feature store, operational dashboards), and consider Materialize only when Flink SQL’s limitations (complex JOINs requiring large state) or the team’s SQL-first preference makes it a better fit.”
Think About This
You’re in a Netflix interview. The prompt: “Design the data infrastructure for Netflix’s real-time content popularity dashboard — showing which titles are trending in the last 5 minutes globally and by country.”
Walk through:
-
What’s the stream-table duality here? (The stream = individual viewing events. The table = count of views per title per country in the last 5 minutes — a windowed aggregation that materializes the stream into current state.)
-
What does the Flink SQL query look like? (SELECT country, title_id, COUNT(*) AS views FROM viewing_events GROUP BY HOP(event_time, INTERVAL ‘1’ MINUTE, INTERVAL ‘5’ MINUTE), country, title_id — a hopping window that gives sliding 5-minute counts updated every minute)
-
Where does the result land? (Redis or Cassandra for sub-10ms reads. Flink emits changelog updates → the serving store receives increments and decrements as titles enter/leave the 5-minute window)
-
What happens with the table-to-stream direction? (When a title drops off the trending list, Flink emits a DELETE event to the serving store — this is the retraction in Flink’s changelog output)
-
How does the duality enable historical analysis? (The same viewing events in Kafka (stream) can be accumulated into a daily/weekly fact table (table) via a separate batch pipeline. Same data source, two different materializations for two different consumers)
Quick Reference
- Duality formula: Table = integral of stream (accumulate events). Stream = derivative of table (observe changes). Every table can be streamed; every stream can be materialized into a table.
- Flink dynamic tables: SQL runs continuously on streaming data. GROUP BY creates state. Windowing triggers output. Results are themselves changelog streams.
- Three materialized view tiers: BigQuery/Snowflake (minutes, SQL-native, BI-friendly) → Flink SQL / ksqlDB (seconds, stateful streaming aggregation) → Materialize DB (milliseconds, incremental arbitrary SQL)
- CQRS pattern: Kafka topic = event log (write side). Streaming materialized views = read models (read side). Multiple read models from the same event log. Rebuild any read model by replaying events.
- Event sourcing: Store events, not state. Current state = replay events. Complete audit trail. New read models = replay from beginning. Kafka with infinite retention is the event log.
- The refresh storm anti-pattern: Don’t have 50 materialized views refreshing from the same high-write base table. Stagger refresh intervals, use incremental refresh, limit materialized view depth.
Tomorrow’s Preview
Day 46: Real-Time Feature Engineering — Feature stores (Feast, Tecton), online vs offline feature computation, point-in-time correctness, backfilling features, serving features for ML inference at low latency. Directly relevant for your OpenAI, Anthropic, and Meta interviews where ML data infrastructure is a core topic.