Phase 2: Deep Dives | Category: Advanced Streaming

The Core Insight: The Same Data, Two Views

The stream-table duality is one of the most powerful mental models in modern data engineering. As Julian Hyde (Apache Calcite pioneer) put it: “Streams are like derivatives of tables, and tables are integrals of streams.”

  • A table is a snapshot of state at a point in time — all the accumulated changes from the beginning of time, collapsed into current values
  • A stream is the changelog of a table — the sequence of every insert, update, and delete that produced the current table state

They are mathematically dual: you can always derive one from the other.

Table
(current state):    user_id | balance | last_updated    --------+---------+-------------    u-001   | $350    | 2026-04-12    u-002   | $120    | 2026-04-11  Stream (changelog that produced the table):    {op: INSERT, user_id: u-001, balance: $0}    {op: UPDATE, user_id: u-001, balance: $100}   // deposit $100    {op: UPDATE, user_id: u-001, balance: $450}   // deposit $350    {op: UPDATE, user_id: u-001, balance: $350}   // withdraw $100    {op: INSERT, user_id: u-002, balance: $120}  Table = last state reached by applying all stream operations in order  Stream = complete history of every state transition

Why this matters for data engineering: Every Kafka topic is a stream. Every database table can be streamed (via CDC). When you apply aggregations (GROUP BY), you materialize a stream into a table. When you observe changes to a table (CDC), you produce a stream. Your entire data architecture is just transformations between these two forms.

Flink SQL is built entirely on the stream-table duality concept. Flink treats every data source as a dynamic table — a table that changes over time.

A Flink streaming job expressed in SQL:

--
Create a source table backed by a Kafka topic  -- This is a STREAM presented as a TABLE  CREATE TABLE orders (      order_id    STRING,      user_id     STRING,      amount      DECIMAL(10,2),      status      STRING,      order_time  TIMESTAMP(3),      WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND  ) WITH (      'connector' = 'kafka',      'topic' = 'orders',      'format' = 'json'  );  -- Run a CONTINUOUS QUERY that produces a dynamic table  -- This query runs forever and updates as new orders arrive  SELECT      user_id,      COUNT(*) AS order_count,      SUM(amount) AS total_spend  FROM orders  WHERE status = 'completed'  GROUP BY user_id;

This query looks identical to a batch SQL query, but in Flink it runs continuously and produces an updatable result table — every time a new order arrives, the affected user_id row in the result is updated.

The changelog output: When Flink writes this dynamic table to a downstream system, it emits a changelog stream:

{op: INSERT, user_id: "u-001", order_count: 1, total_spend: 99.99}  {op: UPDATE_BEFORE, user_id: "u-001", order_count: 1, total_spend: 99.99}  {op: UPDATE_AFTER,  user_id: "u-001", order_count: 2, total_spend: 249.98}

This is the changelog that a downstream consumer (Kafka topic, database, another Flink job) receives to maintain its materialized view of the query result.

Flink’s four changelog operations:

  • +I (INSERT): new row
  • -U (UPDATE_BEFORE): old value before update
  • +U (UPDATE_AFTER): new value after update
  • -D (DELETE): row removed

The Materialized View Spectrum

There’s a spectrum of “materialized view” implementations with very different freshness, cost, and complexity trade-offs:

1. Traditional Materialized Views (Warehouse Layer)

--
BigQuery / Snowflake  CREATE MATERIALIZED VIEW gold.daily_order_summary  OPTIONS (refresh_interval_minutes = 60)  AS  SELECT      order_date,      user_segment,      COUNT(*) AS orders,      SUM(amount) AS revenue  FROM gold.fact_orders  GROUP BY order_date, user_segment;

Mechanism: Full recompute (or incremental recompute) triggered by a refresh schedule or base table change.

Freshness: Minutes to hours (limited by refresh interval)

Best for: Pre-aggregating expensive queries for BI dashboards. BigQuery auto-rewrites queries against base tables to use the materialized view transparently.

The refresh storm problem (per DataVidhya): “If 50 views depend on the same base table, a single load can trigger 50 concurrent rebuilds, saturating your warehouse compute.” Mitigation: stagger refresh intervals, use incremental refresh where available, limit materialized view fan-out from high-write tables.

--
Flink SQL: this view updates in REAL-TIME as events arrive in Kafka  CREATE VIEW user_order_stats AS  SELECT      user_id,      COUNT(*) AS order_count,      SUM(amount) AS total_spend,      MAX(order_time) AS last_order_time  FROM orders  GROUP BY user_id;  -- Or persist to an external system:  INSERT INTO user_stats_sink  SELECT * FROM user_order_stats;

Mechanism: Incremental computation — only the changed portion of the view is recomputed when new events arrive. Flink maintains state (the current aggregated values per user_id in RocksDB) and applies incremental updates.

Freshness: Seconds (bounded by checkpoint interval and watermark)

Best for: Real-time dashboards, serving layers that need current state, reducing load on databases by pre-materializing the query.

ksqlDB equivalent:

--
ksqlDB: similar streaming materialized view  CREATE TABLE user_order_stats AS  SELECT      user_id,      COUNT(*) AS order_count,      SUM(amount) AS total_spend  FROM orders_stream  GROUP BY user_id  EMIT CHANGES;

3. Incrementally Maintained Views (Materialize DB)

Materialize is a streaming SQL database that maintains ALL views incrementally — not just simple aggregations, but arbitrary SQL including multi-table JOINs — with sub-second freshness.

--
Materialize: this multi-table join view updates in milliseconds  CREATE MATERIALIZED VIEW enriched_orders AS  SELECT      o.order_id,      o.amount,      u.user_tier,      u.country,      p.product_category,      o.order_time  FROM orders o  JOIN users u ON o.user_id = u.user_id  JOIN products p ON o.product_id = p.product_id;

As a new order arrives in Kafka → Materialize updates only the affected row in enriched_orders in milliseconds — without rerunning the full JOIN. This is powered by differential dataflow (compute incremental changes to joins, not full recomputation).

The dbt integration: With Materialize, you write a dbt model once. The first dbt run creates the materialized view. You never run it again — Materialize keeps it current automatically. This is genuinely transformative for teams who want continuous real-time views with a batch modeling workflow.

Event Sourcing & CQRS: The Architectural Pattern

These concepts appear in senior DE interviews when the conversation touches on system design for microservices + data platforms.

Event Sourcing: Instead of storing current state, store the FULL HISTORY of events. Current state = replay all events.

Traditional
DB:               Event Sourcing (Kafka as event log):  ┌─────────────────────┐       Kafka topic "account-events":  │  accounts           │       {event: "opened", account_id: "a1", amount: 0}  │  id | balance       │  →    {event: "deposit", account_id: "a1", amount: 100}  │ ----+--------       │       {event: "deposit", account_id: "a1", amount: 50}  │  a1 | $150          │       {event: "withdraw", account_id: "a1", amount: 100}  └─────────────────────┘                                       Current state derived by replaying these 4 events → $50 balance                                Full audit trail always available

Why event sourcing matters for data engineering: The Kafka topic IS your source of truth. Any view of the data (user balance, order status, inventory count) is derived by subscribing to the event log and materializing the current state. New read models can be created by replaying events from the beginning — no ETL needed.

CQRS (Command Query Responsibility Segregation): Separate the write path from the read path.

WRITE
SIDE:                           READ SIDE:  Commands → Event log (Kafka)          Event log → Materialized views    "place_order"      ──────────────►  order_by_user:   {user: "u1", orders: [o1, o2]}    "cancel_order"                      order_by_status: {status: "pending", count: 47}    "update_status"                     user_dashboard:  {u1: {total_spend: $450, ...}}

The key insight: The event log (Kafka topic with long/infinite retention) is the authoritative source. Multiple read-side views can be derived from it independently — each optimized for its specific query pattern. If a read-side view needs to be rebuilt (bug fix, new query pattern), just replay events from the beginning of the topic.

CQRS with ksqlDB:

--
Write side: events land in Kafka topic "shopping_cart_events"  -- Read side: streaming aggregation produces a materialized table  CREATE STREAM shopping_cart_events (      customer VARCHAR,      item VARCHAR,      qty INT,      op VARCHAR  -- 'add', 'remove'  ) WITH (kafka_topic='shopping_cart_events', value_format='json', partitions=1);  -- Streaming CQRS read model: current cart contents per customer  CREATE TABLE cart_contents AS  SELECT      customer,      item,      SUM(CASE WHEN op = 'add' THEN qty ELSE -qty END) AS quantity  FROM shopping_cart_events  GROUP BY customer, item  HAVING SUM(CASE WHEN op = 'add' THEN qty ELSE -qty END) > 0  EMIT CHANGES;

The cart_contents table is always up-to-date. When the application queries “what’s in user X’s cart?”, it reads from this table — not from the event log. The event log is the write side; the materialized view is the read side.

Practical Patterns for Interview Design Questions

Pattern 1: Streaming ETL with Flink SQL

--
Normalize CDC events from Postgres into a clean streaming table  CREATE TABLE orders_raw (      order_id    STRING,      user_id     STRING,      amount      DECIMAL,      created_at  TIMESTAMP(3),      op          STRING  -- 'c' (create), 'u' (update), 'd' (delete) from Debezium  ) WITH (connector = 'kafka', topic = 'postgres.public.orders', format = 'debezium-json');  -- Streaming view: only active (non-deleted) orders  CREATE VIEW active_orders AS  SELECT order_id, user_id, amount, created_at  FROM orders_raw  WHERE op <> 'd';  -- Streaming aggregation: real-time revenue by user  CREATE TABLE user_revenue AS  SELECT user_id, SUM(amount) AS total_revenue  FROM active_orders  GROUP BY user_id;  -- Sink to serving layer (Cassandra for low-latency lookup)  INSERT INTO user_revenue_cassandra_sink  SELECT * FROM user_revenue;

Pattern 2: Multi-source enrichment with windowing

--
Join two Kafka streams (orders + inventory) with a time window  SELECT      o.order_id,      o.product_id,      o.amount,      i.stock_remaining  FROM orders o  JOIN inventory_updates i      ON o.product_id = i.product_id      AND o.order_time BETWEEN i.update_time AND i.update_time + INTERVAL '1' MINUTE  WHERE o.order_time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR;

Interview Questions

Q1: “Explain stream-table duality and how Flink SQL uses it. Why does this matter for data engineering?”

Model Answer: “Stream-table duality means that tables and streams are two representations of the same data. A table represents the current state — the result of applying all historical changes. A stream represents the history of changes — every insert, update, and delete. They’re interchangeable: accumulate (integrate) a stream to get a table; differentiate a table’s changes over time to get a stream. Flink SQL exploits this by treating every data source as a ‘dynamic table’ — a table that changes over time as new events arrive. You write standard SQL against this dynamic table, and Flink executes it as a continuous query: it never stops running, it just updates results as new data arrives. This matters for data engineering because it lets you express real-time pipelines in SQL rather than in Spark or Flink’s procedural API. A product analyst can define a ‘rolling 7-day user engagement’ metric as a SQL query, and Flink ensures that metric is always current in the serving layer. The output of the continuous query is itself a stream (a changelog), which can feed other views, databases, or dashboards.”

Q2: “When would you use Flink SQL’s streaming materialized view vs a BigQuery materialized view vs Materialize DB?”

Model Answer: “These sit at different points on the freshness/cost/complexity spectrum. BigQuery materialized views are refreshed on a schedule — hourly at minimum. Use them when the consumer is a BI dashboard that refreshes on a human timescale (minutes) and the transformation is a simple aggregation. Cost: only the refresh compute, not per-query compute. Best for the gold layer of a data warehouse where freshness of minutes is acceptable. Flink SQL streaming views update in seconds. Use them when the consumer needs near-real-time data — a fraud detection dashboard, a real-time leaderboard, a recommendation feature serving layer. The complexity cost is operating a Flink cluster with stateful checkpointing. Best when the transformation is expressible in SQL and latency matters (seconds, not minutes). Materialize DB maintains views incrementally in milliseconds — including complex multi-table JOINs that would be prohibitively expensive in Flink state. Use it when you want the freshness of streaming but the expressiveness of arbitrary SQL, and your team prefers a database-centric workflow (dbt compatibility). The operational cost is running the Materialize cluster and managing source connectivity. For most data platforms, I’d use BigQuery materialized views for the analytics gold layer, Flink SQL streaming views for the real-time serving layer (feature store, operational dashboards), and consider Materialize only when Flink SQL’s limitations (complex JOINs requiring large state) or the team’s SQL-first preference makes it a better fit.”

Think About This

You’re in a Netflix interview. The prompt: “Design the data infrastructure for Netflix’s real-time content popularity dashboard — showing which titles are trending in the last 5 minutes globally and by country.”

Walk through:

  1. What’s the stream-table duality here? (The stream = individual viewing events. The table = count of views per title per country in the last 5 minutes — a windowed aggregation that materializes the stream into current state.)

  2. What does the Flink SQL query look like? (SELECT country, title_id, COUNT(*) AS views FROM viewing_events GROUP BY HOP(event_time, INTERVAL ‘1’ MINUTE, INTERVAL ‘5’ MINUTE), country, title_id — a hopping window that gives sliding 5-minute counts updated every minute)

  3. Where does the result land? (Redis or Cassandra for sub-10ms reads. Flink emits changelog updates → the serving store receives increments and decrements as titles enter/leave the 5-minute window)

  4. What happens with the table-to-stream direction? (When a title drops off the trending list, Flink emits a DELETE event to the serving store — this is the retraction in Flink’s changelog output)

  5. How does the duality enable historical analysis? (The same viewing events in Kafka (stream) can be accumulated into a daily/weekly fact table (table) via a separate batch pipeline. Same data source, two different materializations for two different consumers)

Quick Reference

  • Duality formula: Table = integral of stream (accumulate events). Stream = derivative of table (observe changes). Every table can be streamed; every stream can be materialized into a table.
  • Flink dynamic tables: SQL runs continuously on streaming data. GROUP BY creates state. Windowing triggers output. Results are themselves changelog streams.
  • Three materialized view tiers: BigQuery/Snowflake (minutes, SQL-native, BI-friendly) → Flink SQL / ksqlDB (seconds, stateful streaming aggregation) → Materialize DB (milliseconds, incremental arbitrary SQL)
  • CQRS pattern: Kafka topic = event log (write side). Streaming materialized views = read models (read side). Multiple read models from the same event log. Rebuild any read model by replaying events.
  • Event sourcing: Store events, not state. Current state = replay events. Complete audit trail. New read models = replay from beginning. Kafka with infinite retention is the event log.
  • The refresh storm anti-pattern: Don’t have 50 materialized views refreshing from the same high-write base table. Stagger refresh intervals, use incremental refresh, limit materialized view depth.

Tomorrow’s Preview

Day 46: Real-Time Feature Engineering — Feature stores (Feast, Tecton), online vs offline feature computation, point-in-time correctness, backfilling features, serving features for ML inference at low latency. Directly relevant for your OpenAI, Anthropic, and Meta interviews where ML data infrastructure is a core topic.