Day 10 — Stream processing fundamentals

Phase 1: Foundations & Frameworks | Category: ETL/ELT Workflows

Why Streaming Matters More Than Ever

Netflix runs over 20,000 Apache Flink jobs processing 14+ trillion records daily. Meta processes billions of events per second for real-time ranking. Google built Dataflow/Beam to unify batch and streaming. At your target companies, streaming isn’t a niche — it’s core infrastructure. In 2026, Kafka and Flink form the foundation of the modern data streaming platform, and interviewers expect senior DEs to design end-to-end streaming architectures, reason about delivery guarantees, and handle the hard problems: late data, ordering, and exactly-once semantics.

But first, the critical question from Hello Interview’s Flink deep dive: “Before embarking on a stream processing solution, ask yourself: do I really need real-time latencies?” Most companies think they need real-time. They usually don’t. Knowing when NOT to stream is as important as knowing how.

The Streaming Architecture Stack

Producers (apps, DBs, IoT, CDC)
    ↓ Message Broker / Event Streaming Platform (Kafka, Kinesis, Pub/Sub)
    ↓ Stream Processing Engine (Flink, Spark Structured Streaming, Dataflow/Beam)
    ↓ Sinks (warehouse, lake, real-time OLAP, cache, API)

Message Brokers: Kafka vs Kinesis vs Pub/Sub

Topic	Details
Model	Distributed log, consumer pulls Managed shards, consumer pulls Managed topic/subscription, push or pull
Ordering	Per-partition guaranteed Per-shard guaranteed Not guaranteed (unless ordering key)
Retention	Configurable (days to infinite)1-365 days7 days default
Throughput	~~1M msgs/sec per broker~~1 MB/sec per shard (scales with shards) Auto-scales, no shard management
Ops overhead	High (self-managed) or medium (Confluent Cloud) Low (fully managed) Low (fully managed)
Best for	High throughput, replay, ecosystem (Connect, Streams) AWS-native, simpler use casesGCP-native, serverless patterns
Used at	Netflix, Meta, most companies Amazon internal + AWS customers Google, GCP customers

Kafka fundamentals you must know:

Topics and partitions: A topic is a logical stream. Partitions are the unit of parallelism and ordering. Messages within a partition are strictly ordered; across partitions, there’s no ordering guarantee.

Partition key: Determines which partition a message goes to (hash of key mod partition count). All messages with the same key go to the same partition → guaranteed ordering for that key. Choose carefully — a bad partition key causes skew.

Consumer groups: Multiple consumers in a group divide partitions among themselves. Each partition is consumed by exactly one consumer in the group. Adding consumers (up to partition count) increases parallelism.

Offsets: Each message in a partition has a sequential offset. Consumers track their position via offsets. Resetting offsets = replaying data from any point in history.

Retention: Kafka retains messages for a configurable period (or forever with compacted topics). This is what makes Kafka a replayable log, not just a message queue.

Stream Processing Engines: Flink vs Spark Streaming

Topic	Details
Processing model	True event-at-a-time (with micro-batching option) Micro-batch (triggers at intervals)
Latency	Milliseconds to low seconds Seconds to minutes (trigger interval)
State management	First-class: RocksDB-backed, checkpointed, queryable Limited: state stored in memory/checkpointed
Windowing	Rich: tumbling, sliding, session, custom, late-data handling Good: tumbling, sliding, session (improving)
Watermarks	Native, per-source, sophisticated late-data handling Supported but less flexible
Exactly-once	Checkpointing + 2PC sinks Checkpointing + idempotent sinks
Backpressure	Built-in credit-based flow control Handled via micro-batch scheduling
Batch support	Unified batch+streaming (same API) Native batch engine, streaming added on top
Ecosystem	Growing, strong in streaming-first use cases Massive, strong in batch + ML
Used at	Netflix (20K+ jobs), Uber, Alibaba Databricks ecosystem, many enterprises

When to choose Flink: Sub-second latency requirements, complex event processing, sophisticated windowing, large stateful computations. If you’re designing for Netflix, Meta, or any streaming-heavy company, Flink is the expected answer.

When to choose Spark Streaming: Your team already uses Spark for batch, latency of seconds-to-minutes is acceptable, and you want a unified batch+streaming codebase. Good for near-real-time ETL into a lakehouse.

Windowing: How to Aggregate Unbounded Data

Streams are infinite. To compute aggregates (count, sum, average), you need to define finite chunks — windows.

Tumbling Windows

Fixed-size, non-overlapping, gap-free.

|--- Window 1 (00:00-00:05) ---|--- Window 2 (00:05-00:10) ---|  Events:  e1  e2  e3             e4  e5  e6  e7

Use case: “Count events per 5-minute interval.” Each event belongs to exactly one window. Simple, most common.

Sliding Windows

Fixed-size, overlapping. Defined by window size + slide interval.

Window size: 10 min, Slide: 5 min  |--- Window 1 (00:00-00:10) ---|           |--- Window 2 (00:05-00:15) ---|                    |--- Window 3 (00:10-00:20) ---|

Use case: “Compute a rolling 10-minute average, updated every 5 minutes.” Each event may belong to multiple windows. More compute-intensive than tumbling.

Session Windows

Dynamic size, defined by a gap of inactivity. No fixed duration.

Gap timeout: 5 min  |-- Session 1 --|     (gap > 5 min)     |-- Session 2 --|  e1 e2   e3                              e4 e5 e6

Use case: “Group user activity into sessions, where a session ends after 5 minutes of inactivity.” Window size varies per key. More complex to implement — requires stateful tracking of the last event per key.

Interview phrasing: “For computing real-time engagement metrics, I’d use tumbling windows for periodic snapshots (events per minute) and session windows for user-level session analysis (session duration, events per session). The choice depends on the business question.”

Watermarks: Solving the Late Data Problem

The fundamental problem: In event-time processing, events can arrive out of order. An event with timestamp 10:00:03 might arrive after an event with timestamp 10:00:07. When can you safely close the 10:00:00-10:00:05 window?

Watermark: A timestamp assertion that says “I believe all events with timestamps ≤ W have arrived.” When the watermark passes a window’s end time, the window fires.

Watermark = max_event_time - allowed_lateness
Events arriving: t=10:07, t=10:03, t=10:09, t=10:05                                            ^                                      Watermark at t=10:07                                      (if allowed_lateness = 2 sec)                                      → Window [10:00-10:05] can fire                                        because 10:07 - 2 = 10:05 ≥ window end

What happens to REALLY late data (after watermark passes)?

Three strategies:

Drop it: The default. Data that arrives after the watermark is ignored. Simple, but you lose accuracy.
Allowed lateness: Flink/Spark allow a grace period after the watermark. Late events within this period trigger window re-computation. Increases accuracy but adds complexity.
Side output: Late events are routed to a separate “late data” stream for special handling (e.g., periodic batch correction).

What to say in interviews: “I’d set the watermark based on the expected maximum delay from source to processing — if events typically arrive within 30 seconds of generation, I’d set allowed lateness to 60 seconds as a 2x buffer. For the long tail of extremely late events, I’d route them to a side output and handle them in a periodic batch correction job. This gives us accuracy for 99.9% of events in real-time, with a batch safety net for the rest.”

Delivery Semantics: The Three Guarantees

Guarantee	Description	Data Loss?	Duplicates?	Use Case
At-most-once	Fire and forget. No retries.	Possible	No	Metrics where occasional loss is acceptable (clickstream sampling)
At-least-once	Retry on failure.	No	Possible	Most streaming use cases. Handle duplicates downstream.
Exactly-once	Each event processed exactly once, even on failure.	No	No	Financial transactions, billing, anything where duplication = real cost

Exactly-Once: How It Actually Works

Exactly-once is NOT magic. It’s achieved through a combination of mechanisms:

Within Kafka (producer → Kafka → consumer):

Idempotent producer: Kafka assigns a producer ID and sequence number. Duplicate messages are detected and discarded.
Transactional API: Produce to multiple partitions atomically. Either all writes succeed or none do.
Consumer: Read from committed offsets only (isolation.level=read_committed).

End-to-end (Kafka + processing engine + sink):

This is the hard part. You need:

Checkpointing (Flink/Spark): Periodically snapshot the processing state and Kafka offsets together.
Atomic commit to sink: Either use a transactional sink (JDBC transaction) or an idempotent sink (write with deterministic keys so re-processing produces same result).
Two-Phase Commit (2PC): Flink’s approach — pre-commit to sink during checkpoint, finalize on checkpoint success, rollback on failure.

The practical truth: True exactly-once across systems is expensive and complex. Most production systems use at-least-once + idempotent processing — ensure your consumer/sink can handle duplicates (e.g., MERGE INTO with natural key, or deduplication by event ID).

What to say in interviews: “I’d design for at-least-once delivery with idempotent sinks. Each event carries a unique event_id. The sink uses MERGE INTO or INSERT … ON CONFLICT to deduplicate. This gives us effectively-exactly-once semantics without the overhead and complexity of distributed transactions. True exactly-once via Kafka transactions and Flink 2PC is available but only worth the cost for financial/billing pipelines where a duplicate has real monetary impact.”

Backpressure: When Processing Can’t Keep Up

Problem: Producer sends 100K events/sec, but the consumer can only process 50K events/sec. The gap grows unboundedly.

Flink’s approach: Credit-based flow control. Downstream operators signal upstream how much data they can accept. Upstream slows down accordingly. Backpressure propagates all the way to the source.

Kafka’s natural buffer: Kafka acts as a shock absorber between producers and consumers. If consumers fall behind, Kafka retains the data (within retention). Consumer lag grows but no data is lost. When consumers catch up, lag shrinks.

Consumer lag is the key metric to monitor: latest_offset - consumer_offset. If lag is growing continuously, your processing is too slow. Fix: add more consumer instances (up to partition count), optimize processing, or increase partitions.

Real-World Streaming Architecture at Your Target Companies

Netflix (from their tech blog and conference talk):

20,000+ Flink jobs, processing 14+ trillion records/day
Kafka as the transport layer, Flink for processing
Streaming SQL (Flink SQL) to democratize stream processing
CDC from databases → Kafka → Flink enrichment → Iceberg/Cassandra
Use cases: real-time recommendations, ad processing, game analytics

Meta: Custom streaming infrastructure at billions-of-events/sec scale. Scuba for real-time analytics. Real-time ranking signals computed via streaming for News Feed and ads.

Google: Pub/Sub → Dataflow (Apache Beam) → BigQuery. Dataflow provides unified batch+streaming with auto-scaling. Beam’s programming model is the same for both modes.

Interview Questions

Q1: “Design a real-time fraud detection pipeline for a payment system.”

Model Answer: “I’d use Kafka as the ingestion layer — every payment event published to a payments topic, partitioned by customer_id for ordering. Flink consumes the stream and maintains per-customer state: rolling 1-hour transaction count, transaction velocity, geographic spread. For each event, Flink applies rule-based checks (amount > threshold, velocity > limit, geo-anomaly) and a real-time ML model score from a feature store. Suspicious transactions are emitted to a fraud-alerts topic for immediate action (block transaction) and simultaneously written to a fraud analytics store. For delivery guarantees, I’d use at-least-once with idempotent sinks — each fraud alert has a deterministic alert_id so downstream systems can deduplicate. Watermarks set to 10 seconds to handle slight event delays. For late data beyond the watermark, side-output to a batch correction job. The critical design choice is the stateful window: a session window per customer with a 30-minute inactivity gap catches burst patterns that fixed windows would miss.”

Q2: “When would you choose Spark Structured Streaming over Flink?”

Model Answer: “Three scenarios. First, when the team already has deep Spark expertise and the latency requirement is seconds-to-minutes, not milliseconds. Near-real-time ELT into a lakehouse — CDC events via Kafka, Spark Structured Streaming micro-batch every 30 seconds, writing to Delta Lake — is a sweet spot. Second, when I need a unified batch and streaming codebase and the streaming use case is relatively simple (filter, transform, aggregate, write). Spark’s DataFrame API is identical for batch and streaming. Third, in the Databricks ecosystem where Structured Streaming is deeply integrated with Delta Lake, auto-scaling, and Unity Catalog. I’d choose Flink when I need sub-second latency, complex stateful processing (session windows, pattern detection), or when the streaming workload is the primary use case rather than an add-on to a batch platform.”

Think About This

You’re in a Google interview. The prompt: “Design a real-time analytics system that shows advertisers their campaign performance metrics updated every 30 seconds.”

Walk through:

What’s the event source? (Ad impression and click events from serving infrastructure, published to Pub/Sub)
What processing engine and why? (Dataflow/Beam — GCP-native, auto-scaling, unified batch+streaming. Tumbling windows of 30 seconds to compute metrics per campaign.)
How do you handle late-arriving click events? (A click may be attributed to an impression that happened minutes ago. Watermark with 2-minute allowed lateness. Late clicks beyond that go to a batch correction.)
Where do aggregated metrics land? (BigQuery for historical queries + Bigtable or Memorystore for low-latency dashboard reads at 30-second refresh.)
What delivery guarantee? (At-least-once + idempotent writes. Metric aggregates are written with a deterministic key of campaign_id + window_timestamp. Re-computing a window produces the same result.)

The key insight: 30-second refresh doesn’t require true sub-second streaming. Dataflow micro-batch or even Spark Structured Streaming with a 30-second trigger is sufficient. Don’t over-engineer with full event-at-a-time processing when micro-batch meets the SLA.

Quick Reference

Kafka: Distributed log, ordered within partitions, consumer group parallelism, replayable. The default choice for event ingestion.
Flink: True stream processing, millisecond latency, rich stateful windowing, native watermarks. Choose for low-latency or complex event processing.
Spark Structured Streaming: Micro-batch, seconds-to-minutes latency, unified with batch Spark. Choose for near-real-time ELT and Databricks/lakehouse integration.
Windowing: Tumbling (fixed, non-overlapping), Sliding (overlapping), Session (gap-based). Match the window type to the business question.
Watermarks: max_event_time - allowed_lateness. Defines when a window can safely close. Late data goes to side output or is dropped.
Exactly-once: Achievable within Kafka (idempotent producers + transactions). End-to-end requires checkpointing + idempotent or transactional sinks. Most production systems use at-least-once + deduplication.

Tomorrow’s Preview

Day 11: Lambda vs Kappa Architecture — Batch + speed layers (Lambda) vs stream-only (Kappa). Trade-offs in complexity, consistency, and latency. Real-world examples of each, and when interviewers expect you to propose one over the other.

Day 10: Stream processing fundamentals