Day 9 — Batch processing architecture

Phase 1: Foundations & Frameworks | Category: ETL/ELT Workflows

Why Spark Internals Matter at Senior Level

At your target companies, you won’t be asked “what is Spark?” You’ll be asked “Your Spark job processing 300GB daily just went from 40 minutes to 3 hours without code changes — how do you investigate?”. Senior-level Spark knowledge means understanding the execution model deeply enough to diagnose, tune, and architect — not just write transformations. As one interviewer summarized: “At senior level, Spark expertise means being able to answer one question: Why is this job slow — and what is the most impactful fix?”

Spark Execution Model: The Full Picture

Your Code (DataFrame/SQL)
    ↓ Catalyst Optimizer (logical plan → optimized logical plan → physical plan)
    ↓ DAG Scheduler (physical plan → stages)
    ↓ Task Scheduler (stages → tasks, assigned to executors)
    ↓ Executors (tasks run on partitions across the cluster)

1. Catalyst Optimizer

When you write DataFrame operations or SQL, Spark doesn’t execute them immediately (lazy evaluation). Instead, Catalyst builds a logical plan, optimizes it (predicate pushdown, column pruning, join reordering), and produces a physical execution plan.

What to mention in interviews: “Catalyst is why I prefer DataFrames/SQL over RDDs — the optimizer can push filters down to the scan level, prune unnecessary columns before reading from Parquet, and reorder joins based on table statistics. With RDDs, none of this happens — you’re writing the physical plan yourself.”

2. DAG → Stages → Tasks

This is the core mental model:

DAG (Directed Acyclic Graph): The full graph of all transformations from input to output.

Stages: The DAG is broken into stages at shuffle boundaries (wide dependencies). Within a stage, all operations are narrow (map, filter, project) and can be pipelined without data movement.

Tasks: Each stage is divided into tasks — one task per partition. Tasks are the unit of parallelism.

Stage 1: Read Parquet → Filter → Map (narrow ops, pipelined)
    ↓ SHUFFLE (groupBy key)
Stage 2: Aggregate → Write (narrow ops, pipelined)

The key rule: A new stage is created every time data needs to be redistributed across the cluster (shuffle). Shuffles are the #1 performance killer in Spark.

3. Narrow vs Wide Transformations

Type	Examples	Data Movement	Stage Impact
Narrow	map, filter, select, union, coalesce	None — each input partition maps to one output partition	Same stage
Wide	group By, join, repartition, distinct, order By	Shuffle — data redistributed across all partitions	New stage boundary

Senior insight: Every wide transformation triggers a shuffle, writing intermediate data to disk (shuffle files) and reading it back. A job with 5 groupBys has at least 6 stages and 5 shuffles. Reducing the number of shuffles is often the highest-impact optimization.

Partitioning: The Foundation of Performance

Partitions are Spark’s unit of parallelism. Everything flows from getting partitions right.

Key principles:

Topic	Details
Partition count	Target 2-4x the total number of cores in your cluster
Partition size	Target 128-512 MB per partition (compressed). Too small = task scheduling overhead. Too large = memory pressure, GC pauses.
Input partitions	Determined by file count and size. 1 small file = 1 partition (even if only 1 KB — the small files problem).
Shuffle partitions	Controlled by spark.sql.shuffle.partitions (default: 200). Must be tuned per job.

The small files problem: If your input is 10,000 files of 1 MB each, Spark creates 10,000 partitions/tasks. Task scheduling overhead dominates. Fix: use coalesce() after read, or compact files upstream.

coalesce vs repartition:

coalesce(n): Reduces partitions WITHOUT a full shuffle. Merges adjacent partitions. Use when reducing partition count.
repartition(n): Full shuffle to redistribute data evenly across n partitions. Use when you need balanced partitions (e.g., before a write to avoid skewed output files).
repartition(n, col): Shuffle + hash partition by column. Use before repeated joins/aggregations on that column.

Shuffles: The Performance Killer

A shuffle involves:

Map side: Each task writes its output to local disk, sorted by target partition
Network transfer: Data is fetched by reducers across the network
Reduce side: Data is read, merged, and processed

Why shuffles are expensive:

Disk I/O (write + read shuffle files)
Network transfer (cross-node data movement)
Serialization/deserialization
Memory pressure (sort buffers)

Shuffle spill: When the in-memory sort buffer is too small for the shuffle data, Spark spills to disk — dramatically slower. Check the Spark UI for “spill (memory)” and “spill (disk)” metrics.

How to reduce shuffles:

Broadcast join (small table / dimension): ship the small side to every executor so the join does not shuffle the large fact.

from pyspark.sql.functions import broadcast

result = large_df.join(broadcast(small_df), "key")

Pre-partition data: If you frequently join on customer_id, write your data partitioned/bucketed by customer_id. Subsequent joins avoid shuffle.
Combine operations: Multiple aggregations on the same key? Combine them into one groupBy instead of chaining separate ones.
Partial aggregation: Use reduceByKey over groupByKey (RDD API), or trust Catalyst to do partial aggregation for DataFrame groupBy (it does by default).

Data Skew: The Silent Killer

Skew occurs when one partition has far more data than others. 999 tasks finish in 10 seconds; 1 task takes 30 minutes processing the skewed key.

How to detect:

Spark UI: One task in a stage takes 10-100x longer than others
Spark UI: One task processes 10-100x more data (input/shuffle read)
Symptoms: Low overall CPU usage but the job is stuck

How to fix:

1. Salting (most common interview answer): add a random suffix to the skewed key, aggregate partially, then aggregate again:

# Salt the skewed key
df = df.withColumn("salted_key", concat(col("key"), lit("_"), (rand() * 10).cast("int")))

# Partial aggregation on salted key
partial = df.groupBy("salted_key").agg(sum("amount").alias("partial_sum"))

# Remove salt and final aggregation
partial = partial.withColumn("key", split(col("salted_key"), "_")[0])
result = partial.groupBy("key").agg(sum("partial_sum").alias("total"))

2. Isolate the skewed key: process the hot key separately from the rest:

hot_key_df = df.filter(col("key") == "HOT_KEY")
normal_df = df.filter(col("key") != "HOT_KEY")
# Process separately, then union results

3. Broadcast join: if the skew is in a join, broadcast the smaller side (eliminates the skewed shuffle for that side when it fits driver/executor memory limits).

4. Adaptive Query Execution (AQE): Spark 3.0+ can dynamically coalesce shuffle partitions, optimize skewed joins, and switch join strategies at runtime. Enable with:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

Idempotency & Reprocessing: The Architecture Pattern

Batch pipelines fail. Networks glitch. Data arrives late. Your pipeline must handle re-runs without duplicating or corrupting data.

Idempotent pipeline design principles:

1. Partition-based overwrite (most common):

# Write output partitioned by date, overwrite only the target partition
df.write.mode("overwrite").partitionBy("process_date").parquet("s3://warehouse/gold/metrics/")

Re-running for process_date=2026-03-25 overwrites only that partition. All other dates untouched. This is the simplest idempotency pattern.

2. MERGE INTO for upserts (Delta Lake / Iceberg):

MERGE INTO gold.metrics t
USING staging.metrics s
ON t.metric_id = s.metric_id AND t.date = s.date
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Re-running produces the same result — matched rows are updated, new rows inserted, no duplicates.

3. Deterministic output paths:

Use the logical processing date, not datetime.now(), for output paths:

# GOOD: deterministic, re-runnable
output_path = f"s3://warehouse/gold/metrics/date={process_date}"

# BAD: changes on every run
output_path = f"s3://warehouse/gold/metrics/run_{datetime.now()}"

4. Input-based processing, not time-based:

# GOOD: process a specific date's data
df = spark.read.parquet(f"s3://raw/events/date={target_date}")

# BAD: process "everything since last run"
df = spark.read.parquet("s3://raw/events/").filter(col("date") > last_run_time)

The second approach is fragile — if last_run_time is wrong or the job runs twice, you get duplicates or gaps.

Spark Cluster Sizing: The Interview Calculation

Per a common senior DE interview pattern:

Scenario: Process 10 TB of data in Spark.

Step 1 — Partitions:

Data: 10 TB = 10,240 GB
Target partition size: 128 MB
Partitions needed: 10,240 GB × 1,024 MB/GB ÷ 128 MB = ~81,920 partitions

Step 2 — Parallelism:

40 worker nodes × 16 cores each = 640 parallel tasks
Waves: 81,920 ÷ 640 = ~128 waves
At ~10 sec/wave → ~21 minutes compute time (without shuffle)

Step 3 — Memory:

Each node: 256 GB RAM
Usable Spark memory: ~70% = 179 GB
4 executors per node → ~45 GB per executor
Each task should use < 1.5 GB → safe for 128 MB partitions even with expansion during joins

Step 4 — Shuffle estimation:

If a join causes 2x data expansion, shuffle volume ≈ 20 TB
At ~2 GB/sec disk throughput: ~10,000 sec worst case (~2.8 hours)
This tells you whether to broadcast, filter early, or redesign the join

What to say: “I don’t guess cluster size — I calculate from data volume, target partition size, and available parallelism. Then I estimate shuffle volume to decide whether the job is compute-bound or shuffle-bound.”

Common Spark Optimization Checklist

Category	Optimization	Impact
Storage	Use Parquet/ORC (columnar, compressed)	High
Storage	Target 128-512 MB file sizes (avoid small files)	High
Storage	Enable predicate pushdown + column pruning	High
Shuffle	Broadcast small tables in joins (< 1 GB)	High
Shuffle	Tune shuffle.partitions (don’t leave at default 200)	Medium
Shuffle	Enable AQE (adaptive.enabled = true)	Medium
Skew	Salt skewed keys or isolate hot keys	High
Memory	Cache/persist intermediate Data Frames reused >1 time	Medium
Memory	Use Kryo serialization (spark.serializer)	Low-Medium
Partitioning	Partition output by common filter column (date)	High
Partitioning	coalesce before write to control output file count	Medium
Joins	Filter early (push predicates before joins)	High
Joins	Use sort-merge join for large-large joins	Context-dependent

Interview Questions

Q1: “A Spark job processing ~300GB daily recently went from 40 minutes to 3 hours without code changes. How would you investigate?”

Model Answer: “No code changes but 3x runtime increase — I’d investigate in this order. First, data growth: has the input volume increased? 300GB might now be 900GB due to upstream changes. Check input partition count and sizes. Second, data skew: a new hot key in the data could cause one task to take 100x longer. I’d check the Spark UI for tasks with disproportionate input size or duration. Third, small files proliferation: if the source started writing more, smaller files, the partition count explodes with scheduling overhead. Fourth, shuffle spill: check if shuffle data is spilling to disk — this could be caused by increased data volume hitting the same memory configuration. Fifth, cluster resource changes: was the cluster downsized, or are other jobs now sharing resources? I’d start with the Spark UI — the stage and task views give me input size, shuffle read/write, spill metrics, and task duration distribution. The fix depends on the root cause: if it’s data growth, I’d increase parallelism and possibly the cluster; if it’s skew, I’d salt the key; if it’s small files, I’d compact input.”

Q2: “How would you design a Spark pipeline that supports replay/backfill without duplicating results?”

Model Answer: “Three design principles. First, partition-based overwrite: output is partitioned by processing date, and re-running a date overwrites only that partition. This makes every run idempotent — running it twice produces the same output. Second, input-based processing: the pipeline takes a target date as a parameter and reads exactly that date’s input data, not ‘everything since last run.’ This eliminates dependency on run state and makes backfills trivial — just loop over the date range. Third, for upsert scenarios on Delta Lake or Iceberg, I’d use MERGE INTO with a deterministic business key so re-processing inserts new records and updates existing ones without duplicates. For the orchestration layer, Airflow with catchup=True and depends_on_past=False lets me backfill any date range independently. The key principle is: the pipeline should produce identical output regardless of how many times it runs for the same input — that’s idempotency.”

Think About This

You’re in a Netflix interview. The prompt: “We have a nightly Spark job that joins a 5 TB user-events fact table with a 500 GB content-metadata dimension, aggregates by content title and region, and writes to the analytics warehouse. The job takes 4 hours. How would you get it under 1 hour?”

Walk through:

Is the 500 GB dimension small enough to broadcast? (No — default broadcast threshold is 10 MB, and even tuned up, 500 GB is far too large. Sort-merge join is required.)
Where is time being spent? (Likely the shuffle for the join — moving 5 TB + 500 GB across the network. Second suspect: the aggregation shuffle.)
How would you reduce shuffle? (Pre-partition both tables by the join key. If events are already partitioned by content_id on disk, and content-metadata is bucketed the same way, the join becomes a partition-local operation with zero shuffle.)
What about skew? (Popular content on Netflix — top 1% of titles might account for 30% of events. A salt-and-two-phase aggregation would handle this.)
Is 4 hours compute-bound or I/O-bound? (Check Spark UI: if CPU utilization is low and shuffle write/read is high, it’s I/O-bound. If CPU is maxed, add more cores.)

Quick Reference

DAG → Stages (split at shuffles) → Tasks (one per partition) — this is the Spark execution model in one sentence
Shuffles are the #1 performance killer — reduce them via broadcast joins, pre-partitioning, and combining aggregations
Partition sizing: 128-512 MB per partition, 2-4x total cores for partition count
Data skew fixes: salting, key isolation, broadcast join, AQE skew join optimization
Idempotent pipelines: partition-based overwrite, input-based processing (not time-based), deterministic output paths, MERGE INTO for upserts
AQE (Spark 3.0+): Enable it — dynamic partition coalescing, skew join handling, and runtime join strategy switching

Tomorrow’s Preview

Day 10: Stream Processing Fundamentals — Kafka, Kinesis, Pub/Sub. Stream processing with Flink and Spark Streaming. Windowing (tumbling, sliding, session). Watermarks and late-arriving data. Exactly-once semantics. How real-time pipelines are designed at Meta, Netflix, and Google.

Day 9: Batch processing architecture