Day 33 — Netflix Data Infrastructure & Interview Patterns

Phase 2: Company-Specific | Category: Netflix-Specific

Netflix at a Glance: The Numbers That Frame Everything

Before any architecture discussion, internalize the scale context:

Metric	Value
Paid subscribers	300M+ across 190 countries
Daily events processed	2+ trillion
Flink jobs running	15,000–20,000+ processing 60 PB/day
Data lake size	Exabyte-scale (S3 + Iceberg)
Streaming hours	Billions per month across 190 countries
Content titles	15,000+ (movies, shows, games)

This is the scale you’re designing for. Unlike Meta (social interactions), Netflix is dominated by viewing behavior — what people watch, when they stop watching, what they skip, what they rewatch. Every data system at Netflix serves the ultimate question: “What will this person want to watch next?”

Netflix’s Data Infrastructure Stack

Netflix has been extraordinarily transparent about their technology via the Netflix Tech Blog. These are the systems your interviewer will expect you to know:

The Storage Layer: Iceberg-First (2023-Present)

Netflix built Apache Iceberg and has been migrating from Hive to an Iceberg-only architecture. This is the most important tech context for Netflix interviews.

Why Netflix migrated from Hive to Iceberg:

Hive’s metadata bottleneck: listing millions of files on S3 took minutes for metadata operations at exabyte scale
No ACID guarantees: concurrent writers corrupted tables
No schema evolution without rewriting all data
No time travel for auditing and debugging

Iceberg’s three-layer metadata solved all of these:

Catalog
(Iceberg REST Catalog — Netflix built this)      ↓  Metadata file (snapshot history, partition specs, schema versions)      ↓  Manifest list → Manifest files (per-file statistics, min/max per column)      ↓  Parquet data files on S3

Netflix’s Iceberg-specific features:

Secure Iceberg tables: Column-level security integrated into the Iceberg catalog
Iceberg REST Catalog: Netflix built and open-sourced this — the catalog that makes Iceberg engine-agnostic
Hidden partitioning: Queries don’t reference partition columns — Iceberg prunes automatically
Partition evolution: Migrated from hourly to daily partitions for older data without rewriting files

What to say in a Netflix interview: “I’d use Iceberg as the table format throughout — Netflix built it for exactly this scale. The hidden partitioning means I can change partition strategies as data patterns evolve without rewriting historical data, and the Iceberg REST Catalog enables any query engine (Spark, Trino, Flink) to read the same tables with consistent ACID semantics.”

The Query Layer

Trino (formerly PrestoSQL — Netflix was a major contributor):

Primary interactive analytics engine for analysts and data scientists
Federated queries across Iceberg tables, MySQL, Cassandra, Druid
Used for ad-hoc exploration and dashboard queries on the data lake

Apache Spark:

Primary batch processing engine
ETL pipelines, feature engineering, model training data preparation
Runs on Netflix’s internal cluster management (Titus — Netflix’s own container scheduling platform)
Most complex transformations (sessionization, complex joins, ML features) run here

Apache Flink (15,000+ jobs):

Real-time stream processing at massive scale
Use cases: data movement, personalization signals, messaging, finance
Jobs range from stateless (filter/transform/route) to deeply stateful (session tracking, real-time aggregation)
Netflix built significant internal tooling on top of Flink for job management at this scale

Apache Druid:

Real-time OLAP for sub-second analytics
Powers real-time dashboards (content performance, ad metrics for the ad tier)
Particularly important post-ad-tier launch (2022-present) for real-time ad delivery analytics

The Messaging Layer

Apache Kafka:

Backbone for all event streaming
All user interactions (views, pauses, searches, ratings) flow through Kafka
Topics partitioned by user_id for per-user ordering
Feeds both Flink real-time processing and Spark batch processing

The Orchestration Layer

Maestro (Netflix’s internal workflow orchestrator, open-sourced):

Netflix’s custom-built orchestration platform, more scalable than Airflow for their use case
Handles thousands of concurrent DAGs with complex dependencies
Supports both streaming and batch workflows in the same DAG
Knowledge of Maestro signals deep Netflix-specific preparation; generic Airflow knowledge is also acceptable

Metaflow (open-sourced by Netflix):

ML workflow orchestration — from data prep through model training and deployment
Manages versioning of ML artifacts, experiments, and models
Data engineers at Netflix use Metaflow to build training data pipelines that ML engineers consume

Netflix’s Culture: What Interviewers Are Actually Testing

Per DataInterview.com, the biggest hiring failures at Netflix are cultural, not technical. Understanding this shapes how you answer every question.

“Freedom & Responsibility” applied to data engineering:

Cultural value	What it means for a DE
Ownership	You own your pipeline from design through production monitoring. “I built it, now it’s the ops team’s problem” is disqualifying.
Keeper test	Would your manager fight to keep you if you said you were leaving? Netflix optimizes for density of high performers. “Average” is a departure signal.
Candor	Disagree and commit, but disagree loudly first. If your skip-level’s architecture is wrong, you say so with data.
Context, not control	You’re given the business context to make decisions, not a list of steps to follow. You’re expected to define your own best path.
Impact	Everything is measured by impact on the business, not effort. A well-designed pipeline that nobody uses is worthless.

The two values most emphasized for DE roles per DataInterview.com: Impact and Courage. Prepare stories that show measurable impact AND instances where you pushed back on a bad decision, raised an uncomfortable truth, or took a calculated risk.

“Treating data engineers as software engineers”: Netflix expects production-grade code, unit tests, CI/CD pipelines for data assets, and code review rigor. Unlike some companies where DE is “build the pipeline and move on,” Netflix DE ownership is full-stack: design, build, test, deploy, monitor, improve, deprecate.

Netflix Interview Structure

Per Reddit r/interviewstack, Exponent, and DataInterview.com:

Timeline: 4-6 weeks from first contact to offer

Process for L5 (Senior) and L6+ (Staff):

Round	Focus	Duration
Recruiter screen	Culture alignment, background, compensation	30 min
Technical phone screen	SQL, Python/Scala, pipeline concepts	45–60 min
Onsite Round 1	SQL deep dive — complex window functions, CTEs on viewing data	45 min
Onsite Round 2	Coding — Python/Scala data manipulation, transformations	45 min
Onsite Round 3	System design / data architecture	60 min
Onsite Round 4	Behavioral / culture (L5) or 2nd system design (L6+)	45–60 min
Onsite Round 5 (L6+)	Cross-functional leadership, architectural vision	45 min

L6/Staff additions: Two system design rounds. The second goes deeper into architectural trade-offs. Cross-org influence scenarios: “How have you set data standards that other teams adopted?” Engineering manager, product manager, and director involvement.

What Netflix Interviewers Specifically Probe

Per DataInterview.com and LinkedIn:

Technical depth areas:

Iceberg deep knowledge: Schema evolution, partition evolution, time travel, merge-on-read vs copy-on-write, compaction strategies
Spark tuning: Real scenarios — “Your Spark job processing 300GB took 40 min, now takes 3 hours. Diagnose.” (Day 9 applies directly)
Data quality and testing: Not “I wrote some tests.” Specific framework (Great Expectations, dbt contracts), what you tested, how you measured quality over time, what SLOs you defined
Cost optimization: Netflix’s data bills are enormous. “Tell me about a time you reduced cloud spend without sacrificing performance.” Concrete numbers expected.
Backfill and reprocessing: “How would you backfill 6 months of historical data without interrupting the live pipeline?” Iceberg time travel + parallel processing answer expected.
Schema evolution without downtime: “Kafka stream schema changes without downtime” — Schema Registry + Avro BACKWARD_TRANSITIVE + dual-field migration (Day 21 applies directly)

Product understanding (Netflix-specific):

“Why would viewing hours drop 5% week-over-week? Walk me through how you’d diagnose using data.”
“How would you measure the impact of a new recommendation algorithm?”
Netflix interviewers probe whether you understand the business domain, not just the pipeline mechanics.

The Netflix Data Stack Mental Model

When designing for Netflix, think in this layered stack:

User
devices (TV, mobile, web)      ↓ event SDK (batch 100ms)  Kafka (partitioned by user_id, RF=3)      ├── Flink path (real-time)      │     • Sessionization (gap = 30 min inactivity)      │     • Real-time viewing features (current session, recent engagement)      │     • Write to Iceberg (streaming writes, 5-min micro-commits)      │     • Write to Druid (real-time OLAP, last 7 days)      │     • Write to Cassandra (user activity state, serving)      └── Spark Structured Streaming (micro-batch, every 15 min)            • Write to Iceberg Bronze on S3  Iceberg on S3 (Bronze)      ↓ Spark ETL (hourly/daily via Maestro)  Iceberg Silver (cleaned, sessionized, joined with dim tables)      ↓ dbt / Spark (nightly)  Iceberg Gold (star schema, analytics-ready, Kimball)  Serving:    Trino → Iceberg Gold (analyst ad-hoc queries)    Druid → Real-time data (< 7 days, sub-second dashboards)    Metaflow → ML training datasets from Iceberg Silver    Cassandra → Online feature serving for recommendation model

The Three Netflix DE Interview Differentiators

1. Iceberg fluency beyond basics

Most candidates know “Iceberg gives you ACID and time travel.” Netflix interviewers want depth:

“Explain the difference between copy-on-write and merge-on-read delete modes in Iceberg and when you’d choose each.”
- Copy-on-write (COW): rewrite affected files on every update/delete. Optimized for reads — no merge at query time. Expensive for write-heavy tables.
- Merge-on-read (MOR): write delta files on update/delete. Merge at read time. Optimized for writes. Higher read latency for heavily updated tables.
- Netflix choice: COW for analytical tables (write-once, read-many). MOR for CDC-driven tables (frequent updates, read latency acceptable).
“How would you use Iceberg’s partition evolution to migrate a table from hourly to daily partitions?”
- ALTER TABLE events ADD PARTITION FIELD day(event_time) — new data uses daily partition. Old data keeps hourly. No rewrite needed. Query engine handles both transparently via hidden partitioning.

2. Trade-off fluency over prescriptive answers

Per Exponent: “Netflix wants to hear you reason about cost vs latency vs correctness.” A prescriptive “I’d use Spark because it’s scalable” fails. The senior answer: “I’d choose Flink over Spark Structured Streaming for this sessionization job because we need sub-5-second latency on the session-end signal for real-time recommendations. The trade-off is operational complexity — Flink’s stateful recovery and RocksDB management require more expertise than Spark Structured Streaming. If latency of 2-3 minutes were acceptable, I’d stay with Spark to reduce ops burden.”

3. End-to-end ownership language

Use language that signals full ownership: “I designed AND operated AND improved AND deprecated” rather than “I built.” Netflix interviewers listen for “when my pipeline failed at 3 AM” stories — not “when the ops team paged me.” Show you own the entire lifecycle.

Interview Questions

Q1: “Describe how you’d architect a pipeline to produce viewing analytics for content strategy teams. They need daily metrics by title, genre, and region — and they occasionally need to go back 3 years.”

Model Answer: “I’d build a Medallion architecture on S3 with Iceberg as the table format throughout — Netflix’s own technology, ideal for schema evolution and time travel.

Bronze layer: Kafka → Spark Structured Streaming (micro-batch every 15 min) → Iceberg on S3. Raw viewing events, append-only, partitioned by event_date. Schema enforced via Avro with Schema Registry. No transformations — preserve source fidelity.

Silver layer: Daily Spark job via Maestro. Sessionize viewing events (gap > 30 min = new session), deduplicate by event_id, join with dim_content (title, genre, maturity rating) and dim_user (country, subscription tier, device type) using point-in-time joins on the SCD2 dimensions. Output: silver.viewing_sessions on Iceberg, partitioned by session_date.

Gold layer: Nightly dbt models building star schema. fact_daily_title_metrics (grain: one row per title × region × date), aggregating watch hours, completion rates, start rates. dim_title with SCD2 on genre, maturity_rating for historical accuracy.

For the 3-year lookback: Iceberg’s time travel lets the content team query SELECT * FROM gold.title_metrics FOR TIMESTAMP AS OF ‘2023-04-10’ — seeing the data exactly as it was on that date, including dim values. Cold data tiered to S3 Glacier for cost optimization while remaining queryable.

Cost optimization for 3-year access: partition pruning by date is critical. Trino queries with a date filter scan only the relevant partitions — a 3-year full table scan on a large title metrics table would be expensive. I’d enforce a partition filter requirement in the table DDL.

Monitoring: data freshness probe on gold.fact_daily_title_metrics — alert if MAX(session_date) < CURRENT_DATE - 1 by 8 AM. Volume anomaly detection via Monte Carlo. SLO: gold layer available by 8 AM PT, 29/30 days.”

Q2: “Your Flink job processing viewing events has been running for 6 months. A new streaming service launched in a new country, and event volume doubled overnight. The Flink job is falling behind — Kafka consumer lag is growing. How do you fix this without losing data?”

Model Answer: “Kafka’s durability is the safety net — data isn’t lost while consumer lag grows. I have time to fix the Flink job. My diagnosis and fix sequence:

First, diagnose the bottleneck: check the Flink UI for which operators are backpressured. Is it the source (reading from Kafka), a specific operator (likely any stateful operation like sessionization), or the sink (writing to Iceberg)? The Flink task manager metrics will show where throughput drops.

If the bottleneck is Kafka source throughput: increase Flink parallelism on the source operator — match it to the Kafka partition count. If Kafka has 50 partitions, source parallelism should be 50. Adding parallelism scales horizontally on Titus.

If the bottleneck is a stateful operator (sessionization): this is the most common issue at 2x volume. RocksDB state backend may be disk I/O-bound. Options: (1) increase RocksDB instance memory (faster than disk), (2) increase parallelism of the stateful operator (rescale the Flink job), (3) if the state is enormous, consider a tiered approach — keep only the last 24 hours of state in Flink, offload older sessions to Iceberg for late-session resolution.

If the bottleneck is Iceberg sink: batch larger commit sizes (more records per Iceberg transaction), increase Flink checkpointing interval (commits align with checkpoints), consider temporarily switching from COW to MOR on the sink table to reduce write amplification.

Regardless of the fix: I’d scale the Flink cluster first (add task managers) to stop the lag from growing, buy time to diagnose properly, then implement the structural fix. After the fix, monitor consumer lag recovery — it should trend toward 0 over 2-4 hours as the cluster catches up on the backlog. Never manually skip offsets to ‘catch up’ — that’s data loss.”

Think About This

You’re preparing for the Netflix system design round. The prompt you’re most likely to receive:

“Design the data infrastructure for Netflix’s recommendation system. The system needs to serve personalized recommendations to 300M users with < 100ms latency, use viewing history going back 3 years, and update recommendations as users watch content in real-time.”

Before Day 34 (where we design this end-to-end), mentally sketch:

What are the two data paths? (Online: Flink → Cassandra for real-time features. Offline: Spark → Iceberg → Metaflow for training and batch features)
What’s the feature store design? (Online: Cassandra/Redis for < 10ms feature lookup. Offline: Iceberg silver layer for ML training)
How do you handle 300M users × 15K titles = 4.5 trillion possible combinations? (Two-tower model: user embeddings + content embeddings, ANN search to retrieve top-K candidates)
What’s the freshness SLA for features? (Viewing history updated < 30 sec for the current session. Longer-term patterns updated hourly. Historical patterns daily.)
How does Netflix A/B test new recommendation algorithms? (Experiment assignment stored in Iceberg. Viewing metrics joined to experiment assignment for offline evaluation. Online evaluation via interleaving — show users a mix of both algorithms’ recommendations.)

Quick Reference: Netflix-Specific

Scale: 2+ trillion events/day, 15K+ Flink jobs, 60 PB/day, exabyte data lake. State this context upfront.
Tech stack: Kafka → Flink (real-time) + Spark (batch) → Iceberg on S3 → Trino (ad-hoc) + Druid (real-time OLAP) + Cassandra (serving) → Metaflow (ML training)
Iceberg-first: Netflix built Iceberg and is migrating to Iceberg-only. Know COW vs MOR, hidden partitioning, partition evolution, time travel, and the REST Catalog.
Culture differentiators: Ownership (you own design → production → monitoring). Impact + Courage (measurable results + pushing back on bad decisions). “Full-cycle developer” mindset.
Interview differentiators: Iceberg depth (partition evolution, COW vs MOR), trade-off fluency (cost vs latency vs correctness), end-to-end ownership language, product understanding (viewing hours, retention, content performance).
Avoid: “I’d add more servers” answers. “That’s the ops team’s job” language. Vague cultural answers (“I believe in ownership” without a concrete story).

Tomorrow’s Preview

Day 34: Design: Netflix Streaming Recommendation Data Pipeline — The full end-to-end system design: user behavior events → feature engineering → ML training pipeline → real-time serving → A/B testing infrastructure. Multi-region, high availability, and the specific Iceberg + Flink + Cassandra + Metaflow stack Netflix actually uses.

Day 33: Netflix Data Infrastructure & Interview Patterns