Phase 2: Company-Specific | Category: Netflix-Specific

The Prompt

“Design the data infrastructure for Netflix’s recommendation system. 300M subscribers. The system must serve personalized recommendations in < 100ms, use viewing history going back 3 years, update recommendations as users watch in real-time, and support A/B testing of new recommendation algorithms.”

Set a 45-minute timer. This is the most likely Netflix system design prompt.

Step 1: Requirements (5 min)

Functional:

  • Ingest all user interaction events (views, plays, pauses, completions, ratings, searches, skips)
  • Compute and serve real-time features for the active viewing session
  • Compute batch features from long-term viewing history (90 days to 3 years)
  • Generate candidate recommendations via content retrieval (two-tower model)
  • Rank candidates using a scoring model with combined real-time + batch features
  • Support A/B testing of recommendation algorithm variants
  • Support GDPR deletion (remove user data and update recommendations)

Non-functional:

Scale
:         300M subscribers × ~10 recommendation requests/day = 3B req/day = 35K req/sec  Event volume:  ~2T events/day = ~23M events/sec peak (Netflix actual scale)  Latency:       p99 < 100ms end-to-end recommendation serving  Freshness:     Real-time features: < 30 sec after event occurs                 Batch feature updates: within 4 hours of midnight  Availability:  99.99% (recommendation failure = black screen for users)  Storage:       3 years of viewing history × 300M users × ~500 events/user/day ≈ 160 PB

Scoping: “I’ll design the full data pipeline — ingestion, feature engineering (online + offline), candidate generation, ranking, and A/B testing infrastructure. I’ll describe the ML model architecture at a high level but won’t design the model training in detail — that’s ML Engineering territory.”

Step 2: Scale Estimation (3 min)

Ingestion
:    23M events/sec × 400 bytes avg = 9.2 GB/sec raw    Parquet + Zstd (6x compression): ~1.5 GB/sec = 130 TB/day  Feature serving at recommendation time:    35K recommendation requests/sec    Each request: ~2,500 features (user + content)    → Feature store must serve 35K × 2,500 = 87.5M feature lookups/sec    → Requires aggressive multi-tier caching (L1 in-process → L2 Redis → L3 Cassandra)  Recommendation serving latency budget (100ms):    Network + API layer: ~10ms    Feature retrieval: ~15ms (from Redis/Cassandra)    Candidate generation (ANN search): ~20ms    Ranking model inference: ~40ms    Post-processing + response: ~15ms    Total: ~100ms ✓  Storage:    3 years × 130 TB/day = ~142 PB (compressed, tiered)    Hot (last 30 days): 3.9 PB on fast S3 Standard    Warm (31-365 days): 47 PB on S3 Infrequent Access    Cold (1-3 years): 91 PB on S3 Glacier Instant Retrieval

Step 3: High-Level Architecture

┌────────────────────────────────────────────────────────────────────────┐  │                        USER DEVICES                                     │  │  Smart TVs, iOS, Android, Web — event SDK batches every 100ms          │  └──────────────────────────────┬─────────────────────────────────────────┘                                  ↓ Avro + Schema Registry (BACKWARD_TRANSITIVE)  ┌────────────────────────────────────────────────────────────────────────┐  │                    KAFKA (Netflix's transport layer)                    │  │  Topics: viewing_events, playback_events, search_events                │  │  Partitioned by: user_id (per-user ordering)                           │  │  Retention: 7 days + Iceberg archival                                  │  └──────────────┬─────────────────────────────┬──────────────────────────┘                 ↓ (real-time path)             ↓ (batch path)  ┌──────────────────────────────┐   ┌──────────────────────────────────┐  │   FLINK STREAM PROCESSOR     │   │  SPARK BATCH (via Maestro DAGs)  │  │   15,000+ jobs at Netflix    │   │  Daily/hourly runs               │  │                              │   │                                   │  │  • Deduplication (1hr window)│   │  Reads from: Iceberg Bronze      │  │  • Sessionization (30min gap)│   │  Computes: 30/90/365-day features│  │  • Real-time feature compute │   │  Generates: training datasets    │  │    - session_duration_min    │   │  Writes to: Iceberg Silver+Gold  │  │    - recent_genre_affinity   │   │                                   │  │    - current_content_signals │   │  Also manages: model retraining  │  │  • Write to Iceberg Bronze   │   │  via Metaflow pipelines          │  │    (micro-commit every 5min) │   └──────────────┬───────────────────┘  └──────┬───────────────────────┘                  ↓         ↓ online features                  ┌────────────────────────────┐  ┌──────────────────────────┐              │  ICEBERG DATA LAKE (S3)    │  │  ONLINE FEATURE STORE    │              │  Bronze: raw events        │  │                          │              │  Silver: sessions, joined  │  │  L1: In-process cache    │              │  Gold: analytics, features │  │      (< 1ms, JVM heap)   │              │  Time travel for A/B eval  │  │  L2: Redis Cluster       │              │  + GDPR deletion           │  │      (< 5ms, per user)   │              └──────────────┬─────────────┘  │  L3: Cassandra           │                             ↓  │      (< 15ms, durable)   │              ┌────────────────────────────┐  │                          │              │  OFFLINE FEATURE STORE     │  │  TTL: 30 sec (real-time) │              │  (Iceberg Silver on S3)    │  │       1 hour (session)   │              │  Batch features, point-in- │  │       24 hours (stable)  │              │  time correct, versioned   │  └──────┬───────────────────┘              └──────────────┬─────────────┘         └──────────────────────┬───────────────────────────┘                                ↓ feature retrieval (15ms budget)  ┌────────────────────────────────────────────────────────────────────────┐  │              RECOMMENDATION SERVING PIPELINE                           │  │                                                                         │  │  [1] Candidate Generation Service                                       │  │       • Two-Tower: user_embedding → ANN search → top-500 candidates   │  │       • User embedding from: online features (session) + batch         │  │         features (long-term preferences) combined in user tower        │  │       • Content embeddings: pre-computed nightly by Spark, stored in   │  │         FAISS/Vespa vector index                                        │  │       Latency: ~20ms                                                    │  │                                                                         │  │  [2] Ranking Service                                                    │  │       • Deep NN on top-500 candidates with 2,500 features each         │  │       • Multi-task: predicts P(play), P(complete), P(rate_high)        │  │       • Combined score: V = w1×P(play) + w2×P(complete) + w3×P(rate) │  │       Latency: ~40ms                                                    │  │                                                                         │  │  [3] Post-Ranking                                                       │  │       • Content diversity (genres, languages, content types)           │  │       • Business rules (new releases boost, licensed content caps)     │  │       • A/B variant injection (assign experiment variant per user)     │  │       Latency: ~10ms                                                    │  │                                                                         │  │  Result: Top 25-50 ranked recommendations served to client             │  └───────────────────────────────────────────┬───────────────────────────┘                                              ↓ < 100ms total                                          Netflix Client

Step 4: Feature Engineering — The Core DE Value

These capture what the user is doing RIGHT NOW in this session:

Feature
Name                    | Window        | Update Freq | Storage  ────────────────────────────────|───────────────|─────────────|─────────  current_genre_last_30min        | Sliding 30min | 30 sec      | Redis  session_duration_sec            | Session window| Real-time   | Redis  plays_abandoned_last_1hr        | Tumbling 1hr  | 5 min       | Redis  content_completion_rate_today   | Tumbling day  | 5 min       | Redis  current_device_type             | Event field   | Real-time   | Redis  search_terms_last_5min          | Sliding 5min  | 1 min       | Redis

Flink session window for viewing sessions:

on

Session window: gap = 30 minutes of inactivity = new session  stream    .keyBy("user_id")    .window(EventTimeSessionWindows.withGap(Time.minutes(30)))    .aggregate(SessionAggregator())  # builds session summary    .addSink(RedisSink("user_session:{user_id}"))

Batch Features (Spark → Iceberg → Cassandra, updated daily)

These capture long-term preferences and stable characteristics:

Feature
Name                    | Lookback  | Update Freq | Storage  ────────────────────────────────|───────────|─────────────|─────────  genre_affinity_vector[18]       | 30 days   | Daily       | Cassandra  content_completion_rate_p50     | 90 days   | Daily       | Cassandra  avg_session_duration_min        | 30 days   | Daily       | Cassandra  preferred_time_of_day           | 30 days   | Daily       | Cassandra  language_preferences            | 90 days   | Daily       | Cassandra  country, subscription_tier      | Static    | On change   | Cassandra  watch_hours_30d, 90d, 365d      | Multi-win | Daily       | Cassandra  top_directors_30d               | 30 days   | Daily       | Cassandra

Point-in-Time Correctness for Training (Critical Concept)

When training the ranking model on historical data, features MUST reflect the state known at the time of the training label — not future information.

on

WRONG: uses features computed with data after the label timestamp  training_data = join(labels, current_features, on="user_id")  # RIGHT: point-in-time join — features as of the label timestamp  training_data = spark.sql("""  SELECT      l.user_id,      l.content_id,      l.played,  -- training label: did user play this content?      f.genre_affinity_vector,      f.completion_rate_p50  FROM labels l  -- Join to the feature snapshot that was current at label timestamp  JOIN silver.user_features_daily f      ON l.user_id = f.user_id      AND l.label_date = f.feature_date  -- same-day features only  WHERE l.label_date = '2026-04-10'  """)

Iceberg time travel makes this possible: SELECT * FROM silver.user_features FOR TIMESTAMP AS OF label_timestamp retrieves exactly the features that existed when the recommendation was made.

Step 5: The Two-Tower Architecture (What DE Supports)

The DE’s job is to build the data infrastructure the ML model depends on. Here’s what that means:

TWO-TOWER
MODEL:  User Tower                          Item Tower  ─────────────                       ──────────────  Inputs: user_id, genre_affinity,    Inputs: content_id, genre, maturity,          completion_rate, session     runtime, language, release_year,          duration, device_type        cast embeddings, description text  Neural network                      Neural network     ↓                                   ↓  User embedding (128d vector)        Content embedding (128d vector)                      Similarity score = dot_product(user_emb, content_emb)

What DE builds for this model:

Training data pipeline (Maestro DAG, daily):

1.
Extract positive examples: (user_id, content_id, label=1) for viewed content  2. Negative sampling: random unviewed content (label=0)  3. Feature join: attach user and content features with point-in-time correctness  4. Write training dataset to Iceberg with version tag (date + model version)  5. Trigger Metaflow training pipeline

Pre-computed content embeddings (Spark batch, nightly):

1.
Run item tower inference on all 15K+ content items  2. Store content_id → 128d embedding in a vector index (FAISS or Vespa)  3. This enables offline ANN (Approximate Nearest Neighbor) search  4. Users are never stored in the vector index — their embeddings are computed at query time

Why pre-compute content, not user, embeddings?

  • 15K content items vs 300M users — content is 20,000x smaller
  • Content catalog changes slowly (new titles weekly, not hourly)
  • User preferences change with every interaction — must be computed fresh at query time using real-time + batch features

Step 6: A/B Testing Infrastructure

Netflix is deeply data-driven. Every algorithm change goes through experimentation.

EXPERIMENT
ASSIGNMENT:    User opens Netflix → Retrieve experiment assignment from Cassandra:      "user_123 is in experiment RECO-2026-04, variant B (new ranking model)"    Assignment is deterministic (hash of user_id + experiment_id)    Assignment is consistent: same user always gets same variant  EXPERIMENT LOGGING:    Every recommendation request logs:      - experiment_id, variant_id, user_id_hash      - ranked_items list (what was shown)      - position of each item    Every viewing event logs:      - experiment_id, variant_id at the time of recommendation      - This is critical: the variant must be the one active when the recommendation was made,        not when the user eventually watched  METRIC COMPUTATION:    Daily Spark job joins:      experiment_assignments × viewing_events × recommendation_logs    Computes per-variant metrics:      - Play rate (plays / recommendations shown)      - Stream minutes per user      - 7-day retention (did user come back?)      - Content diversity (Gini coefficient of genre distribution)    Statistical significance: sequential testing (allows early stopping)    Guardrail metrics: streaming quality must not degrade in any variant

Iceberg’s role in A/B evaluation: Time travel enables retrospective analysis. “What were the recommendations shown to variant B users on April 5th?” — query gold.recommendation_log FOR TIMESTAMP AS OF ‘2026-04-05’. This is critical for debugging experiment anomalies.

The holdout group: Netflix maintains a permanent holdout group (~1% of users) that never receives any algorithm improvements. This measures the cumulative long-term impact of all recommendation improvements over years — answering “what would Netflix engagement look like without all our ML improvements?”

Step 7: Multi-Region & Fault Tolerance

Netflix operates in 190 countries. The recommendation data infrastructure is global.

Data plane (storage):

  • Primary S3 region: US-East-1 (master copy)
  • Cross-region replication to EU-West-1 and AP-Southeast-1 for local reads
  • Iceberg metadata catalog replicated across regions
  • Feature computation runs in US-East-1, feature data replicated to regional Cassandra clusters

Serving plane (< 100ms latency globally):

  • Recommendation serving deployed in multiple AWS regions
  • Regional Redis caches for real-time features (user’s session data in their home region)
  • Cassandra multi-datacenter replication for batch features (replication factor 3 per region)
  • If US-East region is unavailable: regional failover serves recommendations from a degraded model using cached batch features only (< 100ms still achievable, slight quality degradation)

Graceful degradation hierarchy:

Full
mode:    Real-time features + batch features + full ranking model → p99 < 100ms  Degraded:     Batch features only + simplified ranking → p99 < 80ms (faster, less personalized)  Fallback:     Pre-computed default recommendations (top-20 per genre per country) → p99 < 10ms  Emergency:    Cached last recommendations per user → any latency

Step 8: Trade-offs to Articulate

Trade-off 1: Two-tower vs single model”A single model could theoretically learn better joint representations but cannot scale to 300M users × 15K items at serving time. The two-tower model pre-computes content embeddings offline — ANN search over 15K vectors takes milliseconds. The cost is that the model can’t learn cross-feature interactions between user and item directly during candidate generation. We compensate with a powerful ranker that takes the full feature set on the top-500 candidates.”

Trade-off 2: Flink vs Spark for real-time features”Flink for sub-30-second feature freshness. Spark Structured Streaming’s micro-batch would introduce 30-second to 2-minute latency — unacceptable for session features needed within seconds of a viewing action. The trade-off: Flink requires dedicated expertise and operational overhead (15,000 jobs is complex). Netflix has invested heavily in their managed Flink platform to absorb this complexity.”

Trade-off 3: Redis vs Cassandra for online feature storage”Redis for the hottest, smallest features (session state, current viewing context) — sub-5ms reads with full in-memory storage. Cassandra for the broader feature set that doesn’t fit in Redis budget — 15ms reads with multi-datacenter replication. Redis TTL = 30 seconds to 1 hour. Cassandra TTL = 24-48 hours. The combination gives us the speed of Redis for hot features and the durability + capacity of Cassandra for the full feature set.”

Trade-off 4: Recommendation quality vs latency”Our 100ms budget is allocated: 15ms features, 20ms candidate generation, 40ms ranking, 25ms overhead. If we wanted higher quality (more candidates, deeper model), we’d increase ranking time but break the 100ms SLA. Netflix resolves this with a two-pass approach: fast candidate retrieval on a simpler model, then expensive ranking on a smaller set. For TV-experience (large screen, relaxed browsing), we could potentially increase the budget to 200ms — the user interface pacing is slower.”

Interview Questions

Q1: “How do you handle the cold start problem for new Netflix subscribers with no viewing history?”

Model Answer: “Three strategies for cold start. First, at signup: collect explicit preferences — favorite genres, languages, preferred content types. Use these as the initial ‘user features’ for the first session. Second, demographic-based recommendations: use country, device type, and time-of-day to serve popular content from appropriate categories. Country is a strong signal — what’s trending in Brazil differs from Japan. Third, immediate personalization: after the user’s first interaction (play, skip, browse), Flink updates real-time features within 30 seconds. The session window begins immediately. After 3-5 interactions in the first session, the recommendations noticeably shift to reflect demonstrated preferences — the system learns faster than the user expects.

For the feature store: new users get a ‘default’ feature profile bootstrapped from their signup preferences. Flink begins filling in real-time features from their first event. Batch features take 24-48 hours to compute meaningful signals — until then, real-time features dominate.

A/B testing implication: cold start users must be tracked separately in experiments. A recommendation algorithm optimized for users with rich history may perform poorly for new users — different optimization objective for this segment.”

Q2: “A new recommendation algorithm shows a 3% lift in play rate in an A/B test. The product team wants to ship it. What data questions do you ask first?”

Model Answer: “Three categories of questions before shipping. First, guardrail metrics: did streaming quality degrade? Did complaint rates increase? Did content diversity decrease? A 3% lift in play rate means nothing if users are clicking but not actually watching, or if they’re all being funneled to the same 3 titles. Diversity (Gini coefficient of genre distribution) must be checked.

Second, user segment breakdown: does the lift hold across all segments or is it driven by one segment? A 3% average lift could be +10% for power users and -2% for casual users. If casual users are degraded, that’s a churn risk.

Third, time dynamics: is the lift stable over the test duration, or does it spike early and decay (novelty effect)? Netflix typically runs experiments for multiple weeks to measure post-novelty steady-state impact, including 7-day and 30-day retention.

Fourth, complementary metrics: did subscriber cancellation rate change? Did watch hours per session change (not just play rate)? A recommendation that gets more clicks but shorter sessions might hurt long-term engagement.

If all of these look good, I’d suggest a staged rollout — 10% → 25% → 50% → 100% — rather than a hard cutover. Monitor key metrics at each stage before expanding.”

Self-Assessment: 5 Questions

  1. Why does Netflix pre-compute content embeddings but not user embeddings?

  2. What are the three tiers of the online feature store (L1/L2/L3) and their latencies?

  3. What is point-in-time correctness and why is it critical for training data?

  4. How does Iceberg time travel support A/B test evaluation?

  5. What’s the graceful degradation path when Flink is down — how do recommendations still serve?

Quick Reference: Netflix Recommendation Data Pipeline

  • Event flow: Kafka → Flink (real-time features → Redis/Cassandra) + Spark (batch features → Iceberg → Cassandra) → Feature store
  • Two-tower: User tower (real-time + batch features) → user embedding → ANN search over pre-computed content embeddings → top-500 candidates → Deep NN ranker (40ms)
  • Feature tiers: L1 in-process cache (< 1ms) → L2 Redis (< 5ms) → L3 Cassandra (< 15ms) → Iceberg Silver (offline training)
  • Training data: Point-in-time joins using Iceberg time travel. Feature version = date of the recommendation being evaluated.
  • A/B testing: Experiment assignment in Cassandra. Variant logged with every recommendation and viewing event. Metrics computed daily by Spark joining assignment + events + recommendations.
  • Cold start: Signup preferences + country/device defaults + rapid real-time learning from first session.
  • Graceful degradation: Full model → batch features only → pre-computed defaults → cached last recommendations.

Tomorrow’s Preview

Day 35: Google Data Engineering & GCP Deep Dive — BigQuery architecture (Capacitor, Dremel, Colossus), Dataflow/Beam, Pub/Sub, Dataproc, Cloud Composer. Google interview emphasis on GCP operational fluency, cost optimization, and designing systems that use managed services effectively. The mental model shift from “build your own” to “compose managed services.”