Day 32 — Design: Meta News Feed Data Pipeline

Phase 2: Company-Specific | Category: Meta-Specific

The Prompt

“Design the data infrastructure for Meta’s News Feed. The system must support real-time ranking (content shown to users must be ranked by relevance, not just recency), feature engineering for ML models, analytics for product teams, and privacy-compliant data access. Scale: 3 billion daily active users.”

This is the most commonly asked Meta DE interview question. Set a 45-minute timer.

Step 1: Requirements (5 min)

Functional:

Ingest all user engagement events (views, likes, comments, shares, clicks, dwells) across Feed, Reels, Stories
Support real-time ranking signals for the News Feed ML model (freshness-sensitive features)
Support batch feature computation for training (historical engagement patterns)
Provide analytics for product teams (engagement metrics, A/B test results, retention)
Comply with privacy requirements (no individual user data exposed to advertisers or analysts)

Non-functional:

Scale
:        3B DAU × ~500 events/user/day = 1.5T events/day ≈ 17M events/sec peak  Latency:      Ranking features must be available < 100ms after event occurs (online path)                Batch features updated daily for training (offline path)  Freshness:    Real-time features: < 30 second lag                Analytics dashboards: < 4 hours (batch)  Availability: 99.99% for ranking feature serving (feed breaks without it)  Privacy:      User-level data never exposed externally. Minimum aggregation thresholds enforced.  Retention:    Hot: 30 days. Warm: 1 year. Cold: 7 years (regulatory)

Scoping for the interview: “I’ll design three interconnected pipelines: (1) the event ingestion layer, (2) the real-time feature pipeline for serving, and (3) the batch analytics pipeline. I’ll connect them to the ranking system at a high level. I won’t design the ML model itself — the focus is data infrastructure.”

Step 2: Scale Estimation (3 min)

17M
events/sec × 300 bytes avg (Scribe log format, structured JSON) = ~5 GB/sec raw  Compressed Parquet (6x): ~850 MB/sec = ~73 TB/day compressed  Annual storage at 6 years warm+cold: ~160 PB  Fan-out calculation (critical for Meta):    Avg user has 300 friends/follows    17M events/sec → need to update feed caches for affected users    Fan-out ratio: 1 event → ~300 cache writes = 5B cache writes/sec PEAK    → Cannot fan-out synchronously for every event → must be selective  Ranking model features:    Per user: ~2,000 features (historical preferences, affinity scores)    Per content piece: ~500 features (engagement velocity, content quality signals)    Feature serving: 3B DAU × 10 feed opens/day × ~2,500 feature lookups = 30T lookups/day = 350M lookups/sec    → Requires massive in-memory feature store with multiple layers of caching

Architecture implications: 17M events/sec rules out synchronous fan-out for all users. 350M feature lookups/sec requires tiered caching (in-memory → Redis → persistent feature store). 73 TB/day requires distributed storage with tiering.

Step 3: High-Level Architecture

┌────────────────────────────────────────────────────────────────────┐  │                        EVENT SOURCES                                │  │  Facebook web/mobile → Scribe SDK (batch 100ms or 50 events)       │  │  Instagram → Scribe SDK                                             │  │  WhatsApp (anonymized signals) → Scribe SDK                         │  │  Backend services (impression logging) → Scribe direct              │  └───────────────────────────┬────────────────────────────────────────┘                               ↓ structured event logs  ┌────────────────────────────────────────────────────────────────────┐  │              SCRIBE (Meta's internal Kafka-equivalent)              │  │  Topics: user_engagements, content_views, ad_impressions, signals  │  │  Partitioned by user_id (per-user event ordering)                  │  │  Retention: 7 days hot, tiered to cold storage                     │  └──────────────┬────────────────────────────┬────────────────────────┘                 ↓ (real-time path)            ↓ (batch path)  ┌──────────────────────────┐   ┌────────────────────────────────────┐  │   FLINK STREAM PROCESSOR  │   │   SPARK BATCH PROCESSOR            │  │   • Dedup (1hr window)   │   │   • Daily batch runs               │  │   • PII tokenization     │   │   • Complex feature computation    │  │   • Session detection    │   │   • Training dataset generation    │  │   • Real-time aggregation │   │   • Gold layer analytics tables    │  │   • Affinity scoring      │   │   Reads from: Hive/Iceberg bronze  │  └──────┬───────────────────┘   └──────────────┬─────────────────────┘         ↓                                       ↓  ┌──────────────────────────┐   ┌────────────────────────────────────┐  │  ONLINE FEATURE STORE    │   │  OFFLINE DATA LAKE (Hive/Iceberg)  │  │  (RocksDB + Redis cache) │   │  Bronze: raw events (immutable)    │  │  • Real-time features    │   │  Silver: cleaned, deduped, joined  │  │  • User engagement state │   │  Gold: fact tables, features, A/B  │  │  • Content signals       │   │  Partitioned: date + surface_type  │  │  p99 read: < 5ms         │   │  Query: Presto (interactive)       │  └──────┬───────────────────┘   │           Spark (batch ETL)        │         │                       └──────────────┬─────────────────────┘         ↓                                       ↓  ┌──────────────────────────┐   ┌────────────────────────────────────┐  │  FEED RANKING SERVICE    │   │  SCUBA (Real-time OLAP)            │  │  Calls feature store at  │   │  • Analyst self-serve              │  │  feed request time        │   │  • Last 30 days hot data          │  │  Presto for feature       │   │  • Product team dashboards         │  │  federation queries       │   │  • A/B experiment results          │  └──────────────────────────┘   └────────────────────────────────────┘

Step 4: The Fan-Out Architecture — The Meta Signature Topic

Per Hello Interview and LinkedIn, this is the critical design decision Meta interviewers probe:

The Three Fan-Out Strategies

Fan-out on Write (Push Model):

User
A posts → Fetch A's 300 followers → Write post to each follower's feed cache  Result: Pre-computed feeds, sub-ms read latency  Cost:   300 writes per post. For 100M posts/day × 300 followers = 30B cache writes/day  Problem: What if User A is a celebrity with 50M followers?            50M cache writes for one post = thundering herd

Fan-out on Read (Pull Model):

User
opens feed → Fetch all users they follow → Fetch recent posts from each → Rank → Serve  Result: Always fresh, no pre-computation  Cost:   High read latency (merge 300+ user timelines + ranking on every request)  Problem: If user follows 500 accounts, 500 DB reads per feed open × 3B DAU = prohibitive

Meta’s Hybrid (Production):

Regular
users (< 10K followers): Fan-out on WRITE    → When they post, pre-populate followers' feed caches    → Fast reads, manageable write amplification  Celebrities / High-follower accounts (> 10K followers): Fan-out on READ    → Their posts are NOT pushed to all follower caches    → At feed-load time, fetch their recent posts separately and merge  Active vs Inactive users:    → Active users: pre-compute feed (fan-out on write)    → Inactive users (not logged in > 30 days): no pre-computation      On return, compute feed on-demand (fan-out on read)

The data pipeline implication: The fan-out service needs a real-time classifier of user type (regular vs celebrity, active vs inactive). This is a Flink stateful job that maintains per-user follower counts and activity state, and routes each post to the appropriate fan-out path.

Step 5: The Ranking Pipeline — Four Stages

Per Meta’s engineering blog and YouTube ML system design interview, News Feed ranking uses a multi-pass pipeline:

Feed
Request (user opens Facebook)      ↓  STAGE 1: CANDIDATE GENERATION    - Pull eligible posts from feed cache (fan-out on write) or       from social graph query (fan-out on read for celebrities)    - ~500-2,000 candidate posts for each user    - Fast retrieval, coarse filtering (deleted posts, blocked users, seen content)      ↓  STAGE 2: LIGHTWEIGHT SCORING (Pass 1)    - Apply fast logistic regression model on ~50 features    - Reduces 2,000 candidates to top 500    - Features: recency, author affinity, content type preference    - Latency budget: ~10ms      ↓  STAGE 3: HEAVYWEIGHT SCORING (Pass 2)    - Apply deep neural network on top 500 candidates    - ~2,500 features per candidate (from feature store)    - Multi-task: predicts likelihood of like, comment, share, video watch time    - Aggregate into single ranking score V = w1×P(like) + w2×P(comment) + w3×P(share)    - Latency budget: ~50ms      ↓  STAGE 4: POST-RANKING ADJUSTMENTS    - Content diversity (don't show 5 videos in a row)    - Business rules (ads insertion, integrity filtering)    - Freshness boost for very recent content    - Final ranked list of 20-50 items      ↓    Serve to user

The data pipeline’s job: Provide the features for Stages 2 and 3 in < 100ms total.

Step 6: Feature Engineering — The Core DE Deliverable

The data pipeline must produce two types of features:

Online Features (Real-Time, < 30 sec lag)

These are computed by Flink from the Scribe stream and stored in RocksDB:

Feature	Computation	Update frequency
`post_like_count_last_1hr`	Tumbling 1-hour window, COUNT(likes) per `post_id`	Every minute
`user_session_length_min`	Session window per user (gap = 30 min), current duration	Real-time
`user_engagement_last_30min`	Sliding 30-min window, SUM(actions) per user	Every 30 seconds
`author_post_velocity`	Count of posts by author in last 6 hours	Every 5 minutes
`content_virality_score`	Rate of change of like count (delta / time)	Every minute

Flink job example:

on

Flink stateful aggregation: 1-hour tumbling window per post  stream    .keyBy("post_id")    .window(TumblingEventTimeWindows.of(Time.hours(1)))    .aggregate(CountAggregate())    .addSink(RocksDBSink("post_like_count_last_1hr"))

Offline Features (Batch, Updated Daily)

These are computed by Spark from the data lake and stored in Hive/Iceberg:

Feature	Computation	Update frequency
`user_category_affinity`	Last 90 days of views by content category, normalized	Daily
`user_friend_interaction_score`	Mutual engagement history per friend pair	Daily
`author_quality_score`	Engagement rate, share rate, report rate over 30 days	Daily
`user_video_watch_rate_p90`	Percentile of video completion rates	Daily
`user_time_of_day_preference`	Engagement distribution by hour	Daily

The point-in-time correctness requirement: When training a model on historical data, features must reflect what was known at the time of the training label — not future knowledge. This requires time-windowed feature computation in the training pipeline: “For each (user, post, timestamp) training example, compute features using only data available before timestamp.”

Step 7: Privacy-Aware Design

This is where Meta interviews differentiate. You must proactively address privacy — don’t wait to be asked.

PII tokenization at ingestion (Layer 1):

Raw
event: { "user_id": "12345", "ip": "192.168.1.1", "name": "Alice" }  After PII gate: { "user_id_hash": "a8f2c...", "geo_region": "US-West", "name": [REDACTED] }  Only the hash enters the pipeline — the raw PII never leaves the ingestion layer

Access control by layer (Layer 2):

Bronze
(raw, hashed): accessible to data platform team only  Silver (cleaned, aggregated): accessible to ML and product teams  Gold (analytics, aggregated): accessible to all analysts    → Minimum threshold: no metric shown for < 1000 users (prevents re-identification)

Differential privacy for model training (Layer 3):

Training data uses DP-SGD (Differentially Private Stochastic Gradient Descent)
No individual user’s data can be inferred from model outputs

Right to erasure:

User
requests deletion:    1. user_id_hash added to deletion_requests table    2. Iceberg row-level deletes from bronze/silver tables (soft delete, physically removed at compaction)    3. RocksDB features for that user evicted from online store    4. Model training pipeline excludes deleted users from training sets    5. Audit log: deletion completed within 24 hours per regulatory SLA

Step 8: Data Model — The Gold Layer

The gold layer serves three consumers with different models:

For product analytics teams (Presto on Hive/Iceberg):

gold.fact_feed_engagements
(Transaction Fact)    Grain: one row per engagement event (view, like, comment, share) per user per content    Partitioned: by event_date, clustered by surface_type    engagement_id, user_id_hash, content_id, author_id_hash,    surface_type (feed/reels/stories), event_type,    session_id, engagement_timestamp,    content_age_min, is_organic, is_sponsored,    event_date → dim_date, user_segment_sk → dim_user_segment

For ML training (Spark reads from silver):

silver.user_content_interactions
(Wide event table, schema-on-read)    All raw engagement signals with full feature context    Point-in-time query support via Iceberg time travel    Partitioned by event_date

For A/B testing:

gold.experiment_assignments
(Transaction Fact)    Grain: one row per user per experiment per day    experiment_id, user_id_hash, variant_id, assignment_date, country  gold.experiment_metrics (Periodic Snapshot)    Grain: one row per experiment × variant × metric × day    Joined with fact_feed_engagements to compute metric deltas

Step 9: Trade-offs to Articulate

Trade-off 1: Fan-out threshold (10K followers)“I chose 10K followers as the celebrity threshold for switching from write to read fan-out. Below 10K, the write amplification (10K writes per post) is acceptable given the read-latency benefit. Above 10K, the celebrity’s posts are merged at read time. This threshold is empirically tuned — Meta likely uses something similar based on write throughput and cache budget.”

Trade-off 2: Online vs offline features”I split features into online (Flink, < 30 sec) and offline (Spark, daily) based on how quickly they change and their impact on ranking accuracy. Rapidly changing signals (post virality, current session behavior) must be online — a post with 10K likes in the last hour is far more relevant than one with the same engagement over a week. Historical preferences (category affinity, friend interaction scores) change slowly — daily batch is sufficient and far cheaper to compute.”

Trade-off 3: Exactness vs scalability for engagement counts”For post_like_count_last_1hr, I’d use HyperLogLog sketches for unique user counts (COUNT DISTINCT) and exact counts for total likes. HLL introduces ~2% error but reduces memory from O(n) to O(log log n) — at 17M events/sec, this is a multi-TB vs multi-GB difference in Flink state.”

Trade-off 4: Scuba vs Presto for analytics”Scuba serves last-30-days real-time analytics for product teams who need to check feature impact immediately after a launch. Presto on Hive/Iceberg serves historical analytical queries. They’re complementary: Scuba is fast and fresh but limited in retention; Presto is flexible and historical but queries take seconds-to-minutes.”

Failure Mode Analysis

Failure	Impact	Mitigation
Flink job crashes	Online features go stale, ranking uses cached values	Flink recovers from 30-sec checkpoint. Fallback: serve last known feature values for up to 5 min
RocksDB node failure	Feature lookups fail, ranking degrades	Replicated RocksDB. Graceful degradation: serve simplified ranking without real-time features
Scribe topic lag	Events delayed, features stale	Kafka consumer lag monitoring. Alert at > 1M events lag. Scale Flink consumers.
Fan-out service overload	Feeds not updated for some users	Async queue. Users see slightly stale feed — acceptable for eventual consistency
Privacy gate bug	PII reaches pipeline	Defense-in-depth: automated PII scanner on bronze layer sends alerts + blocks downstream tables

Interview Questions

Q1: “How does your pipeline handle the case where a post goes viral — from 100 likes to 10 million likes in 30 minutes?”

Model Answer: “The virality scenario is exactly why the online feature pipeline must be Flink-based with short windows, not daily batch. When a post starts going viral, the post_like_count_last_1hr feature in RocksDB updates every minute from Flink’s tumbling window. The ranking model sees the rapidly increasing like count and progressively boosts the post’s rank for all users — even those who haven’t opened their feed yet. The fan-out mechanism also adapts: once the post reaches the celebrity threshold on engagement velocity, it may be treated as ‘high-reach content’ and pushed more aggressively to relevant users via the content recommendation path (distinct from the friend-based social graph fan-out). The key is that the feature freshness drives the ranking update organically — I don’t need a special ‘viral content’ code path. The pipeline just needs to update features fast enough that the model sees the signal within minutes.”

Q2: “The product team wants to run an A/B test comparing ranked feeds vs chronological feeds. What data infrastructure do you need?”

Model Answer: “This requires three pipeline additions. First, experiment assignment: a deterministic assignment service assigns each user to a variant (ranked or chronological) based on user_id hash. The assignment is logged to gold.experiment_assignments — one row per user per experiment per day. Second, variant-aware feature logging: every engagement event in fact_feed_engagements must include experiment_id and variant_id columns, set at event generation time. This is the key — you can’t retroactively determine which variant a user was in for past events. Third, metric computation: a daily Spark job joins experiment_assignments with fact_feed_engagements on user_id_hash and date to compute per-variant metrics (engagement rate, session length, 7-day retention). Statistical significance computed using sequential testing (allows early stopping). Privacy consideration: the experiment metrics are aggregated — no individual user behavior is shown per-variant, only aggregate statistics. The minimum reporting threshold (1000 users per variant) applies here too.”

Self-Assessment: 5 Questions Before Moving On

Why does Meta use a hybrid fan-out model rather than pure fan-out on write?
What are the four stages of the ranking pipeline and what does each reduce?
What’s the difference between online and offline features — give a concrete example of each?
How does point-in-time correctness affect training data feature computation?
At what layer is PII stripped and why must it happen there (not later)?

Quick Reference: Meta News Feed Data Pipeline

Fan-out hybrid: Regular users → write (pre-compute). Celebrities (> 10K followers) → read (merge at query time). Inactive users → pull on return.
Four ranking stages: Candidate generation → Lightweight scoring (logistic regression) → Heavyweight scoring (deep NN, 2500 features) → Post-ranking adjustments (diversity, ads, freshness).
Features split: Online (Flink, RocksDB, < 30 sec) for rapidly changing signals. Offline (Spark, Hive, daily) for stable historical patterns.
Privacy layers: PII tokenized at ingestion (never enters pipeline). Access control by layer. Minimum aggregation thresholds. Right-to-deletion via Iceberg row deletes.
The gold layer: fact_feed_engagements (transaction fact), experiment_assignments + experiment_metrics (A/B testing), wide silver table for ML training.
Meta’s stack: Scribe → Flink → RocksDB (online features) + Hive/Iceberg (offline) → Scuba (real-time analytics) + Presto (ad-hoc).

Tomorrow’s Preview

Day 33: Netflix Data Infrastructure & Interview Patterns — Netflix’s Iceberg-first architecture, the culture of freedom and responsibility, ownership expectations, how Netflix interviews emphasize architectural trade-off fluency over memorized solutions, and the specific stack (S3, Spark, Flink, Iceberg, Trino, Druid) you must know cold going into a Netflix interview.

Day 32: Design: Meta News Feed Data Pipeline