Day 31 — Meta Data Infrastructure & Interview Patterns

Phase 2: Deep Dives & Company-Specific | Category: Meta-Specific

Welcome to Phase 2

Phase 1 gave you the foundations. Phase 2 translates those foundations into the specific language, scale, and architectural philosophy of each target company. The difference between a good interview answer and a great one at Meta isn’t technical knowledge — it’s knowing that Meta’s engineers think in billions of users as the baseline, not as a stretch goal.

What Meta Looks For at Senior/Staff Level

Per Exponent, Interview Query, and Fonzi AI, Meta evaluates senior DE candidates on four dimensions:

Dimension	What it means at Meta
Scale thinking	Every design must proactively address billions of users, petabytes of data. Generic “add more servers” answers fail. Discuss specific sharding, caching, partitioning strategies.
Product sense	Data engineers at Meta are expected to reason about why data matters, not just how to move it. “How would you measure Reels engagement?” is a DE question at Meta.
Ownership	Meta’s culture is “move fast and break things” evolved into “move fast with infrastructure.” You’re expected to own systems end-to-end — build, operate, improve, and advocate for your pipeline.
Trade-off articulation	Meta interviewers care more about your reasoning than your specific choices. Every decision must be justified with explicit trade-offs.

The #1 Meta interview mistake per Exponent: “Designing without addressing scale. Presenting a system that works for 1,000 users but has no clear path to billions. At Meta, scale isn’t a follow-up question. It’s a baseline expectation. If you don’t proactively address it, your interviewer will, and you’ll be playing catch-up.”

Meta’s Data Infrastructure Stack

Understanding the actual stack is what lets you say “I’d use Presto for this” instead of “I’d use a query engine for this.” From Meta’s own engineering blog and public talks:

The Query Layer

Presto (now PrestoDB / Velox-native)

Meta’s primary interactive analytics query engine — powers analyst self-serve, ad-hoc exploration
Federated queries across Hive, Iceberg, RocksDB, MySQL, Kafka
Designed for sub-minute interactive queries on petabyte datasets
Meta contributed Velox (C++ vectorized execution engine) which now backs Presto’s execution
Use when: interactive analytics, ad-hoc SQL, analyst self-serve, multi-source federation

Spark

Batch ETL for heavy transformation jobs
Used for daily/hourly data processing at scale — the workhorse of Meta’s pipeline layer
Runs on Meta’s internal cluster management (not raw Hadoop today — more modern infra)
Use when: large-scale batch processing, complex multi-stage transforms

Hive

The legacy Metastore — still widely used for metadata catalog (table definitions, partition specs)
Hive tables = the metadata standard. Even Spark and Presto jobs reference the Hive Metastore.
Hive queries (MapReduce-backed) are largely replaced by Presto/Spark for execution
Use when: discussing metadata catalog and table registry — not execution

The Storage Layer

Hive tables on HDFS/ORC (legacy) and increasingly Iceberg tables on HDFS/object storage

Meta is actively migrating to Iceberg for ACID support, time travel, schema evolution
Partitioned by date, often hourly for high-volume event data
Parquet (newer pipelines) and ORC (legacy) as file formats

RocksDB

Embedded key-value store used for online feature storage and serving
Ultra-low latency reads for ML inference — effectively their online feature store
Also used in some streaming state management (Flink stateful processing)

TAO (The Associations and Objects)

Meta’s distributed social graph storage — the backing store for the social graph
Custom-built graph database optimized for friendship/follower relationships
Not relevant for DE interviews directly, but worth knowing why Meta built it

The Streaming Layer

Scribe (Meta’s internal log aggregation system)

Every user interaction on Facebook/Instagram/WhatsApp generates a Scribe log
Scribe = Meta’s internal Kafka — a distributed streaming log system
Real-time event data flows through Scribe before landing in batch storage

Flink + Kafka Streams

Used for real-time processing on top of Scribe/Kafka-equivalent streams
Powers real-time ranking signals, fraud detection, and live dashboards

Scuba

Meta’s internal real-time analytics database — the equivalent of ClickHouse or Druid
Stores recent time-series data (last ~30 days) for interactive real-time queries
Interviewers may ask “how would you design a system like Scuba” — it’s their real-time OLAP

The Orchestration Layer

Airflow (open-source version Meta contributed to heavily) + internal DAG schedulers

Meta was one of the original Airflow adopters and contributors
Large-scale Airflow deployments with thousands of DAGs
Internal tooling on top of Airflow for Meta-specific features (SLA tracking, lineage)

Meta Scale: The Numbers You Must Internalize

When you design at Meta, these are your mental defaults:

Metric	Meta scale
Monthly active users	~3.3 billion (across all apps)
Daily active users	~3.27 billion (2025)
Events per day	Trillions (across Facebook, Instagram, WhatsApp, Messenger)
Photos uploaded per day	~100 million
Video minutes watched	Billions per day
Ad impressions per day	~100 billion
Engineers	~70,000+
Data centers	Global, multi-region

Translating to pipeline design:

3B DAU × 200 events/user/day = 600B events/day = ~7M events/sec
7M events/sec × 500 bytes avg = 3.5 GB/sec raw ingestion
Compressed Parquet (6x): ~580 MB/sec = ~50 TB/day compressed
Annual storage: ~18 PB/year for event data alone

This is why Meta built custom solutions (Scribe, Scuba, TAO) — nothing off-the-shelf handled these numbers in 2008. In 2026, cloud-native tools can approach this scale, but Meta’s internal systems are still more battle-tested at their specific load profile.

The Meta Data Engineering Mental Model

Think top-down, not bottom-up: Start with the business process and the consumer’s query pattern. Only then choose technology.

Privacy by design: Meta operates under intense regulatory scrutiny. Every data engineering design must include PII handling, access controls, and data minimization. This shows up in interviews — don’t design a pipeline without mentioning how sensitive user data is handled.

Fan-out-on-write vs fan-out-on-read: Meta’s News Feed is the canonical fan-out example. When a user posts, do you write to all followers’ feeds immediately (fan-out-on-write) or generate the feed at read time by joining the social graph (fan-out-on-read)? This trade-off appears in Meta DE interviews constantly.

Consistency at social scale: Social data is inherently eventually consistent. If your friend likes a post and you see the like count a few seconds later — that’s fine. Designing for strong consistency everywhere at Meta’s scale would be prohibitively expensive. Know when eventual consistency is acceptable.

Multi-surface data: One user generates data on Facebook, Instagram, WhatsApp, and Messenger. Conformed dimensions (especially user identity) that work across all surfaces are a core Meta DE competency.

Meta’s Data Modeling Expectations for Senior DEs

From Interview Query’s Meta guide — what interviewers specifically probe:

1. Multi-surface engagement modeling

“Design a data model to track engagement across Facebook Feed, Reels, and Stories.”

The senior answer must address:

Unified engagement event schema that works across surfaces (not three separate schemas)
Grain: one row per engagement event per surface per user per content
Conformed dim_content and dim_user across all surfaces
SCD2 on dim_user (user_tier, is_verified) — analytics must reflect historical user state
Partitioning: by event_date + surface for partition pruning

2. Schema drift at Meta scale

“How would you manage schema evolution when the Reels team adds new event fields?”

Senior answer: “I’d enforce backward compatibility via schema registry. New optional fields with defaults are automatically compatible. Reels teams push schema changes through a review process — any breaking change requires 30-day notice and dual-field migration period. The bronze layer uses Iceberg’s mergeSchema option to automatically absorb new columns. Silver and gold models use explicit column selection to avoid being affected by upstream additions.”

3. Handling the social graph in data models

“How do you model friend/follower relationships for analytics?”

Senior answer: “For analytics (not application serving), I’d model this as a fact_relationships table with user_id, friend_user_id, relationship_type, created_date, status. This is actually a transaction fact table at relationship-action grain. For graph traversal at serving (not analytics), Meta uses TAO — but for analytics queries like ‘retention by friend network size cohort,’ the fact table approach is sufficient and queryable with standard SQL.”

Meta’s Four Interview Rounds: What to Expect

Round 1: SQL / Data Manipulation (45 min)

Complex SQL on realistic Meta schemas (user_events, posts, ads_impressions)
Window functions, CTEs, complex aggregations
“Write a query to find users who engaged with Reels but not Feed this week”
Performance/scalability discussion: “How does this query scale to 100B rows?”

Round 2: Data Modeling & Pipeline Design (45 min) — The most weighted round

Design a data model + ETL flow + validation approach
“Design the data infrastructure for tracking ad performance at Meta”
Excalidraw whiteboarding + SQL to validate the model
Focus: requirements first, grain declaration, partitioning, SLA

Round 3: Product Sense & Analytical Reasoning (45 min)

Open-ended product data questions
“How would you measure the success of a new Instagram feature?”
“Why might daily active users on Facebook have dropped 5% this week?”
Focus: metric definition, data interpretation, connecting data to product decisions

Round 4: Behavioral / Ownership (30 min)

STAR format, emphasis on ownership and impact
“Tell me about a time you designed a data system that served multiple teams”
“Describe a situation where your pipeline failed in production and how you handled it”

Interview Questions

Q1: “You’re designing a data pipeline that tracks ad performance metrics for all advertisers on Meta’s platform. How do you think about scale, and what are the key design decisions?”

Model Answer: “I’d start with the scale context — Meta serves approximately 10 million advertisers with 100 billion daily ad impressions across Facebook, Instagram, and the Audience Network. At 100B impressions/day, that’s roughly 1.2M events/sec peak, each event about 500 bytes — so ~600 MB/sec raw, compressed to ~100 MB/sec Parquet. This rules out any single-node solution.

The core data model: grain is one row per ad impression event (ad_id, campaign_id, advertiser_id, user_id_hash, timestamp, surface, placement, device_type). I’d hash user_id immediately — advertisers shouldn’t see individual user data, only aggregated signals. Fact table partitioned by event_date, clustered on campaign_id — the dominant filter pattern.

For the pipeline: Scribe/Kafka ingestion at the edge, Flink for deduplication and near-real-time aggregation into 1-minute windows, Scuba for the real-time advertiser dashboard (last 24 hours), and Presto on Hive/Iceberg tables for the historical analytics layer.

Key trade-offs: Real-time vs accurate reporting. Advertisers see near-real-time spend in their dashboard (Scuba, 30-second freshness). Billing reconciliation uses the batch layer (6-hour delayed, fully accurate). These are explicitly two different systems serving two different SLAs — the dashboard can tolerate slight delay; billing cannot tolerate errors.

Privacy: All user-level aggregation happens server-side before data is accessible to advertisers. Minimum threshold (e.g., 50 unique users) before a demographic breakdown is shown — prevents individual user identification. PII stripped before data leaves the ingestion layer.”

Q2: “Meta’s news feed ranking team asks you to build a pipeline providing real-time features for their ranking model. What do you design?”

Model Answer: “This is a feature pipeline for ML serving — it spans both offline (batch training data) and online (real-time inference). I’d design it in two paths.

Offline path: Nightly Spark job processes all user engagement events from the last 90 days, computes user-level features (avg session duration, content category affinity, time-of-day engagement patterns), and writes to Hive/Iceberg as the offline feature store. Training jobs read from here.

Online path: Flink consumes from Scribe in real-time, maintains per-user stateful aggregations (events in last 30 minutes, recent content interactions, current session duration) using RocksDB-backed Flink state. On a feed request, the serving layer calls RocksDB for real-time features — p99 target < 5ms.

The hardest problem: training-serving skew. The offline features are computed with Spark; the online features are computed with Flink. Different computation engines, potential semantic differences. I’d address this by defining all feature logic as SQL and running it through a shared feature computation library — same logic, different execution engines. Regular skew detection jobs compare offline vs online feature distributions.

Feature freshness tiers: Static features (age, location) — updated daily. Engagement features — updated every 5 minutes from Flink micro-batch. Session features — updated in real-time (sub-second) from Flink. Each tier has a different storage backend: S3/Iceberg for daily, ClickHouse/Druid for 5-min, RocksDB for real-time.”

Think About This

You’re preparing for the Meta data modeling round. The prompt will likely be:

“Design the data model and pipeline for tracking user engagement across Facebook Feed, Instagram Reels, and Stories. The product team needs daily retention metrics, the ML team needs real-time engagement signals, and the ads team needs cross-surface attribution.”

Before Day 32 (where we design this end-to-end), mentally sketch:

What’s the grain? (One row per engagement event per surface per user per content)
How do you unify three surfaces into one model? (Conformed dim_content with surface_type as an attribute; dim_user shared across all surfaces)
How do you serve three different consumers (product → SQL, ML → features, ads → attribution)? (Three different gold-layer projections from the same silver events table)
What’s the PII strategy? (user_id hashed at ingestion, aggregation-only for ads)
Where does real-time meet batch? (Flink for ML real-time features, Presto/Spark for product daily metrics, Scuba for ad-hoc real-time exploration)

Quick Reference: Meta-Specific

Scale baseline: 3.3B MAU, 600B+ events/day, petabyte-scale storage, 7M events/sec peak. State this upfront, don’t wait to be asked.
Tech stack: Scribe (ingestion) → Flink (stream) → Presto (interactive SQL) + Spark (batch ETL) → Hive/Iceberg (storage) → Scuba (real-time OLAP). Know these names.
Privacy first: Every design includes PII tokenization at ingestion, minimum aggregation thresholds, access control by team/role.
Fan-out trade-off: Fan-out-on-write (pre-compute feed) vs fan-out-on-read (compute at request). Meta uses hybrid — mostly write, with read-time merging for freshness.
Multi-surface conformed dims: dim_user and dim_content must work across Facebook, Instagram, WhatsApp. Cross-surface analysis is a key use case.
The ownership round: Meta’s behavioral interview is specifically called the “Ownership” round. Prepare stories about systems you owned end-to-end, drove impact across teams, and improved after production incidents.
Data modeling carries the most weight in the onsite loop per Exponent. Prioritize this in your Meta prep.

Tomorrow’s Preview

Day 32: Design: Meta News Feed Data Pipeline — The full system design for Meta’s most important product. Ranking data pipeline, real-time engagement signals, batch features, fan-out architecture, and the privacy-aware design that Meta specifically evaluates. This is the practice version of what you’ll see in the Meta data modeling interview round.

Day 31: Meta Data Infrastructure & Interview Patterns