The 5-step data engineering system design framework

Step 1: Clarify requirements and scope (5–8 minutes)

This is the step most senior candidates still underperform on. You’re not just “asking questions” — you’re demonstrating that you think about systems the way a principal engineer does: from the consumer backward.

Functional requirements — work backward from the consumer

  • Who consumes this data? (BI analysts, ML models, applications, external partners, GenAI/RAG systems)
  • What are the access patterns? (Dashboards, SQL queries, APIs, feature serving)
  • What business entities and metrics matter? (Define grain early)
  • What’s the core use case vs. nice-to-have? (Assign P0/P1/P2)

Non-functional requirements — the senior differentiator

  • Latency SLA: Does the consumer need data in seconds (real-time), minutes (near-real-time), or hours (batch)? This single answer reshapes your entire architecture.
  • Throughput: Events/sec for ingestion, QPS for serving
  • Availability vs. consistency: Can we tolerate eventual consistency, or do we need strong consistency?
  • Data freshness: What staleness is acceptable?
  • Durability: Can we afford to lose any data?
  • Scale: Current volume and expected growth (10x in 2 years?)
  • Cost constraints: Build vs. buy, cloud budget

Senior-level move: State assumptions explicitly. “I’ll assume we need sub-minute freshness for the fraud detection use case, and daily batch is fine for the reporting layer. I’ll design for both.” This shows you can hold two architectures in your head simultaneously — which is exactly what companies like Netflix expect with their emphasis on ownership and trade-off fluency.

Outcome: A written list on the whiteboard of functional requirements, non-functional requirements, and explicit assumptions. This is your contract for the rest of the interview.

Step 2: Estimate scale and capacity (3–5 minutes)

Back-of-envelope math isn’t about precision — it’s about proving your numbers drive your architecture, not the other way around.

Key estimates for data engineering

  • Ingestion volume: events/sec × avg event size = MB/sec ingest rate
  • Storage growth: daily ingest × retention period × replication factor
  • Read QPS: how many queries/sec hit the serving layer
  • Write vs. read ratio: write-heavy (event logging) vs. read-heavy (dashboards)

Numbers worth memorizing

MetricValue
1 day~86,400 seconds (~100K)
1 million req/day~12 req/sec
1 billion req/day~12,000 req/sec
1 KB × 1 billion~1 TB
1 MB × 1 million~1 TB
Network: 1 Gbps~125 MB/sec
SSD read~500 MB/sec
HDD read~100 MB/sec

Senior-level move: Tie estimates to architectural decisions. “At 500K events/sec, we’re looking at ~50 GB/hour raw ingestion. That rules out a single-node solution — we need distributed stream processing like Flink or Spark Streaming, with Kafka as the buffer.” This is what Google interviewers look for — showing that your numbers directly inform your component choices.

Step 3: Define the data model and API contracts (5–8 minutes)

For data engineering specifically, this step is more important than in generic SWE system design. Your data model IS your product.

Data model approach

  1. Identify core entities (users, events, products, sessions)
  2. Define the grain of your fact tables (one row = one event? one session? one daily aggregate?)
  3. Sketch fact and dimension tables or event schemas
  4. Specify keys: natural vs. surrogate, partition keys, sort keys
  5. Note SCD strategy if dimensions change over time

For streaming systems: Define the event schema (Avro/Protobuf with schema registry) and the topic/partition key strategy.

API contracts (if there’s a serving layer)

  • Define read APIs: What queries does the consumer run?
  • Define write APIs: How does data enter the system?
  • REST vs. gRPC vs. GraphQL — justify your choice

Medallion / 3-hop architecture (commonly expected at interviews, per Start Data Engineering)

  • Bronze: Raw ingestion, schema-on-read, append-only
  • Silver: Cleaned, typed, deduplicated → facts and dimensions
  • Gold: Aggregated, business-metric tables → served to consumers

Senior-level move: Explicitly connect your data model to the query patterns from Step 1. “Since the primary access pattern is 7-day retention by experience, I’ll partition the fact table by event_date and cluster on experience_id. This minimizes scan cost in BigQuery and aligns with the BI team’s daily dashboard refresh.”

Step 4: High-level architecture (10–12 minutes)

Now you draw. The architecture should flow left-to-right: sources → ingestion → processing → storage → serving → consumers.

Standard data engineering building blocks

LayerBatch optionsStreaming options
IngestionSqoop, APIs, SFTP, CDC (Debezium/DMS)Kafka, Kinesis, Pub/Sub
ProcessingSpark, dbt, Dataflow (Beam)Flink, Spark Streaming, Dataflow
StorageData warehouse (BigQuery, Redshift, Snowflake), data lake (S3/GCS + Iceberg/Delta)Real-time OLAP (Druid, Pinot, ClickHouse), Redis, DynamoDB
OrchestrationAirflow, Dagster, Prefect, Step FunctionsN/A (streaming is self-orchestrating)
ServingSQL interface, BI tools, APIsLow-latency APIs, feature stores

For each component, briefly state:

  1. What it does
  2. Why you chose it over alternatives
  3. How it connects to adjacent components

Senior-level move: Don’t just draw boxes — describe the data flow for both the write path and the read path separately. “On the write path, events flow from the app through Kafka, get processed by Flink for sessionization, and land in Iceberg tables on S3. On the read path, analysts query through Trino, while the ML team reads features from the feature store backed by Redis.”

At Meta scale, interviewers expect you to address billions of users and petabytes of data with specific strategies — sharding, caching layers, CDNs, event-driven architectures — not generic “add more servers” answers (Exponent).

Step 5: Deep-dive and trade-offs (10–15 minutes)

This is where senior candidates win or lose. The interviewer will either pick a component to drill into, or you should proactively offer: “I’d like to deep-dive into the stream processing layer — would that be interesting to you?”

What to deep-dive on

  • Failure handling: What happens when Kafka goes down? When a Flink job crashes? When a source sends bad data?
  • Exactly-once vs. at-least-once: Do you need it? How do you achieve it? What’s the cost?
  • Backpressure: What happens when ingestion outpaces processing?
  • Data quality: Where do you validate? What happens when validation fails?
  • Scalability: How does this handle 10x growth? Where’s the bottleneck?
  • Cost: What’s the most expensive component and how would you optimize it?

Trade-off articulation framework

For every decision, use this pattern:

I chose X over Y because [reason tied to requirements]. The trade-off is [downside of X], but that’s acceptable because [justification from Step 1 requirements].

Example: “I chose Flink over Spark Streaming because our sub-second latency requirement rules out micro-batch. The trade-off is operational complexity — Flink’s checkpointing and state management require more expertise — but given that this is a fraud detection pipeline where milliseconds matter, it’s the right call.”

Senior-level move: Proactively mention what you’d do differently at different scales. “At our current scale of 100K events/sec, this design works well. If we hit 10M events/sec, I’d introduce a tiered architecture with edge pre-aggregation before the events hit the central Kafka cluster.”

Time allocation summary

StepTime (45 min)Time (60 min)What interviewer evaluates
1. Requirements5 min8 minCan you scope ambiguity?
2. Estimation3 min5 minDo numbers drive your decisions?
3. Data model and APIs5 min8 minDo you think data-first?
4. High-level design10 min12 minCan you architect end-to-end?
5. Deep-dive and trade-offs10 min15 minCan you reason under pressure?
Buffer / Q&A2 min2 min

Interview questions

Q1: “You’re asked to design the data infrastructure for a new product at Meta that tracks user engagement across Reels, Stories, and Feed. Where do you start?”

Model answer: “Before drawing anything, I’d clarify: Who consumes this data — product managers via dashboards, ML models for recommendations, or both? What’s the latency requirement — do we need real-time engagement signals for ranking, or is daily aggregate reporting sufficient? What’s the scale — how many events/sec across all three surfaces? Once I have those answers, I’d define the core entities: engagement events (view, like, share, comment) as the fact table with user, content, and surface as dimensions. The grain would be one row per engagement event. For the architecture, given Meta’s scale of billions of daily events, I’d propose Kafka for ingestion, Flink for real-time sessionization and metric computation, writing to both a real-time serving layer (Scuba-like for operational dashboards) and a batch warehouse (Hive/Presto for deep analytics). The key trade-off is maintaining two serving layers — it increases operational cost but lets us serve sub-second real-time dashboards AND complex ad-hoc analytical queries.”

Q2: “Walk me through how you’d structure a 60-minute system design interview if you were the candidate.”

Model answer: “I’d spend the first 8 minutes on requirements — functional, non-functional, and explicit assumptions written on the board. Then 5 minutes on back-of-envelope estimation to anchor the design. Next, 8 minutes on the data model and API contracts — this is where I define what the system produces before how it produces it. Then 12 minutes on high-level architecture, walking through write and read paths. The remaining 15 minutes I’d spend on deep-dives — either where the interviewer directs or on the most complex component. I’d leave 2 minutes to recap. The key principle is: requirements before architecture, data model before pipeline, and trade-offs throughout.”

Think about this

Imagine you’re in a Netflix interview. The prompt is: “Design a fault-tolerant video streaming analytics system.” You have 60 minutes. Before reading further, mentally walk through:

  1. What are the first 3 clarifying questions you’d ask?
  2. What non-functional requirement would fundamentally change your architecture if the answer were different?
  3. What would be your write path vs. read path?

The answer to #2 is latency. If analytics need to be available in under 5 seconds (for real-time abuse detection), you’re building a streaming system with Flink and a real-time OLAP store. If next-morning is fine (for content performance reporting), you’re building a batch pipeline with Spark and a warehouse. The entire architecture pivots on that single question — which is why Step 1 is the most important step.

Quick reference

  • Always start with requirements, never with architecture. The first 5–8 minutes determine whether you solve the right problem.
  • Numbers drive decisions. If you can’t estimate the scale, you can’t justify your component choices.
  • Data model before pipeline. Define what you produce before how you produce it. For data engineers, the data model IS the product.
  • Articulate trade-offs with the pattern: “I chose X over Y because [requirement]. The downside is [cost], which is acceptable because [justification].”
  • Senior expectation: You own the conversation. Don’t wait for the interviewer to steer you. Proactively suggest deep-dive areas and check in: “Should I go deeper here, or move to the next component?”

Tomorrow’s preview

Day 2: Functional vs non-functional requirements — We’ll go deep on how to extract and prioritize requirements for data-intensive systems, with specific requirement patterns for each of your target companies (Meta’s scale requirements, Netflix’s availability obsession, Google’s GCP-native constraints, OpenAI/Anthropic’s AI infrastructure needs).