Day 2 — Functional vs non-functional requirements

Sharpen requirements gathering: functional vs NFRs, P0/P1/P2 scoping, and company-specific patterns for DE interviews.

Day 2/90: Functional vs non-functional requirements

Phase 1: Foundations & Frameworks | Category: System Design Methodology

Why this day matters

Yesterday you learned the 5-step framework. Today we sharpen the single most important step: requirements gathering. At your level, interviewers don’t just want you to “ask questions” — they want to see you systematically decompose an ambiguous prompt into a scoped, prioritized contract that drives every downstream decision. The difference between a mid-level and senior answer often comes down to this step alone. As System Design Handbook puts it: as your experience increases, you are expected to spend half or more of your deep-dive time on non-functional requirements.

Functional requirements: what the system does

Functional requirements are your “the system should be able to…” statements. For data engineering, these aren’t about UI features — they’re about what data the system produces, for whom, and how they access it.

The data engineering functional requirements checklist

Question	Why it matters	Example
Who are the consumers?	Drives serving layer choice	BI analysts → SQL warehouse; ML team → feature store; App → low-latency API
What entities and metrics?	Drives data model & grain	”Daily active users by country” → `fact_user_activity` at daily grain, `dim_country`
What access pattern?	Drives storage & indexing	Dashboard with filters → pre-aggregated tables; Ad-hoc SQL → columnar warehouse
What’s the source data?	Drives ingestion architecture	OLTP database → CDC; Clickstream → Kafka; Third-party API → scheduled pull
Is this one-off or ongoing?	Drives orchestration needs	One-time backfill vs. daily scheduled pipeline vs. continuous streaming
What transformations?	Drives processing complexity	Simple aggregation → SQL/dbt; Complex sessionization → Spark/Flink

Senior-level move: Group functional requirements into P0 (must-have for MVP), P1 (important but can be phased), and P2 (nice-to-have). Say it out loud: “I’ll design for P0 first, then show how the architecture extends to P1.” This mirrors how Exponent’s framework recommends scoping — and it shows the interviewer you can ship incrementally, which is exactly how Meta, Netflix, and Google operate.

Example at the whiteboard:

Prompt: “Design a data pipeline for user engagement analytics at Netflix.”

P0: Ingest viewing events, compute daily/weekly engagement metrics (watch time, completion rate), serve to BI dashboards

P1: Real-time engagement signals for the recommendation model

P2: Self-serve ad-hoc exploration for data scientists

Each priority level implies a fundamentally different architecture — P0 is a batch pipeline, P1 adds a streaming layer, P2 adds a query-on-demand OLAP store. By making this explicit, you control scope instead of boiling the ocean.

Non-functional requirements: how the system behaves

This is where senior candidates differentiate themselves. Non-functional requirements (NFRs) are the constraints and quality attributes that shape your architecture. According to System Design Handbook, ignoring NFRs is a red flag — engineers who focus only on data models and APIs but overlook scalability or fault tolerance miss what makes systems work at scale.

The 7 NFRs that matter for data engineering

1. Latency / data freshness

The single most architecture-shaping NFR. It determines batch vs. streaming vs. hybrid.

Freshness requirement	Architecture implication
Real-time (< 5 sec)	Streaming: Kafka → Flink → real-time OLAP (Druid/Pinot/ClickHouse)
Near-real-time (1–15 min)	Micro-batch: Spark Structured Streaming, or streaming with larger windows
Hourly / Daily	Batch: Airflow → Spark/dbt → Warehouse (BigQuery/Redshift/Snowflake)

Interview phrasing: “What’s the acceptable staleness of data for the primary consumer?” — not just “what’s the latency?“

2. Throughput / scale

How much data flows through the system per unit time.

Ingestion throughput: Events/sec, MB/sec into the pipeline
Processing throughput: Records/sec transformed
Query throughput: QPS on the serving layer

Why it matters: 10K events/sec vs. 10M events/sec are different universes. The first can run on a single Flink taskmanager; the second needs a distributed Kafka cluster with dozens of partitions and multi-node Flink with careful state management.

3. Availability

How tolerant is the business of pipeline downtime?

Data pipeline availability: Can the pipeline go down for 2 hours without business impact? Or does every minute of downtime mean lost revenue (e.g., real-time ad bidding)?
Serving layer availability: Does the dashboard need 99.99% uptime, or is 99.9% fine?

Data engineering nuance: Pipeline availability is different from application availability. A batch pipeline that runs at 2 AM can tolerate some downtime as long as data is ready by 8 AM. A real-time fraud detection pipeline cannot tolerate any gap.

Netflix operates with a baseline expectation of multi-AZ, multi-region architectures — bringing up availability in a Netflix interview is not optional.

4. Consistency

In data systems, this manifests as: Can different consumers see different versions of the truth at the same time?

Pattern	Consistency level	Example
Single source of truth warehouse	Strong	All dashboards read from same tables, same numbers
Lambda architecture (batch + speed)	Eventual	Real-time layer may show slightly different numbers than batch
Replicated serving stores	Eventual	Cross-region replicas may lag by seconds

Interview phrasing: “Is it acceptable for the real-time dashboard to show slightly different numbers than the daily report, as long as they converge within a few hours?“

5. Durability / data loss tolerance

Can you afford to lose any data?

Zero loss: Financial transactions, compliance data → need WAL, replication, exactly-once semantics
Tolerable loss: Clickstream analytics → at-least-once with deduplication is fine, occasional duplicates acceptable
Sampling OK: High-volume telemetry → can sample 10% and extrapolate

This directly impacts your Kafka config (acks=all vs. acks=1), your processing guarantees (exactly-once vs. at-least-once), and your storage replication factor.

6. Cost

Often overlooked in interviews, but mentioning it signals senior thinking.

Compute cost: Spark clusters, Flink clusters, serverless pricing
Storage cost: Hot vs. warm vs. cold tiering
Query cost: Pay-per-scan (BigQuery) vs. provisioned (Redshift)
Network cost: Cross-region data transfer

Interview phrasing: “Before I finalize the storage layer, are there budget constraints that should influence whether I choose a provisioned warehouse like Redshift or a pay-per-query model like BigQuery?“

7. Maintainability / operability

How easy is it to debug, extend, and operate the system?

Schema evolution: Can producers add fields without breaking consumers?
Backfill capability: Can you reprocess 6 months of data without a heroic effort?
Observability: Can you answer “why is the dashboard showing stale data?” within 5 minutes?
Team expertise: Does the team know Flink, or would Spark Streaming be more maintainable?

Company-specific NFR patterns

Each of your target companies weighs NFRs differently. Knowing this helps you prioritize in the interview:

Company	Primary NFR focus	What they want to hear
Meta	Scale & throughput	”This handles billions of events from 3B+ users. Here’s how I’d shard/partition…”
Netflix	Availability & latency	”Multi-region, fault-tolerant, sub-100ms serving. Graceful degradation if a zone fails.”
Google	Cost & operational efficiency	”I’d use BigQuery with partitioning by `event_date` and clustering on `user_id` to minimize scan cost. Dataflow for auto-scaling stream processing.”
OpenAI	Durability & data quality	”Training data quality is the moat. I’d build rigorous deduplication, PII filtering, and version every dataset.”
Anthropic	Scalability & progressive complexity	”Start with the base case, then layer on: what changes at 10x? What changes when we add LLM reranking?”

The requirements matrix: a practical tool

When you’re at the whiteboard, write a quick 2-column matrix. This takes 60 seconds and gives you a contract to reference throughout:

FUNCTIONAL (P0)                    NON-FUNCTIONAL
─────────────────────────────      ─────────────────────────────
• Ingest user events               • Freshness: < 5 min
• Compute daily engagement         • Scale: 500K events/sec
  metrics (watch time, completion) • Availability: 99.9% pipeline
• Serve to BI dashboards           • Consistency: eventual OK for
                                      real-time, strong for daily
FUNCTIONAL (P1)                    • Durability: at-least-once
─────────────────────────────      • Cost: optimize for storage
• Real-time signals for rec model    (multi-PB retention)
• Self-serve SQL access

This matrix becomes your anchor. Every architectural decision in Steps 3–5 should trace back to a specific line here.

Interview questions

Q1: “You’re designing a data pipeline for Google Ads click analytics. What requirements would you gather before starting?”

Model answer: “I’d start with functional requirements: What metrics do advertisers need — click-through rate, cost-per-click, conversion attribution? What’s the access pattern — self-serve dashboards, API for programmatic advertisers, or both? What’s the attribution window — last-click, multi-touch? Then non-functional: What’s the freshness SLA — do advertisers expect real-time click counts or is hourly acceptable? My guess is near-real-time for spend monitoring but daily for reporting. Scale: Google Ads processes billions of clicks daily, so I’d estimate ~100K events/sec peak. Durability is critical here — every click is tied to billing, so zero data loss with exactly-once semantics. Cost matters too — at this scale, the difference between pay-per-scan and provisioned compute is millions per year. I’d write these on the board, confirm with the interviewer, and then design the batch + streaming hybrid architecture that the requirements demand.”

Q2: “An interviewer says: ‘Design a real-time analytics system.’ You ask about latency and they say ‘as fast as possible.’ How do you handle this?”

Model answer: “I wouldn’t accept ‘as fast as possible’ — that’s not a requirement, it’s a wish. I’d reframe: ‘Let me propose tiers. Sub-second means we need a streaming architecture with a real-time OLAP store like Druid or ClickHouse — that’s significantly more complex and expensive. Under 5 minutes means micro-batch with Spark Streaming into a warehouse. Under an hour means simple batch. Which use case are we optimizing for? For example, if this is fraud detection, sub-second matters. If it’s executive dashboards, 5-minute staleness is likely fine.’ By offering concrete tiers tied to architecture implications, I’m showing the interviewer I understand the cost-complexity spectrum and I’m forcing a scoping decision rather than over-engineering.”

Think about this

You’re in an Anthropic interview. The prompt: “Design the data infrastructure for evaluating Claude’s response quality across millions of conversations daily.”

Walk through mentally:

What are the functional requirements? (What metrics define “quality”? Who consumes the evaluation results — researchers, PMs, automated systems?)
What single non-functional requirement would you ask about first, and why?
How would the architecture change if they said “we need quality scores within 10 seconds of each conversation” vs. “overnight batch is fine”?

The key insight: at an AI company like Anthropic, data quality and durability are likely the top NFRs, not just latency. Every conversation is potential training signal. Losing evaluation data or having inconsistent quality scores could corrupt the feedback loop for model improvement. This is where you show you understand that NFR prioritization is domain-specific — not every system is just “make it fast and available.”

Quick reference

Functional requirements for DE = what data, for whom, how accessed, from what source — not UI features
Prioritize with P0/P1/P2 — shows you can scope and ship incrementally
Latency/freshness is the #1 architecture-shaping NFR — it determines batch vs. streaming vs. hybrid before anything else
Always quantify NFRs — “low latency” means nothing; ”< 200ms p95 on the serving layer” drives real decisions
Write the requirements matrix on the whiteboard — it takes 60 seconds and gives you a contract to reference throughout the interview

Tomorrow’s preview

Day 3: Back-of-envelope estimation for data systems — How to quickly estimate storage, throughput, and compute needs for data pipelines. Key numbers to memorize, and how your estimates should directly drive architectural choices (not be a disconnected math exercise).