Day 2/90: Functional vs non-functional requirements

Phase 1: Foundations & Frameworks | Category: System Design Methodology

Why this day matters

Yesterday you learned the 5-step framework. Today we sharpen the single most important step: requirements gathering. At your level, interviewers don’t just want you to “ask questions” — they want to see you systematically decompose an ambiguous prompt into a scoped, prioritized contract that drives every downstream decision. The difference between a mid-level and senior answer often comes down to this step alone. As System Design Handbook puts it: as your experience increases, you are expected to spend half or more of your deep-dive time on non-functional requirements.

Functional requirements: what the system does

Functional requirements are your “the system should be able to…” statements. For data engineering, these aren’t about UI features — they’re about what data the system produces, for whom, and how they access it.

The data engineering functional requirements checklist

QuestionWhy it mattersExample
Who are the consumers?Drives serving layer choiceBI analysts → SQL warehouse; ML team → feature store; App → low-latency API
What entities and metrics?Drives data model & grain”Daily active users by country” → fact_user_activity at daily grain, dim_country
What access pattern?Drives storage & indexingDashboard with filters → pre-aggregated tables; Ad-hoc SQL → columnar warehouse
What’s the source data?Drives ingestion architectureOLTP database → CDC; Clickstream → Kafka; Third-party API → scheduled pull
Is this one-off or ongoing?Drives orchestration needsOne-time backfill vs. daily scheduled pipeline vs. continuous streaming
What transformations?Drives processing complexitySimple aggregation → SQL/dbt; Complex sessionization → Spark/Flink

Senior-level move: Group functional requirements into P0 (must-have for MVP), P1 (important but can be phased), and P2 (nice-to-have). Say it out loud: “I’ll design for P0 first, then show how the architecture extends to P1.” This mirrors how Exponent’s framework recommends scoping — and it shows the interviewer you can ship incrementally, which is exactly how Meta, Netflix, and Google operate.

Example at the whiteboard:

Prompt: “Design a data pipeline for user engagement analytics at Netflix.”

  • P0: Ingest viewing events, compute daily/weekly engagement metrics (watch time, completion rate), serve to BI dashboards
  • P1: Real-time engagement signals for the recommendation model
  • P2: Self-serve ad-hoc exploration for data scientists

Each priority level implies a fundamentally different architecture — P0 is a batch pipeline, P1 adds a streaming layer, P2 adds a query-on-demand OLAP store. By making this explicit, you control scope instead of boiling the ocean.

Non-functional requirements: how the system behaves

This is where senior candidates differentiate themselves. Non-functional requirements (NFRs) are the constraints and quality attributes that shape your architecture. According to System Design Handbook, ignoring NFRs is a red flag — engineers who focus only on data models and APIs but overlook scalability or fault tolerance miss what makes systems work at scale.

The 7 NFRs that matter for data engineering

1. Latency / data freshness

The single most architecture-shaping NFR. It determines batch vs. streaming vs. hybrid.

Freshness requirementArchitecture implication
Real-time (< 5 sec)Streaming: Kafka → Flink → real-time OLAP (Druid/Pinot/ClickHouse)
Near-real-time (1–15 min)Micro-batch: Spark Structured Streaming, or streaming with larger windows
Hourly / DailyBatch: Airflow → Spark/dbt → Warehouse (BigQuery/Redshift/Snowflake)

Interview phrasing: “What’s the acceptable staleness of data for the primary consumer?” — not just “what’s the latency?“

2. Throughput / scale

How much data flows through the system per unit time.

  • Ingestion throughput: Events/sec, MB/sec into the pipeline
  • Processing throughput: Records/sec transformed
  • Query throughput: QPS on the serving layer

Why it matters: 10K events/sec vs. 10M events/sec are different universes. The first can run on a single Flink taskmanager; the second needs a distributed Kafka cluster with dozens of partitions and multi-node Flink with careful state management.

3. Availability

How tolerant is the business of pipeline downtime?

  • Data pipeline availability: Can the pipeline go down for 2 hours without business impact? Or does every minute of downtime mean lost revenue (e.g., real-time ad bidding)?
  • Serving layer availability: Does the dashboard need 99.99% uptime, or is 99.9% fine?

Data engineering nuance: Pipeline availability is different from application availability. A batch pipeline that runs at 2 AM can tolerate some downtime as long as data is ready by 8 AM. A real-time fraud detection pipeline cannot tolerate any gap.

Netflix operates with a baseline expectation of multi-AZ, multi-region architectures — bringing up availability in a Netflix interview is not optional.

4. Consistency

In data systems, this manifests as: Can different consumers see different versions of the truth at the same time?

PatternConsistency levelExample
Single source of truth warehouseStrongAll dashboards read from same tables, same numbers
Lambda architecture (batch + speed)EventualReal-time layer may show slightly different numbers than batch
Replicated serving storesEventualCross-region replicas may lag by seconds

Interview phrasing: “Is it acceptable for the real-time dashboard to show slightly different numbers than the daily report, as long as they converge within a few hours?“

5. Durability / data loss tolerance

Can you afford to lose any data?

  • Zero loss: Financial transactions, compliance data → need WAL, replication, exactly-once semantics
  • Tolerable loss: Clickstream analytics → at-least-once with deduplication is fine, occasional duplicates acceptable
  • Sampling OK: High-volume telemetry → can sample 10% and extrapolate

This directly impacts your Kafka config (acks=all vs. acks=1), your processing guarantees (exactly-once vs. at-least-once), and your storage replication factor.

6. Cost

Often overlooked in interviews, but mentioning it signals senior thinking.

  • Compute cost: Spark clusters, Flink clusters, serverless pricing
  • Storage cost: Hot vs. warm vs. cold tiering
  • Query cost: Pay-per-scan (BigQuery) vs. provisioned (Redshift)
  • Network cost: Cross-region data transfer

Interview phrasing: “Before I finalize the storage layer, are there budget constraints that should influence whether I choose a provisioned warehouse like Redshift or a pay-per-query model like BigQuery?“

7. Maintainability / operability

How easy is it to debug, extend, and operate the system?

  • Schema evolution: Can producers add fields without breaking consumers?
  • Backfill capability: Can you reprocess 6 months of data without a heroic effort?
  • Observability: Can you answer “why is the dashboard showing stale data?” within 5 minutes?
  • Team expertise: Does the team know Flink, or would Spark Streaming be more maintainable?

Company-specific NFR patterns

Each of your target companies weighs NFRs differently. Knowing this helps you prioritize in the interview:

CompanyPrimary NFR focusWhat they want to hear
MetaScale & throughput”This handles billions of events from 3B+ users. Here’s how I’d shard/partition…”
NetflixAvailability & latency”Multi-region, fault-tolerant, sub-100ms serving. Graceful degradation if a zone fails.”
GoogleCost & operational efficiency”I’d use BigQuery with partitioning by event_date and clustering on user_id to minimize scan cost. Dataflow for auto-scaling stream processing.”
OpenAIDurability & data quality”Training data quality is the moat. I’d build rigorous deduplication, PII filtering, and version every dataset.”
AnthropicScalability & progressive complexity”Start with the base case, then layer on: what changes at 10x? What changes when we add LLM reranking?”

The requirements matrix: a practical tool

When you’re at the whiteboard, write a quick 2-column matrix. This takes 60 seconds and gives you a contract to reference throughout:

FUNCTIONAL (P0)                    NON-FUNCTIONAL
─────────────────────────────      ─────────────────────────────
• Ingest user events               • Freshness: < 5 min
• Compute daily engagement         • Scale: 500K events/sec
  metrics (watch time, completion) • Availability: 99.9% pipeline
• Serve to BI dashboards           • Consistency: eventual OK for
                                      real-time, strong for daily
FUNCTIONAL (P1)                    • Durability: at-least-once
─────────────────────────────      • Cost: optimize for storage
• Real-time signals for rec model    (multi-PB retention)
• Self-serve SQL access

This matrix becomes your anchor. Every architectural decision in Steps 3–5 should trace back to a specific line here.

Interview questions

Q1: “You’re designing a data pipeline for Google Ads click analytics. What requirements would you gather before starting?”

Model answer: “I’d start with functional requirements: What metrics do advertisers need — click-through rate, cost-per-click, conversion attribution? What’s the access pattern — self-serve dashboards, API for programmatic advertisers, or both? What’s the attribution window — last-click, multi-touch? Then non-functional: What’s the freshness SLA — do advertisers expect real-time click counts or is hourly acceptable? My guess is near-real-time for spend monitoring but daily for reporting. Scale: Google Ads processes billions of clicks daily, so I’d estimate ~100K events/sec peak. Durability is critical here — every click is tied to billing, so zero data loss with exactly-once semantics. Cost matters too — at this scale, the difference between pay-per-scan and provisioned compute is millions per year. I’d write these on the board, confirm with the interviewer, and then design the batch + streaming hybrid architecture that the requirements demand.”

Q2: “An interviewer says: ‘Design a real-time analytics system.’ You ask about latency and they say ‘as fast as possible.’ How do you handle this?”

Model answer: “I wouldn’t accept ‘as fast as possible’ — that’s not a requirement, it’s a wish. I’d reframe: ‘Let me propose tiers. Sub-second means we need a streaming architecture with a real-time OLAP store like Druid or ClickHouse — that’s significantly more complex and expensive. Under 5 minutes means micro-batch with Spark Streaming into a warehouse. Under an hour means simple batch. Which use case are we optimizing for? For example, if this is fraud detection, sub-second matters. If it’s executive dashboards, 5-minute staleness is likely fine.’ By offering concrete tiers tied to architecture implications, I’m showing the interviewer I understand the cost-complexity spectrum and I’m forcing a scoping decision rather than over-engineering.”

Think about this

You’re in an Anthropic interview. The prompt: “Design the data infrastructure for evaluating Claude’s response quality across millions of conversations daily.”

Walk through mentally:

  1. What are the functional requirements? (What metrics define “quality”? Who consumes the evaluation results — researchers, PMs, automated systems?)
  2. What single non-functional requirement would you ask about first, and why?
  3. How would the architecture change if they said “we need quality scores within 10 seconds of each conversation” vs. “overnight batch is fine”?

The key insight: at an AI company like Anthropic, data quality and durability are likely the top NFRs, not just latency. Every conversation is potential training signal. Losing evaluation data or having inconsistent quality scores could corrupt the feedback loop for model improvement. This is where you show you understand that NFR prioritization is domain-specific — not every system is just “make it fast and available.”

Quick reference

  • Functional requirements for DE = what data, for whom, how accessed, from what source — not UI features
  • Prioritize with P0/P1/P2 — shows you can scope and ship incrementally
  • Latency/freshness is the #1 architecture-shaping NFR — it determines batch vs. streaming vs. hybrid before anything else
  • Always quantify NFRs — “low latency” means nothing; ”< 200ms p95 on the serving layer” drives real decisions
  • Write the requirements matrix on the whiteboard — it takes 60 seconds and gives you a contract to reference throughout the interview

Tomorrow’s preview

Day 3: Back-of-envelope estimation for data systems — How to quickly estimate storage, throughput, and compute needs for data pipelines. Key numbers to memorize, and how your estimates should directly drive architectural choices (not be a disconnected math exercise).