Day 3/90: Back-of-envelope estimation for data systems

Phase 1: Foundations & Frameworks | Category: System Design Methodology

Why this matters at senior level

Estimation isn’t a math exercise — it’s how you prove your architecture is grounded in reality. As System Design Handbook puts it: “Strong candidates naturally anchor their designs in numbers. They talk about traffic before databases, capacity before sharding, and growth before optimization.” At your level, interviewers expect your numbers to directly drive component choices — not live in a disconnected calculation on the side of the whiteboard.

The key mindset shift: you’re not trying to be precise. You’re trying to land in the right order of magnitude so your architecture makes sense. Being off by 2x is fine. Being off by 100x means you designed the wrong system.

The numbers you must memorize

Powers of two — your unit converter

PowerApprox valueBytes
101 thousand1 KB
201 million1 MB
301 billion1 GB
401 trillion1 TB
501 quadrillion1 PB

The one shortcut that matters: 1 KB × 1 billion = 1 TB. This single formula handles most data engineering estimations.

Time constants

PeriodSecondsRounded
1 day86,400~100K
1 month2.6M~2.5M
1 year31.5M~30M

Quick conversion trick per this cheat sheet:

  • 1M requests/day ≈ 12 req/sec
  • 1B requests/day ≈ 12K req/sec
  • 100M requests/month ≈ 40 req/sec

Modern hardware capabilities (2025–2026)

This is where many candidates use outdated numbers. According to Hello Interview’s modern hardware guide, today’s numbers would surprise you:

ComponentCapabilityWhat it means
Single server RAM512 GB standard, up to 4 TB (X1e), 24 TB extremeYou can cache a lot more than you think
Single Postgres/MySQLUp to 64 TiB storage, 10–20K TPSDon’t shard until you actually need to
SSD (NVMe)3–7 GB/sec sequential read, 60 TB per instanceStorage is rarely the bottleneck
Network (datacenter)25 Gbps standard, 50–100 Gbps high-perf~3–12 GB/sec between services
Cross-AZ latency1–2 msCross-region: 50–150 ms
Single Kafka brokerUp to 1M messages/sec, 50 TB storageOne broker is surprisingly powerful
Redis (single instance)100K+ ops/secOften enough without clustering

Senior-level insight: Many candidates over-engineer because they use 2015-era numbers. A single well-tuned Postgres handles more than most people think. Hello Interview explicitly warns: “One of the biggest giveaways of book knowledge without hands-on experience is relying on outdated hardware constraints.”

The 4-step estimation framework for data engineering

Step 1: Estimate ingestion volume

Start with the source. How much data enters the system?

Formula: Daily events × avg event size = daily ingest volume

Example — User engagement pipeline at Netflix scale:

  • 250M daily active users
  • Each user generates ~200 events/day (views, pauses, searches, ratings)
  • 250M × 200 = 50B events/day
  • 50B / 100K seconds per day = ~500K events/sec
  • Avg event size: ~500 bytes (JSON with user_id, content_id, event_type, timestamp, metadata)
  • 50B × 500 bytes = 25 TB/day raw ingestion

That single calculation tells you: this is a streaming system (500K events/sec can’t wait for batch), you need Kafka with significant partition count, and you’re looking at petabyte-scale storage annually.

Step 2: Estimate storage requirements

Formula: Daily ingest × retention × replication factor × format overhead

Continuing the Netflix example:

  • Raw data: 25 TB/day
  • Columnar compression (Parquet): ~5:1 compression → 5 TB/day compressed
  • Retention: 2 years = 730 days
  • 5 TB × 730 = 3.65 PB for the main fact table
  • Replication factor 3 (standard for durability): ~11 PB total storage
  • Plus dimensions, aggregates, intermediate tables: multiply by ~1.5x → ~16 PB

Storage tier breakdown (cost-driven):

  • Hot (last 30 days): 150 TB in a fast OLAP store (Druid/ClickHouse) or warehouse
  • Warm (30 days – 1 year): in columnar format on S3/GCS with Iceberg
  • Cold (1–2 years): S3 Glacier / GCS Archive

Step 3: Estimate processing capacity

For Spark batch jobs:

Following the approach from a common Spark cluster sizing interview pattern:

  • Data to process: 25 TB/day (raw)
  • Target partition size: 128 MB
  • Partitions needed: 25 TB / 128 MB = ~200K partitions
  • Cluster: 50 worker nodes × 16 cores = 800 parallel tasks
  • Waves: 200K / 800 = 250 waves
  • ~10 sec per wave → ~42 minutes for a full day’s transformation

For Kafka/streaming:

  • 500K events/sec
  • Single Kafka broker handles ~1M msg/sec
  • Need ~5–10 brokers with headroom (for replication, consumer lag, spikes)
  • Partition count: at least 100–200 partitions for parallelism
  • Flink parallelism should match partition count for max throughput

Step 4: Estimate query/serving load

Formula: DAU × queries per session × peak multiplier / seconds in peak hour

  • 10K analysts running 20 queries/day = 200K queries/day
  • Peak hour handles 30% of daily load: 60K queries in 3,600 sec = ~17 QPS
  • If serving an application: 10M DAU × 5 API calls = 50M/day = ~580 QPS avg, ~2K QPS peak

These numbers determine whether you need a provisioned warehouse (Redshift/Snowflake) vs. pay-per-query (BigQuery), and whether you need a caching layer in front.

Estimation-to-architecture decision map

This is the senior move — connecting your numbers to specific architectural choices:

Estimation resultArchitecture decision
< 10K events/secSingle-node processing (DuckDB, Polars). Batch with Airflow + dbt is fine.
10K – 100K events/secDistributed batch (Spark) or light streaming (Spark Structured Streaming)
100K – 1M events/secKafka + Flink/Spark Streaming. Multiple Kafka brokers. Serious partitioning strategy needed.
> 1M events/secMulti-cluster Kafka, dedicated Flink clusters, consider edge pre-aggregation
Storage < 1 TBSingle Postgres/MySQL. No warehouse needed.
Storage 1–100 TBData warehouse (BigQuery, Snowflake, Redshift). Single-node OLAP for real-time.
Storage 100 TB – 10 PBLakehouse (Iceberg/Delta on S3) + warehouse for gold layer. Storage tiering essential.
Storage > 10 PBFull data lake architecture. Object storage with multiple query engines. Cost optimization is critical.
Query QPS < 50Warehouse is fine. No caching needed.
Query QPS 50–1000Materialized views, pre-aggregation, consider caching hot queries
Query QPS > 1000Dedicated serving store (Redis, DynamoDB, Druid/Pinot). Cache layer mandatory.

Worked example: Google Ads click analytics

Let’s walk through a full estimation the way you’d do it in a Google interview:

Prompt: Design a click analytics pipeline for Google Ads.

Step 1 — Ingestion:

  • Google serves ~8.5B searches/day, assume ~30% have ads, ~5% CTR
  • 8.5B × 0.30 × 0.05 = ~130M clicks/day
  • 130M / 100K = ~1,300 clicks/sec (average), ~5K/sec peak
  • Avg click event: ~1 KB (user_id, ad_id, campaign_id, keyword, bid, timestamp, geo, device, landing_url)
  • Daily volume: 130M × 1 KB = ~130 GB/day raw

Step 2 — Storage:

  • Compressed (Parquet): ~25 GB/day
  • 3-year retention for billing/audit: 25 GB × 1,095 = ~27 TB
  • With replication and derived tables: ~100 TB total

Step 3 — Processing:

  • 5K events/sec peak — manageable with Pub/Sub → Dataflow (streaming)
  • Batch aggregation for daily reports: 130 GB — small enough for a single Dataflow job or BigQuery scheduled query

Step 4 — Serving:

  • Advertiser dashboard QPS: ~10K advertisers × 10 queries/hour in peak = ~28 QPS
  • This is well within BigQuery’s capability without any caching

Architecture implication: At 130 GB/day and ~28 QPS serving, this doesn’t need a complex streaming architecture for the analytics side. A near-real-time micro-batch (every 1–5 min) into BigQuery with partitioning by date and clustering on campaign_id would be sufficient and cost-effective. The streaming component is needed only for the real-time spend monitoring API that advertisers check frequently.

This is the kind of nuanced, estimation-driven reasoning Google interviewers look for — not defaulting to the most complex architecture, but letting the numbers guide you to the right level of complexity.

Common estimation mistakes to avoid

  1. Premature sharding: Don’t propose database sharding when your estimate shows 50 GB of data. A single Postgres handles 64 TiB.
  2. Ignoring compression: Raw JSON at 1 KB/event becomes ~200 bytes in Parquet. A 5:1 ratio changes your storage estimate dramatically.
  3. Forgetting peak vs. average: Peak traffic is typically 3–5x average. Design for peak, budget for average.
  4. Neglecting derived data: Your raw data is just the start. Transformations, aggregates, intermediate tables, and indexes can 2–3x total storage.
  5. Using outdated numbers: A single Kafka broker handles 1M msg/sec today, not 10K. A Redis node does 100K ops/sec. Don’t over-provision based on 2015 capabilities.

Interview questions

Q1: “We need to build a pipeline for processing 1 billion user events per day. How would you estimate the infrastructure requirements?”

Model answer: “1 billion events/day is about 12K events/sec average, maybe 40–50K peak. If each event is ~500 bytes, that’s 500 GB/day raw, compressing to ~100 GB/day in Parquet. For ingestion, I’d need Kafka with maybe 5–10 brokers and 50–100 partitions to handle peak throughput comfortably. For processing, a Spark or Flink cluster with ~20 workers could handle this — 500 GB / 128 MB per partition gives about 4,000 partitions, at 320 parallel tasks (20 nodes × 16 cores), that’s about 12 waves at 10 sec each — roughly 2 minutes per batch run. Annually, compressed storage is ~36 TB, well within a single warehouse cluster. The key takeaway: this is moderate scale. It needs distributed processing but doesn’t require exotic architecture. Kafka → Flink for streaming, Spark/dbt for batch, warehouse for serving.”

Q2: “How would you estimate differently for Meta vs. a startup with 100K users?”

Model answer: “At a startup with 100K users generating maybe 50 events/user/day, that’s 5M events/day — about 60 events/sec. At 500 bytes each, that’s 2.5 GB/day raw, 500 MB compressed. Annually it’s ~180 GB. This entire pipeline can run on a single node — DuckDB or Polars for processing, Postgres for serving, Airflow + dbt for orchestration. No Kafka, no Spark, no warehouse needed. At Meta scale, 3B users × 200 events = 600B events/day, roughly 7M events/sec. That’s a different universe: multi-cluster Kafka, large Flink deployments, petabyte-scale lakehouse, multiple serving tiers. The architecture fundamentally changes — and this is why estimation comes before architecture, not after.”

Think about this

You’re interviewing at OpenAI. The prompt: “Design the data pipeline for processing user conversations with ChatGPT for quality analysis.”

Estimate:

  1. ChatGPT has ~200M weekly active users. If each averages 5 conversations/week with 10 messages each, how many messages/day is that?
  2. If each message (prompt + response) averages 2 KB, what’s the daily and annual storage?
  3. If you need to run embedding generation on each conversation (one embedding per conversation, 1536 dimensions × 4 bytes = ~6 KB), how much additional storage and compute does that add?
  4. Based on these numbers, what processing framework would you choose and why?

The insight: at ~285M messages/day (~3,300/sec), the text pipeline is moderate scale. But the embedding generation is the compute bottleneck — generating embeddings for 28.5M conversations/day using a model takes significant GPU time. This is where your estimation shifts from traditional data engineering to AI-infrastructure thinking, which is exactly what OpenAI would probe.

Quick reference

  • 1 KB × 1 billion = 1 TB — the one formula that handles most estimations
  • 1M requests/day ≈ 12 req/sec — quick throughput conversion
  • Always state assumptions out loud — interviewers want to see your reasoning, not just the final number
  • Numbers must drive architecture — if your estimate says 100 GB/day, don’t propose a 50-node Spark cluster. Let the math justify the complexity.
  • Use modern hardware numbers — single Kafka broker = 1M msg/sec, single Postgres = 64 TiB, single Redis = 100K ops/sec. Don’t over-engineer.

Tomorrow’s preview

Day 4: Dimensional modeling fundamentals — Star schema vs. snowflake, fact table types (transaction, periodic snapshot, accumulating snapshot), dimension tables, grain definition, and how to present data models in interviews for your target companies.