Day 3 — Back-of-envelope estimation for data systems

Orders of magnitude, numbers to memorize, a four-step framework, and tying estimates to architecture decisions.

Day 3/90: Back-of-envelope estimation for data systems

Phase 1: Foundations & Frameworks | Category: System Design Methodology

Why this matters at senior level

Estimation isn’t a math exercise — it’s how you prove your architecture is grounded in reality. As System Design Handbook puts it: “Strong candidates naturally anchor their designs in numbers. They talk about traffic before databases, capacity before sharding, and growth before optimization.” At your level, interviewers expect your numbers to directly drive component choices — not live in a disconnected calculation on the side of the whiteboard.

The key mindset shift: you’re not trying to be precise. You’re trying to land in the right order of magnitude so your architecture makes sense. Being off by 2x is fine. Being off by 100x means you designed the wrong system.

The numbers you must memorize

Powers of two — your unit converter

Power	Approx value	Bytes
10	1 thousand	1 KB
20	1 million	1 MB
30	1 billion	1 GB
40	1 trillion	1 TB
50	1 quadrillion	1 PB

The one shortcut that matters: 1 KB × 1 billion = 1 TB. This single formula handles most data engineering estimations.

Time constants

Period	Seconds	Rounded
1 day	86,400	~100K
1 month	2.6M	~2.5M
1 year	31.5M	~30M

Quick conversion trick per this cheat sheet:

1M requests/day ≈ 12 req/sec
1B requests/day ≈ 12K req/sec
100M requests/month ≈ 40 req/sec

Modern hardware capabilities (2025–2026)

This is where many candidates use outdated numbers. According to Hello Interview’s modern hardware guide, today’s numbers would surprise you:

Component	Capability	What it means
Single server RAM	512 GB standard, up to 4 TB (X1e), 24 TB extreme	You can cache a lot more than you think
Single Postgres/MySQL	Up to 64 TiB storage, 10–20K TPS	Don’t shard until you actually need to
SSD (NVMe)	3–7 GB/sec sequential read, 60 TB per instance	Storage is rarely the bottleneck
Network (datacenter)	25 Gbps standard, 50–100 Gbps high-perf	~3–12 GB/sec between services
Cross-AZ latency	1–2 ms	Cross-region: 50–150 ms
Single Kafka broker	Up to 1M messages/sec, 50 TB storage	One broker is surprisingly powerful
Redis (single instance)	100K+ ops/sec	Often enough without clustering

Senior-level insight: Many candidates over-engineer because they use 2015-era numbers. A single well-tuned Postgres handles more than most people think. Hello Interview explicitly warns: “One of the biggest giveaways of book knowledge without hands-on experience is relying on outdated hardware constraints.”

The 4-step estimation framework for data engineering

Step 1: Estimate ingestion volume

Start with the source. How much data enters the system?

Formula: Daily events × avg event size = daily ingest volume

Example — User engagement pipeline at Netflix scale:

250M daily active users
Each user generates ~200 events/day (views, pauses, searches, ratings)
250M × 200 = 50B events/day
50B / 100K seconds per day = ~500K events/sec
Avg event size: ~500 bytes (JSON with user_id, content_id, event_type, timestamp, metadata)
50B × 500 bytes = 25 TB/day raw ingestion

That single calculation tells you: this is a streaming system (500K events/sec can’t wait for batch), you need Kafka with significant partition count, and you’re looking at petabyte-scale storage annually.

Step 2: Estimate storage requirements

Formula: Daily ingest × retention × replication factor × format overhead

Continuing the Netflix example:

Raw data: 25 TB/day
Columnar compression (Parquet): ~5:1 compression → 5 TB/day compressed
Retention: 2 years = 730 days
5 TB × 730 = 3.65 PB for the main fact table
Replication factor 3 (standard for durability): ~11 PB total storage
Plus dimensions, aggregates, intermediate tables: multiply by ~1.5x → ~16 PB

Storage tier breakdown (cost-driven):

Hot (last 30 days): 150 TB in a fast OLAP store (Druid/ClickHouse) or warehouse
Warm (30 days – 1 year): in columnar format on S3/GCS with Iceberg
Cold (1–2 years): S3 Glacier / GCS Archive

Step 3: Estimate processing capacity

For Spark batch jobs:

Following the approach from a common Spark cluster sizing interview pattern:

Data to process: 25 TB/day (raw)
Target partition size: 128 MB
Partitions needed: 25 TB / 128 MB = ~200K partitions
Cluster: 50 worker nodes × 16 cores = 800 parallel tasks
Waves: 200K / 800 = 250 waves
~10 sec per wave → ~42 minutes for a full day’s transformation

For Kafka/streaming:

500K events/sec
Single Kafka broker handles ~1M msg/sec
Need ~5–10 brokers with headroom (for replication, consumer lag, spikes)
Partition count: at least 100–200 partitions for parallelism
Flink parallelism should match partition count for max throughput

Step 4: Estimate query/serving load

Formula: DAU × queries per session × peak multiplier / seconds in peak hour

10K analysts running 20 queries/day = 200K queries/day
Peak hour handles 30% of daily load: 60K queries in 3,600 sec = ~17 QPS
If serving an application: 10M DAU × 5 API calls = 50M/day = ~580 QPS avg, ~2K QPS peak

These numbers determine whether you need a provisioned warehouse (Redshift/Snowflake) vs. pay-per-query (BigQuery), and whether you need a caching layer in front.

Estimation-to-architecture decision map

This is the senior move — connecting your numbers to specific architectural choices:

Estimation result	Architecture decision
< 10K events/sec	Single-node processing (DuckDB, Polars). Batch with Airflow + dbt is fine.
10K – 100K events/sec	Distributed batch (Spark) or light streaming (Spark Structured Streaming)
100K – 1M events/sec	Kafka + Flink/Spark Streaming. Multiple Kafka brokers. Serious partitioning strategy needed.
> 1M events/sec	Multi-cluster Kafka, dedicated Flink clusters, consider edge pre-aggregation
Storage < 1 TB	Single Postgres/MySQL. No warehouse needed.
Storage 1–100 TB	Data warehouse (BigQuery, Snowflake, Redshift). Single-node OLAP for real-time.
Storage 100 TB – 10 PB	Lakehouse (Iceberg/Delta on S3) + warehouse for gold layer. Storage tiering essential.
Storage > 10 PB	Full data lake architecture. Object storage with multiple query engines. Cost optimization is critical.
Query QPS < 50	Warehouse is fine. No caching needed.
Query QPS 50–1000	Materialized views, pre-aggregation, consider caching hot queries
Query QPS > 1000	Dedicated serving store (Redis, DynamoDB, Druid/Pinot). Cache layer mandatory.

Worked example: Google Ads click analytics

Let’s walk through a full estimation the way you’d do it in a Google interview:

Prompt: Design a click analytics pipeline for Google Ads.

Step 1 — Ingestion:

Google serves ~8.5B searches/day, assume ~30% have ads, ~5% CTR
8.5B × 0.30 × 0.05 = ~130M clicks/day
130M / 100K = ~1,300 clicks/sec (average), ~5K/sec peak
Avg click event: ~1 KB (user_id, ad_id, campaign_id, keyword, bid, timestamp, geo, device, landing_url)
Daily volume: 130M × 1 KB = ~130 GB/day raw

Step 2 — Storage:

Compressed (Parquet): ~25 GB/day
3-year retention for billing/audit: 25 GB × 1,095 = ~27 TB
With replication and derived tables: ~100 TB total

Step 3 — Processing:

5K events/sec peak — manageable with Pub/Sub → Dataflow (streaming)
Batch aggregation for daily reports: 130 GB — small enough for a single Dataflow job or BigQuery scheduled query

Step 4 — Serving:

Advertiser dashboard QPS: ~10K advertisers × 10 queries/hour in peak = ~28 QPS
This is well within BigQuery’s capability without any caching

Architecture implication: At 130 GB/day and ~28 QPS serving, this doesn’t need a complex streaming architecture for the analytics side. A near-real-time micro-batch (every 1–5 min) into BigQuery with partitioning by date and clustering on campaign_id would be sufficient and cost-effective. The streaming component is needed only for the real-time spend monitoring API that advertisers check frequently.

This is the kind of nuanced, estimation-driven reasoning Google interviewers look for — not defaulting to the most complex architecture, but letting the numbers guide you to the right level of complexity.

Common estimation mistakes to avoid

Premature sharding: Don’t propose database sharding when your estimate shows 50 GB of data. A single Postgres handles 64 TiB.
Ignoring compression: Raw JSON at 1 KB/event becomes ~200 bytes in Parquet. A 5:1 ratio changes your storage estimate dramatically.
Forgetting peak vs. average: Peak traffic is typically 3–5x average. Design for peak, budget for average.
Neglecting derived data: Your raw data is just the start. Transformations, aggregates, intermediate tables, and indexes can 2–3x total storage.
Using outdated numbers: A single Kafka broker handles 1M msg/sec today, not 10K. A Redis node does 100K ops/sec. Don’t over-provision based on 2015 capabilities.

Interview questions

Q1: “We need to build a pipeline for processing 1 billion user events per day. How would you estimate the infrastructure requirements?”

Model answer: “1 billion events/day is about 12K events/sec average, maybe 40–50K peak. If each event is ~500 bytes, that’s 500 GB/day raw, compressing to ~100 GB/day in Parquet. For ingestion, I’d need Kafka with maybe 5–10 brokers and 50–100 partitions to handle peak throughput comfortably. For processing, a Spark or Flink cluster with ~20 workers could handle this — 500 GB / 128 MB per partition gives about 4,000 partitions, at 320 parallel tasks (20 nodes × 16 cores), that’s about 12 waves at 10 sec each — roughly 2 minutes per batch run. Annually, compressed storage is ~36 TB, well within a single warehouse cluster. The key takeaway: this is moderate scale. It needs distributed processing but doesn’t require exotic architecture. Kafka → Flink for streaming, Spark/dbt for batch, warehouse for serving.”

Q2: “How would you estimate differently for Meta vs. a startup with 100K users?”

Model answer: “At a startup with 100K users generating maybe 50 events/user/day, that’s 5M events/day — about 60 events/sec. At 500 bytes each, that’s 2.5 GB/day raw, 500 MB compressed. Annually it’s ~180 GB. This entire pipeline can run on a single node — DuckDB or Polars for processing, Postgres for serving, Airflow + dbt for orchestration. No Kafka, no Spark, no warehouse needed. At Meta scale, 3B users × 200 events = 600B events/day, roughly 7M events/sec. That’s a different universe: multi-cluster Kafka, large Flink deployments, petabyte-scale lakehouse, multiple serving tiers. The architecture fundamentally changes — and this is why estimation comes before architecture, not after.”

Think about this

You’re interviewing at OpenAI. The prompt: “Design the data pipeline for processing user conversations with ChatGPT for quality analysis.”

Estimate:

ChatGPT has ~200M weekly active users. If each averages 5 conversations/week with 10 messages each, how many messages/day is that?
If each message (prompt + response) averages 2 KB, what’s the daily and annual storage?
If you need to run embedding generation on each conversation (one embedding per conversation, 1536 dimensions × 4 bytes = ~6 KB), how much additional storage and compute does that add?
Based on these numbers, what processing framework would you choose and why?

The insight: at ~285M messages/day (~3,300/sec), the text pipeline is moderate scale. But the embedding generation is the compute bottleneck — generating embeddings for 28.5M conversations/day using a model takes significant GPU time. This is where your estimation shifts from traditional data engineering to AI-infrastructure thinking, which is exactly what OpenAI would probe.

Quick reference

1 KB × 1 billion = 1 TB — the one formula that handles most estimations
1M requests/day ≈ 12 req/sec — quick throughput conversion
Always state assumptions out loud — interviewers want to see your reasoning, not just the final number
Numbers must drive architecture — if your estimate says 100 GB/day, don’t propose a 50-node Spark cluster. Let the math justify the complexity.
Use modern hardware numbers — single Kafka broker = 1M msg/sec, single Postgres = 64 TiB, single Redis = 100K ops/sec. Don’t over-engineer.

Tomorrow’s preview

Day 4: Dimensional modeling fundamentals — Star schema vs. snowflake, fact table types (transaction, periodic snapshot, accumulating snapshot), dimension tables, grain definition, and how to present data models in interviews for your target companies.