Day 3/90: Back-of-envelope estimation for data systems
Phase 1: Foundations & Frameworks | Category: System Design Methodology
Why this matters at senior level
Estimation isn’t a math exercise — it’s how you prove your architecture is grounded in reality. As System Design Handbook puts it: “Strong candidates naturally anchor their designs in numbers. They talk about traffic before databases, capacity before sharding, and growth before optimization.” At your level, interviewers expect your numbers to directly drive component choices — not live in a disconnected calculation on the side of the whiteboard.
The key mindset shift: you’re not trying to be precise. You’re trying to land in the right order of magnitude so your architecture makes sense. Being off by 2x is fine. Being off by 100x means you designed the wrong system.
The numbers you must memorize
Powers of two — your unit converter
| Power | Approx value | Bytes |
|---|---|---|
| 10 | 1 thousand | 1 KB |
| 20 | 1 million | 1 MB |
| 30 | 1 billion | 1 GB |
| 40 | 1 trillion | 1 TB |
| 50 | 1 quadrillion | 1 PB |
The one shortcut that matters: 1 KB × 1 billion = 1 TB. This single formula handles most data engineering estimations.
Time constants
| Period | Seconds | Rounded |
|---|---|---|
| 1 day | 86,400 | ~100K |
| 1 month | 2.6M | ~2.5M |
| 1 year | 31.5M | ~30M |
Quick conversion trick per this cheat sheet:
- 1M requests/day ≈ 12 req/sec
- 1B requests/day ≈ 12K req/sec
- 100M requests/month ≈ 40 req/sec
Modern hardware capabilities (2025–2026)
This is where many candidates use outdated numbers. According to Hello Interview’s modern hardware guide, today’s numbers would surprise you:
| Component | Capability | What it means |
|---|---|---|
| Single server RAM | 512 GB standard, up to 4 TB (X1e), 24 TB extreme | You can cache a lot more than you think |
| Single Postgres/MySQL | Up to 64 TiB storage, 10–20K TPS | Don’t shard until you actually need to |
| SSD (NVMe) | 3–7 GB/sec sequential read, 60 TB per instance | Storage is rarely the bottleneck |
| Network (datacenter) | 25 Gbps standard, 50–100 Gbps high-perf | ~3–12 GB/sec between services |
| Cross-AZ latency | 1–2 ms | Cross-region: 50–150 ms |
| Single Kafka broker | Up to 1M messages/sec, 50 TB storage | One broker is surprisingly powerful |
| Redis (single instance) | 100K+ ops/sec | Often enough without clustering |
Senior-level insight: Many candidates over-engineer because they use 2015-era numbers. A single well-tuned Postgres handles more than most people think. Hello Interview explicitly warns: “One of the biggest giveaways of book knowledge without hands-on experience is relying on outdated hardware constraints.”
The 4-step estimation framework for data engineering
Step 1: Estimate ingestion volume
Start with the source. How much data enters the system?
Formula: Daily events × avg event size = daily ingest volume
Example — User engagement pipeline at Netflix scale:
- 250M daily active users
- Each user generates ~200 events/day (views, pauses, searches, ratings)
- 250M × 200 = 50B events/day
- 50B / 100K seconds per day = ~500K events/sec
- Avg event size: ~500 bytes (JSON with
user_id,content_id,event_type, timestamp, metadata) - 50B × 500 bytes = 25 TB/day raw ingestion
That single calculation tells you: this is a streaming system (500K events/sec can’t wait for batch), you need Kafka with significant partition count, and you’re looking at petabyte-scale storage annually.
Step 2: Estimate storage requirements
Formula: Daily ingest × retention × replication factor × format overhead
Continuing the Netflix example:
- Raw data: 25 TB/day
- Columnar compression (Parquet): ~5:1 compression → 5 TB/day compressed
- Retention: 2 years = 730 days
- 5 TB × 730 = 3.65 PB for the main fact table
- Replication factor 3 (standard for durability): ~11 PB total storage
- Plus dimensions, aggregates, intermediate tables: multiply by ~1.5x → ~16 PB
Storage tier breakdown (cost-driven):
- Hot (last 30 days): 150 TB in a fast OLAP store (Druid/ClickHouse) or warehouse
- Warm (30 days – 1 year): in columnar format on S3/GCS with Iceberg
- Cold (1–2 years): S3 Glacier / GCS Archive
Step 3: Estimate processing capacity
For Spark batch jobs:
Following the approach from a common Spark cluster sizing interview pattern:
- Data to process: 25 TB/day (raw)
- Target partition size: 128 MB
- Partitions needed: 25 TB / 128 MB = ~200K partitions
- Cluster: 50 worker nodes × 16 cores = 800 parallel tasks
- Waves: 200K / 800 = 250 waves
- ~10 sec per wave → ~42 minutes for a full day’s transformation
For Kafka/streaming:
- 500K events/sec
- Single Kafka broker handles ~1M msg/sec
- Need ~5–10 brokers with headroom (for replication, consumer lag, spikes)
- Partition count: at least 100–200 partitions for parallelism
- Flink parallelism should match partition count for max throughput
Step 4: Estimate query/serving load
Formula: DAU × queries per session × peak multiplier / seconds in peak hour
- 10K analysts running 20 queries/day = 200K queries/day
- Peak hour handles 30% of daily load: 60K queries in 3,600 sec = ~17 QPS
- If serving an application: 10M DAU × 5 API calls = 50M/day = ~580 QPS avg, ~2K QPS peak
These numbers determine whether you need a provisioned warehouse (Redshift/Snowflake) vs. pay-per-query (BigQuery), and whether you need a caching layer in front.
Estimation-to-architecture decision map
This is the senior move — connecting your numbers to specific architectural choices:
| Estimation result | Architecture decision |
|---|---|
| < 10K events/sec | Single-node processing (DuckDB, Polars). Batch with Airflow + dbt is fine. |
| 10K – 100K events/sec | Distributed batch (Spark) or light streaming (Spark Structured Streaming) |
| 100K – 1M events/sec | Kafka + Flink/Spark Streaming. Multiple Kafka brokers. Serious partitioning strategy needed. |
| > 1M events/sec | Multi-cluster Kafka, dedicated Flink clusters, consider edge pre-aggregation |
| Storage < 1 TB | Single Postgres/MySQL. No warehouse needed. |
| Storage 1–100 TB | Data warehouse (BigQuery, Snowflake, Redshift). Single-node OLAP for real-time. |
| Storage 100 TB – 10 PB | Lakehouse (Iceberg/Delta on S3) + warehouse for gold layer. Storage tiering essential. |
| Storage > 10 PB | Full data lake architecture. Object storage with multiple query engines. Cost optimization is critical. |
| Query QPS < 50 | Warehouse is fine. No caching needed. |
| Query QPS 50–1000 | Materialized views, pre-aggregation, consider caching hot queries |
| Query QPS > 1000 | Dedicated serving store (Redis, DynamoDB, Druid/Pinot). Cache layer mandatory. |
Worked example: Google Ads click analytics
Let’s walk through a full estimation the way you’d do it in a Google interview:
Prompt: Design a click analytics pipeline for Google Ads.
Step 1 — Ingestion:
- Google serves ~8.5B searches/day, assume ~30% have ads, ~5% CTR
- 8.5B × 0.30 × 0.05 = ~130M clicks/day
- 130M / 100K = ~1,300 clicks/sec (average), ~5K/sec peak
- Avg click event: ~1 KB (
user_id,ad_id,campaign_id, keyword, bid, timestamp, geo, device,landing_url) - Daily volume: 130M × 1 KB = ~130 GB/day raw
Step 2 — Storage:
- Compressed (Parquet): ~25 GB/day
- 3-year retention for billing/audit: 25 GB × 1,095 = ~27 TB
- With replication and derived tables: ~100 TB total
Step 3 — Processing:
- 5K events/sec peak — manageable with Pub/Sub → Dataflow (streaming)
- Batch aggregation for daily reports: 130 GB — small enough for a single Dataflow job or BigQuery scheduled query
Step 4 — Serving:
- Advertiser dashboard QPS: ~10K advertisers × 10 queries/hour in peak = ~28 QPS
- This is well within BigQuery’s capability without any caching
Architecture implication: At 130 GB/day and ~28 QPS serving, this doesn’t need a complex streaming architecture for the analytics side. A near-real-time micro-batch (every 1–5 min) into BigQuery with partitioning by date and clustering on campaign_id would be sufficient and cost-effective. The streaming component is needed only for the real-time spend monitoring API that advertisers check frequently.
This is the kind of nuanced, estimation-driven reasoning Google interviewers look for — not defaulting to the most complex architecture, but letting the numbers guide you to the right level of complexity.
Common estimation mistakes to avoid
- Premature sharding: Don’t propose database sharding when your estimate shows 50 GB of data. A single Postgres handles 64 TiB.
- Ignoring compression: Raw JSON at 1 KB/event becomes ~200 bytes in Parquet. A 5:1 ratio changes your storage estimate dramatically.
- Forgetting peak vs. average: Peak traffic is typically 3–5x average. Design for peak, budget for average.
- Neglecting derived data: Your raw data is just the start. Transformations, aggregates, intermediate tables, and indexes can 2–3x total storage.
- Using outdated numbers: A single Kafka broker handles 1M msg/sec today, not 10K. A Redis node does 100K ops/sec. Don’t over-provision based on 2015 capabilities.
Interview questions
Q1: “We need to build a pipeline for processing 1 billion user events per day. How would you estimate the infrastructure requirements?”
Model answer: “1 billion events/day is about 12K events/sec average, maybe 40–50K peak. If each event is ~500 bytes, that’s 500 GB/day raw, compressing to ~100 GB/day in Parquet. For ingestion, I’d need Kafka with maybe 5–10 brokers and 50–100 partitions to handle peak throughput comfortably. For processing, a Spark or Flink cluster with ~20 workers could handle this — 500 GB / 128 MB per partition gives about 4,000 partitions, at 320 parallel tasks (20 nodes × 16 cores), that’s about 12 waves at 10 sec each — roughly 2 minutes per batch run. Annually, compressed storage is ~36 TB, well within a single warehouse cluster. The key takeaway: this is moderate scale. It needs distributed processing but doesn’t require exotic architecture. Kafka → Flink for streaming, Spark/dbt for batch, warehouse for serving.”
Q2: “How would you estimate differently for Meta vs. a startup with 100K users?”
Model answer: “At a startup with 100K users generating maybe 50 events/user/day, that’s 5M events/day — about 60 events/sec. At 500 bytes each, that’s 2.5 GB/day raw, 500 MB compressed. Annually it’s ~180 GB. This entire pipeline can run on a single node — DuckDB or Polars for processing, Postgres for serving, Airflow + dbt for orchestration. No Kafka, no Spark, no warehouse needed. At Meta scale, 3B users × 200 events = 600B events/day, roughly 7M events/sec. That’s a different universe: multi-cluster Kafka, large Flink deployments, petabyte-scale lakehouse, multiple serving tiers. The architecture fundamentally changes — and this is why estimation comes before architecture, not after.”
Think about this
You’re interviewing at OpenAI. The prompt: “Design the data pipeline for processing user conversations with ChatGPT for quality analysis.”
Estimate:
- ChatGPT has ~200M weekly active users. If each averages 5 conversations/week with 10 messages each, how many messages/day is that?
- If each message (prompt + response) averages 2 KB, what’s the daily and annual storage?
- If you need to run embedding generation on each conversation (one embedding per conversation, 1536 dimensions × 4 bytes = ~6 KB), how much additional storage and compute does that add?
- Based on these numbers, what processing framework would you choose and why?
The insight: at ~285M messages/day (~3,300/sec), the text pipeline is moderate scale. But the embedding generation is the compute bottleneck — generating embeddings for 28.5M conversations/day using a model takes significant GPU time. This is where your estimation shifts from traditional data engineering to AI-infrastructure thinking, which is exactly what OpenAI would probe.
Quick reference
- 1 KB × 1 billion = 1 TB — the one formula that handles most estimations
- 1M requests/day ≈ 12 req/sec — quick throughput conversion
- Always state assumptions out loud — interviewers want to see your reasoning, not just the final number
- Numbers must drive architecture — if your estimate says 100 GB/day, don’t propose a 50-node Spark cluster. Let the math justify the complexity.
- Use modern hardware numbers — single Kafka broker = 1M msg/sec, single Postgres = 64 TiB, single Redis = 100K ops/sec. Don’t over-engineer.
Tomorrow’s preview
Day 4: Dimensional modeling fundamentals — Star schema vs. snowflake, fact table types (transaction, periodic snapshot, accumulating snapshot), dimension tables, grain definition, and how to present data models in interviews for your target companies.