Phase 1: Foundations & Frameworks | Category: Storage Systems
Why This Is Core to Every DE Interview
The data warehouse is where your pipelines deliver value. Whether you’re building ETL at Meta or ELT at Google, the destination is an analytical warehouse. At your target companies, interviewers expect you to reason about warehouse internals — not just “I’d use BigQuery” but WHY, how columnar storage affects your partition strategy, when to choose serverless vs provisioned, and how to optimize for cost at petabyte scale.
Foundational Concepts
Columnar Storage: Why Warehouses Are Fast
Traditional databases (PostgreSQL, MySQL) store data row-by-row. Warehouses store data column-by-column.
Row-oriented (OLTP): Column-oriented (OLAP): ┌──────┬───────┬────────┬───────┐
┌──────────────────────────┐
│ id │ name │ city │ spend │ │ id: 1, 2, 3, 4, 5... │
├──────┼───────┼────────┼───────┤ │ name: A, B, C, D, E... │ │ 1 │ Alice │ NYC │ 500 │ │ city: NYC, SF, NYC... │ │ 2 │ Bob │ SF │ 300 │ │ spend: 500, 300, 700... │ │ 3 │ Carol │ NYC │ 700 │
└──────────────────────────┘
└──────┴───────┴────────┴───────┘
Why columnar wins for analytics:
-
Column pruning: SELECT city, SUM(spend) FROM users GROUP BY city only reads the city and spend columns. In row-oriented storage, it would read all 4 columns.
-
Better compression: Columns contain homogeneous data types. City names compress far better when stored together than interleaved with IDs and numbers. Typical compression ratios: 5-10x.
-
Vectorized execution: Processing a batch of values from a single column fits CPU cache lines and enables SIMD instructions. Much faster than row-at-a-time.
-
Predicate pushdown: Filters can be applied at the storage layer before data reaches the query engine. Min/max metadata per column block enables skipping entire blocks.
The trade-off: Columnar storage is optimized for reads (scan many rows, few columns) and slow for point writes/updates (must rewrite entire column blocks). This is why warehouses are append-heavy and use MERGE for updates rather than row-level UPDATE.
Massively Parallel Processing (MPP)
All modern warehouses distribute query execution across multiple nodes:
Client SQL query
↓ Query Planner (leader/coordinator node)
↓ distributes work
┌─────────────┬─────────────┬─────────────┐
│ Compute │ Compute │ Compute │
│ Node 1 │ Node 2 │ Node 3 │
│ (scans its │ (scans its │ (scans its │
│ partition) │ partition) │ partition) │
└──────┬──────┴──────┬──────┴──────┬──────┘
└─────────────┼─────────────┘
↓ aggregate
Final Result
Each node processes its slice of data in parallel. The coordinator merges partial results. Scaling = adding more nodes = more parallelism. This is how warehouses handle petabyte-scale queries in seconds.
The Big Three: Architecture Comparison
| Topic | Details |
|---|---|
| Architecture | Serverless, fully decoupled SaaS, 3-layer (storage/compute/services) Cluster-based MPP (RA3 adds decoupling) |
| Compute model | Auto-scaling slots; no provisioning Independent virtual warehouses; you choose size Provisioned nodes (or Serverless option) |
| Storage | Colossus (Google’s distributed filesystem) Cloud object storage (S3/GCS/ADLS) micro-partitions Local SSD + Managed Storage (S3 for RA3) |
| Compute-Storage separation | Full Full Partial (RA3 nodes) or coupled (DC2) |
| Scaling | Instant, automatic Manual or auto-scale warehouse; add warehouses for concurrency Resize cluster or add concurrency scaling |
| Concurrency | High (managed by Google) Excellent (separate warehouses per workload) Can bottleneck; concurrency scaling helps |
| Pricing | Per-query (on-demand) or flat-rate slots Credits per compute-second + storage per TBPer-node-hour + storage |
| Ops overhead | Near zero Low Medium (sort keys, distribution keys, vacuum) |
| Ecosystem | GCP-native (Dataflow, Pub/Sub, Looker, Vertex AI) Multi-cloud (AWS, Azure, GCP) AWS-native (S3, Glue, Sage Maker, Quick Sight) |
BigQuery: Serverless, Zero-Ops
Architecture: Storage on Colossus, compute via Dremel engine. Query is decomposed into a tree of workers that process data in parallel. You never see or manage servers.
Key features for interviews:
-
Partitioning: By date/timestamp column, integer range, or ingestion time. Partition pruning is the #1 cost and performance optimization.
-
Clustering: Sorts data within partitions by up to 4 columns. Enables block-level pruning on clustered columns.
-
Slot-based execution: On-demand = pay per TB scanned ($5/TB). Flat-rate = reserve slots for predictable cost.
-
Streaming inserts: Real-time ingestion via streaming buffer, queryable immediately.
-
Cost trap: Every query scans entire columns unless you partition and cluster. An unpartitioned 10 TB table costs $50 per full scan.
When to choose: GCP ecosystem, ad-hoc analytics, minimal ops, variable workloads.
Snowflake: Virtual Warehouses, Multi-Cloud
Architecture: Three layers — (1) centralized storage on cloud object storage in compressed micro-partitions, (2) independent virtual warehouses (compute clusters) that scale separately, (3) cloud services layer for metadata, security, optimization.
Key features for interviews:
-
Virtual warehouses: Independent compute clusters. Finance team runs on warehouse-A, data science on warehouse-B. Zero contention.
-
Micro-partitions: Auto-created, ~50-500 MB each. Snowflake automatically organizes data with min/max pruning metadata. No manual partitioning needed (but clustering keys help).
-
Clustering keys: Hint to Snowflake about which columns to co-locate. Not partitions — Snowflake re-clusters data in the background.
-
Time travel: Query data as it was at any point in the last 1-90 days. Enables easy rollback from bad ETL runs.
-
Zero-copy cloning: Instant database/table copy using metadata pointers. No data duplication. Perfect for dev/staging environments.
When to choose: Multi-cloud, workload isolation via virtual warehouses, teams wanting low ops overhead, data sharing across organizations.
Redshift: AWS-Native MPP
Architecture: Leader node (query planning) + compute nodes (data storage and processing). RA3 nodes decouple storage to S3 Managed Storage. DC2 nodes keep data local on SSD.
Key features for interviews:
-
Distribution keys: How data is distributed across nodes. EVEN (round-robin), KEY (hash on a column), ALL (full copy on every node for small tables).
-
Sort keys: Physical sort order on disk. Compound (multi-column, first column dominates) or Interleaved (equal weight to all columns).
-
VACUUM: Required to reclaim space after deletes/updates and re-sort data. Unique to Redshift — BigQuery and Snowflake don’t need this.
-
Concurrency scaling: Auto-provisions additional clusters during peak demand.
-
Redshift Spectrum: Query data in S3 directly without loading into Redshift. Extends your warehouse to the data lake.
When to choose: AWS-native organizations, existing Redshift expertise, tight integration with AWS services.
Compute-Storage Separation: The Defining Architecture Shift
This is the single most important warehouse architecture concept for interviews:
Coupled (old model — Redshift DC2, Teradata):
Node 1: [Compute + Storage] Node 2: [Compute + Storage] Node 3: [Compute + Storage]
-
Storage scales with compute. Need more storage? Add more compute nodes (wasteful if CPU isn’t the bottleneck).
-
Data is local = fast reads. But scaling is inflexible.
Decoupled (modern — BigQuery, Snowflake, Redshift RA3):
Compute: [Warehouse A] [Warehouse B] [Warehouse C] ← scale independently ↕ ↕ ↕
Storage: [Shared cloud object storage (S3/GCS)] ← scale independently
-
Storage and compute scale independently. Pay for each separately.
-
Multiple compute clusters read from the same storage. No data duplication.
-
Compute can be paused (Snowflake auto-suspend) — pay $0 when idle.
-
Trade-off: Network hop between compute and storage adds latency vs. local SSD. Caching mitigates this.
What to say in interviews: “I’d choose a warehouse with decoupled compute and storage because our storage grows at 50 TB/year but our compute needs peak only during business hours. With a coupled architecture, I’d be paying for 24/7 compute just to hold the storage. With decoupled, I spin up compute during the 10-hour business window and auto-suspend overnight — cutting compute cost by ~60%.”
Cost Optimization: The Senior-Level Skill
This is where you prove you’ve operated warehouses in production, not just read about them:
| Topic | Details |
|---|---|
| Partition/prune | Partition by date + cluster by common filters. Use require_partition_filter. Choose clustering keys matching common WHERE clauses. Choose distribution key matching JOIN key; sort key matching WHERE clause. |
| Reduce scan size | SELECT only needed columns (columnar pruning). Avoid SELECT *. Same — columnar pruning is automatic. Same. Also: use late-binding views to defer materialization. |
| Control compute | Use flat-rate slots for predictable cost; on-demand for ad-hoc. Right-size warehouses. Auto-suspend after 1-5 min of inactivity. Multi-cluster only if concurrency demands it. Right-size nodes. Use Reserved Instances for steady workloads. Concurrency scaling for spikes. |
| Materialized views | Yes — auto-refresh on base table changes. Yes — background maintenance. Yes — manual or auto-refresh. |
| Storage tiering | Auto: active storage ($0.02/GB) → long-term ($0.01/GB) after 90 days. Auto: active → Fail-safe.S3 tiering for RA3 managed storage. |
| Query optimization | Check execution graph in Big Query UI. Look for slot contention, excessive shuffle. Check query profile. Look for spillage to disk, inefficient joins. Check EXPLAIN plan and STL_QUERY system tables. Look for distribution skew, disk-based joins. |
Warehouse in the Lakehouse Context
The modern pattern isn’t “warehouse vs lake” — it’s both:
Sources → Kafka/Fivetran → S3/GCS (Bronze, Iceberg/Delta)
↓ Spark/dbt (Silver → Gold)
↓
┌───────────┴───────────┐
▼ ▼
Data Warehouse Lakehouse Query Engine
(BigQuery/Snowflake/ (Trino, Spark SQL,
Redshift) Athena, Starburst)
Gold tables loaded Query Iceberg tables
for BI & self-serve directly on S3
When to load into a warehouse vs query in-place:
-
Load into warehouse: BI dashboards with sub-second response, concurrent analyst queries, governed data marts
-
Query in-place: Data science exploration, large-scale batch transforms, cost-sensitive ad-hoc queries on cold data
BigQuery, Snowflake, and Redshift all now support querying external data (BigQuery external tables, Snowflake external tables, Redshift Spectrum) — blurring the line between warehouse and lakehouse.
Interview Questions
Q1: “You’re building the analytics platform for a company processing 500 GB of new data daily, with 50 analysts running reports and dashboards. BigQuery, Snowflake, or Redshift?”
Model Answer: “I’d evaluate based on three factors. First, cloud ecosystem: if we’re on GCP, BigQuery is the natural choice with deep integration into Dataflow, Pub/Sub, and Looker. If AWS, Redshift. If multi-cloud or cloud-agnostic, Snowflake. Second, operations capacity: if the team is small and wants zero-ops, BigQuery’s serverless model wins — no clusters to manage, no vacuuming, no right-sizing. Third, concurrency pattern: if 50 analysts run queries simultaneously during business hours, Snowflake’s virtual warehouses provide the best workload isolation — finance gets their own warehouse, marketing gets theirs, no contention. At 500 GB/day (~180 TB/year), all three handle the volume easily. I’d partition by date, cluster on the most common filter dimensions, and set up materialized views for the top 10 dashboard queries. For cost control, I’d auto-suspend compute during off-hours and use storage tiering for data older than 90 days.”
Q2: “A query that used to take 10 seconds now takes 5 minutes after data volume tripled. How do you diagnose and fix this in BigQuery?”
Model Answer: “Step 1: Check the query execution graph in BigQuery UI. Look for stages with disproportionate data processing — that’s usually a full table scan on a non-partitioned or non-clustered column. Step 2: Verify partition pruning. If the query doesn’t filter on the partition column, it scans the entire table. Adding WHERE event_date = ‘2026-03-31’ to a date-partitioned table can reduce scan from 180 TB to 500 GB instantly. Step 3: Check for data skew in the execution plan — one slot processing far more data than others indicates an uneven partition or a skewed JOIN key. Step 4: Review clustering effectiveness — if the table was clustered on column A but queries now primarily filter on column B, re-clustering may help. Step 5: Consider materialized views for this specific query pattern if it runs frequently. At 3x data growth, the fix is almost always better pruning, not more compute. BigQuery scales compute automatically — if the query is slow, it’s because it’s scanning too much data, not because there aren’t enough slots.”
Think About This
You’re in a Netflix interview. The prompt: “Netflix’s analytics warehouse stores petabytes of viewing data. Multiple teams query it: content strategy (complex ad-hoc), product analytics (recurring dashboards), data science (large ML feature extraction). How would you architect the warehouse layer?”
Walk through:
-
Would you use one warehouse or multiple? (Snowflake-style: multiple virtual warehouses. Content strategy gets a large warehouse for complex queries. Dashboards get a dedicated warehouse sized for known workload. Data science gets an XL warehouse for feature extraction that auto-suspends when not in use.)
-
How would you organize the data? (Medallion: bronze on S3/Iceberg for raw viewing events, silver for cleaned/sessionized data, gold star schema in the warehouse for BI consumption. Data science reads from silver via Spark, not from gold — different access pattern.)
-
What’s the cost optimization strategy? (Auto-suspend idle warehouses. Partition gold tables by date. Materialized views for the top 20 dashboard queries. Long-term storage tier for data > 90 days. Charge-back per team based on warehouse usage.)
-
How do you handle the petabyte scale? (Compute-storage separation is mandatory. Data lives in object storage. Query engines scale horizontally. Partition pruning and clustering prevent full-table scans.)
Quick Reference
-
Columnar storage = read only the columns you need + better compression + vectorized execution. The foundation of all warehouse performance.
-
MPP = parallelize query across multiple nodes. More nodes = more parallelism = faster queries.
-
Compute-storage separation = scale each independently. Pay for compute only when running queries. The defining architecture of modern warehouses.
-
The Big Three: BigQuery (serverless, zero-ops, GCP), Snowflake (virtual warehouses, multi-cloud), Redshift (AWS-native, most tuning knobs).
-
Cost optimization top 3: (1) Partition + cluster to minimize scan size, (2) Auto-suspend idle compute, (3) Materialized views for hot queries.
-
Always state your warehouse choice with reasoning: “I chose Snowflake because we need workload isolation across 5 teams, multi-cloud flexibility, and minimal ops overhead.”
Tomorrow’s Preview
Day 17: Data Lake & Lakehouse Architecture — S3/GCS + Delta Lake, Iceberg, Hudi. ACID on object storage. Time travel, schema evolution. The lakehouse pattern that combines the best of warehouses and lakes — and why it’s the dominant modern architecture.