Phase 2: Deep Dives | Category: Cost & Performance
Why Cost Optimization Is a Senior Interview Signal
Junior engineers build things that work. Senior engineers build things that work efficiently. At Google, Netflix, Meta, and Amazon — where data infrastructure bills can hit millions per month — cost optimization is first-class engineering.
Interview expectation: quantify the inefficiency, identify root cause, fix at the architectural level (not just “add more compute”).
The Three Cost Drivers: Compute, Storage, Network
Typical breakdown:
Compute: 50–70% (Spark clusters, warehouse slots, streaming processors)
Storage: 20–35% (S3/GCS, warehouse storage, lakehouse files)
Network: 5–15% (cross-AZ, cross-region, internet egress)
Other: 5–10% (managed services, catalog, monitoring)
Optimization priority:
- Compute: highest impact (largest spend, most addressable)
- Storage: medium (grows continuously; compress/tier)
- Network: lower but often overlooked; can be huge at PB scale
Compute Cost Optimization
1. Spot / Preemptible Instances: Highest-Impact Change
Strategy:
Master/Core nodes (cluster stability): On-Demand or Reserved
Task nodes (parallelism, elastic): Spot — up to ~90% cheaper
Why task nodes are safe on Spot:
- Spark retries failed tasks automatically
- S3 data isn't lost when a task node is interrupted
- Only shuffle data is lost → Spark reruns from last shuffle stage
- Managed scaling can handle Spot interruptions
EMR spot configuration (instance fleet mix):
instance_fleet_config = {
"InstanceFleetType": "TASK",
"TargetSpotCapacity": 20,
"TargetOnDemandCapacity": 0,
"InstanceTypeConfigs": [
{"InstanceType": "r5.4xlarge", "WeightedCapacity": 1},
{"InstanceType": "r5d.4xlarge", "WeightedCapacity": 1},
{"InstanceType": "r5a.4xlarge", "WeightedCapacity": 1},
{"InstanceType": "r4.4xlarge", "WeightedCapacity": 1},
],
"LaunchSpecifications": {
"SpotSpecification": {
"TimeoutDurationMinutes": 20,
"TimeoutAction": "SWITCH_TO_ON_DEMAND",
}
},
}
Spot-safe workloads:
- Batch ETL (checkpoint/retry)
- Training data generation
- Historical backfills
- Non-SLA-critical analytics jobs
Keep On-Demand:
- Real-time streaming (Flink, Kafka Streams)
- Interactive BI queries (interruptions hurt UX)
- Tight SLAs (< 30 minutes)
- HDFS core nodes (data risk)
2. Auto-Termination: Kill Idle Clusters
Most common waste: clusters left running while idle.
# EMR: auto-terminate after 1 hour of idle
emr_cluster = EMRCluster(
auto_termination_policy={"IdleTimeout": 3600}
)
# Databricks: auto-terminate after 30 min of idle
cluster_config = {
"autotermination_minutes": 30,
"enable_elastic_disk": True,
}
Snowflake auto-suspend:
CREATE WAREHOUSE analytics_wh
WAREHOUSE_SIZE = MEDIUM
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;
3. Right-Sizing: Don’t Over-Provision
Right-sizing loop:
- Run job, capture CPU/memory utilization
- If peak CPU < 60%, reduce nodes ~30%
- If peak memory < 60%, consider smaller instance type
- If runtime > 2× SLA, increase size or fix skew/IO bottlenecks
Instance family heuristics:
- Spark ETL: memory-bound →
r*families - ML training: compute-bound →
c*families - Streaming: balanced →
m*families
4. Warehouse Slot / Virtual Warehouse Optimization
BigQuery pricing options:
On-demand: $6.25/TiB scanned
Good for: variable/ad-hoc workloads, smaller spend
Risk: one runaway query can cost hundreds+
Flat-rate (reserved slots): fixed cost
Good for: predictable batch, higher query spend
Break-even: ~ $10–15K/month (order of magnitude)
Snowflake virtual warehouse sizing:
-- Credits/hour by size (illustrative):
-- XS: 1, S: 2, M: 4, L: 8, XL: 16, 2XL: 32
CREATE WAREHOUSE analyst_wh
WAREHOUSE_SIZE = SMALL
AUTO_SUSPEND = 60;
CREATE WAREHOUSE etl_wh
WAREHOUSE_SIZE = LARGE
AUTO_SUSPEND = 30;
CREATE WAREHOUSE concurrent_wh
WAREHOUSE_SIZE = MEDIUM
MAX_CLUSTER_COUNT = 3
MIN_CLUSTER_COUNT = 1
SCALING_POLICY = ECONOMY;
Storage Cost Optimization
1. Tiered Storage: Biggest Storage Win
S3 tiers (example):
Standard: $0.023/GB-month
Infrequent Access: $0.0125/GB-month
Glacier Instant: $0.004/GB-month
Glacier Deep Archive: $0.00099/GB-month
Typical access pattern:
Last 30 days: queried daily → Standard
31–365 days: queried monthly → IA
1–3 years: queried quarterly → Glacier Instant
3+ years: compliance → Deep Archive
Lifecycle policy example:
s3_lifecycle = aws_s3_bucket_lifecycle_configuration(
"data-lifecycle",
rules=[
{
"id": "tier-by-age",
"status": "Enabled",
"transitions": [
{"days": 30, "storage_class": "STANDARD_IA"},
{"days": 90, "storage_class": "GLACIER_IR"},
{"days": 365, "storage_class": "DEEP_ARCHIVE"},
],
"expiration": {"days": 2557}, # ~7 years
}
],
)
Savings example (100 TB):
All Standard: 100 TB × $0.023 × 12 = $27,600/year
Tiered (30% hot, 30% IA, 40% Glacier):
30 TB × $0.023 + 30 TB × $0.0125 + 40 TB × $0.004 = $9,600/year
Savings: $18,000/year (~65%)
2. Compaction and Small File Cleanup
Small files waste money and slow queries (too many tasks, too much metadata).
Iceberg compaction:
spark.sql("""
CALL system.rewrite_data_files(
table => 'silver.user_events',
options => map(
'target-file-size-bytes', '134217728',
'min-file-size-bytes', '67108864',
'max-file-size-bytes', '536870912'
)
)
""")
Delta Lake:
spark.sql("OPTIMIZE silver.user_events ZORDER BY (user_id, event_date)")
Rule of thumb: streaming writes every 30 seconds produce 2,880 files/day/table. Daily compaction can reduce file counts by ~100× and cut object-store API costs dramatically.
3. Format and Compression: One-Time Win
Converting JSON/CSV → Parquet+Zstd often yields ~6–8× storage reduction and big query savings (scan less data).
Query Cost Optimization (BigQuery Deep Dive)
BigQuery charges per data scanned. Optimize in this order:
-- 1) PARTITION PRUNING
-- BAD: scans all partitions
SELECT user_id, event_type, COUNT(*)
FROM gold.fact_events
GROUP BY 1, 2;
-- GOOD: filter on partition key
SELECT user_id, event_type, COUNT(*)
FROM gold.fact_events
WHERE event_date = '2026-04-13'
GROUP BY 1, 2;
-- 2) COLUMN SELECTION
-- BAD
SELECT * FROM gold.fact_events WHERE event_date = '2026-04-13';
-- GOOD
SELECT user_id, event_type
FROM gold.fact_events
WHERE event_date = '2026-04-13';
-- 3) APPROX AGGREGATION (HLL)
SELECT
DATE(event_timestamp) AS date,
APPROX_COUNT_DISTINCT(user_id) AS dau_approx
FROM gold.fact_events
GROUP BY 1;
-- 4) MATERIALIZED VIEWS
CREATE MATERIALIZED VIEW gold.mv_daily_dau
OPTIONS (refresh_interval_minutes = 60)
AS
SELECT DATE(event_timestamp) AS date, COUNT(DISTINCT user_id) AS dau
FROM gold.fact_events
GROUP BY 1;
Enforce guardrails: require_partition_filter = TRUE.
CREATE TABLE gold.fact_events
PARTITION BY event_date
OPTIONS (require_partition_filter = TRUE);
Network Cost Optimization
Often overlooked but can be significant at PB scale.
Common expensive transfers:
| Transfer type | Example cost driver | Mitigation |
|---|---|---|
| S3 → EC2 via internet/NAT | NAT + egress $/GB | Use VPC endpoints (AWS backbone) |
| Cross-AZ transfer | replication/shuffles across AZs | Minimize cross-AZ; co-locate compute+storage |
| Cross-region transfer | multi-region copies/reads | Keep processing in region; avoid unnecessary replication |
| Internet egress | exporting data externally | Minimize; use CDNs where appropriate |
VPC endpoints example:
# Without endpoint: EMR → NAT → internet → S3 (paid)
# With gateway endpoint: EMR → VPC endpoint → S3 (on AWS backbone)
The Cost Optimization Monitoring Stack
You can’t optimize what you can’t see.
Budget alerts
aws_budgets.Budget(
"team-budget",
budget_type="COST",
limit_amount="5000",
limit_unit="USD",
time_unit="MONTHLY",
cost_filters={"TagKeyValue": ["user:team$marketing"]},
notifications=[
{
"notification_type": "ACTUAL",
"comparison_operator": "GREATER_THAN",
"threshold": 80,
"subscribers": [{"type": "EMAIL", "address": "data-platform@company.com"}],
}
],
)
Cost anomaly detection
SELECT
team,
today_cost,
avg_7day_cost,
today_cost / avg_7day_cost AS cost_ratio,
today_cost - avg_7day_cost AS cost_delta
FROM (
SELECT
labels.team AS team,
SUM(IF(usage_date = CURRENT_DATE - 1, cost, 0)) AS today_cost,
AVG(IF(usage_date >= CURRENT_DATE - 8, cost, NULL)) AS avg_7day_cost
FROM billing_export
GROUP BY labels.team
)
WHERE cost_ratio > 1.5
ORDER BY cost_delta DESC;
Per-query cost alerting (BigQuery)
SELECT
user_email,
query,
total_bytes_billed / 1e12 * 6.25 AS cost_usd,
creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE DATE(creation_time) = CURRENT_DATE - 1
AND total_bytes_billed / 1e12 * 6.25 > 50
ORDER BY cost_usd DESC;
The Cost–Performance Trade-off Framework
For each optimization:
1) Cost savings: $X/month
2) Performance impact: latency/SLA risk
3) Implementation cost: engineer-days
4) Risk: availability, data loss, correctness
Examples:
Spot for batch ETL:
Savings: large (often 40–80% of compute)
Performance: occasional retries
Risk: not suitable for real-time SLAs
Enable require_partition_filter:
Savings: eliminates accidental full scans
Impact: some queries fail until fixed
Risk: temporary dashboard breakage; coordinate rollout
Interview Questions
Q1: Bill grew from $800K/month to $2M/month without a new launch. What do you do?
Model answer:
- Cost attribution: breakdown by service (compute/storage/network) and by team (tagging).
- If compute: check utilization, Spot ratio, auto-termination, and over-provisioned clusters.
- If storage: check growth vs business growth, retention/tiering, duplicates, and file format/compression.
- If query spend: identify top expensive queries; partition filter failures are usually dominant; enforce guardrails and alerts.
- Target outcome: 40–60% reduction via Spot + auto-terminate + right-size + partition guardrails + tiering.
Q2: BigQuery training query takes 45 minutes and costs $200; run 10×/day. How do you fix?
Model answer:
- Diagnose: read query plan + bytes billed. $200/query at $6.25/TB implies ~32 TB scanned.
- Biggest lever: partition filter and column selection (reduce scan from TBs to GBs).
- Next: materialized views / precomputed feature tables.
- If it truly needs multi-decade history: move training data prep to Spark on Spot, then load curated features back.
Think About This
Google prompt: data platform costs $50M/year. Reduce by 30% without degrading SLAs.
Walk through:
- Estimate breakdown (compute ~60%, storage ~30%, network ~10%).
- First lever: compute (Spot/preemptible on batch + reservations for predictable warehouse spend).
- Second: storage tiering + compression/format standardization + compaction.
- Third: network (endpoints, region locality, avoid NAT/egress for internal transfers).
- Risk management: start with nightly batch; measure reliability; expand while tracking savings weekly.
Quick Reference
- Compute is biggest lever (Spot + auto-terminate + right-size).
- Storage wins: tiering + compaction + Parquet+Zstd.
- Query wins: partition filters > column pruning > approx aggregation > materialized views.
- Guardrails: enforce
require_partition_filter = TRUEfor large fact tables. - Network wins: avoid NAT/egress for internal object-store access; keep processing in-region.
Tomorrow’s Preview
Day 56: Performance Tuning Data Pipelines — Spark tuning, SQL execution plans, bottleneck identification, and systematic production debugging.