Day 55 — Cost Optimization for Data Platforms

Phase 2: Deep Dives | Category: Cost & Performance

Why Cost Optimization Is a Senior Interview Signal

Junior engineers build things that work. Senior engineers build things that work efficiently. At Google, Netflix, Meta, and Amazon — where data infrastructure bills can hit millions per month — cost optimization is first-class engineering.

Interview expectation: quantify the inefficiency, identify root cause, fix at the architectural level (not just “add more compute”).

The Three Cost Drivers: Compute, Storage, Network

Typical breakdown:

Compute:  50–70% (Spark clusters, warehouse slots, streaming processors)
Storage:  20–35% (S3/GCS, warehouse storage, lakehouse files)
Network:   5–15% (cross-AZ, cross-region, internet egress)
Other:     5–10% (managed services, catalog, monitoring)

Optimization priority:

Compute: highest impact (largest spend, most addressable)
Storage: medium (grows continuously; compress/tier)
Network: lower but often overlooked; can be huge at PB scale

Compute Cost Optimization

1. Spot / Preemptible Instances: Highest-Impact Change

Strategy:

Master/Core nodes (cluster stability):  On-Demand or Reserved
Task nodes (parallelism, elastic):       Spot — up to ~90% cheaper

Why task nodes are safe on Spot:
  - Spark retries failed tasks automatically
  - S3 data isn't lost when a task node is interrupted
  - Only shuffle data is lost → Spark reruns from last shuffle stage
  - Managed scaling can handle Spot interruptions

EMR spot configuration (instance fleet mix):

instance_fleet_config = {
    "InstanceFleetType": "TASK",
    "TargetSpotCapacity": 20,
    "TargetOnDemandCapacity": 0,
    "InstanceTypeConfigs": [
        {"InstanceType": "r5.4xlarge", "WeightedCapacity": 1},
        {"InstanceType": "r5d.4xlarge", "WeightedCapacity": 1},
        {"InstanceType": "r5a.4xlarge", "WeightedCapacity": 1},
        {"InstanceType": "r4.4xlarge", "WeightedCapacity": 1},
    ],
    "LaunchSpecifications": {
        "SpotSpecification": {
            "TimeoutDurationMinutes": 20,
            "TimeoutAction": "SWITCH_TO_ON_DEMAND",
        }
    },
}

Spot-safe workloads:

Batch ETL (checkpoint/retry)
Training data generation
Historical backfills
Non-SLA-critical analytics jobs

Keep On-Demand:

Real-time streaming (Flink, Kafka Streams)
Interactive BI queries (interruptions hurt UX)
Tight SLAs (< 30 minutes)
HDFS core nodes (data risk)

2. Auto-Termination: Kill Idle Clusters

Most common waste: clusters left running while idle.

# EMR: auto-terminate after 1 hour of idle
emr_cluster = EMRCluster(
    auto_termination_policy={"IdleTimeout": 3600}
)

# Databricks: auto-terminate after 30 min of idle
cluster_config = {
    "autotermination_minutes": 30,
    "enable_elastic_disk": True,
}

Snowflake auto-suspend:

CREATE WAREHOUSE analytics_wh
    WAREHOUSE_SIZE = MEDIUM
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE;

3. Right-Sizing: Don’t Over-Provision

Right-sizing loop:

Run job, capture CPU/memory utilization
If peak CPU < 60%, reduce nodes ~30%
If peak memory < 60%, consider smaller instance type
If runtime > 2× SLA, increase size or fix skew/IO bottlenecks

Instance family heuristics:

Spark ETL: memory-bound → r* families
ML training: compute-bound → c* families
Streaming: balanced → m* families

4. Warehouse Slot / Virtual Warehouse Optimization

BigQuery pricing options:

On-demand: $6.25/TiB scanned
  Good for: variable/ad-hoc workloads, smaller spend
  Risk: one runaway query can cost hundreds+

Flat-rate (reserved slots): fixed cost
  Good for: predictable batch, higher query spend
  Break-even: ~ $10–15K/month (order of magnitude)

Snowflake virtual warehouse sizing:

-- Credits/hour by size (illustrative):
-- XS: 1, S: 2, M: 4, L: 8, XL: 16, 2XL: 32

CREATE WAREHOUSE analyst_wh
    WAREHOUSE_SIZE = SMALL
    AUTO_SUSPEND = 60;

CREATE WAREHOUSE etl_wh
    WAREHOUSE_SIZE = LARGE
    AUTO_SUSPEND = 30;

CREATE WAREHOUSE concurrent_wh
    WAREHOUSE_SIZE = MEDIUM
    MAX_CLUSTER_COUNT = 3
    MIN_CLUSTER_COUNT = 1
    SCALING_POLICY = ECONOMY;

Storage Cost Optimization

1. Tiered Storage: Biggest Storage Win

S3 tiers (example):

Standard:              $0.023/GB-month
Infrequent Access:     $0.0125/GB-month
Glacier Instant:       $0.004/GB-month
Glacier Deep Archive:  $0.00099/GB-month

Typical access pattern:
  Last 30 days: queried daily → Standard
  31–365 days: queried monthly → IA
  1–3 years: queried quarterly → Glacier Instant
  3+ years: compliance → Deep Archive

Lifecycle policy example:

s3_lifecycle = aws_s3_bucket_lifecycle_configuration(
    "data-lifecycle",
    rules=[
        {
            "id": "tier-by-age",
            "status": "Enabled",
            "transitions": [
                {"days": 30, "storage_class": "STANDARD_IA"},
                {"days": 90, "storage_class": "GLACIER_IR"},
                {"days": 365, "storage_class": "DEEP_ARCHIVE"},
            ],
            "expiration": {"days": 2557},  # ~7 years
        }
    ],
)

Savings example (100 TB):

All Standard: 100 TB × $0.023 × 12 = $27,600/year

Tiered (30% hot, 30% IA, 40% Glacier):
  30 TB × $0.023 + 30 TB × $0.0125 + 40 TB × $0.004 = $9,600/year

Savings: $18,000/year (~65%)

2. Compaction and Small File Cleanup

Small files waste money and slow queries (too many tasks, too much metadata).

Iceberg compaction:

spark.sql("""
    CALL system.rewrite_data_files(
        table => 'silver.user_events',
        options => map(
            'target-file-size-bytes', '134217728',
            'min-file-size-bytes',    '67108864',
            'max-file-size-bytes',    '536870912'
        )
    )
""")

Delta Lake:

spark.sql("OPTIMIZE silver.user_events ZORDER BY (user_id, event_date)")

Rule of thumb: streaming writes every 30 seconds produce 2,880 files/day/table. Daily compaction can reduce file counts by ~100× and cut object-store API costs dramatically.

3. Format and Compression: One-Time Win

Converting JSON/CSV → Parquet+Zstd often yields ~6–8× storage reduction and big query savings (scan less data).

Query Cost Optimization (BigQuery Deep Dive)

BigQuery charges per data scanned. Optimize in this order:

-- 1) PARTITION PRUNING
-- BAD: scans all partitions
SELECT user_id, event_type, COUNT(*)
FROM gold.fact_events
GROUP BY 1, 2;

-- GOOD: filter on partition key
SELECT user_id, event_type, COUNT(*)
FROM gold.fact_events
WHERE event_date = '2026-04-13'
GROUP BY 1, 2;

-- 2) COLUMN SELECTION
-- BAD
SELECT * FROM gold.fact_events WHERE event_date = '2026-04-13';

-- GOOD
SELECT user_id, event_type
FROM gold.fact_events
WHERE event_date = '2026-04-13';

-- 3) APPROX AGGREGATION (HLL)
SELECT
  DATE(event_timestamp) AS date,
  APPROX_COUNT_DISTINCT(user_id) AS dau_approx
FROM gold.fact_events
GROUP BY 1;

-- 4) MATERIALIZED VIEWS
CREATE MATERIALIZED VIEW gold.mv_daily_dau
OPTIONS (refresh_interval_minutes = 60)
AS
SELECT DATE(event_timestamp) AS date, COUNT(DISTINCT user_id) AS dau
FROM gold.fact_events
GROUP BY 1;

Enforce guardrails: require_partition_filter = TRUE.

CREATE TABLE gold.fact_events
PARTITION BY event_date
OPTIONS (require_partition_filter = TRUE);

Network Cost Optimization

Often overlooked but can be significant at PB scale.

Common expensive transfers:

Transfer type	Example cost driver	Mitigation
S3 → EC2 via internet/NAT	NAT + egress $/GB	Use VPC endpoints (AWS backbone)
Cross-AZ transfer	replication/shuffles across AZs	Minimize cross-AZ; co-locate compute+storage
Cross-region transfer	multi-region copies/reads	Keep processing in region; avoid unnecessary replication
Internet egress	exporting data externally	Minimize; use CDNs where appropriate

VPC endpoints example:

# Without endpoint: EMR → NAT → internet → S3 (paid)
# With gateway endpoint: EMR → VPC endpoint → S3 (on AWS backbone)

The Cost Optimization Monitoring Stack

You can’t optimize what you can’t see.

Budget alerts

aws_budgets.Budget(
    "team-budget",
    budget_type="COST",
    limit_amount="5000",
    limit_unit="USD",
    time_unit="MONTHLY",
    cost_filters={"TagKeyValue": ["user:team$marketing"]},
    notifications=[
        {
            "notification_type": "ACTUAL",
            "comparison_operator": "GREATER_THAN",
            "threshold": 80,
            "subscribers": [{"type": "EMAIL", "address": "data-platform@company.com"}],
        }
    ],
)

Cost anomaly detection

SELECT
    team,
    today_cost,
    avg_7day_cost,
    today_cost / avg_7day_cost AS cost_ratio,
    today_cost - avg_7day_cost AS cost_delta
FROM (
    SELECT
        labels.team AS team,
        SUM(IF(usage_date = CURRENT_DATE - 1, cost, 0)) AS today_cost,
        AVG(IF(usage_date >= CURRENT_DATE - 8, cost, NULL)) AS avg_7day_cost
    FROM billing_export
    GROUP BY labels.team
)
WHERE cost_ratio > 1.5
ORDER BY cost_delta DESC;

Per-query cost alerting (BigQuery)

SELECT
    user_email,
    query,
    total_bytes_billed / 1e12 * 6.25 AS cost_usd,
    creation_time
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE DATE(creation_time) = CURRENT_DATE - 1
  AND total_bytes_billed / 1e12 * 6.25 > 50
ORDER BY cost_usd DESC;

The Cost–Performance Trade-off Framework

For each optimization:

1) Cost savings: $X/month
2) Performance impact: latency/SLA risk
3) Implementation cost: engineer-days
4) Risk: availability, data loss, correctness

Examples:

Spot for batch ETL:
  Savings: large (often 40–80% of compute)
  Performance: occasional retries
  Risk: not suitable for real-time SLAs

Enable require_partition_filter:
  Savings: eliminates accidental full scans
  Impact: some queries fail until fixed
  Risk: temporary dashboard breakage; coordinate rollout

Interview Questions

Q1: Bill grew from $800K/month to $2M/month without a new launch. What do you do?

Model answer:

Cost attribution: breakdown by service (compute/storage/network) and by team (tagging).
If compute: check utilization, Spot ratio, auto-termination, and over-provisioned clusters.
If storage: check growth vs business growth, retention/tiering, duplicates, and file format/compression.
If query spend: identify top expensive queries; partition filter failures are usually dominant; enforce guardrails and alerts.
Target outcome: 40–60% reduction via Spot + auto-terminate + right-size + partition guardrails + tiering.

Q2: BigQuery training query takes 45 minutes and costs $200; run 10×/day. How do you fix?