Day 18 — File formats and compression

Phase 1: Foundations & Frameworks | Category: Storage Systems

Why This Is a Practical Interview Topic

File format choice is one of the highest-leverage decisions in data engineering — it directly impacts query speed, storage cost, pipeline throughput, and schema flexibility. Interviewers don’t just ask “what is Parquet?” — they ask “why did you choose Parquet over Avro for this use case?” or “your pipeline is slow and expensive — what would you look at first?” As OneUptime puts it: “If you have ever wondered why analytical queries on Parquet files run 10-100x faster than the same queries on CSV or JSON, the answer is columnar storage.” At your level, the answer must go deeper than that.

Row-Oriented vs Columnar: The Mental Model

This is the foundation everything else builds on.

Row-oriented stores each record contiguously on disk:

Record 1: [user_id=101, name="Alice", city="NYC", spend=500.00, device="mobile"]  Record 2: [user_id=102, name="Bob",   city="SF",  spend=300.00, device="desktop"]  Record 3: [user_id=103, name="Carol", city="NYC", spend=700.00, device="mobile"]

Columnar stores each column contiguously:

user_id column:  [101, 102, 103, ...]  name column:     ["Alice", "Bob", "Carol", ...]  city column:     ["NYC", "SF", "NYC", ...]  spend column:    [500.00, 300.00, 700.00, ...]  device column:   ["mobile", "desktop", "mobile", ...]

Why columnar wins for analytics:

Benefit	How It Works
Column pruning	SELECT city, SUM(spend) — only reads 2 of 5 columns. 60% of I/O eliminated.
Better compression	Homogeneous data (all cities together) compresses far better than interleaved mixed types.
Vectorized execution	CPU processes 1024 values from the same column in one SIMD instruction batch.
Predicate pushdown	Row group statistics (min/max per column) let the engine skip entire blocks without reading them.
Dictionary encoding	”NYC” repeated 1M times stored as integer codes → massive compression for low-cardinality columns.

When row-oriented wins:

Full record access (read ALL columns for a row) — OLTP, user-facing APIs
Write-heavy workloads with frequent row updates
Streaming ingestion where you write one event at a time
Small files where the columnar overhead isn’t justified

The Formats

Parquet: The Default for Analytics

What it is: Columnar, self-describing binary format. The de facto standard for data lakes and warehouses. Created by Twitter and Cloudera.

Internal structure:

File
├── Row Group 1 (128MB default)  │
├── Column Chunk: user_id    [compressed pages]  │
├── Column Chunk: city       [dictionary encoded]  │
└── Column Chunk: spend      [delta encoded]
├── Row Group 2  │
└── ...
└── Footer (schema + row group statistics: min/max per column chunk)

Key features:

Row groups: The unit of parallelism. Spark reads one row group per task. Larger row groups = fewer tasks but more memory per task.
Column statistics: Min/max per column per row group stored in the footer. Engines read the footer first, then skip row groups where the predicate can’t possibly match.
Encoding: Before compression, Parquet applies efficient encodings:
- Dictionary encoding: Replace repeated values with integer codes (great for city, category, status)
- Run-Length Encoding (RLE): Compress runs of repeated values
- Delta encoding: Store differences between successive values (great for timestamps, sequential IDs)
Nested data: Supports complex types (arrays, maps, structs) via the Dremel encoding model

Compression algorithms (applied after encoding):

Topic	Details
Snappy	Moderate (~3-5x) Fastest Fast Default for Spark, interactive queries
Zstd	High (~5-8x) Fast Fast
2026 default choice	— best balance
Gzip	High (~5-8x) Slow Very slow Cold storage, maximum compression
LZ4	Moderate Fastest Fastest Streaming, low-latency read paths
Brotli	Highest Slow Slow Static web assets, archival

The 2026 recommendation: According to real benchmarks and OneUptime: “Zstd has become the default choice for most use cases — compression ratios close to Gzip with speeds close to Snappy.” Snappy for streaming/interactive, Zstd for batch/archival, Gzip only for maximum cold storage compression where read speed doesn’t matter.

Ecosystem: Spark, Flink, BigQuery, Snowflake, Redshift, Athena, Trino, Presto, Pandas, DuckDB, Arrow. Universally supported.

Use when: Analytics gold/silver layer, data lake storage, Parquet is your default for 90% of pipeline outputs.

ORC: Optimized for Hive

What it is: Columnar format, also row-group based. Created by Hortonworks for the Hive/Hadoop ecosystem. Very similar to Parquet but with different design choices.

ORC vs Parquet:

Topic	Details
Ecosystem	Hive, Spark (good), Presto Virtually everywhere
Compression	Slightly better in some benchmarks Competitive
Bloom filters	Built-in, per column Optional, plugin
ACID support	Built into Hive ACIDVia table formats (Delta/Iceberg)
Nested types	More limited Excellent (Dremel model)
Adoption	Declining (Hive declining) Growing — the clear winner

When to choose ORC: Only if you’re deeply invested in the Hive ecosystem. For any modern data stack, Parquet wins on ecosystem breadth. At your target companies, the answer is Parquet.

Avro: The Streaming/Serialization Format

What it is: Row-oriented, binary serialization format with a rich schema evolution model. JSON-defined schema embedded in or registered separately. Created by the Hadoop project.

Internal structure:

File  ├── Header (schema JSON, codec, sync marker)  ├── Block 1 (compressed binary records, row-by-row)  ├── Block 2  └── ...

Key features:

Schema-first: Schema is defined before data and embedded in or referenced by every file
Schema evolution: Add fields (with defaults), remove fields, rename fields — with strong compatibility guarantees
- Backward compatible: New schema can read old data
- Forward compatible: Old schema can read new data
- Full compatible: Both directions work
Schema Registry integration: Confluent Schema Registry stores Avro schemas centrally. Kafka messages contain only a schema ID + binary payload — extremely compact for streaming
Compact binary encoding: No column overhead. For row-by-row streaming, Avro is more efficient than Parquet (which has row group overhead)
Language agnostic: Auto-generated code for Java, Python, Go, C++

When to use Avro:

Kafka messages: Schema Registry + Avro is the standard pattern for typed Kafka topics
Inter-service serialization: gRPC uses Protobuf, but Avro is common in the Hadoop/Kafka ecosystem
Schema evolution critical: When producers and consumers evolve independently and you need compatibility guarantees
Write-heavy, read-partial: When you always read the full record (not a subset of columns)

When NOT to use Avro:

Analytics / warehouse storage → use Parquet (columnar pruning, compression)
Long-term analytical storage → Parquet’s column stats and encoding are far more efficient

JSON / JSON Lines: The Interchange Format

What it is: Text-based, self-describing, human-readable. The lingua franca of APIs and web data.

JSON Lines (JSONL / NDJSON): One JSON object per line — key distinction for data engineering. Enables line-by-line streaming without loading the whole file.

Performance reality:

Same dataset, different formats:  JSON Lines:        150.8 MB  (keys repeated for every record)  CSV (gzip):         25.0 MB  Parquet (zstd):      6.3 MB  (24x smaller than JSON, fastest reads)

When to accept JSON (not use as primary storage):

API responses — you receive JSON, immediately convert to Parquet for storage
Bronze layer when source provides only JSON — land as-is, transform to Parquet in silver
Config files, metadata, small reference data
Debugging and human inspection

Anti-pattern: Storing terabytes of analytical data as JSON. Every query scans everything (no column pruning, no predicate pushdown). Storage cost is 10-25x higher than Parquet.

Protobuf and Arrow: Honorable Mentions

Protobuf: Google’s binary serialization (gRPC-native). More compact than Avro, schema defined in .proto files. Used in Google systems and increasingly at OpenAI/Anthropic for internal APIs.

Apache Arrow: In-memory columnar format. NOT a file format for storage — a format for in-memory computation and zero-copy data exchange between systems (Spark → Pandas → DuckDB). When you use PyArrow or pandas with Parquet, Arrow is the in-memory representation.

File Format by Pipeline Layer

Topic	Details
Message bus (Kafka)	Avro (with Schema Registry) Schema evolution, compact binary, typed streaming
Bronze (raw ingestion)	As-is from source (JSON/CSV/Avro) Preserve source fidelity. Convert ASAP.
Bronze → Silver transform output	Parquet + Zstd Begin analytics optimization immediately
Silver layer	Parquet + Zstd Columnar for downstream analytics
Gold layer	Parquet + Zstd Same — query engines read Parquet natively
ML feature tables	Parquet + Zstd Pandas/Spark/Py Arrow all read Parquet efficiently
Cold/archive storage	Parquet + Gzip or Zstd-9Maximum compression for infrequently read data
Streaming intermediate	Avro or JSON (in-flight) Low latency, schema flexibility
Cross-system exchange	JSON or Parquet JSON for compatibility; Parquet for efficiency

Practical Sizing: The Impact of Format Choice

Working through a real example helps cement this in interviews:

Scenario: 1 billion user events/day, avg 500 bytes raw JSON.

FormatStorageAnnual Cost (S3 @ $0.023/GB)Raw JSON500 GB/day → 182 TB/year~$4,200/yearParquet + Snappy~~100 GB/day (5x compression) → 36 TB/year~~$830/yearParquet + Zstd~~80 GB/day (6.3x compression) → 29 TB/year~~$670/yearParquet + Gzip~~65 GB/day (7.7x compression) → 24 TB/year~~$550/year

Annual saving switching from JSON to Parquet+Zstd: ~$3,500/year storage — PLUS the 10-25x query performance improvement and reduced compute cost from scanning less data. On multi-petabyte data lakes, this is millions of dollars annually.

What to say in an interview: “One of the first things I’d check when a pipeline is expensive to run is the file format. Converting from JSON to Parquet with Zstd typically reduces storage by 6-8x and cuts query time by 10-25x from column pruning alone. At a billion events/day, that’s a 6x storage cost reduction plus the compute savings from scanning 80% less data on every analytical query.”

Interview Questions

Q1: “You inherit a data lake where all data is stored as gzipped JSON. Queries take 10 minutes and storage costs are $50K/month. What do you change and why?”

Model Answer: “The root cause is the file format. Gzipped JSON is the worst of both worlds for analytical queries: it’s row-oriented (no column pruning), gzip is not splittable (Spark can’t parallelize within a file), and the verbose key-name-per-record structure wastes storage. I’d migrate to Parquet with Zstd compression. Migration approach: run a one-time Spark job to convert existing data from JSON to Parquet, partitioned by date — spark.read.json().write.parquet() with spark.sql.shuffle.partitions tuned appropriately. Going forward, the pipeline writes Parquet directly. Expected results: 6-8x storage reduction brings $50K/month to ~$7K/month. Query time drops from 10 minutes to under 1 minute from column pruning and predicate pushdown. Gzip’s non-splittability was causing single-task processing of each file — Parquet row groups enable parallel reads. One caveat: I’d validate schema consistency in the JSON data before migration, as JSON’s schema-on-read flexibility means some files may have inconsistent types or missing fields that need handling.”

Q2: “Why would you use Avro for Kafka messages instead of JSON or Parquet?”

Model Answer: “Three reasons. First, schema evolution with compatibility guarantees: Avro with Confluent Schema Registry gives producers and consumers independent deployment. I can add a new field with a default value — old consumers ignore it, new consumers use it. With JSON, there’s no enforcement; schema drift causes silent bugs. With Protobuf it’s similar to Avro but less common in the Kafka ecosystem. Second, compactness: an Avro message over Kafka contains just a 5-byte schema ID prefix plus binary-encoded payload — no key names in every message. At 500K events/sec, eliminating repeated field names reduces network bandwidth and storage by 30-50% vs JSON. Third, type safety: Avro enforces data types at serialization time. A producer writing a string where a long is expected fails immediately rather than silently corrupting downstream consumers. Parquet isn’t suitable here because it’s a file format with row group overhead — it’s not designed for individual message serialization over a message bus.”

Think About This

You’re in a Netflix interview. The prompt: “Netflix ingests viewing events via Kafka (500K events/sec), stores them in a data lake, and uses them for both real-time recommendations (< 100ms) and weekly analytics (batch). How would you choose file formats across this pipeline?”

Walk through:

Kafka messages format? (Avro with Schema Registry — typed, compact, schema-evolvable as Netflix’s event schema evolves)
Bronze landing format? (Parquet + Snappy — fast writes from Flink, readable by Spark for batch downstream. Snappy for speed over compression ratio at this write volume.)
Silver/gold format? (Parquet + Zstd — better compression for storage cost at Netflix’s petabyte scale, fast reads for weekly batch analytics jobs)
Real-time recommendation path? (Events don’t go to Parquet for real-time — they go from Kafka through Flink directly to a feature store (Redis) or real-time OLAP (Druid/Pinot). Parquet is for batch paths only.)
What about the 500K events/sec write rate — does file size matter? (Yes — write too frequently and you get small files (Parquet overhead per file). Solution: Flink buffers events and writes row groups of 128 MB+, creating properly sized Parquet files.)

The insight: file format choice is layer-specific. Avro for the message bus (streaming needs), Parquet for storage (analytics needs). The real-time path skips file storage entirely — events go straight to low-latency serving stores.

Quick Reference

Parquet = columnar, analytics default. Column pruning + predicate pushdown + vectorized execution. Use for silver/gold layers in 90% of DE use cases.
ORC = columnar, Hive ecosystem. Near-feature-parity with Parquet. Default to Parquet unless deeply invested in Hive.
Avro = row-oriented, schema evolution, serialization. The Kafka message format standard. Use for streaming ingestion, not long-term analytical storage.
JSON = accept from sources, convert immediately to Parquet. Never store analytics data long-term as JSON.
Compression rule: Zstd = new default (best ratio/speed balance). Snappy = streaming/interactive. Gzip = cold archival only. LZ4 = extreme low-latency reads.
The ROI argument: JSON → Parquet+Zstd = 6-8x storage reduction + 10-25x query speedup. At scale, this is millions of dollars annually.

Tomorrow’s Preview

Day 19: Data Quality Frameworks — The six dimensions of data quality, building quality checks into pipelines (Great Expectations, dbt tests, Dataplex), defining SLIs for data, and alerting on quality failures — the operational maturity topic that senior DE candidates are expected to own end-to-end.

Day 18: File formats and compression