Phase 1: Foundations & Frameworks | Category: Storage Systems
Why This Is a Practical Interview Topic
File format choice is one of the highest-leverage decisions in data engineering — it directly impacts query speed, storage cost, pipeline throughput, and schema flexibility. Interviewers don’t just ask “what is Parquet?” — they ask “why did you choose Parquet over Avro for this use case?” or “your pipeline is slow and expensive — what would you look at first?” As OneUptime puts it: “If you have ever wondered why analytical queries on Parquet files run 10-100x faster than the same queries on CSV or JSON, the answer is columnar storage.” At your level, the answer must go deeper than that.
Row-Oriented vs Columnar: The Mental Model
This is the foundation everything else builds on.
Row-oriented stores each record contiguously on disk:
Record 1: [user_id=101, name="Alice", city="NYC", spend=500.00, device="mobile"] Record 2: [user_id=102, name="Bob", city="SF", spend=300.00, device="desktop"] Record 3: [user_id=103, name="Carol", city="NYC", spend=700.00, device="mobile"]
Columnar stores each column contiguously:
user_id column: [101, 102, 103, ...] name column: ["Alice", "Bob", "Carol", ...] city column: ["NYC", "SF", "NYC", ...] spend column: [500.00, 300.00, 700.00, ...] device column: ["mobile", "desktop", "mobile", ...]
Why columnar wins for analytics:
| Benefit | How It Works |
|---|---|
| Column pruning | SELECT city, SUM(spend) — only reads 2 of 5 columns. 60% of I/O eliminated. |
| Better compression | Homogeneous data (all cities together) compresses far better than interleaved mixed types. |
| Vectorized execution | CPU processes 1024 values from the same column in one SIMD instruction batch. |
| Predicate pushdown | Row group statistics (min/max per column) let the engine skip entire blocks without reading them. |
| Dictionary encoding | ”NYC” repeated 1M times stored as integer codes → massive compression for low-cardinality columns. |
When row-oriented wins:
-
Full record access (read ALL columns for a row) — OLTP, user-facing APIs
-
Write-heavy workloads with frequent row updates
-
Streaming ingestion where you write one event at a time
-
Small files where the columnar overhead isn’t justified
The Formats
Parquet: The Default for Analytics
What it is: Columnar, self-describing binary format. The de facto standard for data lakes and warehouses. Created by Twitter and Cloudera.
Internal structure:
File
├── Row Group 1 (128MB default) │
├── Column Chunk: user_id [compressed pages] │
├── Column Chunk: city [dictionary encoded] │
└── Column Chunk: spend [delta encoded]
├── Row Group 2 │
└── ...
└── Footer (schema + row group statistics: min/max per column chunk)
Key features:
-
Row groups: The unit of parallelism. Spark reads one row group per task. Larger row groups = fewer tasks but more memory per task.
-
Column statistics: Min/max per column per row group stored in the footer. Engines read the footer first, then skip row groups where the predicate can’t possibly match.
-
Encoding: Before compression, Parquet applies efficient encodings:
-
Dictionary encoding: Replace repeated values with integer codes (great for city, category, status)
-
Run-Length Encoding (RLE): Compress runs of repeated values
-
Delta encoding: Store differences between successive values (great for timestamps, sequential IDs)
-
-
Nested data: Supports complex types (arrays, maps, structs) via the Dremel encoding model
Compression algorithms (applied after encoding):
| Topic | Details |
|---|---|
| Snappy | Moderate (~3-5x) Fastest Fast Default for Spark, interactive queries |
| Zstd | High (~5-8x) Fast Fast |
| 2026 default choice | — best balance |
| Gzip | High (~5-8x) Slow Very slow Cold storage, maximum compression |
| LZ4 | Moderate Fastest Fastest Streaming, low-latency read paths |
| Brotli | Highest Slow Slow Static web assets, archival |
The 2026 recommendation: According to real benchmarks and OneUptime: “Zstd has become the default choice for most use cases — compression ratios close to Gzip with speeds close to Snappy.” Snappy for streaming/interactive, Zstd for batch/archival, Gzip only for maximum cold storage compression where read speed doesn’t matter.
Ecosystem: Spark, Flink, BigQuery, Snowflake, Redshift, Athena, Trino, Presto, Pandas, DuckDB, Arrow. Universally supported.
Use when: Analytics gold/silver layer, data lake storage, Parquet is your default for 90% of pipeline outputs.
ORC: Optimized for Hive
What it is: Columnar format, also row-group based. Created by Hortonworks for the Hive/Hadoop ecosystem. Very similar to Parquet but with different design choices.
ORC vs Parquet:
| Topic | Details |
|---|---|
| Ecosystem | Hive, Spark (good), Presto Virtually everywhere |
| Compression | Slightly better in some benchmarks Competitive |
| Bloom filters | Built-in, per column Optional, plugin |
| ACID support | Built into Hive ACIDVia table formats (Delta/Iceberg) |
| Nested types | More limited Excellent (Dremel model) |
| Adoption | Declining (Hive declining) Growing — the clear winner |
When to choose ORC: Only if you’re deeply invested in the Hive ecosystem. For any modern data stack, Parquet wins on ecosystem breadth. At your target companies, the answer is Parquet.
Avro: The Streaming/Serialization Format
What it is: Row-oriented, binary serialization format with a rich schema evolution model. JSON-defined schema embedded in or registered separately. Created by the Hadoop project.
Internal structure:
File ├── Header (schema JSON, codec, sync marker) ├── Block 1 (compressed binary records, row-by-row) ├── Block 2 └── ...
Key features:
-
Schema-first: Schema is defined before data and embedded in or referenced by every file
-
Schema evolution: Add fields (with defaults), remove fields, rename fields — with strong compatibility guarantees
-
Backward compatible: New schema can read old data
-
Forward compatible: Old schema can read new data
-
Full compatible: Both directions work
-
-
Schema Registry integration: Confluent Schema Registry stores Avro schemas centrally. Kafka messages contain only a schema ID + binary payload — extremely compact for streaming
-
Compact binary encoding: No column overhead. For row-by-row streaming, Avro is more efficient than Parquet (which has row group overhead)
-
Language agnostic: Auto-generated code for Java, Python, Go, C++
When to use Avro:
-
Kafka messages: Schema Registry + Avro is the standard pattern for typed Kafka topics
-
Inter-service serialization: gRPC uses Protobuf, but Avro is common in the Hadoop/Kafka ecosystem
-
Schema evolution critical: When producers and consumers evolve independently and you need compatibility guarantees
-
Write-heavy, read-partial: When you always read the full record (not a subset of columns)
When NOT to use Avro:
-
Analytics / warehouse storage → use Parquet (columnar pruning, compression)
-
Long-term analytical storage → Parquet’s column stats and encoding are far more efficient
JSON / JSON Lines: The Interchange Format
What it is: Text-based, self-describing, human-readable. The lingua franca of APIs and web data.
JSON Lines (JSONL / NDJSON): One JSON object per line — key distinction for data engineering. Enables line-by-line streaming without loading the whole file.
Performance reality:
Same dataset, different formats: JSON Lines: 150.8 MB (keys repeated for every record) CSV (gzip): 25.0 MB Parquet (zstd): 6.3 MB (24x smaller than JSON, fastest reads)
When to accept JSON (not use as primary storage):
-
API responses — you receive JSON, immediately convert to Parquet for storage
-
Bronze layer when source provides only JSON — land as-is, transform to Parquet in silver
-
Config files, metadata, small reference data
-
Debugging and human inspection
Anti-pattern: Storing terabytes of analytical data as JSON. Every query scans everything (no column pruning, no predicate pushdown). Storage cost is 10-25x higher than Parquet.
Protobuf and Arrow: Honorable Mentions
Protobuf: Google’s binary serialization (gRPC-native). More compact than Avro, schema defined in .proto files. Used in Google systems and increasingly at OpenAI/Anthropic for internal APIs.
Apache Arrow: In-memory columnar format. NOT a file format for storage — a format for in-memory computation and zero-copy data exchange between systems (Spark → Pandas → DuckDB). When you use PyArrow or pandas with Parquet, Arrow is the in-memory representation.
File Format by Pipeline Layer
| Topic | Details |
|---|---|
| Message bus (Kafka) | Avro (with Schema Registry) Schema evolution, compact binary, typed streaming |
| Bronze (raw ingestion) | As-is from source (JSON/CSV/Avro) Preserve source fidelity. Convert ASAP. |
| Bronze → Silver transform output | Parquet + Zstd Begin analytics optimization immediately |
| Silver layer | Parquet + Zstd Columnar for downstream analytics |
| Gold layer | Parquet + Zstd Same — query engines read Parquet natively |
| ML feature tables | Parquet + Zstd Pandas/Spark/Py Arrow all read Parquet efficiently |
| Cold/archive storage | Parquet + Gzip or Zstd-9Maximum compression for infrequently read data |
| Streaming intermediate | Avro or JSON (in-flight) Low latency, schema flexibility |
| Cross-system exchange | JSON or Parquet JSON for compatibility; Parquet for efficiency |
Practical Sizing: The Impact of Format Choice
Working through a real example helps cement this in interviews:
Scenario: 1 billion user events/day, avg 500 bytes raw JSON.
FormatStorageAnnual Cost (S3 @ $0.023/GB)Raw JSON500 GB/day → 182 TB/year~$4,200/yearParquet + Snappy100 GB/day (5x compression) → 36 TB/year$830/yearParquet + Zstd80 GB/day (6.3x compression) → 29 TB/year$670/yearParquet + Gzip65 GB/day (7.7x compression) → 24 TB/year$550/year
Annual saving switching from JSON to Parquet+Zstd: ~$3,500/year storage — PLUS the 10-25x query performance improvement and reduced compute cost from scanning less data. On multi-petabyte data lakes, this is millions of dollars annually.
What to say in an interview: “One of the first things I’d check when a pipeline is expensive to run is the file format. Converting from JSON to Parquet with Zstd typically reduces storage by 6-8x and cuts query time by 10-25x from column pruning alone. At a billion events/day, that’s a 6x storage cost reduction plus the compute savings from scanning 80% less data on every analytical query.”
Interview Questions
Q1: “You inherit a data lake where all data is stored as gzipped JSON. Queries take 10 minutes and storage costs are $50K/month. What do you change and why?”
Model Answer: “The root cause is the file format. Gzipped JSON is the worst of both worlds for analytical queries: it’s row-oriented (no column pruning), gzip is not splittable (Spark can’t parallelize within a file), and the verbose key-name-per-record structure wastes storage. I’d migrate to Parquet with Zstd compression. Migration approach: run a one-time Spark job to convert existing data from JSON to Parquet, partitioned by date — spark.read.json().write.parquet() with spark.sql.shuffle.partitions tuned appropriately. Going forward, the pipeline writes Parquet directly. Expected results: 6-8x storage reduction brings $50K/month to ~$7K/month. Query time drops from 10 minutes to under 1 minute from column pruning and predicate pushdown. Gzip’s non-splittability was causing single-task processing of each file — Parquet row groups enable parallel reads. One caveat: I’d validate schema consistency in the JSON data before migration, as JSON’s schema-on-read flexibility means some files may have inconsistent types or missing fields that need handling.”
Q2: “Why would you use Avro for Kafka messages instead of JSON or Parquet?”
Model Answer: “Three reasons. First, schema evolution with compatibility guarantees: Avro with Confluent Schema Registry gives producers and consumers independent deployment. I can add a new field with a default value — old consumers ignore it, new consumers use it. With JSON, there’s no enforcement; schema drift causes silent bugs. With Protobuf it’s similar to Avro but less common in the Kafka ecosystem. Second, compactness: an Avro message over Kafka contains just a 5-byte schema ID prefix plus binary-encoded payload — no key names in every message. At 500K events/sec, eliminating repeated field names reduces network bandwidth and storage by 30-50% vs JSON. Third, type safety: Avro enforces data types at serialization time. A producer writing a string where a long is expected fails immediately rather than silently corrupting downstream consumers. Parquet isn’t suitable here because it’s a file format with row group overhead — it’s not designed for individual message serialization over a message bus.”
Think About This
You’re in a Netflix interview. The prompt: “Netflix ingests viewing events via Kafka (500K events/sec), stores them in a data lake, and uses them for both real-time recommendations (< 100ms) and weekly analytics (batch). How would you choose file formats across this pipeline?”
Walk through:
-
Kafka messages format? (Avro with Schema Registry — typed, compact, schema-evolvable as Netflix’s event schema evolves)
-
Bronze landing format? (Parquet + Snappy — fast writes from Flink, readable by Spark for batch downstream. Snappy for speed over compression ratio at this write volume.)
-
Silver/gold format? (Parquet + Zstd — better compression for storage cost at Netflix’s petabyte scale, fast reads for weekly batch analytics jobs)
-
Real-time recommendation path? (Events don’t go to Parquet for real-time — they go from Kafka through Flink directly to a feature store (Redis) or real-time OLAP (Druid/Pinot). Parquet is for batch paths only.)
-
What about the 500K events/sec write rate — does file size matter? (Yes — write too frequently and you get small files (Parquet overhead per file). Solution: Flink buffers events and writes row groups of 128 MB+, creating properly sized Parquet files.)
The insight: file format choice is layer-specific. Avro for the message bus (streaming needs), Parquet for storage (analytics needs). The real-time path skips file storage entirely — events go straight to low-latency serving stores.
Quick Reference
-
Parquet = columnar, analytics default. Column pruning + predicate pushdown + vectorized execution. Use for silver/gold layers in 90% of DE use cases.
-
ORC = columnar, Hive ecosystem. Near-feature-parity with Parquet. Default to Parquet unless deeply invested in Hive.
-
Avro = row-oriented, schema evolution, serialization. The Kafka message format standard. Use for streaming ingestion, not long-term analytical storage.
-
JSON = accept from sources, convert immediately to Parquet. Never store analytics data long-term as JSON.
-
Compression rule: Zstd = new default (best ratio/speed balance). Snappy = streaming/interactive. Gzip = cold archival only. LZ4 = extreme low-latency reads.
-
The ROI argument: JSON → Parquet+Zstd = 6-8x storage reduction + 10-25x query speedup. At scale, this is millions of dollars annually.
Tomorrow’s Preview
Day 19: Data Quality Frameworks — The six dimensions of data quality, building quality checks into pipelines (Great Expectations, dbt tests, Dataplex), defining SLIs for data, and alerting on quality failures — the operational maturity topic that senior DE candidates are expected to own end-to-end.