Phase 1 Complete | Category: Consolidation

What You’ve Built in 30 Days

You’ve completed the entire foundation layer. Before Phase 2 begins with company-specific deep dives, today is about closing gaps, consolidating patterns, and building the confidence to walk into any system design room and own it.

No new material today. 100% active recall and synthesis.

The Phase 1 Mastery Checklist

TopicDetails
Green(can explain + defend trade-offs),
Yellow(know the concept but struggle with trade-offs),
Red(need to revisit).

System Design Methodology (Days 1-3)

  • Can you walk through the 5-step framework in under 2 minutes from memory?

  • Can you extract functional vs non-functional requirements for a data system and prioritize them P0/P1/P2?

  • Can you estimate storage, throughput, and query load for a 1B events/day system from scratch?

  • Do your estimations directly drive architectural decisions (not live in a disconnected math box)?

Data Modeling (Days 4-7)

  • Can you define grain before drawing any tables?

  • Can you choose the right fact table type (transaction vs periodic snapshot vs accumulating snapshot) given a business scenario?

  • Can you explain SCD Types 0/1/2/3/6 and choose between them based on the analytical requirement?

  • Can you implement SCD2 with a MERGE INTO using the NULL join_key pattern?

  • Can you articulate when to normalize vs denormalize and at which lakehouse layer each applies?

  • Can you explain Data Vault (hubs/links/satellites) and when it’s overkill?

ETL/ELT & Processing (Days 8-11)

  • Can you explain push-down compute and why ELT won over ETL?

  • Can you describe Spark’s DAG → stages → tasks execution model and explain why shuffles are the #1 performance killer?

  • Can you handle data skew with salting, key isolation, and AQE?

  • Can you explain the three window types (tumbling, sliding, session) and when to use each?

  • Can you explain watermarks and what happens to events beyond the allowed lateness?

  • Can you compare Flink vs Spark Streaming with concrete trade-offs?

  • Can you explain Lambda vs Kappa and describe the 2026 hybrid that most companies actually use?

Distributed Systems (Days 12-14)

  • Can you explain CAP theorem as a binary CP/AP choice during partition (not a trilemma)?

  • Can you use PACELC to reason about the normal-operation latency/consistency trade-off?

  • Can you apply per-feature consistency (strong for billing, eventual for feeds)?

  • Can you explain range vs hash vs consistent hashing and handle hot partitions?

  • Can you compare leader-follower vs multi-leader vs leaderless replication?

  • Can you explain quorum math (W + R > N) and design for different fault tolerance profiles?

  • Can you explain WAL and why it enables CDC?

Storage Systems (Days 15-18)

  • Can you apply the database selection decision tree (relational vs document vs key-value vs wide-column)?

  • Can you explain the three warehouse architectures (BigQuery serverless, Snowflake virtual warehouses, Redshift MPP)?

  • Can you compare Delta Lake, Iceberg, and Hudi with their origin story, strength, and best-fit scenario?

  • Can you explain why Parquet + Zstd is the default and quantify the storage savings vs JSON?

  • Can you explain the medallion architecture (bronze/silver/gold) with the modeling approach at each layer?

Data Quality & Governance (Days 19-23)

  • Can you name the six quality dimensions and map each to a pipeline layer?

  • Can you design a defense-in-depth quality framework (ingestion → transform → output → consumer)?

  • Can you explain column-level vs table-level lineage and their respective use cases?

  • Can you explain BACKWARD_TRANSITIVE vs FORWARD compatibility and give examples of breaking vs non-breaking changes?

  • Can you design a data contract with schema + semantics + quality rules + SLA?

  • Can you explain log-based CDC vs query-based CDC and choose between them based on write rate and delete requirement?

  • Can you describe the Stage → Validate → Publish Airflow pattern and explain why it enables safe retries?

System Design Practice (Days 24-29)

  • Can you design a real-time analytics dashboard end-to-end (Day 24)?

  • Can you design an event logging & telemetry system end-to-end (Day 25)?

  • Can you design an e-commerce data warehouse with all three fact table types (Day 26)?

  • Can you choose REST vs gRPC vs GraphQL for a given data access pattern (Day 27)?

  • Can you design a three-tier caching strategy (Redis → materialized views → partitioned table) (Day 28)?

  • Can you design a pipeline observability system with circuit breakers and two-tier alerting (Day 29)?

Phase 1 Pattern Library: The 15 Answers That Win Interviews

These are the high-signal responses that separate senior answers from junior answers. Memorize the pattern, not the words.

1. On grain: “The grain of this fact table is one row per X. I’m stating this explicitly before naming any dimensions because every subsequent decision — which dimensions are valid, whether aggregation is additive — depends on getting the grain right.”

2. On SCD2: “I’d use Type 2 for [attribute] because historical analysis must reflect the state at time of fact. The key implementation detail is the NULL join_key pattern in the MERGE — one UNION branch with the natural key to expire the old row, one with NULL to force-insert the new version.”

3. On consistency: “I wouldn’t make a blanket consistency choice. [Financial feature] needs strong CP/EC — an incorrect read costs real money. [Dashboard feature] is fine with eventual AP/EL — showing numbers from 30 seconds ago doesn’t affect any decision.”

4. On ETL vs ELT: “I default to ELT — land raw, transform in-place using the warehouse’s compute. Push-down compute means data never leaves the warehouse. I apply ETL guardrails selectively: PII masking before storage, schema validation at ingestion, ML inference that can’t be expressed as SQL.”

5. On Lambda vs Kappa: “I don’t think in pure Lambda or pure Kappa. The modern pattern is streaming ingestion into a lakehouse with batch transforms on top. Streaming gives freshness; batch gives correctness for complex analytics. They share the same Iceberg storage — no merge layer problem.”

6. On database selection: “I start with PostgreSQL unless requirements push me elsewhere. It handles 64 TiB and 20K TPS. I switch to DynamoDB when the access pattern is purely key-value with < 10ms SLA. Cassandra when write throughput exceeds 5K/sec with time-series patterns. The choice is always access-pattern first, not technology first.”

7. On file formats: “I default to Parquet + Zstd for analytics storage — 6-8x smaller than JSON, column pruning, predicate pushdown. Avro for Kafka messages — schema evolution, compact binary, Schema Registry integration. Never JSON for long-term analytical storage.”

8. On partitioning: “I choose the partition key that matches the dominant WHERE clause — usually event_date for time-series data. I state the expected partition size (100MB-1GB target), the write pattern it supports, and why the cardinality is appropriate. Hot partition mitigation: salting or composite key.”

9. On idempotency: “Three mechanisms: interval-scoped output paths (same interval = same path = overwrite), MERGE INTO instead of INSERT (deduplication by business key), and deterministic logic (use data_interval_start, not now()). A pipeline that isn’t idempotent can’t be safely retried.”

10. On schema evolution: “BACKWARD_TRANSITIVE in the Schema Registry blocks breaking changes before they reach production. Adding fields with defaults is always safe. Renames and removals require the expand-contract pattern: add new field → migrate consumers → remove old field. Never force a breaking change in a single deploy.”

11. On CDC: “I default to log-based CDC (Debezium reading WAL/binlog) for any table with > 2K writes/sec or where deletes must be captured. Query-based polling is acceptable for slowly-changing reference tables where simplicity matters more than completeness. The crossover point is ~2K-5K writes/sec.”

12. On observability: “Most teams monitor only Layer 1 (job success/fail) and miss silent failures. The most dangerous failure is a pipeline that runs successfully but writes 0 rows. Volume anomaly detection (> 30% deviation from 7-day rolling average) catches this. Circuit breakers halt downstream propagation before bad data reaches dashboards.”

13. On caching: “I design three tiers: Redis (sub-100ms, 5-30 min TTL) for the hottest queries, materialized views (sub-5s, hourly refresh) for common aggregation patterns, partitioned base tables (sub-30s) for ad-hoc. Pre-computation when the query runs 100+/day; on-demand when it’s flexible or rare.”

14. On API design: “REST for public/external (caching, broad support), gRPC for internal high-throughput (binary encoding, HTTP/2), GraphQL for flexible client queries (analyst self-serve). Cursor-based pagination for anything > 100K records — offset degrades to O(N) and causes consistency bugs on inserts.”

15. On trade-offs: The universal pattern: “I chose X over Y because [specific requirement from Step 1]. The trade-off is [concrete downside of X]. That’s acceptable because [justification tied to the requirements]. If the requirement were different — specifically if [condition] — I’d switch to Y.”

The 30-Minute Timed Practice (Do This Now)

Set three separate 10-minute timers. For each design, speak out loud as if you’re at a whiteboard. Draw the architecture on paper.

Design 1 (10 min): “Design the data infrastructure for a ride-sharing surge pricing system.”

Constraint: Real-time pricing must update within 30 seconds of demand changes.

Target elements: CDC or event streaming from ride request system → Flink for geospatial aggregation (demand per zone per 5-min window) → Redis for sub-second pricing lookup → batch pipeline to BigQuery for pricing model analysis.

Design 2 (10 min): “Design a data warehouse for a music streaming platform.”

Constraint: Analysts need track-level, artist-level, and user-level analysis.

Target elements: Three fact tables (fact_stream — transaction, fact_daily_listeners — periodic snapshot, fact_artist_release_lifecycle — accumulating snapshot). SCD2 on dim_track (genre, price) and dim_user (subscription tier). Star schema in gold layer.

Design 3 (10 min): “Your company’s most important daily finance report was wrong this morning. Walk me through how you’d diagnose and prevent recurrence.”

Constraint: You have 30 minutes before the CFO reviews the numbers.

Target elements: Five observability layers. Work backward from serving layer (query the table directly). Check volume anomaly. Check Airflow logs for errors. Check CDC / source completeness. Root cause in 30 minutes because of structured logs + lineage. Prevention: circuit breaker on the transform step validating row count within 15% of 7-day average.

Self-Assessment: What to Do With Your Gaps

After the checklist and practice, you likely have 3-5 Red items. Here’s how to close them before Phase 2 starts:

For Data Modeling gaps: Re-read Days 4-7. Draw the three fact table types on paper with concrete e-commerce examples. Practice stating the grain out loud for five different business processes.

For Streaming/Processing gaps: Re-read Days 9-11. Draw the Spark DAG → stages → tasks diagram. Walk through the Lambda vs Kappa decision for three scenarios: fraud detection (Kappa), financial reporting (Lambda-like), ride-sharing real-time pricing (hybrid).

For Distributed Systems gaps: Re-read Days 12-14. Practice the CAP decision for each database in the comparison table. Draw the quorum formula (W + R > N) and work through three configurations for N=5.

For the full system designs: Re-read Days 24-26. Attempt to redraw each architecture from scratch on paper without looking. If you can’t fill in the boxes without the notes, the design isn’t yours yet.

Phase 1 → Phase 2 Transition

You’ve covered the fundamentals that every senior DE must know regardless of company. Starting Day 31, Phase 2 shifts to company-specific patterns:

  • Day 31: Meta — scale thinking, petabyte data infrastructure, social graph

  • Day 32: Meta system design practice — News Feed data pipeline

  • Day 33: Netflix — Iceberg-first architecture, ownership culture, fault tolerance

  • Day 34: Netflix system design practice — Recommendation pipeline

  • Day 35: Google — GCP-native design (BigQuery + Dataflow + Pub/Sub)

  • Day 36: Google system design practice — Large-scale search analytics

  • Day 37: OpenAI — LLM training data pipelines, AI-native infrastructure

  • Day 38: OpenAI design practice — Training data pipeline for LLMs

  • Day 39: Anthropic — progressive complexity, safety-first, distributed search

  • Day 40: Anthropic design practice — Distributed search at billion-document scale

The Phase 1 foundations are what Phase 2 builds on. Every company-specific design will require you to apply data modeling, streaming, storage selection, quality frameworks, and observability — just in the context of their specific scale and architecture philosophy.

Quick Reference: Phase 1 in One Page

PHASE 1 — QUICK REFERENCE (ONE PAGE)

METHODOLOGY
  Requirements → Estimation → Data Model → Architecture → Trade-offs

ESTIMATION
  1 KB × 1 B events ≈ 1 TB/day.  1 M req/day ≈ 12 req/s average.
  Size partitions (target ~128–512 MB per file).

DATA MODEL
  Grain first. Three fact types (transaction / periodic snapshot / accumulating).
  SCD2 for historical attributes. Star schema in gold.

PROCESSING
  Batch (Spark, idempotent) + stream (Flink, watermarks). Hybrid is normal.

DISTRIBUTION
  CAP → CP or AP under partition. PACELC for latency vs consistency day-to-day.

STORAGE
  PostgreSQL default. DynamoDB for KV. Cassandra for write-heavy. Warehouse for analytics.

FORMATS
  Parquet + Zstd for analytics storage. Avro for Kafka.
  Delta / Iceberg / Hudi for ACID lakehouse tables.

QUALITY
  Six dimensions. Defense-in-depth. Circuit breakers stop cascade failures.

GOVERNANCE
  Lineage = debuggability. Contracts = schema drift prevention.

CACHING
  Redis (<100 ms) → materialized views (<5 s) → partitioned table (<30 s).

APIS
  REST (external). gRPC (internal, high-perf). GraphQL (flexible clients).

OBSERVABILITY
  Five layers. Volume anomaly detection catches silent failures. SLOs, not only alerts.

TRADE-OFFS
  "I chose X over Y because [requirement]. Trade-off is [cost]. Acceptable because [justification]."

See You at Day 31

Phase 1 is done. You’ve covered 30 days of deep, senior-level data engineering foundations. Phase 2 starts with Meta — where your challenge is thinking at the scale of 3 billion users, petabytes of daily data, and an engineering culture that expects every design decision to be defensible at 10x scale.