Phase 1 Complete | Category: Consolidation
What You’ve Built in 30 Days
You’ve completed the entire foundation layer. Before Phase 2 begins with company-specific deep dives, today is about closing gaps, consolidating patterns, and building the confidence to walk into any system design room and own it.
No new material today. 100% active recall and synthesis.
The Phase 1 Mastery Checklist
| Topic | Details |
|---|---|
| Green | (can explain + defend trade-offs), |
| Yellow | (know the concept but struggle with trade-offs), |
| Red | (need to revisit). |
System Design Methodology (Days 1-3)
-
Can you walk through the 5-step framework in under 2 minutes from memory?
-
Can you extract functional vs non-functional requirements for a data system and prioritize them P0/P1/P2?
-
Can you estimate storage, throughput, and query load for a 1B events/day system from scratch?
-
Do your estimations directly drive architectural decisions (not live in a disconnected math box)?
Data Modeling (Days 4-7)
-
Can you define grain before drawing any tables?
-
Can you choose the right fact table type (transaction vs periodic snapshot vs accumulating snapshot) given a business scenario?
-
Can you explain SCD Types 0/1/2/3/6 and choose between them based on the analytical requirement?
-
Can you implement SCD2 with a MERGE INTO using the NULL join_key pattern?
-
Can you articulate when to normalize vs denormalize and at which lakehouse layer each applies?
-
Can you explain Data Vault (hubs/links/satellites) and when it’s overkill?
ETL/ELT & Processing (Days 8-11)
-
Can you explain push-down compute and why ELT won over ETL?
-
Can you describe Spark’s DAG → stages → tasks execution model and explain why shuffles are the #1 performance killer?
-
Can you handle data skew with salting, key isolation, and AQE?
-
Can you explain the three window types (tumbling, sliding, session) and when to use each?
-
Can you explain watermarks and what happens to events beyond the allowed lateness?
-
Can you compare Flink vs Spark Streaming with concrete trade-offs?
-
Can you explain Lambda vs Kappa and describe the 2026 hybrid that most companies actually use?
Distributed Systems (Days 12-14)
-
Can you explain CAP theorem as a binary CP/AP choice during partition (not a trilemma)?
-
Can you use PACELC to reason about the normal-operation latency/consistency trade-off?
-
Can you apply per-feature consistency (strong for billing, eventual for feeds)?
-
Can you explain range vs hash vs consistent hashing and handle hot partitions?
-
Can you compare leader-follower vs multi-leader vs leaderless replication?
-
Can you explain quorum math (W + R > N) and design for different fault tolerance profiles?
-
Can you explain WAL and why it enables CDC?
Storage Systems (Days 15-18)
-
Can you apply the database selection decision tree (relational vs document vs key-value vs wide-column)?
-
Can you explain the three warehouse architectures (BigQuery serverless, Snowflake virtual warehouses, Redshift MPP)?
-
Can you compare Delta Lake, Iceberg, and Hudi with their origin story, strength, and best-fit scenario?
-
Can you explain why Parquet + Zstd is the default and quantify the storage savings vs JSON?
-
Can you explain the medallion architecture (bronze/silver/gold) with the modeling approach at each layer?
Data Quality & Governance (Days 19-23)
-
Can you name the six quality dimensions and map each to a pipeline layer?
-
Can you design a defense-in-depth quality framework (ingestion → transform → output → consumer)?
-
Can you explain column-level vs table-level lineage and their respective use cases?
-
Can you explain BACKWARD_TRANSITIVE vs FORWARD compatibility and give examples of breaking vs non-breaking changes?
-
Can you design a data contract with schema + semantics + quality rules + SLA?
-
Can you explain log-based CDC vs query-based CDC and choose between them based on write rate and delete requirement?
-
Can you describe the Stage → Validate → Publish Airflow pattern and explain why it enables safe retries?
System Design Practice (Days 24-29)
-
Can you design a real-time analytics dashboard end-to-end (Day 24)?
-
Can you design an event logging & telemetry system end-to-end (Day 25)?
-
Can you design an e-commerce data warehouse with all three fact table types (Day 26)?
-
Can you choose REST vs gRPC vs GraphQL for a given data access pattern (Day 27)?
-
Can you design a three-tier caching strategy (Redis → materialized views → partitioned table) (Day 28)?
-
Can you design a pipeline observability system with circuit breakers and two-tier alerting (Day 29)?
Phase 1 Pattern Library: The 15 Answers That Win Interviews
These are the high-signal responses that separate senior answers from junior answers. Memorize the pattern, not the words.
1. On grain: “The grain of this fact table is one row per X. I’m stating this explicitly before naming any dimensions because every subsequent decision — which dimensions are valid, whether aggregation is additive — depends on getting the grain right.”
2. On SCD2: “I’d use Type 2 for [attribute] because historical analysis must reflect the state at time of fact. The key implementation detail is the NULL join_key pattern in the MERGE — one UNION branch with the natural key to expire the old row, one with NULL to force-insert the new version.”
3. On consistency: “I wouldn’t make a blanket consistency choice. [Financial feature] needs strong CP/EC — an incorrect read costs real money. [Dashboard feature] is fine with eventual AP/EL — showing numbers from 30 seconds ago doesn’t affect any decision.”
4. On ETL vs ELT: “I default to ELT — land raw, transform in-place using the warehouse’s compute. Push-down compute means data never leaves the warehouse. I apply ETL guardrails selectively: PII masking before storage, schema validation at ingestion, ML inference that can’t be expressed as SQL.”
5. On Lambda vs Kappa: “I don’t think in pure Lambda or pure Kappa. The modern pattern is streaming ingestion into a lakehouse with batch transforms on top. Streaming gives freshness; batch gives correctness for complex analytics. They share the same Iceberg storage — no merge layer problem.”
6. On database selection: “I start with PostgreSQL unless requirements push me elsewhere. It handles 64 TiB and 20K TPS. I switch to DynamoDB when the access pattern is purely key-value with < 10ms SLA. Cassandra when write throughput exceeds 5K/sec with time-series patterns. The choice is always access-pattern first, not technology first.”
7. On file formats: “I default to Parquet + Zstd for analytics storage — 6-8x smaller than JSON, column pruning, predicate pushdown. Avro for Kafka messages — schema evolution, compact binary, Schema Registry integration. Never JSON for long-term analytical storage.”
8. On partitioning: “I choose the partition key that matches the dominant WHERE clause — usually event_date for time-series data. I state the expected partition size (100MB-1GB target), the write pattern it supports, and why the cardinality is appropriate. Hot partition mitigation: salting or composite key.”
9. On idempotency: “Three mechanisms: interval-scoped output paths (same interval = same path = overwrite), MERGE INTO instead of INSERT (deduplication by business key), and deterministic logic (use data_interval_start, not now()). A pipeline that isn’t idempotent can’t be safely retried.”
10. On schema evolution: “BACKWARD_TRANSITIVE in the Schema Registry blocks breaking changes before they reach production. Adding fields with defaults is always safe. Renames and removals require the expand-contract pattern: add new field → migrate consumers → remove old field. Never force a breaking change in a single deploy.”
11. On CDC: “I default to log-based CDC (Debezium reading WAL/binlog) for any table with > 2K writes/sec or where deletes must be captured. Query-based polling is acceptable for slowly-changing reference tables where simplicity matters more than completeness. The crossover point is ~2K-5K writes/sec.”
12. On observability: “Most teams monitor only Layer 1 (job success/fail) and miss silent failures. The most dangerous failure is a pipeline that runs successfully but writes 0 rows. Volume anomaly detection (> 30% deviation from 7-day rolling average) catches this. Circuit breakers halt downstream propagation before bad data reaches dashboards.”
13. On caching: “I design three tiers: Redis (sub-100ms, 5-30 min TTL) for the hottest queries, materialized views (sub-5s, hourly refresh) for common aggregation patterns, partitioned base tables (sub-30s) for ad-hoc. Pre-computation when the query runs 100+/day; on-demand when it’s flexible or rare.”
14. On API design: “REST for public/external (caching, broad support), gRPC for internal high-throughput (binary encoding, HTTP/2), GraphQL for flexible client queries (analyst self-serve). Cursor-based pagination for anything > 100K records — offset degrades to O(N) and causes consistency bugs on inserts.”
15. On trade-offs: The universal pattern: “I chose X over Y because [specific requirement from Step 1]. The trade-off is [concrete downside of X]. That’s acceptable because [justification tied to the requirements]. If the requirement were different — specifically if [condition] — I’d switch to Y.”
The 30-Minute Timed Practice (Do This Now)
Set three separate 10-minute timers. For each design, speak out loud as if you’re at a whiteboard. Draw the architecture on paper.
Design 1 (10 min): “Design the data infrastructure for a ride-sharing surge pricing system.”
Constraint: Real-time pricing must update within 30 seconds of demand changes.
Target elements: CDC or event streaming from ride request system → Flink for geospatial aggregation (demand per zone per 5-min window) → Redis for sub-second pricing lookup → batch pipeline to BigQuery for pricing model analysis.
Design 2 (10 min): “Design a data warehouse for a music streaming platform.”
Constraint: Analysts need track-level, artist-level, and user-level analysis.
Target elements: Three fact tables (fact_stream — transaction, fact_daily_listeners — periodic snapshot, fact_artist_release_lifecycle — accumulating snapshot). SCD2 on dim_track (genre, price) and dim_user (subscription tier). Star schema in gold layer.
Design 3 (10 min): “Your company’s most important daily finance report was wrong this morning. Walk me through how you’d diagnose and prevent recurrence.”
Constraint: You have 30 minutes before the CFO reviews the numbers.
Target elements: Five observability layers. Work backward from serving layer (query the table directly). Check volume anomaly. Check Airflow logs for errors. Check CDC / source completeness. Root cause in 30 minutes because of structured logs + lineage. Prevention: circuit breaker on the transform step validating row count within 15% of 7-day average.
Self-Assessment: What to Do With Your Gaps
After the checklist and practice, you likely have 3-5 Red items. Here’s how to close them before Phase 2 starts:
For Data Modeling gaps: Re-read Days 4-7. Draw the three fact table types on paper with concrete e-commerce examples. Practice stating the grain out loud for five different business processes.
For Streaming/Processing gaps: Re-read Days 9-11. Draw the Spark DAG → stages → tasks diagram. Walk through the Lambda vs Kappa decision for three scenarios: fraud detection (Kappa), financial reporting (Lambda-like), ride-sharing real-time pricing (hybrid).
For Distributed Systems gaps: Re-read Days 12-14. Practice the CAP decision for each database in the comparison table. Draw the quorum formula (W + R > N) and work through three configurations for N=5.
For the full system designs: Re-read Days 24-26. Attempt to redraw each architecture from scratch on paper without looking. If you can’t fill in the boxes without the notes, the design isn’t yours yet.
Phase 1 → Phase 2 Transition
You’ve covered the fundamentals that every senior DE must know regardless of company. Starting Day 31, Phase 2 shifts to company-specific patterns:
-
Day 31: Meta — scale thinking, petabyte data infrastructure, social graph
-
Day 32: Meta system design practice — News Feed data pipeline
-
Day 33: Netflix — Iceberg-first architecture, ownership culture, fault tolerance
-
Day 34: Netflix system design practice — Recommendation pipeline
-
Day 35: Google — GCP-native design (BigQuery + Dataflow + Pub/Sub)
-
Day 36: Google system design practice — Large-scale search analytics
-
Day 37: OpenAI — LLM training data pipelines, AI-native infrastructure
-
Day 38: OpenAI design practice — Training data pipeline for LLMs
-
Day 39: Anthropic — progressive complexity, safety-first, distributed search
-
Day 40: Anthropic design practice — Distributed search at billion-document scale
The Phase 1 foundations are what Phase 2 builds on. Every company-specific design will require you to apply data modeling, streaming, storage selection, quality frameworks, and observability — just in the context of their specific scale and architecture philosophy.
Quick Reference: Phase 1 in One Page
PHASE 1 — QUICK REFERENCE (ONE PAGE)
METHODOLOGY
Requirements → Estimation → Data Model → Architecture → Trade-offs
ESTIMATION
1 KB × 1 B events ≈ 1 TB/day. 1 M req/day ≈ 12 req/s average.
Size partitions (target ~128–512 MB per file).
DATA MODEL
Grain first. Three fact types (transaction / periodic snapshot / accumulating).
SCD2 for historical attributes. Star schema in gold.
PROCESSING
Batch (Spark, idempotent) + stream (Flink, watermarks). Hybrid is normal.
DISTRIBUTION
CAP → CP or AP under partition. PACELC for latency vs consistency day-to-day.
STORAGE
PostgreSQL default. DynamoDB for KV. Cassandra for write-heavy. Warehouse for analytics.
FORMATS
Parquet + Zstd for analytics storage. Avro for Kafka.
Delta / Iceberg / Hudi for ACID lakehouse tables.
QUALITY
Six dimensions. Defense-in-depth. Circuit breakers stop cascade failures.
GOVERNANCE
Lineage = debuggability. Contracts = schema drift prevention.
CACHING
Redis (<100 ms) → materialized views (<5 s) → partitioned table (<30 s).
APIS
REST (external). gRPC (internal, high-perf). GraphQL (flexible clients).
OBSERVABILITY
Five layers. Volume anomaly detection catches silent failures. SLOs, not only alerts.
TRADE-OFFS
"I chose X over Y because [requirement]. Trade-off is [cost]. Acceptable because [justification]."
See You at Day 31
Phase 1 is done. You’ve covered 30 days of deep, senior-level data engineering foundations. Phase 2 starts with Meta — where your challenge is thinking at the scale of 3 billion users, petabytes of daily data, and an engineering culture that expects every design decision to be defensible at 10x scale.