Day 30 — Phase 1 review and self-assessment

Phase 1 Complete | Category: Consolidation

What You’ve Built in 30 Days

You’ve completed the entire foundation layer. Before Phase 2 begins with company-specific deep dives, today is about closing gaps, consolidating patterns, and building the confidence to walk into any system design room and own it.

No new material today. 100% active recall and synthesis.

The Phase 1 Mastery Checklist

Topic	Details
Green	(can explain + defend trade-offs),
Yellow	(know the concept but struggle with trade-offs),
Red	(need to revisit).

System Design Methodology (Days 1-3)

Can you walk through the 5-step framework in under 2 minutes from memory?
Can you extract functional vs non-functional requirements for a data system and prioritize them P0/P1/P2?
Can you estimate storage, throughput, and query load for a 1B events/day system from scratch?
Do your estimations directly drive architectural decisions (not live in a disconnected math box)?

Data Modeling (Days 4-7)

Can you define grain before drawing any tables?
Can you choose the right fact table type (transaction vs periodic snapshot vs accumulating snapshot) given a business scenario?
Can you explain SCD Types 0/1/2/3/6 and choose between them based on the analytical requirement?
Can you implement SCD2 with a MERGE INTO using the NULL join_key pattern?
Can you articulate when to normalize vs denormalize and at which lakehouse layer each applies?
Can you explain Data Vault (hubs/links/satellites) and when it’s overkill?

ETL/ELT & Processing (Days 8-11)

Can you explain push-down compute and why ELT won over ETL?
Can you describe Spark’s DAG → stages → tasks execution model and explain why shuffles are the #1 performance killer?
Can you handle data skew with salting, key isolation, and AQE?
Can you explain the three window types (tumbling, sliding, session) and when to use each?
Can you explain watermarks and what happens to events beyond the allowed lateness?
Can you compare Flink vs Spark Streaming with concrete trade-offs?
Can you explain Lambda vs Kappa and describe the 2026 hybrid that most companies actually use?

Distributed Systems (Days 12-14)

Can you explain CAP theorem as a binary CP/AP choice during partition (not a trilemma)?
Can you use PACELC to reason about the normal-operation latency/consistency trade-off?
Can you apply per-feature consistency (strong for billing, eventual for feeds)?
Can you explain range vs hash vs consistent hashing and handle hot partitions?
Can you compare leader-follower vs multi-leader vs leaderless replication?
Can you explain quorum math (W + R > N) and design for different fault tolerance profiles?
Can you explain WAL and why it enables CDC?

Storage Systems (Days 15-18)

Can you apply the database selection decision tree (relational vs document vs key-value vs wide-column)?
Can you explain the three warehouse architectures (BigQuery serverless, Snowflake virtual warehouses, Redshift MPP)?
Can you compare Delta Lake, Iceberg, and Hudi with their origin story, strength, and best-fit scenario?
Can you explain why Parquet + Zstd is the default and quantify the storage savings vs JSON?
Can you explain the medallion architecture (bronze/silver/gold) with the modeling approach at each layer?

Data Quality & Governance (Days 19-23)

Can you name the six quality dimensions and map each to a pipeline layer?
Can you design a defense-in-depth quality framework (ingestion → transform → output → consumer)?
Can you explain column-level vs table-level lineage and their respective use cases?
Can you explain BACKWARD_TRANSITIVE vs FORWARD compatibility and give examples of breaking vs non-breaking changes?
Can you design a data contract with schema + semantics + quality rules + SLA?
Can you explain log-based CDC vs query-based CDC and choose between them based on write rate and delete requirement?
Can you describe the Stage → Validate → Publish Airflow pattern and explain why it enables safe retries?

System Design Practice (Days 24-29)

Can you design a real-time analytics dashboard end-to-end (Day 24)?
Can you design an event logging & telemetry system end-to-end (Day 25)?
Can you design an e-commerce data warehouse with all three fact table types (Day 26)?
Can you choose REST vs gRPC vs GraphQL for a given data access pattern (Day 27)?
Can you design a three-tier caching strategy (Redis → materialized views → partitioned table) (Day 28)?
Can you design a pipeline observability system with circuit breakers and two-tier alerting (Day 29)?

Phase 1 Pattern Library: The 15 Answers That Win Interviews

These are the high-signal responses that separate senior answers from junior answers. Memorize the pattern, not the words.

1. On grain: “The grain of this fact table is one row per X. I’m stating this explicitly before naming any dimensions because every subsequent decision — which dimensions are valid, whether aggregation is additive — depends on getting the grain right.”

2. On SCD2: “I’d use Type 2 for [attribute] because historical analysis must reflect the state at time of fact. The key implementation detail is the NULL join_key pattern in the MERGE — one UNION branch with the natural key to expire the old row, one with NULL to force-insert the new version.”

3. On consistency: “I wouldn’t make a blanket consistency choice. [Financial feature] needs strong CP/EC — an incorrect read costs real money. [Dashboard feature] is fine with eventual AP/EL — showing numbers from 30 seconds ago doesn’t affect any decision.”

4. On ETL vs ELT: “I default to ELT — land raw, transform in-place using the warehouse’s compute. Push-down compute means data never leaves the warehouse. I apply ETL guardrails selectively: PII masking before storage, schema validation at ingestion, ML inference that can’t be expressed as SQL.”

5. On Lambda vs Kappa: “I don’t think in pure Lambda or pure Kappa. The modern pattern is streaming ingestion into a lakehouse with batch transforms on top. Streaming gives freshness; batch gives correctness for complex analytics. They share the same Iceberg storage — no merge layer problem.”

6. On database selection: “I start with PostgreSQL unless requirements push me elsewhere. It handles 64 TiB and 20K TPS. I switch to DynamoDB when the access pattern is purely key-value with < 10ms SLA. Cassandra when write throughput exceeds 5K/sec with time-series patterns. The choice is always access-pattern first, not technology first.”

7. On file formats: “I default to Parquet + Zstd for analytics storage — 6-8x smaller than JSON, column pruning, predicate pushdown. Avro for Kafka messages — schema evolution, compact binary, Schema Registry integration. Never JSON for long-term analytical storage.”

8. On partitioning: “I choose the partition key that matches the dominant WHERE clause — usually event_date for time-series data. I state the expected partition size (100MB-1GB target), the write pattern it supports, and why the cardinality is appropriate. Hot partition mitigation: salting or composite key.”

9. On idempotency: “Three mechanisms: interval-scoped output paths (same interval = same path = overwrite), MERGE INTO instead of INSERT (deduplication by business key), and deterministic logic (use data_interval_start, not now()). A pipeline that isn’t idempotent can’t be safely retried.”

10. On schema evolution: “BACKWARD_TRANSITIVE in the Schema Registry blocks breaking changes before they reach production. Adding fields with defaults is always safe. Renames and removals require the expand-contract pattern: add new field → migrate consumers → remove old field. Never force a breaking change in a single deploy.”

11. On CDC: “I default to log-based CDC (Debezium reading WAL/binlog) for any table with > 2K writes/sec or where deletes must be captured. Query-based polling is acceptable for slowly-changing reference tables where simplicity matters more than completeness. The crossover point is ~2K-5K writes/sec.”

12. On observability: “Most teams monitor only Layer 1 (job success/fail) and miss silent failures. The most dangerous failure is a pipeline that runs successfully but writes 0 rows. Volume anomaly detection (> 30% deviation from 7-day rolling average) catches this. Circuit breakers halt downstream propagation before bad data reaches dashboards.”

13. On caching: “I design three tiers: Redis (sub-100ms, 5-30 min TTL) for the hottest queries, materialized views (sub-5s, hourly refresh) for common aggregation patterns, partitioned base tables (sub-30s) for ad-hoc. Pre-computation when the query runs 100+/day; on-demand when it’s flexible or rare.”

14. On API design: “REST for public/external (caching, broad support), gRPC for internal high-throughput (binary encoding, HTTP/2), GraphQL for flexible client queries (analyst self-serve). Cursor-based pagination for anything > 100K records — offset degrades to O(N) and causes consistency bugs on inserts.”

15. On trade-offs: The universal pattern: “I chose X over Y because [specific requirement from Step 1]. The trade-off is [concrete downside of X]. That’s acceptable because [justification tied to the requirements]. If the requirement were different — specifically if [condition] — I’d switch to Y.”

The 30-Minute Timed Practice (Do This Now)

Set three separate 10-minute timers. For each design, speak out loud as if you’re at a whiteboard. Draw the architecture on paper.

Design 1 (10 min): “Design the data infrastructure for a ride-sharing surge pricing system.”

Constraint: Real-time pricing must update within 30 seconds of demand changes.

Target elements: CDC or event streaming from ride request system → Flink for geospatial aggregation (demand per zone per 5-min window) → Redis for sub-second pricing lookup → batch pipeline to BigQuery for pricing model analysis.

Design 2 (10 min): “Design a data warehouse for a music streaming platform.”

Constraint: Analysts need track-level, artist-level, and user-level analysis.

Target elements: Three fact tables (fact_stream — transaction, fact_daily_listeners — periodic snapshot, fact_artist_release_lifecycle — accumulating snapshot). SCD2 on dim_track (genre, price) and dim_user (subscription tier). Star schema in gold layer.

Design 3 (10 min): “Your company’s most important daily finance report was wrong this morning. Walk me through how you’d diagnose and prevent recurrence.”

Constraint: You have 30 minutes before the CFO reviews the numbers.

Target elements: Five observability layers. Work backward from serving layer (query the table directly). Check volume anomaly. Check Airflow logs for errors. Check CDC / source completeness. Root cause in 30 minutes because of structured logs + lineage. Prevention: circuit breaker on the transform step validating row count within 15% of 7-day average.

Self-Assessment: What to Do With Your Gaps

After the checklist and practice, you likely have 3-5 Red items. Here’s how to close them before Phase 2 starts:

For Data Modeling gaps: Re-read Days 4-7. Draw the three fact table types on paper with concrete e-commerce examples. Practice stating the grain out loud for five different business processes.

For Streaming/Processing gaps: Re-read Days 9-11. Draw the Spark DAG → stages → tasks diagram. Walk through the Lambda vs Kappa decision for three scenarios: fraud detection (Kappa), financial reporting (Lambda-like), ride-sharing real-time pricing (hybrid).

For Distributed Systems gaps: Re-read Days 12-14. Practice the CAP decision for each database in the comparison table. Draw the quorum formula (W + R > N) and work through three configurations for N=5.

For the full system designs: Re-read Days 24-26. Attempt to redraw each architecture from scratch on paper without looking. If you can’t fill in the boxes without the notes, the design isn’t yours yet.

Phase 1 → Phase 2 Transition

You’ve covered the fundamentals that every senior DE must know regardless of company. Starting Day 31, Phase 2 shifts to company-specific patterns:

Day 31: Meta — scale thinking, petabyte data infrastructure, social graph
Day 32: Meta system design practice — News Feed data pipeline
Day 33: Netflix — Iceberg-first architecture, ownership culture, fault tolerance
Day 34: Netflix system design practice — Recommendation pipeline
Day 35: Google — GCP-native design (BigQuery + Dataflow + Pub/Sub)
Day 36: Google system design practice — Large-scale search analytics
Day 37: OpenAI — LLM training data pipelines, AI-native infrastructure
Day 38: OpenAI design practice — Training data pipeline for LLMs
Day 39: Anthropic — progressive complexity, safety-first, distributed search
Day 40: Anthropic design practice — Distributed search at billion-document scale

The Phase 1 foundations are what Phase 2 builds on. Every company-specific design will require you to apply data modeling, streaming, storage selection, quality frameworks, and observability — just in the context of their specific scale and architecture philosophy.

Quick Reference: Phase 1 in One Page

PHASE 1 — QUICK REFERENCE (ONE PAGE)

METHODOLOGY
  Requirements → Estimation → Data Model → Architecture → Trade-offs

ESTIMATION
  1 KB × 1 B events ≈ 1 TB/day.  1 M req/day ≈ 12 req/s average.
  Size partitions (target ~128–512 MB per file).

DATA MODEL
  Grain first. Three fact types (transaction / periodic snapshot / accumulating).
  SCD2 for historical attributes. Star schema in gold.

PROCESSING
  Batch (Spark, idempotent) + stream (Flink, watermarks). Hybrid is normal.

DISTRIBUTION
  CAP → CP or AP under partition. PACELC for latency vs consistency day-to-day.

STORAGE
  PostgreSQL default. DynamoDB for KV. Cassandra for write-heavy. Warehouse for analytics.

FORMATS
  Parquet + Zstd for analytics storage. Avro for Kafka.
  Delta / Iceberg / Hudi for ACID lakehouse tables.

QUALITY
  Six dimensions. Defense-in-depth. Circuit breakers stop cascade failures.

GOVERNANCE
  Lineage = debuggability. Contracts = schema drift prevention.

CACHING
  Redis (<100 ms) → materialized views (<5 s) → partitioned table (<30 s).

APIS
  REST (external). gRPC (internal, high-perf). GraphQL (flexible clients).

OBSERVABILITY
  Five layers. Volume anomaly detection catches silent failures. SLOs, not only alerts.

TRADE-OFFS
  "I chose X over Y because [requirement]. Trade-off is [cost]. Acceptable because [justification]."

See You at Day 31

Phase 1 is done. You’ve covered 30 days of deep, senior-level data engineering foundations. Phase 2 starts with Meta — where your challenge is thinking at the scale of 3 billion users, petabytes of daily data, and an engineering culture that expects every design decision to be defensible at 10x scale.

Day 30: Phase 1 review and self-assessment