Day 60 — Phase 2 Comprehensive Review & Self-Assessment

Phase 2 Complete | Category: Consolidation

What You’ve Built in Phase 2 (Days 31–60)

30 days of advanced, company-specific, and cross-cutting depth. Phase 2 turns Phase 1 foundations into interview-ready fluency at senior/staff level.

The Phase 2 Mastery Checklist

Rate each area:

Green: can defend trade-offs fluently
Yellow: know the concept, struggle under pressure
Red: need review

Company-Specific Patterns (Days 31–40)

Meta (Days 31–32)

Can you articulate Meta’s “scale baseline” and how it drives design decisions?
Can you design the News Feed data pipeline including fan-out-on-write vs fan-out-on-read for celebrity accounts?
Can you explain the four ranking stages and what features the DE provides for each?
Can you describe the stack: Scribe → Flink → Presto + Hive/Iceberg → Scuba + Presto?

Netflix (Days 33–34)

Can you explain Netflix’s Iceberg-first architecture and why Iceberg exists?
Can you design the recommendation pipeline with online (Flink → Cassandra) and offline (Spark → Iceberg) feature paths?
Can you explain the two-tower model and what DE builds to support it?
Can you describe Netflix’s culture of ownership and how it shows up in interview answers?

Google (Days 35–36)

Can you explain BigQuery internals: Capacitor + Colossus + Dremel + Jupiter?
Can you choose between Dataflow and Dataproc with specific justification?
Can you design a GCP-native pipeline (Pub/Sub → Dataflow → BigQuery) and identify when not to use BigQuery?
Can you cite the cost optimization hierarchy: require_partition_filter → partitioning → clustering → materialized views → BI Engine?

OpenAI (Days 37–38)

Can you design an LLM training data pipeline: ingestion → PII gate → bronze → dedup (MinHash LSH) → quality filter → toxicity filter → gold?
Can you explain the RLHF data pipeline and the preference data model?
Can you explain the LLM-as-transformer pattern and its cost/quality trade-offs?
Can you describe the quality gates that block a training run?

Anthropic (Days 39–40)

Can you describe the progressive-complexity interview style and respond to evolving constraints?
Can you design a distributed search system with hybrid BM25 + vector retrieval?
Can you explain HNSW vs IVF-PQ trade-offs at billion-vector scale?
Can you proactively integrate safety considerations without being asked?

Advanced Data Modeling (Days 41–43)

Can you explain Activity Schema and when it beats Kimball?
Can you design a graph data model for fraud detection and know when to use a graph DB vs SQL?
Can you design a time-series data model with tiered retention policies?
Can you choose between InfluxDB, TimescaleDB, and Prometheus for a given scenario?

Advanced Streaming (Days 44–45)

Can you explain end-to-end exactly-once (Kafka transactions + Flink 2PC) vs at-least-once + idempotent sinks?
Can you explain stream-table duality and what Flink dynamic tables mean in practice?
Can you compare BigQuery materialized views vs Flink SQL streaming views vs Materialize?
Can you explain CQRS with Kafka as the event log?

ML Data Infrastructure (Days 46–49)

Can you design a feature store with online (Redis/Cassandra) and offline (Iceberg) paths?
Can you explain point-in-time correctness and why it prevents training data leakage?
Can you explain training-serving skew and how a feature store prevents it architecturally?
Can you design an ML pipeline with data versioning, experiment tracking, and a model registry promotion workflow?
Can you design a RAG pipeline end-to-end: chunking → embedding → vector store → hybrid retrieval → reranking?
Can you explain HNSW, IVF, and IVF-PQ with memory/recall trade-offs?
Can you design an A/B testing data infrastructure with SRM detection, sequential testing, and guardrail metrics?

Data Platform Design (Days 50–52)

Can you explain the four data mesh principles and the hybrid reality?
Can you distinguish data mesh (organizational) from data fabric (technology) and choose between them?
Can you design a multi-tenant data platform with compute isolation, cost chargeback, and tenant onboarding automation?

Security & Privacy (Days 53–54)

Can you layer RBAC → RLS → CLS/ABAC → encryption → audit logging as defense-in-depth?
Can you explain masking vs tokenization vs encryption and choose correctly per scenario?
Can you explain GDPR right-to-erasure as a distributed systems engineering problem?
Can you design privacy-by-design with PII detection, consent flags, and retention policies?

Cost & Performance (Days 55–56)

Can you calculate savings from Spot instances, auto-termination, and storage tiering?
Can you diagnose a slow pipeline by bottleneck category: source, transform (skew/OOM), sink, orchestration, contention?
Can you read a Spark UI and identify the root cause of a slow job?
Can you explain AQE and enable the three settings that solve most production issues?
Can you explain SARGABLE predicates and why functions on indexed columns defeat indexes?

System Design Practice (Days 57–59)

Can you design surge pricing: H3 indexing → Kafka (partitioned by zone_id) → Flink sliding windows → Redis serving?
Can you design CDN analytics: edge pre-aggregation → Kafka → Flink → ClickHouse + S3 batch → Iceberg?
Can you design fraud detection: two-path architecture (sync scoring + async ring detection) with feedback loop?

Phase 2 Master Pattern Library: The 15 Advanced Answers

On company-specific scale: state it up front to frame decisions.
On GCP-native design: default Pub/Sub → Dataflow → BigQuery; deviate only with clear latency/consistency needs.
On LLM training data: PII stripped before pipeline; dedup is mandatory; quality gates must block training runs.
On feature stores: training-serving skew is solved architecturally by a single source of feature computation logic.
On RAG: chunking strategy often matters more than embeddings; use hybrid search (BM25 + vector) by default.
On A/B infra: SRM detection is mandatory; sequential testing enables valid early stopping; guardrails should auto-pause experiments.
On data mesh: pure mesh often fails at scale; hybrid approach wins (platform/fabric core + mesh ownership principles).
On security: defense-in-depth (network → IAM → fine-grained policy → encryption → audit).
On GDPR erasure: deletion propagates to Kafka tombstones, lakehouse deletes+compaction, feature eviction, and exclusion registry for ML.
On cost: quantify first; Spot is highest ROI for batch; require_partition_filter prevents runaway warehouse spend.
On performance: diagnose bottleneck category first; Spark UI is the starting point; AQE is always-on.
On geospatial: H3 for spatial aggregation; partition Kafka by cell ID to keep state local and avoid shuffle.
On CDN/log analytics: pre-aggregate at edge; raw logs batch to object store for history.
On fraud: two paths, two latency tiers; never put ring detection in the sync authorization path.
On the meta-skill: every decision must tie to a requirement + trade-off + what would change your mind.

The Phase 2 Timed Practice (Do This Weekend)

Set aside ~4 hours for four timed exercises:

Exercise 1 (45 min): Meta News Feed + Privacy

“Design the data infrastructure for Facebook’s News Feed. New requirements: EU user data must remain in EU regions only, and EU users must be deleted within 7 days.”

Focus:

How data residency changes fan-out and serving
How right-to-erasure propagates through derived datasets and feature stores

Exercise 2 (45 min): Google CDN Analytics (no notes)

From memory: edge nodes → Kafka → Flink → ClickHouse + S3 batch → Iceberg.

Include:

Dedup strategy
Anomaly detection logic
Latency budget and freshness targets

Exercise 3 (45 min): Anthropic RAG + Safety

“Design enterprise retrieval for Claude: tenant isolation, 10M docs/tenant, and retrieval must not return harmful content.”

Apply progressive complexity:

Standard RAG → tenant isolation → safety filtering at retrieval

Exercise 4 (45 min): Netflix ML Feature Pipeline

From memory: online path (Flink → Redis), offline path (Spark → Iceberg), point-in-time joins, and A/B testing infra.

Gaps Identified: Common Weak Areas

Gap 1: Company-specific vocabulary

Can you say “Scuba” at Meta, “Dremel” at Google, and other stack terms naturally (without forcing it)?

Gap 2: Trade-off quantification

Mid-level: “Use Spot.”

Senior: “If 70% of our $500K/month batch compute can move to Spot, savings are ~$350K/month; nightly ETL tolerates interruptions via retries/checkpoints.”

Gap 3: Progressive complexity under pressure

For any design: add one constraint and re-evaluate what breaks first (latency, consistency, privacy, cost).

Gap 4: Safety and privacy integration

Weave it into the first diagram and first data flow, not as an afterthought.

Phase 2 → Phase 3 Transition

Phase 3 (Days 61–90) is application-only: timed mocks and integration, not new concepts.

The Single Most Important Preparation for Phase 3

Practice speaking, not reading.

Pick three designs from Phase 2 and explain them out loud, timed, as if you’re in the interview. The hesitation points are what to drill next.

See You at Day 61

Phase 3 starts with a full mock: Design Instagram Stories Analytics (Meta style).

Day 60: Phase 2 Comprehensive Review & Self-Assessment