Phase 2 Complete | Category: Consolidation
What You’ve Built in Phase 2 (Days 31–60)
30 days of advanced, company-specific, and cross-cutting depth. Phase 2 turns Phase 1 foundations into interview-ready fluency at senior/staff level.
The Phase 2 Mastery Checklist
Rate each area:
- Green: can defend trade-offs fluently
- Yellow: know the concept, struggle under pressure
- Red: need review
Company-Specific Patterns (Days 31–40)
Meta (Days 31–32)
- Can you articulate Meta’s “scale baseline” and how it drives design decisions?
- Can you design the News Feed data pipeline including fan-out-on-write vs fan-out-on-read for celebrity accounts?
- Can you explain the four ranking stages and what features the DE provides for each?
- Can you describe the stack: Scribe → Flink → Presto + Hive/Iceberg → Scuba + Presto?
Netflix (Days 33–34)
- Can you explain Netflix’s Iceberg-first architecture and why Iceberg exists?
- Can you design the recommendation pipeline with online (Flink → Cassandra) and offline (Spark → Iceberg) feature paths?
- Can you explain the two-tower model and what DE builds to support it?
- Can you describe Netflix’s culture of ownership and how it shows up in interview answers?
Google (Days 35–36)
- Can you explain BigQuery internals: Capacitor + Colossus + Dremel + Jupiter?
- Can you choose between Dataflow and Dataproc with specific justification?
- Can you design a GCP-native pipeline (Pub/Sub → Dataflow → BigQuery) and identify when not to use BigQuery?
- Can you cite the cost optimization hierarchy: require_partition_filter → partitioning → clustering → materialized views → BI Engine?
OpenAI (Days 37–38)
- Can you design an LLM training data pipeline: ingestion → PII gate → bronze → dedup (MinHash LSH) → quality filter → toxicity filter → gold?
- Can you explain the RLHF data pipeline and the preference data model?
- Can you explain the LLM-as-transformer pattern and its cost/quality trade-offs?
- Can you describe the quality gates that block a training run?
Anthropic (Days 39–40)
- Can you describe the progressive-complexity interview style and respond to evolving constraints?
- Can you design a distributed search system with hybrid BM25 + vector retrieval?
- Can you explain HNSW vs IVF-PQ trade-offs at billion-vector scale?
- Can you proactively integrate safety considerations without being asked?
Advanced Data Modeling (Days 41–43)
- Can you explain Activity Schema and when it beats Kimball?
- Can you design a graph data model for fraud detection and know when to use a graph DB vs SQL?
- Can you design a time-series data model with tiered retention policies?
- Can you choose between InfluxDB, TimescaleDB, and Prometheus for a given scenario?
Advanced Streaming (Days 44–45)
- Can you explain end-to-end exactly-once (Kafka transactions + Flink 2PC) vs at-least-once + idempotent sinks?
- Can you explain stream-table duality and what Flink dynamic tables mean in practice?
- Can you compare BigQuery materialized views vs Flink SQL streaming views vs Materialize?
- Can you explain CQRS with Kafka as the event log?
ML Data Infrastructure (Days 46–49)
- Can you design a feature store with online (Redis/Cassandra) and offline (Iceberg) paths?
- Can you explain point-in-time correctness and why it prevents training data leakage?
- Can you explain training-serving skew and how a feature store prevents it architecturally?
- Can you design an ML pipeline with data versioning, experiment tracking, and a model registry promotion workflow?
- Can you design a RAG pipeline end-to-end: chunking → embedding → vector store → hybrid retrieval → reranking?
- Can you explain HNSW, IVF, and IVF-PQ with memory/recall trade-offs?
- Can you design an A/B testing data infrastructure with SRM detection, sequential testing, and guardrail metrics?
Data Platform Design (Days 50–52)
- Can you explain the four data mesh principles and the hybrid reality?
- Can you distinguish data mesh (organizational) from data fabric (technology) and choose between them?
- Can you design a multi-tenant data platform with compute isolation, cost chargeback, and tenant onboarding automation?
Security & Privacy (Days 53–54)
- Can you layer RBAC → RLS → CLS/ABAC → encryption → audit logging as defense-in-depth?
- Can you explain masking vs tokenization vs encryption and choose correctly per scenario?
- Can you explain GDPR right-to-erasure as a distributed systems engineering problem?
- Can you design privacy-by-design with PII detection, consent flags, and retention policies?
Cost & Performance (Days 55–56)
- Can you calculate savings from Spot instances, auto-termination, and storage tiering?
- Can you diagnose a slow pipeline by bottleneck category: source, transform (skew/OOM), sink, orchestration, contention?
- Can you read a Spark UI and identify the root cause of a slow job?
- Can you explain AQE and enable the three settings that solve most production issues?
- Can you explain SARGABLE predicates and why functions on indexed columns defeat indexes?
System Design Practice (Days 57–59)
- Can you design surge pricing: H3 indexing → Kafka (partitioned by zone_id) → Flink sliding windows → Redis serving?
- Can you design CDN analytics: edge pre-aggregation → Kafka → Flink → ClickHouse + S3 batch → Iceberg?
- Can you design fraud detection: two-path architecture (sync scoring + async ring detection) with feedback loop?
Phase 2 Master Pattern Library: The 15 Advanced Answers
- On company-specific scale: state it up front to frame decisions.
- On GCP-native design: default Pub/Sub → Dataflow → BigQuery; deviate only with clear latency/consistency needs.
- On LLM training data: PII stripped before pipeline; dedup is mandatory; quality gates must block training runs.
- On feature stores: training-serving skew is solved architecturally by a single source of feature computation logic.
- On RAG: chunking strategy often matters more than embeddings; use hybrid search (BM25 + vector) by default.
- On A/B infra: SRM detection is mandatory; sequential testing enables valid early stopping; guardrails should auto-pause experiments.
- On data mesh: pure mesh often fails at scale; hybrid approach wins (platform/fabric core + mesh ownership principles).
- On security: defense-in-depth (network → IAM → fine-grained policy → encryption → audit).
- On GDPR erasure: deletion propagates to Kafka tombstones, lakehouse deletes+compaction, feature eviction, and exclusion registry for ML.
- On cost: quantify first; Spot is highest ROI for batch; require_partition_filter prevents runaway warehouse spend.
- On performance: diagnose bottleneck category first; Spark UI is the starting point; AQE is always-on.
- On geospatial: H3 for spatial aggregation; partition Kafka by cell ID to keep state local and avoid shuffle.
- On CDN/log analytics: pre-aggregate at edge; raw logs batch to object store for history.
- On fraud: two paths, two latency tiers; never put ring detection in the sync authorization path.
- On the meta-skill: every decision must tie to a requirement + trade-off + what would change your mind.
The Phase 2 Timed Practice (Do This Weekend)
Set aside ~4 hours for four timed exercises:
Exercise 1 (45 min): Meta News Feed + Privacy
“Design the data infrastructure for Facebook’s News Feed. New requirements: EU user data must remain in EU regions only, and EU users must be deleted within 7 days.”
Focus:
- How data residency changes fan-out and serving
- How right-to-erasure propagates through derived datasets and feature stores
Exercise 2 (45 min): Google CDN Analytics (no notes)
From memory: edge nodes → Kafka → Flink → ClickHouse + S3 batch → Iceberg.
Include:
- Dedup strategy
- Anomaly detection logic
- Latency budget and freshness targets
Exercise 3 (45 min): Anthropic RAG + Safety
“Design enterprise retrieval for Claude: tenant isolation, 10M docs/tenant, and retrieval must not return harmful content.”
Apply progressive complexity:
- Standard RAG → tenant isolation → safety filtering at retrieval
Exercise 4 (45 min): Netflix ML Feature Pipeline
From memory: online path (Flink → Redis), offline path (Spark → Iceberg), point-in-time joins, and A/B testing infra.
Gaps Identified: Common Weak Areas
Gap 1: Company-specific vocabulary
Can you say “Scuba” at Meta, “Dremel” at Google, and other stack terms naturally (without forcing it)?
Gap 2: Trade-off quantification
Mid-level: “Use Spot.”
Senior: “If 70% of our $500K/month batch compute can move to Spot, savings are ~$350K/month; nightly ETL tolerates interruptions via retries/checkpoints.”
Gap 3: Progressive complexity under pressure
For any design: add one constraint and re-evaluate what breaks first (latency, consistency, privacy, cost).
Gap 4: Safety and privacy integration
Weave it into the first diagram and first data flow, not as an afterthought.
Phase 2 → Phase 3 Transition
Phase 3 (Days 61–90) is application-only: timed mocks and integration, not new concepts.
The Single Most Important Preparation for Phase 3
Practice speaking, not reading.
Pick three designs from Phase 2 and explain them out loud, timed, as if you’re in the interview. The hesitation points are what to drill next.
See You at Day 61
Phase 3 starts with a full mock: Design Instagram Stories Analytics (Meta style).