Phase 2 Complete | Category: Consolidation

What You’ve Built in Phase 2 (Days 31–60)

30 days of advanced, company-specific, and cross-cutting depth. Phase 2 turns Phase 1 foundations into interview-ready fluency at senior/staff level.

The Phase 2 Mastery Checklist

Rate each area:

  • Green: can defend trade-offs fluently
  • Yellow: know the concept, struggle under pressure
  • Red: need review

Company-Specific Patterns (Days 31–40)

Meta (Days 31–32)

  • Can you articulate Meta’s “scale baseline” and how it drives design decisions?
  • Can you design the News Feed data pipeline including fan-out-on-write vs fan-out-on-read for celebrity accounts?
  • Can you explain the four ranking stages and what features the DE provides for each?
  • Can you describe the stack: Scribe → Flink → Presto + Hive/Iceberg → Scuba + Presto?

Netflix (Days 33–34)

  • Can you explain Netflix’s Iceberg-first architecture and why Iceberg exists?
  • Can you design the recommendation pipeline with online (Flink → Cassandra) and offline (Spark → Iceberg) feature paths?
  • Can you explain the two-tower model and what DE builds to support it?
  • Can you describe Netflix’s culture of ownership and how it shows up in interview answers?

Google (Days 35–36)

  • Can you explain BigQuery internals: Capacitor + Colossus + Dremel + Jupiter?
  • Can you choose between Dataflow and Dataproc with specific justification?
  • Can you design a GCP-native pipeline (Pub/Sub → Dataflow → BigQuery) and identify when not to use BigQuery?
  • Can you cite the cost optimization hierarchy: require_partition_filter → partitioning → clustering → materialized views → BI Engine?

OpenAI (Days 37–38)

  • Can you design an LLM training data pipeline: ingestion → PII gate → bronze → dedup (MinHash LSH) → quality filter → toxicity filter → gold?
  • Can you explain the RLHF data pipeline and the preference data model?
  • Can you explain the LLM-as-transformer pattern and its cost/quality trade-offs?
  • Can you describe the quality gates that block a training run?

Anthropic (Days 39–40)

  • Can you describe the progressive-complexity interview style and respond to evolving constraints?
  • Can you design a distributed search system with hybrid BM25 + vector retrieval?
  • Can you explain HNSW vs IVF-PQ trade-offs at billion-vector scale?
  • Can you proactively integrate safety considerations without being asked?

Advanced Data Modeling (Days 41–43)

  • Can you explain Activity Schema and when it beats Kimball?
  • Can you design a graph data model for fraud detection and know when to use a graph DB vs SQL?
  • Can you design a time-series data model with tiered retention policies?
  • Can you choose between InfluxDB, TimescaleDB, and Prometheus for a given scenario?

Advanced Streaming (Days 44–45)

  • Can you explain end-to-end exactly-once (Kafka transactions + Flink 2PC) vs at-least-once + idempotent sinks?
  • Can you explain stream-table duality and what Flink dynamic tables mean in practice?
  • Can you compare BigQuery materialized views vs Flink SQL streaming views vs Materialize?
  • Can you explain CQRS with Kafka as the event log?

ML Data Infrastructure (Days 46–49)

  • Can you design a feature store with online (Redis/Cassandra) and offline (Iceberg) paths?
  • Can you explain point-in-time correctness and why it prevents training data leakage?
  • Can you explain training-serving skew and how a feature store prevents it architecturally?
  • Can you design an ML pipeline with data versioning, experiment tracking, and a model registry promotion workflow?
  • Can you design a RAG pipeline end-to-end: chunking → embedding → vector store → hybrid retrieval → reranking?
  • Can you explain HNSW, IVF, and IVF-PQ with memory/recall trade-offs?
  • Can you design an A/B testing data infrastructure with SRM detection, sequential testing, and guardrail metrics?

Data Platform Design (Days 50–52)

  • Can you explain the four data mesh principles and the hybrid reality?
  • Can you distinguish data mesh (organizational) from data fabric (technology) and choose between them?
  • Can you design a multi-tenant data platform with compute isolation, cost chargeback, and tenant onboarding automation?

Security & Privacy (Days 53–54)

  • Can you layer RBAC → RLS → CLS/ABAC → encryption → audit logging as defense-in-depth?
  • Can you explain masking vs tokenization vs encryption and choose correctly per scenario?
  • Can you explain GDPR right-to-erasure as a distributed systems engineering problem?
  • Can you design privacy-by-design with PII detection, consent flags, and retention policies?

Cost & Performance (Days 55–56)

  • Can you calculate savings from Spot instances, auto-termination, and storage tiering?
  • Can you diagnose a slow pipeline by bottleneck category: source, transform (skew/OOM), sink, orchestration, contention?
  • Can you read a Spark UI and identify the root cause of a slow job?
  • Can you explain AQE and enable the three settings that solve most production issues?
  • Can you explain SARGABLE predicates and why functions on indexed columns defeat indexes?

System Design Practice (Days 57–59)

  • Can you design surge pricing: H3 indexing → Kafka (partitioned by zone_id) → Flink sliding windows → Redis serving?
  • Can you design CDN analytics: edge pre-aggregation → Kafka → Flink → ClickHouse + S3 batch → Iceberg?
  • Can you design fraud detection: two-path architecture (sync scoring + async ring detection) with feedback loop?

Phase 2 Master Pattern Library: The 15 Advanced Answers

  1. On company-specific scale: state it up front to frame decisions.
  2. On GCP-native design: default Pub/Sub → Dataflow → BigQuery; deviate only with clear latency/consistency needs.
  3. On LLM training data: PII stripped before pipeline; dedup is mandatory; quality gates must block training runs.
  4. On feature stores: training-serving skew is solved architecturally by a single source of feature computation logic.
  5. On RAG: chunking strategy often matters more than embeddings; use hybrid search (BM25 + vector) by default.
  6. On A/B infra: SRM detection is mandatory; sequential testing enables valid early stopping; guardrails should auto-pause experiments.
  7. On data mesh: pure mesh often fails at scale; hybrid approach wins (platform/fabric core + mesh ownership principles).
  8. On security: defense-in-depth (network → IAM → fine-grained policy → encryption → audit).
  9. On GDPR erasure: deletion propagates to Kafka tombstones, lakehouse deletes+compaction, feature eviction, and exclusion registry for ML.
  10. On cost: quantify first; Spot is highest ROI for batch; require_partition_filter prevents runaway warehouse spend.
  11. On performance: diagnose bottleneck category first; Spark UI is the starting point; AQE is always-on.
  12. On geospatial: H3 for spatial aggregation; partition Kafka by cell ID to keep state local and avoid shuffle.
  13. On CDN/log analytics: pre-aggregate at edge; raw logs batch to object store for history.
  14. On fraud: two paths, two latency tiers; never put ring detection in the sync authorization path.
  15. On the meta-skill: every decision must tie to a requirement + trade-off + what would change your mind.

The Phase 2 Timed Practice (Do This Weekend)

Set aside ~4 hours for four timed exercises:

Exercise 1 (45 min): Meta News Feed + Privacy

“Design the data infrastructure for Facebook’s News Feed. New requirements: EU user data must remain in EU regions only, and EU users must be deleted within 7 days.”

Focus:

  • How data residency changes fan-out and serving
  • How right-to-erasure propagates through derived datasets and feature stores

Exercise 2 (45 min): Google CDN Analytics (no notes)

From memory: edge nodes → Kafka → Flink → ClickHouse + S3 batch → Iceberg.

Include:

  • Dedup strategy
  • Anomaly detection logic
  • Latency budget and freshness targets

Exercise 3 (45 min): Anthropic RAG + Safety

“Design enterprise retrieval for Claude: tenant isolation, 10M docs/tenant, and retrieval must not return harmful content.”

Apply progressive complexity:

  • Standard RAG → tenant isolation → safety filtering at retrieval

Exercise 4 (45 min): Netflix ML Feature Pipeline

From memory: online path (Flink → Redis), offline path (Spark → Iceberg), point-in-time joins, and A/B testing infra.

Gaps Identified: Common Weak Areas

Gap 1: Company-specific vocabulary

Can you say “Scuba” at Meta, “Dremel” at Google, and other stack terms naturally (without forcing it)?

Gap 2: Trade-off quantification

Mid-level: “Use Spot.”

Senior: “If 70% of our $500K/month batch compute can move to Spot, savings are ~$350K/month; nightly ETL tolerates interruptions via retries/checkpoints.”

Gap 3: Progressive complexity under pressure

For any design: add one constraint and re-evaluate what breaks first (latency, consistency, privacy, cost).

Gap 4: Safety and privacy integration

Weave it into the first diagram and first data flow, not as an afterthought.

Phase 2 → Phase 3 Transition

Phase 3 (Days 61–90) is application-only: timed mocks and integration, not new concepts.

The Single Most Important Preparation for Phase 3

Practice speaking, not reading.

Pick three designs from Phase 2 and explain them out loud, timed, as if you’re in the interview. The hesitation points are what to drill next.

See You at Day 61

Phase 3 starts with a full mock: Design Instagram Stories Analytics (Meta style).