Phase 2: Company-Specific | Category: Google-Specific
The Prompt
“Design a data pipeline to process Google Search analytics. The system ingests search query events from billions of daily searches, computes metrics for the Search Quality team (query success rate, zero-result rate, spelling correction usage, latency distribution), powers real-time anomaly detection, and provides historical trend analysis. Cost efficiency is a first-class requirement.”
Set a 45-minute timer. Apply everything from Day 35.
Step 1: Requirements (5 min)
Functional:
- Ingest search query events from all surfaces (Search, Google Maps, Gmail Search, Image Search, etc.)
- Compute quality metrics: query success rate, zero-result rate, spelling correction rate, p50/p95/p99 latency
- Real-time anomaly detection: flag sudden metric degradations within 5 minutes
- Historical trend analysis: year-over-year, week-over-week comparison going back 3+ years
- Support dimensional slicing: by country, language, query category, device type, surface
- Self-serve analytics for Search Quality team (~500 analysts)
- A/B test support: compare metric distributions across experiment variants
Non-functional:
Scale
: 8.5B searches/day = ~100K searches/sec peak Estimate: 8.5B × 800 bytes avg event = ~6.8 TB/day raw Compressed: ~1 TB/day (Capacitor-equivalent, ~7x compression) Latency: Anomaly detection: < 5 minutes from event to alert Real-time dashboards: < 60 seconds data freshness Batch analytics: data by 8 AM PT for prior day Availability: 99.99% for ingestion (Search can't drop queries) 99.9% for analytics serving Cost: Minimize BigQuery scan costs — this is critical at this scale Privacy: User search queries are sensitive PII — strict access controls required Retention: Hot (90 days): queryable in BigQuery Warm (1-3 years): GCS + BigQuery external tables Cold (3+ years): GCS Nearline/Archive
Scope: “I’ll design the full analytics pipeline from event ingestion through serving. I won’t design the Search ranking system itself — just the data infrastructure that analyzes search quality.”
Step 2: Scale Estimation (3 min)
Ingestion
: 100K events/sec × 800 bytes = 80 MB/sec = ~6.8 TB/day raw Post-compression (Capacitor/Parquet ~7x): ~1 TB/day Annual retention (3 years): ~1.1 PB compressed BigQuery cost estimation (on-demand pricing $6.25/TiB): If analysts run unoptimized queries on 3-year history (1.1 PB): 1,100 TB × $6.25 = $6,875 per full-scan query — UNACCEPTABLE With date partitioning (query one day of 365): 3 TB / 365 = ~8 GB per day-partition = $0.05 per day-query → GOOD Cost optimization is non-negotiable at this scale. Pub/Sub throughput: 80 MB/sec raw → well within Pub/Sub's auto-scaling capacity No shard management needed — Pub/Sub handles this automatically Dataflow workers: For streaming path: 80 MB/sec / ~10 MB/sec per worker = ~8-12 workers steady state With 2x headroom for peak: 20-25 workers auto-scaled by Dataflow Anomaly detection: 500 metric time series (metric × dimension combinations) Updated every 1 minute → 500 writes/min to monitoring → trivial
Architecture implication: Pub/Sub + Dataflow + BigQuery is the clear GCP-native answer. Strict partitioning + clustering + materialized views are mandatory — not optional — to make the cost model work.
Step 3: Architecture (10-12 min)
┌─────────────────────────────────────────────────────────────────────┐ │ SEARCH EVENT SOURCES │ │ Web search, Mobile search, Maps, Gmail, Images, Shopping │ │ Each search generates 1 event with: query_id, query_text_hash, │ │ surface, country, language, device_type, latency_ms, result_count, │ │ spelling_corrected, clicked, timestamp │ │ NOTE: query_text is HASHED before leaving the search stack (PII) │ └──────────────────────────────┬──────────────────────────────────────┘ ↓ Avro + schema validation at source ┌─────────────────────────────────────────────────────────────────────┐ │ CLOUD PUB/SUB │ │ Topic: search-quality-events │ │ Subscriptions: │ │ → sub-dataflow-realtime (Dataflow streaming consumer) │ │ → sub-dataflow-batch (fan-out to GCS via Dataflow batch sink) │ │ No partition key needed — Pub/Sub handles distribution │ │ Retention: 7 days (for replay on pipeline failure) │ └──────────────┬──────────────────────────┬───────────────────────────┘ ↓ real-time path ↓ batch path ┌──────────────────────────────┐ ┌──────────────────────────────────┐ │ DATAFLOW STREAMING JOB │ │ DATAFLOW BATCH SINK │ │ (Apache Beam pipeline) │ │ Pub/Sub → GCS (Parquet/Avro) │ │ │ │ Raw events, partitioned by hour │ │ • Dedup (query_id, 1hr win) │ │ Runs continuously in micro-batch │ │ • Validate schema │ │ 5-min commit interval │ │ • DLQ for invalid events │ └──────────────┬───────────────────┘ │ • 1-min tumbling windows │ ↓ (raw data on GCS) │ • Aggregate metrics: │ ┌──────────────────────────────────┐ │ - query_count per min │ │ CLOUD STORAGE (GCS) — BRONZE │ │ - zero_result_rate │ │ gs://search-analytics/raw/ │ │ - p50/p95/p99 latency │ │ Partitioned: year/month/day/hr │ │ - spelling_correction_rate│ │ Format: Parquet + Zstd │ │ • Dimensioned by: │ │ Retention: 3 years (auto-tier) │ │ country, surface, language│ │ Active → Nearline after 90 days │ │ • Anomaly detection: │ └──────────────┬───────────────────┘ │ Z-score vs 7-day baseline │ ↓ (Cloud Composer DAG) │ • Anomaly → Cloud Monitoring│ ┌──────────────────────────────────┐ │ custom metric → Alert │ │ DATAFLOW BATCH ETL (daily) │ └──────┬───────────────────────┘ │ GCS → transform → BigQuery │ ↓ (real-time metrics) │ • Join with dim tables │ ┌──────────────────────────────┐ │ • Compute session metrics │ │ BIGQUERY STREAMING TABLE │ │ • Quality score aggregation │ │ analytics.realtime_metrics │ │ • A/B experiment joining │ │ (Storage Write API) │ │ Runs: 2 AM PT, data by 8 AM │ │ Queryable within < 60 sec │ └──────────────┬───────────────────┘ │ Partitioned by minute │ ↓ │ Retention: 30 days │ ┌──────────────────────────────────┐ └──────┬───────────────────────┘ │ BIGQUERY GOLD TABLES │ │ │ gold.fact_search_events │ │ │ gold.daily_search_quality │ └───────────────┬───────────│ gold.experiment_metrics │ ↓ └──────────────┬───────────────────┘ ┌────────────────────────────────────┐ │ SERVING LAYER │ │ Looker → BigQuery Gold (dashboards)│ │ Dataplex → governance + lineage │ │ Cloud Monitoring → anomaly alerts │ │ Vertex AI → trend forecasting │ └────────────────────────────────────┘
Step 4: BigQuery Table Design — Cost First
This is where Google interviews focus most heavily. Every design decision must be justified by cost and performance.
Primary Fact Table: gold.fact_search_events
CREATE
TABLE gold.fact_search_events ( -- Keys query_id STRING NOT NULL, event_timestamp TIMESTAMP NOT NULL, event_date DATE NOT NULL, -- materialized for partitioning -- Dimensions (denormalized — star schema for query efficiency) country_code STRING, -- ISO 3166 language_code STRING, -- BCP 47 surface STRING, -- 'web', 'maps', 'gmail', 'images' device_category STRING, -- 'desktop', 'mobile', 'tablet' query_category STRING, -- ML-classified: 'navigational', 'informational', 'transactional' -- Metrics latency_ms INT64, result_count INT64, is_zero_result BOOL, is_spelling_corrected BOOL, is_clicked BOOL, -- Experiment tracking experiment_id STRING, variant_id STRING ) PARTITION BY event_date CLUSTER BY country_code, surface, language_code OPTIONS ( require_partition_filter = TRUE, -- MANDATORY: prevent $6,875 full scans partition_expiration_days = 90 -- auto-delete after 90 days (older data on GCS) );
Design justifications:
- PARTITION BY event_date: All quality queries filter by date range. Partition pruning = scan only relevant dates.
- CLUSTER BY country_code, surface, language_code: The three most common filter/group-by dimensions for Search Quality team.
- require_partition_filter = TRUE: Hard guard against full-table scans. Non-negotiable at this scale.
- partition_expiration_days = 90: Older data lives on GCS as Parquet, queryable via BigQuery external tables at $0 storage cost in BigQuery.
Daily Aggregation Table: gold.daily_search_quality
CREATE
TABLE gold.daily_search_quality ( metric_date DATE NOT NULL, country_code STRING NOT NULL, surface STRING NOT NULL, language_code STRING NOT NULL, device_category STRING NOT NULL, query_category STRING NOT NULL, -- Pre-aggregated metrics (avoid scanning fact table for dashboard) search_volume INT64, zero_result_count INT64, zero_result_rate FLOAT64, spelling_correction_rate FLOAT64, click_rate FLOAT64, -- Latency percentiles (use APPROX_QUANTILES at aggregation time) latency_p50_ms INT64, latency_p95_ms INT64, latency_p99_ms INT64 ) PARTITION BY metric_date CLUSTER BY country_code, surface;
Why this exists: Dashboard queries asking “show me zero_result_rate by country for last 30 days” should NOT scan 30 days of the fact table (billions of rows). They should scan this aggregation table (30 rows × few thousand dimension combos = microseconds to fetch). The daily aggregation is built nightly by the batch Dataflow job.
Materialized Views for the Hottest Dashboard Queries
--
Top 10 most-run dashboard query: last 7-day trend for a specific country CREATE MATERIALIZED VIEW gold.mv_7day_quality_by_country OPTIONS (enable_refresh = true, refresh_interval_minutes = 60) AS SELECT metric_date, country_code, SUM(search_volume) as searches, AVG(zero_result_rate) as avg_zero_result_rate, AVG(latency_p95_ms) as avg_p95_latency FROM gold.daily_search_quality WHERE metric_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY metric_date, country_code;
BigQuery automatically rewrites queries against gold.daily_search_quality to use this materialized view when the WHERE and GROUP BY match. Dashboards scanning 7 days of country data hit an in-memory pre-computed result instead of a fact table.
Step 5: Real-Time Anomaly Detection Design
The 5-minute SLA for anomaly detection requires the streaming path, not batch.
Dataflow anomaly detection logic (Beam):
on
1-minute tumbling window per (surface, country) combination window_results = ( events | 'Window' >> beam.WindowInto(FixedWindows(60)) # 1-minute tumbling | 'Key by dimension' >> beam.Map(lambda e: ((e.surface, e.country_code), e)) | 'Aggregate' >> beam.CombinePerKey(SearchMetricsAggregator()) ) # Anomaly detection: compare window metric to 7-day rolling baseline anomalies = ( window_results | 'Detect anomalies' >> beam.ParDo(AnomalyDetector( baseline_store=BigTableBaselineStore(), # reads 7-day rolling avg from Bigtable threshold_stddev=3.0 # Z-score threshold )) | 'Filter significant' >> beam.Filter(lambda x: x.is_significant) | 'Publish alert' >> beam.io.WriteToPubSub('projects/p/topics/quality-alerts') )
Baseline stored in Bigtable: The 7-day rolling baseline (mean and stddev per metric per dimension) is precomputed daily and stored in Bigtable for sub-10ms lookup at streaming time. Row key: {surface}#{country_code}#{metric_name}. Dataflow reads baseline for each window to compute Z-scores.
Alert routing:
Pub
/Sub quality-alerts topic → Cloud Monitoring custom metric (for dashboard) → Cloud Functions → PagerDuty (P1: zero_result_rate > 3σ above baseline) → Cloud Functions → Slack #search-quality-alerts (P2: any 2σ deviation)
Step 6: Privacy-Aware Design
Search queries are among the most sensitive PII imaginable. Google’s design:
Query text handling:
-
Raw query text is NEVER stored in the analytics pipeline
-
At the search serving layer (before Pub/Sub), the query is replaced with:
-
query_hash: SHA-256 of normalized query (for deduplication only, not reversible)
-
query_category: ML classifier output (navigational, informational, transactional)
-
query_language: detected language
-
-
The analytics pipeline receives no raw query text — only derived signals
IAM structure:
Raw
events (GCS Bronze): Only pipeline service accounts Fact table (BigQuery): Only data engineering team + approved ML jobs Daily aggregation (BigQuery): Search Quality team (analysts can self-serve) Country-level aggregations: Minimum threshold: 1000 queries/day (prevents re-identification) Public-facing metrics: Only heavily aggregated, no individual or small-group data
Audit logging: All BigQuery queries to the fact table logged to Cloud Audit Logs → BigQuery (stored separately) → compliance team monitors access patterns.
Step 7: The Batch ETL Pipeline (Cloud Composer)
on
Cloud Composer (Airflow) DAG: daily search quality pipeline with DAG( dag_id="search_quality_daily", schedule_interval="0 2 * * *", # 2 AM PT = 10 AM UTC default_args={"retries": 3, "retry_delay": timedelta(minutes=10)}, sla_miss_callback=pagerduty_alert ) as dag: wait_for_raw_data = BigQuerySensor( task_id="wait_for_gcs_landing", # Verify yesterday's hourly partitions are all complete sql="""SELECT COUNT(*) >= 24 FROM gcs_metadata WHERE date = '{{ ds }}' AND all_hours_complete = true""", poke_interval=300 ) run_quality_etl = DataflowCreateJavaJobOperator( task_id="run_quality_etl_dataflow", job_name="search-quality-daily-{{ ds }}", job_class="com.google.search.analytics.QualityETL", parameters={ "date": "{{ ds }}", "input": "gs://search-analytics/raw/{{ ds }}/*", "output": "search-analytics:gold.fact_search_events${{ ds_nodash }}" } ) run_dbt_aggregations = BashOperator( task_id="run_dbt_aggregations", bash_command="dbt run --select gold.daily_search_quality --vars '{date: {{ ds }}}'" ) validate_output = BigQueryCheckOperator( task_id="validate_daily_volume", sql=""" SELECT CASE WHEN COUNT(*) > (SELECT search_volume * 0.85 FROM gold.daily_search_quality WHERE metric_date = DATE_SUB('{{ ds }}', INTERVAL 7 DAY) AND country_code = 'US') THEN 1 ELSE 0 END FROM gold.fact_search_events WHERE event_date = '{{ ds }}' AND country_code = 'US' """ ) # Circuit breaker: fail if US volume < 85% of prior week same day publish_success = BigQueryInsertJobOperator( task_id="mark_partition_complete", configuration={"query": { "query": f"INSERT INTO metadata.pipeline_status VALUES ('{{ ds }}', CURRENT_TIMESTAMP(), 'SUCCESS')" }} ) wait_for_raw_data >> run_quality_etl >> run_dbt_aggregations >> validate_output >> publish_success
Step 8: Cost Optimization Manifest
Explicitly naming cost decisions signals Google-level operational maturity:
| Optimization | Estimated impact |
|---|---|
require_partition_filter = TRUE on fact table | Eliminates accidental $6,875+ scans |
partition_expiration_days = 90 in BQ, older data on GCS | ~75% BigQuery storage cost reduction for 3-year retention |
| GCS auto-tier to Nearline after 90 days ($0.01/GB vs $0.02/GB) | ~50% storage cost reduction for warm data |
gold.daily_search_quality pre-aggregation table | 99%+ reduction in bytes scanned for dashboard queries |
| Materialized views for top 10 dashboard patterns | Dashboard queries at near $0 per query |
APPROX_QUANTILES(latency_ms, 100) instead of exact percentiles | 10x cheaper than PERCENTILE_CONT at this volume |
| BigQuery BI Engine reservation for Looker | Eliminates query slot consumption for repetitive dashboard loads |
| Flat-rate slot reservation for nightly batch | Predictable cost, cheaper than on-demand for scheduled workloads |
INFORMATION_SCHEMA.JOBS alerting | Auto-alert when any single query exceeds $50 threshold |
Total cost reduction vs naive implementation (no partitioning, no pre-aggregation): estimated 95-99% reduction in BigQuery compute costs.
Step 9: Dataplex for Governance
Google’s Dataplex is the GCP-native data governance layer — mentioning it in a Google interview is a differentiator.
Dataplex
Lake: search-analytics Zone: raw (GCS Bronze) Zone: curated (BigQuery Gold) Zone: sandbox (analyst experiments, auto-delete after 30 days) Data Quality Rules on gold.fact_search_events: - event_date NOT NULL (completeness) - latency_ms BETWEEN 0 AND 60000 (validity — no impossible values) - country_code matches ISO 3166 (validity) - search_volume > 0 for each partition (completeness) Quality score visible in Data Catalog for each table Policy Tags for PII columns: - query_hash: RESTRICTED — requires explicit IAM permission - Column-level security enforced via Data Catalog Policy Tags
Interview Questions
Q1: “The Search Quality team says their BigQuery dashboard takes 45 seconds per query. How do you fix this?”
Model Answer: “45-second queries on a search analytics table almost certainly means one of three problems. First, I’d check if the query is filtering on the partition column. INFORMATION_SCHEMA.JOBS_BY_PROJECT will show bytes billed — if it’s scanning more than a day’s worth of data for a ‘show today’s metrics’ query, partition pruning isn’t working. Common causes: function on the partition column (EXTRACT(DATE FROM timestamp) instead of WHERE event_date = CURRENT_DATE()), or the WHERE clause uses a column that isn’t the partition column.
Second, if partition pruning is working but still slow, I’d check if they’re hitting the fact table directly instead of the pre-aggregated daily_search_quality table. Most dashboard queries just need daily aggregates — scanning the 8.5B-event fact table for those is wasteful when a 50K-row aggregation table exists.
Third, if the query is legitimately complex and needs the fact table — add a BigQuery BI Engine reservation. For Looker dashboards with repetitive query patterns, BI Engine serves results from in-memory cache at sub-second latency with no slot consumption.
My structured fix: (1) Check INFORMATION_SCHEMA.JOBS to find the query and its bytes_billed. (2) Add require_partition_filter = TRUE if not set — forces proper date filtering. (3) Redirect dashboard to daily_search_quality for standard metrics. (4) Create materialized view for the specific query pattern if it runs > 100 times/day. (5) Set up per-user query quota alerts to catch future high-cost queries before they accumulate.”
Q2: “How would you detect if a new Search algorithm change caused degradation in zero-result rate within 5 minutes of rollout?”
Model Answer: “This is exactly the streaming anomaly detection path. When the new algorithm is rolled out, it affects search results immediately — I need to detect the signal in the Dataflow streaming layer, not the batch layer. My design:
The Dataflow streaming job computes zero_result_rate in 1-minute tumbling windows, keyed by (surface, country_code). For each window, it fetches the 7-day rolling baseline (same hour, same day-of-week) from Bigtable — sub-10ms lookup. If the current window’s zero_result_rate exceeds baseline by more than 3 standard deviations, a Cloud Monitoring custom metric fires and triggers an alert.
For an algorithm rollout, I’d add one more dimension: experiment_variant. If the rollout is done as a 1% / 99% A/B test, the zero_result_rate is windowed by (surface, country_code, experiment_variant). If variant B shows 3σ higher zero_result_rate than variant A for the same window, that’s the degradation signal — and it fires within minutes of rollout, not hours.
Alert routing: P1 to PagerDuty (the algorithm team gets paged immediately if production zero-result rate spikes 3σ). The alert body includes: current metric, baseline, Z-score, time range, and a link to the Looker dashboard filtered to the affected surface and time window. The team can evaluate within minutes whether to roll back the algorithm change.”
Self-Assessment: 5 Questions
-
Why is require_partition_filter = TRUE non-negotiable at this scale? What’s the cost of not having it?
-
What data lives in BigQuery vs GCS for the 3-year history, and why?
-
Why is the query text never stored in the analytics pipeline?
-
What’s the difference between the real-time anomaly detection path and the batch analytics path — what does each serve?
-
How does Dataplex fit into this design — what does it provide that BigQuery alone doesn’t?
Quick Reference: Google Search Analytics Pipeline
- GCP Streaming Trinity: Pub/Sub (shock absorber) → Dataflow (stateful processing) → BigQuery (analytics warehouse). This is the answer for every streaming analytics problem at Google.
- BigQuery cost hierarchy: require_partition_filter → date partitioning → clustering → pre-aggregation tables → materialized views → BI Engine. Apply ALL of these.
- Anomaly detection: Dataflow 1-min tumbling windows → Z-score vs Bigtable 7-day baseline → Cloud Monitoring custom metric → Alerting policy. 5-minute SLA achievable.
- Privacy-first: Raw query text NEVER enters the analytics pipeline. Hashes and ML-classified categories only. Column-level security via Dataplex Policy Tags. Minimum aggregation thresholds.
- Cost math: Without optimization = $6,875 per full scan. With date partitioning + clustering + require_partition_filter = $0.05 per typical dashboard query. 99%+ cost reduction.
- Dataplex: GCP-native governance — data quality rules, policy tags for PII, zone-based access control, Data Catalog for discovery. Mentioning it signals Google-specific depth.
Tomorrow’s Preview
Day 37: OpenAI Data Engineering & AI-Native Pipelines — How OpenAI thinks about data infrastructure differently from traditional tech companies. Training data pipelines at 100TB+ scale, RLHF data infrastructure, embedding pipelines, and what “AI-native data engineering” means when your product IS the AI model. How OpenAI interviews differ from FAANG and what they specifically probe.