Day 36 — Design: Large-Scale Search Analytics Pipeline (Google Style)

Phase 2: Company-Specific | Category: Google-Specific

The Prompt

“Design a data pipeline to process Google Search analytics. The system ingests search query events from billions of daily searches, computes metrics for the Search Quality team (query success rate, zero-result rate, spelling correction usage, latency distribution), powers real-time anomaly detection, and provides historical trend analysis. Cost efficiency is a first-class requirement.”

Set a 45-minute timer. Apply everything from Day 35.

Step 1: Requirements (5 min)

Functional:

Ingest search query events from all surfaces (Search, Google Maps, Gmail Search, Image Search, etc.)
Compute quality metrics: query success rate, zero-result rate, spelling correction rate, p50/p95/p99 latency
Real-time anomaly detection: flag sudden metric degradations within 5 minutes
Historical trend analysis: year-over-year, week-over-week comparison going back 3+ years
Support dimensional slicing: by country, language, query category, device type, surface
Self-serve analytics for Search Quality team (~500 analysts)
A/B test support: compare metric distributions across experiment variants

Non-functional:

Scale
:         8.5B searches/day = ~100K searches/sec peak                 Estimate: 8.5B × 800 bytes avg event = ~6.8 TB/day raw                 Compressed: ~1 TB/day (Capacitor-equivalent, ~7x compression)  Latency:       Anomaly detection: < 5 minutes from event to alert                 Real-time dashboards: < 60 seconds data freshness                 Batch analytics: data by 8 AM PT for prior day  Availability:  99.99% for ingestion (Search can't drop queries)                 99.9% for analytics serving  Cost:          Minimize BigQuery scan costs — this is critical at this scale  Privacy:       User search queries are sensitive PII — strict access controls required  Retention:     Hot (90 days): queryable in BigQuery                 Warm (1-3 years): GCS + BigQuery external tables                 Cold (3+ years): GCS Nearline/Archive

Scope: “I’ll design the full analytics pipeline from event ingestion through serving. I won’t design the Search ranking system itself — just the data infrastructure that analyzes search quality.”

Step 2: Scale Estimation (3 min)

Ingestion
:    100K events/sec × 800 bytes = 80 MB/sec = ~6.8 TB/day raw    Post-compression (Capacitor/Parquet ~7x): ~1 TB/day    Annual retention (3 years): ~1.1 PB compressed  BigQuery cost estimation (on-demand pricing $6.25/TiB):    If analysts run unoptimized queries on 3-year history (1.1 PB):      1,100 TB × $6.25 = $6,875 per full-scan query — UNACCEPTABLE    With date partitioning (query one day of 365):      3 TB / 365 = ~8 GB per day-partition = $0.05 per day-query → GOOD    Cost optimization is non-negotiable at this scale.  Pub/Sub throughput:    80 MB/sec raw → well within Pub/Sub's auto-scaling capacity    No shard management needed — Pub/Sub handles this automatically  Dataflow workers:    For streaming path: 80 MB/sec / ~10 MB/sec per worker = ~8-12 workers steady state    With 2x headroom for peak: 20-25 workers auto-scaled by Dataflow  Anomaly detection:    500 metric time series (metric × dimension combinations)    Updated every 1 minute → 500 writes/min to monitoring → trivial

Architecture implication: Pub/Sub + Dataflow + BigQuery is the clear GCP-native answer. Strict partitioning + clustering + materialized views are mandatory — not optional — to make the cost model work.

Step 3: Architecture (10-12 min)

┌─────────────────────────────────────────────────────────────────────┐  │              SEARCH EVENT SOURCES                                    │  │  Web search, Mobile search, Maps, Gmail, Images, Shopping           │  │  Each search generates 1 event with: query_id, query_text_hash,     │  │  surface, country, language, device_type, latency_ms, result_count, │  │  spelling_corrected, clicked, timestamp                              │  │  NOTE: query_text is HASHED before leaving the search stack (PII)   │  └──────────────────────────────┬──────────────────────────────────────┘                                  ↓ Avro + schema validation at source  ┌─────────────────────────────────────────────────────────────────────┐  │              CLOUD PUB/SUB                                          │  │  Topic: search-quality-events                                       │  │  Subscriptions:                                                     │  │    → sub-dataflow-realtime (Dataflow streaming consumer)            │  │    → sub-dataflow-batch (fan-out to GCS via Dataflow batch sink)    │  │  No partition key needed — Pub/Sub handles distribution             │  │  Retention: 7 days (for replay on pipeline failure)                 │  └──────────────┬──────────────────────────┬───────────────────────────┘                 ↓ real-time path            ↓ batch path  ┌──────────────────────────────┐   ┌──────────────────────────────────┐  │   DATAFLOW STREAMING JOB     │   │  DATAFLOW BATCH SINK             │  │   (Apache Beam pipeline)     │   │  Pub/Sub → GCS (Parquet/Avro)   │  │                              │   │  Raw events, partitioned by hour  │  │  • Dedup (query_id, 1hr win) │   │  Runs continuously in micro-batch │  │  • Validate schema           │   │  5-min commit interval            │  │  • DLQ for invalid events    │   └──────────────┬───────────────────┘  │  • 1-min tumbling windows    │                  ↓ (raw data on GCS)  │  • Aggregate metrics:        │   ┌──────────────────────────────────┐  │    - query_count per min     │   │  CLOUD STORAGE (GCS) — BRONZE   │  │    - zero_result_rate        │   │  gs://search-analytics/raw/      │  │    - p50/p95/p99 latency     │   │  Partitioned: year/month/day/hr  │  │    - spelling_correction_rate│   │  Format: Parquet + Zstd          │  │  • Dimensioned by:           │   │  Retention: 3 years (auto-tier)  │  │    country, surface, language│   │  Active → Nearline after 90 days │  │  • Anomaly detection:        │   └──────────────┬───────────────────┘  │    Z-score vs 7-day baseline │                  ↓ (Cloud Composer DAG)  │  • Anomaly → Cloud Monitoring│   ┌──────────────────────────────────┐  │    custom metric → Alert     │   │  DATAFLOW BATCH ETL (daily)     │  └──────┬───────────────────────┘   │  GCS → transform → BigQuery     │         ↓ (real-time metrics)       │  • Join with dim tables          │  ┌──────────────────────────────┐   │  • Compute session metrics       │  │ BIGQUERY STREAMING TABLE     │   │  • Quality score aggregation     │  │ analytics.realtime_metrics   │   │  • A/B experiment joining        │  │ (Storage Write API)          │   │  Runs: 2 AM PT, data by 8 AM    │  │ Queryable within < 60 sec    │   └──────────────┬───────────────────┘  │ Partitioned by minute        │                  ↓  │ Retention: 30 days           │   ┌──────────────────────────────────┐  └──────┬───────────────────────┘   │  BIGQUERY GOLD TABLES           │         │                           │  gold.fact_search_events         │         │                           │  gold.daily_search_quality       │         └───────────────┬───────────│  gold.experiment_metrics         │                         ↓           └──────────────┬───────────────────┘                ┌────────────────────────────────────┐                │         SERVING LAYER              │                │  Looker → BigQuery Gold (dashboards)│                │  Dataplex → governance + lineage    │                │  Cloud Monitoring → anomaly alerts  │                │  Vertex AI → trend forecasting      │                └────────────────────────────────────┘

Step 4: BigQuery Table Design — Cost First

This is where Google interviews focus most heavily. Every design decision must be justified by cost and performance.

Primary Fact Table: gold.fact_search_events

CREATE
TABLE gold.fact_search_events (    -- Keys    query_id            STRING NOT NULL,    event_timestamp     TIMESTAMP NOT NULL,    event_date          DATE NOT NULL,        -- materialized for partitioning    -- Dimensions (denormalized — star schema for query efficiency)    country_code        STRING,              -- ISO 3166    language_code       STRING,              -- BCP 47    surface             STRING,              -- 'web', 'maps', 'gmail', 'images'    device_category     STRING,              -- 'desktop', 'mobile', 'tablet'    query_category      STRING,              -- ML-classified: 'navigational', 'informational', 'transactional'    -- Metrics    latency_ms          INT64,    result_count        INT64,    is_zero_result      BOOL,    is_spelling_corrected BOOL,    is_clicked          BOOL,    -- Experiment tracking    experiment_id       STRING,    variant_id          STRING  )  PARTITION BY event_date  CLUSTER BY country_code, surface, language_code  OPTIONS (    require_partition_filter = TRUE,    -- MANDATORY: prevent $6,875 full scans    partition_expiration_days = 90      -- auto-delete after 90 days (older data on GCS)  );

Design justifications:

PARTITION BY event_date: All quality queries filter by date range. Partition pruning = scan only relevant dates.
CLUSTER BY country_code, surface, language_code: The three most common filter/group-by dimensions for Search Quality team.
require_partition_filter = TRUE: Hard guard against full-table scans. Non-negotiable at this scale.
partition_expiration_days = 90: Older data lives on GCS as Parquet, queryable via BigQuery external tables at $0 storage cost in BigQuery.

Daily Aggregation Table: gold.daily_search_quality

CREATE
TABLE gold.daily_search_quality (    metric_date         DATE NOT NULL,    country_code        STRING NOT NULL,    surface             STRING NOT NULL,    language_code       STRING NOT NULL,    device_category     STRING NOT NULL,    query_category      STRING NOT NULL,    -- Pre-aggregated metrics (avoid scanning fact table for dashboard)    search_volume       INT64,    zero_result_count   INT64,    zero_result_rate    FLOAT64,    spelling_correction_rate FLOAT64,    click_rate          FLOAT64,    -- Latency percentiles (use APPROX_QUANTILES at aggregation time)    latency_p50_ms      INT64,    latency_p95_ms      INT64,    latency_p99_ms      INT64  )  PARTITION BY metric_date  CLUSTER BY country_code, surface;

Why this exists: Dashboard queries asking “show me zero_result_rate by country for last 30 days” should NOT scan 30 days of the fact table (billions of rows). They should scan this aggregation table (30 rows × few thousand dimension combos = microseconds to fetch). The daily aggregation is built nightly by the batch Dataflow job.

Materialized Views for the Hottest Dashboard Queries

--
Top 10 most-run dashboard query: last 7-day trend for a specific country  CREATE MATERIALIZED VIEW gold.mv_7day_quality_by_country  OPTIONS (enable_refresh = true, refresh_interval_minutes = 60)  AS  SELECT    metric_date,    country_code,    SUM(search_volume) as searches,    AVG(zero_result_rate) as avg_zero_result_rate,    AVG(latency_p95_ms) as avg_p95_latency  FROM gold.daily_search_quality  WHERE metric_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)  GROUP BY metric_date, country_code;

BigQuery automatically rewrites queries against gold.daily_search_quality to use this materialized view when the WHERE and GROUP BY match. Dashboards scanning 7 days of country data hit an in-memory pre-computed result instead of a fact table.

Step 5: Real-Time Anomaly Detection Design

The 5-minute SLA for anomaly detection requires the streaming path, not batch.

Dataflow anomaly detection logic (Beam):

on

1-minute tumbling window per (surface, country) combination  window_results = (      events      | 'Window' >> beam.WindowInto(FixedWindows(60))  # 1-minute tumbling      | 'Key by dimension' >> beam.Map(lambda e: ((e.surface, e.country_code), e))      | 'Aggregate' >> beam.CombinePerKey(SearchMetricsAggregator())  )  # Anomaly detection: compare window metric to 7-day rolling baseline  anomalies = (      window_results      | 'Detect anomalies' >> beam.ParDo(AnomalyDetector(          baseline_store=BigTableBaselineStore(),  # reads 7-day rolling avg from Bigtable          threshold_stddev=3.0                    # Z-score threshold        ))      | 'Filter significant' >> beam.Filter(lambda x: x.is_significant)      | 'Publish alert' >> beam.io.WriteToPubSub('projects/p/topics/quality-alerts')  )

Baseline stored in Bigtable: The 7-day rolling baseline (mean and stddev per metric per dimension) is precomputed daily and stored in Bigtable for sub-10ms lookup at streaming time. Row key: {surface}#{country_code}#{metric_name}. Dataflow reads baseline for each window to compute Z-scores.

Alert routing:

Pub
/Sub quality-alerts topic      → Cloud Monitoring custom metric (for dashboard)      → Cloud Functions → PagerDuty (P1: zero_result_rate > 3σ above baseline)      → Cloud Functions → Slack #search-quality-alerts (P2: any 2σ deviation)

Step 6: Privacy-Aware Design

Search queries are among the most sensitive PII imaginable. Google’s design:

Query text handling:

Raw query text is NEVER stored in the analytics pipeline
At the search serving layer (before Pub/Sub), the query is replaced with:
- query_hash: SHA-256 of normalized query (for deduplication only, not reversible)
- query_category: ML classifier output (navigational, informational, transactional)
- query_language: detected language
The analytics pipeline receives no raw query text — only derived signals

IAM structure:

Raw
events (GCS Bronze):      Only pipeline service accounts  Fact table (BigQuery):        Only data engineering team + approved ML jobs  Daily aggregation (BigQuery): Search Quality team (analysts can self-serve)  Country-level aggregations:   Minimum threshold: 1000 queries/day (prevents re-identification)  Public-facing metrics:        Only heavily aggregated, no individual or small-group data

Audit logging: All BigQuery queries to the fact table logged to Cloud Audit Logs → BigQuery (stored separately) → compliance team monitors access patterns.

Step 7: The Batch ETL Pipeline (Cloud Composer)

on

Cloud Composer (Airflow) DAG: daily search quality pipeline  with DAG(      dag_id="search_quality_daily",      schedule_interval="0 2 * * *",  # 2 AM PT = 10 AM UTC      default_args={"retries": 3, "retry_delay": timedelta(minutes=10)},      sla_miss_callback=pagerduty_alert  ) as dag:      wait_for_raw_data = BigQuerySensor(          task_id="wait_for_gcs_landing",          # Verify yesterday's hourly partitions are all complete          sql="""SELECT COUNT(*) >= 24 FROM gcs_metadata                 WHERE date = '{{ ds }}' AND all_hours_complete = true""",          poke_interval=300      )      run_quality_etl = DataflowCreateJavaJobOperator(          task_id="run_quality_etl_dataflow",          job_name="search-quality-daily-{{ ds }}",          job_class="com.google.search.analytics.QualityETL",          parameters={              "date": "{{ ds }}",              "input": "gs://search-analytics/raw/{{ ds }}/*",              "output": "search-analytics:gold.fact_search_events${{ ds_nodash }}"          }      )      run_dbt_aggregations = BashOperator(          task_id="run_dbt_aggregations",          bash_command="dbt run --select gold.daily_search_quality --vars '{date: {{ ds }}}'"      )      validate_output = BigQueryCheckOperator(          task_id="validate_daily_volume",          sql="""              SELECT CASE                WHEN COUNT(*) > (SELECT search_volume * 0.85 FROM gold.daily_search_quality                                 WHERE metric_date = DATE_SUB('{{ ds }}', INTERVAL 7 DAY)                                   AND country_code = 'US')                THEN 1 ELSE 0 END              FROM gold.fact_search_events              WHERE event_date = '{{ ds }}' AND country_code = 'US'          """      )  # Circuit breaker: fail if US volume < 85% of prior week same day      publish_success = BigQueryInsertJobOperator(          task_id="mark_partition_complete",          configuration={"query": {              "query": f"INSERT INTO metadata.pipeline_status VALUES ('{{ ds }}', CURRENT_TIMESTAMP(), 'SUCCESS')"          }}      )      wait_for_raw_data >> run_quality_etl >> run_dbt_aggregations >> validate_output >> publish_success

Step 8: Cost Optimization Manifest

Explicitly naming cost decisions signals Google-level operational maturity:

Optimization	Estimated impact
`require_partition_filter = TRUE` on fact table	Eliminates accidental $6,875+ scans
`partition_expiration_days = 90` in BQ, older data on GCS	~75% BigQuery storage cost reduction for 3-year retention
GCS auto-tier to Nearline after 90 days ($0.01/GB vs $0.02/GB)	~50% storage cost reduction for warm data
`gold.daily_search_quality` pre-aggregation table	99%+ reduction in bytes scanned for dashboard queries
Materialized views for top 10 dashboard patterns	Dashboard queries at near $0 per query
`APPROX_QUANTILES(latency_ms, 100)` instead of exact percentiles	10x cheaper than `PERCENTILE_CONT` at this volume
BigQuery BI Engine reservation for Looker	Eliminates query slot consumption for repetitive dashboard loads
Flat-rate slot reservation for nightly batch	Predictable cost, cheaper than on-demand for scheduled workloads
`INFORMATION_SCHEMA.JOBS` alerting	Auto-alert when any single query exceeds $50 threshold

Total cost reduction vs naive implementation (no partitioning, no pre-aggregation): estimated 95-99% reduction in BigQuery compute costs.

Step 9: Dataplex for Governance

Google’s Dataplex is the GCP-native data governance layer — mentioning it in a Google interview is a differentiator.

Dataplex
Lake: search-analytics    Zone: raw (GCS Bronze)    Zone: curated (BigQuery Gold)    Zone: sandbox (analyst experiments, auto-delete after 30 days)  Data Quality Rules on gold.fact_search_events:    - event_date NOT NULL (completeness)    - latency_ms BETWEEN 0 AND 60000 (validity — no impossible values)    - country_code matches ISO 3166 (validity)    - search_volume > 0 for each partition (completeness)    Quality score visible in Data Catalog for each table  Policy Tags for PII columns:    - query_hash: RESTRICTED — requires explicit IAM permission    - Column-level security enforced via Data Catalog Policy Tags

Interview Questions

Q1: “The Search Quality team says their BigQuery dashboard takes 45 seconds per query. How do you fix this?”

Model Answer: “45-second queries on a search analytics table almost certainly means one of three problems. First, I’d check if the query is filtering on the partition column. INFORMATION_SCHEMA.JOBS_BY_PROJECT will show bytes billed — if it’s scanning more than a day’s worth of data for a ‘show today’s metrics’ query, partition pruning isn’t working. Common causes: function on the partition column (EXTRACT(DATE FROM timestamp) instead of WHERE event_date = CURRENT_DATE()), or the WHERE clause uses a column that isn’t the partition column.

Second, if partition pruning is working but still slow, I’d check if they’re hitting the fact table directly instead of the pre-aggregated daily_search_quality table. Most dashboard queries just need daily aggregates — scanning the 8.5B-event fact table for those is wasteful when a 50K-row aggregation table exists.

Third, if the query is legitimately complex and needs the fact table — add a BigQuery BI Engine reservation. For Looker dashboards with repetitive query patterns, BI Engine serves results from in-memory cache at sub-second latency with no slot consumption.

My structured fix: (1) Check INFORMATION_SCHEMA.JOBS to find the query and its bytes_billed. (2) Add require_partition_filter = TRUE if not set — forces proper date filtering. (3) Redirect dashboard to daily_search_quality for standard metrics. (4) Create materialized view for the specific query pattern if it runs > 100 times/day. (5) Set up per-user query quota alerts to catch future high-cost queries before they accumulate.”

Q2: “How would you detect if a new Search algorithm change caused degradation in zero-result rate within 5 minutes of rollout?”

Model Answer: “This is exactly the streaming anomaly detection path. When the new algorithm is rolled out, it affects search results immediately — I need to detect the signal in the Dataflow streaming layer, not the batch layer. My design:

The Dataflow streaming job computes zero_result_rate in 1-minute tumbling windows, keyed by (surface, country_code). For each window, it fetches the 7-day rolling baseline (same hour, same day-of-week) from Bigtable — sub-10ms lookup. If the current window’s zero_result_rate exceeds baseline by more than 3 standard deviations, a Cloud Monitoring custom metric fires and triggers an alert.

For an algorithm rollout, I’d add one more dimension: experiment_variant. If the rollout is done as a 1% / 99% A/B test, the zero_result_rate is windowed by (surface, country_code, experiment_variant). If variant B shows 3σ higher zero_result_rate than variant A for the same window, that’s the degradation signal — and it fires within minutes of rollout, not hours.

Alert routing: P1 to PagerDuty (the algorithm team gets paged immediately if production zero-result rate spikes 3σ). The alert body includes: current metric, baseline, Z-score, time range, and a link to the Looker dashboard filtered to the affected surface and time window. The team can evaluate within minutes whether to roll back the algorithm change.”

Self-Assessment: 5 Questions

Why is require_partition_filter = TRUE non-negotiable at this scale? What’s the cost of not having it?
What data lives in BigQuery vs GCS for the 3-year history, and why?
Why is the query text never stored in the analytics pipeline?
What’s the difference between the real-time anomaly detection path and the batch analytics path — what does each serve?
How does Dataplex fit into this design — what does it provide that BigQuery alone doesn’t?

Quick Reference: Google Search Analytics Pipeline

GCP Streaming Trinity: Pub/Sub (shock absorber) → Dataflow (stateful processing) → BigQuery (analytics warehouse). This is the answer for every streaming analytics problem at Google.
BigQuery cost hierarchy: require_partition_filter → date partitioning → clustering → pre-aggregation tables → materialized views → BI Engine. Apply ALL of these.
Anomaly detection: Dataflow 1-min tumbling windows → Z-score vs Bigtable 7-day baseline → Cloud Monitoring custom metric → Alerting policy. 5-minute SLA achievable.
Privacy-first: Raw query text NEVER enters the analytics pipeline. Hashes and ML-classified categories only. Column-level security via Dataplex Policy Tags. Minimum aggregation thresholds.
Cost math: Without optimization = $6,875 per full scan. With date partitioning + clustering + require_partition_filter = $0.05 per typical dashboard query. 99%+ cost reduction.
Dataplex: GCP-native governance — data quality rules, policy tags for PII, zone-based access control, Data Catalog for discovery. Mentioning it signals Google-specific depth.

Tomorrow’s Preview

Day 37: OpenAI Data Engineering & AI-Native Pipelines — How OpenAI thinks about data infrastructure differently from traditional tech companies. Training data pipelines at 100TB+ scale, RLHF data infrastructure, embedding pipelines, and what “AI-native data engineering” means when your product IS the AI model. How OpenAI interviews differ from FAANG and what they specifically probe.

Day 36: Design: Large-Scale Search Analytics Pipeline (Google Style)