Day 35 — Google Data Engineering & GCP Deep Dive

Phase 2: Company-Specific | Category: Google-Specific

The Google DE Interview Mindset

Per DataInterview.com: “From hundreds of mock interviews, one pattern stands out with Google DE candidates: they over-prepare for coding and under-prepare for the infrastructure half of the loop. BigQuery, Dataflow, and Pub/Sub fluency isn’t a bonus — it’s the core of what you’ll be evaluated on.”

Google is unique among your five targets: you’re designing with Google’s own tooling. At Meta, you’d reference their stack. At Netflix, you’d design on AWS with OSS. At Google, you’re expected to use GCP-native services as first-class design components — not as a garnish on top of a generic architecture. A candidate who defaults to Kafka + Spark + Postgres when the answer should be Pub/Sub + Dataflow + BigQuery will not pass a Google interview.

The mental model shift: compose managed services, don’t build your own.

BigQuery Internals: What Interviewers Actually Test

Most candidates know “BigQuery is serverless columnar.” Google interviewers want depth. Here’s what’s under the hood:

The Four-Layer Architecture

BigQuery
= Dremel (execution) + Colossus (storage) + Capacitor (format) + Jupiter (network)

Colossus (Luminousmen, Panoply):

Google’s distributed file system (successor to GFS)
Completely decouples storage from compute — the foundation of BigQuery’s serverless model
Metadata stored in Bigtable (high-performance NoSQL) — no single metadata bottleneck
Data blocks distributed across independent storage nodes with Reed-Solomon erasure coding
Geo-replication across data centers within a region automatically
Background processes (curators + custodians) continuously repair, rebalance, and optimize layout — invisible to users

Capacitor (Google Cloud Blog, Luminousmen):

BigQuery’s proprietary columnar format (NOT Parquet or ORC)
Each column stored as independently addressable chunks with per-chunk metadata (zone maps: min/max, null counts, row count)
Zone maps enable chunk-level skipping — skips entire chunks without reading them if predicate can’t match
Automatic compression: dictionary encoding, RLE, delta encoding, bit-packing — adapts automatically per column
Typical compression: 5-15x (better than Parquet’s typical 3-8x)
Nested + repeated fields stored as separate column streams without flattening — enables scanning events.type without reading events.ts
Background optimization: continuously rewrites storage when it finds improvement opportunities, atomically swaps in new optimized layout without disrupting running queries

Dremel:

BigQuery’s distributed query engine — the compute layer
Tree architecture: root server → intermediate servers → leaf nodes
Root server receives SQL query, decomposes into parallel tasks
Leaf nodes (potentially thousands) read from Colossus in parallel
Results aggregated up the tree back to the root
Each query dynamically allocates “slots” (units of compute) — on-demand: auto-scales to query complexity; flat-rate: reserved slots for predictable performance

Jupiter:

Google’s internal network fabric — 1 Petabit/sec bisection bandwidth
Enables compute and storage to be truly separate (data movement between Dremel workers and Colossus storage is over Jupiter, not a bottleneck)
This is why BigQuery can scale to thousands of workers on a single query — Jupiter handles the data movement

What to say in interviews: “BigQuery’s performance comes from four internal layers working together. Capacitor stores data with per-chunk zone maps that enable predicate pushdown at the storage layer — chunks that can’t satisfy the WHERE clause are skipped before any data is sent to Dremel. Jupiter provides the high-bandwidth network fabric that makes compute-storage separation non-latent. And Dremel dynamically allocates thousands of leaf workers for a single query. When I choose BigQuery, I’m choosing this entire stack — not just a SQL interface.”

The GCP Data Services Map

Know when to choose each service:

ServiceWhat It IsWhen to UseBigQueryServerless columnar warehouseAnalytics, ad-hoc SQL, batch ELT transforms, ML features (via BigQuery ML)Pub/SubManaged message queue / event streamingEvent ingestion, fan-out messaging, decoupling producers from consumers. Global, auto-scales, at-least-once deliveryDataflowManaged Apache Beam executionUnified batch + streaming ETL. Auto-scales workers. Exactly-once optional. Stateful processing.DataprocManaged Spark/Hadoop clustersExisting Spark code, custom libraries, ML training with Spark MLlib, cheaper than Dataflow for batch at scaleCloud Storage (GCS)Object storageRaw data lake (Bronze/Silver), Parquet/ORC/Avro files, BigQuery external tables, Dataflow stagingBigtableManaged wide-column NoSQL (HBase-compatible)Low-latency serving (< 10ms), time-series, IoT, feature store online serving, high-write throughputCloud SpannerGlobally distributed SQL with strong consistencyFinancial transactions, inventory counts, global ACID at scaleFirestoreManaged document databaseMobile/web app data, real-time sync to clients, small-to-medium scaleMemorystoreManaged Redis/ValkeyCaching layer, session storage, leaderboardsDataplexUnified data governance + qualityData catalog, lineage, quality checks, zone-based lake managementVertex AIML platformModel training, feature store, ML pipelines, model registry, online + batch predictionCloud ComposerManaged Apache AirflowOrchestration of complex multi-step pipelines across GCP services

BigQuery Service Selection: When NOT to Use It

Google interviewers specifically test whether you know BigQuery’s limitations:

ScenarioDon’t Use BigQueryUse InsteadLow-latency serving (< 10ms)❌ BigQuery min latency ~300msBigtable, Memorystore (Redis)High-frequency row updates❌ DML is expensive in BigQueryCloud Spanner, BigtableTransactional OLTP❌ Not designed for row-at-a-time opsCloud Spanner, PostgreSQL on Cloud SQLReal-time feature serving❌ Too slow for inference pathBigtable + MemorystoreStreaming with < 1 sec latency⚠️ Streaming insert API works but costs morePub/Sub → Dataflow → Bigtable for serving

BigQuery Design Best Practices (Interview Essentials)

Partitioning in BigQuery

Date/timestamp partitioning (most common):

CREATE
TABLE events.user_clicks  PARTITION BY DATE(event_timestamp)  OPTIONS (    partition_expiration_days = 365,   -- auto-delete partitions after 1 year    require_partition_filter = TRUE    -- prevent accidental full-table scans (VERY IMPORTANT)  )  AS SELECT ...

require_partition_filter = TRUE is a cost safeguard — queries without a partition filter fail rather than scan the entire table. Always use for large partitioned tables.

Integer range partitioning: For tables without a date column — partition by user_id ranges (0-999M, 1B-1.999B, etc.)

Ingestion time partitioning: When event_timestamp isn’t available. Uses _PARTITIONTIME pseudo-column. Less precise but zero schema changes required.

Clustering in BigQuery

Clustering physically sorts data within partitions by up to 4 columns. Block-level min/max pruning skips blocks where the predicate can’t match.

CREATE
TABLE events.user_clicks  PARTITION BY DATE(event_timestamp)  CLUSTER BY user_id, event_type, country    -- ordered by importance (first column gets most benefit)  AS SELECT ...

Partition vs Clustering decision:

NeedUseFilter on time range (almost every analytical query)Partition by dateFilter on high-cardinality columns (user_id, campaign_id)ClusterBoth time and dimension filteringPartition by date + Cluster by dimensionFilter with exact value (WHERE campaign_id = ‘X’)Cluster (can skip non-matching blocks)Filter with range (WHERE spend > 1000)Cluster still helps (block min/max pruning)

Cost estimation: Unlike partitioned tables where you know bytes scanned before running, clustered tables can only estimate after query execution. Mention this trade-off: “Clustering gives better performance but no upfront cost guarantee — I’d use it for read-heavy tables where query patterns are well-understood, and combine with partition pruning for cost control.”

BigQuery MERGE for Incremental Loads

--
Idempotent incremental load: upsert today's data  MERGE INTO gold.fact_orders T  USING staging.fact_orders_today S  ON T.order_id = S.order_id     AND T.order_date = S.order_date   -- include partition key in join for performance  WHEN MATCHED THEN    UPDATE SET T.status = S.status, T.updated_at = S.updated_at  WHEN NOT MATCHED THEN    INSERT (order_id, order_date, customer_id, amount, status)    VALUES (S.order_id, S.order_date, S.customer_id, S.amount, S.status);

Always include the partition column in the MERGE ON clause — this tells BigQuery which partitions to scan for matches rather than scanning the full table.

Dataflow / Apache Beam: The Unified Pipeline Engine

Dataflow is Google’s managed execution of Apache Beam pipelines. The key insight: same code runs as batch or streaming — just change the runner configuration.

Beam Programming Model

on

Beam pipeline — works for both batch and streaming  with beam.Pipeline(options=pipeline_options) as p:      (p       | 'Read from Pub/Sub' >> beam.io.ReadFromPubSub(topic='projects/proj/topics/events')       | 'Parse JSON' >> beam.Map(parse_event)       | 'Filter valid' >> beam.Filter(is_valid)       | 'Add session window' >> beam.WindowInto(             Sessions(gap_size=30 * 60),  # 30-minute session gap             trigger=AfterWatermark(late=AfterCount(1)),             accumulation_mode=AccumulationMode.DISCARDING         )       | 'Aggregate metrics' >> beam.CombinePerKey(sum)       | 'Write to BigQuery' >> beam.io.WriteToBigQuery(             'project:dataset.table',             write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND         )      )

Dataflow vs Dataproc — When to choose each:

Factor	Dataflow	Dataproc
Streaming workloads	✅ First choice	❌ Spark Streaming is secondary
Serverless/zero-ops	✅ Fully managed, auto-scale	❌ Cluster management required
Existing Spark code	❌ Requires rewrite to Beam	✅ Lift-and-shift
Cost for large batch	⚠️ Can be expensive vs Dataproc	✅ Often cheaper for pure batch
Complex ML with Spark MLlib	❌ Beam doesn’t have ML APIs	✅ Spark MLlib native
Exactly-once semantics	✅ Native support	❌ Spark at-least-once
Unified batch + streaming	✅ Same code, different runner	❌ Different APIs

What to say in Google interviews: “I default to Dataflow for new pipelines because the unified batch+streaming model reduces long-term code maintenance — I write it once. For a pipeline that only runs batch and already has significant Spark code investment, Dataproc is more cost-effective for the migration path. But for new streaming work, Dataflow’s exactly-once guarantees and serverless scaling are worth the premium.”

The Standard GCP Data Pipeline Architecture

This is what every Google interview expects you to draw:

Event
Sources (apps, mobile, IoT, backend services)      ↓ HTTP/SDK  Cloud Pub/Sub (event ingestion, global, auto-scale)      ├── Path 1: Real-time (Dataflow streaming job)      │     ↓ Beam streaming pipeline      │   Dataflow (dedup, enrich, aggregate)      │     ├── Bigtable (low-latency serving, < 10ms)      │     └── BigQuery Streaming Insert (real-time queryable)      └── Path 2: Batch (Cloud Storage landing)            ↓ Cloud Storage (GCS)          Dataflow or Dataproc (batch ETL)            ↓          BigQuery (warehouse — Gold layer)            ↓ Cloud Composer (Airflow) orchestrates all jobs  Analytics Layer:    BigQuery → Looker / Looker Studio (dashboards)    BigQuery → Vertex AI (ML training, feature store)    BigQuery → BigQuery ML (in-place model training with SQL)    Bigtable → Online serving APIs (recommendation, personalization)

Cost Optimization at Google Scale

Google interviewers probe cost awareness extensively per LinkedIn: “How did you optimize BigQuery cost?” is a near-universal question.

BigQuery cost model:

On-demand: $5/TB scanned (only pay for bytes read)
Flat-rate: Reserve slots for predictable workloads ($0 per query, fixed slot cost)
Storage: $0.02/GB/month (active) → $0.01/GB/month (long-term, after 90 days, automatic)

Cost optimization checklist for interviews:

Optimization	Impact	How
Partition pruning	High	Always filter on partition column. `require_partition_filter = TRUE`.
Clustering	Medium	Cluster on most-queried filter columns. Reduces scan within partitions.
Column selection	High	Never `SELECT *`. Select only needed columns — columnar storage makes this free performance.
Materialized views	High	Pre-compute expensive aggregations. Auto-refreshed. Queries rewritten automatically.
Approximate aggregation	Medium	`APPROX_COUNT_DISTINCT` instead of `COUNT(DISTINCT)` for large-scale analytics — ~2% error, much cheaper.
Table expiration	Medium	Set `partition_expiration_days` for operational tables that don’t need long retention.
Slot reservations	High (at scale)	For predictable workloads (daily batch runs), flat-rate slots cheaper than on-demand at high query volume.
External tables	Situational	Query cold data directly in GCS without loading into BigQuery — no storage cost, but slower queries.

Scenario-based cost answer: “A Looker dashboard was scanning 20 TB per query because it joined a massive fact table to a dimension without partition filters. I added require_partition_filter = TRUE to the fact table, updated the Looker Explore to always include a date filter, and added a materialized view for the 3 most common aggregations. Query cost dropped from $100/query to $0.02/query — a 5,000x reduction. The dashboard load time dropped from 45 seconds to 2 seconds.”

IAM and Security: The Production-Readiness Test

Google interviewers specifically test IAM design per DataInterview.com:

Principle of least privilege for Dataflow jobs:

Dataflow
service account needs:    roles/pubsub.subscriber          → on the specific input subscription    roles/bigquery.dataEditor        → on the target dataset (not project)    roles/bigquery.jobUser           → on the project (to run BQ jobs)    roles/storage.objectViewer       → on the specific GCS bucket (if reading staged files)  NEVER: roles/editor or roles/owner on the project  NEVER: user credentials embedded in code (use service accounts)

IAM for BigQuery datasets:

Dataset-level
:    roles/bigquery.dataViewer   → analysts can read    roles/bigquery.dataEditor   → ETL service accounts can write  Row-level security:    Row Access Policies filter data per user group    "SELECT * FROM gold.sales WHERE region = SESSION_USER_REGION()"  Column-level security:    Data Catalog Policy Tags on sensitive columns    Only users with tag-reader permission can see PII columns

Interview Questions

Q1: “Design a GCP-native real-time click analytics pipeline for Google Ads. Advertisers need dashboards refreshing every 60 seconds, plus historical reporting going back 3 years.”

Model Answer: “I’ll design this as a dual-path GCP-native pipeline.

Ingestion: Ad click events published to Pub/Sub. Each click event carries: ad_id, campaign_id, advertiser_id, user_id_hash, device_type, geo, timestamp, placement. Pub/Sub handles global scale and auto-scales to peak load during ad campaigns.

Real-time path (Dataflow streaming → BigQuery Streaming Insert):Dataflow Beam pipeline: Read from Pub/Sub → parse and validate → deduplicate by click_id in stateful Beam state (1-hour window) → 1-minute tumbling window aggregations (click_count, impression_count, CTR per campaign) → Write to analytics.realtime_campaign_metrics via BigQuery Streaming Insert. This table is queryable within seconds, serves the 60-second dashboard refresh.

Batch path (GCS → Dataflow/Dataproc → BigQuery):Pub/Sub also fans out to a Cloud Storage Pub/Sub subscription via a Dataflow job that writes raw events to GCS in Parquet format, partitioned by date+hour. Daily Cloud Composer (Airflow) DAG triggers a Dataflow batch job: read from GCS, apply business logic (attribution, fraud filtering), join with dim_campaign and dim_advertiser, write to gold.fact_ad_clicks in BigQuery. This serves the historical 3-year reporting.

BigQuery table design:

CREATE
TABLE gold.fact_ad_clicks  PARTITION BY DATE(click_date)  CLUSTER BY advertiser_id, campaign_id  OPTIONS (require_partition_filter = TRUE)

Partition by date (all queries filter by date range). Cluster by advertiser_id + campaign_id — the dominant filter pattern for advertiser-facing reports.

Cost control: require_partition_filter = TRUE prevents accidental full scans. Materialized view for top 50 most common campaign-level aggregations. On-demand pricing for ad-hoc; flat-rate slots for the nightly batch jobs.

Serving: Looker connects to BigQuery Gold layer for historical reports. Real-time dashboard queries analytics.realtime_campaign_metrics — sub-second since it’s a small aggregated table refreshed every minute.

Security: Dataflow service account has minimum IAM — Pub/Sub Subscriber on the specific subscription, BigQuery Data Editor on analytics and gold datasets, BigQuery Job User at project level. Advertiser data access controlled via BigQuery row-level access policies: each advertiser’s service account can only read rows where advertiser_id = their_id.”

Q2: “An analyst says a BigQuery query on a 10 TB table takes 3 minutes and costs $50 per run. It runs 200 times per day. How do you fix this?”

Model Answer: “200 runs × $50 = $10,000/day. That’s a $3.6M/year query — definitely worth fixing. My diagnosis and fix sequence:

First, check the query: is it doing SELECT *? Is it missing a partition filter? Run INFORMATION_SCHEMA.JOBS_BY_PROJECT to see bytes billed and the actual query.

If no partition filter: add WHERE date_column BETWEEN ’…’ AND ’…’ to the query. If the table is partitioned by date, this alone reduces scan from 10 TB to maybe 50 GB (200x reduction). Cost: $0.25 per run → $50/day.

If the table isn’t partitioned: run CREATE OR REPLACE TABLE [table] PARTITION BY DATE(timestamp_col) CLUSTER BY [frequent_filter_cols] AS SELECT * FROM [old_table]. Then add require_partition_filter = TRUE. This is the highest-impact change.

If it’s already partitioned but still slow: check if the WHERE clause is using the partition column correctly — functions on the partition column (e.g., DATE_TRUNC(event_timestamp, MONTH)) can defeat partition pruning. Rewrite as WHERE event_timestamp >= ‘2026-04-01’ AND event_timestamp < ‘2026-05-01’.

For the dashboard use case (same query 200 times/day): create a BigQuery materialized view for the exact aggregation this query computes. Cost of building the materialized view is O(new data since last refresh), not O(full table). The dashboard reads from the materialized view in < 1 second at effectively $0 per query.

For the ad-hoc analyst use case: consider flat-rate slot reservations. 200 queries/day × 3 min each = 10 hours of query time/day. Depending on slot intensity, flat-rate reservations may be cheaper than $50/query on-demand.”

Think About This

You’re in a Google interview. The prompt: “You need to build a data pipeline that ingests audit logs from all Google Cloud services (BigQuery job logs, Dataflow job logs, GCS access logs) across 50 GCP projects. The logs must be analyzed for compliance and cost anomaly detection, and an alert should fire within 5 minutes of an anomaly.”

Walk through:

How do you collect logs from 50 projects? (Cloud Logging log sinks — aggregate sink from the organization level routes all logs to a central Pub/Sub topic. No per-project configuration needed.)
What processes the logs in real-time? (Dataflow Beam streaming pipeline: Pub/Sub → parse log entries → 5-minute tumbling window → anomaly detection (Z-score on cost metrics) → Cloud Monitoring custom metrics → Alerting policy fires within 5 minutes)
Where does the data land for historical compliance queries? (BigQuery with date partitioning, clustered by project_id and service_name. Compliance team queries 90+ day history via Looker.)
How do you handle the schema diversity of different log types? (BigQuery JSON column for raw log payload. Extracted structured fields in typed columns. Schema-on-read for the payload, schema-on-write for the important fields.)
What IAM design? (Organization-level log sink service account with Pub/Sub Publisher only. Dataflow service account with Pub/Sub Subscriber + BigQuery Data Editor on the compliance dataset only.)

Quick Reference: Google GCP-Specific

BigQuery internals: Capacitor (columnar format, zone maps, auto-compression) + Colossus (distributed storage, geo-replicated) + Dremel (tree execution engine, dynamic slot allocation) + Jupiter (network fabric enabling compute-storage separation)
Design principle: Compose managed services. Pub/Sub → Dataflow → BigQuery is the canonical streaming ETL. Don’t propose Kafka + Spark + Postgres when the GCP-native answer is simpler and zero-ops.
Dataflow vs Dataproc: Dataflow for streaming (first choice) and new unified batch+stream code. Dataproc for existing Spark code, cheaper pure-batch at scale, Spark MLlib.
BigQuery partitioning + clustering: Always partition by date for time-series. Cluster on top filter columns within partitions. require_partition_filter = TRUE prevents cost overruns.
Cost optimization: Column selection, partition pruning, materialized views, APPROX_COUNT_DISTINCT, flat-rate slots for predictable batch workloads.
IAM: Service accounts with minimum permissions. Never project-level Editor. Pub/Sub Subscriber on specific subscription + BigQuery Data Editor on specific dataset + BigQuery Job User at project level.
The interview differentiator: Know when NOT to use BigQuery (low-latency serving → Bigtable, OLTP → Spanner, real-time feature serving → Bigtable + Memorystore). Generic “use BigQuery for everything” fails the interview.

Tomorrow’s Preview

Day 36: Design: Large-Scale Search Analytics Pipeline (Google Style) — The full system design using GCP-native services. Web-scale event processing, BigQuery for analytics warehouse, Dataflow for streaming ETL, Dataplex for data quality, and the cost management approach Google interviewers specifically evaluate. This is the practice version of the Google system design round.

Day 35: Google Data Engineering & GCP Deep Dive