Day 23 — Change data capture (CDC)

Phase 1: Foundations & Frameworks | Category: Orchestration

Why CDC Is Central to Senior DE Interviews

CDC is the connective tissue between your operational systems and your data platform. Every real-time analytics pipeline, every data lakehouse that reflects production state, every search index that stays in sync — all of it depends on CDC working correctly. As DataVidhya puts it: “The failure mode with log-based CDC is schema evolution — when the source adds a column, your downstream pipeline breaks unless you have schema registry integration and forward-compatible Avro/Protobuf schemas.” At your target companies, CDC questions test whether you understand the full operational picture: not just how to set it up, but how to handle the hard cases that break naive implementations.

The Three CDC Approaches

1. Log-Based CDC (The Production Default)

Reads the database’s internal transaction log — the same log the database uses for crash recovery. Every insert, update, and delete is captured as it happens.

Source DB (PostgreSQL WAL / MySQL binlog / Oracle redo log)
    ↓ read by Debezium connector
Kafka Connect cluster
    ↓ publishes change events as Avro/JSON messages
Kafka topic (one per source table)
    ↓ consumed by
Flink / Spark / Kafka Streams → warehouse / data lake / search index

What a Debezium change event looks like:

{
  "before": {
    // Row state BEFORE the change
    "order_id": 42,
    "status": "pending",
    "amount": 99.99
  },
  "after": {
    // Row state AFTER the change
    "order_id": 42,
    "status": "confirmed",
    "amount": 99.99
  },
  // u=update, c=create, d=delete, r=read(snapshot)
  "op": "u",
  // Transaction timestamp (ms since epoch)
  "ts_ms": 1744300800000,
  "source": {
    "db": "orders_db",
    "table": "orders",
    // PostgreSQL Log Sequence Number — the ordering signal
    "lsn": 23502984,
    // Transaction ID
    "txId": 1234
  }
}

Five advantages over query-based:

Captures ALL changes including hard deletes (query-based can’t)
Captures intermediate states — if a row changes 3 times between polls, you see all 3 changes
Near-zero load on source database (reads the log sequentially, no SELECT queries)
Sub-second latency (no polling interval)
Can resume from exact position after downtime (log sequence number = no missed events)

Key tools:

Debezium: Open-source, Kafka Connect-based. Connectors for PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, Cassandra. The industry standard.
AWS DMS (Database Migration Service): Managed CDC for AWS-native deployments. Supports Oracle, SQL Server, PostgreSQL, MySQL. Less flexible than Debezium but zero ops overhead.
Google Datastream: GCP-native CDC. PostgreSQL, MySQL, Oracle → BigQuery, GCS.
Fivetran / Airbyte: SaaS connectors that use log-based CDC under the hood.

2. Query-Based CDC (Polling)

Periodically runs SQL queries against the source, using a high-water mark column to find changed rows.

-- Run every 5 minutes
SELECT *
FROM orders
WHERE updated_at > :last_checkpoint
   OR created_at > :last_checkpoint;
-- Save MAX(updated_at) as new checkpoint

Advantages: Simple, no database permissions beyond read access, universal (any database with a timestamp column).

Critical limitations:

Cannot capture hard deletes — once a row is deleted, it’s gone. No timestamp to filter on.
Misses intermediate changes — if a row is inserted and deleted between two polls, it’s never captured.
Adds load to source DB — periodic full-table or range scans compete with production traffic.
Requires source cooperation — tables must have reliable updated_at columns. Many legacy systems don’t.
Latency — minimum = polling interval (typically 1-5 minutes).

When query-based still makes sense:

< 1,000 writes/sec with 1-5 minute lag tolerance
No database-level log access (managed databases, strict security policies)
Slowly-changing dimension tables (reference data, user profiles — low write volume)
Simple implementation is more important than completeness

The crossover point: At ~2,000-5,000 writes/sec, query-based starts overloading the OLTP database. At 5,000+ writes/sec, log-based CDC is mandatory.

Decision Framework

Do you need to capture hard deletes?
├── YES → Log-based CDC only (query-based cannot)
└── NO → Continue...  Write rate?
├── < 1K writes/sec + 1-5 min lag acceptable → Query-based OK
├── 1K-5K writes/sec → Evaluate: query-based may cause DB load issues
└── > 5K writes/sec → Log-based CDC required  Source system access?
├── Can read transaction logs → Log-based (Debezium/DMS)
└── Read-only SQL access only → Query-based  Need intermediate state history (all changes, not just latest)?
├── YES → Log-based CDC only
└── NO → Either works  Latency requirement?
├── < 1 minute → Log-based CDC
└── Minutes acceptable → Query-based may be sufficient

The Full CDC Architecture: Debezium + Kafka + Data Lake

This is the architecture you draw in interviews:

PostgreSQL (OLTP)
    ↓ WAL logical replication
Debezium Connector (Kafka Connect)
    ↓ change events (Avro + Schema Registry)
Kafka topic: "orders.public.orders"
├── Consumer 1: Flink → MERGE INTO silver.orders (Iceberg/Delta)
├── Consumer 2: Flink → Elasticsearch index update
├── Consumer 3: Flink → Redis cache invalidation
└── Consumer 4: Spark Structured Streaming → hourly warehouse batch

Topic naming convention: Debezium uses .. — e.g., mypostgres.public.orders. Each source table gets its own topic.

Kafka message key: The row’s primary key. This guarantees that all changes to a specific row land in the same partition, preserving order per row (though not across rows).

Initial snapshot: Debezium can take an initial snapshot of the current table state before switching to log-based capture. This bootstraps your downstream system with the baseline. Critical: the snapshot must be consistent — Debezium reads in a transaction to get a point-in-time view.

The Hard Problems: What Interviewers Probe

1. Handling Deletes

Log-based CDC captures deletes as tombstone events (op: “d”):

{
  "before": { "order_id": 42, "status": "cancelled" },
  // null = deleted row
  "after": null,
  "op": "d"
}

How to propagate deletes to the warehouse:

Option A: Hard delete

-- Delta Lake / Iceberg MERGE handles the delete case
MERGE INTO silver.orders t
USING kafka_changes s ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'd' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED AND s.op != 'd' THEN INSERT *

Option B: Soft delete (better for audit/SCD2)

-- Add is_deleted flag rather than physical delete  WHEN MATCHED AND s.op = 'd' THEN UPDATE SET is_deleted = true, deleted_at = s.ts_ms

Soft delete is almost always the right choice for data warehouses — physical deletes destroy history that may be needed for analytics, auditing, or model training.

Soft deletes in the source: Many applications implement deletes by setting an is_deleted = true flag rather than physical deletion. This is an UPDATE event in CDC, not a DELETE. Make sure your CDC event filter includes the is_deleted column, or you’ll silently miss these “deletes.”

2. Ordering Guarantees

Within a single row: Guaranteed. Debezium uses the Log Sequence Number (LSN in PostgreSQL, position in MySQL binlog) to preserve exact commit order. The Kafka message key (primary key) ensures all changes to a row land in the same partition, maintaining per-row order.

Across rows in the same transaction: Debezium preserves transaction boundaries — it emits a BEGIN marker, all row changes, and a COMMIT marker. Downstream consumers can use these to apply changes atomically if needed.

Across different tables: NOT guaranteed in default Kafka setup. Changes to orders and order_items in the same transaction may land in different topics and be processed in different orders. Solution: use transactional outbox pattern or ensure consumers handle eventual consistency.

The interview answer: “I’d partition Kafka topics by the primary key of the source table. This guarantees ordering for all changes to a specific entity. For cross-table ordering (e.g., order header and order lines in the same transaction), I’d either use Debezium’s transaction metadata topic, or accept eventual consistency with idempotent MERGE operations that tolerate out-of-order application.”

3. Schema Changes in the Source

This is the most operationally painful part of CDC:

What happens when a DBA adds a column to the source table:

Debezium captures the DDL change and updates the Avro schema in the Schema Registry
New events include the new column
Old consumers using the old schema…
- With BACKWARD compatibility and Avro defaults → continue working (new column ignored or defaults to null)
- Without compatibility rules → deserialization failure → pipeline stops

Defense strategy (three layers):

Layer 1: Schema Registry with BACKWARD_TRANSITIVE compatibility

New column added with default null → BACKWARD compatible  Producer schema updated in registry → consumers still work

Layer 2: Transformations that don’t assume fixed columns

# BAD: breaks if source adds a column you weren't expecting
spark.sql("SELECT order_id, customer_id, amount FROM raw_orders")

# GOOD: explicit column selection, ignores new columns safely
columns_of_interest = ["order_id", "customer_id", "amount"]
df.select([col for col in columns_of_interest if col in df.columns])

Layer 3: Warehouse schema evolution with mergeSchema

# Delta Lake: automatically adds new columns from source
df.write.option("mergeSchema", "true").mode("append").saveAsTable("silver.orders")

Breaking changes (column rename, type change, column drop): These require coordinated migration. The schema registry blocks them with BACKWARD_TRANSITIVE. When a breaking change is genuinely necessary, follow the expand-contract pattern (Day 21).

4. Large Transactions

A single database transaction that modifies 10 million rows produces 10 million CDC events that must all be processed before downstream state is consistent.

Strategies:

Buffer events in Kafka (Kafka’s retention absorbs the burst)
Apply downstream using micro-batches rather than row-at-a-time
Monitor consumer lag — if lag spikes during a large transaction, it’s expected to drain
For extremely large batch operations, coordinate: pause CDC, apply the batch, restart CDC

5. Monitoring CDC Pipelines

Three key metrics to always mention:

Topic	Details
Replication lag	now() - timestamp of latest CDC event processed> 5 minutes for tier-1 tables
Consumer lag	Kafka consumer group offset delta (latest - consumed) Growing continuously = sustained throughput gap
Row count reconciliation	COUNT(source table) vs COUNT(destination table) hourly> 0.1% discrepancy

Use Cases: When to Propose CDC in an Interview

ScenarioCDC SolutionReal-time data sync (OLTP → warehouse)Log-based CDC → Kafka → MERGE INTO Iceberg/DeltaSearch index freshnessLog-based CDC → Kafka → Elasticsearch updateCache invalidationLog-based CDC → Kafka → Redis cache evictionSCD2 trackingCDC before/after fields → derive which attributes changed → apply SCD2 logicGDPR deletion propagationCDC delete events → cascade to all downstream tablesMicroservice event sourcingCDC as the event bus — DB change = event to downstream servicesData lake freshnessReplace nightly batch with continuous CDC → near-real-time bronze layer

Interview Questions

Q1: “Design a CDC pipeline to sync an e-commerce PostgreSQL orders database to a data warehouse with under 5-minute latency.”

Model Answer: “I’d use log-based CDC with Debezium reading PostgreSQL’s WAL via logical replication. Debezium publishes change events to Kafka — one topic per table, primary key as the Kafka message key for per-row ordering. Events are serialized as Avro with Confluent Schema Registry enforcing BACKWARD_TRANSITIVE compatibility — protecting against source schema changes breaking the pipeline. A Flink consumer reads from the Kafka topics and applies upserts to the warehouse using MERGE INTO on Delta Lake or the warehouse’s native MERGE. The MERGE key is the order’s primary key plus the event timestamp to handle out-of-order events. For deletes, I’d apply soft deletes in the warehouse (set is_deleted = true) rather than physical deletes, preserving analytical history. For initial bootstrap, Debezium takes a consistent snapshot of the current order table, then seamlessly switches to log-based streaming. Monitoring: replication lag alert if now() - last_event_timestamp > 5 minutes, consumer lag trending up, and hourly row count reconciliation between source and warehouse. For the sub-5-minute SLA: Kafka acts as the buffer, Flink processes events within seconds, so end-to-end latency is typically 10-30 seconds — well within the SLA.”

Q2: “Your CDC pipeline captures changes from a high-volume MySQL database. The DBA announces they’ll rename a column from amount to order_total next week. What do you do?”

Model Answer: “A column rename is a breaking change — not compatible with existing consumers. I’d treat this as an emergency requiring coordination. Immediate steps: First, alert all downstream consumers of the impending breaking change. Second, work with the DBA to implement the expand-contract pattern: have MySQL add order_total as a NEW column populated alongside the existing amount for a transition period — no rename yet. Third, update all downstream consumers to read order_total instead of amount. Fourth, validate each consumer has migrated using the Schema Registry’s consumer tracking. Fifth, after all consumers are confirmed migrated, the DBA removes the old amount column — which IS backward compatible since all consumers have switched. The Schema Registry with BACKWARD_TRANSITIVE would have blocked a direct rename from happening, which is exactly why we have it. At layer 2, I’d also audit downstream transformations for any SELECT * patterns that might have hidden column name dependencies — this is a good time to enforce explicit column selection across all CDC consumers.”

Think About This

You’re in a Netflix interview. The prompt: “Netflix’s recommendation engine requires near-real-time updates when users add titles to their watchlist, rate content, or complete viewing sessions. How would you design the data pipeline from the OLTP database to the recommendation feature store?”

Walk through:

What capture method? (Log-based CDC — deletes matter when a user removes something from their watchlist, and sub-second latency is required for recommendations to feel responsive.)
What does the event flow look like? (PostgreSQL WAL → Debezium → Kafka topics for watchlist, ratings, viewing_sessions → Flink enrichment jobs → feature store in Redis for online serving, Iceberg for offline training.)
How do you handle ordering? (Partition Kafka by user_id — all events for a user in the same partition, guaranteed ordering per user.)
What about a user completing a 2-hour movie — is there a large transaction concern? (No — each viewing event is a row update, not a bulk transaction. Large transaction concern applies to bulk operations like a new feature launch that pre-populates data for all users.)
How do you monitor freshness? (Kafka consumer lag metric, plus an end-to-end latency probe: inject test events with timestamps, measure time from source update to feature store update. Alert if > 30 seconds.)

Quick Reference

Log-based CDC = read transaction log (WAL/binlog). Captures ALL changes including deletes. Sub-second latency. Near-zero source load. Requires log access permissions. The default for production.
Query-based CDC = poll with updated_at timestamp. Simple. Cannot capture deletes. Adds load to source. Latency = polling interval. Use only for low-volume, delete-free tables.
Debezium: Open-source Kafka Connect-based CDC. Connectors for all major databases. Industry standard.
Kafka topic key = source primary key: Guarantees per-row ordering. All changes to a specific row land in the same partition.
Three hard problems: (1) Deletes → tombstone events, apply as soft deletes in warehouse. (2) Schema changes → Schema Registry with BACKWARD_TRANSITIVE + mergeSchema in lakehouse. (3) Ordering → Kafka partitioning by PK ensures per-entity order.
Three monitoring metrics: Replication lag, consumer lag, row count reconciliation.
Decision crossover: Log-based when > 5K writes/sec, deletes needed, < 1 min latency, or intermediate state history required.

Tomorrow’s Preview

Day 24: Design: Real-Time Analytics Dashboard — The first full end-to-end system design practice session. End-to-end: event ingestion → stream processing → aggregation → serving layer → dashboard. You’ll apply everything from Days 1-23 in a single connected design — calibrated for how your target companies structure system design interviews.

Day 23: Change data capture (CDC)