Phase 1: Foundations & Frameworks | Category: Data Quality & Governance
The Silent Killer of Data Pipelines
Schema drift — when a producer changes data structure without telling consumers — is one of the most common causes of production pipeline failures. A developer renames a column in an upstream service. No announcement. Your Flink job silently starts producing NULLs. Three days later, an analyst finds the revenue dashboard is $0. As Soda.io puts it: “Data contracts did not emerge because the industry fell in love with governance. They emerged because modern data platforms can no longer scale on implicit assumptions and scattered validations.” At your target companies, where hundreds of teams produce and consume data independently, schema evolution and data contracts are the engineering discipline that keeps everything from breaking.
Schema Evolution: The Core Problem
A schema is a contract about data structure. The challenge: systems evolve. Producers add fields, remove fields, rename columns. The question is: how do you evolve schemas without breaking consumers?
Three axes of compatibility, each with a regular and transitive variant:
Backward Compatibility
Definition: A NEW schema can read data written by the OLD schema.
Old producer writes: { "user_id": 123, "name": "Alice" }
New schema adds: { "user_id": int, "name": string, "email": string (default: null) } New consumer reads old data → works because email defaults to null
The subscriber doesn’t need to update. You upgrade the schema; old data is still readable.
Safe backward-compatible changes:
-
Add a new optional field with a default value ✅
-
Delete a field (new readers simply don’t get the field) ✅
-
Widen a numeric type (int → long → float → double) ✅ (Avro)
Breaking changes (NOT backward compatible):
-
Add a required field with no default ❌ — old data doesn’t have it → deserialization fails
-
Change a field’s data type incompatibly (string → int) ❌
-
Rename a field without an alias ❌
BACKWARD_TRANSITIVE: New schema can read data from ALL previous versions, not just the last one. Critical when you store historical data (Iceberg, S3) that consumers may replay.
Forward Compatibility
Definition: An OLD schema can read data written by the NEW schema.
New producer writes: { "user_id": 123, "name": "Alice", "email": "alice@example.com" } Old consumer reads: ignores unknown "email" field → works
Old consumers don’t need to update. You deploy new producers; old consumers continue processing.
Safe forward-compatible changes:
-
Add a new field (old consumers ignore it) ✅
-
Delete an optional field with a default ✅
Breaking changes (NOT forward compatible):
-
Remove a field that old consumers require ❌
-
Change a field type that old consumers can’t handle ❌
Full Compatibility
Definition: Both backward AND forward compatible. Old schemas read new data; new schemas read old data.
Safe full-compatible changes:
-
Add a field with a default value ✅ (backward: new schema reads old data using default; forward: old schema ignores new field)
-
Remove a field with a default value ✅
This is the most restrictive mode but the safest for independent producer/consumer deployments. Fields must always have defaults.
Compatibility Summary Matrix
ChangeBackwardForwardFullAdd optional field with default✅✅✅Delete optional field with default✅✅✅Add required field (no default)❌✅❌Delete required field✅❌❌Rename field (no alias)❌❌❌Rename field (with alias)✅ (Avro)❌❌Change type incompatibly❌❌❌Widen numeric type (int→long)✅❌❌
Schema Registry: The Enforcement Mechanism
A schema registry centralizes schema management and enforces compatibility rules before any data flows. The Confluent Schema Registry is the industry standard:
Producer attempts to publish with a new schema
↓ Schema Registry API: POST /subjects/orders-value/versions
Registry checks: is new schema compatible with the registered version?
├── Compatible → assign schema ID → allow publish
└── Incompatible → 409 Conflict → block publish
How it works in a Kafka pipeline:
Producer: serialize event with Avro → prepend 5-byte magic byte + schema_id to message → send to Kafka
Consumer: read Kafka message → extract schema_id from first 5 bytes → fetch schema from registry (cached locally) → deserialize using that schema → apply evolution rules if using different version
Key configuration:
# Default is BACKWARD — the safest default
compatibility = BACKWARD
# For streaming pipelines that replay all history
compatibility = BACKWARD_TRANSITIVE
# For independent producer/consumer teams
compatibility = FULL
Pre-register schemas in CI/CD:
# Don't auto-register in production — enforce via CI pipeline
auto.register.schemas=false
# In CI/CD: validate before deploy
mvn schema-registry:test-compatibility
AWS Glue Schema Registry and Google Cloud Schema Registry: Cloud-native alternatives for AWS and GCP ecosystems respectively. Same principles, different management interfaces.
Avro, Protobuf, JSON Schema: Evolution Behavior
| Topic | Details |
|---|---|
| Field identification | By name By field number (tag) By name |
| Default values | Explicit, required for optional fields Implicit defaults per type Explicit |
| Field rename | Via aliases property Rename freely (tag unchanged) Breaking — consumers use names |
| Add field | Need default value for backward compat Safe — use new tag number Need default |
| Remove field | Safe if had default Safe — old readers ignore missing May break if required |
| Type change | Limited widening (int→long→float→double) Limited (int32→int64 safe) Very limited |
| Schema evolution design | Built-in, Confluent default Built-in via tag stability Weaker, less standardized |
Critical Protobuf rule: Never reuse field tag numbers. Even if you delete a field, mark its tag as reserved to prevent future misuse:
message Order {
int32 order_id = 1;
string customer_name = 2;
reserved 3; // was "old_status_code" — never reuse this tag
reserved "old_status_code"; // also reserve the name
string status = 4;
}
Avro aliases for rename (backward compatible rename):
{
"type": "record",
"name": "Order",
"fields": [
{
"name": "order_total",
"aliases": ["amount"],
// old name — readers using old schema can find this
"type": "double",
"default": 0.0
}
]
}
Data Contracts: The Organizational Layer
A data contract is broader than a schema. It’s a formal, enforceable agreement between a data producer and all its consumers (Soda.io, Dataskew.io):
Data Contract for: orders-events topic
_______________________________
Producer:
payments-service team
Version:
2.1.0
SLA:
99.9% delivery, p99 latency < 500ms, data by 8:00 AM PT daily
SCHEMA:
order_id:
string (required, UUID format)
customer_id: string (required)
amount:
double (required, > 0)
currency:
string (required, ISO 4217 code)
status:
string (enum: pending|confirmed|shipped|delivered|cancelled)
created_at:
timestamp (required)
metadata:
map (optional, default: {})
QUALITY RULES:
- order_id must be unique per day
- amount must be positive
- currency must be 3-character ISO code
- status must be in allowed enum values
EVOLUTION POLICY:
- Compatibility: FULL (backward + forward)
- Breaking changes require 30-day notice + major version bump
- New fields must have defaults
CONSUMERS:
- revenue-analytics (team: data platform, criticality: tier-1)
- fraud-detection (team: risk, criticality: tier-1)
- customer-service-dashboard (team: cx, criticality: tier-2)
The four components of a data contract:
-
Schema: Structure — field names, types, required/optional
-
Semantics: Meaning — what does amount mean? Gross or net? In cents or dollars?
-
Quality rules: Constraints — amount > 0, status in allowed values
-
SLA: Freshness, availability, latency commitments
Implementing Data Contracts: The Three Approaches
Approach 1: Schema Registry-Based (Streaming)
For Kafka/streaming pipelines. Schema Registry enforces structural compatibility. Add quality rules via CEL expressions:
# Register schema WITH quality rules (data contract in Schema Registry)
rule = Rule(
name="check_amount_positive",
kind="CONDITION",
type="CEL",
mode="WRITE",
expr="message.amount > 0",
on_failure="ERROR", # or "DLQ" to route violations to dead letter queue
)
Approach 2: dbt Contract (Warehouse Layer)
dbt 1.5+ supports model contracts — enforce schema at the transformation layer:
# models/gold/fact_orders.yml
models:
- name: fact_orders
config:
contract:
enforced: true # dbt will fail if schema doesn't match
columns:
- name: order_id
data_type: string
constraints:
- type: not_null
- type: unique
- name: amount
data_type: float
constraints:
- type: not_null
- type: check
expression: "amount > 0"
With contract: enforced: true, dbt validates that the model’s output exactly matches the declared schema. A column type mismatch or missing column fails the run — preventing bad schema from reaching downstream.
Approach 3: YAML-Based Contracts (Code-Defined)
Tools like Soda, Great Expectations, or custom YAML contracts define expectations as code in Git, enforced at pipeline checkpoints:
# contracts/orders.yaml
dataset: gold.fact_orders
owner: data-platform@company.com
sla:
freshness: daily by 08:00 PT
availability: 99.9%
schema:
- name: order_id
type: string
required: true
unique: true
- name: amount
type: float
required: true
min: 0
quality:
- name: no_future_dates
sql: "SELECT COUNT(*) FROM fact_orders WHERE order_date > CURRENT_DATE"
expected: 0
- name: completeness
sql: "SELECT COUNT(*) * 1.0 / (SELECT COUNT(*) FROM raw.orders) FROM fact_orders"
expected: "> 0.99"
Handling Breaking Changes: The Migration Patterns
Sometimes breaking changes are unavoidable (major refactors, compliance requirements). The patterns:
Pattern 1: Parallel topics / dual-write
Phase 1: Write to both old topic (v1) and new topic (v2) Phase 2: Migrate consumers to v2 one by one Phase 3: Deprecate v1 topic after all consumers migrated
Zero downtime. Old consumers keep working. New consumers adopt v2.
Pattern 2: In-place migration with transformation rules:Schema Registry migration rules transform data at read time. Producers write v2; consumers expecting v1 receive a v1-compatible view via declarative transformation rules. Complex but avoids topic proliferation.
Pattern 3: Versioned endpoints
/v1/orders → old schema (deprecated, sunset date: 2026-06-01) /v2/orders → new schema (current)
Consumers choose their version explicitly. Cleaner but doubles maintenance burden temporarily.
The golden rule: “Never remove or rename a field without a deprecation period. Add the new field, keep the old field for N days, then remove it after all consumers have migrated.”
Interview Questions
Q1: “A producer team wants to rename the field amount to order_total in a Kafka topic consumed by 8 downstream pipelines. How do you handle this?”
Model Answer: “Renaming a field is a breaking change — it’s not backward or forward compatible unless handled carefully. I’d follow a three-phase approach. Phase 1: dual-field period. The producer adds order_total alongside the existing amount field. Both are populated identically. This is fully backward compatible — no consumers break. Announce the migration with a 30-day timeline. Phase 2: consumer migration. Each of the 8 downstream teams migrates their pipeline to read order_total. We track migration progress in the data contract registry. Phase 3: deprecation. After all 8 consumers confirm migration, the producer removes amount. The schema registry enforces FULL compatibility — this final change (removing amount) passes because amount had a default value and no consumers reference it anymore. For Avro specifically, Phase 1 can use the aliases mechanism to make the rename backward compatible, but the dual-field approach is cleaner operationally because it avoids relying on all consumers correctly handling Avro aliases. Throughout, dbt contracts on the downstream models catch any consumer that hasn’t migrated.”
Q2: “How do you prevent schema drift in a lakehouse with 50 source systems feeding data into your bronze layer?”
Model Answer: “Schema drift at the bronze layer is inevitable — you can’t control 50 source systems. The strategy is detect-early and isolate. At ingestion, I’d implement schema validation as the first transformation step: compare the incoming schema against the registered contract. Violations go to a quarantine table, not the main bronze table. The pipeline continues processing valid records. An alert fires for schema drift — the diff shows exactly which fields changed. Second, use a schema registry or catalog to maintain the ‘known schema’ for each source. The ingestion job compares incoming schemas against the registered version and alerts on any deviation. Third, for Avro/Protobuf sources, the Schema Registry enforces compatibility automatically — producers can’t publish an incompatible schema. For untyped JSON sources (where drift is most common), I’d use schema inference with Great Expectations to detect new/missing/type-changed columns. Fourth, in the bronze-to-silver transformation, dbt model contracts enforce the expected output schema. A schema drift upstream becomes a failed dbt run, which is visible in CI/CD before it reaches production. The combination of quarantine + alerting + dbt contracts means schema drift is caught within one pipeline run, not three days later when an analyst notices wrong numbers.”
Think About This
You’re in an Anthropic interview. The prompt: “We have a pipeline that collects conversation data from Claude. The format of conversations will evolve significantly as we add new features — new message types, new metadata fields, new safety annotations. How would you design the schema evolution strategy?”
Walk through:
-
What compatibility mode? (BACKWARD_TRANSITIVE — you’ll be replaying historical conversations for model training. New schemas must be able to read all historical data, not just data from the last schema version.)
-
How do you handle new message types? (In Avro: add new message type as a union type with a null default — backward compatible. In Protobuf: use oneof for message type variants with new types added carefully.)
-
What’s the migration strategy for major restructures? (Versioned training datasets — store conversations with their schema version tag. Training jobs can specify which schema versions to include. Dual-write during transitions.)
-
How do you prevent safety annotation schema changes from breaking downstream safety analysis pipelines? (Data contracts on the safety annotation schema. Breaking changes require 60-day notice, parallel schema support, and sign-off from all downstream safety analysis teams. Safety-related schemas get stricter governance than product feature schemas.)
The insight: at Anthropic, some schemas (safety annotations, evaluation data) are more critical than others and warrant stricter governance. Tiering your schema evolution policy by impact — just like tiering data quality — shows senior thinking.
Quick Reference
-
Backward: New schema reads old data. Safe: add optional field with default. Breaking: add required field with no default.
-
Forward: Old schema reads new data. Safe: add any field (old readers ignore it). Breaking: remove a field old readers need.
-
Full: Both directions. Safe: add/remove fields WITH defaults. Most restrictive but enables independent producer/consumer deployment.
-
TRANSITIVE variants: Apply compatibility to ALL historical versions, not just the last. Required when replaying historical data.
-
Schema Registry: Blocks incompatible schema registration. Enforces your chosen compatibility mode. Pre-register in CI/CD — never auto-register in production.
-
Data contract = schema + semantics + quality rules + SLA. The formal interface between producers and consumers.
-
Breaking change process: Dual-field period → consumer migration → old field removal. Never remove without a deprecation period.
-
Protobuf golden rule: Never reuse field tag numbers. Mark deleted tags as reserved.
Tomorrow’s Preview
Day 22: Workflow Orchestration Patterns — Airflow, Dagster, Prefect, Step Functions. DAG design patterns. Idempotency, retries, backfills. SLA monitoring. Task dependencies and fan-out/fan-in. How to design orchestration architectures that interviewers at your target companies recognize as production-grade.