Day 14 — Replication and fault tolerance

Phase 1: Foundations & Frameworks | Category: Distributed Systems

The Pairing with Day 13

Yesterday you learned how to split data across nodes (partitioning). Today you learn how to copy data across nodes (replication). Together they answer the two fundamental questions of distributed data: where does each piece of data live (partition), and how many copies exist (replication). Every data system you’ll discuss in interviews uses both.

Why Replicate?

Goal	How Replication Helps
Fault tolerance	If one node dies, another replica has the data. No data loss.
High availability	Read from any healthy replica. System stays up during node failures.
Read scaling	Distribute read load across multiple replicas. One writer, many readers.
Geographic locality	Place replicas near users in different regions. Lower latency.

The cost: keeping replicas in sync. Every replication strategy is a different answer to the question: how do we propagate writes to all copies, and what consistency do we sacrifice to do it fast?

The Three Replication Architectures

1. Leader-Follower (Single-Leader)

The most common model. One node (leader/primary) accepts all writes. Changes flow to followers (replicas/secondaries) via a replication log.

Client writes ──→ LEADER ──→ replication log ──→ Follower 1
                                                ──→ Follower 2
                                                ──→ Follower 3
Client reads  ──→ Any node (leader or follower)

Sync vs Async replication:

Mode	How It Works	Trade-off
Synchronous	Leader waits for follower(s) to ACK before confirming write to client	Zero data loss on leader failure. Higher write latency.
Asynchronous	Leader confirms write immediately, replicates in background	Low write latency. Risk: if leader dies before replication, writes are lost.
Semi-synchronous	Leader waits for 1 follower to ACK, rest are async	Compromise: one guaranteed replica, reasonable latency. Postgre SQL’s common config.

Leader failure → Leader election: When the leader goes down, a follower must be promoted. Mechanisms: Raft consensus (CockroachDB, etcd), Paxos (Spanner), ZooKeeper-based (Kafka, HBase), or manual failover.

Replication lag: The time between a write on the leader and that write appearing on a follower. During this lag, followers return stale data. This is why “read-after-write consistency” is a real problem — you write to the leader, then immediately read from a follower that hasn’t caught up yet.

Systems: PostgreSQL streaming replication, MySQL binlog replication, MongoDB replica sets, Kafka (leader per partition with ISR followers), Redis Sentinel.

Best for: Most OLTP workloads, data pipelines where one writer is sufficient, any system where strong consistency is preferred over write throughput.

2. Multi-Leader

Multiple nodes accept writes simultaneously. Each leader replicates its writes to all other leaders.

Region US:   Leader A  ←──replication──→  Leader B  :Region EU
                 ↑ writes                     ↑ writes
            US clients                   EU clients

The fundamental problem: write conflicts. If User X updates a record on Leader A and User Y updates the same record on Leader B at the same time, which write wins?

Conflict resolution strategies:

Strategy	How It Works	Trade-off
Last-Write-Wins (LWW)	Timestamp-based: later write wins	Simple but lossy — the “losing” write is silently discarded. Clock skew can cause wrong resolution.
Custom application logic	Application defines merge rules per data type	Most accurate but complex. Must handle every conflict type.
CRDTs	(Conflict-Free Replicated Data Types) Data structures designed to merge without conflicts (counters, sets, registers)	Elegant for specific data types. Not general-purpose.
Operational Transform	Transforms concurrent operations to produce consistent result (Google Docs)	Complex to implement. Works well for text/document editing.

When to use multi-leader:

Multi-datacenter deployments where each DC needs local write latency
Offline-capable apps (each device is a “leader” that syncs when online)
Collaborative editing (each user’s session is a “leader”)

Systems: CouchDB multi-master, Cassandra multi-DC (technically leaderless but behaves similarly), Active-Active PostgreSQL (BDR).

Important misconception to flag in interviews: “Multi-leader replication does NOT improve consistency. It trades consistency for availability and geographic write performance. It reduces consistency by introducing conflicts.”

3. Leaderless

No designated leader. Any node can accept writes. Clients write to multiple nodes simultaneously and read from multiple nodes simultaneously. Quorum rules determine success.

Client write ──→ Node 1 (ACK)                ──→ Node 2 (ACK)
      W=2 of N=3 → write succeeds                ──→ Node 3 (timeout)
Client read  ──→ Node 1 (returns v2)                ──→ Node 2 (returns v2)
      R=2 of N=3 → returns v2                ──→ Node 3 (returns v1 — stale, ignored)

The quorum formula:

W + R > N  → guarantees strong consistency (read always sees latest write)
Where:
    N = total replicas
    W = write quorum (replicas that must ACK a write)
    R = read quorum (replicas that must respond to a read)

Common configurations with N=3:

Config	W	R	Consistency	Fault Tolerance	Use Case
Strong	2	2	Strong (W+R=4 > 3)	Tolerates 1 failure for both reads/writes	Default balanced config
Write-heavy	1	3	Strong (W+R=4 > 3)	Tolerates 2 write failures, 0 read failures	High write throughput, reads query all nodes
Read-heavy	3	1	Strong (W+R=4 > 3)	Tolerates 0 write failures, 2 read failures	Fast reads from any single node
Eventual	1	1	Eventual (W+R=2 ≤ 3)	Tolerates 2 failures each	Maximum availability, stale reads possible

Repair mechanisms (to fix stale replicas):

Read repair: During a read, if one replica returns stale data, the client sends the fresh value to that replica
Anti-entropy: Background process compares replicas and reconciles differences (DynamoDB does this)
Hinted handoff: If a target node is down during write, another node temporarily stores the write and forwards it when the target recovers (sloppy quorum)

Systems: DynamoDB, Cassandra, Riak, Voldemort.

Write-Ahead Log (WAL): The Foundation of Durability

The WAL is the mechanism that makes everything above possible (System Design Classroom, Architecture Weekly):

Core principle: Never modify data directly. First, log the intended change to a durable, append-only file. Then apply the change to the actual data. If the system crashes after logging but before applying, replay the log on recovery.

1. Client: INSERT INTO orders (id, amount) VALUES (42, 99.99)
2. Database: Write log entry to WAL → flush to disk (fsync)
3. Database: ACK to client ("write committed")
4. Database: Later, apply change to actual data files (async)
5. If crash between 3 and 4: replay WAL on startup → change applied

Why WAL matters for data engineering:

Topic	Details
Crash recovery	Replay uncommitted WAL entries to restore consistent state
Streaming replication	Ship WAL segments from leader to followers (Postgre SQL, My SQL)
CDC (Change Data Capture)	Read the WAL to capture changes in real-time (Debezium reads Postgre SQL WAL / My SQL binlog)
Point-in-time recovery	Archive WAL segments + base backup → restore to any timestamp
Kafka’s architecture	Kafka IS a distributed WAL. The entire commit log is the data structure.

The connection to CDC: When you use Debezium to capture changes from PostgreSQL for a streaming pipeline, Debezium reads the PostgreSQL WAL (via logical replication slots). The WAL is what makes CDC possible without querying the database. This is why you should always mention WAL-based CDC as superior to query-based CDC — it’s non-intrusive, captures all changes including deletes, and maintains ordering.

Replication in Data Pipeline Design

Here’s where you connect replication theory to practical DE architecture:

Kafka Replication

Each partition has one leader and N-1 follower replicas (ISR = In-Sync Replicas)
Producers write to the leader; acks=all waits for all ISR replicas to ACK
If the leader fails, a new leader is elected from the ISR
Replication factor of 3 is standard (tolerates 1 broker failure)
min.insync.replicas=2 prevents writes if ISR drops below 2 (CP behavior)

Warehouse Replication

BigQuery: Automatically replicated across zones within a region. Multi-region datasets replicate across US/EU.
Snowflake: Replication across regions/clouds for DR. Time travel and fail-safe use internal replication.
Redshift: Each block replicated to another node in the cluster. Cross-region snapshots for DR.

Lakehouse Replication

Iceberg/Delta on S3: S3 provides 11 nines of durability via internal replication. Table-level replication = copy metadata + data files to another region.
Cross-region replication: S3 Cross-Region Replication (CRR) for DR.

How Replication Affects Your Pipeline SLA

Pipeline ComponentReplication ConfigFailure ImpactKafka (RF=3, min.ISR=2, acks=all)2 replicas in sync at all timesTolerates 1 broker failure with zero data loss. If 2 brokers fail, writes rejected (CP).PostgreSQL (sync replication to 1 standby)1 synchronous followerTolerates 1 node failure with zero data loss. Write latency includes replication round-trip.S3 storage11 nines durability (auto-replicated)Essentially never loses data. Region-level failure requires CRR to another region.Flink checkpoints (to S3)Checkpoints stored on S3 (auto-replicated)Job restart replays from last checkpoint. Data since last checkpoint reprocessed from Kafka.

Interview Questions

Q1: “Your streaming pipeline writes to both Cassandra (serving) and S3/Iceberg (warehouse). The Cassandra cluster loses a node during a write. What happens?”

Model Answer: “It depends on the consistency level configured. With Cassandra replication factor 3 and consistency level QUORUM (W=2), losing one node still allows writes to succeed — the write commits to 2 of 3 replicas. The failed node receives the data via hinted handoff when it recovers, or anti-entropy repair in the background. With consistency level ALL (W=3), the write would fail because all 3 nodes must ACK. I’d configure QUORUM for this serving use case — the dashboard can tolerate a briefly stale replica over a failed write. For the S3/Iceberg write, it’s a separate concern: S3 provides 11 nines of durability via internal replication. If the Flink writer fails mid-write to S3, Flink’s checkpointing ensures it retries from the last consistent state. The key design principle: different sinks can have different replication and consistency guarantees within the same pipeline.”

Q2: “How would you design replication for a global analytics platform serving teams in US, EU, and APAC?”

Model Answer: “The primary warehouse lives in one region — say US-West — where the batch pipelines produce gold tables. For global access, I have two options. Option A: read replicas in EU and APAC. The primary in US replicates asynchronously to follower warehouses in EU/APAC. Replication lag of minutes is acceptable for analytics — teams see data that’s 5-10 minutes behind the primary. This is leader-follower replication: one writer, multiple readers. Option B: for BigQuery, I’d use multi-region datasets that automatically replicate across US and EU. APAC reads route to the nearest region. I’d choose leader-follower over multi-leader because analytics workloads are write-once, read-many — there’s no need for multiple write endpoints. All ETL writes to the US primary; all regions read from their local replica. The RPO (Recovery Point Objective) is the replication lag — if the US region goes down, EU/APAC lose the last few minutes of data. For DR, I’d maintain cross-region snapshots with a defined RTO (Recovery Time Objective) of under 1 hour.”

Think About This

You’re in a Netflix interview. The prompt: “Netflix’s content metadata service stores information about every title (movies, shows, episodes). This metadata is read millions of times per second globally for the Netflix UI. How would you design the replication?”

Walk through:

What’s the read/write ratio? (Extremely read-heavy. Metadata changes infrequently — a show’s title, description, and cast rarely change. Millions of reads per second, maybe hundreds of writes per hour.)
What replication model? (Leader-follower. One leader accepts the rare writes. Many followers across regions serve the massive read load.)
Sync or async replication? (Async. A few seconds of lag is fine — if a show’s description updates 3 seconds later in APAC, nobody notices. Sync replication would add cross-region latency to every metadata write, which is unnecessary.)
How many replicas per region? (Multiple read replicas behind a load balancer in each region. Netflix has 3 AWS regions — probably 3-5 replicas per region for read throughput and zone-level fault tolerance.)
What happens if the leader fails? (Automated leader election from an in-region follower. Writes pause briefly during election. Reads continue unaffected from all followers globally — this is the availability benefit of leader-follower with async replication.)

Quick Reference

Leader-Follower: One writer, many readers. Simplest model. Risk: replication lag causes stale reads. Used by PostgreSQL, MySQL, MongoDB, Kafka (per partition).
Multi-Leader: Multiple writers for geographic distribution. Must handle write conflicts (LWW, CRDTs, app logic). Use only when multi-DC write latency matters.
Leaderless: No designated leader. Quorum-based (W + R > N for strong consistency). Used by DynamoDB, Cassandra. Maximum fault tolerance but operationally complex.
Quorum math: N=3, W=2, R=2 is the standard balanced config. Tolerates 1 node failure. W+R > N guarantees read-after-write consistency.
WAL: Log the change before applying it. Enables crash recovery, streaming replication, CDC, and point-in-time recovery. Kafka’s entire architecture IS a distributed WAL.
For data pipelines: Match replication to the component. Kafka RF=3 with acks=all for ingestion durability. Async read replicas for serving dashboards. S3’s built-in 11-nines durability for lakehouse storage.

Tomorrow’s Preview

Day 15: Storage Layer Deep Dive — SQL vs NoSQL — Relational (Postgres, MySQL) vs Document (MongoDB) vs Wide-column (Cassandra, HBase) vs Key-Value (Redis, DynamoDB). Selection criteria for data engineering use cases, and how to justify your storage choice in system design interviews.

Day 14: Replication and fault tolerance