Day 49 — A/B Testing Data Infrastructure

Phase 2: Deep Dives | Category: ML Data Infrastructure

Why This Is a Senior DE Interview Topic (Not Just Data Science)

A/B testing is a data engineering problem as much as it is a statistics problem. The pipeline that collects experiment assignments, logs user interactions, joins them correctly, computes metrics, detects statistical significance, and surfaces results — this is built and owned by data engineers.

The data engineering challenges are subtle: one bad join, one incorrectly tracked variant, one assignment logged after the fact — and the entire experiment is invalid.

The Full A/B Testing Data Pipeline

┌───────────────────────────────────────────────────────────────────┐
│  EXPERIMENT DEFINITION                                             │
│  • Hypothesis: "Adding video previews increases engagement"        │
│  • Success metric: 7-day retention rate                            │
│  • Guardrail metrics: stream start time, error rate, revenue       │
│  • Randomization unit: user_id                                     │
│  • Traffic allocation: 50% control / 50% treatment                 │
│  • Duration: 14 days (power analysis → MDE 0.5% at 80% power)      │
│  Stored in: Experiment Registry (Postgres)                         │
└──────────────────────────┬────────────────────────────────────────┘
                            ↓
┌───────────────────────────────────────────────────────────────────┐
│  ASSIGNMENT SERVICE (real-time)                                    │
│  • User arrives → deterministic hash(user_id + experiment_id)      │
│    → bucket assignment (0-49: control, 50-99: treatment)           │
│  • Assignment logged to Kafka topic "experiment_assignments"       │
│  • Served variant feature flags to the application                 │
│  • Assignment is STICKY (same user always gets same variant)       │
└──────────────────────────┬────────────────────────────────────────┘
                            ↓ (alongside existing event collection)
┌───────────────────────────────────────────────────────────────────┐
│  EVENT LOGGING                                                     │
│  • All user events carry: experiment_id, variant_id                │
│    OR events are later joined to assignment logs                   │
│  • Kafka topic: "user_events" (enriched with variant metadata)     │
└──────────────────────────┬────────────────────────────────────────┘
                            ↓ (daily Spark/dbt job)
┌───────────────────────────────────────────────────────────────────┐
│  METRICS COMPUTATION                                               │
│  • Join assignments to events on user_id + date range              │
│  • Compute per-user metrics (retention, session length, revenue)   │
│  • Aggregate: mean, variance per variant                           │
│  • Statistical tests: t-test, z-test, bootstrap                     │
│  • Guardrail checks: is stream start time worse in treatment?      │
│  Output: gold.experiment_metrics table                             │
└──────────────────────────┬────────────────────────────────────────┘
                            ↓
┌───────────────────────────────────────────────────────────────────┐
│  RESULTS SURFACING                                                 │
│  • Experiment dashboard (Tableau/custom UI)                        │
│  • P-value, confidence interval, effect size per metric            │
│  • Guardrail status: PASS / FAIL / WARNING                         │
│  • Statistical significance reached? If sequential testing: YES    │
│  Decision: Ship / Roll back / Extend / Iterate                     │
└───────────────────────────────────────────────────────────────────┘

Experiment Assignment: Getting It Right

Deterministic hashing (the correct approach):

import hashlib
from typing import Dict, Tuple

def assign_variant(
    user_id: str,
    experiment_id: str,
    buckets: Dict[str, Tuple[int, int]],
) -> str:
    """
    Deterministic bucket assignment.
    Same user_id + experiment_id → always same variant.
    No state required — recompute anytime.
    """
    # Hash combines user_id and experiment_id
    # experiment_id prevents correlation across experiments
    hash_input = f"{user_id}:{experiment_id}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    bucket = hash_value % 100  # 0-99

    for variant_name, (low, high) in buckets.items():
        if low <= bucket < high:
            return variant_name

    return "control"  # fallback

assignment = assign_variant(
    user_id="user-12345",
    experiment_id="exp-video-preview-v1",
    buckets={"control": (0, 50), "treatment": (50, 100)},
)

Why include experiment_id in the hash? Without it, the same user_id always hashes to the same bucket across experiments, causing correlated assignments.

Assignment persistence choices:

Approach	How it works	Trade-offs
Deterministic hash (recompute)	Recompute from user_id + exp_id every time	No state/DB lookup. Can’t change the hash function mid-experiment.
Assignment store (DynamoDB/Redis)	Write assignment on first exposure, read on subsequent	Supports overrides/QA and allocation changes. Requires infra.
Cookie / client-side	Store assignment in browser cookie	Lost if cookie cleared; can be manipulated.

Netflix/Meta/Google often use deterministic hashing for production traffic (stateless) with an assignment store as a cache and as the logging source of truth.

Common assignment bugs to flag in interviews:

Assignment logging lag: user assigned at 2 PM, assignment logged at 3 PM → events in between have no assignment to join to
Post-treatment assignment: assignment recorded after events occurred (usually logging delay/backfill bug)
Survivor bias: only users who return after assignment are logged → denominator is wrong

Experiment Data Model

-- Experiment registry (source of truth for experiment metadata)
CREATE TABLE experiments (
    experiment_id      STRING PRIMARY KEY,
    name               STRING,
    hypothesis         TEXT,
    owner              STRING,
    status             STRING,   -- 'running', 'paused', 'completed', 'rolled_back'
    start_date         DATE,
    end_date           DATE,
    randomization_unit STRING,  -- 'user_id', 'session_id', 'device_id'
    traffic_pct        FLOAT,   -- % of eligible users in experiment
    variants           JSONB    -- {"control": 0.5, "treatment": 0.5}
);

-- Assignment table (grain: one row per user per experiment)
CREATE TABLE experiment_assignments (
    user_id           STRING,
    experiment_id     STRING,
    variant           STRING,
    first_exposed_at  TIMESTAMP,
    assignment_date   DATE,
    PRIMARY KEY (user_id, experiment_id)
) PARTITION BY assignment_date;

-- Metric observations (grain: one row per user per experiment per metric per day)
CREATE TABLE experiment_metrics_daily (
    experiment_id     STRING,
    variant           STRING,
    metric_date       DATE,
    user_id           STRING,
    sessions          INT,
    stream_minutes    FLOAT,
    revenue           FLOAT,
    is_retained_7d    BOOLEAN
) PARTITION BY metric_date;

-- Aggregated results (grain: one row per experiment × variant × metric)
CREATE TABLE experiment_results (
    experiment_id     STRING,
    variant           STRING,
    metric_name       STRING,
    computed_at       TIMESTAMP,
    control_mean      FLOAT,
    treatment_mean    FLOAT,
    relative_lift_pct FLOAT,
    p_value           FLOAT,
    confidence_interval_lower FLOAT,
    confidence_interval_upper FLOAT,
    is_significant    BOOLEAN,
    sample_size_control INT,
    sample_size_treatment INT
);

Metrics Computation: The SQL at Scale

WITH exposed_users AS (
    SELECT
        ea.user_id,
        ea.experiment_id,
        ea.variant,
        ea.first_exposed_at
    FROM experiment_assignments ea
    WHERE ea.experiment_id = 'exp-video-preview-v1'
      AND ea.first_exposed_at BETWEEN '2026-04-01' AND '2026-04-14'
),
user_metrics AS (
    SELECT
        eu.user_id,
        eu.variant,
        MAX(CASE WHEN ue.event_timestamp
                      BETWEEN eu.first_exposed_at
                          AND eu.first_exposed_at + INTERVAL '7 days'
                 AND ue.event_type = 'session_start'
                 THEN 1 ELSE 0 END) AS is_retained_7d,
        SUM(CASE WHEN ue.event_timestamp > eu.first_exposed_at
                 THEN ue.stream_duration_min ELSE 0 END) AS stream_minutes_7d,
        SUM(CASE WHEN ue.event_timestamp > eu.first_exposed_at
                 THEN ue.revenue ELSE 0 END) AS revenue_7d
    FROM exposed_users eu
    LEFT JOIN user_events ue ON eu.user_id = ue.user_id
    GROUP BY eu.user_id, eu.variant
),
variant_stats AS (
    SELECT
        variant,
        COUNT(*) AS sample_size,
        AVG(is_retained_7d) AS retention_rate,
        STDDEV(is_retained_7d) AS retention_stddev,
        AVG(stream_minutes_7d) AS avg_stream_minutes,
        AVG(revenue_7d) AS avg_revenue
    FROM user_metrics
    GROUP BY variant
)
SELECT * FROM variant_stats;

Statistical Significance: The Data Engineering Layer

Two-sample z-test for proportions (binary metrics like retention):

from scipy import stats
import numpy as np

def compute_ab_test_results(
    control_successes: int, control_n: int,
    treatment_successes: int, treatment_n: int
) -> dict:
    """
    Two-sample z-test for difference in proportions.
    Used for binary metrics (conversion, retention).
    """
    p_control = control_successes / control_n
    p_treatment = treatment_successes / treatment_n

    p_pooled = (control_successes + treatment_successes) / (control_n + treatment_n)
    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/control_n + 1/treatment_n))

    z_score = (p_treatment - p_control) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

    ci_lower = (p_treatment - p_control) - 1.96 * se
    ci_upper = (p_treatment - p_control) + 1.96 * se

    relative_lift = (p_treatment - p_control) / p_control * 100

    return {
        "control_rate": p_control,
        "treatment_rate": p_treatment,
        "relative_lift_pct": relative_lift,
        "z_score": z_score,
        "p_value": p_value,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper,
        "is_significant": p_value < 0.05,
    }

T-test for continuous metrics (e.g., stream minutes):

from scipy.stats import ttest_ind

control_values = df[df.variant == 'control']['stream_minutes'].values
treatment_values = df[df.variant == 'treatment']['stream_minutes'].values

t_stat, p_value = ttest_ind(control_values, treatment_values, equal_var=False)

Sequential Testing: The Netflix/Meta Approach

Fixed-horizon tests require “no peeking.” Sequential testing allows valid early stopping with adjusted thresholds.

checkpoints = {
    7:  {"alpha_threshold": 0.005},
    10: {"alpha_threshold": 0.014},
    14: {"alpha_threshold": 0.043},
}

for checkpoint_day, thresholds in checkpoints.items():
    results = compute_ab_test_results(...)
    if results["p_value"] < thresholds["alpha_threshold"]:
        trigger_early_stop(experiment_id, checkpoint_day)
        break

Guardrail Metrics: The Safety Net

Guardrail metrics must NOT get significantly worse, even if the primary metric improves.

SELECT
    experiment_id,
    variant,
    metric_name,
    p_value,
    relative_lift_pct,
    CASE
        WHEN metric_name IN ('stream_start_time_ms', 'error_rate', 'cancellation_rate')
             AND relative_lift_pct < -0.5
             AND p_value < 0.05
        THEN 'GUARDRAIL_VIOLATED'
        WHEN metric_name IN ('stream_start_time_ms', 'error_rate', 'cancellation_rate')
             AND relative_lift_pct < -0.5
             AND p_value < 0.10
        THEN 'GUARDRAIL_WARNING'
        ELSE 'PASS'
    END AS guardrail_status
FROM experiment_results
WHERE experiment_id = 'exp-video-preview-v1'
ORDER BY metric_name;

GUARDRAIL_VIOLATED should trigger automatic experiment pause + alert.

Common Experiment Data Engineering Pitfalls

Simpson’s paradox: better in every segment, worse overall → always segment results by major dimensions
Network effects: social features violate independence → cluster randomization
Novelty effect: week 1 spike then decay → run long enough to outlast novelty
SRM (Sample Ratio Mismatch): expected 50/50 but observed 51/49 → assignment/logging bug

SRM detection (chi-squared):

SELECT
    variant,
    COUNT(*) AS actual_count,
    SUM(COUNT(*)) OVER () * 0.5 AS expected_count,
    (COUNT(*) - SUM(COUNT(*)) OVER () * 0.5)^2 /
        (SUM(COUNT(*)) OVER () * 0.5) AS chi_squared_contribution
FROM experiment_assignments
WHERE experiment_id = 'exp-video-preview-v1'
GROUP BY variant;
-- If SUM(chi_squared_contribution) > 3.84 (p < 0.05): SRM detected → experiment invalid

Metric contamination: control users see treatment content → monitor cross-variant exposure rates

Interview Questions

Q1: “Design the data infrastructure for Meta’s experimentation platform, which runs 10,000 concurrent experiments across billions of users.”

Model Answer: “Five components: experiment registry (Postgres control plane), assignment service (deterministic hashing, stateless), assignment logging (Kafka → warehouse table), event logging (events carry experiment metadata when possible), metrics computation (Spark jobs + sequential testing + automated guardrails), and a self-serve results dashboard for p-values, CIs, lifts, and guardrail status.”

Q2: “Your experiment shows treatment increased 7-day retention by 2% (p=0.03). Stream start time increased by 8% (p=0.04). Do you ship?”

Model Answer: “No — guardrails exist to prevent shipping regressions in UX. Pause the experiment, investigate the root cause, fix performance, and rerun. A mature platform should auto-pause on guardrail violation.”

Think About This

You’re in a Netflix interview: test a new recommendation algorithm for New Releases. Design the A/B test data infrastructure.

Walk through:

Randomization unit (user_id to avoid contamination)
Success metrics and guardrails
Expected run time + sequential testing checkpoints
SRM check after 48 hours

Quick Reference

Stages: definition (registry) → assignment (hash) → logging (events with variant) → results (join + stats + guardrails)
Deterministic hashing: hash(user_id + experiment_id) % 100 (experiment_id avoids correlated assignment)
SRM: chi-squared test; if mismatch detected → experiment invalid
Sequential testing: alpha spending functions enable early stopping without false positives inflation
Guardrails: enforced by platform (auto-pause), not manual review
Pre-exposure balance checks and network effects handling are critical

Tomorrow’s Preview

Day 50: Self-Serve Data Platform Architecture — Data mesh principles, domain ownership, data-as-a-product, federated governance, and self-serve platforms.

Day 49: A/B Testing Data Infrastructure