Phase 2: Deep Dives | Category: ML Data Infrastructure
Why This Is a Senior DE Interview Topic (Not Just Data Science)
A/B testing is a data engineering problem as much as it is a statistics problem. The pipeline that collects experiment assignments, logs user interactions, joins them correctly, computes metrics, detects statistical significance, and surfaces results — this is built and owned by data engineers.
The data engineering challenges are subtle: one bad join, one incorrectly tracked variant, one assignment logged after the fact — and the entire experiment is invalid.
The Full A/B Testing Data Pipeline
┌───────────────────────────────────────────────────────────────────┐
│ EXPERIMENT DEFINITION │
│ • Hypothesis: "Adding video previews increases engagement" │
│ • Success metric: 7-day retention rate │
│ • Guardrail metrics: stream start time, error rate, revenue │
│ • Randomization unit: user_id │
│ • Traffic allocation: 50% control / 50% treatment │
│ • Duration: 14 days (power analysis → MDE 0.5% at 80% power) │
│ Stored in: Experiment Registry (Postgres) │
└──────────────────────────┬────────────────────────────────────────┘
↓
┌───────────────────────────────────────────────────────────────────┐
│ ASSIGNMENT SERVICE (real-time) │
│ • User arrives → deterministic hash(user_id + experiment_id) │
│ → bucket assignment (0-49: control, 50-99: treatment) │
│ • Assignment logged to Kafka topic "experiment_assignments" │
│ • Served variant feature flags to the application │
│ • Assignment is STICKY (same user always gets same variant) │
└──────────────────────────┬────────────────────────────────────────┘
↓ (alongside existing event collection)
┌───────────────────────────────────────────────────────────────────┐
│ EVENT LOGGING │
│ • All user events carry: experiment_id, variant_id │
│ OR events are later joined to assignment logs │
│ • Kafka topic: "user_events" (enriched with variant metadata) │
└──────────────────────────┬────────────────────────────────────────┘
↓ (daily Spark/dbt job)
┌───────────────────────────────────────────────────────────────────┐
│ METRICS COMPUTATION │
│ • Join assignments to events on user_id + date range │
│ • Compute per-user metrics (retention, session length, revenue) │
│ • Aggregate: mean, variance per variant │
│ • Statistical tests: t-test, z-test, bootstrap │
│ • Guardrail checks: is stream start time worse in treatment? │
│ Output: gold.experiment_metrics table │
└──────────────────────────┬────────────────────────────────────────┘
↓
┌───────────────────────────────────────────────────────────────────┐
│ RESULTS SURFACING │
│ • Experiment dashboard (Tableau/custom UI) │
│ • P-value, confidence interval, effect size per metric │
│ • Guardrail status: PASS / FAIL / WARNING │
│ • Statistical significance reached? If sequential testing: YES │
│ Decision: Ship / Roll back / Extend / Iterate │
└───────────────────────────────────────────────────────────────────┘
Experiment Assignment: Getting It Right
Deterministic hashing (the correct approach):
import hashlib
from typing import Dict, Tuple
def assign_variant(
user_id: str,
experiment_id: str,
buckets: Dict[str, Tuple[int, int]],
) -> str:
"""
Deterministic bucket assignment.
Same user_id + experiment_id → always same variant.
No state required — recompute anytime.
"""
# Hash combines user_id and experiment_id
# experiment_id prevents correlation across experiments
hash_input = f"{user_id}:{experiment_id}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
bucket = hash_value % 100 # 0-99
for variant_name, (low, high) in buckets.items():
if low <= bucket < high:
return variant_name
return "control" # fallback
assignment = assign_variant(
user_id="user-12345",
experiment_id="exp-video-preview-v1",
buckets={"control": (0, 50), "treatment": (50, 100)},
)
Why include experiment_id in the hash? Without it, the same user_id always hashes to the same bucket across experiments, causing correlated assignments.
Assignment persistence choices:
| Approach | How it works | Trade-offs |
|---|---|---|
| Deterministic hash (recompute) | Recompute from user_id + exp_id every time | No state/DB lookup. Can’t change the hash function mid-experiment. |
| Assignment store (DynamoDB/Redis) | Write assignment on first exposure, read on subsequent | Supports overrides/QA and allocation changes. Requires infra. |
| Cookie / client-side | Store assignment in browser cookie | Lost if cookie cleared; can be manipulated. |
Netflix/Meta/Google often use deterministic hashing for production traffic (stateless) with an assignment store as a cache and as the logging source of truth.
Common assignment bugs to flag in interviews:
- Assignment logging lag: user assigned at 2 PM, assignment logged at 3 PM → events in between have no assignment to join to
- Post-treatment assignment: assignment recorded after events occurred (usually logging delay/backfill bug)
- Survivor bias: only users who return after assignment are logged → denominator is wrong
Experiment Data Model
-- Experiment registry (source of truth for experiment metadata)
CREATE TABLE experiments (
experiment_id STRING PRIMARY KEY,
name STRING,
hypothesis TEXT,
owner STRING,
status STRING, -- 'running', 'paused', 'completed', 'rolled_back'
start_date DATE,
end_date DATE,
randomization_unit STRING, -- 'user_id', 'session_id', 'device_id'
traffic_pct FLOAT, -- % of eligible users in experiment
variants JSONB -- {"control": 0.5, "treatment": 0.5}
);
-- Assignment table (grain: one row per user per experiment)
CREATE TABLE experiment_assignments (
user_id STRING,
experiment_id STRING,
variant STRING,
first_exposed_at TIMESTAMP,
assignment_date DATE,
PRIMARY KEY (user_id, experiment_id)
) PARTITION BY assignment_date;
-- Metric observations (grain: one row per user per experiment per metric per day)
CREATE TABLE experiment_metrics_daily (
experiment_id STRING,
variant STRING,
metric_date DATE,
user_id STRING,
sessions INT,
stream_minutes FLOAT,
revenue FLOAT,
is_retained_7d BOOLEAN
) PARTITION BY metric_date;
-- Aggregated results (grain: one row per experiment × variant × metric)
CREATE TABLE experiment_results (
experiment_id STRING,
variant STRING,
metric_name STRING,
computed_at TIMESTAMP,
control_mean FLOAT,
treatment_mean FLOAT,
relative_lift_pct FLOAT,
p_value FLOAT,
confidence_interval_lower FLOAT,
confidence_interval_upper FLOAT,
is_significant BOOLEAN,
sample_size_control INT,
sample_size_treatment INT
);
Metrics Computation: The SQL at Scale
WITH exposed_users AS (
SELECT
ea.user_id,
ea.experiment_id,
ea.variant,
ea.first_exposed_at
FROM experiment_assignments ea
WHERE ea.experiment_id = 'exp-video-preview-v1'
AND ea.first_exposed_at BETWEEN '2026-04-01' AND '2026-04-14'
),
user_metrics AS (
SELECT
eu.user_id,
eu.variant,
MAX(CASE WHEN ue.event_timestamp
BETWEEN eu.first_exposed_at
AND eu.first_exposed_at + INTERVAL '7 days'
AND ue.event_type = 'session_start'
THEN 1 ELSE 0 END) AS is_retained_7d,
SUM(CASE WHEN ue.event_timestamp > eu.first_exposed_at
THEN ue.stream_duration_min ELSE 0 END) AS stream_minutes_7d,
SUM(CASE WHEN ue.event_timestamp > eu.first_exposed_at
THEN ue.revenue ELSE 0 END) AS revenue_7d
FROM exposed_users eu
LEFT JOIN user_events ue ON eu.user_id = ue.user_id
GROUP BY eu.user_id, eu.variant
),
variant_stats AS (
SELECT
variant,
COUNT(*) AS sample_size,
AVG(is_retained_7d) AS retention_rate,
STDDEV(is_retained_7d) AS retention_stddev,
AVG(stream_minutes_7d) AS avg_stream_minutes,
AVG(revenue_7d) AS avg_revenue
FROM user_metrics
GROUP BY variant
)
SELECT * FROM variant_stats;
Statistical Significance: The Data Engineering Layer
Two-sample z-test for proportions (binary metrics like retention):
from scipy import stats
import numpy as np
def compute_ab_test_results(
control_successes: int, control_n: int,
treatment_successes: int, treatment_n: int
) -> dict:
"""
Two-sample z-test for difference in proportions.
Used for binary metrics (conversion, retention).
"""
p_control = control_successes / control_n
p_treatment = treatment_successes / treatment_n
p_pooled = (control_successes + treatment_successes) / (control_n + treatment_n)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/control_n + 1/treatment_n))
z_score = (p_treatment - p_control) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
ci_lower = (p_treatment - p_control) - 1.96 * se
ci_upper = (p_treatment - p_control) + 1.96 * se
relative_lift = (p_treatment - p_control) / p_control * 100
return {
"control_rate": p_control,
"treatment_rate": p_treatment,
"relative_lift_pct": relative_lift,
"z_score": z_score,
"p_value": p_value,
"ci_lower": ci_lower,
"ci_upper": ci_upper,
"is_significant": p_value < 0.05,
}
T-test for continuous metrics (e.g., stream minutes):
from scipy.stats import ttest_ind
control_values = df[df.variant == 'control']['stream_minutes'].values
treatment_values = df[df.variant == 'treatment']['stream_minutes'].values
t_stat, p_value = ttest_ind(control_values, treatment_values, equal_var=False)
Sequential Testing: The Netflix/Meta Approach
Fixed-horizon tests require “no peeking.” Sequential testing allows valid early stopping with adjusted thresholds.
checkpoints = {
7: {"alpha_threshold": 0.005},
10: {"alpha_threshold": 0.014},
14: {"alpha_threshold": 0.043},
}
for checkpoint_day, thresholds in checkpoints.items():
results = compute_ab_test_results(...)
if results["p_value"] < thresholds["alpha_threshold"]:
trigger_early_stop(experiment_id, checkpoint_day)
break
Guardrail Metrics: The Safety Net
Guardrail metrics must NOT get significantly worse, even if the primary metric improves.
SELECT
experiment_id,
variant,
metric_name,
p_value,
relative_lift_pct,
CASE
WHEN metric_name IN ('stream_start_time_ms', 'error_rate', 'cancellation_rate')
AND relative_lift_pct < -0.5
AND p_value < 0.05
THEN 'GUARDRAIL_VIOLATED'
WHEN metric_name IN ('stream_start_time_ms', 'error_rate', 'cancellation_rate')
AND relative_lift_pct < -0.5
AND p_value < 0.10
THEN 'GUARDRAIL_WARNING'
ELSE 'PASS'
END AS guardrail_status
FROM experiment_results
WHERE experiment_id = 'exp-video-preview-v1'
ORDER BY metric_name;
GUARDRAIL_VIOLATED should trigger automatic experiment pause + alert.
Common Experiment Data Engineering Pitfalls
- Simpson’s paradox: better in every segment, worse overall → always segment results by major dimensions
- Network effects: social features violate independence → cluster randomization
- Novelty effect: week 1 spike then decay → run long enough to outlast novelty
- SRM (Sample Ratio Mismatch): expected 50/50 but observed 51/49 → assignment/logging bug
SRM detection (chi-squared):
SELECT
variant,
COUNT(*) AS actual_count,
SUM(COUNT(*)) OVER () * 0.5 AS expected_count,
(COUNT(*) - SUM(COUNT(*)) OVER () * 0.5)^2 /
(SUM(COUNT(*)) OVER () * 0.5) AS chi_squared_contribution
FROM experiment_assignments
WHERE experiment_id = 'exp-video-preview-v1'
GROUP BY variant;
-- If SUM(chi_squared_contribution) > 3.84 (p < 0.05): SRM detected → experiment invalid
- Metric contamination: control users see treatment content → monitor cross-variant exposure rates
Interview Questions
Q1: “Design the data infrastructure for Meta’s experimentation platform, which runs 10,000 concurrent experiments across billions of users.”
Model Answer: “Five components: experiment registry (Postgres control plane), assignment service (deterministic hashing, stateless), assignment logging (Kafka → warehouse table), event logging (events carry experiment metadata when possible), metrics computation (Spark jobs + sequential testing + automated guardrails), and a self-serve results dashboard for p-values, CIs, lifts, and guardrail status.”
Q2: “Your experiment shows treatment increased 7-day retention by 2% (p=0.03). Stream start time increased by 8% (p=0.04). Do you ship?”
Model Answer: “No — guardrails exist to prevent shipping regressions in UX. Pause the experiment, investigate the root cause, fix performance, and rerun. A mature platform should auto-pause on guardrail violation.”
Think About This
You’re in a Netflix interview: test a new recommendation algorithm for New Releases. Design the A/B test data infrastructure.
Walk through:
- Randomization unit (
user_idto avoid contamination) - Success metrics and guardrails
- Expected run time + sequential testing checkpoints
- SRM check after 48 hours
Quick Reference
- Stages: definition (registry) → assignment (hash) → logging (events with variant) → results (join + stats + guardrails)
- Deterministic hashing:
hash(user_id + experiment_id) % 100(experiment_id avoids correlated assignment) - SRM: chi-squared test; if mismatch detected → experiment invalid
- Sequential testing: alpha spending functions enable early stopping without false positives inflation
- Guardrails: enforced by platform (auto-pause), not manual review
- Pre-exposure balance checks and network effects handling are critical
Tomorrow’s Preview
Day 50: Self-Serve Data Platform Architecture — Data mesh principles, domain ownership, data-as-a-product, federated governance, and self-serve platforms.