Phase 2: Deep Dives | Category: ML Data Infrastructure
The DE’s Role in ML: Infrastructure, Not Models
Senior DEs at AI-heavy companies (OpenAI, Anthropic, Meta, Netflix, Google) are expected to own the data infrastructure that ML teams depend on. Per DataInterview.com: “ML knowledge is rated medium, yet candidates over-index on it constantly.”
The interview isn’t asking you to build neural networks — it’s asking whether you can build the reliable, versioned, reproducible data pipelines that make ML possible at scale. The DE’s job in ML: data prep, versioning, training pipeline infrastructure, and the handoff to serving.
The Full ML Pipeline Lifecycle
Raw Data (warehouse, lake, streams)
↓
[1] DATA PREPARATION PIPELINE
• Feature engineering (Day 46 topics)
• Training dataset construction (point-in-time joins)
• Data versioning (DVC / lakeFS)
• Quality gates (schema validation, distribution checks)
↓
Training Dataset (versioned, immutable)
↓
[2] EXPERIMENT TRACKING
• Hyperparameter logging
• Metric tracking (loss, accuracy, F1)
• Artifact storage (model weights, evaluation plots)
• Tools: MLflow, Weights & Biases, Neptune.ai
↓
Trained Model Artifact
↓
[3] MODEL REGISTRY
• Version control for models (v1.0, v1.1, v2.0)
• Promotion workflow: dev → staging → production
• Lineage: model → training dataset → code version
• Rollback capability
↓
Production Model
↓
[4] SERVING / INFERENCE
• Online (real-time API): p99 < 100ms, feature store integration
• Batch (scheduled scoring): millions of records, cost-efficient
↓
[5] MONITORING & RETRAINING TRIGGERS
• Data drift detection
• Model performance monitoring
• Automated retraining pipelines
The Two Pipeline Types
Training Pipeline: Throughput-Optimized
Training dataset (Parquet/Iceberg on S3)
↓
Data loader (batched reads, shuffling, preprocessing)
↓
Distributed training (multiple GPUs/TPUs)
↓
Checkpointing (save model state periodically)
↓
Evaluation (validation set)
↓
Model artifact (weights + config + preprocessing code)
↓
Model registry
DE-owned components:
- Training dataset construction and versioning
- Efficient data loading (minimize GPU idle time)
- Storage for model artifacts
- Training pipeline orchestration (Airflow/Metaflow)
- Evaluation dataset management
The training data bottleneck: if your 500-GPU training job is waiting for data, you’re wasting huge GPU spend. DEs own this problem:
- Pre-shard data to match training parallelism
- Use columnar formats with efficient column-subset reads
- Pre-tokenize/pre-process (don’t do it in the training loop)
- Use streaming datasets for data that doesn’t fit in storage budget
Inference Pipeline: Latency-Optimized
Two types:
ONLINE (real-time):
API request → feature retrieval (feature store) → model inference → response
Latency: p99 < 100ms total (p99 < 10ms features, p99 < 40ms inference, remainder overhead)
Scale: thousands of requests/sec
Tools: TorchServe, TensorFlow Serving, Triton Inference Server, vLLM (LLMs)
BATCH (scheduled):
Input dataset → parallel model inference → output dataset
Latency: hours (not a concern)
Scale: millions of records
Tools: Spark + ONNX, Spark UDF wrapping the model, SageMaker Batch Transform
DE-owned components in inference:
- Batch inference pipeline orchestration (Airflow)
- Input data preparation for batch scoring
- Output data storage and downstream integration
- Feature retrieval optimization in the online path
Data Versioning: The Foundation of Reproducibility
An ML experiment is only reproducible if you can recreate the exact same training data, code, and environment used for a given training run.
DVC (Data Version Control)
DVC extends Git to version large files (datasets, model weights) that Git can’t handle efficiently.
# Initialize DVC alongside Git
git init && dvc init
# Add a dataset to DVC tracking
dvc add data/training_features.parquet
# DVC stores the file hash in data/training_features.parquet.dvc
# and pushes the actual file to remote storage (S3/GCS)
git add data/training_features.parquet.dvc .gitignore
git commit -m "Add training features v1"
# Define a versioned pipeline
# dvc.yaml:
stages:
prepare_features:
cmd: python src/prepare_features.py --date 2026-04-05
deps:
- src/prepare_features.py
- data/raw_events/
outs:
- data/training_features.parquet
train_model:
cmd: python src/train.py
deps:
- src/train.py
- data/training_features.parquet
params:
- params.yaml:
- model.learning_rate
- model.batch_size
outs:
- models/trained_model.pkl
# Run the pipeline (only re-runs stages with changed deps)
dvc repro
# To reproduce this exact experiment 6 months later:
git checkout experiment-v2.3
dvc pull # retrieves exact dataset version from S3
dvc repro # runs exact same pipeline
DVC tracks:
- Dataset files (content-hashed, stored in S3/GCS)
- Pipeline stage definitions (code + input → output dependencies)
- Parameters (hyperparameters)
- Metrics (model evaluation scores)
lakeFS: Git for Data Lakes
DVC handles individual files. lakeFS handles entire data lake namespaces — creating Git-like branches of S3/GCS/ADLS storage.
import lakefs_client
client = lakefs_client.LakeFSClient(...)
repo = "training-data"
# Create a branch for this experiment
client.branches.create_branch(
repo,
BranchCreation(
name="experiment-dedup-v2",
source="main"
)
)
# Run deduplication pipeline on the branch (doesn't affect main)
run_deduplication_pipeline(
input="lakefs://training-data/main/raw/",
output="lakefs://training-data/experiment-dedup-v2/features/"
)
# If it improves metrics → merge to main
# If it doesn't → abandon branch
client.refs.merge_into_branch(
repo,
source_ref="experiment-dedup-v2",
destination_branch="main"
)
lakeFS is the right tool when multiple experiments run simultaneously on the same data lake and may modify the data (dedup/filter/transform). Without branching, experiments corrupt each other’s data.
Experiment Tracking: MLflow
MLflow is a standard for tracking ML experiments. DE’s job: ensure the training pipeline integrates with MLflow and that experiment artifacts are stored reliably.
What MLflow tracks:
import mlflow
# Start an experiment run
with mlflow.start_run(run_name="sft-v2.3-lr-1e-4"):
# Log parameters
mlflow.log_params({
"learning_rate": 1e-4,
"batch_size": 32,
"epochs": 10,
"training_dataset_version": "v2026-04-05-abc123", # DE-provided
"model_architecture": "gpt2-medium"
})
# Log metrics during training
for epoch in range(10):
train_loss = train_one_epoch()
val_loss = evaluate()
mlflow.log_metrics({
"train_loss": train_loss,
"val_loss": val_loss
}, step=epoch)
# Log the trained model artifact
mlflow.pytorch.log_model(model, "model")
# Log the evaluation results
mlflow.log_metrics({
"accuracy": 0.847,
"f1_score": 0.832,
"perplexity": 12.4
})
# DE-owned: log dataset info for reproducibility
mlflow.log_input(
mlflow.data.from_spark(training_df, source="s3://data/features/v20260405/"),
context="training"
)
The MLflow experiment database: MLflow stores metadata (parameters, metrics, run IDs) in a SQL database (SQLite for local, PostgreSQL for production). Artifacts (model weights, plots) in object storage (S3/GCS). DE owns the infrastructure: database HA, storage lifecycle, backups.
Model Registry: The Production Gateway
The model registry is the workflow control point between experimentation and production. It must store lineage and enable rollbacks.
MLflow Model Registry workflow:
# Register a trained model
model_version = mlflow.register_model(
model_uri=f"runs:/{run_id}/model",
name="fraud-detection-v2",
tags={
"training_dataset": "v2026-04-05-abc123",
"code_commit": "git_hash_abc123",
"trained_by": "data_pipeline_team"
}
)
# Promotion workflow: Staging → Production
client = mlflow.MlflowClient()
# Promote to staging for evaluation
client.transition_model_version_stage(
name="fraud-detection-v2",
version=model_version.version,
stage="Staging"
)
# After passing evaluation gates → promote to production
client.transition_model_version_stage(
name="fraud-detection-v2",
version=model_version.version,
stage="Production"
)
# Old production version automatically moves to "Archived"
What the model registry should store:
- Model artifact (weights, config, preprocessing code)
- Training metadata (dataset version, hyperparameters, code commit hash)
- Evaluation metrics (accuracy, F1, business metrics)
- Training timestamp and duration
- Promotion history (who approved it, when)
- Downstream dependencies (which inference services use this model)
Golden rule: every production model must be traceable to the exact training dataset version and code commit. If a model misbehaves in production, you should be able to answer “what data was this trained on?” within minutes.
Continuous Training: Automated Retraining Pipelines
A model deployed to production degrades over time as the data distribution shifts. DE’s job: build the retraining pipeline that keeps models fresh.
Data drift triggers:
- Input drift: distribution of input features shifts
- Performance drift: model metrics drop below threshold (needs labels or proxies)
Automated retraining pipeline:
with DAG("fraud_model_retraining", schedule="0 6 * * *") as dag:
check_performance = PythonOperator(
task_id="check_model_performance",
python_callable=compute_model_metrics,
)
check_data_drift = PythonOperator(
task_id="check_data_drift",
python_callable=compute_psi_score,
# PSI compares feature distributions today vs training baseline.
# PSI > 0.2 = significant drift
)
should_retrain = BranchPythonOperator(
task_id="should_retrain",
python_callable=lambda: "retrain" if drift_or_degradation() else "skip",
)
prepare_training_data = PythonOperator(
task_id="prepare_training_data",
python_callable=build_training_dataset,
# Point-in-time correct feature construction
)
train_model = KubernetesPodOperator(
task_id="train_model",
image="ml-training:latest",
# GPU pod for training; logs to MLflow
)
evaluate_new_model = PythonOperator(
task_id="evaluate_new_model",
python_callable=compare_model_versions,
# Shadow test gate: new model must beat production by > 1%
)
promote_model = PythonOperator(
task_id="promote_model_to_production",
python_callable=update_model_registry,
# Only if evaluation gate passes
)
check_performance >> check_data_drift >> should_retrain
should_retrain >> prepare_training_data >> train_model >> evaluate_new_model >> promote_model
The DE-ML Interface: What You Own vs What ML Owns
| Responsibility | DE owns | ML engineer / data scientist owns |
|---|---|---|
| Raw data ingestion and storage | ✅ | ❌ |
| Feature engineering pipelines | ✅ (infra + batch) | ✅ (logic definition) |
| Training dataset construction | ✅ | Defines requirements |
| Data versioning (DVC/lakeFS) | ✅ | Consumes |
| Experiment tracking infra (MLflow) | ✅ | Logs experiments |
| Model registry infra | ✅ | Promotes models |
| Training pipeline orchestration | ✅ | Triggers / monitors |
| Batch inference pipeline | ✅ | Defines scoring logic |
| Online feature serving infra | ✅ | Defines features |
| Model architecture, hyperparameters | ❌ | ✅ |
| Model training code | ❌ (infra) | ✅ (training logic) |
| Model evaluation logic | ❌ | ✅ |
| Online inference server (TorchServe, vLLM) | ❌ | ✅ (with ML Eng) |
Interview Questions
Q1: “A data scientist says their model performance dropped in production but they trained it last week and it looked great. How do you diagnose this as a DE?”
Model Answer: “This is either data drift, training-serving skew, or an infrastructure bug — I’d investigate all three in parallel. First, data drift: check whether input feature distributions shifted between training and now (PSI or KL divergence; PSI > 0.2 is a common threshold). Second, training-serving skew: compare feature values served in production against training dataset distributions; if they diverge, there’s likely offline vs online computation mismatch. Third, infrastructure: confirm the deployed model matches the evaluated registry version (hash/version mismatch happens). Fourth, label drift: if the ground truth definition changed, your evaluation signal changes. Fix depends on root cause: drift → retrain; skew → unify feature logic; infrastructure → redeploy; label drift → update labeling + retraining.”
Q2: “How would you design a versioned training data pipeline that ensures any ML experiment from the past 2 years can be reproduced exactly?”
Model Answer: “Four components. First, immutable raw storage: append-only raw events in object storage (Iceberg on S3) with durable retention. Second, versioned training datasets: training pipelines produce datasets tagged with a version hash that encodes inputs (raw partitions, feature code commit, params). Store write-once datasets under s3://ml-data/datasets/{version_hash}/. Third, DVC or lakeFS integration: track the dataset version hash alongside the code commit and MLflow run ID, creating full lineage (run → model → dataset → raw partitions + code). Fourth, pipeline determinism/idempotency: no wall-clock timestamps in outputs, deterministic ordering and seeds so reruns produce identical outputs. Repro = git checkout <commit> + pull dataset version + run training.”
Think About This
You’re in an OpenAI interview. The prompt: “ChatGPT is fine-tuned every 2 weeks on new conversation data. Design the end-to-end ML pipeline that makes this reliable and reproducible.”
Walk through:
- What data goes into each fine-tuning run?
- How do you version the training data?
- What validation gates must pass before the model ships?
- What happens if the model fails evaluation?
- How is rollback triggered?
Quick Reference
- DE owns: data prep, versioning, training pipeline infra, batch inference, feature store infra, MLflow/registry infra
- ML owns: model architecture, training logic, evaluation criteria
- Training pipeline: data prep → versioned dataset → GPU training → MLflow → model registry → eval gates → prod
- DVC: version datasets/pipelines/params/metrics; reproducibility via
git checkout+dvc pull - lakeFS: branch the data lake namespace per experiment; merge on success, abandon on failure
- MLflow: run metadata + artifacts; model registry manages staging/production promotion lifecycle
- Continuous training triggers: PSI > 0.2 (drift) or model metric < threshold; automated retrain DAG
- Reproducibility requirement: immutable raw + versioned datasets + code commit + deterministic pipelines
Tomorrow’s Preview
Day 48: Embedding & Vector Data Pipelines — Embedding generation at scale, vector DBs (Pinecone, Weaviate, pgvector), ANN indexing (HNSW, IVF), RAG pipelines, and why vector infra is critical in OpenAI/Anthropic interviews.