Phase 2: Deep Dives | Category: ML Data Infrastructure

The DE’s Role in ML: Infrastructure, Not Models

Senior DEs at AI-heavy companies (OpenAI, Anthropic, Meta, Netflix, Google) are expected to own the data infrastructure that ML teams depend on. Per DataInterview.com: “ML knowledge is rated medium, yet candidates over-index on it constantly.”

The interview isn’t asking you to build neural networks — it’s asking whether you can build the reliable, versioned, reproducible data pipelines that make ML possible at scale. The DE’s job in ML: data prep, versioning, training pipeline infrastructure, and the handoff to serving.

The Full ML Pipeline Lifecycle

Raw Data (warehouse, lake, streams)

[1] DATA PREPARATION PIPELINE
  • Feature engineering (Day 46 topics)
  • Training dataset construction (point-in-time joins)
  • Data versioning (DVC / lakeFS)
  • Quality gates (schema validation, distribution checks)

Training Dataset (versioned, immutable)

[2] EXPERIMENT TRACKING
  • Hyperparameter logging
  • Metric tracking (loss, accuracy, F1)
  • Artifact storage (model weights, evaluation plots)
  • Tools: MLflow, Weights & Biases, Neptune.ai

Trained Model Artifact

[3] MODEL REGISTRY
  • Version control for models (v1.0, v1.1, v2.0)
  • Promotion workflow: dev → staging → production
  • Lineage: model → training dataset → code version
  • Rollback capability

Production Model

[4] SERVING / INFERENCE
  • Online (real-time API): p99 < 100ms, feature store integration
  • Batch (scheduled scoring): millions of records, cost-efficient

[5] MONITORING & RETRAINING TRIGGERS
  • Data drift detection
  • Model performance monitoring
  • Automated retraining pipelines

The Two Pipeline Types

Training Pipeline: Throughput-Optimized

Training dataset (Parquet/Iceberg on S3)

Data loader (batched reads, shuffling, preprocessing)

Distributed training (multiple GPUs/TPUs)

Checkpointing (save model state periodically)

Evaluation (validation set)

Model artifact (weights + config + preprocessing code)

Model registry

DE-owned components:

  • Training dataset construction and versioning
  • Efficient data loading (minimize GPU idle time)
  • Storage for model artifacts
  • Training pipeline orchestration (Airflow/Metaflow)
  • Evaluation dataset management

The training data bottleneck: if your 500-GPU training job is waiting for data, you’re wasting huge GPU spend. DEs own this problem:

  • Pre-shard data to match training parallelism
  • Use columnar formats with efficient column-subset reads
  • Pre-tokenize/pre-process (don’t do it in the training loop)
  • Use streaming datasets for data that doesn’t fit in storage budget

Inference Pipeline: Latency-Optimized

Two types:

ONLINE (real-time):
  API request → feature retrieval (feature store) → model inference → response
  Latency: p99 < 100ms total (p99 < 10ms features, p99 < 40ms inference, remainder overhead)
  Scale: thousands of requests/sec
  Tools: TorchServe, TensorFlow Serving, Triton Inference Server, vLLM (LLMs)

BATCH (scheduled):
  Input dataset → parallel model inference → output dataset
  Latency: hours (not a concern)
  Scale: millions of records
  Tools: Spark + ONNX, Spark UDF wrapping the model, SageMaker Batch Transform

DE-owned components in inference:

  • Batch inference pipeline orchestration (Airflow)
  • Input data preparation for batch scoring
  • Output data storage and downstream integration
  • Feature retrieval optimization in the online path

Data Versioning: The Foundation of Reproducibility

An ML experiment is only reproducible if you can recreate the exact same training data, code, and environment used for a given training run.

DVC (Data Version Control)

DVC extends Git to version large files (datasets, model weights) that Git can’t handle efficiently.

# Initialize DVC alongside Git
git init && dvc init

# Add a dataset to DVC tracking
dvc add data/training_features.parquet
# DVC stores the file hash in data/training_features.parquet.dvc
# and pushes the actual file to remote storage (S3/GCS)
git add data/training_features.parquet.dvc .gitignore
git commit -m "Add training features v1"

# Define a versioned pipeline
# dvc.yaml:
stages:
  prepare_features:
    cmd: python src/prepare_features.py --date 2026-04-05
    deps:
      - src/prepare_features.py
      - data/raw_events/
    outs:
      - data/training_features.parquet

  train_model:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/training_features.parquet
    params:
      - params.yaml:
          - model.learning_rate
          - model.batch_size
    outs:
      - models/trained_model.pkl

# Run the pipeline (only re-runs stages with changed deps)
dvc repro

# To reproduce this exact experiment 6 months later:
git checkout experiment-v2.3
dvc pull  # retrieves exact dataset version from S3
dvc repro # runs exact same pipeline

DVC tracks:

  • Dataset files (content-hashed, stored in S3/GCS)
  • Pipeline stage definitions (code + input → output dependencies)
  • Parameters (hyperparameters)
  • Metrics (model evaluation scores)

lakeFS: Git for Data Lakes

DVC handles individual files. lakeFS handles entire data lake namespaces — creating Git-like branches of S3/GCS/ADLS storage.

import lakefs_client

client = lakefs_client.LakeFSClient(...)
repo = "training-data"

# Create a branch for this experiment
client.branches.create_branch(
    repo,
    BranchCreation(
        name="experiment-dedup-v2",
        source="main"
    )
)

# Run deduplication pipeline on the branch (doesn't affect main)
run_deduplication_pipeline(
    input="lakefs://training-data/main/raw/",
    output="lakefs://training-data/experiment-dedup-v2/features/"
)

# If it improves metrics → merge to main
# If it doesn't → abandon branch
client.refs.merge_into_branch(
    repo,
    source_ref="experiment-dedup-v2",
    destination_branch="main"
)

lakeFS is the right tool when multiple experiments run simultaneously on the same data lake and may modify the data (dedup/filter/transform). Without branching, experiments corrupt each other’s data.

Experiment Tracking: MLflow

MLflow is a standard for tracking ML experiments. DE’s job: ensure the training pipeline integrates with MLflow and that experiment artifacts are stored reliably.

What MLflow tracks:

import mlflow

# Start an experiment run
with mlflow.start_run(run_name="sft-v2.3-lr-1e-4"):

    # Log parameters
    mlflow.log_params({
        "learning_rate": 1e-4,
        "batch_size": 32,
        "epochs": 10,
        "training_dataset_version": "v2026-04-05-abc123",  # DE-provided
        "model_architecture": "gpt2-medium"
    })

    # Log metrics during training
    for epoch in range(10):
        train_loss = train_one_epoch()
        val_loss = evaluate()

        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_loss
        }, step=epoch)

    # Log the trained model artifact
    mlflow.pytorch.log_model(model, "model")

    # Log the evaluation results
    mlflow.log_metrics({
        "accuracy": 0.847,
        "f1_score": 0.832,
        "perplexity": 12.4
    })

    # DE-owned: log dataset info for reproducibility
    mlflow.log_input(
        mlflow.data.from_spark(training_df, source="s3://data/features/v20260405/"),
        context="training"
    )

The MLflow experiment database: MLflow stores metadata (parameters, metrics, run IDs) in a SQL database (SQLite for local, PostgreSQL for production). Artifacts (model weights, plots) in object storage (S3/GCS). DE owns the infrastructure: database HA, storage lifecycle, backups.

Model Registry: The Production Gateway

The model registry is the workflow control point between experimentation and production. It must store lineage and enable rollbacks.

MLflow Model Registry workflow:

# Register a trained model
model_version = mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name="fraud-detection-v2",
    tags={
        "training_dataset": "v2026-04-05-abc123",
        "code_commit": "git_hash_abc123",
        "trained_by": "data_pipeline_team"
    }
)

# Promotion workflow: Staging → Production
client = mlflow.MlflowClient()

# Promote to staging for evaluation
client.transition_model_version_stage(
    name="fraud-detection-v2",
    version=model_version.version,
    stage="Staging"
)

# After passing evaluation gates → promote to production
client.transition_model_version_stage(
    name="fraud-detection-v2",
    version=model_version.version,
    stage="Production"
)

# Old production version automatically moves to "Archived"

What the model registry should store:

  • Model artifact (weights, config, preprocessing code)
  • Training metadata (dataset version, hyperparameters, code commit hash)
  • Evaluation metrics (accuracy, F1, business metrics)
  • Training timestamp and duration
  • Promotion history (who approved it, when)
  • Downstream dependencies (which inference services use this model)

Golden rule: every production model must be traceable to the exact training dataset version and code commit. If a model misbehaves in production, you should be able to answer “what data was this trained on?” within minutes.

Continuous Training: Automated Retraining Pipelines

A model deployed to production degrades over time as the data distribution shifts. DE’s job: build the retraining pipeline that keeps models fresh.

Data drift triggers:

  • Input drift: distribution of input features shifts
  • Performance drift: model metrics drop below threshold (needs labels or proxies)

Automated retraining pipeline:

with DAG("fraud_model_retraining", schedule="0 6 * * *") as dag:

    check_performance = PythonOperator(
        task_id="check_model_performance",
        python_callable=compute_model_metrics,
    )

    check_data_drift = PythonOperator(
        task_id="check_data_drift",
        python_callable=compute_psi_score,
        # PSI compares feature distributions today vs training baseline.
        # PSI > 0.2 = significant drift
    )

    should_retrain = BranchPythonOperator(
        task_id="should_retrain",
        python_callable=lambda: "retrain" if drift_or_degradation() else "skip",
    )

    prepare_training_data = PythonOperator(
        task_id="prepare_training_data",
        python_callable=build_training_dataset,
        # Point-in-time correct feature construction
    )

    train_model = KubernetesPodOperator(
        task_id="train_model",
        image="ml-training:latest",
        # GPU pod for training; logs to MLflow
    )

    evaluate_new_model = PythonOperator(
        task_id="evaluate_new_model",
        python_callable=compare_model_versions,
        # Shadow test gate: new model must beat production by > 1%
    )

    promote_model = PythonOperator(
        task_id="promote_model_to_production",
        python_callable=update_model_registry,
        # Only if evaluation gate passes
    )

    check_performance >> check_data_drift >> should_retrain
    should_retrain >> prepare_training_data >> train_model >> evaluate_new_model >> promote_model

The DE-ML Interface: What You Own vs What ML Owns

ResponsibilityDE ownsML engineer / data scientist owns
Raw data ingestion and storage
Feature engineering pipelines✅ (infra + batch)✅ (logic definition)
Training dataset constructionDefines requirements
Data versioning (DVC/lakeFS)Consumes
Experiment tracking infra (MLflow)Logs experiments
Model registry infraPromotes models
Training pipeline orchestrationTriggers / monitors
Batch inference pipelineDefines scoring logic
Online feature serving infraDefines features
Model architecture, hyperparameters
Model training code❌ (infra)✅ (training logic)
Model evaluation logic
Online inference server (TorchServe, vLLM)✅ (with ML Eng)

Interview Questions

Q1: “A data scientist says their model performance dropped in production but they trained it last week and it looked great. How do you diagnose this as a DE?”

Model Answer: “This is either data drift, training-serving skew, or an infrastructure bug — I’d investigate all three in parallel. First, data drift: check whether input feature distributions shifted between training and now (PSI or KL divergence; PSI > 0.2 is a common threshold). Second, training-serving skew: compare feature values served in production against training dataset distributions; if they diverge, there’s likely offline vs online computation mismatch. Third, infrastructure: confirm the deployed model matches the evaluated registry version (hash/version mismatch happens). Fourth, label drift: if the ground truth definition changed, your evaluation signal changes. Fix depends on root cause: drift → retrain; skew → unify feature logic; infrastructure → redeploy; label drift → update labeling + retraining.”

Q2: “How would you design a versioned training data pipeline that ensures any ML experiment from the past 2 years can be reproduced exactly?”

Model Answer: “Four components. First, immutable raw storage: append-only raw events in object storage (Iceberg on S3) with durable retention. Second, versioned training datasets: training pipelines produce datasets tagged with a version hash that encodes inputs (raw partitions, feature code commit, params). Store write-once datasets under s3://ml-data/datasets/{version_hash}/. Third, DVC or lakeFS integration: track the dataset version hash alongside the code commit and MLflow run ID, creating full lineage (run → model → dataset → raw partitions + code). Fourth, pipeline determinism/idempotency: no wall-clock timestamps in outputs, deterministic ordering and seeds so reruns produce identical outputs. Repro = git checkout <commit> + pull dataset version + run training.”

Think About This

You’re in an OpenAI interview. The prompt: “ChatGPT is fine-tuned every 2 weeks on new conversation data. Design the end-to-end ML pipeline that makes this reliable and reproducible.”

Walk through:

  • What data goes into each fine-tuning run?
  • How do you version the training data?
  • What validation gates must pass before the model ships?
  • What happens if the model fails evaluation?
  • How is rollback triggered?

Quick Reference

  • DE owns: data prep, versioning, training pipeline infra, batch inference, feature store infra, MLflow/registry infra
  • ML owns: model architecture, training logic, evaluation criteria
  • Training pipeline: data prep → versioned dataset → GPU training → MLflow → model registry → eval gates → prod
  • DVC: version datasets/pipelines/params/metrics; reproducibility via git checkout + dvc pull
  • lakeFS: branch the data lake namespace per experiment; merge on success, abandon on failure
  • MLflow: run metadata + artifacts; model registry manages staging/production promotion lifecycle
  • Continuous training triggers: PSI > 0.2 (drift) or model metric < threshold; automated retrain DAG
  • Reproducibility requirement: immutable raw + versioned datasets + code commit + deterministic pipelines

Tomorrow’s Preview

Day 48: Embedding & Vector Data Pipelines — Embedding generation at scale, vector DBs (Pinecone, Weaviate, pgvector), ANN indexing (HNSW, IVF), RAG pipelines, and why vector infra is critical in OpenAI/Anthropic interviews.