Day 47 — ML Pipeline Architecture

Phase 2: Deep Dives | Category: ML Data Infrastructure

The DE’s Role in ML: Infrastructure, Not Models

Senior DEs at AI-heavy companies (OpenAI, Anthropic, Meta, Netflix, Google) are expected to own the data infrastructure that ML teams depend on. Per DataInterview.com: “ML knowledge is rated medium, yet candidates over-index on it constantly.”

The interview isn’t asking you to build neural networks — it’s asking whether you can build the reliable, versioned, reproducible data pipelines that make ML possible at scale. The DE’s job in ML: data prep, versioning, training pipeline infrastructure, and the handoff to serving.

The Full ML Pipeline Lifecycle

Raw Data (warehouse, lake, streams)
    ↓
[1] DATA PREPARATION PIPELINE
  • Feature engineering (Day 46 topics)
  • Training dataset construction (point-in-time joins)
  • Data versioning (DVC / lakeFS)
  • Quality gates (schema validation, distribution checks)
    ↓
Training Dataset (versioned, immutable)
    ↓
[2] EXPERIMENT TRACKING
  • Hyperparameter logging
  • Metric tracking (loss, accuracy, F1)
  • Artifact storage (model weights, evaluation plots)
  • Tools: MLflow, Weights & Biases, Neptune.ai
    ↓
Trained Model Artifact
    ↓
[3] MODEL REGISTRY
  • Version control for models (v1.0, v1.1, v2.0)
  • Promotion workflow: dev → staging → production
  • Lineage: model → training dataset → code version
  • Rollback capability
    ↓
Production Model
    ↓
[4] SERVING / INFERENCE
  • Online (real-time API): p99 < 100ms, feature store integration
  • Batch (scheduled scoring): millions of records, cost-efficient
    ↓
[5] MONITORING & RETRAINING TRIGGERS
  • Data drift detection
  • Model performance monitoring
  • Automated retraining pipelines

The Two Pipeline Types

Training Pipeline: Throughput-Optimized

Training dataset (Parquet/Iceberg on S3)
    ↓
Data loader (batched reads, shuffling, preprocessing)
    ↓
Distributed training (multiple GPUs/TPUs)
    ↓
Checkpointing (save model state periodically)
    ↓
Evaluation (validation set)
    ↓
Model artifact (weights + config + preprocessing code)
    ↓
Model registry

DE-owned components:

Training dataset construction and versioning
Efficient data loading (minimize GPU idle time)
Storage for model artifacts
Training pipeline orchestration (Airflow/Metaflow)
Evaluation dataset management

The training data bottleneck: if your 500-GPU training job is waiting for data, you’re wasting huge GPU spend. DEs own this problem:

Pre-shard data to match training parallelism
Use columnar formats with efficient column-subset reads
Pre-tokenize/pre-process (don’t do it in the training loop)
Use streaming datasets for data that doesn’t fit in storage budget

Inference Pipeline: Latency-Optimized

Two types:

ONLINE (real-time):
  API request → feature retrieval (feature store) → model inference → response
  Latency: p99 < 100ms total (p99 < 10ms features, p99 < 40ms inference, remainder overhead)
  Scale: thousands of requests/sec
  Tools: TorchServe, TensorFlow Serving, Triton Inference Server, vLLM (LLMs)

BATCH (scheduled):
  Input dataset → parallel model inference → output dataset
  Latency: hours (not a concern)
  Scale: millions of records
  Tools: Spark + ONNX, Spark UDF wrapping the model, SageMaker Batch Transform

DE-owned components in inference:

Batch inference pipeline orchestration (Airflow)
Input data preparation for batch scoring
Output data storage and downstream integration
Feature retrieval optimization in the online path

Data Versioning: The Foundation of Reproducibility

An ML experiment is only reproducible if you can recreate the exact same training data, code, and environment used for a given training run.

DVC (Data Version Control)

DVC extends Git to version large files (datasets, model weights) that Git can’t handle efficiently.

# Initialize DVC alongside Git
git init && dvc init

# Add a dataset to DVC tracking
dvc add data/training_features.parquet
# DVC stores the file hash in data/training_features.parquet.dvc
# and pushes the actual file to remote storage (S3/GCS)
git add data/training_features.parquet.dvc .gitignore
git commit -m "Add training features v1"

# Define a versioned pipeline
# dvc.yaml:
stages:
  prepare_features:
    cmd: python src/prepare_features.py --date 2026-04-05
    deps:
      - src/prepare_features.py
      - data/raw_events/
    outs:
      - data/training_features.parquet

  train_model:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/training_features.parquet
    params:
      - params.yaml:
          - model.learning_rate
          - model.batch_size
    outs:
      - models/trained_model.pkl

# Run the pipeline (only re-runs stages with changed deps)
dvc repro

# To reproduce this exact experiment 6 months later:
git checkout experiment-v2.3
dvc pull  # retrieves exact dataset version from S3
dvc repro # runs exact same pipeline

DVC tracks:

Dataset files (content-hashed, stored in S3/GCS)
Pipeline stage definitions (code + input → output dependencies)
Parameters (hyperparameters)
Metrics (model evaluation scores)

lakeFS: Git for Data Lakes

DVC handles individual files. lakeFS handles entire data lake namespaces — creating Git-like branches of S3/GCS/ADLS storage.

import lakefs_client

client = lakefs_client.LakeFSClient(...)
repo = "training-data"

# Create a branch for this experiment
client.branches.create_branch(
    repo,
    BranchCreation(
        name="experiment-dedup-v2",
        source="main"
    )
)

# Run deduplication pipeline on the branch (doesn't affect main)
run_deduplication_pipeline(
    input="lakefs://training-data/main/raw/",
    output="lakefs://training-data/experiment-dedup-v2/features/"
)

# If it improves metrics → merge to main
# If it doesn't → abandon branch
client.refs.merge_into_branch(
    repo,
    source_ref="experiment-dedup-v2",
    destination_branch="main"
)

lakeFS is the right tool when multiple experiments run simultaneously on the same data lake and may modify the data (dedup/filter/transform). Without branching, experiments corrupt each other’s data.

Experiment Tracking: MLflow

MLflow is a standard for tracking ML experiments. DE’s job: ensure the training pipeline integrates with MLflow and that experiment artifacts are stored reliably.

What MLflow tracks:

import mlflow

# Start an experiment run
with mlflow.start_run(run_name="sft-v2.3-lr-1e-4"):

    # Log parameters
    mlflow.log_params({
        "learning_rate": 1e-4,
        "batch_size": 32,
        "epochs": 10,
        "training_dataset_version": "v2026-04-05-abc123",  # DE-provided
        "model_architecture": "gpt2-medium"
    })

    # Log metrics during training
    for epoch in range(10):
        train_loss = train_one_epoch()
        val_loss = evaluate()

        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_loss
        }, step=epoch)

    # Log the trained model artifact
    mlflow.pytorch.log_model(model, "model")

    # Log the evaluation results
    mlflow.log_metrics({
        "accuracy": 0.847,
        "f1_score": 0.832,
        "perplexity": 12.4
    })

    # DE-owned: log dataset info for reproducibility
    mlflow.log_input(
        mlflow.data.from_spark(training_df, source="s3://data/features/v20260405/"),
        context="training"
    )

The MLflow experiment database: MLflow stores metadata (parameters, metrics, run IDs) in a SQL database (SQLite for local, PostgreSQL for production). Artifacts (model weights, plots) in object storage (S3/GCS). DE owns the infrastructure: database HA, storage lifecycle, backups.

Model Registry: The Production Gateway

The model registry is the workflow control point between experimentation and production. It must store lineage and enable rollbacks.

MLflow Model Registry workflow:

# Register a trained model
model_version = mlflow.register_model(
    model_uri=f"runs:/{run_id}/model",
    name="fraud-detection-v2",
    tags={
        "training_dataset": "v2026-04-05-abc123",
        "code_commit": "git_hash_abc123",
        "trained_by": "data_pipeline_team"
    }
)

# Promotion workflow: Staging → Production
client = mlflow.MlflowClient()

# Promote to staging for evaluation
client.transition_model_version_stage(
    name="fraud-detection-v2",
    version=model_version.version,
    stage="Staging"
)

# After passing evaluation gates → promote to production
client.transition_model_version_stage(
    name="fraud-detection-v2",
    version=model_version.version,
    stage="Production"
)

# Old production version automatically moves to "Archived"

What the model registry should store:

Model artifact (weights, config, preprocessing code)
Training metadata (dataset version, hyperparameters, code commit hash)
Evaluation metrics (accuracy, F1, business metrics)
Training timestamp and duration
Promotion history (who approved it, when)
Downstream dependencies (which inference services use this model)

Golden rule: every production model must be traceable to the exact training dataset version and code commit. If a model misbehaves in production, you should be able to answer “what data was this trained on?” within minutes.

Continuous Training: Automated Retraining Pipelines

A model deployed to production degrades over time as the data distribution shifts. DE’s job: build the retraining pipeline that keeps models fresh.

Data drift triggers:

Input drift: distribution of input features shifts
Performance drift: model metrics drop below threshold (needs labels or proxies)

Automated retraining pipeline:

with DAG("fraud_model_retraining", schedule="0 6 * * *") as dag:

    check_performance = PythonOperator(
        task_id="check_model_performance",
        python_callable=compute_model_metrics,
    )

    check_data_drift = PythonOperator(
        task_id="check_data_drift",
        python_callable=compute_psi_score,
        # PSI compares feature distributions today vs training baseline.
        # PSI > 0.2 = significant drift
    )

    should_retrain = BranchPythonOperator(
        task_id="should_retrain",
        python_callable=lambda: "retrain" if drift_or_degradation() else "skip",
    )

    prepare_training_data = PythonOperator(
        task_id="prepare_training_data",
        python_callable=build_training_dataset,
        # Point-in-time correct feature construction
    )

    train_model = KubernetesPodOperator(
        task_id="train_model",
        image="ml-training:latest",
        # GPU pod for training; logs to MLflow
    )

    evaluate_new_model = PythonOperator(
        task_id="evaluate_new_model",
        python_callable=compare_model_versions,
        # Shadow test gate: new model must beat production by > 1%
    )

    promote_model = PythonOperator(
        task_id="promote_model_to_production",
        python_callable=update_model_registry,
        # Only if evaluation gate passes
    )

    check_performance >> check_data_drift >> should_retrain
    should_retrain >> prepare_training_data >> train_model >> evaluate_new_model >> promote_model

The DE-ML Interface: What You Own vs What ML Owns

Responsibility	DE owns	ML engineer / data scientist owns
Raw data ingestion and storage	✅	❌
Feature engineering pipelines	✅ (infra + batch)	✅ (logic definition)
Training dataset construction	✅	Defines requirements
Data versioning (DVC/lakeFS)	✅	Consumes
Experiment tracking infra (MLflow)	✅	Logs experiments
Model registry infra	✅	Promotes models
Training pipeline orchestration	✅	Triggers / monitors
Batch inference pipeline	✅	Defines scoring logic
Online feature serving infra	✅	Defines features
Model architecture, hyperparameters	❌	✅
Model training code	❌ (infra)	✅ (training logic)
Model evaluation logic	❌	✅
Online inference server (TorchServe, vLLM)	❌	✅ (with ML Eng)

Interview Questions

Q1: “A data scientist says their model performance dropped in production but they trained it last week and it looked great. How do you diagnose this as a DE?”

Model Answer: “This is either data drift, training-serving skew, or an infrastructure bug — I’d investigate all three in parallel. First, data drift: check whether input feature distributions shifted between training and now (PSI or KL divergence; PSI > 0.2 is a common threshold). Second, training-serving skew: compare feature values served in production against training dataset distributions; if they diverge, there’s likely offline vs online computation mismatch. Third, infrastructure: confirm the deployed model matches the evaluated registry version (hash/version mismatch happens). Fourth, label drift: if the ground truth definition changed, your evaluation signal changes. Fix depends on root cause: drift → retrain; skew → unify feature logic; infrastructure → redeploy; label drift → update labeling + retraining.”

Q2: “How would you design a versioned training data pipeline that ensures any ML experiment from the past 2 years can be reproduced exactly?”

Model Answer: “Four components. First, immutable raw storage: append-only raw events in object storage (Iceberg on S3) with durable retention. Second, versioned training datasets: training pipelines produce datasets tagged with a version hash that encodes inputs (raw partitions, feature code commit, params). Store write-once datasets under s3://ml-data/datasets/{version_hash}/. Third, DVC or lakeFS integration: track the dataset version hash alongside the code commit and MLflow run ID, creating full lineage (run → model → dataset → raw partitions + code). Fourth, pipeline determinism/idempotency: no wall-clock timestamps in outputs, deterministic ordering and seeds so reruns produce identical outputs. Repro = git checkout <commit> + pull dataset version + run training.”

Think About This

You’re in an OpenAI interview. The prompt: “ChatGPT is fine-tuned every 2 weeks on new conversation data. Design the end-to-end ML pipeline that makes this reliable and reproducible.”

Walk through:

What data goes into each fine-tuning run?
How do you version the training data?
What validation gates must pass before the model ships?
What happens if the model fails evaluation?
How is rollback triggered?

Quick Reference

DE owns: data prep, versioning, training pipeline infra, batch inference, feature store infra, MLflow/registry infra
ML owns: model architecture, training logic, evaluation criteria
Training pipeline: data prep → versioned dataset → GPU training → MLflow → model registry → eval gates → prod
DVC: version datasets/pipelines/params/metrics; reproducibility via git checkout + dvc pull
lakeFS: branch the data lake namespace per experiment; merge on success, abandon on failure
MLflow: run metadata + artifacts; model registry manages staging/production promotion lifecycle
Continuous training triggers: PSI > 0.2 (drift) or model metric < threshold; automated retrain DAG
Reproducibility requirement: immutable raw + versioned datasets + code commit + deterministic pipelines

Tomorrow’s Preview

Day 48: Embedding & Vector Data Pipelines — Embedding generation at scale, vector DBs (Pinecone, Weaviate, pgvector), ANN indexing (HNSW, IVF), RAG pipelines, and why vector infra is critical in OpenAI/Anthropic interviews.

Day 47: ML Pipeline Architecture