Day 38 — Design: LLM Training Data Pipeline

Phase 2: Company-Specific | Category: OpenAI-Specific

The Prompt

“Design an end-to-end pipeline that produces a high-quality fine-tuning dataset for ChatGPT from user conversations. Include PII redaction, deduplication, toxicity filtering, and quality scoring. Specify your storage layers, idempotent reprocessing strategy, and the data quality gates that block a training run from starting.”

This is the exact question from DataInterview.com’s OpenAI interview guide. Set a 45-minute timer.

Step 1: Requirements (5 min)

Functional:

Ingest raw user conversations from ChatGPT (hundreds of millions daily)
PII detection and redaction before any human reviewer or trainer sees data
Deduplication at document and near-duplicate level
Toxicity filtering (remove harmful content, optionally quarantine for safety research)
Quality scoring (is this conversation worth learning from?)
RLHF preference data collection workflow (human comparison labels)
Dataset versioning (exact reproducibility of any training run)
Quality gates that block training runs when criteria aren’t met

Non-functional:

Scale
:       ~500M ChatGPT users × ~5 conversations/day = 2.5B conversations/day               Each conversation avg ~2KB = ~5 TB/day raw ingestion  Latency:     Fine-tuning datasets produced within 24 hours of collection window  Privacy:     PII must be redacted BEFORE data leaves the ingestion layer               (before any human sees it, before any ML model trains on it)  Durability:  Zero tolerance for data loss — immutable storage at every stage  Compliance:  EU AI Act (August 2026 enforcement) requires training data provenance               Every training run must be fully reproducible and auditable  Versioning:  Any training dataset must be exactly reproducible 18 months later

Scoping: “I’ll design the full pipeline from raw conversation ingestion through RLHF preference collection to training-ready dataset. I’ll describe the data model, storage layers, deduplication strategy, and quality gates in detail.”

Step 2: Scale Estimation (3 min)

Raw
ingestion:    2.5B conversations/day × 2 KB avg = 5 TB/day raw    Compressed (Parquet Zstd ~5x): ~1 TB/day in bronze layer  Post-filtering estimate (typical rejection rates):    Raw:           2.5B conversations (100%)    PII issues:    ~2% removed or heavily redacted → 2.45B remaining    Deduplication: ~15% near-duplicates removed → 2.1B remaining    Too short/low quality: ~20% filtered → 1.68B remaining    Toxicity:      ~3% filtered (some quarantined for safety research) → 1.63B remaining    Final gold dataset: ~1.6B high-quality conversations/day  RLHF labeling:    Sample ~50K conversations/day for human preference labeling    Each conversation generates 2-4 model responses for pairwise comparison    ~100K-200K comparison pairs/day → roughly 50-100 labeler-hours/day  Storage per year:    Bronze (raw): 1 TB/day × 365 = 365 TB/year    Silver (cleaned): ~300 TB/year    Gold (training-ready): ~200 TB/year    RLHF labels: ~50 GB/year (tiny compared to conversations)    Total: ~1 PB/year

Step 3: Full Architecture

┌──────────────────────────────────────────────────────────────────┐  │                   CHATGPT SERVING LAYER                          │  │  Each completed conversation → event published to message queue  │  │  Fields: conversation_id, user_id_encrypted, messages[],        │  │          model_version, start_time, end_time, country_code       │  │  NOTE: user_id is already pseudonymized at source               │  └────────────────────────────┬─────────────────────────────────────┘                                ↓  ┌──────────────────────────────────────────────────────────────────┐  │              INGESTION + PII GATE (MUST BE FIRST)               │  │                                                                  │  │  Streaming consumer (Kafka/Azure Event Hub):                     │  │  1. Validate schema (reject malformed events to DLQ)            │  │  2. PII DETECTION + REDACTION:                                   │  │     • Microsoft Presidio / custom NER model                      │  │     • Detect: names, emails, phones, SSNs, addresses, IPs       │  │     • Replace with typed placeholders: [PERSON], [EMAIL]        │  │     • PII-heavy conversations (> 10 entities): quarantine tier   │  │  3. Generate: document_id = SHA256(conversation_id + pipeline_v) │  │  4. Write to BRONZE (raw-but-redacted, immutable)               │  └────────────────────────────┬─────────────────────────────────────┘                                ↓  ┌──────────────────────────────────────────────────────────────────┐  │               BRONZE LAYER (S3 / Azure Blob)                    │  │  s3://training-data/bronze/date=YYYY-MM-DD/hour=HH/             │  │  Format: Parquet + Zstd                                          │  │  Content: PII-redacted conversations, append-only, immutable    │  │  Schema: document_id, conversation_id_hash, messages[],         │  │          redacted_entity_count, country_code, model_version,     │  │          conversation_date, pipeline_version                     │  │  Retention: 7 years (regulatory compliance)                     │  └────────────────────────────┬─────────────────────────────────────┘                                ↓ (daily Spark batch job, 2 AM PT)  ┌──────────────────────────────────────────────────────────────────┐  │                  PROCESSING PIPELINE (Spark + Ray)               │  │                                                                  │  │  STAGE 1: DEDUPLICATION                                          │  │  ┌────────────────────────────────────────────────────────────┐ │  │  │  Exact dedup: hash(normalize(conversation_text)) → DynamoDB│ │  │  │  Near-dedup: MinHash LSH (Jaccard threshold 0.7)           │ │  │  │    - Shingling: 5-gram character shingles                  │ │  │  │    - 128 MinHash permutations → compact signature          │ │  │  │    - LSH bands (b=16, r=8) → candidate pairs               │ │  │  │    - Exact Jaccard confirm on candidates (2-stage)         │ │  │  │    - Mark duplicates: keep longest/highest-quality copy    │ │  │  └────────────────────────────────────────────────────────────┘ │  │                              ↓                                   │  │  STAGE 2: QUALITY FILTERING                                      │  │  ┌────────────────────────────────────────────────────────────┐ │  │  │  Heuristic filters (fast, cheap, applied first):           │ │  │  │    • Min conversation turns: ≥ 2                           │ │  │  │    • Min total tokens: ≥ 50                                │ │  │  │    • Max total tokens: ≤ 32,768 (context window limit)     │ │  │  │    • Language detection (keep English + major languages)   │ │  │  │    • Repetition ratio < 30% (templated/spam detection)     │ │  │  │                                                            │ │  │  │  ML quality score (costlier, applied to heuristic-passes): │ │  │  │    • Classifier: "Is this a high-quality helpful exchange?"│ │  │  │    • Score 0-1. Threshold: > 0.6 for training inclusion    │ │  │  │    • OR use GPT-3.5-mini as LLM-as-classifier (faster/     │ │  │  │      cheaper than GPT-4, still high quality signal)        │ │  │  └────────────────────────────────────────────────────────────┘ │  │                              ↓                                   │  │  STAGE 3: TOXICITY FILTERING                                     │  │  ┌────────────────────────────────────────────────────────────┐ │  │  │  Category-specific classifiers:                            │ │  │  │    • Hate speech → REMOVE (never include in training)      │ │  │  │    • Violence/self-harm → REMOVE                           │ │  │  │    • Mild profanity → INCLUDE (model must handle real users│ │  │  │    • Safety research cases → QUARANTINE (separate dataset  │ │  │  │      used only for safety training, strict access control) │ │  │  │                                                            │ │  │  │  Human audit: 0.1% random sample reviewed weekly          │ │  │  └────────────────────────────────────────────────────────────┘ │  │                              ↓                                   │  │  STAGE 4: FORMAT + TOKENIZE                                      │  │  ┌────────────────────────────────────────────────────────────┐ │  │  │  Convert to training format: instruction-following format  │ │  │  │  [{"role": "user", "content": "..."}, {"role": "assistant"}]│ │  │  │  Tokenize with tiktoken (cl100k_base)                      │ │  │  │  Pack sequences to context window (8192 tokens)            │ │  │  └────────────────────────────────────────────────────────────┘ │  └────────────────────────────┬─────────────────────────────────────┘                                ↓  ┌──────────────────────────────────────────────────────────────────┐  │                    GOLD LAYER + VERSIONING                       │  │  s3://training-data/gold/dataset_version={version_hash}/        │  │  Format: WebDataset (streaming-friendly shards, ~500MB each)    │  │                                                                  │  │  Version hash = SHA256(                                          │  │    bronze_data_hash +                                            │  │    pii_model_version +                                           │  │    dedup_params_hash +                                           │  │    quality_model_version +                                       │  │    toxicity_model_version +                                      │  │    tokenizer_version                                             │  │  )                                                               │  │                                                                  │  │  Lineage manifest per dataset version:                           │  │  {                                                               │  │    "version": "v2026-04-11-abc123",                             │  │    "source_bronze_partitions": ["2026-04-10", "2026-04-11"],    │  │    "row_count": 1_628_450_000,                                  │  │    "token_count": 412_000_000_000,                              │  │    "stage_versions": {                                           │  │      "pii_model": "presidio-v2.4.0",                           │  │      "quality_classifier": "sft-quality-v1.2",                 │  │      "toxicity_classifier": "perspective-v7.1",                 │  │      "tokenizer": "cl100k_base-v1"                              │  │    },                                                            │  │    "quality_metrics": {                                          │  │      "pii_false_negative_rate": 0.0003,                        │  │      "dedup_rate": 0.147,                                        │  │      "toxicity_removal_rate": 0.031,                            │  │      "quality_acceptance_rate": 0.651                           │  │    }                                                             │  │  }                                                               │  └────────────────────────────┬─────────────────────────────────────┘                                ↓  ┌──────────────────────────────────────────────────────────────────┐  │                 QUALITY GATES (BLOCKING)                         │  │           Training run cannot start until ALL pass               │  │                                                                  │  │  Gate 1: Minimum token count ≥ 100B tokens (dataset too small)  │  │  Gate 2: PII audit — scan 0.5% sample with stricter model:      │  │          false-negative rate must be < 0.05%                    │  │  Gate 3: Toxicity audit — sample 0.1% manually:                │  │          harmful content rate must be < 0.1%                    │  │  Gate 4: Language distribution — no single language > 70%       │  │  Gate 5: Dedup rate between 5-25% (too low = not deduping,     │  │          too high = losing unique content)                       │  │  Gate 6: Quality score distribution — median score ≥ 0.65       │  │  Gate 7: Dataset hash matches lineage manifest (immutability)   │  │                                                                  │  │  If any gate fails:                                              │  │    → Block training run via training job coordinator API         │  │    → Create PagerDuty P2 incident with failing metric           │  │    → Notify RLHF team and safety team                           │  │    → Log failure to audit trail (required for EU AI Act)        │  └──────────────────────────────────────────────────────────────────┘

Step 4: Deduplication Deep Dive

Deduplication is the most technically complex step. At 2.5B conversations/day, naive pairwise comparison is O(n²) = infeasible. Per DZone’s MinHash LSH walkthrough:

Two-stage strategy:

Stage 1: Exact deduplication (fast, catches templates and copy-pastes)

on

For each conversation:  # 1. Normalize: lowercase, remove whitespace variation, strip formatting  normalized_text = normalize(conversation_text)  # 2. Hash the normalized text  doc_hash = SHA256(normalized_text).hexdigest()  # 3. Check hash store (DynamoDB with conditional write)  # If hash exists: mark as duplicate, skip  # If hash doesn't exist: insert hash, continue to Stage 2  response = dynamodb.put_item(      TableName="dedup-hashes",      Item={"hash": {"S": doc_hash}, "first_seen": {"S": date.today().isoformat()}},      ConditionExpression="attribute_not_exists(hash)"  # atomic check-and-insert  )

Stage 2: Near-duplicate detection via MinHash LSH

For
each document:  1. Extract 5-character shingles: {"hell", "ello", "llo ", ...}  2. Compute MinHash signature (128 permutations):     signature[i] = min(h_i(shingle) for shingle in shingles)  3. Divide signature into 16 bands of 8 values each  4. Hash each band → bucket in band's hash table  5. Candidate pairs: documents sharing ≥ 1 bucket in any band  The math:    P(candidate pair) = 1 - (1 - s^r)^b    where s = Jaccard similarity, r = rows/band = 8, b = bands = 16    At s = 0.7: P(candidate) ≈ 0.97 → 97% recall of 70%+ similar pairs    At s = 0.5: P(candidate) ≈ 0.55 → catches many weaker duplicates too  For candidates: compute exact Jaccard similarity to confirm    (two-stage prevents false positives from dominating storage)

Distributed Spark implementation:

Compute MinHash signatures: one Spark map task per document
Band hashing: group by (band_id, band_hash) → candidate pairs
Exact confirmation: join candidate pairs → filter on Jaccard ≥ 0.7
Mark duplicates: keep highest-quality conversation per near-duplicate cluster

Incremental dedup (for daily runs without reprocessing all history):

Persist MinHash signatures in object storage (one file per date partition)
New day’s documents checked against both the new-day LSH index AND a persistent all-history LSH index
Persistent index updated after each day’s run

Scale sanity check: 2.5B conversations × 128 MinHash values × 4 bytes = ~1.28 TB of MinHash signatures per day. Distributed across 200 Spark workers = manageable.

Step 5: RLHF Preference Data Collection

Beyond the supervised fine-tuning (SFT) pipeline above, RLHF requires a separate preference data collection pipeline:

FROM
GOLD SFT DATASET:    Sample prompts (~50K/day) for RLHF preference collection    Prioritize prompts from underrepresented categories      ↓  MODEL SAMPLING SERVICE:    For each prompt, generate N=4 responses from current model checkpoint    (on-policy data per RLHF Book [rlhfbook.com])    Store: prompt_id, response_1..4_ids, model_checkpoint_version      ↓  LABELING TASK QUEUE:    Each task: "Which response is better and why?"    Routing rules:      • General tasks → general labelers      • Safety cases → certified safety reviewers only      • Technical tasks (code, math) → domain experts    Labeler metadata: labeler_id, time_spent_sec, confidence      ↓  PREFERENCE DATABASE (Postgres + S3):    comparison_id    prompt_id, model_version    response_a_id, response_b_id    winner: "a" | "b" | "tie" | "both_bad"    reason_categories: ["helpful", "accurate", "safe", "well_formatted"]    labeler_id, labeled_at, time_spent_sec, confidence    is_gold_validated (human-reviewed for quality control)      ↓  QUALITY CONTROL:    Gold set: 500 comparisons with known correct answers    Weekly labeler audit: each labeler scored on gold set    Labeler agreement rate must stay > 75%    If labeler drops below 65%: flag for retraining/review      ↓  REWARD MODEL TRAINING DATASET:    Filter: is_gold_validated = true OR confidence ≥ 4    Filter: exclude "both_bad" responses (reward model can't learn from these)    Balance: ensure safety categories have minimum representation    Version: same hash-based versioning as SFT dataset

Step 6: Idempotency Design

The pipeline processes hundreds of billions of conversations. It WILL fail and be rerun. Idempotency is non-negotiable.

Three idempotency mechanisms:

1. Document-level dedup store (DynamoDB):The hash store for exact deduplication doubles as a processing tracker. A document with a hash already in the store is already processed — the pipeline skips it on rerun. No duplicate processing, no duplicate writes.

2. Partition-scoped output paths:

Output
path: s3://training-data/silver/date=2026-04-11/run_id=abc123/

The run_id is deterministic: SHA256(date + pipeline_version). Same inputs, same run_id, same output path. Re-running overwrites the same path with the same content.

3. Dataset version hash for training runs:The dataset version hash is computed from all inputs. If nothing changed, the hash is identical to the previous run — the training coordinator knows this dataset is already built and skips regeneration.

Step 7: Training Data Lineage (EU AI Act Compliance)

Per Atlan’s training data lineage guide, EU AI Act Article 10 (August 2026 enforcement) requires documented data provenance for LLM training data. This is now a legal requirement, not just good practice.

Lineage record per dataset version (stored in a governance database):

{    "dataset_version": "v2026-04-11-abc123",    "creation_timestamp": "2026-04-11T08:23:14Z",    "source_systems": [      {"name": "ChatGPT", "collection_date_range": "2026-04-01 to 2026-04-11",       "consent_mechanism": "Terms of Service v3.2", "license": "proprietary"}    ],    "transformations": [      {"stage": "pii_redaction", "model": "presidio-v2.4.0", "params_hash": "def456"},      {"stage": "deduplication", "strategy": "minhash_lsh_0.7", "params_hash": "ghi789"},      {"stage": "quality_filter", "model": "sft-quality-v1.2", "threshold": 0.6},      {"stage": "toxicity_filter", "model": "perspective-v7.1", "categories": [...]}    ],    "quality_metrics": {...},    "approved_by": "RLHF team lead",    "training_runs_using_this_dataset": ["gpt-5-finetune-run-001", "gpt-5-test-003"]  }

This lineage record answers: “Which version of this model used this data?” and “If we find a data problem, which models are affected?”

Interview Questions

Q1: “Your toxicity classifier missed harmful content in 0.3% of conversations. This dataset was used for a training run. What do you do?”

Model Answer: “This is a data incident with potential model safety impact. Three immediate actions.

First, impact scope: query the lineage database to find which training runs used this dataset version. If a model trained on this data is already in production, escalate to the safety team immediately — this may require model evaluation for the affected behavior and potentially pulling the model from production.

Second, dataset remediation: re-run toxicity filtering on the affected dataset with the improved classifier. The version hash changes (new toxicity model version = new hash). The new clean dataset is a separate version — the old one is retained with a flag ‘known_contaminated’ for audit purposes, but blocked from future training runs. Never delete old dataset versions — they’re part of the regulatory audit trail.

Third, system improvement: increase the toxicity gate threshold (0.1% was insufficient — tighten to 0.05%). Add a second-pass human review for flagged-but-borderline content. Increase the random audit sample rate for the next 30 days to monitor the new classifier’s performance.

Root cause analysis: was this a distribution shift (new types of harmful content not in the classifier’s training data), a threshold misconfiguration, or a model regression? Different causes have different fixes. The gate that should have caught this (0.1% human audit of 1.6B conversations = 1.6M samples reviewed — any harmful content in a 0.3% failure rate should have appeared in that audit sample). Investigation question: did the human audit actually catch it but it wasn’t escalated properly?”

Q2: “How do you ensure the fine-tuning dataset is reproducible 18 months later for a regulatory audit?”

Model Answer: “Dataset reproducibility is a hard requirement, not a goal. Three components.

First, immutable storage: every dataset version is written to S3 with object versioning enabled and a bucket policy that prevents deletions (Object Lock in compliance mode). The version hash is a content hash — the dataset content can be verified against the hash at any time. Retention policy: 7 years minimum (EU AI Act requirement).

Second, version pinning of all dependencies: the lineage manifest records the exact version of every component — PII model, quality classifier, toxicity classifier, tokenizer, Spark version, deduplication threshold. To reproduce the dataset: install the same versions, run the same pipeline on the same bronze data partitions. The output should be bitwise identical (pipeline is deterministic — same seed, same processing order, same random samples if any sampling is used).

Third, environment snapshot: the pipeline code is version-controlled (Git hash included in the lineage manifest). Docker images for each pipeline version are archived in a container registry with no-delete policy. This means 18 months later, you can spin up the exact same container, point it at the same bronze data, and produce the same output.

The governance database stores the lineage manifest with a cryptographic signature from the pipeline operator (me or the system service account). A regulatory auditor can verify: ‘the dataset used to train model X was produced by pipeline version Y from data source Z, with these quality metrics, approved by person W on this date.’ That’s a complete audit trail.”

Self-Assessment: 5 Questions

Why must PII redaction happen BEFORE any human sees the data (not after quality filtering)?
What’s the two-stage deduplication approach and why do we need both stages?
What makes this pipeline idempotent — how does it handle reruns without duplicating data?
What are the 7 quality gates and why does each one exist?
How does the dataset version hash enable reproducibility 18 months later?

Quick Reference: LLM Training Data Pipeline

Layer order is critical: Ingest → PII redaction (BEFORE ANYTHING ELSE) → Bronze (immutable) → Dedup → Quality filter → Toxicity filter → Tokenize → Gold (versioned)
Dedup two-stage: Exact hash dedup (catches templates, O(n) with hash store) → MinHash LSH (near-dup, O(n log n) with LSH index, 128 permutations, Jaccard threshold 0.7)
Quality gates are BLOCKING: A failed gate stops the training run. Bad data = bad model shipped to 800M users. Gates: size, PII audit, toxicity audit, language diversity, dedup rate, quality score, hash integrity
RLHF data model: one row per comparison (prompt, response_A, response_B, winner). On-policy preference data is critical — collect from current model checkpoint.
Immutability is mandatory: old dataset versions are NEVER modified or deleted. New processing logic = new dataset version with new hash. Audit trail requires all versions.
EU AI Act compliance: lineage manifest with source provenance, transformation log, quality metrics, approver, and list of model training runs that used each dataset version.

Tomorrow’s Preview

Day 39: Anthropic Data Engineering & Safety-First Design — Anthropic’s unique focus on AI safety, the progressive complexity interview style, distributed systems at scale, and what “safety-first data design” means when your data infrastructure shapes a model’s alignment. How Anthropic interviews differ from OpenAI and what they specifically look for in senior DE candidates.

Day 38: Design: LLM Training Data Pipeline