Day 48 — Embedding & Vector Data Pipelines

Phase 2: Deep Dives | Category: ML Data Infrastructure

Why This Is Critical for Your Target Companies

RAG is now core infrastructure. OpenAI is building the platform that powers it. Anthropic increasingly uses retrieval to ground responses. Meta uses embedding-based search for recommendations. Google uses vector search inside LLM infrastructure.

Per Jobright AI’s OpenAI interview guide: “Design a Vector Database: How to store and search billions of embeddings efficiently?” is an explicit OpenAI system design question.

What Embeddings Are (The Foundation)

An embedding is a dense numerical vector that represents the semantic meaning of a piece of data. The key property: semantically similar items have geometrically close vectors.

Text: "The cat sat on the mat" → [0.23, -0.45, 0.87, ..., 0.12]  (1536 dimensions)
Text: "A feline rested on the rug" → [0.25, -0.43, 0.85, ..., 0.14]  (close!)
Text: "Stock market crashed today" → [-0.78, 0.33, -0.21, ..., 0.56]  (far away)

Why 1536 dimensions? OpenAI’s text-embedding-3-large produces 1536-dim vectors. More dimensions = more nuanced representation = better retrieval quality. Trade-off: storage and computation cost grows linearly with dimensions.

Similarity measurement: cosine similarity (angle between vectors, not magnitude) is standard for text embeddings:

cosine_similarity(v1, v2) = dot(v1, v2) / (|v1| × |v2|)
# Range: -1 (opposite) to 1 (identical)
# Threshold: > 0.7 usually means semantically related

# Inner product = cosine similarity for unit-normalized vectors
# Most vector DBs normalize vectors at index time → inner product search

Embedding Generation at Scale

The compute challenge: generating embeddings for 1 billion documents:

1B documents × avg 500 tokens/doc × $0.02/1M tokens (text-embedding-3-small)
= $10,000 for the full corpus
+ 1B × 1536 dims × 4 bytes = 6 TB storage
+ Time: at 10K docs/sec throughput → 27 hours

For a 100M document corpus: $1,000 compute + 600 GB storage — manageable

Efficient batch embedding generation (Spark + async API calls):

import asyncio
from typing import List
from openai import AsyncOpenAI
import openai

async def embed_batch(texts: List[str], client: AsyncOpenAI) -> List[List[float]]:
    """Generate embeddings for a batch of texts with rate limit handling."""
    try:
        response = await client.embeddings.create(
            model="text-embedding-3-small",  # 1536 dims, $0.02/1M tokens
            input=texts,                    # batch up to 2048 texts at once
            encoding_format="float"
        )
        return [item.embedding for item in response.data]
    except openai.RateLimitError:
        await asyncio.sleep(60)             # rate limit backoff
        return await embed_batch(texts, client)

def embed_partition(iterator, batch_size: int = 256):
    """Embed a Spark partition of documents."""
    client = AsyncOpenAI()
    buffer = []
    for doc in iterator:
        buffer.append(doc)
        if len(buffer) >= batch_size:
            embeddings = asyncio.run(embed_batch([d["chunk_text"] for d in buffer], client))
            for doc, emb in zip(buffer, embeddings):
                yield {**doc, "embedding": emb}
            buffer = []
    if buffer:
        embeddings = asyncio.run(embed_batch([d["chunk_text"] for d in buffer], client))
        for doc, emb in zip(buffer, embeddings):
            yield {**doc, "embedding": emb}

# embedded_df = documents_df.rdd.mapPartitions(embed_partition).toDF()

Cost vs quality trade-off of embedding models:

Model	Dimensions	Cost / 1M tokens	Retrieval quality	Use case
text-embedding-3-small	1536	$0.02	Good	Most RAG applications
text-embedding-3-large	3072	$0.13	Excellent	High-stakes retrieval
BGE-M3 (open source)	1024	$0 (compute)	Very good	Self-hosted, privacy-sensitive
GTE-Qwen2 (open source)	2048	$0 (compute)	Excellent	Enterprise, no API dependency

For OpenAI/Anthropic interviews: using OpenAI’s embedding API creates a dependency and latency. Self-hosted embedding models eliminate API cost and external dependency but require GPU infrastructure.

ANN Indexing: How Vector Search Works at Scale

Exact nearest neighbor search (brute force) is too slow for large corpora. Approximate Nearest Neighbor (ANN) indexes trade a small accuracy reduction for massive speed gains.

HNSW (Hierarchical Navigable Small World)

The dominant ANN algorithm for production use. Conceptually:

Layer 2 (sparse): A ─────────────────── D
                              │
Layer 1 (medium): A ────── B ──── C ── D
                              │         │
Layer 0 (dense): A ── A2 ── B ── B2 ── C ── C2 ── D ── D2

Search: start at layer 2, greedily navigate to nearest node
        descend to layer 1, refine, descend to layer 0, final answer

HNSW parameters:

M (16-64): connections per node. Higher = better recall, more memory.
ef_construction (100-500): build-time quality. Higher = better index, slower build.
ef_search (50-200): query-time quality. Higher = better recall, slower query.

HNSW trade-offs:

Excellent recall (95-99%) at low latency (often 5-15ms for ~1M vectors)
High memory usage
Supports incremental additions (no full rebuild on insert)
Requires sharding for very large corpora

IVF (Inverted File Index)

Groups vectors into K clusters (k-means). Search: embed query → find nearest clusters → search within those clusters.

K=1000 clusters of 1M vectors = 1000 vectors/cluster
Query: compare against 1000 cluster centroids → find top-3 nearest clusters
       search only 3000 vectors (not 1M) → 333x speedup

nprobe = number of clusters to search (higher = better recall, slower)
nprobe=3: fast but lower recall
nprobe=50: slower but higher recall

IVF-PQ (Product Quantization for Memory Efficiency)

Compresses vectors for memory efficiency. Necessary when the corpus is too large for full-precision in-memory search.

Memory comparison for 1B vectors at 1536 dims:

Flat (exact):    1B × 1536 × 4 bytes = 6.1 TB  ← impossible in memory
HNSW:            overhead + full vectors (still enormous without sharding)
IVF-PQ (M=64):   1B × 64 bytes ≈ 64 GB         ← feasible with compression (recall trade-off)

Chunking Strategy: The Most Impactful RAG Design Decision

Each chunk gets one embedding. If a chunk is too large, the embedding averages too much meaning → poor retrieval precision. Too small → loses context → the LLM can’t answer well.

Four chunking strategies:

1. Fixed-size (simplest)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,      # tokens per chunk
    chunk_overlap=50,    # overlap to preserve context across boundaries
    length_function=len
)
chunks = splitter.split_text(document)

2. Semantic (boundary-aware)

Split at natural document boundaries (paragraphs/sections/headings).

def semantic_split(text: str, max_tokens: int = 512):
    paragraphs = text.split("\n\n")
    chunks = []
    current = ""
    for para in paragraphs:
        if len(current) + len(para) < max_tokens:
            current += "\n\n" + para
        else:
            if current:
                chunks.append(current.strip())
            current = para
    if current:
        chunks.append(current.strip())
    return chunks

3. Hierarchical (parent-child)

Store chunks at multiple granularities: document → section → paragraph. When a paragraph matches, return its parent section for richer context.

Document: "2026 Annual Report"
  Section: "Financial Results" → embedding_1
    Paragraph: "Revenue grew 15%" → embedding_2 (child of embedding_1)
    Paragraph: "EBITDA margin expanded" → embedding_3

Query "revenue growth" matches embedding_2 (paragraph)
Return: full "Financial Results" section (parent) for LLM context

4. Rule-based (domain-specific)

For structured documents: code files (split by function), legal docs (split by clause), medical records (split by section). Usually best for domain-specific RAG.

Rule of thumb: 256-512 token chunks with ~10% overlap. Validate with real queries.

Vector Database Selection

Database	Deployment	p50 latency (1M)	Hybrid search	Best for
Qdrant	Self-hosted / Managed	~6ms	✅ Native	OSS performance, cost-sensitive
Pinecone	Fully managed	~8ms	✅ Sparse-dense	Zero-ops production, burst traffic
Weaviate	Self-hosted / Managed	~12ms	✅ BM25+vector	Rich metadata filtering, multimodal
pgvector	Postgres extension	20-50ms (< 1M)	❌ Manual	Existing Postgres, < 1M vectors
Milvus	Self-hosted	~10ms	✅	Large-scale, billion vectors
ChromaDB	Local / Self-hosted	~15ms	❌ Manual	Prototyping
FAISS	Library (not a DB)	2-5ms	❌	Custom implementations, batch

Guideline: for (< 1M) vectors and you’re already on Postgres, pgvector can be the right answer — don’t add infra complexity.

pgvector for small-scale RAG:

-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table with vector column
CREATE TABLE document_chunks (
    id          UUID PRIMARY KEY,
    document_id TEXT NOT NULL,
    chunk_text  TEXT NOT NULL,
    embedding   vector(1536),    -- OpenAI text-embedding-3-small
    metadata    JSONB,
    created_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create HNSW index for fast ANN search
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- ANN semantic search
SELECT
    chunk_text,
    metadata,
    1 - (embedding <=> $1::vector) AS cosine_similarity
FROM document_chunks
WHERE cosine_similarity > 0.7
ORDER BY embedding <=> $1::vector  -- cosine distance
LIMIT 10;

The Production RAG Pipeline

This is the full architecture OpenAI/Anthropic interviewers expect you to draw:

INGESTION PIPELINE (offline, async):
  ┌──────────────────────────────────────────────────────────────┐
  │  Document Sources (PDFs, web pages, DBs, code repos)        │
  │       ↓ async queue (BullMQ / Kafka)                        │
  │  Document Parser (Apache Tika / Unstructured.io)            │
  │       ↓ clean text + metadata                               │
  │  Chunker (semantic or fixed-size, 256-512 tokens)           │
  │       ↓ chunks with metadata                                │
  │  Embedding Model (text-embedding-3-small or BGE-M3)         │
  │       ↓ batch API calls, async, with retry                  │
  │  Vector Store (Qdrant / Pinecone / pgvector)                │
  │       + Metadata: source, date, document_id, chunk_idx      │
  │       + Full text in separate store (Postgres) for reranking│
  └──────────────────────────────────────────────────────────────┘

QUERY PIPELINE (online, p99 < 200ms goal for retrieval stage):
  ┌──────────────────────────────────────────────────────────────┐
  │  User query                                                  │
  │       ↓ cache check (Redis, TTL 5 min, query hash key)      │
  │  Query preprocessing (expand abbreviations, resolve coref)   │
  │       ↓ optionally rewrite with LLM for better retrieval     │
  │  Embedding (same model as ingestion — CRITICAL)              │
  │       ↓ query vector                                         │
  │  Hybrid Retrieval (parallel):                                │
  │    ├─ Dense: ANN vector search → top-50 chunks               │
  │    └─ Sparse: BM25 keyword search → top-50 chunks            │
  │       ↓ RRF fusion → top-100 candidates                      │
  │  Cross-encoder Reranker (Cohere rerank-v3, local model)      │
  │       ↓ top-10 highest relevance chunks                      │
  │  Context injection into LLM prompt                           │
  │       ↓ \"Answer based on these documents: [chunks]\"         │
  │  LLM generation (streaming response)                         │
  │       ↓ with source citations (chunk → document lineage)     │
  │  Cache response, return to user                              │
  └──────────────────────────────────────────────────────────────┘

Embedding Pipeline Monitoring: What to Track

Embedding/model version drift: if the embedding model version changes, existing embeddings may become incompatible. Version-tag all embeddings. On upgrade, reindex the corpus into a new index and cut over atomically.

Retrieval quality metrics (RAGAS framework):

Context precision
Context recall
Answer faithfulness
Answer relevance

Operational metrics:

p50/p95/p99 vector search latency
Cache hit rate (query hash)
Index freshness (time since last document indexed)
Embedding generation queue depth (backlog)

Interview Questions

Q1: “Design a RAG pipeline for an enterprise where employees can ask questions about 10 million internal documents.”

Model Answer: “I’d design this in two pipelines. Ingestion: documents from SharePoint, Confluence, and internal wikis land in a processing queue (Kafka). Each document gets parsed (Apache Tika for PDFs, Mammoth for DOCX), chunked semantically (paragraph boundaries, 512 token max), and embedded using a self-hosted model (BGE-M3 — no external API dependency for confidential documents). Chunks + embeddings written to Qdrant (HNSW) with metadata (source system, document date, access group). Full chunk text stored in Postgres for reranking. At 10M documents × ~20 chunks/doc = 200M vectors. Query pipeline: Redis cache → embed query (same model) → hybrid retrieval (Qdrant dense top-50 + Elasticsearch BM25 top-50) → RRF fusion → cross-encoder reranker → top-10 chunks → inject + cite.”

Security: each document tagged with access groups; retrieval enforces filters so users only retrieve authorized content.

Q2: “You’re using OpenAI’s embedding API, and the model version changes. How does this affect your vector database?”

Model Answer: “A model version change can change the vector space — new vectors are not semantically comparable to old vectors. You cannot mix versions in the same index. Remediation: tag every vector with embedding model version. When the model changes, build a new index, re-embed the corpus, populate the new index, then cut traffic over atomically. Keep the old index for rollback. If using API models, version-pin and test upgrades on a shadow index.”

Think About This

You’re in an Anthropic interview. The prompt: “Claude.ai has a projects feature where users upload documents and ask questions about them. Each user can upload up to 1000 documents. There are 50 million users. Design the vector database architecture.”

Walk through:

How many total vectors? (order-of-magnitude estimate; compression likely required)
How do you isolate user data? (shared index + namespace/metadata filter vs per-user index)
What’s the latency target? (end-to-end UX; retrieval budget)
How do you shard? (e.g., hash(user_id) routing to shard)
Upload path: parse → chunk → embed → write; track ingestion state

Quick Reference

Embedding = dense vector encoding semantic meaning; same model must produce query AND document embeddings
ANN indexes: Flat (exact, small), HNSW (best recall/speed, high memory), IVF (clustered), IVF-PQ (compressed, billion-scale)
Chunking is the biggest lever: fixed-size, semantic, hierarchical, rule-based; rule-of-thumb 256-512 tokens + overlap
Hybrid search: dense + sparse (BM25) + RRF fusion; commonly best in production
Vector DB selection: pgvector (< 1M), Qdrant (OSS perf), Pinecone (managed), Weaviate (metadata-heavy), Milvus (billion scale)
Monitoring: embedding version tags, RAGAS metrics, p99 search latency, index freshness, embedding backlog

Tomorrow’s Preview

Day 49: A/B Testing Data Infrastructure — Experiment assignment/logging, metrics pipelines, statistical significance in streaming, guardrail metrics, and how Netflix and Meta run experiments at scale.

Day 48: Embedding & Vector Data Pipelines