Phase 2: Deep Dives | Category: ML Data Infrastructure
Why This Is Critical for Your Target Companies
RAG is now core infrastructure. OpenAI is building the platform that powers it. Anthropic increasingly uses retrieval to ground responses. Meta uses embedding-based search for recommendations. Google uses vector search inside LLM infrastructure.
Per Jobright AI’s OpenAI interview guide: “Design a Vector Database: How to store and search billions of embeddings efficiently?” is an explicit OpenAI system design question.
What Embeddings Are (The Foundation)
An embedding is a dense numerical vector that represents the semantic meaning of a piece of data. The key property: semantically similar items have geometrically close vectors.
Text: "The cat sat on the mat" → [0.23, -0.45, 0.87, ..., 0.12] (1536 dimensions)
Text: "A feline rested on the rug" → [0.25, -0.43, 0.85, ..., 0.14] (close!)
Text: "Stock market crashed today" → [-0.78, 0.33, -0.21, ..., 0.56] (far away)
Why 1536 dimensions? OpenAI’s text-embedding-3-large produces 1536-dim vectors. More dimensions = more nuanced representation = better retrieval quality. Trade-off: storage and computation cost grows linearly with dimensions.
Similarity measurement: cosine similarity (angle between vectors, not magnitude) is standard for text embeddings:
cosine_similarity(v1, v2) = dot(v1, v2) / (|v1| × |v2|)
# Range: -1 (opposite) to 1 (identical)
# Threshold: > 0.7 usually means semantically related
# Inner product = cosine similarity for unit-normalized vectors
# Most vector DBs normalize vectors at index time → inner product search
Embedding Generation at Scale
The compute challenge: generating embeddings for 1 billion documents:
1B documents × avg 500 tokens/doc × $0.02/1M tokens (text-embedding-3-small)
= $10,000 for the full corpus
+ 1B × 1536 dims × 4 bytes = 6 TB storage
+ Time: at 10K docs/sec throughput → 27 hours
For a 100M document corpus: $1,000 compute + 600 GB storage — manageable
Efficient batch embedding generation (Spark + async API calls):
import asyncio
from typing import List
from openai import AsyncOpenAI
import openai
async def embed_batch(texts: List[str], client: AsyncOpenAI) -> List[List[float]]:
"""Generate embeddings for a batch of texts with rate limit handling."""
try:
response = await client.embeddings.create(
model="text-embedding-3-small", # 1536 dims, $0.02/1M tokens
input=texts, # batch up to 2048 texts at once
encoding_format="float"
)
return [item.embedding for item in response.data]
except openai.RateLimitError:
await asyncio.sleep(60) # rate limit backoff
return await embed_batch(texts, client)
def embed_partition(iterator, batch_size: int = 256):
"""Embed a Spark partition of documents."""
client = AsyncOpenAI()
buffer = []
for doc in iterator:
buffer.append(doc)
if len(buffer) >= batch_size:
embeddings = asyncio.run(embed_batch([d["chunk_text"] for d in buffer], client))
for doc, emb in zip(buffer, embeddings):
yield {**doc, "embedding": emb}
buffer = []
if buffer:
embeddings = asyncio.run(embed_batch([d["chunk_text"] for d in buffer], client))
for doc, emb in zip(buffer, embeddings):
yield {**doc, "embedding": emb}
# embedded_df = documents_df.rdd.mapPartitions(embed_partition).toDF()
Cost vs quality trade-off of embedding models:
| Model | Dimensions | Cost / 1M tokens | Retrieval quality | Use case |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | $0.02 | Good | Most RAG applications |
| text-embedding-3-large | 3072 | $0.13 | Excellent | High-stakes retrieval |
| BGE-M3 (open source) | 1024 | $0 (compute) | Very good | Self-hosted, privacy-sensitive |
| GTE-Qwen2 (open source) | 2048 | $0 (compute) | Excellent | Enterprise, no API dependency |
For OpenAI/Anthropic interviews: using OpenAI’s embedding API creates a dependency and latency. Self-hosted embedding models eliminate API cost and external dependency but require GPU infrastructure.
ANN Indexing: How Vector Search Works at Scale
Exact nearest neighbor search (brute force) is too slow for large corpora. Approximate Nearest Neighbor (ANN) indexes trade a small accuracy reduction for massive speed gains.
HNSW (Hierarchical Navigable Small World)
The dominant ANN algorithm for production use. Conceptually:
Layer 2 (sparse): A ─────────────────── D
│
Layer 1 (medium): A ────── B ──── C ── D
│ │
Layer 0 (dense): A ── A2 ── B ── B2 ── C ── C2 ── D ── D2
Search: start at layer 2, greedily navigate to nearest node
descend to layer 1, refine, descend to layer 0, final answer
HNSW parameters:
- M (16-64): connections per node. Higher = better recall, more memory.
- ef_construction (100-500): build-time quality. Higher = better index, slower build.
- ef_search (50-200): query-time quality. Higher = better recall, slower query.
HNSW trade-offs:
- Excellent recall (95-99%) at low latency (often 5-15ms for ~1M vectors)
- High memory usage
- Supports incremental additions (no full rebuild on insert)
- Requires sharding for very large corpora
IVF (Inverted File Index)
Groups vectors into K clusters (k-means). Search: embed query → find nearest clusters → search within those clusters.
K=1000 clusters of 1M vectors = 1000 vectors/cluster
Query: compare against 1000 cluster centroids → find top-3 nearest clusters
search only 3000 vectors (not 1M) → 333x speedup
nprobe = number of clusters to search (higher = better recall, slower)
nprobe=3: fast but lower recall
nprobe=50: slower but higher recall
IVF-PQ (Product Quantization for Memory Efficiency)
Compresses vectors for memory efficiency. Necessary when the corpus is too large for full-precision in-memory search.
Memory comparison for 1B vectors at 1536 dims:
Flat (exact): 1B × 1536 × 4 bytes = 6.1 TB ← impossible in memory
HNSW: overhead + full vectors (still enormous without sharding)
IVF-PQ (M=64): 1B × 64 bytes ≈ 64 GB ← feasible with compression (recall trade-off)
Chunking Strategy: The Most Impactful RAG Design Decision
Each chunk gets one embedding. If a chunk is too large, the embedding averages too much meaning → poor retrieval precision. Too small → loses context → the LLM can’t answer well.
Four chunking strategies:
1. Fixed-size (simplest)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens per chunk
chunk_overlap=50, # overlap to preserve context across boundaries
length_function=len
)
chunks = splitter.split_text(document)
2. Semantic (boundary-aware)
Split at natural document boundaries (paragraphs/sections/headings).
def semantic_split(text: str, max_tokens: int = 512):
paragraphs = text.split("\n\n")
chunks = []
current = ""
for para in paragraphs:
if len(current) + len(para) < max_tokens:
current += "\n\n" + para
else:
if current:
chunks.append(current.strip())
current = para
if current:
chunks.append(current.strip())
return chunks
3. Hierarchical (parent-child)
Store chunks at multiple granularities: document → section → paragraph. When a paragraph matches, return its parent section for richer context.
Document: "2026 Annual Report"
Section: "Financial Results" → embedding_1
Paragraph: "Revenue grew 15%" → embedding_2 (child of embedding_1)
Paragraph: "EBITDA margin expanded" → embedding_3
Query "revenue growth" matches embedding_2 (paragraph)
Return: full "Financial Results" section (parent) for LLM context
4. Rule-based (domain-specific)
For structured documents: code files (split by function), legal docs (split by clause), medical records (split by section). Usually best for domain-specific RAG.
Rule of thumb: 256-512 token chunks with ~10% overlap. Validate with real queries.
Vector Database Selection
| Database | Deployment | p50 latency (1M) | Hybrid search | Best for |
|---|---|---|---|---|
| Qdrant | Self-hosted / Managed | ~6ms | ✅ Native | OSS performance, cost-sensitive |
| Pinecone | Fully managed | ~8ms | ✅ Sparse-dense | Zero-ops production, burst traffic |
| Weaviate | Self-hosted / Managed | ~12ms | ✅ BM25+vector | Rich metadata filtering, multimodal |
| pgvector | Postgres extension | 20-50ms (< 1M) | ❌ Manual | Existing Postgres, < 1M vectors |
| Milvus | Self-hosted | ~10ms | ✅ | Large-scale, billion vectors |
| ChromaDB | Local / Self-hosted | ~15ms | ❌ Manual | Prototyping |
| FAISS | Library (not a DB) | 2-5ms | ❌ | Custom implementations, batch |
Guideline: for (< 1M) vectors and you’re already on Postgres, pgvector can be the right answer — don’t add infra complexity.
pgvector for small-scale RAG:
-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;
-- Create table with vector column
CREATE TABLE document_chunks (
id UUID PRIMARY KEY,
document_id TEXT NOT NULL,
chunk_text TEXT NOT NULL,
embedding vector(1536), -- OpenAI text-embedding-3-small
metadata JSONB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create HNSW index for fast ANN search
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- ANN semantic search
SELECT
chunk_text,
metadata,
1 - (embedding <=> $1::vector) AS cosine_similarity
FROM document_chunks
WHERE cosine_similarity > 0.7
ORDER BY embedding <=> $1::vector -- cosine distance
LIMIT 10;
The Production RAG Pipeline
This is the full architecture OpenAI/Anthropic interviewers expect you to draw:
INGESTION PIPELINE (offline, async):
┌──────────────────────────────────────────────────────────────┐
│ Document Sources (PDFs, web pages, DBs, code repos) │
│ ↓ async queue (BullMQ / Kafka) │
│ Document Parser (Apache Tika / Unstructured.io) │
│ ↓ clean text + metadata │
│ Chunker (semantic or fixed-size, 256-512 tokens) │
│ ↓ chunks with metadata │
│ Embedding Model (text-embedding-3-small or BGE-M3) │
│ ↓ batch API calls, async, with retry │
│ Vector Store (Qdrant / Pinecone / pgvector) │
│ + Metadata: source, date, document_id, chunk_idx │
│ + Full text in separate store (Postgres) for reranking│
└──────────────────────────────────────────────────────────────┘
QUERY PIPELINE (online, p99 < 200ms goal for retrieval stage):
┌──────────────────────────────────────────────────────────────┐
│ User query │
│ ↓ cache check (Redis, TTL 5 min, query hash key) │
│ Query preprocessing (expand abbreviations, resolve coref) │
│ ↓ optionally rewrite with LLM for better retrieval │
│ Embedding (same model as ingestion — CRITICAL) │
│ ↓ query vector │
│ Hybrid Retrieval (parallel): │
│ ├─ Dense: ANN vector search → top-50 chunks │
│ └─ Sparse: BM25 keyword search → top-50 chunks │
│ ↓ RRF fusion → top-100 candidates │
│ Cross-encoder Reranker (Cohere rerank-v3, local model) │
│ ↓ top-10 highest relevance chunks │
│ Context injection into LLM prompt │
│ ↓ \"Answer based on these documents: [chunks]\" │
│ LLM generation (streaming response) │
│ ↓ with source citations (chunk → document lineage) │
│ Cache response, return to user │
└──────────────────────────────────────────────────────────────┘
Embedding Pipeline Monitoring: What to Track
Embedding/model version drift: if the embedding model version changes, existing embeddings may become incompatible. Version-tag all embeddings. On upgrade, reindex the corpus into a new index and cut over atomically.
Retrieval quality metrics (RAGAS framework):
- Context precision
- Context recall
- Answer faithfulness
- Answer relevance
Operational metrics:
- p50/p95/p99 vector search latency
- Cache hit rate (query hash)
- Index freshness (time since last document indexed)
- Embedding generation queue depth (backlog)
Interview Questions
Q1: “Design a RAG pipeline for an enterprise where employees can ask questions about 10 million internal documents.”
Model Answer: “I’d design this in two pipelines. Ingestion: documents from SharePoint, Confluence, and internal wikis land in a processing queue (Kafka). Each document gets parsed (Apache Tika for PDFs, Mammoth for DOCX), chunked semantically (paragraph boundaries, 512 token max), and embedded using a self-hosted model (BGE-M3 — no external API dependency for confidential documents). Chunks + embeddings written to Qdrant (HNSW) with metadata (source system, document date, access group). Full chunk text stored in Postgres for reranking. At 10M documents × ~20 chunks/doc = 200M vectors. Query pipeline: Redis cache → embed query (same model) → hybrid retrieval (Qdrant dense top-50 + Elasticsearch BM25 top-50) → RRF fusion → cross-encoder reranker → top-10 chunks → inject + cite.”
Security: each document tagged with access groups; retrieval enforces filters so users only retrieve authorized content.
Q2: “You’re using OpenAI’s embedding API, and the model version changes. How does this affect your vector database?”
Model Answer: “A model version change can change the vector space — new vectors are not semantically comparable to old vectors. You cannot mix versions in the same index. Remediation: tag every vector with embedding model version. When the model changes, build a new index, re-embed the corpus, populate the new index, then cut traffic over atomically. Keep the old index for rollback. If using API models, version-pin and test upgrades on a shadow index.”
Think About This
You’re in an Anthropic interview. The prompt: “Claude.ai has a projects feature where users upload documents and ask questions about them. Each user can upload up to 1000 documents. There are 50 million users. Design the vector database architecture.”
Walk through:
- How many total vectors? (order-of-magnitude estimate; compression likely required)
- How do you isolate user data? (shared index + namespace/metadata filter vs per-user index)
- What’s the latency target? (end-to-end UX; retrieval budget)
- How do you shard? (e.g.,
hash(user_id)routing to shard) - Upload path: parse → chunk → embed → write; track ingestion state
Quick Reference
- Embedding = dense vector encoding semantic meaning; same model must produce query AND document embeddings
- ANN indexes: Flat (exact, small), HNSW (best recall/speed, high memory), IVF (clustered), IVF-PQ (compressed, billion-scale)
- Chunking is the biggest lever: fixed-size, semantic, hierarchical, rule-based; rule-of-thumb 256-512 tokens + overlap
- Hybrid search: dense + sparse (BM25) + RRF fusion; commonly best in production
- Vector DB selection: pgvector (< 1M), Qdrant (OSS perf), Pinecone (managed), Weaviate (metadata-heavy), Milvus (billion scale)
- Monitoring: embedding version tags, RAGAS metrics, p99 search latency, index freshness, embedding backlog
Tomorrow’s Preview
Day 49: A/B Testing Data Infrastructure — Experiment assignment/logging, metrics pipelines, statistical significance in streaming, guardrail metrics, and how Netflix and Meta run experiments at scale.