Day 40 — Design: Distributed Search System for Billion Documents (Anthropic Style)

Phase 2: Company-Specific | Category: Anthropic-Specific

The Prompt

Per IGotAnOffer: “Design a distributed search system capable of handling 1 billion documents at 1 million queries per second, covering sharding, caching, and LLM inference scaling.”

This is the most consistently reported Anthropic system design question. Set a 45-minute timer. But unlike previous designs, expect the interviewer to add constraints at minutes 15, 25, and 35.

Step 1: Requirements (5 min)

Functional:

Full-text search AND semantic search across 1 billion documents
Natural language queries (not just keyword)
Ranked results by relevance
Support for LLM-powered reranking of top-K results
Document ingestion pipeline (batch + incremental updates)
Safety filtering (relevant at Anthropic — search must not surface harmful content)

Non-functional:

Scale
:     1 billion documents             1 million QPS (queries per second) — 100K avg + 10x peak headroom  Latency:   p95 < 500ms end-to-end (retrieval + reranking)             p99 < 1 second  Availability: 99.99%  Document size avg: 2 KB → 1B × 2KB = 2 TB documents  Embedding dim: 1536 (text-embedding-3-large) → 1B × 1536 × 4 bytes = 6 TB vectors  Inverted index: ~3x document size → ~6 TB  Latency budget (500ms):    Query embedding: 20ms (local embedding model, not API call)    ANN retrieval: 30ms (HNSW on sharded vector index)    BM25 retrieval: 20ms (inverted index lookup)    Fusion + dedup: 5ms    LLM reranking top-50: 200ms (batched cross-encoder or LLM)    Cache check + response: 25ms    Total: ~300ms (leaves 200ms headroom for p95)

Scope: “I’ll design the ingestion pipeline, the retrieval layer (hybrid BM25 + vector), the reranking layer, caching strategy, and sharding. I’ll address safety filtering and the progressive complexity additions.”

Step 2: Scale Estimation (3 min)

Index
size:    1B documents × 2KB avg = 2 TB raw text    Inverted index: ~6 TB (3x documents due to posting lists)    Vector embeddings: 1B × 1536d × 4 bytes = 6 TB    Total index: ~12 TB → too large for one node (max ~2-4 TB per NVMe node)  Sharding requirement:    12 TB / 1 TB per index shard = 12 shards minimum    With 2x replication for availability: 24 index nodes minimum    For 1M QPS: assuming 1K QPS per shard search → 12 shards × 1K = 12K QPS from shards    → Need 1M / 12K ≈ 84 parallel query "lanes" per second → routing layer handles fan-out  Embedding generation for queries:    1M QPS × 20ms per embedding = 20M GPU-ms/sec needed    ~1,000 GPU-hours/sec → need dedicated embedding inference cluster (~100-200 GPUs)    (In practice: batch embeddings, use smaller model, or cache repeated queries)  LLM reranking cost:    1M QPS × top-50 candidates × ~500 tokens each → 25B tokens/sec    This is prohibitively expensive at full throughput → must be tiered/cached

Step 3: Full Architecture

┌──────────────────────────────────────────────────────────────────┐  │                      INGESTION PIPELINE                          │  │  (Batch: crawl data, incremental: new document events)          │  │                                                                  │  │  [Document Queue] → [Preprocessor]                              │  │    • Language detection, chunking (512 tokens, 50-token overlap) │  │    • Normalize: lowercase, remove markup, deduplicate           │  │    • Safety filter: remove harmful documents before indexing    │  │                                                                  │  │  Two parallel paths:                                             │  │  Path A → Embedding Service → Vector Index Shards               │  │  Path B → BM25 Indexer → Inverted Index Shards                  │  │  Both updates are eventually consistent (commit lag < 5 min)    │  └─────────────────────────────────────────────────────────────────┘                                ↓  ┌──────────────────────────────────────────────────────────────────┐  │                     INDEX LAYER (Sharded)                        │  │                                                                  │  │  Sharding strategy: consistent hashing on document_id           │  │  16 shards × 2 replicas = 32 index nodes                        │  │                                                                  │  │  Each shard contains:                                            │  │  ┌────────────────────┐  ┌────────────────────┐                │  │  │ INVERTED INDEX     │  │ VECTOR INDEX (HNSW)│                │  │  │ (Elasticsearch/    │  │ (FAISS, 96M docs   │                │  │  │  Lucene-based)     │  │  per shard)         │                │  │  │ BM25 scoring       │  │ HNSW: M=32,        │                │  │  │ Posting lists:     │  │ efSearch=100       │                │  │  │  term → [doc_ids,  │  │ IVF-PQ for memory: │                │  │  │  positions, tf]    │  │  quantized to 8bit │                │  │  └────────────────────┘  └────────────────────┘                │  │                                                                  │  │  Memory per shard: ~300 GB (inverted) + ~450 GB (HNSW) = 750 GB│  │  Node: 1 TB RAM, 16-core, NVMe SSD for overflow                 │  └─────────────────────────────────────────────────────────────────┘                                ↑↓ query fan-out  ┌──────────────────────────────────────────────────────────────────┐  │                   QUERY PROCESSING LAYER                         │  │                                                                  │  │  Query Router (load balanced, stateless)                         │  │  ├── Cache check: is this query in the result cache? (Redis)     │  │  │   Cache hit → return cached result (~5ms)                    │  │  │   Cache miss → continue                                       │  │  │                                                               │  │  ├── Query Embedding Service                                     │  │  │   Local embedding model (not API call): ~20ms                │  │  │   Same model as ingestion (critical for consistency)         │  │  │                                                               │  │  ├── Parallel Retrieval Fan-out                                  │  │  │   Send query simultaneously to ALL 16 shards:                │  │  │   • BM25 request: keyword retrieval, top-50 per shard        │  │  │   • Vector request: ANN retrieval, top-50 per shard          │  │  │   Wait for all shards (with timeout: 100ms max)              │  │  │                                                               │  │  ├── Merge + Fusion                                              │  │  │   Collect top-50 from each of 16 shards (both BM25 + vector) │  │  │   → up to 1,600 candidates per retrieval type                │  │  │   → RRF fusion: score = Σ 1/(k + rank_i), k=60              │  │  │   → Deduplicate by document_id                               │  │  │   → Take global top-100 candidates                           │  │  │                                                               │  │  └── Reranking (tiered — explained below)                        │  └─────────────────────────────────────────────────────────────────┘                                ↓  ┌──────────────────────────────────────────────────────────────────┐  │                     RERANKING LAYER (Tiered)                     │  │                                                                  │  │  Tier 1 (all queries, ~10ms):                                    │  │    Cross-encoder model (lightweight, 110M params)               │  │    Input: query + document summary (first 200 tokens)           │  │    Output: relevance score 0-1 for top-100 candidates           │  │    Rerank → top-10                                               │  │                                                                  │  │  Tier 2 (premium/flagged queries, +200ms):                       │  │    LLM reranker (Claude Haiku or equivalent)                     │  │    Input: query + full document chunks for top-10               │  │    Output: explanation + confidence per result                   │  │    Triggered when: tier-1 confidence is low, query is complex   │  │                                                                  │  │  Result cache: store (query_hash → ranked_results) in Redis     │  │  TTL: 5 minutes for dynamic content, 1 hour for stable content  │  └─────────────────────────────────────────────────────────────────┘                                ↓                            API Response                      top-10 results + scores + (optional) explanations

Step 4: Hybrid Retrieval Deep Dive

Why hybrid is mandatory for this system (per BM25 vs Vector Search, Alok and DEV Community):

Query TypeBM25 WinVector Win”Python asyncio tutorial” (keyword)✅ exact match❌ may return “concurrency patterns""Why does my code fail when threads share state?” (semantic)❌ no keyword match✅ understands intent”Claude model hallucination rate” (hybrid)✅ finds docs with these exact terms✅ finds docs about AI accuracy

Hybrid improves recall 15-30% over either alone. At Anthropic, where search supports Claude’s knowledge retrieval, this is the difference between finding the right context and hallucinating.

RRF Fusion — why it works without tuning:

For
each candidate document:  RRF_score = Σ(i) 1 / (60 + rank_i)  Where rank_i is the position in retrieval list i (BM25 or vector)  A document ranked #1 in both: 1/61 + 1/61 = 0.0328  A document ranked #1 in BM25, #20 in vector: 1/61 + 1/80 = 0.0288  A document ranked #3 in both: 1/63 + 1/63 = 0.0317  The k=60 constant absorbs rank-scale differences — works universally without tuning

HNSW Index Selection (per DEV Community HNSW deep dive):

At
1B documents / 16 shards = 62.5M documents per shard  Index type selection:    < 100K docs: FAISS Flat (exact) — brute force is fast enough    100K - 10M docs: FAISS HNSW — best recall/speed trade-off    > 10M docs: FAISS IVF-PQ — compressed index fits in memory  At 62.5M docs per shard: HNSW is the right choice, but memory-intensive    Memory: 62.5M × 1536d × 4 bytes = 384 GB per shard → need IVF-PQ compression    With IVF-PQ (8-bit quantization): 62.5M × 1536/8 bytes = 12 GB per shard ✓  Trade-off: IVF-PQ has ~5% recall degradation vs exact HNSW    For search, this is acceptable — you're finding approximately nearest documents,    and the reranking layer compensates for retrieval imprecision

Step 5: Sharding Strategy

Why consistent hashing for document sharding:

16
virtual nodes per physical shard on the ring  Document assignment: shard = consistent_hash(document_id) % 16  Query fan-out: every query goes to ALL 16 shards (scatter-gather)    → Can't predict which shard has relevant documents    → Must query all, merge results    → This is standard for search systems (Elasticsearch does the same)  Alternative — partition by topic/domain:    "Documents about climate → shard 1, AI papers → shard 2"    Pros: queries can target fewer shards    Cons: uneven distribution, hot shards, complex routing    Verdict: consistent hashing is simpler and good enough  Replication:    Each shard replicated 2x (primary + replica)    Reads go to either primary or replica (load balancing)    Writes go to primary, async replicated to replica    If primary fails: replica promoted within seconds (Raft/Paxos leader election)

Step 6: Progressive Complexity — The Anthropic Interview Additions

Addition 1: “Now support real-time document updates with < 5-minute index freshness”

The challenge: HNSW doesn’t support efficient incremental updates. Every insertion requires graph reconnection. At 100K new documents/minute, batch rebuilding is too slow.

Solution — two-tier index:

HOT
TIER (new documents, last 24 hours):    Smaller HNSW index per shard, rebuilt every 10 minutes    ~1M new docs × 1536d = 6 GB additional index per shard    Queries check both hot tier and main index, merge results  COLD TIER (stable documents):    Main HNSW index rebuilt nightly during low-traffic hours    Previous night's hot tier merged into cold tier  Inverted index is easier — Lucene supports incremental segment additions    New segments written immediately → merged in background (NRT search)    Lucene's Near-Real-Time (NRT) reader: new docs visible within seconds

Addition 2: “10% of queries are adversarial — users trying to extract harmful information”

This is the Anthropic safety dimension. Pure technical optimization isn’t enough.

Defense-in-depth at three layers:

Layer
1 — Ingestion gate:    All documents pass through safety classifier before indexing    Documents scoring > safety_threshold are excluded from the index    Category: "this document contains instructions for harm" → never indexed  Layer 2 — Query intent classification (at query time):    Lightweight classifier (< 5ms) on incoming query    "How to synthesize dangerous chemicals" → HIGH_RISK    HIGH_RISK queries: bypass LLM reranker, apply stricter result filter, log for review    Threshold: precision matters more than recall (false positive = missed helpful result    is better than false negative = harmful content returned)  Layer 3 — Result filtering:    Each document in top-K results has a stored safety score (computed at index time)    Results with safety_score < threshold are excluded from the response    Even if a document passes ingestion filter, can be re-evaluated against the specific    query context ("document is neutral but specifically answers this harmful query")  Audit log (immutable):    Every HIGH_RISK flagged query + response → append-only log    Human review team monitors this log    Model for classifying queries retrained on accumulated audit data

Addition 3: “Reduce the cost of LLM reranking — it’s too expensive at 1M QPS”

At
1M QPS × LLM reranking = 1M × ~$0.0001/query = $100/second = $8.6M/day  UNACCEPTABLE.  Cost optimization strategy:  1. Tiered reranking (already in design):     Only 5-10% of queries use LLM reranking (complex, low-confidence, premium users)     90-95% use lightweight cross-encoder: cost ≈ $0.000001/query vs $0.0001  2. Result caching for top queries:     Top 10K queries account for ~60% of traffic (Zipf distribution)     Cache these with 5-minute TTL → 60% reduction in reranking calls     Cache hit: ~5ms response, $0 compute cost  3. Reranking batching:     Instead of reranking immediately per query, batch 100 queries together     GPU utilization: 1 GPU-second handles 100 queries instead of 1     Introduces ~50ms additional latency — acceptable within 500ms budget  4. Smaller models for most queries:     Claude Haiku (fastest) for most LLM reranking     Claude Sonnet only for "premium" or safety-flagged queries     Cost difference: ~10x  Combined: 1M QPS × 5% LLM rate × 60% cache hit × batch efficiency  = 1M × 0.05 × 0.40 × 0.01 × $0.0001 = $0.20/second ≈ $17K/day  Achievable.

Addition 4: “The document corpus contains sensitive proprietary documents with different access permissions”

Access
control at query time (not index time):    Store per-document ACL (access control list) in a fast KV store (Redis/DynamoDB)    At retrieval: fan-out query to all shards as normal    At merge/fusion: filter candidates by checking ACL for requesting user    Only documents the user can access are returned  Index-level separation (for very sensitive data):    Most sensitive documents (top-secret clearance) → separate index cluster    Only queries with appropriate permissions are routed to that cluster    Physical separation > logical filtering for highest-sensitivity data  Trade-off:    Per-query ACL check adds ~5ms (Redis lookup, batched)    Separate index clusters add complexity but provide stronger isolation    For Anthropic's use case: model knowledge is often not access-controlled    but customer enterprise data is → separate index for enterprise RAG

Step 7: Failure Mode Analysis

Failure	Impact	Mitigation
One index shard fails	1/16 of documents unavailable	Secondary replica promoted within 30 sec. Query continues with 15/16 shards, degrades gracefully.
Embedding service overload	Query embedding fails	Circuit breaker: fall back to BM25-only search. Lower quality but 0ms embedding needed.
All replicas of one shard fail	1/16 documents permanently unavailable until recovery	Data loss prevented by WAL. Shard rebalanced from other nodes. Estimated recovery: 30 min.
LLM reranker unavailable	Reranking degrades to cross-encoder	Tier-1 cross-encoder is always-on backup. QPS limit on LLM reranker prevents cascading failure.
Cache cluster fails	All queries hit index	Index layer sized for full throughput without cache. Performance degrades 3–5x but system stays up.

Interview Questions

Q1: “The HNSW index rebuild at 1B documents takes too long. The nightly rebuild overlaps with business hours in Asia-Pacific. How do you handle this?”

Model Answer: “Three strategies. First, rolling rebuilds: instead of rebuilding all 16 shards simultaneously, rebuild one shard at a time (each takes ~1/16 of the time). Shard 1 rebuilds while shards 2-16 serve traffic. The shard being rebuilt temporarily goes offline; traffic routes to its replica. Total rebuild time is the same, but only 1/16 of capacity is affected at any time. Second, blue-green index deployment: build the new index alongside the existing one on a separate set of nodes. When the new index is ready (tested, validated), switch traffic atomically from old to new. Zero downtime, but doubles the index node cost temporarily. Third, for the APAC concern specifically — region-specific builds. If the document corpus doesn’t require 100% global consistency, allow APAC region’s index to rebuild during their off-peak (which is US prime time), while US rebuilds during US off-peak. Each region runs on a slightly different schedule. The trade-off: a 2-6 hour window where US and APAC indexes differ slightly.”

Q2: “Walk me through what happens end-to-end when a user submits the query ‘explain quantum entanglement’ to your system.”

Model Answer: “1. The query router receives the HTTP request. It computes a cache key: SHA256(‘explain quantum entanglement’). It checks Redis — cache miss (assume cold start).

The query embedding service receives the text. It runs the query through the same embedding model used during document ingestion (critical: same model = consistent vector space). Returns a 1536-dimensional vector in ~20ms.
The router fans out in parallel to all 16 shards. Each shard receives two sub-queries simultaneously: a BM25 keyword query for ‘explain quantum entanglement’ (tokenized, stop words removed: ‘quantum entanglement explain’) AND an ANN vector search for the query embedding against HNSW. Each shard returns top-50 BM25 results and top-50 vector results in ~20-30ms.
The router receives 16 × 100 = 1,600 candidate (document_id, score) pairs per retrieval type. It deduplicates by document_id and applies RRF fusion. The document about ‘quantum entanglement for beginners’ ranked #2 in BM25 and #4 in vector gets RRF score = 1/62 + 1/64 ≈ 0.032. The document about ‘Bell’s theorem and quantum non-locality’ ranked #8 in BM25 and #1 in vector gets RRF score = 1/68 + 1/61 ≈ 0.031. Very close — the fusion surface is nuanced. Top 100 candidates selected.
The cross-encoder reranker receives (query, document_excerpt) pairs for top-100. It processes them in a batch on GPU in ~100ms. It re-scores based on joint query-document understanding — ‘explain’ intent means accessible explanations score higher than technical papers. Returns top-10.
Safety check on top-10 results: all have safety_score > threshold (quantum physics is not harmful). Pass.
Results cached in Redis with 60-minute TTL (physics content is stable). Response returned to user. Total: ~25ms (cache check) + ~20ms (embedding) + ~30ms (retrieval) + ~100ms (reranking) + ~5ms (cache write) ≈ 180ms p50, within the 500ms p95 budget.”

Self-Assessment: 5 Questions

Why is consistent hashing the right sharding strategy for search, and why must every query fan-out to all shards?
What’s the RRF formula and why does it work without tuning across different score scales?
What are the three layers of safety defense and what does each one catch?
How does the two-tier hot/cold index solve the HNSW real-time update problem?
Why is LLM reranking prohibitively expensive at full throughput, and what’s the cost reduction strategy?

Quick Reference: Distributed Search (Anthropic Style)

Three-layer architecture: Ingestion (embed + index) → Retrieval (BM25 + vector, sharded) → Reranking (cross-encoder → optional LLM)
Hybrid retrieval is mandatory: BM25 for exact keyword match, HNSW vector search for semantic similarity. RRF fusion (k=60) combines them without tuning.
16 shards, consistent hashing, scatter-gather: every query fans out to all shards, merges top-K from each.
HNSW at scale: IVF-PQ compression needed for 62.5M vectors per shard (12 GB vs 384 GB). ~5% recall loss, acceptable with reranking.
Tiered reranking: lightweight cross-encoder (~10ms, always on) → LLM reranker (~200ms, 5-10% of queries only). Cost = 100x reduction.
Real-time updates: two-tier index (hot/cold). Lucene NRT for inverted index, small hot HNSW rebuilt every 10 minutes.
Safety at Anthropic: filter at ingestion, classify query intent, filter results by safety score. Immutable audit log.
The Anthropic differentiator: You must address safety unprompted. Don’t wait to be asked.

Phase 2 Complete — What’s Coming

You’ve finished all five company-specific deep dives:

Meta: Scale thinking, News Feed data pipeline, fan-out architecture
Netflix: Iceberg-first, ownership culture, recommendation data pipeline
Google: GCP-native design, BigQuery cost optimization, search analytics pipeline
OpenAI: LLM training data, RLHF, AI-native pipelines
Anthropic: Safety-first, progressive complexity, distributed search

Phase 3 (Days 61-90) starts in 21 days with full mock designs, cross-cutting topics, and interview-day simulation. Before then, Days 41-60 cover the advanced Phase 2 topics: streaming feature engineering, A/B testing infrastructure, data mesh, security/privacy, cost optimization, and more complex designs.

Tomorrow’s Preview

Day 41: Activity Schema & Event Modeling — Event-driven data modeling, the Activity Schema pattern, wide event tables, how to model clickstreams and app events at scale, and when schema-on-read beats schema-on-write — directly relevant to OpenAI conversation data and Meta engagement pipelines.

Day 40: Design: Distributed Search System for Billion Documents (Anthropic Style)