Phase 2: Company-Specific | Category: Anthropic-Specific
The Prompt
Per IGotAnOffer: “Design a distributed search system capable of handling 1 billion documents at 1 million queries per second, covering sharding, caching, and LLM inference scaling.”
This is the most consistently reported Anthropic system design question. Set a 45-minute timer. But unlike previous designs, expect the interviewer to add constraints at minutes 15, 25, and 35.
Step 1: Requirements (5 min)
Functional:
- Full-text search AND semantic search across 1 billion documents
- Natural language queries (not just keyword)
- Ranked results by relevance
- Support for LLM-powered reranking of top-K results
- Document ingestion pipeline (batch + incremental updates)
- Safety filtering (relevant at Anthropic — search must not surface harmful content)
Non-functional:
Scale
: 1 billion documents 1 million QPS (queries per second) — 100K avg + 10x peak headroom Latency: p95 < 500ms end-to-end (retrieval + reranking) p99 < 1 second Availability: 99.99% Document size avg: 2 KB → 1B × 2KB = 2 TB documents Embedding dim: 1536 (text-embedding-3-large) → 1B × 1536 × 4 bytes = 6 TB vectors Inverted index: ~3x document size → ~6 TB Latency budget (500ms): Query embedding: 20ms (local embedding model, not API call) ANN retrieval: 30ms (HNSW on sharded vector index) BM25 retrieval: 20ms (inverted index lookup) Fusion + dedup: 5ms LLM reranking top-50: 200ms (batched cross-encoder or LLM) Cache check + response: 25ms Total: ~300ms (leaves 200ms headroom for p95)
Scope: “I’ll design the ingestion pipeline, the retrieval layer (hybrid BM25 + vector), the reranking layer, caching strategy, and sharding. I’ll address safety filtering and the progressive complexity additions.”
Step 2: Scale Estimation (3 min)
Index
size: 1B documents × 2KB avg = 2 TB raw text Inverted index: ~6 TB (3x documents due to posting lists) Vector embeddings: 1B × 1536d × 4 bytes = 6 TB Total index: ~12 TB → too large for one node (max ~2-4 TB per NVMe node) Sharding requirement: 12 TB / 1 TB per index shard = 12 shards minimum With 2x replication for availability: 24 index nodes minimum For 1M QPS: assuming 1K QPS per shard search → 12 shards × 1K = 12K QPS from shards → Need 1M / 12K ≈ 84 parallel query "lanes" per second → routing layer handles fan-out Embedding generation for queries: 1M QPS × 20ms per embedding = 20M GPU-ms/sec needed ~1,000 GPU-hours/sec → need dedicated embedding inference cluster (~100-200 GPUs) (In practice: batch embeddings, use smaller model, or cache repeated queries) LLM reranking cost: 1M QPS × top-50 candidates × ~500 tokens each → 25B tokens/sec This is prohibitively expensive at full throughput → must be tiered/cached
Step 3: Full Architecture
┌──────────────────────────────────────────────────────────────────┐ │ INGESTION PIPELINE │ │ (Batch: crawl data, incremental: new document events) │ │ │ │ [Document Queue] → [Preprocessor] │ │ • Language detection, chunking (512 tokens, 50-token overlap) │ │ • Normalize: lowercase, remove markup, deduplicate │ │ • Safety filter: remove harmful documents before indexing │ │ │ │ Two parallel paths: │ │ Path A → Embedding Service → Vector Index Shards │ │ Path B → BM25 Indexer → Inverted Index Shards │ │ Both updates are eventually consistent (commit lag < 5 min) │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌──────────────────────────────────────────────────────────────────┐ │ INDEX LAYER (Sharded) │ │ │ │ Sharding strategy: consistent hashing on document_id │ │ 16 shards × 2 replicas = 32 index nodes │ │ │ │ Each shard contains: │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ INVERTED INDEX │ │ VECTOR INDEX (HNSW)│ │ │ │ (Elasticsearch/ │ │ (FAISS, 96M docs │ │ │ │ Lucene-based) │ │ per shard) │ │ │ │ BM25 scoring │ │ HNSW: M=32, │ │ │ │ Posting lists: │ │ efSearch=100 │ │ │ │ term → [doc_ids, │ │ IVF-PQ for memory: │ │ │ │ positions, tf] │ │ quantized to 8bit │ │ │ └────────────────────┘ └────────────────────┘ │ │ │ │ Memory per shard: ~300 GB (inverted) + ~450 GB (HNSW) = 750 GB│ │ Node: 1 TB RAM, 16-core, NVMe SSD for overflow │ └─────────────────────────────────────────────────────────────────┘ ↑↓ query fan-out ┌──────────────────────────────────────────────────────────────────┐ │ QUERY PROCESSING LAYER │ │ │ │ Query Router (load balanced, stateless) │ │ ├── Cache check: is this query in the result cache? (Redis) │ │ │ Cache hit → return cached result (~5ms) │ │ │ Cache miss → continue │ │ │ │ │ ├── Query Embedding Service │ │ │ Local embedding model (not API call): ~20ms │ │ │ Same model as ingestion (critical for consistency) │ │ │ │ │ ├── Parallel Retrieval Fan-out │ │ │ Send query simultaneously to ALL 16 shards: │ │ │ • BM25 request: keyword retrieval, top-50 per shard │ │ │ • Vector request: ANN retrieval, top-50 per shard │ │ │ Wait for all shards (with timeout: 100ms max) │ │ │ │ │ ├── Merge + Fusion │ │ │ Collect top-50 from each of 16 shards (both BM25 + vector) │ │ │ → up to 1,600 candidates per retrieval type │ │ │ → RRF fusion: score = Σ 1/(k + rank_i), k=60 │ │ │ → Deduplicate by document_id │ │ │ → Take global top-100 candidates │ │ │ │ │ └── Reranking (tiered — explained below) │ └─────────────────────────────────────────────────────────────────┘ ↓ ┌──────────────────────────────────────────────────────────────────┐ │ RERANKING LAYER (Tiered) │ │ │ │ Tier 1 (all queries, ~10ms): │ │ Cross-encoder model (lightweight, 110M params) │ │ Input: query + document summary (first 200 tokens) │ │ Output: relevance score 0-1 for top-100 candidates │ │ Rerank → top-10 │ │ │ │ Tier 2 (premium/flagged queries, +200ms): │ │ LLM reranker (Claude Haiku or equivalent) │ │ Input: query + full document chunks for top-10 │ │ Output: explanation + confidence per result │ │ Triggered when: tier-1 confidence is low, query is complex │ │ │ │ Result cache: store (query_hash → ranked_results) in Redis │ │ TTL: 5 minutes for dynamic content, 1 hour for stable content │ └─────────────────────────────────────────────────────────────────┘ ↓ API Response top-10 results + scores + (optional) explanations
Step 4: Hybrid Retrieval Deep Dive
Why hybrid is mandatory for this system (per BM25 vs Vector Search, Alok and DEV Community):
Query TypeBM25 WinVector Win”Python asyncio tutorial” (keyword)✅ exact match❌ may return “concurrency patterns""Why does my code fail when threads share state?” (semantic)❌ no keyword match✅ understands intent”Claude model hallucination rate” (hybrid)✅ finds docs with these exact terms✅ finds docs about AI accuracy
Hybrid improves recall 15-30% over either alone. At Anthropic, where search supports Claude’s knowledge retrieval, this is the difference between finding the right context and hallucinating.
RRF Fusion — why it works without tuning:
For
each candidate document: RRF_score = Σ(i) 1 / (60 + rank_i) Where rank_i is the position in retrieval list i (BM25 or vector) A document ranked #1 in both: 1/61 + 1/61 = 0.0328 A document ranked #1 in BM25, #20 in vector: 1/61 + 1/80 = 0.0288 A document ranked #3 in both: 1/63 + 1/63 = 0.0317 The k=60 constant absorbs rank-scale differences — works universally without tuning
HNSW Index Selection (per DEV Community HNSW deep dive):
At
1B documents / 16 shards = 62.5M documents per shard Index type selection: < 100K docs: FAISS Flat (exact) — brute force is fast enough 100K - 10M docs: FAISS HNSW — best recall/speed trade-off > 10M docs: FAISS IVF-PQ — compressed index fits in memory At 62.5M docs per shard: HNSW is the right choice, but memory-intensive Memory: 62.5M × 1536d × 4 bytes = 384 GB per shard → need IVF-PQ compression With IVF-PQ (8-bit quantization): 62.5M × 1536/8 bytes = 12 GB per shard ✓ Trade-off: IVF-PQ has ~5% recall degradation vs exact HNSW For search, this is acceptable — you're finding approximately nearest documents, and the reranking layer compensates for retrieval imprecision
Step 5: Sharding Strategy
Why consistent hashing for document sharding:
16
virtual nodes per physical shard on the ring Document assignment: shard = consistent_hash(document_id) % 16 Query fan-out: every query goes to ALL 16 shards (scatter-gather) → Can't predict which shard has relevant documents → Must query all, merge results → This is standard for search systems (Elasticsearch does the same) Alternative — partition by topic/domain: "Documents about climate → shard 1, AI papers → shard 2" Pros: queries can target fewer shards Cons: uneven distribution, hot shards, complex routing Verdict: consistent hashing is simpler and good enough Replication: Each shard replicated 2x (primary + replica) Reads go to either primary or replica (load balancing) Writes go to primary, async replicated to replica If primary fails: replica promoted within seconds (Raft/Paxos leader election)
Step 6: Progressive Complexity — The Anthropic Interview Additions
Addition 1: “Now support real-time document updates with < 5-minute index freshness”
The challenge: HNSW doesn’t support efficient incremental updates. Every insertion requires graph reconnection. At 100K new documents/minute, batch rebuilding is too slow.
Solution — two-tier index:
HOT
TIER (new documents, last 24 hours): Smaller HNSW index per shard, rebuilt every 10 minutes ~1M new docs × 1536d = 6 GB additional index per shard Queries check both hot tier and main index, merge results COLD TIER (stable documents): Main HNSW index rebuilt nightly during low-traffic hours Previous night's hot tier merged into cold tier Inverted index is easier — Lucene supports incremental segment additions New segments written immediately → merged in background (NRT search) Lucene's Near-Real-Time (NRT) reader: new docs visible within seconds
Addition 2: “10% of queries are adversarial — users trying to extract harmful information”
This is the Anthropic safety dimension. Pure technical optimization isn’t enough.
Defense-in-depth at three layers:
Layer
1 — Ingestion gate: All documents pass through safety classifier before indexing Documents scoring > safety_threshold are excluded from the index Category: "this document contains instructions for harm" → never indexed Layer 2 — Query intent classification (at query time): Lightweight classifier (< 5ms) on incoming query "How to synthesize dangerous chemicals" → HIGH_RISK HIGH_RISK queries: bypass LLM reranker, apply stricter result filter, log for review Threshold: precision matters more than recall (false positive = missed helpful result is better than false negative = harmful content returned) Layer 3 — Result filtering: Each document in top-K results has a stored safety score (computed at index time) Results with safety_score < threshold are excluded from the response Even if a document passes ingestion filter, can be re-evaluated against the specific query context ("document is neutral but specifically answers this harmful query") Audit log (immutable): Every HIGH_RISK flagged query + response → append-only log Human review team monitors this log Model for classifying queries retrained on accumulated audit data
Addition 3: “Reduce the cost of LLM reranking — it’s too expensive at 1M QPS”
At
1M QPS × LLM reranking = 1M × ~$0.0001/query = $100/second = $8.6M/day UNACCEPTABLE. Cost optimization strategy: 1. Tiered reranking (already in design): Only 5-10% of queries use LLM reranking (complex, low-confidence, premium users) 90-95% use lightweight cross-encoder: cost ≈ $0.000001/query vs $0.0001 2. Result caching for top queries: Top 10K queries account for ~60% of traffic (Zipf distribution) Cache these with 5-minute TTL → 60% reduction in reranking calls Cache hit: ~5ms response, $0 compute cost 3. Reranking batching: Instead of reranking immediately per query, batch 100 queries together GPU utilization: 1 GPU-second handles 100 queries instead of 1 Introduces ~50ms additional latency — acceptable within 500ms budget 4. Smaller models for most queries: Claude Haiku (fastest) for most LLM reranking Claude Sonnet only for "premium" or safety-flagged queries Cost difference: ~10x Combined: 1M QPS × 5% LLM rate × 60% cache hit × batch efficiency = 1M × 0.05 × 0.40 × 0.01 × $0.0001 = $0.20/second ≈ $17K/day Achievable.
Addition 4: “The document corpus contains sensitive proprietary documents with different access permissions”
Access
control at query time (not index time): Store per-document ACL (access control list) in a fast KV store (Redis/DynamoDB) At retrieval: fan-out query to all shards as normal At merge/fusion: filter candidates by checking ACL for requesting user Only documents the user can access are returned Index-level separation (for very sensitive data): Most sensitive documents (top-secret clearance) → separate index cluster Only queries with appropriate permissions are routed to that cluster Physical separation > logical filtering for highest-sensitivity data Trade-off: Per-query ACL check adds ~5ms (Redis lookup, batched) Separate index clusters add complexity but provide stronger isolation For Anthropic's use case: model knowledge is often not access-controlled but customer enterprise data is → separate index for enterprise RAG
Step 7: Failure Mode Analysis
| Failure | Impact | Mitigation |
|---|---|---|
| One index shard fails | 1/16 of documents unavailable | Secondary replica promoted within 30 sec. Query continues with 15/16 shards, degrades gracefully. |
| Embedding service overload | Query embedding fails | Circuit breaker: fall back to BM25-only search. Lower quality but 0ms embedding needed. |
| All replicas of one shard fail | 1/16 documents permanently unavailable until recovery | Data loss prevented by WAL. Shard rebalanced from other nodes. Estimated recovery: 30 min. |
| LLM reranker unavailable | Reranking degrades to cross-encoder | Tier-1 cross-encoder is always-on backup. QPS limit on LLM reranker prevents cascading failure. |
| Cache cluster fails | All queries hit index | Index layer sized for full throughput without cache. Performance degrades 3–5x but system stays up. |
Interview Questions
Q1: “The HNSW index rebuild at 1B documents takes too long. The nightly rebuild overlaps with business hours in Asia-Pacific. How do you handle this?”
Model Answer: “Three strategies. First, rolling rebuilds: instead of rebuilding all 16 shards simultaneously, rebuild one shard at a time (each takes ~1/16 of the time). Shard 1 rebuilds while shards 2-16 serve traffic. The shard being rebuilt temporarily goes offline; traffic routes to its replica. Total rebuild time is the same, but only 1/16 of capacity is affected at any time. Second, blue-green index deployment: build the new index alongside the existing one on a separate set of nodes. When the new index is ready (tested, validated), switch traffic atomically from old to new. Zero downtime, but doubles the index node cost temporarily. Third, for the APAC concern specifically — region-specific builds. If the document corpus doesn’t require 100% global consistency, allow APAC region’s index to rebuild during their off-peak (which is US prime time), while US rebuilds during US off-peak. Each region runs on a slightly different schedule. The trade-off: a 2-6 hour window where US and APAC indexes differ slightly.”
Q2: “Walk me through what happens end-to-end when a user submits the query ‘explain quantum entanglement’ to your system.”
Model Answer: “1. The query router receives the HTTP request. It computes a cache key: SHA256(‘explain quantum entanglement’). It checks Redis — cache miss (assume cold start).
-
The query embedding service receives the text. It runs the query through the same embedding model used during document ingestion (critical: same model = consistent vector space). Returns a 1536-dimensional vector in ~20ms.
-
The router fans out in parallel to all 16 shards. Each shard receives two sub-queries simultaneously: a BM25 keyword query for ‘explain quantum entanglement’ (tokenized, stop words removed: ‘quantum entanglement explain’) AND an ANN vector search for the query embedding against HNSW. Each shard returns top-50 BM25 results and top-50 vector results in ~20-30ms.
-
The router receives 16 × 100 = 1,600 candidate (document_id, score) pairs per retrieval type. It deduplicates by document_id and applies RRF fusion. The document about ‘quantum entanglement for beginners’ ranked #2 in BM25 and #4 in vector gets RRF score = 1/62 + 1/64 ≈ 0.032. The document about ‘Bell’s theorem and quantum non-locality’ ranked #8 in BM25 and #1 in vector gets RRF score = 1/68 + 1/61 ≈ 0.031. Very close — the fusion surface is nuanced. Top 100 candidates selected.
-
The cross-encoder reranker receives (query, document_excerpt) pairs for top-100. It processes them in a batch on GPU in ~100ms. It re-scores based on joint query-document understanding — ‘explain’ intent means accessible explanations score higher than technical papers. Returns top-10.
-
Safety check on top-10 results: all have safety_score > threshold (quantum physics is not harmful). Pass.
-
Results cached in Redis with 60-minute TTL (physics content is stable). Response returned to user. Total: ~25ms (cache check) + ~20ms (embedding) + ~30ms (retrieval) + ~100ms (reranking) + ~5ms (cache write) ≈ 180ms p50, within the 500ms p95 budget.”
Self-Assessment: 5 Questions
-
Why is consistent hashing the right sharding strategy for search, and why must every query fan-out to all shards?
-
What’s the RRF formula and why does it work without tuning across different score scales?
-
What are the three layers of safety defense and what does each one catch?
-
How does the two-tier hot/cold index solve the HNSW real-time update problem?
-
Why is LLM reranking prohibitively expensive at full throughput, and what’s the cost reduction strategy?
Quick Reference: Distributed Search (Anthropic Style)
- Three-layer architecture: Ingestion (embed + index) → Retrieval (BM25 + vector, sharded) → Reranking (cross-encoder → optional LLM)
- Hybrid retrieval is mandatory: BM25 for exact keyword match, HNSW vector search for semantic similarity. RRF fusion (k=60) combines them without tuning.
- 16 shards, consistent hashing, scatter-gather: every query fans out to all shards, merges top-K from each.
- HNSW at scale: IVF-PQ compression needed for 62.5M vectors per shard (12 GB vs 384 GB). ~5% recall loss, acceptable with reranking.
- Tiered reranking: lightweight cross-encoder (~10ms, always on) → LLM reranker (~200ms, 5-10% of queries only). Cost = 100x reduction.
- Real-time updates: two-tier index (hot/cold). Lucene NRT for inverted index, small hot HNSW rebuilt every 10 minutes.
- Safety at Anthropic: filter at ingestion, classify query intent, filter results by safety score. Immutable audit log.
- The Anthropic differentiator: You must address safety unprompted. Don’t wait to be asked.
Phase 2 Complete — What’s Coming
You’ve finished all five company-specific deep dives:
- Meta: Scale thinking, News Feed data pipeline, fan-out architecture
- Netflix: Iceberg-first, ownership culture, recommendation data pipeline
- Google: GCP-native design, BigQuery cost optimization, search analytics pipeline
- OpenAI: LLM training data, RLHF, AI-native pipelines
- Anthropic: Safety-first, progressive complexity, distributed search
Phase 3 (Days 61-90) starts in 21 days with full mock designs, cross-cutting topics, and interview-day simulation. Before then, Days 41-60 cover the advanced Phase 2 topics: streaming feature engineering, A/B testing infrastructure, data mesh, security/privacy, cost optimization, and more complex designs.
Tomorrow’s Preview
Day 41: Activity Schema & Event Modeling — Event-driven data modeling, the Activity Schema pattern, wide event tables, how to model clickstreams and app events at scale, and when schema-on-read beats schema-on-write — directly relevant to OpenAI conversation data and Meta engagement pipelines.