Phase 2: Deep Dives | Category: System Design Practice
The Prompt
“Design a real-time fraud detection system for a payment platform processing 100,000 transactions per second at peak. The system must score each transaction within 300ms, detect fraud rings (coordinated multi-account attacks), balance precision/recall trade-offs, and feed back detection results to improve the ML model over time.”
Step 1: Requirements
Functional requirements
- Ingest transactions from all channels (card, mobile, web, API)
- Score each transaction in real time (fraud probability 0–1)
- Apply rule-based checks + ML model scoring
- Detect coordinated fraud rings (multi-account attacks)
- Block high-confidence fraud before authorization
- Alert on suspicious patterns below block threshold
- Route confirmed labels back for model retraining
Non-functional requirements
Scale: 100K TPS peak; design for 300K TPS capacity (3× margin)
Latency: Decision < 300ms total (scoring ideally < 100ms)
Async alerting < 30s
Availability: 99.999% (authorization failures = revenue loss)
Consistency: Strong for blocking decisions; eventual for analytics
False positives: ≤ 0.3% FP (customer experience)
False negatives: target ≤ 0.01% escape (loss)
Key tension: the blocking path must be extremely fast, so heavy work (ring detection, deep joins, long scans) must be async.
Step 2: Scale Estimation
Ingestion:
100K TPS × 500B ≈ 50 MB/sec ≈ 4.3 TB/day raw
Compressed ≈ 700 GB/day ≈ 255 TB/year
Kafka:
100K TPS → ~10+ partitions (order-of-magnitude)
Replication factor 3 → higher write throughput requirement
Latency budget (illustrative):
Network → fraud svc: 10ms
Feature retrieval (Redis): 15ms
Rule engine: 5ms
ML inference: 30ms
Write block decision: 10ms
Response: 10ms
Total: ~80ms (within 300ms)
Feature store sizing: per-account history is large (TBs), so use Redis for hot keys + a durable store (e.g., Cassandra) for full population.
Step 3: Two-Path Architecture
Synchronous path: block/allow in < 300ms.
Asynchronous path: feature updates, ring detection, analytics, retraining.
Payment Service
↓ (sync call, <300ms)
┌─────────────────────────────────────────────────────────────────────┐
│ REAL-TIME SCORING SERVICE (SYNC) │
│ 1) Feature retrieval (<15ms) │
│ - Redis: hot features (p99 < 5ms) │
│ - Cassandra: cold features on miss (<15ms) │
│ 2) Rule engine (<5ms) │
│ - velocity, geo, amount anomaly, blocklists │
│ - hard-block rules short-circuit │
│ 3) ML inference (<30ms) │
│ - XGBoost/LightGBM/NN via serving runtime │
│ 4) Decision (<5ms) │
│ - score > 0.9: BLOCK │
│ - score > 0.7: REVIEW / step-up auth │
│ - else: ALLOW │
│ 5) Write decision (<10ms) │
│ - if BLOCK: set block:{account_id} with TTL │
└──────────────────────┬──────────────────────────────────────────────┘
↓ (async event fanout)
┌─────────────────────────────────────────────────────────────────────┐
│ KAFKA TOPICS │
│ transactions.raw │
│ transactions.scored │
│ fraud.alerts (score>0.7) │
│ fraud.confirmed (labels later) │
└───────────────┬───────────────────────────────┬─────────────────────┘
↓ ↓
┌───────────────────────────┐ ┌────────────────────────────────────┐
│ FLINK FEATURE UPDATER │ │ FRAUD RING DETECTION │
│ - updates velocity counts │ │ Batch (daily Spark): graph + CC │
│ - maintains rolling stats │ │ Streaming (near-RT): CEP patterns │
│ - writes to Redis + │ │ │
│ Cassandra │ │ Output: ring_risk_score/account │
└───────────────────────────┘ └────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────┐
│ DATA LAKE (Iceberg on S3) │
│ Bronze: raw + scored (immutable) │
│ Silver: labeled transactions │
│ Gold: feature tables for retraining │
└─────────────────────────────────────────────────────────────────────┘
Step 4: Feature Engineering (Core DE Value)
Real-time features (Redis; refreshed by Flink; seconds-level freshness):
account_features = {
"tx_count_30min": 3,
"tx_count_1hr": 7,
"tx_count_24hr": 23,
"amount_sum_1hr": 450.00,
"amount_max_30min": 200.00,
"avg_tx_amount_30d": 45.50,
"stddev_tx_amount_30d": 23.20,
"typical_merchants_30d": ["grocery_001", "gas_002", "coffee_003"],
"typical_hours_30d": [8, 9, 12, 13, 18, 19],
"last_transaction_country": "US",
"last_transaction_city": "San Francisco",
"last_transaction_timestamp": "2026-04-13T09:30:00Z",
"trusted_devices": ["device_hash_123", "device_hash_456"],
"last_device_id": "device_hash_123",
"risk_score_rolling_avg": 0.15,
"dispute_rate_90d": 0.001,
"fraud_flag_count_90d": 0,
}
Derived features (computed at score time, ideally sub-millisecond):
def derive_features(transaction, account_features):
return {
"amount_z_score": (
(transaction.amount - account_features["avg_tx_amount_30d"])
/ max(account_features["stddev_tx_amount_30d"], 1)
),
"velocity_anomaly": account_features["tx_count_30min"] > 5,
"impossible_travel": check_impossible_travel(
account_features["last_transaction_city"],
transaction.merchant_city,
account_features["last_transaction_timestamp"],
transaction.timestamp,
),
"is_new_merchant": transaction.merchant_id
not in set(account_features["typical_merchants_30d"]),
"unusual_hour": transaction.hour not in set(account_features["typical_hours_30d"]),
"is_new_device": transaction.device_id
not in set(account_features["trusted_devices"]),
}
Step 5: Fraud Ring Detection (Graph Layer)
Transaction-by-transaction scoring can miss coordinated rings. Ring detection links entities via shared attributes.
Ring pattern:
20+ accounts controlled by one actor
- created in a tight time window
- shared IP ranges / devices / cards / addresses
- coordinated timing and merchant targets
Graph construction (daily batch concept):
# Pseudocode (Spark-ish) for relationship edges
edges_same_ip = (
transactions_df.groupBy("account_id", "ip_address")
.count()
.filter("count >= 2")
.withColumn("edge_type", lit("SAME_IP"))
)
edges_same_device = (
transactions_df.groupBy("account_id", "device_fingerprint")
.count()
.filter("count >= 3")
.withColumn("edge_type", lit("SAME_DEVICE"))
)
# Union edges → community detection / connected components
edges = edges_same_ip.unionByName(edges_same_device)
Community scoring signals:
- High community density
- Account creation burst (e.g., within 48 hours)
- Synchronized transaction timing
- Low merchant entropy (same targets)
Output: ring_risk_score per account_id written to Redis as a feature for the sync scorer.
Step 6: Precision/Recall Trade-offs (Business Impact)
Confusion matrix:
Predicted FRAUD Predicted NOT-FRAUD
Actual FRAUD True Positive False Negative
Actual NOT-FRAUD False Positive True Negative
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
Operating points must be chosen with explicit cost trade-offs.
Threshold management:
BLOCK_THRESHOLD = 0.90
REVIEW_THRESHOLD = 0.70
MONITOR_THRESHOLD = 0.50
thresholds_by_category = {
"luxury_retail": {"block": 0.80, "review": 0.60},
"grocery": {"block": 0.95, "review": 0.80},
"crypto_exchange": {"block": 0.70, "review": 0.50},
}
Step 7: Feedback Loop (Model Retraining)
Fraud labels arrive late (e.g., disputes 30–90 days later), so the system needs a point-in-time training pipeline.
transactions.scored (features at score time)
↓
fraud.confirmed (chargeback/dispute label later)
↓
Daily label reconciliation job:
join by transaction_id
label=1 if confirmed fraud
label=0 if no dispute after N days
↓
Iceberg Silver: labeled training set with point-in-time features
↓
Weekly retrain (MLflow + model registry)
↓
Deploy with A/B test (5% new vs 95% current)
↓
Promote or auto-rollback on FP/FN regression
Key DE requirement: the feature vector must reflect what was known at scoring time, not what is known today.
Step 8: Failure Modes and Mitigations
| Failure | Impact | Mitigation |
|---|---|---|
| Redis down | can’t fetch hot features | degrade to rules-only; fail over; use Cassandra fallback |
| Model service down | no ML inference | rules-only + conservative handling; fast failover |
| Kafka lag | async paths delayed | lag monitoring; autoscale consumers |
| FP spike after new model | customer pain | A/B rollout + auto-rollback thresholds |
| Ring detection silent failure | rings missed | DQ checks on ring outputs; alert on suspicious zero-output |
Interview Questions
Q1: New attack: 2 tx/day/account but 10,000 accounts simultaneously. Detect?
Model answer:
- Individual velocity rules won’t trigger.
- Graph layer detects shared attributes (IP/device/creation bursts) and coordinated timing.
- Add merchant-level velocity anomaly: even if per-account rate is low, 10,000 accounts hitting the same merchant pattern is detectable quickly.
Q2: Model has 99% precision but 60% recall. Improve recall without raising FP.
Model answer:
- Don’t just lower thresholds (that increases FP).
- Do FN breakdown: which fraud types are missed (category/device/channel/time)?
- Fix feature gaps and freshness gaps (real-time features vs daily batch).
- Audit missing/null feature rates; improve upstream data completeness.
- Improve training set coverage for rare patterns (reweighting/oversampling by fraud type), and retest at fixed FP budget.
Self-Assessment: 5 Questions
- What happens in the synchronous path and where is the latency spent?
- Why is ring detection daily batch (plus some streaming heuristics) rather than fully synchronous?
- Precision-optimized vs recall-optimized: what changes and why?
- How does a confirmed fraud label 6 weeks later affect the next model?
- What’s the degraded behavior if Redis is down?
Quick Reference
- Two-path architecture: synchronous (
p99 < 300ms) + async (seconds–days). - Scoring: rules first (short-circuit hard blocks), then ML inference.
- Feature tiers: Redis hot, Cassandra cold, derived at score time.
- Three thresholds: block/review/monitor; optionally category-specific.
- Ring detection: graph communities →
ring_risk_scorefeature. - Feedback loop: scored → labels later → point-in-time training set → retrain → A/B → promote/rollback.
Tomorrow’s Preview
Day 60: Phase 2 Review — comprehensive review and self-assessment before Phase 3 mock interviews.