Phase 2: Deep Dives | Category: System Design Practice

The Prompt

“Design a real-time fraud detection system for a payment platform processing 100,000 transactions per second at peak. The system must score each transaction within 300ms, detect fraud rings (coordinated multi-account attacks), balance precision/recall trade-offs, and feed back detection results to improve the ML model over time.”

Step 1: Requirements

Functional requirements

  • Ingest transactions from all channels (card, mobile, web, API)
  • Score each transaction in real time (fraud probability 0–1)
  • Apply rule-based checks + ML model scoring
  • Detect coordinated fraud rings (multi-account attacks)
  • Block high-confidence fraud before authorization
  • Alert on suspicious patterns below block threshold
  • Route confirmed labels back for model retraining

Non-functional requirements

Scale:         100K TPS peak; design for 300K TPS capacity (3× margin)
Latency:       Decision < 300ms total (scoring ideally < 100ms)
               Async alerting < 30s
Availability:  99.999% (authorization failures = revenue loss)
Consistency:   Strong for blocking decisions; eventual for analytics
False positives: ≤ 0.3% FP (customer experience)
False negatives: target ≤ 0.01% escape (loss)

Key tension: the blocking path must be extremely fast, so heavy work (ring detection, deep joins, long scans) must be async.

Step 2: Scale Estimation

Ingestion:
  100K TPS × 500B ≈ 50 MB/sec ≈ 4.3 TB/day raw
  Compressed ≈ 700 GB/day ≈ 255 TB/year

Kafka:
  100K TPS → ~10+ partitions (order-of-magnitude)
  Replication factor 3 → higher write throughput requirement

Latency budget (illustrative):
  Network → fraud svc:           10ms
  Feature retrieval (Redis):     15ms
  Rule engine:                   5ms
  ML inference:                 30ms
  Write block decision:         10ms
  Response:                     10ms
  Total:                        ~80ms (within 300ms)

Feature store sizing: per-account history is large (TBs), so use Redis for hot keys + a durable store (e.g., Cassandra) for full population.

Step 3: Two-Path Architecture

Synchronous path: block/allow in < 300ms.

Asynchronous path: feature updates, ring detection, analytics, retraining.

Payment Service
  ↓ (sync call, <300ms)
┌─────────────────────────────────────────────────────────────────────┐
│ REAL-TIME SCORING SERVICE (SYNC)                                    │
│  1) Feature retrieval (<15ms)                                       │
│     - Redis: hot features (p99 < 5ms)                               │
│     - Cassandra: cold features on miss (<15ms)                      │
│  2) Rule engine (<5ms)                                              │
│     - velocity, geo, amount anomaly, blocklists                     │
│     - hard-block rules short-circuit                                │
│  3) ML inference (<30ms)                                            │
│     - XGBoost/LightGBM/NN via serving runtime                        │
│  4) Decision (<5ms)                                                 │
│     - score > 0.9: BLOCK                                            │
│     - score > 0.7: REVIEW / step-up auth                            │
│     - else: ALLOW                                                   │
│  5) Write decision (<10ms)                                          │
│     - if BLOCK: set block:{account_id} with TTL                      │
└──────────────────────┬──────────────────────────────────────────────┘
                       ↓ (async event fanout)
┌─────────────────────────────────────────────────────────────────────┐
│ KAFKA TOPICS                                                        │
│  transactions.raw                                                   │
│  transactions.scored                                                │
│  fraud.alerts (score>0.7)                                           │
│  fraud.confirmed (labels later)                                     │
└───────────────┬───────────────────────────────┬─────────────────────┘
                ↓                               ↓
┌───────────────────────────┐   ┌────────────────────────────────────┐
│ FLINK FEATURE UPDATER      │   │ FRAUD RING DETECTION               │
│  - updates velocity counts │   │  Batch (daily Spark): graph + CC   │
│  - maintains rolling stats │   │  Streaming (near-RT): CEP patterns │
│  - writes to Redis +       │   │                                    │
│    Cassandra               │   │ Output: ring_risk_score/account     │
└───────────────────────────┘   └────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ DATA LAKE (Iceberg on S3)                                           │
│  Bronze: raw + scored (immutable)                                   │
│  Silver: labeled transactions                                       │
│  Gold: feature tables for retraining                                │
└─────────────────────────────────────────────────────────────────────┘

Step 4: Feature Engineering (Core DE Value)

Real-time features (Redis; refreshed by Flink; seconds-level freshness):

account_features = {
    "tx_count_30min": 3,
    "tx_count_1hr": 7,
    "tx_count_24hr": 23,
    "amount_sum_1hr": 450.00,
    "amount_max_30min": 200.00,

    "avg_tx_amount_30d": 45.50,
    "stddev_tx_amount_30d": 23.20,
    "typical_merchants_30d": ["grocery_001", "gas_002", "coffee_003"],
    "typical_hours_30d": [8, 9, 12, 13, 18, 19],

    "last_transaction_country": "US",
    "last_transaction_city": "San Francisco",
    "last_transaction_timestamp": "2026-04-13T09:30:00Z",

    "trusted_devices": ["device_hash_123", "device_hash_456"],
    "last_device_id": "device_hash_123",

    "risk_score_rolling_avg": 0.15,
    "dispute_rate_90d": 0.001,
    "fraud_flag_count_90d": 0,
}

Derived features (computed at score time, ideally sub-millisecond):

def derive_features(transaction, account_features):
    return {
        "amount_z_score": (
            (transaction.amount - account_features["avg_tx_amount_30d"])
            / max(account_features["stddev_tx_amount_30d"], 1)
        ),
        "velocity_anomaly": account_features["tx_count_30min"] > 5,
        "impossible_travel": check_impossible_travel(
            account_features["last_transaction_city"],
            transaction.merchant_city,
            account_features["last_transaction_timestamp"],
            transaction.timestamp,
        ),
        "is_new_merchant": transaction.merchant_id
        not in set(account_features["typical_merchants_30d"]),
        "unusual_hour": transaction.hour not in set(account_features["typical_hours_30d"]),
        "is_new_device": transaction.device_id
        not in set(account_features["trusted_devices"]),
    }

Step 5: Fraud Ring Detection (Graph Layer)

Transaction-by-transaction scoring can miss coordinated rings. Ring detection links entities via shared attributes.

Ring pattern:

20+ accounts controlled by one actor
  - created in a tight time window
  - shared IP ranges / devices / cards / addresses
  - coordinated timing and merchant targets

Graph construction (daily batch concept):

# Pseudocode (Spark-ish) for relationship edges
edges_same_ip = (
    transactions_df.groupBy("account_id", "ip_address")
    .count()
    .filter("count >= 2")
    .withColumn("edge_type", lit("SAME_IP"))
)

edges_same_device = (
    transactions_df.groupBy("account_id", "device_fingerprint")
    .count()
    .filter("count >= 3")
    .withColumn("edge_type", lit("SAME_DEVICE"))
)

# Union edges → community detection / connected components
edges = edges_same_ip.unionByName(edges_same_device)

Community scoring signals:

  • High community density
  • Account creation burst (e.g., within 48 hours)
  • Synchronized transaction timing
  • Low merchant entropy (same targets)

Output: ring_risk_score per account_id written to Redis as a feature for the sync scorer.

Step 6: Precision/Recall Trade-offs (Business Impact)

Confusion matrix:

                Predicted FRAUD      Predicted NOT-FRAUD
Actual FRAUD     True Positive       False Negative
Actual NOT-FRAUD False Positive      True Negative

Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)

Operating points must be chosen with explicit cost trade-offs.

Threshold management:

BLOCK_THRESHOLD = 0.90
REVIEW_THRESHOLD = 0.70
MONITOR_THRESHOLD = 0.50

thresholds_by_category = {
    "luxury_retail": {"block": 0.80, "review": 0.60},
    "grocery": {"block": 0.95, "review": 0.80},
    "crypto_exchange": {"block": 0.70, "review": 0.50},
}

Step 7: Feedback Loop (Model Retraining)

Fraud labels arrive late (e.g., disputes 30–90 days later), so the system needs a point-in-time training pipeline.

transactions.scored (features at score time)

fraud.confirmed (chargeback/dispute label later)

Daily label reconciliation job:
  join by transaction_id
  label=1 if confirmed fraud
  label=0 if no dispute after N days

Iceberg Silver: labeled training set with point-in-time features

Weekly retrain (MLflow + model registry)

Deploy with A/B test (5% new vs 95% current)

Promote or auto-rollback on FP/FN regression

Key DE requirement: the feature vector must reflect what was known at scoring time, not what is known today.

Step 8: Failure Modes and Mitigations

FailureImpactMitigation
Redis downcan’t fetch hot featuresdegrade to rules-only; fail over; use Cassandra fallback
Model service downno ML inferencerules-only + conservative handling; fast failover
Kafka lagasync paths delayedlag monitoring; autoscale consumers
FP spike after new modelcustomer painA/B rollout + auto-rollback thresholds
Ring detection silent failurerings missedDQ checks on ring outputs; alert on suspicious zero-output

Interview Questions

Q1: New attack: 2 tx/day/account but 10,000 accounts simultaneously. Detect?

Model answer:

  • Individual velocity rules won’t trigger.
  • Graph layer detects shared attributes (IP/device/creation bursts) and coordinated timing.
  • Add merchant-level velocity anomaly: even if per-account rate is low, 10,000 accounts hitting the same merchant pattern is detectable quickly.

Q2: Model has 99% precision but 60% recall. Improve recall without raising FP.

Model answer:

  • Don’t just lower thresholds (that increases FP).
  • Do FN breakdown: which fraud types are missed (category/device/channel/time)?
  • Fix feature gaps and freshness gaps (real-time features vs daily batch).
  • Audit missing/null feature rates; improve upstream data completeness.
  • Improve training set coverage for rare patterns (reweighting/oversampling by fraud type), and retest at fixed FP budget.

Self-Assessment: 5 Questions

  • What happens in the synchronous path and where is the latency spent?
  • Why is ring detection daily batch (plus some streaming heuristics) rather than fully synchronous?
  • Precision-optimized vs recall-optimized: what changes and why?
  • How does a confirmed fraud label 6 weeks later affect the next model?
  • What’s the degraded behavior if Redis is down?

Quick Reference

  • Two-path architecture: synchronous (p99 < 300ms) + async (seconds–days).
  • Scoring: rules first (short-circuit hard blocks), then ML inference.
  • Feature tiers: Redis hot, Cassandra cold, derived at score time.
  • Three thresholds: block/review/monitor; optionally category-specific.
  • Ring detection: graph communities → ring_risk_score feature.
  • Feedback loop: scored → labels later → point-in-time training set → retrain → A/B → promote/rollback.

Tomorrow’s Preview

Day 60: Phase 2 Review — comprehensive review and self-assessment before Phase 3 mock interviews.