Phase 2: Deep Dives | Category: ML Data Infrastructure

Why Feature Engineering Belongs in Senior DE Interviews

Feature stores bridge the gap between data engineering and machine learning. At Meta (ranking), Netflix (recommendations), Google (search), OpenAI (inference), and Anthropic (safety scoring), the data engineer’s job doesn’t end at the warehouse — it extends to the serving layer that feeds the ML model in production.

Per Educative.io: “Feature store questions expose subtle correctness challenges. The central concern is training-serving skew where the features used during training differ from those served during inference, silently degrading model accuracy.” This is the problem that makes feature engineering a serious systems design challenge, not just data transformation.

The Two Core Problems Feature Stores Solve

Problem 1: Training-Serving Skew

Your model trains on features computed one way, but serves predictions using features computed differently. The model thinks it’s operating in an environment it never actually saw.

Training pipeline (offline, Spark):
  def compute_user_activity(user_id, lookback_days=30):
      return count_events(user_id, lookback_days)  # 30-day window

Serving pipeline (online, ad-hoc code):
  def compute_user_activity(user_id, lookback_days=7):  # BUG: 7 days, not 30
      return count_recent_events(user_id, lookback_days)

Result: Model was trained expecting 30-day activity counts
        but served 7-day counts. Silent degradation.
        Model predicts "low activity" when it should predict "high".

Problem 2: Data Leakage (Future Information in Training Data)

Training data accidentally includes information that wouldn’t have been available at prediction time — the model learns from the future, appears to perform well in testing, then fails in production.

Label: did this user churn? (known after the fact)
Event: churn_date = 2026-03-15

BAD training join:
  user features as of 2026-04-01 (AFTER churn) joined with churn label from 2026-03-15
  The model learns "user with 0 sessions in last month = churned" — but at the
  time of prediction (2026-03-15), that future 0-session period hadn't happened yet!

GOOD training join (point-in-time correct):
  user features as of 2026-03-14 (one day BEFORE churn event)
  The model learns from what was actually available at decision time

Feature Store Architecture: The Two Stores

                    ┌─────────────────────────────┐
                    │         FEATURE STORE        │
                    │                             │
┌──────────┐        │  ┌─────────────────────┐   │        ┌──────────────┐
│  Batch   │──────→ │  │   OFFLINE STORE      │   │──────→│   Training   │
│  Pipeline│        │  │  (Parquet/Iceberg)   │   │        │   Pipeline   │
│ (Spark/dbt)│      │  │  Point-in-time joins │   │        └──────────────┘
└──────────┘        │  │  Historical features │   │
                    │  └──────────┬──────────┘   │
                    │             │ Materialization
                    │             ↓               │
                    │  ┌─────────────────────┐   │
                    │  │    ONLINE STORE      │   │──────→  ML Model
┌──────────┐        │  │  (Redis/Cassandra/   │   │         (p99 < 10ms)
│  Stream  │──────→ │  │   DynamoDB/Bigtable) │   │
│ Pipeline │        │  │  Current values only │   │
│  (Flink) │        │  │  Low-latency serving │   │
└──────────┘        │  └─────────────────────┘   │
                    └─────────────────────────────┘
  • Offline store: historical feature values with timestamps. Used for point-in-time correct training datasets. Optimized for throughput, not latency. Typically Parquet/Iceberg on S3 or a data warehouse.
  • Online store: current feature values per entity (user_id, item_id). Used for real-time inference. Optimized for latency (p99 < 10ms). Typically Redis, Cassandra, DynamoDB, or Bigtable. Stores only the latest value (no history).

The materialization process: batch pipelines compute features and write to BOTH stores simultaneously. Streaming pipelines write real-time features to the online store with low lag. The offline store gets both batch writes (historical features) and periodic snapshots from the online store.

Point-in-Time Correctness: The Critical Concept

This is the feature store’s most important technical contribution. Per ApX Machine Learning: “Implementing point-in-time correct feature lookups is a mechanism essential for preventing data leakage.”

How it works — the “as-of join”:

Feature history for user u-001:
  timestamp          | login_count_30d
  -------------------|----------------
  2026-03-01 00:00   | 12
  2026-03-15 00:00   | 18
  2026-04-01 00:00   | 7    ← (user churned, logged in less)

Training label:
  user_id | churn_event_timestamp | label
  u-001   | 2026-03-18 14:30      | 1 (churned)

Point-in-time correct feature join:
  As of 2026-03-18 14:30 → login_count_30d = 18
  (NOT 7 — that's future data after the churn event)

Result: Training example → {login_count_30d: 18, label: 1}
Model learns: users with 18 logins can still churn → correct behavior

Implementation in Feast:

# The spine: entity + label timestamp
spine_df = pd.DataFrame({
    "user_id": ["u-001", "u-002", "u-001"],
    "event_timestamp": [
        "2026-03-18T14:30:00Z",  # churn event for u-001
        "2026-03-25T09:00:00Z",  # churn event for u-002
        "2026-04-01T00:00:00Z",  # different event for u-001
    ],
    "label": [1, 0, 1]
})

# Feast retrieves point-in-time correct features for each row
training_data = fs.get_historical_features(
    entity_dataframe=spine_df,
    features=["user_features:login_count_30d", "user_features:session_duration_avg"]
)
# Each row gets the feature value as it existed at event_timestamp
# NOT the current value

SQL equivalent (without a feature store):

-- Point-in-time join: for each label event, get the most recent feature value
-- that existed BEFORE the event timestamp
SELECT
    labels.user_id,
    labels.event_timestamp,
    labels.label,
    features.login_count_30d,
    features.session_duration_avg
FROM labels
JOIN (
    -- For each user + event_timestamp, find the most recent feature row
    SELECT
        user_id,
        event_timestamp AS feature_time,
        login_count_30d,
        session_duration_avg,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY feature_timestamp DESC
        ) AS rn
    FROM feature_history
    WHERE feature_timestamp <= labels.event_timestamp  -- only past data
) features ON labels.user_id = features.user_id AND features.rn = 1;

This SQL works but becomes prohibitively expensive at scale. Feature stores optimize this with specialized indexing, partitioning by timestamp, and storage layouts aligned for as-of queries.

Online vs Offline Features: The Split Decision

Feature typeWhere computedWhere storedFreshnessExample
StaticBatch (daily)Online + Offline24 hoursuser_country, account_age_days, subscription_tier
Slow-movingBatch (daily)Online + Offline24 hoursavg_session_duration_30d, category_affinity_vector, ltv_segment
Fast-movingStreaming (Flink)Online onlySecondsevents_last_5min, session_duration_current, items_viewed_this_session
On-demand / real-timeComputed at request timeNot storedAlways currenttime_since_last_login, request_hour_of_day
Pre-computed embeddingsBatch (nightly)Online (vector DB)24 hoursuser_embedding_128d, content_embedding_128d

The serving path at inference time:

ML model inference request: {user_id: "u-001", item_id: "i-999"}

Feature retrieval (parallel, ~10ms total):
  L1 cache check (in-process, <1ms): session features → HIT for user active in session
  L2 Redis (5ms): user slow-moving features → user_ltv_segment: "high", login_count_30d: 18
  L3 Cassandra (15ms): item features → item_popularity_score: 0.87, item_age_days: 42
  On-demand compute (<1ms): request_hour_of_day: 14 (computed from clock)

Feature vector assembled: [18, "high", 0.87, 42, 14, ...]

Model inference: → P(purchase) = 0.73

Feature Freshness: Tiers and SLAs

Feature            | Freshness SLO    | Breach Impact
-------------------+------------------+---------------------------
user_tier          | 24 hours         | Serving wrong tier offers
session_duration   | 5 minutes        | Slightly stale personalization
fraud_signals      | 30 seconds       | Missed fraud detection window
real-time_activity | 5 seconds        | Stale recommendation context

Monitoring freshness:

# For each feature in the online store, track when it was last updated
for feature_key in critical_features:
    last_updated = redis.get(f"metadata:{feature_key}:last_updated")
    lag_seconds = (datetime.utcnow() - last_updated).total_seconds()

    if lag_seconds > feature_slo_seconds[feature_key]:
        alert(f"FEATURE FRESHNESS SLO BREACH: {feature_key} is {lag_seconds:.0f}s stale")
        # Also serve a default/fallback feature value rather than crashing inference

Feature Store Tool Comparison

AspectFeast (OSS)Tecton (Enterprise)Databricks Feature StoreVertex AI Feature Store
TransformationsNot included — bring your own (dbt, Spark)Built-in (batch, streaming, real-time)Built-in (Delta Lake)Limited
Online servingRedis, DynamoDB, CassandraManaged, sub-10ms SLADelta + online servingBigtable
Training-serving consistencyEngineer’s responsibilityArchitecturally enforcedWithin DatabricksWithin GCP
BackfillManualAutomatic (detects gaps)ManualManual
Streaming featuresExternal Flink/Spark StreamingNative supportSpark StreamingDataflow
CostLow (infra cost only)High (enterprise pricing)Included with DatabricksGCP pricing
Best forFlexible, batch-heavy, cost-consciousReal-time fraud/recs, enterprise SLAsDatabricks-centric MLGCP-centric ML

The right choice by scenario:

  • Startup / small team: Feast + Redis + Spark (dbt for offline, Feast for serving)
  • Real-time fraud detection: Tecton (managed SLA, streaming feature pipelines, monitoring)
  • Databricks-native ML: Databricks Feature Store
  • GCP ML workloads: Vertex AI Feature Store

The Backfilling Problem

When you add a new feature or change feature computation logic, you need to backfill historical values for model retraining.

Backfilling strategies:

Option 1: Reprocess from raw events (most accurate):

# Recompute login_count_30d for every day in the past 2 years
for date in date_range("2024-01-01", "2026-04-13"):
    features = compute_login_count_30d(
        events_up_to=date,  # only use data available on that date
        lookback_days=30
    )
    offline_store.write(date=date, features=features)

Option 2: Approximate from existing data (faster, less accurate):

# If you have raw events in the data lake, use Spark to recompute
spark.sql("""
    INSERT INTO offline_store.user_features
    SELECT
        user_id,
        event_date,
        SUM(CASE WHEN event_date >= event_date - INTERVAL 30 DAYS
                 AND event_type = 'login' THEN 1 ELSE 0 END) as login_count_30d
    FROM raw_events
    WHERE event_date BETWEEN '2024-01-01' AND '2026-04-13'
    GROUP BY user_id, event_date
""")

Backfill in Tecton: change the transformation definition → Tecton detects which feature values are missing → automatically recomputes from stored raw events.

Interview Questions

Q1: “You’re building a fraud detection model. What features would you engineer, and how would you design the feature serving pipeline to meet a p99 < 50ms SLA?”

Model Answer: “I’d design three tiers of features with different freshness and serving paths. First, static account features (account_age_days, account_country, is_verified, historical_fraud_flag) — updated daily by a Spark batch job, stored in Redis with 24-hour TTL. Second, real-time behavioral features (transactions_last_1hr, transactions_last_5min, unique_merchants_last_24hr, current_session_amount_total) — computed by a Flink job reading from the transaction event stream. These update within 30 seconds of each transaction and are stored in Redis with 1-hour TTL. Third, on-demand features (transaction_hour_of_day, is_unusual_amount_for_user, distance_from_usual_location) — computed at inference time from the current transaction and cached account history.

Serving path: when a transaction arrives, fetch static features (Redis ~2ms) and real-time features (Redis ~3ms) in parallel. Compute on-demand features (~1ms). Total feature retrieval: ~5ms. Model inference: ~20ms. Remaining ~25ms for overhead.

For offline training, I’d use point-in-time joins: for each historical transaction (with fraud label), retrieve feature values as they existed at transaction time. This prevents leakage from the target: a transaction labeled fraudulent shouldn’t use features computed after the fraud occurred.”

Q2: “Explain training-serving skew and how a feature store prevents it.”

Model Answer: “Training-serving skew occurs when feature computation logic differs between training and serving. It’s insidious because offline evaluation can look great while production silently degrades.

A feature store prevents skew by being the single source of feature definitions: define the transformation once, then materialize the same logic into both offline (training) and online (serving) stores. There’s no separate serving-side code path that can diverge. You still test for skew explicitly: periodically compare offline feature distributions against online-logged feature values.”

Think About This

You’re in a Meta interview. The prompt: “You’re building the feature store for Meta’s News Feed ranking model. The model uses 2,500 features. Some features are user-level (engagement history), some are content-level (post quality scores), and some are user-content pair-level (historical interactions). How do you design the feature store architecture?”

Walk through:

  • How do you partition features? (by entity type / namespace)
  • How do you handle user-content pair features at Meta scale?
  • What’s the online store backend?
  • How do you handle 2,500 features without 2,500 Redis lookups?
  • What’s the freshness SLA for content features?

Quick Reference

  • Two problems: training-serving skew + data leakage
  • Two stores: offline (historical, point-in-time joins) + online (current values, p99 < 10ms)
  • Feature tiers: static → slow-moving → fast-moving → on-demand → embeddings
  • Point-in-time join: for each label at time (T), retrieve feature value just before (T)
  • Tools: Feast (OSS), Tecton (enterprise), Databricks FS, Vertex AI FS
  • Backfill: recompute from raw events with point-in-time correctness; automate if possible
  • Monitoring: freshness lag per feature, online/offline distribution drift, null/default rate

Tomorrow’s Preview

Day 47: ML Pipeline Architecture — Training pipelines vs inference pipelines, data versioning (DVC, lakeFS), experiment tracking, model registry, and how data engineers support ML teams at your target companies.