Phase 2: Deep Dives | Category: Security & Privacy

Why Privacy Engineering Is a Data Engineering Interview Topic

Privacy requirements are architecture requirements. GDPR’s “right to erasure” is a distributed systems engineering problem. CCPA’s data minimization shapes pipeline design. At OpenAI, Anthropic, and Meta, privacy is designed in — not bolted on.

The PII Spectrum: What You’re Protecting

Direct PII (unambiguously identifies individuals):

  • Name, email address, phone number
  • Social Security Number, passport number
  • Credit card number, bank account number
  • Biometric data (fingerprints, face recognition)
  • Precise geolocation (GPS coordinates)

Quasi-PII (can identify when combined):

  • Birth date + zip code + gender
  • IP address (GDPR treats as personal data)
  • Device fingerprint, cookie ID
  • Browsing history patterns

Sensitive PII (requires extra protection):

  • Health/medical data (HIPAA)
  • Race, ethnicity, religion
  • Sexual orientation, gender identity
  • Financial data, credit scores
  • Political opinions

Data engineering implication: classification must happen at ingestion. Every column in your schema should have a sensitivity classification.

The Three Techniques: Masking, Tokenization, Encryption

These are frequently confused. The choice has significant implications for pipeline design.

Masking (Irreversible Substitution)

Replace real values with realistic but fictitious values. The original cannot be recovered.

import hashlib

def mask_email(email: str) -> str:
    """Replace email with a realistic fake that's consistent per original."""
    local_hash = hashlib.md5(email.encode()).hexdigest()[:8]
    domain = email.split("@")[1]  # keep domain for analytics (company distribution)
    return f"user_{local_hash}@{domain}"

def mask_ssn(ssn: str) -> str:
    """Replace SSN with a valid-format fake."""
    return f"XXX-XX-{ssn[-4:]}"  # keep last 4 for partial lookup support

When to use masking:

  • Non-production environments (dev/staging/QA)
  • Analytics layers where individual identity doesn’t matter
  • Training data for ML models
  • When irreversibility is a feature (breach-safe)

Masking cannot support “show me user Alice’s data” after masking. For use cases requiring original data retrieval, use tokenization.

Tokenization (Reversible Substitution)

Replace real values with tokens. Original values stored in a secure vault. Tokens can be de-tokenized when authorized.

Real value: "4111-1111-1111-1111" (credit card)
Token:      "8743-XXXX-XXXX-1111" (format-preserving token)
Vault:      { "8743-XXXX-XXXX-1111" → "4111-1111-1111-1111" } (access-controlled)

Pipeline:
  Card number → Tokenization service → token → Kafka → analytics pipeline
  When customer service needs to verify: token → vault API → original number

Format-preserving tokenization (FPT): token has the same format as the original. Critical when downstream systems validate schema/format.

When to use tokenization:

  • Payment card data (PCI-DSS)
  • Healthcare data that must be retrievable (HIPAA)
  • Any scenario where the original must sometimes be recovered (customer service, legal holds)
  • Cross-system joins where you need a consistent pseudonymous identifier across systems

Token vault security: the vault is the most sensitive component (encrypt, access control, replicate, audit). Many orgs use managed tokenization rather than building vaults.

Encryption (Reversible with Key)

Encrypt the original value; decrypt with the key when needed.

AES-256-GCM (authenticated encryption — common standard):

from cryptography.hazmat.primitives.ciphers.aead import AESGCM
import os

def encrypt_pii_field(plaintext: str, key: bytes) -> tuple[bytes, bytes]:
    """Encrypt a PII field. Returns (nonce, ciphertext)."""
    nonce = os.urandom(12)  # 96-bit nonce for GCM mode
    aesgcm = AESGCM(key)
    ciphertext = aesgcm.encrypt(nonce, plaintext.encode(), None)
    return nonce, ciphertext

def decrypt_pii_field(nonce: bytes, ciphertext: bytes, key: bytes) -> str:
    """Decrypt a PII field."""
    aesgcm = AESGCM(key)
    plaintext = aesgcm.decrypt(nonce, ciphertext, None)
    return plaintext.decode()

# Usage in a pipeline:
# key = kms_client.generate_data_key(KeyId="arn:aws:kms:...")["Plaintext"]  # DEK from KMS
# nonce, encrypted_email = encrypt_pii_field("alice@gmail.com", key)
# Store nonce + ciphertext in the DB; key managed by KMS

When to use encryption:

  • When data must be stored securely but queries don’t need to run on the encrypted field
  • Full-column encryption for highly sensitive fields (SSN in regulated DB)
  • End-to-end encryption for data in transit

Encryption limitation for analytics: you can’t efficiently group/sort/join on encrypted values. That’s why masking/tokenization exist.

Decision Framework

ScenarioTechniqueWhy
Analytics layer — need email domain distributionMasking (keep domain, hash local part)Original not needed; domain preserved
Payment processing — need card number laterTokenization (FPT)Must retrieve original; format preserved
ML training dataMasking (irreversible)Original never needed; breach-safe
Cross-system user identityTokenization (consistent token per user)Same user = same token across systems
SSN stored in DB, never queriedEncryption (AES-256-GCM)Secure storage; occasional decrypt for display
Data in transit (API, Kafka)TLS + field-level encryption for ultra-sensitiveDefense in depth
Dev/staging environmentsMaskingProduction data never in non-prod

GDPR & CCPA: The Engineering Implications

These are architecture requirements.

GDPR (EU General Data Protection Regulation)

The 7 principles that drive DE decisions

PrincipleWhat it means for your pipeline
Lawfulness, fairness, transparencyLog what you collect and why; provide privacy notices
Purpose limitationSeparate pipelines per consent; separate training vs analytics with explicit consent tracking
Data minimizationCollect only what you need; don’t SELECT * by default
AccuracyDetect/correct inaccurate PII via data quality pipelines
Storage limitationRetention policies are mandatory — delete when no longer needed
Integrity and confidentialityEncryption, access control, audit logs
AccountabilityData lineage, audit trails, DPIAs

Right to erasure (Article 17): deletion is a distributed systems problem

When a user requests deletion, you must delete from every system that stores the data.

User "alice" requests deletion under GDPR Article 17

Systems that must delete:
  ✓ Operational DB (PostgreSQL) → DELETE WHERE user_id = 'alice-hash'
  ✓ Kafka topics (log compaction) → produce tombstone: key='alice-hash', value=null
  ✓ Bronze layer (Iceberg on S3) → row-level delete + compaction
  ✓ Silver layer (derived data) → delete rows where user_id = 'alice-hash'
  ✓ Gold layer (aggregates) → re-run aggregates excluding alice
  ✗ ML training datasets → models cannot be "untrained"
     → exclude from future training + periodic retraining
  ✓ Feature store → evict online features; remove from offline store
  ✓ Backups → restore process must re-apply deletions

Timeline: GDPR gives 30 days. Design for 7-day execution with buffer.

ML training: standard approach is (1) deletion registry, (2) exclude deleted users from future training, (3) periodic retraining, (4) consider differential privacy to reduce memorization.

Data localization (Article 44)

EU personal data can’t be transferred outside the EU without protections. Engineering implication: store/process EU data in EU regions and avoid routing EU data through non-EU systems.

CCPA (California Consumer Privacy Act)

Key rights affecting pipelines:

  • Right to know: what data collected about me → requires inventory + queryable personal data
  • Right to delete: similar to GDPR erasure
  • Right to opt-out of sale: consent flags must flow through pipelines
  • Right to non-discrimination

CCPA opt-out enforcement:

def should_include_in_analytics(user_id: str, data_use: str) -> bool:
    consent = consent_service.get_consent(user_id)
    return consent.allows(data_use)

events_stream = events_stream.filter(
    lambda e: should_include_in_analytics(e.user_id, "analytics")
)

shared_data_stream = events_stream.filter(
    lambda e: should_include_in_analytics(e.user_id, "data_sharing")
)

Privacy by Design: Seven Principles Applied to Data Engineering

Proactive, not reactive: design privacy in at day zero.

Privacy as default: new fields → assume PII unless proven otherwise. New tables → restricted by default. New pipelines → don’t over-collect.

Schema contract example embedding privacy:

schema:
  - name: user_id
    type: STRING
    pii_classification: PSEUDONYMOUS

  - name: email
    type: STRING
    pii_classification: DIRECT_PII
    data_handling:
      collection_purpose: account_authentication
      allowed_uses: [authentication, account_recovery]
      not_allowed: [analytics, ml_training, sharing]
      retention_days: 365
      masking_in_analytics: domain_only

  - name: ip_address
    type: STRING
    pii_classification: QUASI_PII
    data_handling:
      retention_days: 30
      masking_in_analytics: subnet_only

Visibility and transparency: lineage shows every consumer of PII. When a user asks “where is my data?” you can answer quickly.

Retention Policies: Engineering Implementation

Retention is mandatory to implement storage limitation and data minimization.

Iceberg retention example:

spark.sql("""
    ALTER TABLE silver.user_events
    SET TBLPROPERTIES (
        'history.expire.max-snapshot-age-ms' = '2592000000',
        'write.delete.mode' = 'merge-on-read'
    )
""")

spark.sql("""
    CALL system.expire_snapshots(
        table => 'silver.user_events',
        older_than => TIMESTAMP '2026-03-13',
        retain_last => 1
    )
""")

spark.sql("""
    CALL system.rewrite_data_files(
        table => 'silver.user_events',
        where => 'event_date < current_date() - interval 30 days'
    )
""")

Retention tiers in practice:

Data categoryRetentionEnforcement
Raw PII (email, phone, SSN)30 daysIceberg expire_snapshots + rewrite_data_files
Pseudonymized event data2 yearsPartition expiry by date
Aggregated analytics (no PII)7 yearsStandard compliance retention
Audit logs (who accessed what)7 yearsImmutable, append-only storage
ML training datasets with PIIExcluded from trainingDeletion registry enforced
Backup/DR copiesMust complyRestore process re-applies deletions

Interview Questions

Q1: Mobile app events with IPs + device IDs + user IDs. How to comply with GDPR?

Model answer:

  • At ingestion: treat IP as personal data; truncate subnet (e.g., 192.168.1.100 → 192.168.1.0) at the edge to preserve network analytics.
  • Device ID: pseudonymize with consistent tokenization.
  • User ID: ensure it’s a stable pseudonymous identifier (not reversible to direct identity).
  • Storage: bronze layer retains raw for 30 days max; silver/gold never see raw IP/device identifiers.
  • Consent: consent flags travel with events; branch or filter pipelines at entry.
  • Erasure: deletion job runs within 7 days across bronze/silver/gold and feature stores; immutable audit record proves compliance.

Q2: Data scientist wants ML training on user data. What privacy considerations?

Model answer:

  • Consent + purpose alignment: exclude users opted out of ML use.
  • Data minimization + de-identification: use derived features from pseudonymized data; avoid raw email/phone.
  • Erasure compliance: enforce deletion registry; exclude deleted users from future training; periodic retraining; consider differential privacy (DP-SGD) for sensitive training.
  • Auditability: dataset versions include consent snapshot date, de-id method, and deletion-exclusion list applied.

Think About This

Meta prompt: a user in Germany requests deletion (GDPR Article 17). Design the deletion pipeline at petabyte scale.

Walk through:

  • Systems covered: operational DBs, Kafka, lake/warehouse, caches, search indexes, ML datasets, third-party exports
  • Architecture: central deletion registry {user_id, requested_at} → deletion events topic → per-system deletion handlers → audit confirmations
  • ML: enforce deletion exclusion list for future training; retrain periodically; DP training reduces memorization risk
  • Multi-region: propagate to all regional replicas; EU residency deletes first
  • Proof: central audit report showing per-system completion, timestamps, and method

Quick Reference

  • Masking: irreversible, analytics-safe
  • Tokenization: reversible via vault, supports joins and retrieval
  • Encryption: reversible with key, not analytics-friendly
  • Consent is first-class data; enforce at pipeline entry
  • Right-to-erasure: registry → events → handlers → audit confirmations
  • Retention policies must be automated

Tomorrow’s Preview

Day 55: Cost Optimization for Data Platforms — cloud cost drivers (compute, storage, network), spot/preemptible, tiered storage, query optimization for cost, and chargeback models.