Day 54 — PII Handling & Privacy Engineering

Phase 2: Deep Dives | Category: Security & Privacy

Why Privacy Engineering Is a Data Engineering Interview Topic

Privacy requirements are architecture requirements. GDPR’s “right to erasure” is a distributed systems engineering problem. CCPA’s data minimization shapes pipeline design. At OpenAI, Anthropic, and Meta, privacy is designed in — not bolted on.

The PII Spectrum: What You’re Protecting

Direct PII (unambiguously identifies individuals):

Name, email address, phone number
Social Security Number, passport number
Credit card number, bank account number
Biometric data (fingerprints, face recognition)
Precise geolocation (GPS coordinates)

Quasi-PII (can identify when combined):

Birth date + zip code + gender
IP address (GDPR treats as personal data)
Device fingerprint, cookie ID
Browsing history patterns

Sensitive PII (requires extra protection):

Health/medical data (HIPAA)
Race, ethnicity, religion
Sexual orientation, gender identity
Financial data, credit scores
Political opinions

Data engineering implication: classification must happen at ingestion. Every column in your schema should have a sensitivity classification.

The Three Techniques: Masking, Tokenization, Encryption

These are frequently confused. The choice has significant implications for pipeline design.

Masking (Irreversible Substitution)

Replace real values with realistic but fictitious values. The original cannot be recovered.

import hashlib

def mask_email(email: str) -> str:
    """Replace email with a realistic fake that's consistent per original."""
    local_hash = hashlib.md5(email.encode()).hexdigest()[:8]
    domain = email.split("@")[1]  # keep domain for analytics (company distribution)
    return f"user_{local_hash}@{domain}"

def mask_ssn(ssn: str) -> str:
    """Replace SSN with a valid-format fake."""
    return f"XXX-XX-{ssn[-4:]}"  # keep last 4 for partial lookup support

When to use masking:

Non-production environments (dev/staging/QA)
Analytics layers where individual identity doesn’t matter
Training data for ML models
When irreversibility is a feature (breach-safe)

Masking cannot support “show me user Alice’s data” after masking. For use cases requiring original data retrieval, use tokenization.

Tokenization (Reversible Substitution)

Replace real values with tokens. Original values stored in a secure vault. Tokens can be de-tokenized when authorized.

Real value: "4111-1111-1111-1111" (credit card)
Token:      "8743-XXXX-XXXX-1111" (format-preserving token)
Vault:      { "8743-XXXX-XXXX-1111" → "4111-1111-1111-1111" } (access-controlled)

Pipeline:
  Card number → Tokenization service → token → Kafka → analytics pipeline
  When customer service needs to verify: token → vault API → original number

Format-preserving tokenization (FPT): token has the same format as the original. Critical when downstream systems validate schema/format.

When to use tokenization:

Payment card data (PCI-DSS)
Healthcare data that must be retrievable (HIPAA)
Any scenario where the original must sometimes be recovered (customer service, legal holds)
Cross-system joins where you need a consistent pseudonymous identifier across systems

Token vault security: the vault is the most sensitive component (encrypt, access control, replicate, audit). Many orgs use managed tokenization rather than building vaults.

Encryption (Reversible with Key)

Encrypt the original value; decrypt with the key when needed.

AES-256-GCM (authenticated encryption — common standard):

from cryptography.hazmat.primitives.ciphers.aead import AESGCM
import os

def encrypt_pii_field(plaintext: str, key: bytes) -> tuple[bytes, bytes]:
    """Encrypt a PII field. Returns (nonce, ciphertext)."""
    nonce = os.urandom(12)  # 96-bit nonce for GCM mode
    aesgcm = AESGCM(key)
    ciphertext = aesgcm.encrypt(nonce, plaintext.encode(), None)
    return nonce, ciphertext

def decrypt_pii_field(nonce: bytes, ciphertext: bytes, key: bytes) -> str:
    """Decrypt a PII field."""
    aesgcm = AESGCM(key)
    plaintext = aesgcm.decrypt(nonce, ciphertext, None)
    return plaintext.decode()

# Usage in a pipeline:
# key = kms_client.generate_data_key(KeyId="arn:aws:kms:...")["Plaintext"]  # DEK from KMS
# nonce, encrypted_email = encrypt_pii_field("alice@gmail.com", key)
# Store nonce + ciphertext in the DB; key managed by KMS

When to use encryption:

When data must be stored securely but queries don’t need to run on the encrypted field
Full-column encryption for highly sensitive fields (SSN in regulated DB)
End-to-end encryption for data in transit

Encryption limitation for analytics: you can’t efficiently group/sort/join on encrypted values. That’s why masking/tokenization exist.

Decision Framework

Scenario	Technique	Why
Analytics layer — need email domain distribution	Masking (keep domain, hash local part)	Original not needed; domain preserved
Payment processing — need card number later	Tokenization (FPT)	Must retrieve original; format preserved
ML training data	Masking (irreversible)	Original never needed; breach-safe
Cross-system user identity	Tokenization (consistent token per user)	Same user = same token across systems
SSN stored in DB, never queried	Encryption (AES-256-GCM)	Secure storage; occasional decrypt for display
Data in transit (API, Kafka)	TLS + field-level encryption for ultra-sensitive	Defense in depth
Dev/staging environments	Masking	Production data never in non-prod

These are architecture requirements.

The 7 principles that drive DE decisions

Principle	What it means for your pipeline
Lawfulness, fairness, transparency	Log what you collect and why; provide privacy notices
Purpose limitation	Separate pipelines per consent; separate training vs analytics with explicit consent tracking
Data minimization	Collect only what you need; don’t SELECT * by default
Accuracy	Detect/correct inaccurate PII via data quality pipelines
Storage limitation	Retention policies are mandatory — delete when no longer needed
Integrity and confidentiality	Encryption, access control, audit logs
Accountability	Data lineage, audit trails, DPIAs

Right to erasure (Article 17): deletion is a distributed systems problem

When a user requests deletion, you must delete from every system that stores the data.

User "alice" requests deletion under GDPR Article 17

Systems that must delete:
  ✓ Operational DB (PostgreSQL) → DELETE WHERE user_id = 'alice-hash'
  ✓ Kafka topics (log compaction) → produce tombstone: key='alice-hash', value=null
  ✓ Bronze layer (Iceberg on S3) → row-level delete + compaction
  ✓ Silver layer (derived data) → delete rows where user_id = 'alice-hash'
  ✓ Gold layer (aggregates) → re-run aggregates excluding alice
  ✗ ML training datasets → models cannot be "untrained"
     → exclude from future training + periodic retraining
  ✓ Feature store → evict online features; remove from offline store
  ✓ Backups → restore process must re-apply deletions

Timeline: GDPR gives 30 days. Design for 7-day execution with buffer.

ML training: standard approach is (1) deletion registry, (2) exclude deleted users from future training, (3) periodic retraining, (4) consider differential privacy to reduce memorization.

Data localization (Article 44)

EU personal data can’t be transferred outside the EU without protections. Engineering implication: store/process EU data in EU regions and avoid routing EU data through non-EU systems.

CCPA (California Consumer Privacy Act)

Key rights affecting pipelines:

Right to know: what data collected about me → requires inventory + queryable personal data
Right to delete: similar to GDPR erasure
Right to opt-out of sale: consent flags must flow through pipelines
Right to non-discrimination

CCPA opt-out enforcement:

def should_include_in_analytics(user_id: str, data_use: str) -> bool:
    consent = consent_service.get_consent(user_id)
    return consent.allows(data_use)

events_stream = events_stream.filter(
    lambda e: should_include_in_analytics(e.user_id, "analytics")
)

shared_data_stream = events_stream.filter(
    lambda e: should_include_in_analytics(e.user_id, "data_sharing")
)

Privacy by Design: Seven Principles Applied to Data Engineering

Proactive, not reactive: design privacy in at day zero.

Privacy as default: new fields → assume PII unless proven otherwise. New tables → restricted by default. New pipelines → don’t over-collect.

Schema contract example embedding privacy:

schema:
  - name: user_id
    type: STRING
    pii_classification: PSEUDONYMOUS

  - name: email
    type: STRING
    pii_classification: DIRECT_PII
    data_handling:
      collection_purpose: account_authentication
      allowed_uses: [authentication, account_recovery]
      not_allowed: [analytics, ml_training, sharing]
      retention_days: 365
      masking_in_analytics: domain_only

  - name: ip_address
    type: STRING
    pii_classification: QUASI_PII
    data_handling:
      retention_days: 30
      masking_in_analytics: subnet_only

Visibility and transparency: lineage shows every consumer of PII. When a user asks “where is my data?” you can answer quickly.

Retention Policies: Engineering Implementation

Retention is mandatory to implement storage limitation and data minimization.

Iceberg retention example:

spark.sql("""
    ALTER TABLE silver.user_events
    SET TBLPROPERTIES (
        'history.expire.max-snapshot-age-ms' = '2592000000',
        'write.delete.mode' = 'merge-on-read'
    )
""")

spark.sql("""
    CALL system.expire_snapshots(
        table => 'silver.user_events',
        older_than => TIMESTAMP '2026-03-13',
        retain_last => 1
    )
""")

spark.sql("""
    CALL system.rewrite_data_files(
        table => 'silver.user_events',
        where => 'event_date < current_date() - interval 30 days'
    )
""")

Retention tiers in practice:

Data category	Retention	Enforcement
Raw PII (email, phone, SSN)	30 days	Iceberg `expire_snapshots` + `rewrite_data_files`
Pseudonymized event data	2 years	Partition expiry by date
Aggregated analytics (no PII)	7 years	Standard compliance retention
Audit logs (who accessed what)	7 years	Immutable, append-only storage
ML training datasets with PII	Excluded from training	Deletion registry enforced
Backup/DR copies	Must comply	Restore process re-applies deletions

Interview Questions

Model answer:

At ingestion: treat IP as personal data; truncate subnet (e.g., 192.168.1.100 → 192.168.1.0) at the edge to preserve network analytics.
Device ID: pseudonymize with consistent tokenization.
User ID: ensure it’s a stable pseudonymous identifier (not reversible to direct identity).
Storage: bronze layer retains raw for 30 days max; silver/gold never see raw IP/device identifiers.
Consent: consent flags travel with events; branch or filter pipelines at entry.
Erasure: deletion job runs within 7 days across bronze/silver/gold and feature stores; immutable audit record proves compliance.

Q2: Data scientist wants ML training on user data. What privacy considerations?

Model answer:

Consent + purpose alignment: exclude users opted out of ML use.
Data minimization + de-identification: use derived features from pseudonymized data; avoid raw email/phone.
Erasure compliance: enforce deletion registry; exclude deleted users from future training; periodic retraining; consider differential privacy (DP-SGD) for sensitive training.
Auditability: dataset versions include consent snapshot date, de-id method, and deletion-exclusion list applied.

Think About This

Meta prompt: a user in Germany requests deletion (GDPR Article 17). Design the deletion pipeline at petabyte scale.

Walk through:

Systems covered: operational DBs, Kafka, lake/warehouse, caches, search indexes, ML datasets, third-party exports
Architecture: central deletion registry {user_id, requested_at} → deletion events topic → per-system deletion handlers → audit confirmations
ML: enforce deletion exclusion list for future training; retrain periodically; DP training reduces memorization risk
Multi-region: propagate to all regional replicas; EU residency deletes first
Proof: central audit report showing per-system completion, timestamps, and method

Quick Reference

Masking: irreversible, analytics-safe
Tokenization: reversible via vault, supports joins and retrieval
Encryption: reversible with key, not analytics-friendly
Consent is first-class data; enforce at pipeline entry
Right-to-erasure: registry → events → handlers → audit confirmations
Retention policies must be automated

Tomorrow’s Preview

Day 55: Cost Optimization for Data Platforms — cloud cost drivers (compute, storage, network), spot/preemptible, tiered storage, query optimization for cost, and chargeback models.

Day 54: PII Handling & Privacy Engineering

Why Privacy Engineering Is a Data Engineering Interview Topic

The PII Spectrum: What You’re Protecting

The Three Techniques: Masking, Tokenization, Encryption

Masking (Irreversible Substitution)

Tokenization (Reversible Substitution)

Encryption (Reversible with Key)

Decision Framework

The 7 principles that drive DE decisions

Right to erasure (Article 17): deletion is a distributed systems problem

Data localization (Article 44)

CCPA (California Consumer Privacy Act)

Privacy by Design: Seven Principles Applied to Data Engineering

Retention Policies: Engineering Implementation

Interview Questions

Q2: Data scientist wants ML training on user data. What privacy considerations?

Think About This

Quick Reference

Tomorrow’s Preview

Feedback

Day 54: PII Handling & Privacy Engineering

Why Privacy Engineering Is a Data Engineering Interview Topic

The PII Spectrum: What You’re Protecting

The Three Techniques: Masking, Tokenization, Encryption

Masking (Irreversible Substitution)

Tokenization (Reversible Substitution)

Encryption (Reversible with Key)

Decision Framework

GDPR & CCPA: The Engineering Implications

GDPR (EU General Data Protection Regulation)

The 7 principles that drive DE decisions

Right to erasure (Article 17): deletion is a distributed systems problem

Data localization (Article 44)

CCPA (California Consumer Privacy Act)

Privacy by Design: Seven Principles Applied to Data Engineering

Retention Policies: Engineering Implementation

Interview Questions

Q1: Mobile app events with IPs + device IDs + user IDs. How to comply with GDPR?

Q2: Data scientist wants ML training on user data. What privacy considerations?

Think About This

Quick Reference

Tomorrow’s Preview

Feedback