Phase 2: Deep Dives | Category: Security & Privacy
Why Privacy Engineering Is a Data Engineering Interview Topic
Privacy requirements are architecture requirements. GDPR’s “right to erasure” is a distributed systems engineering problem. CCPA’s data minimization shapes pipeline design. At OpenAI, Anthropic, and Meta, privacy is designed in — not bolted on.
The PII Spectrum: What You’re Protecting
Direct PII (unambiguously identifies individuals):
- Name, email address, phone number
- Social Security Number, passport number
- Credit card number, bank account number
- Biometric data (fingerprints, face recognition)
- Precise geolocation (GPS coordinates)
Quasi-PII (can identify when combined):
- Birth date + zip code + gender
- IP address (GDPR treats as personal data)
- Device fingerprint, cookie ID
- Browsing history patterns
Sensitive PII (requires extra protection):
- Health/medical data (HIPAA)
- Race, ethnicity, religion
- Sexual orientation, gender identity
- Financial data, credit scores
- Political opinions
Data engineering implication: classification must happen at ingestion. Every column in your schema should have a sensitivity classification.
The Three Techniques: Masking, Tokenization, Encryption
These are frequently confused. The choice has significant implications for pipeline design.
Masking (Irreversible Substitution)
Replace real values with realistic but fictitious values. The original cannot be recovered.
import hashlib
def mask_email(email: str) -> str:
"""Replace email with a realistic fake that's consistent per original."""
local_hash = hashlib.md5(email.encode()).hexdigest()[:8]
domain = email.split("@")[1] # keep domain for analytics (company distribution)
return f"user_{local_hash}@{domain}"
def mask_ssn(ssn: str) -> str:
"""Replace SSN with a valid-format fake."""
return f"XXX-XX-{ssn[-4:]}" # keep last 4 for partial lookup support
When to use masking:
- Non-production environments (dev/staging/QA)
- Analytics layers where individual identity doesn’t matter
- Training data for ML models
- When irreversibility is a feature (breach-safe)
Masking cannot support “show me user Alice’s data” after masking. For use cases requiring original data retrieval, use tokenization.
Tokenization (Reversible Substitution)
Replace real values with tokens. Original values stored in a secure vault. Tokens can be de-tokenized when authorized.
Real value: "4111-1111-1111-1111" (credit card)
Token: "8743-XXXX-XXXX-1111" (format-preserving token)
Vault: { "8743-XXXX-XXXX-1111" → "4111-1111-1111-1111" } (access-controlled)
Pipeline:
Card number → Tokenization service → token → Kafka → analytics pipeline
When customer service needs to verify: token → vault API → original number
Format-preserving tokenization (FPT): token has the same format as the original. Critical when downstream systems validate schema/format.
When to use tokenization:
- Payment card data (PCI-DSS)
- Healthcare data that must be retrievable (HIPAA)
- Any scenario where the original must sometimes be recovered (customer service, legal holds)
- Cross-system joins where you need a consistent pseudonymous identifier across systems
Token vault security: the vault is the most sensitive component (encrypt, access control, replicate, audit). Many orgs use managed tokenization rather than building vaults.
Encryption (Reversible with Key)
Encrypt the original value; decrypt with the key when needed.
AES-256-GCM (authenticated encryption — common standard):
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
import os
def encrypt_pii_field(plaintext: str, key: bytes) -> tuple[bytes, bytes]:
"""Encrypt a PII field. Returns (nonce, ciphertext)."""
nonce = os.urandom(12) # 96-bit nonce for GCM mode
aesgcm = AESGCM(key)
ciphertext = aesgcm.encrypt(nonce, plaintext.encode(), None)
return nonce, ciphertext
def decrypt_pii_field(nonce: bytes, ciphertext: bytes, key: bytes) -> str:
"""Decrypt a PII field."""
aesgcm = AESGCM(key)
plaintext = aesgcm.decrypt(nonce, ciphertext, None)
return plaintext.decode()
# Usage in a pipeline:
# key = kms_client.generate_data_key(KeyId="arn:aws:kms:...")["Plaintext"] # DEK from KMS
# nonce, encrypted_email = encrypt_pii_field("alice@gmail.com", key)
# Store nonce + ciphertext in the DB; key managed by KMS
When to use encryption:
- When data must be stored securely but queries don’t need to run on the encrypted field
- Full-column encryption for highly sensitive fields (SSN in regulated DB)
- End-to-end encryption for data in transit
Encryption limitation for analytics: you can’t efficiently group/sort/join on encrypted values. That’s why masking/tokenization exist.
Decision Framework
| Scenario | Technique | Why |
|---|---|---|
| Analytics layer — need email domain distribution | Masking (keep domain, hash local part) | Original not needed; domain preserved |
| Payment processing — need card number later | Tokenization (FPT) | Must retrieve original; format preserved |
| ML training data | Masking (irreversible) | Original never needed; breach-safe |
| Cross-system user identity | Tokenization (consistent token per user) | Same user = same token across systems |
| SSN stored in DB, never queried | Encryption (AES-256-GCM) | Secure storage; occasional decrypt for display |
| Data in transit (API, Kafka) | TLS + field-level encryption for ultra-sensitive | Defense in depth |
| Dev/staging environments | Masking | Production data never in non-prod |
GDPR & CCPA: The Engineering Implications
These are architecture requirements.
GDPR (EU General Data Protection Regulation)
The 7 principles that drive DE decisions
| Principle | What it means for your pipeline |
|---|---|
| Lawfulness, fairness, transparency | Log what you collect and why; provide privacy notices |
| Purpose limitation | Separate pipelines per consent; separate training vs analytics with explicit consent tracking |
| Data minimization | Collect only what you need; don’t SELECT * by default |
| Accuracy | Detect/correct inaccurate PII via data quality pipelines |
| Storage limitation | Retention policies are mandatory — delete when no longer needed |
| Integrity and confidentiality | Encryption, access control, audit logs |
| Accountability | Data lineage, audit trails, DPIAs |
Right to erasure (Article 17): deletion is a distributed systems problem
When a user requests deletion, you must delete from every system that stores the data.
User "alice" requests deletion under GDPR Article 17
Systems that must delete:
✓ Operational DB (PostgreSQL) → DELETE WHERE user_id = 'alice-hash'
✓ Kafka topics (log compaction) → produce tombstone: key='alice-hash', value=null
✓ Bronze layer (Iceberg on S3) → row-level delete + compaction
✓ Silver layer (derived data) → delete rows where user_id = 'alice-hash'
✓ Gold layer (aggregates) → re-run aggregates excluding alice
✗ ML training datasets → models cannot be "untrained"
→ exclude from future training + periodic retraining
✓ Feature store → evict online features; remove from offline store
✓ Backups → restore process must re-apply deletions
Timeline: GDPR gives 30 days. Design for 7-day execution with buffer.
ML training: standard approach is (1) deletion registry, (2) exclude deleted users from future training, (3) periodic retraining, (4) consider differential privacy to reduce memorization.
Data localization (Article 44)
EU personal data can’t be transferred outside the EU without protections. Engineering implication: store/process EU data in EU regions and avoid routing EU data through non-EU systems.
CCPA (California Consumer Privacy Act)
Key rights affecting pipelines:
- Right to know: what data collected about me → requires inventory + queryable personal data
- Right to delete: similar to GDPR erasure
- Right to opt-out of sale: consent flags must flow through pipelines
- Right to non-discrimination
CCPA opt-out enforcement:
def should_include_in_analytics(user_id: str, data_use: str) -> bool:
consent = consent_service.get_consent(user_id)
return consent.allows(data_use)
events_stream = events_stream.filter(
lambda e: should_include_in_analytics(e.user_id, "analytics")
)
shared_data_stream = events_stream.filter(
lambda e: should_include_in_analytics(e.user_id, "data_sharing")
)
Privacy by Design: Seven Principles Applied to Data Engineering
Proactive, not reactive: design privacy in at day zero.
Privacy as default: new fields → assume PII unless proven otherwise. New tables → restricted by default. New pipelines → don’t over-collect.
Schema contract example embedding privacy:
schema:
- name: user_id
type: STRING
pii_classification: PSEUDONYMOUS
- name: email
type: STRING
pii_classification: DIRECT_PII
data_handling:
collection_purpose: account_authentication
allowed_uses: [authentication, account_recovery]
not_allowed: [analytics, ml_training, sharing]
retention_days: 365
masking_in_analytics: domain_only
- name: ip_address
type: STRING
pii_classification: QUASI_PII
data_handling:
retention_days: 30
masking_in_analytics: subnet_only
Visibility and transparency: lineage shows every consumer of PII. When a user asks “where is my data?” you can answer quickly.
Retention Policies: Engineering Implementation
Retention is mandatory to implement storage limitation and data minimization.
Iceberg retention example:
spark.sql("""
ALTER TABLE silver.user_events
SET TBLPROPERTIES (
'history.expire.max-snapshot-age-ms' = '2592000000',
'write.delete.mode' = 'merge-on-read'
)
""")
spark.sql("""
CALL system.expire_snapshots(
table => 'silver.user_events',
older_than => TIMESTAMP '2026-03-13',
retain_last => 1
)
""")
spark.sql("""
CALL system.rewrite_data_files(
table => 'silver.user_events',
where => 'event_date < current_date() - interval 30 days'
)
""")
Retention tiers in practice:
| Data category | Retention | Enforcement |
|---|---|---|
| Raw PII (email, phone, SSN) | 30 days | Iceberg expire_snapshots + rewrite_data_files |
| Pseudonymized event data | 2 years | Partition expiry by date |
| Aggregated analytics (no PII) | 7 years | Standard compliance retention |
| Audit logs (who accessed what) | 7 years | Immutable, append-only storage |
| ML training datasets with PII | Excluded from training | Deletion registry enforced |
| Backup/DR copies | Must comply | Restore process re-applies deletions |
Interview Questions
Q1: Mobile app events with IPs + device IDs + user IDs. How to comply with GDPR?
Model answer:
- At ingestion: treat IP as personal data; truncate subnet (e.g.,
192.168.1.100 → 192.168.1.0) at the edge to preserve network analytics. - Device ID: pseudonymize with consistent tokenization.
- User ID: ensure it’s a stable pseudonymous identifier (not reversible to direct identity).
- Storage: bronze layer retains raw for 30 days max; silver/gold never see raw IP/device identifiers.
- Consent: consent flags travel with events; branch or filter pipelines at entry.
- Erasure: deletion job runs within 7 days across bronze/silver/gold and feature stores; immutable audit record proves compliance.
Q2: Data scientist wants ML training on user data. What privacy considerations?
Model answer:
- Consent + purpose alignment: exclude users opted out of ML use.
- Data minimization + de-identification: use derived features from pseudonymized data; avoid raw email/phone.
- Erasure compliance: enforce deletion registry; exclude deleted users from future training; periodic retraining; consider differential privacy (DP-SGD) for sensitive training.
- Auditability: dataset versions include consent snapshot date, de-id method, and deletion-exclusion list applied.
Think About This
Meta prompt: a user in Germany requests deletion (GDPR Article 17). Design the deletion pipeline at petabyte scale.
Walk through:
- Systems covered: operational DBs, Kafka, lake/warehouse, caches, search indexes, ML datasets, third-party exports
- Architecture: central deletion registry
{user_id, requested_at}→ deletion events topic → per-system deletion handlers → audit confirmations - ML: enforce deletion exclusion list for future training; retrain periodically; DP training reduces memorization risk
- Multi-region: propagate to all regional replicas; EU residency deletes first
- Proof: central audit report showing per-system completion, timestamps, and method
Quick Reference
- Masking: irreversible, analytics-safe
- Tokenization: reversible via vault, supports joins and retrieval
- Encryption: reversible with key, not analytics-friendly
- Consent is first-class data; enforce at pipeline entry
- Right-to-erasure: registry → events → handlers → audit confirmations
- Retention policies must be automated
Tomorrow’s Preview
Day 55: Cost Optimization for Data Platforms — cloud cost drivers (compute, storage, network), spot/preemptible, tiered storage, query optimization for cost, and chargeback models.