Phase 2: Deep Dives | Category: Security & Privacy
Why Data Engineers Must Own Security
Security in data systems isn’t a separate team’s responsibility — it’s designed in by the data engineer. Senior DEs are expected to design data systems that are secure by default, not secure by bolt-on.
The Security Layers: Defense in Depth
Layer 1: Network perimeter
VPC, private subnets, security groups, VPC endpoints
"Can this network traffic even reach the data?"
Layer 2: Authentication
IAM, service accounts, Okta SSO, MFA
"Is this identity who it claims to be?"
Layer 3: Coarse authorization (RBAC)
Roles, permissions, IAM policies
"Is this role allowed to access this resource?"
Layer 4: Fine-grained authorization (RLS/CLS/ABAC)
Row-level security, column masking, attribute-based policies
"Can this specific user see this specific row/column?"
Layer 5: Encryption
At rest (KMS), in transit (TLS), client-side, tokenization
"Even if someone accesses the data, can they read it?"
Layer 6: Audit logging
CloudTrail, QUERY_HISTORY, access logs
"What happened? Who touched what, when?"
Each layer is independently necessary. Skipping fine-grained controls because you have RBAC means a user with the right IAM role can see ALL rows. Skipping encryption means an IAM breach can become a data breach.
RBAC, ABAC, RLS: The Three Access Control Models
Per NextLytics 2026: “RBAC, RLS, and ABAC each solve a distinct problem in managing data access.” Use them in layers.
RBAC (Role-Based Access Control): The Foundation
Assign users to roles; grant permissions to roles. Users inherit permissions from their roles.
# Unity Catalog RBAC example
spark.sql("GRANT SELECT ON TABLE gold.fact_orders TO `analyst-role`")
spark.sql("GRANT MODIFY ON TABLE silver.events TO `data-engineer-role`")
spark.sql("GRANT ALL PRIVILEGES ON CATALOG marketing TO `marketing-data-role`")
Principle of least privilege: assign the minimum permissions needed. Avoid GRANT ALL to end users. Use separate roles for read/write/admin.
RBAC limitation: RBAC controls which tables you can access — not which rows within those tables. For row-level restrictions, add RLS.
RLS (Row-Level Security): Data-Level Filtering
Restricts which rows a user sees when querying a table. Enforced at query time by the engine; users can’t bypass it.
Unity Catalog row filter (Databricks):
spark.sql("""
CREATE OR REPLACE FUNCTION security.orders_region_filter(region STRING)
RETURN
is_member('global-sales')
OR region = current_user_attribute('sales_region')
""")
spark.sql("""
ALTER TABLE gold.fact_orders
SET ROW FILTER security.orders_region_filter ON (region)
""")
BigQuery row-level security:
CREATE ROW ACCESS POLICY orders_region_policy
ON gold.fact_orders
GRANT TO ("group:emea-sales@company.com")
FILTER USING (region IN ('UK', 'DE', 'FR'));
Key RLS design rules:
- Always include a super-user bypass (admins need global visibility)
- Filter on low-cardinality columns (region, department, tenant_id)
- RLS is an AND condition applied to all queries
Column-Level Security / Column Masking
Some users see raw values; others see masked values. Schema stays the same; values vary by user.
spark.sql("""
CREATE OR REPLACE FUNCTION security.mask_email(email STRING)
RETURN CASE
WHEN is_member('pii-access-group') THEN email
WHEN is_member('analyst-group') THEN CONCAT('***@', SPLIT(email, '@')[1])
ELSE '***@***.***'
END
""")
spark.sql("""
ALTER TABLE gold.dim_customer
ALTER COLUMN email
SET MASK security.mask_email
""")
Common masking patterns:
| Data type | PII team (sees) | Analyst (sees) | Public (sees) |
|---|---|---|---|
| alice@gmail.com | ***@gmail.com | @.*** | |
| SSN | 123-45-6789 | *--6789 | --*** |
| Credit card | 4111111111111111 | **** **** **** 1111 | REDACTED |
| Phone | +1-415-555-1234 | +1-415-*-** | REDACTED |
| IP address | 192.168.1.100 | 192.168.x.x | REDACTED |
ABAC (Attribute-Based Access Control): Policy-Driven at Scale
ABAC uses attributes (tags on data assets + attributes of users) to make dynamic decisions. Instead of writing policies per table, write global policies that apply everywhere.
# Step 1: Tag sensitive columns across tables
spark.sql("ALTER TABLE gold.dim_customer ALTER COLUMN email SET TAGS ('pii'='email')")
spark.sql("ALTER TABLE silver.events ALTER COLUMN ip_address SET TAGS ('pii'='ip')")
spark.sql("ALTER TABLE gold.fact_orders ALTER COLUMN customer_name SET TAGS ('pii'='name')")
# Step 2: One policy for all tagged columns
spark.sql("""
CREATE OR REPLACE FUNCTION security.pii_column_mask(col STRING, pii_type STRING)
RETURN CASE
WHEN is_member('pii-access-group') THEN col
WHEN pii_type = 'email' THEN CONCAT('***@', SPLIT(col, '@')[1])
WHEN pii_type = 'ip' THEN REGEXP_REPLACE(col, '\\\\d+$', 'x')
WHEN pii_type = 'name' THEN CONCAT(LEFT(col, 1), '***')
ELSE 'REDACTED'
END
""")
ABAC scales better than per-table rules: with 10K tables, per-table policy maintenance becomes operationally impossible.
Encryption: At Rest and In Transit
Encryption at rest
- Provider-managed keys (default): simplest, often sufficient
- Customer-managed keys (CMK): required for highly sensitive/regulatory cases
Envelope encryption (how KMS works):
Your data → encrypted with a Data Encryption Key (DEK) → stored in S3/BigQuery/Snowflake
DEK → encrypted with a Key Encryption Key (KEK) in KMS → stored alongside encrypted data
To read: fetch encrypted DEK → call KMS to decrypt DEK → use DEK to decrypt data
To revoke access: disable/delete the KEK → data becomes unreadable
AWS example (S3 + CMK):
s3_bucket = s3.Bucket(
"sensitive-data-bucket",
server_side_encryption_configuration={
"rule": {
"apply_server_side_encryption_by_default": {
"sse_algorithm": "aws:kms",
"kms_master_key_id": "arn:aws:kms:us-east-1:123456789012:key/key-id",
},
"bucket_key_enabled": True,
}
},
block_public_acls=True,
block_public_policy=True,
ignore_public_acls=True,
restrict_public_buckets=True,
)
Encryption in transit
Non-negotiables:
- HTTPS/TLS for APIs (TLS 1.2+; prefer 1.3)
- DB connections over SSL (
sslmode='require') - S3 bucket policy enforcing HTTPS (deny
aws:SecureTransport=false) - Kafka SSL listeners for producer/consumer
- Consider mTLS/service mesh for service-to-service traffic
Network Perimeter: VPCs and Private Endpoints
Data systems should not be internet-accessible.
Public subnet:
Load balancers only
NAT gateway (outbound)
Bastion/jump host (emergency admin)
Private subnet:
Warehouse / DBs
Spark/EMR clusters
Kafka brokers
Airflow workers
VPC endpoints:
S3 gateway endpoint
DynamoDB gateway endpoint
KMS interface endpoint
S3 VPC endpoint example:
aws_ec2.VpcEndpoint(
"s3-endpoint",
vpc_id=vpc.id,
service_name="com.amazonaws.us-east-1.s3",
vpc_endpoint_type="Gateway",
route_table_ids=[private_route_table.id],
)
PII Detection and Classification
Before you can protect PII, you must know where it is.
Automated PII detection pipeline:
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
def scan_sample_for_pii(df: DataFrame, column: str) -> list[str]:
sample = df.select(column).limit(1000).toPandas()[column].dropna()
pii_types_found = set()
for value in sample:
results = analyzer.analyze(text=str(value), language="en")
for result in results:
if result.score > 0.7:
pii_types_found.add(result.entity_type)
return list(pii_types_found)
for table in new_tables:
df = spark.read.table(table)
for col in df.columns:
pii_types = scan_sample_for_pii(df, col)
if pii_types:
spark.sql(
f\"\"\"\n ALTER TABLE {table} ALTER COLUMN {col}\n SET TAGS ({', '.join([f\"'pii_{t.lower()}'='true'\" for t in pii_types])})\n \"\"\"\n )
Audit Logging: The Compliance Foundation
Every data access must be logged (GDPR, CCPA, SOC2, HIPAA, etc.).
What to log:
For every query:
WHO: user/service account
WHAT: tables + columns accessed
WHEN: timestamp
WHERE: source IP/service
HOW: query text (if safe)
RESULT: success/error
For every write:
+ rows written/deleted
BigQuery audit log query:
SELECT
protopayload_auditlog.authenticationInfo.principalEmail AS user,
timestamp,
protopayload_auditlog.resourceName AS resource,
protopayload_auditlog.methodName AS operation
FROM `project.cloudaudit_googleapis_com_data_access`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND protopayload_auditlog.resourceName LIKE '%customer_pii%'
ORDER BY timestamp DESC;
Snowflake audit:
SELECT
user_name,
query_text,
start_time,
status,
rows_produced,
bytes_scanned
FROM snowflake.account_usage.query_history
WHERE start_time > DATEADD('hour', -24, CURRENT_TIMESTAMP())
AND query_text ILIKE '%customer_pii%'
ORDER BY start_time DESC;
Interview Questions
Q1: “How would you handle an analyst request to access a table containing customer PII?”
Model Answer: “Default is no raw PII. First, validate necessity and try to satisfy with masked/aggregated views. If raw PII is truly required: documented business justification, manager + owner approvals, training certification, time-limited access with auto-expiry, and enhanced audit review. PII access must be justified, approved, and auditable.”
Q2: “Migrating petabytes on-prem → S3. What security controls during and after migration?”
Model Answer: “During: inventory/classify first, encrypt in transit (DataSync/HTTPS; Snowball AES-256), prefer Direct Connect. After: Block Public Access everywhere, SSE-KMS with CMK for confidential data, enforce HTTPS-only bucket policies, VPC endpoints, S3 access logging + CloudTrail data events, least-privileged IAM roles, Access Points for per-team policies, Macie/Config rules for drift detection.”
Think About This
Anthropic-style prompt: enterprise customers require isolation and “not used for training”. Design the architecture.
Walk through:
- Storage isolation (separate prefixes + IAM + CMK)
- Training pipeline exclusion list enforced automatically
- Audit trail of every dataset version and included tenants
- Break-glass access with time-limited permissions and immutable audit logs
Quick Reference
- Six layers: network → authn → RBAC → RLS/CLS/ABAC → encryption → audit logs
- RBAC controls tables/schemas; RLS/CLS control rows/values
- ABAC scales to 10K+ tables via governed tags + global policies
- Encryption: provider-managed vs CMK; always TLS in transit
- VPC endpoints keep data plane traffic off the internet and reduce NAT cost
- PII pipeline: detect → tag → mask → restrict → log