Phase 2: Deep Dives | Category: Security & Privacy

Why Data Engineers Must Own Security

Security in data systems isn’t a separate team’s responsibility — it’s designed in by the data engineer. Senior DEs are expected to design data systems that are secure by default, not secure by bolt-on.

The Security Layers: Defense in Depth

Layer 1: Network perimeter
  VPC, private subnets, security groups, VPC endpoints
  "Can this network traffic even reach the data?"

Layer 2: Authentication
  IAM, service accounts, Okta SSO, MFA
  "Is this identity who it claims to be?"

Layer 3: Coarse authorization (RBAC)
  Roles, permissions, IAM policies
  "Is this role allowed to access this resource?"

Layer 4: Fine-grained authorization (RLS/CLS/ABAC)
  Row-level security, column masking, attribute-based policies
  "Can this specific user see this specific row/column?"

Layer 5: Encryption
  At rest (KMS), in transit (TLS), client-side, tokenization
  "Even if someone accesses the data, can they read it?"

Layer 6: Audit logging
  CloudTrail, QUERY_HISTORY, access logs
  "What happened? Who touched what, when?"

Each layer is independently necessary. Skipping fine-grained controls because you have RBAC means a user with the right IAM role can see ALL rows. Skipping encryption means an IAM breach can become a data breach.

RBAC, ABAC, RLS: The Three Access Control Models

Per NextLytics 2026: “RBAC, RLS, and ABAC each solve a distinct problem in managing data access.” Use them in layers.

RBAC (Role-Based Access Control): The Foundation

Assign users to roles; grant permissions to roles. Users inherit permissions from their roles.

# Unity Catalog RBAC example
spark.sql("GRANT SELECT ON TABLE gold.fact_orders TO `analyst-role`")
spark.sql("GRANT MODIFY ON TABLE silver.events TO `data-engineer-role`")
spark.sql("GRANT ALL PRIVILEGES ON CATALOG marketing TO `marketing-data-role`")

Principle of least privilege: assign the minimum permissions needed. Avoid GRANT ALL to end users. Use separate roles for read/write/admin.

RBAC limitation: RBAC controls which tables you can access — not which rows within those tables. For row-level restrictions, add RLS.

RLS (Row-Level Security): Data-Level Filtering

Restricts which rows a user sees when querying a table. Enforced at query time by the engine; users can’t bypass it.

Unity Catalog row filter (Databricks):

spark.sql("""
CREATE OR REPLACE FUNCTION security.orders_region_filter(region STRING)
RETURN
    is_member('global-sales')
    OR region = current_user_attribute('sales_region')
""")

spark.sql("""
ALTER TABLE gold.fact_orders
SET ROW FILTER security.orders_region_filter ON (region)
""")

BigQuery row-level security:

CREATE ROW ACCESS POLICY orders_region_policy
ON gold.fact_orders
GRANT TO ("group:emea-sales@company.com")
FILTER USING (region IN ('UK', 'DE', 'FR'));

Key RLS design rules:

  • Always include a super-user bypass (admins need global visibility)
  • Filter on low-cardinality columns (region, department, tenant_id)
  • RLS is an AND condition applied to all queries

Column-Level Security / Column Masking

Some users see raw values; others see masked values. Schema stays the same; values vary by user.

spark.sql("""
CREATE OR REPLACE FUNCTION security.mask_email(email STRING)
RETURN CASE
    WHEN is_member('pii-access-group') THEN email
    WHEN is_member('analyst-group')    THEN CONCAT('***@', SPLIT(email, '@')[1])
    ELSE '***@***.***'
END
""")

spark.sql("""
ALTER TABLE gold.dim_customer
ALTER COLUMN email
SET MASK security.mask_email
""")

Common masking patterns:

Data typePII team (sees)Analyst (sees)Public (sees)
Emailalice@gmail.com***@gmail.com@.***
SSN123-45-6789*--6789--***
Credit card4111111111111111**** **** **** 1111REDACTED
Phone+1-415-555-1234+1-415-*-**REDACTED
IP address192.168.1.100192.168.x.xREDACTED

ABAC (Attribute-Based Access Control): Policy-Driven at Scale

ABAC uses attributes (tags on data assets + attributes of users) to make dynamic decisions. Instead of writing policies per table, write global policies that apply everywhere.

# Step 1: Tag sensitive columns across tables
spark.sql("ALTER TABLE gold.dim_customer ALTER COLUMN email SET TAGS ('pii'='email')")
spark.sql("ALTER TABLE silver.events ALTER COLUMN ip_address SET TAGS ('pii'='ip')")
spark.sql("ALTER TABLE gold.fact_orders ALTER COLUMN customer_name SET TAGS ('pii'='name')")

# Step 2: One policy for all tagged columns
spark.sql("""
CREATE OR REPLACE FUNCTION security.pii_column_mask(col STRING, pii_type STRING)
RETURN CASE
    WHEN is_member('pii-access-group') THEN col
    WHEN pii_type = 'email' THEN CONCAT('***@', SPLIT(col, '@')[1])
    WHEN pii_type = 'ip'    THEN REGEXP_REPLACE(col, '\\\\d+$', 'x')
    WHEN pii_type = 'name'  THEN CONCAT(LEFT(col, 1), '***')
    ELSE 'REDACTED'
END
""")

ABAC scales better than per-table rules: with 10K tables, per-table policy maintenance becomes operationally impossible.

Encryption: At Rest and In Transit

Encryption at rest

  • Provider-managed keys (default): simplest, often sufficient
  • Customer-managed keys (CMK): required for highly sensitive/regulatory cases

Envelope encryption (how KMS works):

Your data → encrypted with a Data Encryption Key (DEK) → stored in S3/BigQuery/Snowflake
DEK → encrypted with a Key Encryption Key (KEK) in KMS → stored alongside encrypted data

To read: fetch encrypted DEK → call KMS to decrypt DEK → use DEK to decrypt data
To revoke access: disable/delete the KEK → data becomes unreadable

AWS example (S3 + CMK):

s3_bucket = s3.Bucket(
    "sensitive-data-bucket",
    server_side_encryption_configuration={
        "rule": {
            "apply_server_side_encryption_by_default": {
                "sse_algorithm": "aws:kms",
                "kms_master_key_id": "arn:aws:kms:us-east-1:123456789012:key/key-id",
            },
            "bucket_key_enabled": True,
        }
    },
    block_public_acls=True,
    block_public_policy=True,
    ignore_public_acls=True,
    restrict_public_buckets=True,
)

Encryption in transit

Non-negotiables:

  • HTTPS/TLS for APIs (TLS 1.2+; prefer 1.3)
  • DB connections over SSL (sslmode='require')
  • S3 bucket policy enforcing HTTPS (deny aws:SecureTransport=false)
  • Kafka SSL listeners for producer/consumer
  • Consider mTLS/service mesh for service-to-service traffic

Network Perimeter: VPCs and Private Endpoints

Data systems should not be internet-accessible.

Public subnet:
  Load balancers only
  NAT gateway (outbound)
  Bastion/jump host (emergency admin)

Private subnet:
  Warehouse / DBs
  Spark/EMR clusters
  Kafka brokers
  Airflow workers

VPC endpoints:
  S3 gateway endpoint
  DynamoDB gateway endpoint
  KMS interface endpoint

S3 VPC endpoint example:

aws_ec2.VpcEndpoint(
    "s3-endpoint",
    vpc_id=vpc.id,
    service_name="com.amazonaws.us-east-1.s3",
    vpc_endpoint_type="Gateway",
    route_table_ids=[private_route_table.id],
)

PII Detection and Classification

Before you can protect PII, you must know where it is.

Automated PII detection pipeline:

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

def scan_sample_for_pii(df: DataFrame, column: str) -> list[str]:
    sample = df.select(column).limit(1000).toPandas()[column].dropna()
    pii_types_found = set()
    for value in sample:
        results = analyzer.analyze(text=str(value), language="en")
        for result in results:
            if result.score > 0.7:
                pii_types_found.add(result.entity_type)
    return list(pii_types_found)

for table in new_tables:
    df = spark.read.table(table)
    for col in df.columns:
        pii_types = scan_sample_for_pii(df, col)
        if pii_types:
            spark.sql(
                f\"\"\"\n                ALTER TABLE {table} ALTER COLUMN {col}\n                SET TAGS ({', '.join([f\"'pii_{t.lower()}'='true'\" for t in pii_types])})\n                \"\"\"\n            )

Audit Logging: The Compliance Foundation

Every data access must be logged (GDPR, CCPA, SOC2, HIPAA, etc.).

What to log:

For every query:
  WHO: user/service account
  WHAT: tables + columns accessed
  WHEN: timestamp
  WHERE: source IP/service
  HOW: query text (if safe)
  RESULT: success/error

For every write:
  + rows written/deleted

BigQuery audit log query:

SELECT
    protopayload_auditlog.authenticationInfo.principalEmail AS user,
    timestamp,
    protopayload_auditlog.resourceName AS resource,
    protopayload_auditlog.methodName AS operation
FROM `project.cloudaudit_googleapis_com_data_access`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
  AND protopayload_auditlog.resourceName LIKE '%customer_pii%'
ORDER BY timestamp DESC;

Snowflake audit:

SELECT
    user_name,
    query_text,
    start_time,
    status,
    rows_produced,
    bytes_scanned
FROM snowflake.account_usage.query_history
WHERE start_time > DATEADD('hour', -24, CURRENT_TIMESTAMP())
  AND query_text ILIKE '%customer_pii%'
ORDER BY start_time DESC;

Interview Questions

Q1: “How would you handle an analyst request to access a table containing customer PII?”

Model Answer: “Default is no raw PII. First, validate necessity and try to satisfy with masked/aggregated views. If raw PII is truly required: documented business justification, manager + owner approvals, training certification, time-limited access with auto-expiry, and enhanced audit review. PII access must be justified, approved, and auditable.”

Q2: “Migrating petabytes on-prem → S3. What security controls during and after migration?”

Model Answer: “During: inventory/classify first, encrypt in transit (DataSync/HTTPS; Snowball AES-256), prefer Direct Connect. After: Block Public Access everywhere, SSE-KMS with CMK for confidential data, enforce HTTPS-only bucket policies, VPC endpoints, S3 access logging + CloudTrail data events, least-privileged IAM roles, Access Points for per-team policies, Macie/Config rules for drift detection.”

Think About This

Anthropic-style prompt: enterprise customers require isolation and “not used for training”. Design the architecture.

Walk through:

  • Storage isolation (separate prefixes + IAM + CMK)
  • Training pipeline exclusion list enforced automatically
  • Audit trail of every dataset version and included tenants
  • Break-glass access with time-limited permissions and immutable audit logs

Quick Reference

  • Six layers: network → authn → RBAC → RLS/CLS/ABAC → encryption → audit logs
  • RBAC controls tables/schemas; RLS/CLS control rows/values
  • ABAC scales to 10K+ tables via governed tags + global policies
  • Encryption: provider-managed vs CMK; always TLS in transit
  • VPC endpoints keep data plane traffic off the internet and reduce NAT cost
  • PII pipeline: detect → tag → mask → restrict → log