Day 52 — Multi-Tenancy in Data Systems

Phase 2: Deep Dives | Category: Data Platform Design

Why Multi-Tenancy Is a Senior DE Interview Topic

Multi-tenancy is the engineering core of everything you designed in Days 50–51 (data mesh, self-serve platforms). It’s also how every cloud data platform works internally — BigQuery, Snowflake, Databricks — and how company-built platforms serve many internal teams from shared infrastructure.

Per Educative.io: “Multi-tenancy is central to Atlassian’s cloud architecture. Each organization expects data isolation, fair resource allocation, and accurate billing. A hot tenant running a massive bulk import should not cause latency spikes for other tenants sharing the same infrastructure.”

The Multi-Tenancy Spectrum: Three Isolation Models

Model 1: Shared Everything (Row-Level Security)

All tenants share the same database/schema/tables. Every row is tagged with tenant_id. Access is enforced by query-level filtering (RLS).

CREATE TABLE user_events (
    tenant_id    STRING NOT NULL,
    user_id      STRING,
    event_type   STRING,
    event_time   TIMESTAMP
);

-- Row-level security policy (BigQuery example):
CREATE ROW ACCESS POLICY events_tenant_policy
ON user_events
GRANT TO ("group:marketing-team@company.com")
FILTER USING (tenant_id = 'marketing');

-- Marketing team query → automatically filtered to their tenant_id
SELECT event_type, COUNT(*) FROM user_events GROUP BY event_type;

Pros	Cons
Lowest cost (shared infra)	Noisy neighbor risk (one heavy query impacts all)
Simplest to operate (one schema)	Security misconfiguration can leak data
Schema changes apply once	Hard to customize per tenant
Best utilization	Compliance/audit isolation harder to demonstrate

Best for: internal platforms with trusted tenants and similar data structures, where cost efficiency dominates strict isolation.

Model 2: Separate Schema Per Tenant

Each tenant gets their own namespace within a shared database/warehouse. Data is logically isolated; infra is shared.

BigQuery:
  dataset: marketing.events
  dataset: finance.events
  dataset: product.events

Snowflake:
  database: ANALYTICS
    schema: MARKETING
    schema: FINANCE
    schema: PRODUCT

Unity Catalog:
  catalog: company_data
    schema: marketing.*
    schema: finance.*
    schema: product.*

Pros	Cons
Clear logical isolation + ownership	Schema changes may need replication
Easier compliance (clear boundaries)	Slightly harder cross-tenant queries
Per-tenant customization possible	More catalog/namespace overhead
Good balance of cost and isolation	Still shares compute (noisy neighbor at query layer)

Best for: most internal data platform deployments (dominant model at FAANG-scale internal platforms).

Model 3: Separate Database / Cluster Per Tenant

Each tenant gets dedicated infrastructure.

Marketing: dedicated Snowflake virtual warehouse (XL)
Finance:   dedicated Databricks cluster
DS team:   dedicated Databricks GPU cluster
Ad-hoc:    shared pool with slot reservation

Pros	Cons
Strongest performance isolation	Highest cost (underutilization)
No noisy neighbor	Ops overhead multiplies with tenants
Strongest compliance posture	Changes managed per tenant
Custom SLAs per tenant	Cold starts if truly dedicated

Best for: external SaaS, highly regulated tenants, or contractual SLAs.

The Noisy Neighbor Problem: The Core Engineering Challenge

Multi-tenancy offers strong economics, but tenants can interfere with each other.

Monday 9 AM: Finance runs weekly P&L report
  Query scans 3 months of events (500 GB)
  Consumes most available BigQuery slots for 15 minutes
  Marketing dashboard refresh slows from 30s to 10m
  Root cause: unoptimized finance query during peak hours

Three mechanisms to solve noisy neighbor:

1. Compute Quotas (Hard Limits)

BigQuery slot reservations:
  Finance:   500 slots reserved
  Marketing: 200 slots reserved
  Ad-hoc:    100 slots (shared)

Finance can’t steal marketing’s pool.

# Snowflake: separate virtual warehouses per team (isolated compute)
finance_wh = SnowflakeWarehouse(
    name="finance_warehouse",
    size="LARGE",
    auto_suspend=300,
    auto_resume=True,
)

marketing_wh = SnowflakeWarehouse(
    name="marketing_warehouse",
    size="MEDIUM",
    auto_suspend=60,
)

2. Query-Level Throttling

@rate_limit(
    queries_per_minute=50,
    bytes_scanned_per_hour=1_000_000_000_000,  # 1 TB/hour
    concurrent_queries=10,
)
def execute_query(tenant_id: str, sql: str):
    ...

# Emergency kill switch: auto-cancel if > 10 TB scanned or > 30 min runtime.

3. Weighted Fair Queuing

Priority queues:
  HIGH (SLA-critical pipelines)
  MEDIUM (dashboards)
  LOW (batch exports, exploration)

Tenant weighting:
  Each tenant gets N tokens/hour
  Each query costs tokens proportional to estimated compute
  Out of tokens → queries go to LOW queue
  Prevents permanent monopolization

Tenant Isolation in Data Lakes: The Multi-Layer Approach

For an internal lake serving many teams:

Storage isolation (S3/GCS):
  s3://company-lake/raw/marketing/  → IAM: marketing-read role
  s3://company-lake/raw/finance/    → IAM: finance-read role
  s3://company-lake/gold/shared/    → readable by all teams

Compute isolation (Spark/EMR):
  Per-team job queues + max CPU/mem per queue
  Tags per job for accounting

Catalog isolation (Hive/Unity Catalog):
  Catalog: marketing → only marketing can create tables
  Catalog: finance   → only finance can create tables

Query isolation (Trino):
  Resource groups per team
  Max concurrent: 20, max queued: 100
  Soft mem: 100 GB, hard mem: 150 GB (kill query instead of OOM)

Cost Allocation and Chargeback: The Economics Layer

Multi-tenancy economics only work if you can measure and report what each tenant consumes. Otherwise heavy users free-ride.

Tag everything from day 1:

spark_conf = {
    "spark.yarn.tags": f"team={team_id},project={project_id},env=prod",
    "spark.databricks.clusterUsageTags.customTags": json.dumps({
        "team": team_id,
        "cost_center": cost_center_id,
        "data_product": pipeline_name,
    }),
}

query_config = bigquery.QueryJobConfig(
    labels={"team": team_id, "pipeline": pipeline_name, "priority": "high"}
)

Monthly chargeback report:

SELECT
    labels.team AS team,
    SUM(total_bytes_processed) / 1e12 AS tb_scanned,
    SUM(total_bytes_processed) / 1e12 * 5.0 AS bq_compute_cost_usd,

    COUNT(*) AS query_count,
    AVG(total_slot_ms) / 1000 AS avg_slot_seconds,

    SUM(s3_storage_gb) AS storage_gb,
    SUM(s3_storage_gb) * 0.023 AS storage_cost_usd,

    SUM(databricks_dbu) AS databricks_dbus,
    SUM(databricks_dbu) * 0.22 AS databricks_cost_usd,

    SUM(total_bytes_processed) / 1e12 * 5.0 +
    SUM(s3_storage_gb) * 0.023 +
    SUM(databricks_dbu) * 0.22 AS total_cost_usd
FROM cloud_billing_export
WHERE EXTRACT(MONTH FROM usage_date) = 4
  AND EXTRACT(YEAR FROM usage_date) = 2026
GROUP BY labels.team
ORDER BY total_cost_usd DESC;

Chargeback models:

Model	Description	When to use
Show-back	Report usage per team, no billing	Early adoption
Soft chargeback	Cost visibility with manager review	Encourage responsibility without penalty
Hard chargeback	Deduct from team budget	Mature orgs with cost accountability
Budget alerts	Alerts at % of monthly budget	Proactive management

Access Control Patterns: Defense in Depth

Access control must work at multiple layers simultaneously:

Layer 1: Authentication
  Okta SSO, service account tokens

Layer 2: Authorization (coarse)
  IAM roles, Unity Catalog perms, dataset-level IAM

Layer 3: Authorization (fine-grained)
  Row-level security + column-level security

Layer 4: Data masking
  Mask sensitive values even for some authorized users

Layer 5: Audit logging
  Log every access for compliance review

Unity Catalog example:

spark.sql("""
CREATE OR REPLACE FUNCTION pii_mask.mask_email(email STRING)
RETURN CASE
    WHEN is_member('pii-access-group') THEN email
    ELSE CONCAT('***@', SPLIT(email, '@')[1])
END;
""")

spark.sql("""
ALTER TABLE gold.users
ALTER COLUMN email
SET MASK pii_mask.mask_email;
""")

Row-level filter example:

spark.sql("""
CREATE ROW FILTER sales_region_filter ON gold.sales_opportunities
AS (region) ->
    is_member('global-sales') OR
    array_contains(current_user_groups(), CONCAT('sales-', region));
""")

Tenant Onboarding Automation

Manual onboarding doesn’t scale; provisioning must be automated.

def onboard_new_tenant(team_id: str, config: TenantConfig):
    # 1) Storage namespace + IAM
    create_s3_prefix(f"s3://company-lake/{team_id}/")
    apply_iam_policy(
        f"s3://company-lake/{team_id}/",
        principal=f"role/{team_id}-data-role",
    )

    # 2) Compute quota/policy
    create_databricks_cluster_policy(
        team=team_id,
        max_instances=config.max_cluster_size,
        allowed_instance_types=config.allowed_instances,
        auto_terminate_minutes=60,
    )

    # 3) Catalog namespace
    spark.sql(f"CREATE CATALOG IF NOT EXISTS {team_id}")
    spark.sql(f"GRANT CREATE SCHEMA ON CATALOG {team_id} TO `{team_id}-data-role`")

    # 4) Register in data catalog
    datahub.create_domain(
        name=team_id,
        owner=config.team_lead_email,
        description=config.team_description,
    )

    # 5) Budget alert
    create_budget_alert(
        team=team_id,
        monthly_budget_usd=config.monthly_budget,
        alert_at_pct=[50, 80, 100],
    )

    send_onboarding_email(team_id, config.team_lead_email)

Goal: a new team is provisioned with isolated storage/compute/catalog/access in minutes, not days.

Interview Questions

Q1: “Internal data platform for 100 teams. One team runs a 10-hour Spark job nightly that starves others. Fix architecturally.”

Model Answer: “Classic noisy neighbor. Three-layer fix. (1) Compute isolation: per-team resource pools/cluster policies so the job can’t consume more than its quota. (2) Scheduling + priority: SLA-critical jobs first; long batch jobs run in low priority during off-peak windows. (3) Cost visibility/chargeback: the team sees the true cost and is incentivized to optimize (partition pruning, skew fixes, predicate pushdown). Add guardrails: auto-kill jobs beyond duration limits unless explicitly approved.”

Q2: “Analyst accidentally queries another team’s sensitive customer data. How does the platform prevent it, and what should catch it?”

Model Answer: “Defense in depth. (1) Catalog visibility: restricted tables should be non-discoverable or clearly flagged. (2) Execution-time permissions: query should fail with permission denied via Unity Catalog/IAM. (3) Masking: even when cross-team access is valid, PII should be masked unless in privileged groups. (4) Audit logging: query history/audit logs trigger alerts when users access restricted tables.”

Think About This

Meta-style prompt: “Design a multi-tenant data platform for 1,000+ internal product teams with huge variance in size. How do you design isolation, quotas, and cost management?”

Walk through:

Isolation model: separate schemas + compute quotas (balanced)
Quota tiers by team size and criticality
Cost tagging and chargeback + budget alerts
Automated onboarding via portal + Terraform
Emergency kill switch for platform-wide incidents

Quick Reference

Isolation models: shared + RLS → separate schema per tenant → dedicated cluster per tenant
Noisy neighbor fixes: quotas, throttling, weighted fair queuing
Defense in depth: auth → IAM → RLS/CLS → masking → audit logs
Cost allocation: tag everything from day 1 + monthly chargeback + alerts
Tenant onboarding: automated provisioning for storage/compute/catalog/IAM/budget

Tomorrow’s Preview

Day 53: Data Security & Access Control — IAM, RBAC, ABAC, encryption, key management, network perimeters, and audit.

Day 52: Multi-Tenancy in Data Systems