Phase 2: Deep Dives | Category: Data Platform Design
Why Multi-Tenancy Is a Senior DE Interview Topic
Multi-tenancy is the engineering core of everything you designed in Days 50–51 (data mesh, self-serve platforms). It’s also how every cloud data platform works internally — BigQuery, Snowflake, Databricks — and how company-built platforms serve many internal teams from shared infrastructure.
Per Educative.io: “Multi-tenancy is central to Atlassian’s cloud architecture. Each organization expects data isolation, fair resource allocation, and accurate billing. A hot tenant running a massive bulk import should not cause latency spikes for other tenants sharing the same infrastructure.”
The Multi-Tenancy Spectrum: Three Isolation Models
Model 1: Shared Everything (Row-Level Security)
All tenants share the same database/schema/tables. Every row is tagged with tenant_id. Access is enforced by query-level filtering (RLS).
CREATE TABLE user_events (
tenant_id STRING NOT NULL,
user_id STRING,
event_type STRING,
event_time TIMESTAMP
);
-- Row-level security policy (BigQuery example):
CREATE ROW ACCESS POLICY events_tenant_policy
ON user_events
GRANT TO ("group:marketing-team@company.com")
FILTER USING (tenant_id = 'marketing');
-- Marketing team query → automatically filtered to their tenant_id
SELECT event_type, COUNT(*) FROM user_events GROUP BY event_type;
| Pros | Cons |
|---|---|
| Lowest cost (shared infra) | Noisy neighbor risk (one heavy query impacts all) |
| Simplest to operate (one schema) | Security misconfiguration can leak data |
| Schema changes apply once | Hard to customize per tenant |
| Best utilization | Compliance/audit isolation harder to demonstrate |
Best for: internal platforms with trusted tenants and similar data structures, where cost efficiency dominates strict isolation.
Model 2: Separate Schema Per Tenant
Each tenant gets their own namespace within a shared database/warehouse. Data is logically isolated; infra is shared.
BigQuery:
dataset: marketing.events
dataset: finance.events
dataset: product.events
Snowflake:
database: ANALYTICS
schema: MARKETING
schema: FINANCE
schema: PRODUCT
Unity Catalog:
catalog: company_data
schema: marketing.*
schema: finance.*
schema: product.*
| Pros | Cons |
|---|---|
| Clear logical isolation + ownership | Schema changes may need replication |
| Easier compliance (clear boundaries) | Slightly harder cross-tenant queries |
| Per-tenant customization possible | More catalog/namespace overhead |
| Good balance of cost and isolation | Still shares compute (noisy neighbor at query layer) |
Best for: most internal data platform deployments (dominant model at FAANG-scale internal platforms).
Model 3: Separate Database / Cluster Per Tenant
Each tenant gets dedicated infrastructure.
Marketing: dedicated Snowflake virtual warehouse (XL)
Finance: dedicated Databricks cluster
DS team: dedicated Databricks GPU cluster
Ad-hoc: shared pool with slot reservation
| Pros | Cons |
|---|---|
| Strongest performance isolation | Highest cost (underutilization) |
| No noisy neighbor | Ops overhead multiplies with tenants |
| Strongest compliance posture | Changes managed per tenant |
| Custom SLAs per tenant | Cold starts if truly dedicated |
Best for: external SaaS, highly regulated tenants, or contractual SLAs.
The Noisy Neighbor Problem: The Core Engineering Challenge
Multi-tenancy offers strong economics, but tenants can interfere with each other.
Monday 9 AM: Finance runs weekly P&L report
Query scans 3 months of events (500 GB)
Consumes most available BigQuery slots for 15 minutes
Marketing dashboard refresh slows from 30s to 10m
Root cause: unoptimized finance query during peak hours
Three mechanisms to solve noisy neighbor:
1. Compute Quotas (Hard Limits)
BigQuery slot reservations:
Finance: 500 slots reserved
Marketing: 200 slots reserved
Ad-hoc: 100 slots (shared)
Finance can’t steal marketing’s pool.
# Snowflake: separate virtual warehouses per team (isolated compute)
finance_wh = SnowflakeWarehouse(
name="finance_warehouse",
size="LARGE",
auto_suspend=300,
auto_resume=True,
)
marketing_wh = SnowflakeWarehouse(
name="marketing_warehouse",
size="MEDIUM",
auto_suspend=60,
)
2. Query-Level Throttling
@rate_limit(
queries_per_minute=50,
bytes_scanned_per_hour=1_000_000_000_000, # 1 TB/hour
concurrent_queries=10,
)
def execute_query(tenant_id: str, sql: str):
...
# Emergency kill switch: auto-cancel if > 10 TB scanned or > 30 min runtime.
3. Weighted Fair Queuing
Priority queues:
HIGH (SLA-critical pipelines)
MEDIUM (dashboards)
LOW (batch exports, exploration)
Tenant weighting:
Each tenant gets N tokens/hour
Each query costs tokens proportional to estimated compute
Out of tokens → queries go to LOW queue
Prevents permanent monopolization
Tenant Isolation in Data Lakes: The Multi-Layer Approach
For an internal lake serving many teams:
Storage isolation (S3/GCS):
s3://company-lake/raw/marketing/ → IAM: marketing-read role
s3://company-lake/raw/finance/ → IAM: finance-read role
s3://company-lake/gold/shared/ → readable by all teams
Compute isolation (Spark/EMR):
Per-team job queues + max CPU/mem per queue
Tags per job for accounting
Catalog isolation (Hive/Unity Catalog):
Catalog: marketing → only marketing can create tables
Catalog: finance → only finance can create tables
Query isolation (Trino):
Resource groups per team
Max concurrent: 20, max queued: 100
Soft mem: 100 GB, hard mem: 150 GB (kill query instead of OOM)
Cost Allocation and Chargeback: The Economics Layer
Multi-tenancy economics only work if you can measure and report what each tenant consumes. Otherwise heavy users free-ride.
Tag everything from day 1:
spark_conf = {
"spark.yarn.tags": f"team={team_id},project={project_id},env=prod",
"spark.databricks.clusterUsageTags.customTags": json.dumps({
"team": team_id,
"cost_center": cost_center_id,
"data_product": pipeline_name,
}),
}
query_config = bigquery.QueryJobConfig(
labels={"team": team_id, "pipeline": pipeline_name, "priority": "high"}
)
Monthly chargeback report:
SELECT
labels.team AS team,
SUM(total_bytes_processed) / 1e12 AS tb_scanned,
SUM(total_bytes_processed) / 1e12 * 5.0 AS bq_compute_cost_usd,
COUNT(*) AS query_count,
AVG(total_slot_ms) / 1000 AS avg_slot_seconds,
SUM(s3_storage_gb) AS storage_gb,
SUM(s3_storage_gb) * 0.023 AS storage_cost_usd,
SUM(databricks_dbu) AS databricks_dbus,
SUM(databricks_dbu) * 0.22 AS databricks_cost_usd,
SUM(total_bytes_processed) / 1e12 * 5.0 +
SUM(s3_storage_gb) * 0.023 +
SUM(databricks_dbu) * 0.22 AS total_cost_usd
FROM cloud_billing_export
WHERE EXTRACT(MONTH FROM usage_date) = 4
AND EXTRACT(YEAR FROM usage_date) = 2026
GROUP BY labels.team
ORDER BY total_cost_usd DESC;
Chargeback models:
| Model | Description | When to use |
|---|---|---|
| Show-back | Report usage per team, no billing | Early adoption |
| Soft chargeback | Cost visibility with manager review | Encourage responsibility without penalty |
| Hard chargeback | Deduct from team budget | Mature orgs with cost accountability |
| Budget alerts | Alerts at % of monthly budget | Proactive management |
Access Control Patterns: Defense in Depth
Access control must work at multiple layers simultaneously:
Layer 1: Authentication
Okta SSO, service account tokens
Layer 2: Authorization (coarse)
IAM roles, Unity Catalog perms, dataset-level IAM
Layer 3: Authorization (fine-grained)
Row-level security + column-level security
Layer 4: Data masking
Mask sensitive values even for some authorized users
Layer 5: Audit logging
Log every access for compliance review
Unity Catalog example:
spark.sql("""
CREATE OR REPLACE FUNCTION pii_mask.mask_email(email STRING)
RETURN CASE
WHEN is_member('pii-access-group') THEN email
ELSE CONCAT('***@', SPLIT(email, '@')[1])
END;
""")
spark.sql("""
ALTER TABLE gold.users
ALTER COLUMN email
SET MASK pii_mask.mask_email;
""")
Row-level filter example:
spark.sql("""
CREATE ROW FILTER sales_region_filter ON gold.sales_opportunities
AS (region) ->
is_member('global-sales') OR
array_contains(current_user_groups(), CONCAT('sales-', region));
""")
Tenant Onboarding Automation
Manual onboarding doesn’t scale; provisioning must be automated.
def onboard_new_tenant(team_id: str, config: TenantConfig):
# 1) Storage namespace + IAM
create_s3_prefix(f"s3://company-lake/{team_id}/")
apply_iam_policy(
f"s3://company-lake/{team_id}/",
principal=f"role/{team_id}-data-role",
)
# 2) Compute quota/policy
create_databricks_cluster_policy(
team=team_id,
max_instances=config.max_cluster_size,
allowed_instance_types=config.allowed_instances,
auto_terminate_minutes=60,
)
# 3) Catalog namespace
spark.sql(f"CREATE CATALOG IF NOT EXISTS {team_id}")
spark.sql(f"GRANT CREATE SCHEMA ON CATALOG {team_id} TO `{team_id}-data-role`")
# 4) Register in data catalog
datahub.create_domain(
name=team_id,
owner=config.team_lead_email,
description=config.team_description,
)
# 5) Budget alert
create_budget_alert(
team=team_id,
monthly_budget_usd=config.monthly_budget,
alert_at_pct=[50, 80, 100],
)
send_onboarding_email(team_id, config.team_lead_email)
Goal: a new team is provisioned with isolated storage/compute/catalog/access in minutes, not days.
Interview Questions
Q1: “Internal data platform for 100 teams. One team runs a 10-hour Spark job nightly that starves others. Fix architecturally.”
Model Answer: “Classic noisy neighbor. Three-layer fix. (1) Compute isolation: per-team resource pools/cluster policies so the job can’t consume more than its quota. (2) Scheduling + priority: SLA-critical jobs first; long batch jobs run in low priority during off-peak windows. (3) Cost visibility/chargeback: the team sees the true cost and is incentivized to optimize (partition pruning, skew fixes, predicate pushdown). Add guardrails: auto-kill jobs beyond duration limits unless explicitly approved.”
Q2: “Analyst accidentally queries another team’s sensitive customer data. How does the platform prevent it, and what should catch it?”
Model Answer: “Defense in depth. (1) Catalog visibility: restricted tables should be non-discoverable or clearly flagged. (2) Execution-time permissions: query should fail with permission denied via Unity Catalog/IAM. (3) Masking: even when cross-team access is valid, PII should be masked unless in privileged groups. (4) Audit logging: query history/audit logs trigger alerts when users access restricted tables.”
Think About This
Meta-style prompt: “Design a multi-tenant data platform for 1,000+ internal product teams with huge variance in size. How do you design isolation, quotas, and cost management?”
Walk through:
- Isolation model: separate schemas + compute quotas (balanced)
- Quota tiers by team size and criticality
- Cost tagging and chargeback + budget alerts
- Automated onboarding via portal + Terraform
- Emergency kill switch for platform-wide incidents
Quick Reference
- Isolation models: shared + RLS → separate schema per tenant → dedicated cluster per tenant
- Noisy neighbor fixes: quotas, throttling, weighted fair queuing
- Defense in depth: auth → IAM → RLS/CLS → masking → audit logs
- Cost allocation: tag everything from day 1 + monthly chargeback + alerts
- Tenant onboarding: automated provisioning for storage/compute/catalog/IAM/budget
Tomorrow’s Preview
Day 53: Data Security & Access Control — IAM, RBAC, ABAC, encryption, key management, network perimeters, and audit.