Phase 2: Deep Dives | Category: Data Platform Design
Milestone: 50 Days In
You’ve covered the entire Phase 1 foundation (Days 1-30) and two-thirds of Phase 2’s deep dives. In 40 more days you’ll have completed 90 days of senior DE interview preparation. Today’s topic — data platform architecture — is one of the highest-signal topics for staff-level interviews. It tests whether you think at the individual pipeline level (mid-level) or at the organizational systems level (senior/staff).
The Problem: Why Central Data Teams Don’t Scale
The traditional model: one central data engineering team owns all pipelines, all datasets, all transformations. It breaks down predictably as the organization grows:
Central DE Team bottleneck:
Marketing: "We need a new dashboard" → 3-week queue
Sales: "We need a new metric" → 2-week queue
Product: "We need feature data" → 4-week queue
Finance: "Our numbers don't match" → nobody knows whose pipeline is wrong
Data quality issues:
5 different tables named "users" → which is authoritative?
"Revenue" defined 3 different ways across teams
Pipelines break and nobody knows which team owns them
By 2024-2025, companies reaching ~200-1000 data consumers faced this wall. Data Mesh and self-serve platforms emerged as the answer. Per Thoughtworks 2026 assessment: “Data Mesh has evolved from industry hype into a mature socio-technical paradigm.”
Data Mesh: The Four Principles
Coined by Zhamak Dehghani at ThoughtWorks.
Principle 1: Domain-Oriented Decentralized Data Ownership
Ownership of data moves from a central team to the business domains that CREATE and UNDERSTAND the data.
Before (centralized):
Marketing → Central DE → "marketing_data" warehouse table → Analysts
After (domain-owned):
Marketing domain team → owns "marketing_campaigns" data product
Sales domain team → owns "sales_pipeline" data product
Product domain team → owns "user_engagement" data product
Each domain:
- Builds and operates its own data pipelines
- Defines its own schemas and transformations
- Ensures its own data quality SLAs
- Publishes data products that other domains can consume
Why domain ownership improves quality: the domain team understands the edge cases and semantics far better than a central DE team without that context.
What this means for the platform team: the central DE team shifts from building all pipelines to building the infrastructure that lets domain teams build their own pipelines safely.
Principle 2: Data as a Product
Domain teams treat their data as a product — with the same rigor as a software product — not as a side effect of their operational systems.
What a data product must have (the “data product” contract):
# data_product.yaml — the interface contract for a data product
name: "marketing_campaign_performance"
version: "2.1.0"
owner: "marketing-data@company.com"
domain: "marketing"
schema:
- name: campaign_id
type: STRING
description: "Unique identifier for marketing campaign"
- name: impressions
type: INT64
description: "Total ad impressions in the measurement window"
- name: attributed_revenue
type: FLOAT64
description: "Revenue attributed via last-touch model"
sla:
freshness: "daily by 8:00 AM PT"
availability: "99.5%"
completeness: "> 99% of campaign rows"
access:
discovery: public # anyone can see this product exists
data: role_based # requires "marketing-reader" role
pii: none # no PII in this product
output_ports:
- type: BigQuery
location: "project.marketing.campaign_performance"
- type: API
endpoint: "https://data.company.com/api/marketing/campaigns"
Data products are NOT:
- Raw database dumps (no schema, no quality guarantee)
- Internal intermediate tables (unstable, undocumented)
- Dashboards (analytical products, not data products)
- Ad-hoc query results (not versioned, not managed)
Data products ARE:
- Stable, versioned, documented
- Quality-guaranteed with explicit SLAs
- Discoverable (in a catalog)
- Owned by a named team
- Designed for consumption by other teams
Principle 3: Self-Serve Data Platform
The central platform team provides infrastructure that makes it easy for domain teams to build, deploy, operate, and consume data products — without needing to be infrastructure experts.
What the self-serve platform provides:
| Capability | What it does | Platform component |
|---|---|---|
| Storage | Object storage + table format | S3 + Iceberg, or BigQuery |
| Compute | Managed Spark/SQL compute | Databricks, BigQuery, EMR |
| Ingestion | Connectors to source systems | Fivetran, Airbyte, internal CDC |
| Orchestration | Pipeline scheduling and monitoring | Airflow, Dagster, managed templates |
| Data catalog | Discovery and documentation | DataHub, Amundsen, Unity Catalog |
| Access control | Row/column security | Unity Catalog RBAC, Okta integration |
| Quality checks | Standardized tests and anomaly detection | Great Expectations templates, Monte Carlo |
| CI/CD | Pipeline deployment automation | GitHub Actions templates, Terraform |
| Observability | Pipeline monitoring templates | Pre-built Grafana dashboards |
| Cost tracking | Per-team compute and storage | Cloud billing tags, chargeback reports |
The platform team’s product mindset (per Thoughtworks 2026): “The platform team must treat its internal platform as a product. This means running user research with domain teams, having a public roadmap, managing its own services with SLOs and relentlessly prioritizing features that reduce friction. The goal is to provide a streamlined developer experience that makes the ‘right way’ (using the platform) the ‘easy way’.”
Practical example — the “golden path” for a domain team:
# A marketing data engineer creates a new data product using platform templates
# Step 1: Scaffold from template (platform-provided)
data-platform new-product \
--name "campaign_performance" \
--domain "marketing" \
--output-store "bigquery" \
--quality-checks "standard"
# This creates:
# - dbt model with standard tests pre-configured
# - Airflow DAG template with SLA monitoring
# - data_product.yaml contract file
# - GitHub Actions CI/CD pipeline
# - DataHub registration automatically on PR merge
# Step 2: Domain team writes their transformation logic (the "last mile")
# dbt/models/marketing/campaign_performance.sql
SELECT
campaign_id,
SUM(impressions) AS impressions,
SUM(attributed_revenue) AS attributed_revenue
FROM {{ ref('raw_ad_events') }}
GROUP BY campaign_id
# Step 3: PR review → tests run automatically → deploy to staging → approve → prod
git push origin feature/campaign-performance
# CI/CD runs dbt test, quality checks, publishes to DataHub catalog on merge
Principle 4: Federated Computational Governance
Global rules enforced automatically by the platform, with domain autonomy for local decisions.
GLOBAL rules (platform enforces automatically):
- All data products must have an owner registered in the catalog
- All PII columns must be tagged with data classification
- All production data products must have freshness SLAs
- Cross-domain joins require data access agreements
- All data at rest must be encrypted
- Audit log required for all PII column access
DOMAIN rules (domain teams decide locally):
- Which metrics to expose
- What transformation logic to use
- Which columns to include
- Refresh frequency within SLA bounds
- Internal naming conventions
The key insight: governance should not mean “a committee reviews every pipeline change.” It means “the platform automatically enforces the rules, so domain teams can move fast without requiring central approval.”
The 2026 Reality: Data Mesh-ish
Many teams adopt data-as-a-product principles within specific domains and prioritize data contracts for essential pipelines — without a full organizational transformation.
What actually works in 2026 (the pragmatic hybrid):
CENTRALIZED (platform team owns):
- Storage infrastructure (S3, GCS, BigQuery)
- Compute management (Databricks, EMR, BigQuery slots)
- Identity and access control (Okta, Unity Catalog)
- Data catalog (DataHub, Amundsen)
- CI/CD pipeline templates
- Quality monitoring infrastructure
- Cost tracking and chargeback
DECENTRALIZED (domain teams own):
- Their own data products (pipelines, schemas, quality)
- Their own transformation logic
- Their own SLA definitions (within platform constraints)
- Their own metrics definitions
THE BOUNDARY (platform enforces via \"golden path\"):
- Data product contract format (standardized YAML)
- Required quality checks (domain can add more, but minimum is mandatory)
- Catalog registration (automatic on deploy)
- Access control pattern (standard RBAC on top of platform primitives)
Multi-Tenancy: The Engineering Core of Self-Serve
The technical foundation of self-serve platforms is multi-tenancy: multiple teams (tenants) sharing the same infrastructure with proper isolation.
1. Compute isolation
Databricks:
Marketing team → dedicated SQL warehouse (their own cluster)
Finance team → dedicated SQL warehouse (budget for their workloads)
Data science team → dedicated compute cluster (GPU-enabled)
BigQuery:
Marketing team → reserved slot reservation
Finance team → reserved slot reservation
Ad-hoc analysts → on-demand slots (shared pool)
2. Storage isolation
Unity Catalog hierarchy:
Catalog: marketing
Schema: raw_data
Schema: silver_data
Schema: gold_products
Catalog: finance
Schema: raw_data
Schema: silver_data
Schema: gold_products
Row-level security:
Sales team reads orders table → WHERE region = 'their_region'
Finance reads orders table → no restriction (all rows)
3. Cost isolation and chargeback
SELECT
team,
SUM(bytes_processed_gb) AS gb_scanned,
SUM(compute_dbu) AS databricks_dbus,
SUM(storage_gb_months) AS storage_gb,
SUM(bytes_processed_gb) * 0.005 +
SUM(compute_dbu) * 0.22 +
SUM(storage_gb_months) * 0.023 AS total_cost_usd
FROM cloud_cost_attribution
WHERE month = '2026-04'
GROUP BY team
ORDER BY total_cost_usd DESC;
Anti-Patterns to Name in Interviews
These show production experience — not just theoretical knowledge:
Anti-pattern 1: Data mesh for a team of 10 (over-engineering)
Data mesh is for organizations with 50+ data producers and 500+ data consumers. For smaller orgs, a well-modeled warehouse with good dbt practices beats a premature mesh.
Anti-pattern 2: Domain ownership without platform support (chaos)
Decentralizing responsibility without providing infrastructure support creates 50 domain teams each building their own orchestration, quality framework, and CI/CD. Massive duplication. The platform must provide the plumbing.
Anti-pattern 3: Dashboards as data products
Dashboards are products, but they don’t qualify as data products. A data product is a dataset with a stable interface, versioned schema, and SLA. A dashboard built on top is an analytics product.
Anti-pattern 4: Governance by committee
If every schema change requires a steering committee approval, teams route around governance. Governance must be computational (enforced by the platform), not human-in-the-loop.
Interview Questions
Q1: “Your company has grown from 5 data engineers to 50. You have 20 different product teams all waiting on the central data team. How do you redesign the data platform?”
Model Answer: “This is the scaling problem that data mesh was designed to address. I’d approach it in three phases. Phase 1 (3 months): build the platform foundations — standardized storage, managed compute, CI/CD templates, and a data catalog. Create the golden path so a domain team can create a new data product in an afternoon. Define the data product contract format (owner, schema, SLA, access policy). Phase 2 (6 months): identify 3-5 high-impact domains and migrate their highest-value pipelines from central to domain ownership as a partnership (platform provides infrastructure/templates; domain provides business knowledge + ownership). Phase 3 (ongoing): treat the platform as a product — surveys, a public roadmap, SLOs, chargeback, and computational governance enforced automatically.”
Q2: “A domain team deploys a data product that breaks three downstream consumers. How does your platform prevent this, and how do you handle it when it happens anyway?”
Model Answer: “Prevention is architectural. First, data contracts enforced in CI/CD: schema compatibility checks block backward-incompatible changes unless a major version bump + migration path is provided. Second, dependency tracking in the catalog: lineage makes impact analysis automatic. Third, semantic versioning for data products: major breaking changes require explicit consumer migration; additive changes are safe. Fourth, canary/parallel deploys for majors: keep v1 and v2 available for 30 days. When it happens anyway, lineage identifies affected consumers in minutes and the platform can roll back (Iceberg time travel to the pre-change snapshot).”
Think About This
You’re in a Meta interview. The prompt: “Meta has 1,000+ data engineers across 100+ product teams. The central data platform team has become a bottleneck — 6-week queues for new data products. Design a self-serve data platform that lets Meta scale to 10x more data products without adding central team headcount.”
Walk through:
- What does the central platform provide?
- What do domain teams own?
- How does governance scale computationally (not with committees)?
- How does the platform team measure success?
- What’s the biggest risk and how do you mitigate it?
Quick Reference
- Data mesh = 4 principles: domain ownership, data as a product, self-serve platform, federated computational governance
- Domain ownership: responsibility lives where knowledge lives
- Data product contract: name, version, owner, schema, SLA, access policy, output ports
- Self-serve platform: central team provides plumbing; domain teams provide last-mile logic
- Federated governance: global rules enforced automatically by the platform (PII tagging, compatibility, catalog)
- Multi-tenancy: compute isolation, storage isolation, cost isolation/chargeback
- 2026 reality: data mesh-ish (hub-and-spoke + strong platform + contracts) beats pure mesh
Tomorrow’s Preview
Day 51: Data Mesh vs Data Fabric — Organizational patterns for data at scale: data mesh (decentralized, domain-driven) vs data fabric (unified, metadata-driven). Pros, cons, and when each fits.