Phase 1: Foundations & Frameworks | Category: Distributed Systems
Why Partition Key Design Is the Highest-Leverage Decision
As Data Vidhya puts it: “A 10TB table with no partitioning. Every query scans every row. 45 minutes to answer ‘what happened yesterday.’ Add a single date partition key, and that same query returns in 30 seconds. A 90x improvement from one design decision.” At senior level, interviewers expect you to choose and justify a partitioning strategy based on query patterns, data volume, and write patterns — not just define what partitioning is. Staff-level candidates explain WHY they chose a strategy, not just WHAT it is.
Partitioning vs Sharding: Terminology
These terms are often used interchangeably, but there’s a distinction:
-
Partitioning: Dividing a single logical table into multiple physical segments within the same system. BigQuery partition by date, Hive partitions on S3, Iceberg partition specs.
-
Sharding: Distributing data across multiple independent database instances/nodes. Each shard is a separate database.
For data engineering interviews, you’ll discuss partitioning most often (warehouse/lakehouse table design). For system design interviews, sharding comes up when scaling databases beyond a single node.
The Three Partitioning Strategies
1. Range Partitioning
Data is divided into contiguous ranges of the partition key.
Partition 1: 2026-01-01 to 2026-01-31 Partition 2: 2026-02-01 to 2026-02-28 Partition 3: 2026-03-01 to 2026-03-31
Advantages:
-
Efficient range scans: “Give me all events from March” reads only one partition
-
Natural alignment with time-series data and batch pipeline patterns
-
Easy to understand and implement
-
Supports partition pruning in query engines (BigQuery, Spark, Hive)
Disadvantages:
-
Hot partitions: All current writes hit the latest partition. If today is March 31, all writes go to the March partition while Jan/Feb sit idle.
-
Uneven partition sizes if data distribution is skewed across ranges
Best for: Time-series data (the default for 80% of DE use cases), date-partitioned fact tables, log data, event streams.
Interview phrasing: “I’d partition this events table by event_date because our queries always filter by date range, daily partitions will be approximately 500MB each, and it supports our daily batch backfill pattern — I can overwrite a single day’s partition without touching the rest.”
2. Hash Partitioning
A hash function is applied to the partition key, distributing data evenly across partitions.
Partition = hash(user_id) % num_partitions user_id: 12345 → hash → partition 3 user_id: 67890 → hash → partition 7 user_id: 11111 → hash → partition 1
Advantages:
-
Even distribution of data and write load across partitions
-
Prevents hot partitions caused by skewed key values
-
Works well for point lookups by key
Disadvantages:
-
No range queries: “Give me all users with IDs 1000-2000” must scan all partitions because hash destroys ordering
-
Resharding pain: Changing from N to N+1 partitions (simple modulo) remaps almost every key. Requires massive data movement.
Best for: User-level data in application databases, DynamoDB tables, Cassandra tables, any scenario where even write distribution matters more than range scans.
3. Consistent Hashing
A specialized form of hash partitioning that minimizes data movement when nodes are added or removed.
Hash Ring: 0 ─────── Node A ─────── Node B ─────── Node C ─────── 2^32 Key "user:123" → hash → lands at position X → assigned to next node clockwise
How it works:
-
Both keys and nodes are mapped onto a circular hash space (ring)
-
Each key is assigned to the nearest node clockwise on the ring
-
Adding a node only affects keys between the new node and its predecessor
-
Virtual nodes (multiple positions per physical node) improve distribution
Key benefit: Adding/removing a node moves only ~1/N of the data (N = number of nodes), vs. nearly all data with simple modulo hashing.
Used by: DynamoDB, Cassandra, memcached, CDN routing, Kafka (internally for consumer group assignment).
Partition Key Design: The Framework
This 4-question framework works for any technology:
1. What does your query WHERE clause look like?
The partition key must match your most common filter. If 90% of queries filter by date, partition by date. If they filter by customer_id, partition by customer_id.
2. Does the key produce reasonably sized partitions?
Target 100 MB to 1 GB per partition file. Smaller = small file problem (metadata overhead, too many tasks). Larger = reduced pruning benefit.
3. Does the key support your pipeline’s write pattern?
If your pipeline processes daily batches, a date partition key lets you overwrite one day at a time. If you need to update a specific customer’s data, partitioning by customer_id makes that efficient.
4. Does the key have high cardinality and even distribution?
A boolean column (true/false) makes a terrible partition key — only 2 partitions. A timestamp to the millisecond creates millions of tiny partitions. Find the sweet spot.
Partition Key Design by Technology
| Technology | Partitioning Mechanism | Key Considerations |
|---|---|---|
| Big Query | Partition by date/timestamp column or integer range. Clustering on additional columns. | Partition = pruning unit. Queries without partition filter scan everything (and you pay for it). Always filter on partition column. |
| Snowflake | Automatic micro-partitioning. Clustering keys for co-locating related data. | You don’t explicitly choose partitions. You choose clustering keys that influence how micro-partitions are organized. |
| Iceberg/Delta Lake | Partition spec on any column(s). Hidden partitioning in Iceberg (partition transforms: year, month, day, hour, bucket, truncate). | Iceberg’s hidden partitioning means queries don’t need to know the partition layout — the engine prunes automatically. |
| Kafka | Partition by message key (hash). Number of partitions set at topic creation. | Partition count = max consumer parallelism. Partition key determines ordering scope. All messages with same key → same partition → ordered. |
| DynamoDB | Partition key (hash). Optional sort key for range queries within a partition. | Partition key determines data placement AND throughput distribution. Hot partition = throttled reads/writes on that partition. |
| Cassandra | Partition key (consistent hashing on the ring). Clustering columns for sort order within partition. | Large partitions (>100MB) cause GC pressure and repair issues. Design for partition sizes < 100MB. |
Hot Partitions: The Real-World Problem
A hot partition occurs when a disproportionate share of traffic hits a single partition. This is the most common partitioning failure in production.
Common causes:
-
Time-based partition + write-heavy workload: All current writes hit today’s partition
-
Celebrity/viral content: One user_id or content_id generates 1000x normal traffic
-
Null keys: All records with NULL partition key land in the same partition
-
Low cardinality key: Partitioning by country when 80% of traffic is from the US
Detection:
-
DynamoDB: CloudWatch ConsumedCapacity per partition
-
Kafka: Consumer lag on one partition while others are caught up
-
Spark: One task in a stage takes 100x longer than others
-
BigQuery: Slot utilization skewed to one partition
Mitigation strategies:
1. Salting / Write sharding:
Instead of: PK = user_id Use: PK = user_id + "#" + (hash(timestamp) % 10) Result: user_123#0, user_123#1, ... user_123#9 → Spreads one hot user across 10 partitions
Trade-off: Reads for a single user now require scatter-gather across 10 partitions.
2. Composite partition key:
PK = event_date + region → Instead of one "2026-03-31" partition, you get: 2026-03-31/us-east, 2026-03-31/us-west, 2026-03-31/eu-west
Trade-off: More partitions to manage. Queries spanning all regions can’t prune to a single partition.
3. Time-based bucketing:
PK = user_id + "#" + YYYYMM → Spreads historical data across monthly buckets → Each month's bucket is a manageable size
4. Isolate the hot key: Process the hot key in a dedicated pipeline/partition. Everything else goes through the normal path. This is the “VIP lane” pattern.
5. Caching: For read-hot partitions, put a cache (Redis, Memcached) in front. The hot partition serves cache misses only.
Cross-Partition Operations: The Cost You Must Acknowledge
Whenever you partition/shard data, some operations become expensive:
OperationWithin PartitionCross-PartitionPoint lookup by partition keyFast (single partition read)N/ARange scan on partition keyFast (read contiguous partitions)N/AFilter on non-partition columnScans entire partitionScans ALL partitionsJOIN between two tablesFast if co-partitioned on same keyExpensive shuffle/broadcastAggregation (COUNT, SUM)Fast per partitionRequires scatter-gather across all partitionsUPDATE by partition keyTouches one partitionN/AGlobal sort / ORDER BYPer-partition sort is fastFull sort requires all-partition shuffle
The critical interview point: “Every partitioning choice optimizes for one access pattern at the cost of others. If I partition by user_id, point lookups per user are fast, but ‘give me all events from yesterday’ requires scanning every user partition. If I partition by date, time-range queries are fast, but per-user lookups scan every date partition. The key is matching the partition strategy to the dominant access pattern.”
Co-partitioning for joins: If two tables are frequently joined (fact_orders + dim_customer on customer_id), partitioning both by customer_id eliminates the shuffle during the join. Spark, Flink, and Hive all support bucketed/co-partitioned joins.
Interview Questions
Q1: “You’re designing a DynamoDB table for storing user activity events. Users generate between 1 and 10,000 events per day. Some celebrity users generate 1M+ events per day. How do you choose the partition key?”
Model Answer: “The naive choice — partition key = user_id — creates hot partitions for celebrity users. DynamoDB throttles at the partition level, so one hot user could throttle reads/writes for events co-located on that partition. I’d use a composite approach: partition key = user_id#bucket where bucket = hash(event_id) % 10 for users exceeding a threshold. Normal users (< 10K events) use a single bucket (user_id#0). Celebrity users are automatically spread across 10 buckets. For reads, I’d use a scatter-gather pattern: query all 10 buckets for a celebrity user in parallel and merge results. The sort key would be the event timestamp for time-ordered retrieval within each bucket. I’d also put DynamoDB Accelerator (DAX) in front for read-hot celebrity profiles. The key principle: design the partition key for even write distribution first, then optimize reads with caching and parallel queries.”
Q2: “Your BigQuery analytics table has 5 TB of event data. Analysts complain queries are slow and expensive. How would you redesign the partitioning?”
Model Answer: “First, I’d check if the table is partitioned at all — a 5TB unpartitioned table means every query does a full scan. I’d partition by event_date since analytics queries almost always include a date range filter. This alone could reduce scan size by 30x if analysts typically query one month of a year’s data. Next, I’d add clustering on the columns that appear most frequently in WHERE and GROUP BY — likely user_segment, event_type, and region. BigQuery clustering physically sorts data within partitions, enabling further pruning. I’d also set a require_partition_filter table option to prevent accidental full-table scans. For cost control, I’d estimate: 5 TB / 365 days ≈ 14 GB per daily partition. At BigQuery’s $5/TB scan price, a query hitting one day costs $0.07 vs. $25 for a full table scan. That’s the ROI I’d present to justify the migration effort — rewrite the table with CREATE TABLE AS SELECT with partition and cluster specs.”
Think About This
You’re in a Meta interview. The prompt: “Design the storage layer for Facebook’s messaging system. Users send billions of messages per day. How would you partition the message data?”
Walk through:
-
What’s the primary access pattern? (User opens a conversation → fetch recent messages for that conversation. Access pattern = point lookup by conversation_id, sorted by timestamp.)
-
What’s the partition key? (conversation_id — all messages in a conversation live on the same partition for fast retrieval. Sort key = message_timestamp for chronological ordering.)
-
What about hot conversations? (A viral group chat with 10K members sending messages simultaneously. Shard the conversation: partition key = conversation_id#bucket, distribute writes across buckets.)
-
What about “search all my messages”? (This is a cross-partition query — doesn’t align with the conversation_id partition. Build a secondary index or a separate search index (Elasticsearch) partitioned by user_id.)
-
How does this scale? (Consistent hashing across database nodes. Adding a node redistributes ~1/N of conversations. Each conversation’s messages stay co-located for efficient reads.)
The key insight: there’s no single partition key that serves all access patterns. The primary partition (conversation_id) serves the dominant pattern (read conversation). Secondary access patterns (search by user, global analytics) require separate indexes or materialized views with different partition strategies.
Quick Reference
-
Range partitioning: Best for time-series data (80% of DE use cases). Efficient range scans. Risk: hot partitions on the latest range.
-
Hash partitioning: Even distribution, prevents hotspots. Cost: no range queries. Default for application databases and Kafka topics.
-
Consistent hashing: Hash partitioning that minimizes data movement when scaling. Used by DynamoDB, Cassandra, CDN routing.
-
Partition key framework: Match WHERE clause, target 100MB-1GB per partition, support your write/backfill pattern, ensure high cardinality + even distribution.
-
Hot partition fixes: salting, composite keys, time-bucketing, isolate hot keys, caching.
-
Always state your partition key in interviews: “I’ll partition by X because [query pattern], [partition size estimate], [write pattern support].” This 10-second statement signals physical data layout awareness.
-
Don’t shard too early: A well-tuned single database handles more than you think (64 TiB Postgres, 10K+ TPS). Shard only when you’ve exhausted vertical scaling.
Tomorrow’s Preview
Day 14: Replication & Fault Tolerance — Leader-follower vs multi-leader vs leaderless replication. Quorum reads/writes. Write-Ahead Log (WAL). How replication affects data pipeline design and SLAs — the other side of the distributed systems coin from today’s partitioning lesson.