Day 13 — Partitioning and sharding strategies

Phase 1: Foundations & Frameworks | Category: Distributed Systems

Why Partition Key Design Is the Highest-Leverage Decision

As Data Vidhya puts it: “A 10TB table with no partitioning. Every query scans every row. 45 minutes to answer ‘what happened yesterday.’ Add a single date partition key, and that same query returns in 30 seconds. A 90x improvement from one design decision.” At senior level, interviewers expect you to choose and justify a partitioning strategy based on query patterns, data volume, and write patterns — not just define what partitioning is. Staff-level candidates explain WHY they chose a strategy, not just WHAT it is.

Partitioning vs Sharding: Terminology

These terms are often used interchangeably, but there’s a distinction:

Partitioning: Dividing a single logical table into multiple physical segments within the same system. BigQuery partition by date, Hive partitions on S3, Iceberg partition specs.
Sharding: Distributing data across multiple independent database instances/nodes. Each shard is a separate database.

For data engineering interviews, you’ll discuss partitioning most often (warehouse/lakehouse table design). For system design interviews, sharding comes up when scaling databases beyond a single node.

The Three Partitioning Strategies

1. Range Partitioning

Data is divided into contiguous ranges of the partition key.

Partition 1:  2026-01-01 to 2026-01-31  Partition 2:  2026-02-01 to 2026-02-28  Partition 3:  2026-03-01 to 2026-03-31

Advantages:

Efficient range scans: “Give me all events from March” reads only one partition
Natural alignment with time-series data and batch pipeline patterns
Easy to understand and implement
Supports partition pruning in query engines (BigQuery, Spark, Hive)

Disadvantages:

Hot partitions: All current writes hit the latest partition. If today is March 31, all writes go to the March partition while Jan/Feb sit idle.
Uneven partition sizes if data distribution is skewed across ranges

Best for: Time-series data (the default for 80% of DE use cases), date-partitioned fact tables, log data, event streams.

Interview phrasing: “I’d partition this events table by event_date because our queries always filter by date range, daily partitions will be approximately 500MB each, and it supports our daily batch backfill pattern — I can overwrite a single day’s partition without touching the rest.”

2. Hash Partitioning

A hash function is applied to the partition key, distributing data evenly across partitions.

Partition = hash(user_id) % num_partitions  user_id: 12345 → hash → partition 3  user_id: 67890 → hash → partition 7  user_id: 11111 → hash → partition 1

Advantages:

Even distribution of data and write load across partitions
Prevents hot partitions caused by skewed key values
Works well for point lookups by key

Disadvantages:

No range queries: “Give me all users with IDs 1000-2000” must scan all partitions because hash destroys ordering
Resharding pain: Changing from N to N+1 partitions (simple modulo) remaps almost every key. Requires massive data movement.

Best for: User-level data in application databases, DynamoDB tables, Cassandra tables, any scenario where even write distribution matters more than range scans.

3. Consistent Hashing

A specialized form of hash partitioning that minimizes data movement when nodes are added or removed.

Hash Ring:      0 ─────── Node A ─────── Node B ─────── Node C ─────── 2^32  Key "user:123" → hash → lands at position X → assigned to next node clockwise

How it works:

Both keys and nodes are mapped onto a circular hash space (ring)
Each key is assigned to the nearest node clockwise on the ring
Adding a node only affects keys between the new node and its predecessor
Virtual nodes (multiple positions per physical node) improve distribution

Key benefit: Adding/removing a node moves only ~1/N of the data (N = number of nodes), vs. nearly all data with simple modulo hashing.

Used by: DynamoDB, Cassandra, memcached, CDN routing, Kafka (internally for consumer group assignment).

Partition Key Design: The Framework

This 4-question framework works for any technology:

1. What does your query WHERE clause look like?

The partition key must match your most common filter. If 90% of queries filter by date, partition by date. If they filter by customer_id, partition by customer_id.

2. Does the key produce reasonably sized partitions?

Target 100 MB to 1 GB per partition file. Smaller = small file problem (metadata overhead, too many tasks). Larger = reduced pruning benefit.

3. Does the key support your pipeline’s write pattern?

If your pipeline processes daily batches, a date partition key lets you overwrite one day at a time. If you need to update a specific customer’s data, partitioning by customer_id makes that efficient.

4. Does the key have high cardinality and even distribution?

A boolean column (true/false) makes a terrible partition key — only 2 partitions. A timestamp to the millisecond creates millions of tiny partitions. Find the sweet spot.

Partition Key Design by Technology

Technology	Partitioning Mechanism	Key Considerations
Big Query	Partition by date/timestamp column or integer range. Clustering on additional columns.	Partition = pruning unit. Queries without partition filter scan everything (and you pay for it). Always filter on partition column.
Snowflake	Automatic micro-partitioning. Clustering keys for co-locating related data.	You don’t explicitly choose partitions. You choose clustering keys that influence how micro-partitions are organized.
Iceberg/Delta Lake	Partition spec on any column(s). Hidden partitioning in Iceberg (partition transforms: year, month, day, hour, bucket, truncate).	Iceberg’s hidden partitioning means queries don’t need to know the partition layout — the engine prunes automatically.
Kafka	Partition by message key (hash). Number of partitions set at topic creation.	Partition count = max consumer parallelism. Partition key determines ordering scope. All messages with same key → same partition → ordered.
DynamoDB	Partition key (hash). Optional sort key for range queries within a partition.	Partition key determines data placement AND throughput distribution. Hot partition = throttled reads/writes on that partition.
Cassandra	Partition key (consistent hashing on the ring). Clustering columns for sort order within partition.	Large partitions (>100MB) cause GC pressure and repair issues. Design for partition sizes < 100MB.

Hot Partitions: The Real-World Problem

A hot partition occurs when a disproportionate share of traffic hits a single partition. This is the most common partitioning failure in production.

Common causes:

Time-based partition + write-heavy workload: All current writes hit today’s partition
Celebrity/viral content: One user_id or content_id generates 1000x normal traffic
Null keys: All records with NULL partition key land in the same partition
Low cardinality key: Partitioning by country when 80% of traffic is from the US

Detection:

DynamoDB: CloudWatch ConsumedCapacity per partition
Kafka: Consumer lag on one partition while others are caught up
Spark: One task in a stage takes 100x longer than others
BigQuery: Slot utilization skewed to one partition

Mitigation strategies:

1. Salting / Write sharding:

Instead of:  PK = user_id  Use:         PK = user_id + "#" + (hash(timestamp) % 10)  Result: user_123#0, user_123#1, ... user_123#9  → Spreads one hot user across 10 partitions

Trade-off: Reads for a single user now require scatter-gather across 10 partitions.

2. Composite partition key:

PK = event_date + region  → Instead of one "2026-03-31" partition, you get:     2026-03-31/us-east, 2026-03-31/us-west, 2026-03-31/eu-west

Trade-off: More partitions to manage. Queries spanning all regions can’t prune to a single partition.

3. Time-based bucketing:

PK = user_id + "#" + YYYYMM  → Spreads historical data across monthly buckets  → Each month's bucket is a manageable size

4. Isolate the hot key: Process the hot key in a dedicated pipeline/partition. Everything else goes through the normal path. This is the “VIP lane” pattern.

5. Caching: For read-hot partitions, put a cache (Redis, Memcached) in front. The hot partition serves cache misses only.

Cross-Partition Operations: The Cost You Must Acknowledge

Whenever you partition/shard data, some operations become expensive:

OperationWithin PartitionCross-PartitionPoint lookup by partition keyFast (single partition read)N/ARange scan on partition keyFast (read contiguous partitions)N/AFilter on non-partition columnScans entire partitionScans ALL partitionsJOIN between two tablesFast if co-partitioned on same keyExpensive shuffle/broadcastAggregation (COUNT, SUM)Fast per partitionRequires scatter-gather across all partitionsUPDATE by partition keyTouches one partitionN/AGlobal sort / ORDER BYPer-partition sort is fastFull sort requires all-partition shuffle

The critical interview point: “Every partitioning choice optimizes for one access pattern at the cost of others. If I partition by user_id, point lookups per user are fast, but ‘give me all events from yesterday’ requires scanning every user partition. If I partition by date, time-range queries are fast, but per-user lookups scan every date partition. The key is matching the partition strategy to the dominant access pattern.”

Co-partitioning for joins: If two tables are frequently joined (fact_orders + dim_customer on customer_id), partitioning both by customer_id eliminates the shuffle during the join. Spark, Flink, and Hive all support bucketed/co-partitioned joins.

Interview Questions

Q1: “You’re designing a DynamoDB table for storing user activity events. Users generate between 1 and 10,000 events per day. Some celebrity users generate 1M+ events per day. How do you choose the partition key?”

Model Answer: “The naive choice — partition key = user_id — creates hot partitions for celebrity users. DynamoDB throttles at the partition level, so one hot user could throttle reads/writes for events co-located on that partition. I’d use a composite approach: partition key = user_id#bucket where bucket = hash(event_id) % 10 for users exceeding a threshold. Normal users (< 10K events) use a single bucket (user_id#0). Celebrity users are automatically spread across 10 buckets. For reads, I’d use a scatter-gather pattern: query all 10 buckets for a celebrity user in parallel and merge results. The sort key would be the event timestamp for time-ordered retrieval within each bucket. I’d also put DynamoDB Accelerator (DAX) in front for read-hot celebrity profiles. The key principle: design the partition key for even write distribution first, then optimize reads with caching and parallel queries.”

Q2: “Your BigQuery analytics table has 5 TB of event data. Analysts complain queries are slow and expensive. How would you redesign the partitioning?”

Model Answer: “First, I’d check if the table is partitioned at all — a 5TB unpartitioned table means every query does a full scan. I’d partition by event_date since analytics queries almost always include a date range filter. This alone could reduce scan size by 30x if analysts typically query one month of a year’s data. Next, I’d add clustering on the columns that appear most frequently in WHERE and GROUP BY — likely user_segment, event_type, and region. BigQuery clustering physically sorts data within partitions, enabling further pruning. I’d also set a require_partition_filter table option to prevent accidental full-table scans. For cost control, I’d estimate: 5 TB / 365 days ≈ 14 GB per daily partition. At BigQuery’s $5/TB scan price, a query hitting one day costs $0.07 vs. $25 for a full table scan. That’s the ROI I’d present to justify the migration effort — rewrite the table with CREATE TABLE AS SELECT with partition and cluster specs.”

Think About This

You’re in a Meta interview. The prompt: “Design the storage layer for Facebook’s messaging system. Users send billions of messages per day. How would you partition the message data?”

Walk through:

What’s the primary access pattern? (User opens a conversation → fetch recent messages for that conversation. Access pattern = point lookup by conversation_id, sorted by timestamp.)
What’s the partition key? (conversation_id — all messages in a conversation live on the same partition for fast retrieval. Sort key = message_timestamp for chronological ordering.)
What about hot conversations? (A viral group chat with 10K members sending messages simultaneously. Shard the conversation: partition key = conversation_id#bucket, distribute writes across buckets.)
What about “search all my messages”? (This is a cross-partition query — doesn’t align with the conversation_id partition. Build a secondary index or a separate search index (Elasticsearch) partitioned by user_id.)
How does this scale? (Consistent hashing across database nodes. Adding a node redistributes ~1/N of conversations. Each conversation’s messages stay co-located for efficient reads.)

The key insight: there’s no single partition key that serves all access patterns. The primary partition (conversation_id) serves the dominant pattern (read conversation). Secondary access patterns (search by user, global analytics) require separate indexes or materialized views with different partition strategies.

Quick Reference

Range partitioning: Best for time-series data (80% of DE use cases). Efficient range scans. Risk: hot partitions on the latest range.
Hash partitioning: Even distribution, prevents hotspots. Cost: no range queries. Default for application databases and Kafka topics.
Consistent hashing: Hash partitioning that minimizes data movement when scaling. Used by DynamoDB, Cassandra, CDN routing.
Partition key framework: Match WHERE clause, target 100MB-1GB per partition, support your write/backfill pattern, ensure high cardinality + even distribution.
Hot partition fixes: salting, composite keys, time-bucketing, isolate hot keys, caching.
Always state your partition key in interviews: “I’ll partition by X because [query pattern], [partition size estimate], [write pattern support].” This 10-second statement signals physical data layout awareness.
Don’t shard too early: A well-tuned single database handles more than you think (64 TiB Postgres, 10K+ TPS). Shard only when you’ve exhausted vertical scaling.

Tomorrow’s Preview

Day 14: Replication & Fault Tolerance — Leader-follower vs multi-leader vs leaderless replication. Quorum reads/writes. Write-Ahead Log (WAL). How replication affects data pipeline design and SLAs — the other side of the distributed systems coin from today’s partitioning lesson.

Day 13: Partitioning and sharding strategies