Phase 1: Foundations & Frameworks | Category: Storage Systems

The Senior-Level Trap to Avoid

The biggest mistake candidates make at your level: “I’ll use NoSQL because it’s scalable.” As System Design Nuggets warns: “Candidates say ‘I’ll use a NoSQL database because it’s scalable’ without specifying the access pattern. DynamoDB scales beautifully for key-value lookups but forces you to model queries at table-design time. If you need ad-hoc joins, a relational database behind read replicas may scale just as well — and be far simpler.” And Hello Interview is blunt: “Most of the time, the right answer is a relational database. It’s the default unless your requirements clearly signal a specialized model.”

The pro move: tie your database choice to the specific workload — access pattern, read/write ratio, consistency requirement, and scale.

The Five Database Categories

1. Relational (SQL)

Systems: PostgreSQL, MySQL, Aurora, Spanner, CockroachDB, AlloyDB

Data model: Tables with rows and columns. Fixed schema. Relationships via foreign keys and JOINs.

Strengths:

  • ACID transactions — atomicity, consistency, isolation, durability

  • Complex queries: JOINs, aggregations, window functions, subqueries

  • Strong consistency by default

  • Mature ecosystem: decades of tooling, optimization, expertise

  • SQL is universal — skills transfer across all relational databases

Limitations:

  • Horizontal scaling is harder (sharding requires application-level routing)

  • Schema changes require migrations (though modern tools like Alembic/Flyway make this manageable)

  • Vertical scaling has a ceiling (but that ceiling is now 64 TiB / 20K TPS per instance — higher than most people think)

When to choose:

  • Complex relationships between entities (users, orders, products, payments)

  • ACID transactions are required (financial systems, inventory, billing)

  • Ad-hoc analytical queries and reporting

  • Default choice unless a specific requirement rules it out

Data engineering relevance:

  • Source systems for CDC pipelines (Debezium reads PostgreSQL WAL)

  • Metadata stores for orchestration (Airflow metadata DB)

  • Small-to-medium serving layers for gold data marts

  • Feature stores with complex joins (PostgreSQL + pgvector for embeddings)

2. Document Store

Systems: MongoDB, Couchbase, Firestore, DocumentDB

Data model: JSON/BSON documents. Flexible schema. Nested objects and arrays within a single document.

Strengths:

  • Schema flexibility — different documents can have different fields

  • Natural fit for hierarchical/nested data (user profiles with variable attributes)

  • Good horizontal scaling via sharding

  • Rich query language (MongoDB’s aggregation pipeline)

  • ACID transactions within and across documents (MongoDB 4.0+)

Limitations:

  • JOINs across collections are expensive (designed for denormalized data)

  • Schema flexibility can become schema chaos without discipline

  • Less efficient for highly relational data

When to choose:

  • Rapidly evolving schemas (early-stage products, content management)

  • Hierarchical data that would require many JOINs in SQL

  • Different records have vastly different structures (product catalogs with varying attributes)

Data engineering relevance:

  • Source system for CDC (MongoDB change streams → Kafka)

  • Application data stores that feed into your data lake

  • Rarely the destination for analytics pipelines (warehouses are better for that)

3. Key-Value Store

Systems: Redis, Memcached, DynamoDB, Riak

Data model: Simple key → value pairs. Value can be a string, JSON, binary blob, or complex data structure (Redis supports lists, sets, sorted sets, hashes).

Strengths:

  • Blazing fast: sub-millisecond latency (Redis: 100K+ ops/sec per instance)

  • Simple API: GET, SET, DELETE

  • Horizontal scaling via hash partitioning

  • DynamoDB: single-digit millisecond latency at any scale, fully managed

Limitations:

  • No JOINs, no complex queries, no ad-hoc analytics

  • Must design your data model around access patterns upfront (especially DynamoDB)

  • Limited query flexibility — if you didn’t plan for a query, you can’t run it

When to choose:

  • Caching layer (Redis/Memcached in front of a primary database)

  • Session storage, shopping carts, leaderboards

  • High-throughput, low-latency serving with simple access patterns

  • DynamoDB: serverless applications, predictable performance at any scale, AWS-native

Data engineering relevance:

  • Redis as a real-time feature store (online feature serving at < 5ms)

  • DynamoDB as a serving layer for pre-computed pipeline outputs

  • Caching layer between warehouse and BI tools for hot queries

  • Rate limiting and deduplication stores in streaming pipelines

4. Wide-Column Store

Systems: Cassandra, HBase, Bigtable, ScyllaDB

Data model: Rows organized by partition key, with columns grouped into column families. Different rows can have different columns. Data sorted by clustering key within a partition.

Strengths:

  • Massive write throughput (Cassandra: 1M+ writes/sec per cluster)

  • Linear horizontal scaling — add nodes to increase capacity

  • Multi-datacenter replication built-in (Cassandra)

  • Excellent for time-series and append-heavy workloads

  • Tunable consistency (Cassandra: ONE, QUORUM, ALL per query)

Limitations:

  • No JOINs, limited ad-hoc queries

  • Must model around query patterns at design time (like DynamoDB but more flexible)

  • Operational complexity (Cassandra: JVM tuning, compaction, repair)

  • Reads can be slower than writes (LSM tree structure)

When to choose:

  • Massive write volumes: IoT sensor data, event logging, telemetry, metrics

  • Time-series data with high cardinality

  • Multi-region active-active deployments

  • Apple runs 100,000+ Cassandra nodes for iMessage and Siri

Data engineering relevance:

  • Cassandra/HBase as the real-time serving layer for streaming pipeline outputs

  • Bigtable as the serving layer for Google-scale analytics

  • Time-series metrics storage for pipeline monitoring

  • Netflix uses Cassandra for real-time data that feeds recommendations

5. Graph Database

Systems: Neo4j, Amazon Neptune, TigerGraph, JanusGraph

Data model: Nodes (entities) and edges (relationships). Properties on both nodes and edges. Query by traversing relationships.

Strengths:

  • Relationship-centric queries are fast (friends-of-friends, shortest path, recommendation graphs)

  • Natural model for social networks, knowledge graphs, fraud detection

  • Query languages designed for graph traversal (Cypher, Gremlin)

Limitations:

  • Operational complexity and smaller ecosystem

  • Doesn’t scale as well for non-graph workloads

  • Most “graph problems” can be solved with SQL at moderate scale

When to choose: Rarely in interviews. Hello Interview notes: “Graph databases are a common mistake in interviews. Even LinkedIn and Twitter use SQL for their core relationship data.” Only propose if the problem is fundamentally graph-shaped (fraud rings, knowledge graphs, complex recommendation traversals).

Data engineering relevance:

  • Knowledge graph construction pipelines

  • Entity resolution and identity graphs

  • Netflix recently built a real-time distributed graph for content metadata relationships

The Decision Framework

Use this flowchart in interviews. Start with the access pattern, not the database:

What's the primary access pattern?  │
├─ Complex queries with JOINs, aggregations, ad-hoc analytics?  │
└─→ Relational (PostgreSQL, MySQL, Spanner)  │
├─ Simple key-based lookups at massive scale with < 10ms latency?  │
└─→ Key-Value (DynamoDB, Redis)  │
├─ Massive write throughput, time-series, append-heavy?  │
└─→ Wide-Column (Cassandra, HBase, Bigtable)  │
├─ Hierarchical/nested documents with variable schema?  │
└─→ Document (MongoDB)  │
├─ Relationship traversals (friends-of-friends, shortest path)?  │
└─→ Graph (Neo4j, Neptune) — but only if truly graph-shaped  │
└─ Not sure / moderate scale / multiple access patterns?
└─→ Start with PostgreSQL. Add specialized stores for specific hot paths.

The multi-database reality: Production systems almost always use multiple databases. The right question isn’t “SQL or NoSQL?” — it’s “which database for which access pattern?”

The Polyglot Persistence Pattern

Real systems at your target companies use multiple storage technologies:

┌──────────────────────────────┐
│     Application Layer         │
└──────────────┬───────────────┘

    ┌──────────┼──────────┬──────────────┬──────────────────┐
    ▼          ▼          ▼              ▼                  ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────────┐
│PostgreSQL│ │  Redis   │ │  Cassandra   │ │ Elasticsearch│
│(source   │ │(cache +  │ │(real-time    │ │(full-text    │
│ of truth)│ │ features)│ │ serving)     │ │ search)      │
└────┬─────┘ └──────────┘ └──────┬───────┘ └──────────────┘
     │ CDC                       │
     └─────────────┬─────────────┘

            ┌──────────┐
            │  Kafka   │──→ Flink ────────→ Write to serving stores
            └────┬─────┘

            ┌──────────────┐
            │ S3 / Iceberg │──→ dbt ──→ BigQuery/Snowflake (analytics)
            └──────────────┘

Interview phrasing: “I’d use PostgreSQL as the transactional source of truth with ACID guarantees. CDC via Debezium streams changes to Kafka. From Kafka, Flink writes to Cassandra for high-throughput real-time serving and to Elasticsearch for full-text search. Redis serves as a cache and online feature store for sub-millisecond lookups. The same Kafka stream also lands in Iceberg for batch analytics via dbt into BigQuery. Each store is optimized for its specific access pattern.”

Storage Selection for Data Engineering Pipelines

TopicDetails
Raw/Bronze layerS3/GCS + Iceberg/Delta (object storage) Cheap, durable, schema-on-read, supports any file format
Silver/Gold warehouseBig Query, Snowflake, Redshift, Databricks MPP columnar, SQL-native, BI-tool friendly, push-down compute
Real-time servingRedis, DynamoDB, Cassandra Low-latency, high-throughput, key-based access
Real-time OLAPClick House, Druid, Pinot, Star Rocks Sub-second aggregations on streaming data, columnar
Feature store (online)Redis, DynamoDB< 10ms feature lookup at model inference time
Feature store (offline)Iceberg/Delta on S3, warehouse Batch feature computation, point-in-time correctness
SearchElasticsearch, Open Search Full-text search, fuzzy matching, faceted navigation
Metadata / catalogPostgre SQL, MySQLRelational metadata for Airflow, Hive Metastore, data catalogs
Vector / embeddingspgvector, Pinecone, Weaviate, MilvusANN search for RAG, semantic search, recommendations

Interview Questions

Q1: “You’re designing the data infrastructure for a social media platform. How do you choose your databases?”

Model Answer: “I’d use polyglot persistence, matching each access pattern to the right store. PostgreSQL for the core user accounts and relationships — ACID transactions for signups, follows, and payments. The social graph (who follows whom) lives in PostgreSQL too; at moderate scale, a well-indexed adjacency table with SQL outperforms a graph database operationally. Redis for caching hot user profiles and as a session store — sub-millisecond reads for the profile hover card that fires millions of times per day. For the activity feed, I’d use Cassandra — it’s write-optimized for the fan-out-on-write pattern where each post generates writes to followers’ feeds. Elasticsearch for the search feature — full-text search over posts and user profiles. S3 + Iceberg as the data lake for analytics, with CDC from PostgreSQL and Cassandra flowing through Kafka. The key principle: the user-facing product uses specialized databases per access pattern; the analytics platform unifies everything in the lakehouse.”

Q2: “When would you choose DynamoDB over PostgreSQL?”

Model Answer: “Three conditions must all be true. First, the access pattern is simple and fully known upfront — key-value lookups, maybe with a sort key for range queries within a partition. No ad-hoc JOINs, no complex aggregations. Second, I need predictable single-digit millisecond latency at any scale with zero ops overhead — DynamoDB auto-scales and is fully managed. Third, I’m in the AWS ecosystem and the cost model (pay-per-request or provisioned capacity) makes sense for the workload. A concrete example: a serving layer that holds pre-computed recommendation results. The pipeline writes user_id → [recommended_items] to DynamoDB, and the application reads by user_id at < 5ms. No JOINs, no analytics, just fast key-based lookup. For anything requiring ad-hoc queries, JOINs, or complex transactions, I’d stay with PostgreSQL — it handles far more than people assume, up to 64 TiB and 20K TPS on a single instance.”

Think About This

You’re in an OpenAI interview. The prompt: “Design the storage layer for ChatGPT’s conversation history. Users can view past conversations, search within them, and we use conversation data for model evaluation and fine-tuning.”

Walk through:

  1. What are the distinct access patterns? (User views a specific conversation by ID; user lists all their conversations; user searches within conversations; batch pipeline reads all conversations for model training)

  2. What database for each? (Conversation retrieval by ID → DynamoDB or MongoDB with conversation_id as key. Conversation listing per user → same store with user_id as partition key, sorted by timestamp. Full-text search → Elasticsearch index over conversation content. Batch model training → conversations replicated to S3/Iceberg for Spark jobs.)

  3. How do these stores stay in sync? (Write to the primary store, CDC to Kafka, Kafka feeds Elasticsearch and the data lake. Single write path, multiple read-optimized views.)

  4. Could you use just PostgreSQL? (For a startup, yes. At OpenAI’s scale with hundreds of millions of users and billions of conversations — PostgreSQL hits write and storage limits. The polyglot approach lets each store handle what it does best.)

Quick Reference

  • Start with PostgreSQL unless a specific requirement rules it out. It handles more than you think (64 TiB, 20K TPS, JSON support, pgvector for embeddings).

  • Tie database choice to access pattern, not to buzzwords. “I chose X because the access pattern is Y” beats “I chose X because it’s scalable.”

  • Polyglot persistence is the norm at scale: relational for source of truth, key-value for caching/serving, wide-column for write-heavy workloads, search engine for full-text, object storage for the lake.

  • The five categories: Relational (complex queries, ACID), Document (flexible schema, nested data), Key-Value (simple lookups, ultra-low latency), Wide-Column (massive writes, time-series), Graph (relationship traversals — use sparingly).

  • For data pipelines: Object storage (S3/GCS) + table format (Iceberg/Delta) is the foundation. Warehouse for SQL analytics. Specialized stores for real-time serving. CDC connects them all.

Tomorrow’s Preview

Day 16: Data Warehouse Architecture — MPP architecture (Redshift, BigQuery, Snowflake), columnar storage, compute-storage separation, partitioning and clustering strategies, and cost optimization — the core of where your analytics pipelines deliver value.