Phase 2: Deep Dives | Category: Advanced Data Modeling
The Interview Reality: Graph vs Relational Is a Decision Question
Interviewers at your target companies almost never ask “explain Neo4j.” They ask: “You need to model a social network / fraud detection system / knowledge graph — how do you approach it, and when do you actually need a graph database?” The senior answer demonstrates you can reason about when the graph model genuinely wins vs when it’s over-engineering. As Hello Interview bluntly notes: “Graph databases are a common mistake in interviews. Even LinkedIn and Twitter use SQL for their core relationship data.” The skill is knowing when to cross the threshold.
The Property Graph Model: Core Vocabulary
All modern graph databases (Neo4j, Amazon Neptune, TigerGraph) use the labeled property graph model:
| Component | Description | Example |
|---|---|---|
| Node | An entity in the graph | (user:User {id: "u123", name: "Alice"}) |
| Relationship | A directed connection between nodes | [:FOLLOWS], [:PURCHASED], [:KNOWS] |
| Properties | Key-value pairs on nodes or relationships | {since: "2024-01-15", weight: 0.8} |
| Labels | Node type grouping | :User, :Product, :Account, :Transaction |
Index-free adjacency: The critical performance advantage. In a graph database, each node stores a direct pointer to its neighboring nodes — no index lookup required to find connected nodes. Traversing (Alice)-[:KNOWS]->(Bob) is O(1) per hop, regardless of total graph size. In a relational database, finding Bob from Alice requires a join on the friendships table, which scales with table size (O(log n) with an index, O(n) without).
The scaling formula (when graph wins): For k-hop traversals on n nodes:
- Graph DB: O(k) per query — constant with respect to total graph size, depends only on traversal depth
- Relational DB: O(n^k) in the worst case — exponential with depth for unindexed joins
When this matters: At depth 1-2, SQL with good indexes is competitive. At depth 3+, graph traversal performance diverges dramatically. This is the threshold.
When to Choose a Graph Database: The Decision Framework
Does
your primary query involve traversing relationships across an unknown or large number of hops? ├── NO (1-2 hops, bounded): Relational (SQL) is fine │ Example: "Find all orders for user X" = 1 hop from User → Orders │ Example: "Find user's friends" = 1 hop from User → Users │ These work perfectly with indexed foreign keys in Postgres └── YES (3+ hops, or variable depth): Are relationships the first-class concern of your analysis? ├── NO: Consider columnar store + SQL with window functions └── YES (social graphs, fraud rings, knowledge traversal): Is real-time traversal needed (< 100ms)? ├── NO: Graph algorithms on Spark (GraphX, GraphFrames) └── YES: Native graph database (Neo4j, Neptune, TigerGraph)
The depth threshold in practice:
- “Friends of a user” = 1 hop → SQL: SELECT friend_id FROM friendships WHERE user_id = X
- “Friends of friends” = 2 hops → SQL: one self-join, still manageable
- “People within 3 connections who share an interest” = 3 hops + filter → SQL: expensive recursive CTE
- “Shortest path between two users in a social network” = variable hops → SQL: CTE with recursion, O(n) per level; Graph: O(k×branching factor), dramatically faster
Modeling Patterns: Five Real-World Scenarios
1. Social Network Graph
Property graph model:
(:User {id, name, email, signup_date}) -[:FOLLOWS {since}]→ (:User) -[:LIKED]→ (:Content {id, type, created_at}) -[:CREATED]→ (:Content) -[:MEMBER_OF]→ (:Group {id, name, topic})
Cypher query — “Suggest friends (friends of friends, not already following)”:
MATCH
(user:User {id: $user_id})-[:FOLLOWS]->(friend)-[:FOLLOWS]->(suggestion) WHERE NOT (user)-[:FOLLOWS]->(suggestion) AND suggestion <> user RETURN suggestion.id, suggestion.name, count(friend) AS mutual_friends ORDER BY mutual_friends DESC LIMIT 10
SQL equivalent (for comparison):
--
SQL: friends of friends WITH friends AS ( SELECT followed_id FROM follows WHERE follower_id = $user_id ), fof AS ( SELECT f2.followed_id, COUNT(*) as mutual_count FROM follows f2 JOIN friends f ON f2.follower_id = f.followed_id WHERE f2.followed_id NOT IN (SELECT followed_id FROM friends) AND f2.followed_id <> $user_id GROUP BY f2.followed_id ) SELECT users.name, fof.mutual_count FROM fof JOIN users ON users.id = fof.followed_id ORDER BY mutual_count DESC LIMIT 10;
At 100K users, both work. At 1B users (Meta scale), the SQL query requires scanning a follows table with potentially hundreds of billions of rows. The graph traversal starts at one node and follows pointers.
The important nuance: LinkedIn and Twitter/X actually use SQL (MySQL/PostgreSQL) for their core relationship data — they built custom sharding and caching layers. The adjacency table pattern with aggressive indexing handles most social graph use cases. A dedicated graph database is only justified when multi-hop traversal is the core query pattern at very high QPS.
2. Fraud Detection Graph
This is the most compelling real-world use case for graph databases. Fraudsters exploit the fact that fraud detection systems check individual entities (one account, one transaction) but don’t look at the network connecting them.
Graph model:
(:Account {id, status, creation_date, country}) -[:OWNS]→ (:Device {fingerprint, ip, type}) -[:INITIATED]→ (:Transaction {id, amount, timestamp}) -[:RECEIVED_BY]→ (:Account) -[:SHARES_DEVICE]→ (:Account) // if two accounts use same device -[:SAME_IP]→ (:Account) // if two accounts share IP
Key fraud patterns detectable with graph traversal:
Ring pattern — money laundering cycle:
// Find accounts that send money in a cycle (circular flow) MATCH path=(a:Account)-[:INITIATED]->(:Transaction)-[:RECEIVED_BY]-> (:Account)-[:INITIATED]->(:Transaction)-[:RECEIVED_BY]->(a) WHERE ALL(t IN nodes(path) WHERE t.amount > 1000) RETURN path
Shared resource pattern — multiple accounts sharing devices/IPs (common in account takeover fraud):
// Find clusters of accounts sharing the same device fingerprint MATCH (a1:Account)-[:OWNS]->(:Device)<-[:OWNS]-(a2:Account) WHERE a1 <> a2 AND a1.creation_date > datetime() - duration('P7D') // new accounts RETURN a1.id, a2.id, count(*) as shared_devices HAVING count(*) > 3
Why SQL struggles here: To detect a 5-hop money laundering ring in SQL requires a recursive CTE with 5 levels, and at transaction volumes of millions/day it’s prohibitively slow. The same query in Neo4j with relationship indexes traverses the ring in milliseconds.
3. Knowledge Graph
Used at OpenAI and Anthropic for grounding LLM responses with structured factual knowledge.
Property graph model:
(:Entity {id, name, type: "person|org|concept|place"}) -[:WORKS_AT {since, role}]→ (:Entity {type: "org"}) -[:LOCATED_IN]→ (:Entity {type: "place"}) -[:RELATED_TO {confidence: 0.87, source}]→ (:Entity) -[:INSTANCE_OF]→ (:Class {name, description})
Query — multi-hop semantic reasoning:
// "Find all researchers who work at organizations that have published // research related to transformer architectures" MATCH (researcher:Entity {type: "person"}) -[:WORKS_AT]→ (org:Entity {type: "org"}) -[:PUBLISHED]→ (paper:Entity {type: "paper"}) -[:RELATED_TO*1..2]→ (concept:Entity {name: "transformer"}) RETURN DISTINCT researcher.name, org.name, count(paper) as papers ORDER BY papers DESC
The *1..2 in Cypher means “traverse 1 to 2 hops” — variable-depth traversal that’s natural in Cypher but painful in recursive SQL.
4. Recommendation Graph (Collaborative Filtering)
(:User)-[:PURCHASED]→ (:Product) (:User)-[:RATED {score}]→ (:Product) (:User)-[:VIEWED]→ (:Product) (:Product)-[:IN_CATEGORY]→ (:Category) (:Product)-[:FREQUENTLY_BOUGHT_WITH]→ (:Product)
“Users who bought X also bought Y” query:
MATCH
(user:User)-[:PURCHASED]→ (product:Product {id: $product_id}) <-[:PURCHASED]-(similar_user:User) -[:PURCHASED]→ (recommendation:Product) WHERE NOT (user)-[:PURCHASED]→ (recommendation) AND recommendation <> product RETURN recommendation.name, count(similar_user) AS frequency ORDER BY frequency DESC LIMIT 10
5. Data Lineage Graph
Highly relevant for data engineering — tracking how data flows between tables, jobs, and systems:
(:Dataset {name, location, schema_version}) -[:PRODUCED_BY]→ (:Job {name, schedule, version}) -[:READS_FROM]→ (:Dataset) -[:WRITES_TO]→ (:Dataset) -[:DEPENDS_ON]→ (:Dataset)
This is how tools like DataHub and OpenLineage model lineage internally — a graph where any upstream change can be traced forward to all downstream impacts.
Adjacency List in Relational DBs: The SQL Alternative
For moderate-scale graph problems, a relational adjacency list often works fine:
--
Nodes table CREATE TABLE nodes ( id UUID PRIMARY KEY, label VARCHAR(50), properties JSONB ); -- Edges table CREATE TABLE edges ( id UUID PRIMARY KEY, source_id UUID REFERENCES nodes(id), target_id UUID REFERENCES nodes(id), relationship VARCHAR(50), properties JSONB, created_at TIMESTAMP ); CREATE INDEX ON edges(source_id); CREATE INDEX ON edges(target_id); CREATE INDEX ON edges(relationship);
Recursive CTE for BFS traversal:
--
Find all nodes within 3 hops of a starting node WITH RECURSIVE graph_traversal AS ( -- Base: starting node SELECT id, 0 AS depth, ARRAY[id] AS path FROM nodes WHERE id = $start_id UNION ALL -- Recursive: traverse edges SELECT n.id, gt.depth + 1, gt.path || n.id FROM nodes n JOIN edges e ON e.target_id = n.id JOIN graph_traversal gt ON gt.id = e.source_id WHERE gt.depth < 3 AND NOT (n.id = ANY(gt.path)) -- prevent cycles ) SELECT DISTINCT id, depth FROM graph_traversal ORDER BY depth;
When this is good enough: Social graphs up to ~10M users and 2-3 hop traversals. Fraud detection for small transaction volumes. Data lineage for hundreds of datasets (not thousands).
When it breaks down: Deep recursive CTEs on billions of rows are extremely slow. Graph algorithms (PageRank, community detection, shortest path) are much harder to implement in SQL than in native graph query languages.
Graph Algorithms for Data Engineering Use Cases
These algorithms matter for interviews at companies building recommendation or fraud systems:
AlgorithmWhat It ComputesUse CasePageRankNode importance based on link structureMost influential users, authoritative documentsBetweenness CentralityNodes that bridge communitiesFraud hubs that connect multiple ringsCommunity Detection (Louvain)Clusters of densely connected nodesFraud rings, social communitiesShortest Path (Dijkstra)Minimum-hop path between nodesHow is user A connected to user B?Connected ComponentsIsolated subgraphsIdentify separate fraud clustersNode SimilarityHow similar two nodes are by connection patternsRecommendation systems
For data engineering interviews: You rarely need to implement these. But knowing they exist and naming the relevant one for a given problem is a strong signal. “I’d run community detection to identify fraud rings, and betweenness centrality to find the accounts that are acting as hubs within a ring” is a good answer.
Interview Questions
Q1: “You’re designing the data infrastructure for Meta’s fraud detection system. Users, devices, and accounts are involved. When would you use a graph database vs Spark GraphX vs SQL?”
Model Answer: “The answer depends on the latency requirement and the traversal depth. For real-time fraud detection during a transaction (< 100ms to block or allow), I’d use a native graph database like Amazon Neptune or TigerGraph. The pattern queries — ‘does this account share a device with a flagged account?’ (1-2 hops) or ‘is this account part of a known fraud ring?’ (3-6 hops) — need millisecond traversal. Only a native graph database gives that. Neptune stores 8-10 billion relationships and handles graph traversals at sub-second latency. For offline fraud analytics — ‘identify new fraud ring patterns we haven’t seen before’ — I’d use Spark GraphX on the full transaction graph. GraphX handles PB-scale graph computation that Neptune can’t, but takes minutes to hours, not milliseconds. For simple cross-table fraud signals that only need 1-2 hops — ‘has this IP been used in more than 5 different accounts this week?’ — a SQL query on a properly indexed transaction table is faster to build and operationally simpler. I wouldn’t use a graph database for that. The architecture: Neptune for real-time pattern matching, Spark GraphX for offline community detection and new pattern discovery, SQL for simple signal computation. All three read from the same bronze/silver event data in the lakehouse.”
Q2: “LinkedIn has a social graph of 900M users. Would you model it in Neo4j or SQL? Defend your choice.”
Model Answer: “SQL — and that’s actually what LinkedIn uses. Here’s why. LinkedIn’s primary queries are: ‘Who are my 1st connections?’ (1 hop), ‘Who are my 2nd connections?’ (2 hops), and ‘How am I connected to this person?’ (shortest path, variable hops). For 1st and 2nd degree connections, a well-indexed adjacency table in MySQL with read replicas handles the load — LinkedIn’s engineering blog documents this approach. The graph database would shine for the variable-hop ‘how am I connected’ query, but LinkedIn uses a separate graph computation service for that (offline Hadoop-based), not Neo4j in the serving path. The operational complexity and cost of maintaining a 900M-node Neo4j cluster at LinkedIn’s traffic doesn’t justify the performance improvement for their primary access patterns. Where I would recommend a graph database: if LinkedIn were building a real-time fraud detection layer looking for 5-hop account manipulation rings, or if they were building a knowledge graph where semantic multi-hop reasoning is the primary use case. For the core social graph serving profile pages and connection lists? SQL with sharding and caching is battle-tested, simpler to operate, and more than adequate.”
Think About This
You’re in an OpenAI interview. The prompt: “Design the data infrastructure for Claude’s knowledge graph — a structured representation of facts, entities, and relationships that Claude can query during inference to ground its responses.”
Walk through:
-
What are the entities and relationships? (Entity: Person, Organization, Concept, Place, Event. Relationships: WORKS_AT, LOCATED_IN, RELATED_TO, OCCURRED_ON, INSTANCE_OF)
-
Should this be a graph database or a vector database? (Both, for different query types: graph DB for structured fact traversal “what organization does person X work at?”; vector DB for semantic similarity “what concepts are related to transformer architectures?”)
-
How do you keep it fresh? (Batch pipeline: web crawl → entity extraction → knowledge graph updates. Incremental CDC for high-frequency entities. Full rebuild quarterly for schema changes.)
-
At inference time, how does Claude query this? (RAG pattern: embed the user question → retrieve relevant entities + relationships from vector store → fetch structured facts from graph DB → inject as context into the prompt)
-
What’s the pipeline for adding new facts? (NLP pipeline: document → NER (entity extraction) → RE (relationship extraction) → entity linking (resolve “OpenAI” to the canonical Organization node) → graph write)
Quick Reference
- The depth threshold: 1-2 hops → SQL with indexed adjacency table. 3+ hops or variable depth → consider graph DB.
- Index-free adjacency: Each node stores direct pointer to neighbors — O(1) per hop, regardless of total graph size. SQL join is O(log n) with index, O(n) without.
- Property graph = nodes + relationships + properties + labels. Relationships are first-class citizens with their own properties.
- Use graph DB when: Fraud ring detection (3-6 hop traversals), knowledge graphs (semantic multi-hop), recommendation traversals (bought-with-bought-with), data lineage graphs.
- Don’t use graph DB when: Social graph for profile page display (1-2 hops, SQL is fine), simple transaction lookups, analytics aggregations (columnar warehouse is better).
- Graph algorithms to name: PageRank (influence), Community Detection/Louvain (fraud rings, clusters), Shortest Path (connection finder), Betweenness Centrality (hub detection).
- The senior answer on any graph question: Start with SQL adjacency table, justify the upgrade to a graph DB only when traversal depth, query patterns, or real-time requirements genuinely require it.
Tomorrow’s Preview
Day 43: Time-Series Data Modeling — Time-series databases (InfluxDB, TimescaleDB, Prometheus), downsampling and retention policies, modeling metrics, logs, and sensor data, compression strategies for time-series, and why time-series is a fundamentally different problem from analytical data warehousing.