Phase 2: Deep Dives | Category: Advanced Data Modeling
Why Time-Series Is a Fundamentally Different Problem
Time-series data has four properties that make it unlike analytical warehouse data:
-
Writes are append-only: New measurements always arrive at the current time — no random updates, only inserts at the “now” end of the timeline
-
Queries are almost always range-based: “Give me all readings for sensor X between T1 and T2” is the dominant access pattern
-
Compression is extreme: Consecutive timestamps differ by a fixed interval (e.g., 1000ms), so delta encoding compresses timestamps to almost nothing. The Facebook Gorilla paper showed XOR compression achieving 1.37 bytes per data point for float64 values
-
Data ages out: Recent data at full resolution; older data downsampled or deleted. This lifecycle is automatic in purpose-built TSDBs
When these properties appear in your system design: pipeline observability (Kafka lag, Flink checkpoint duration), product analytics (DAU over time), ML training metrics (loss curves), IoT sensors, infrastructure monitoring (CPU, memory), financial tick data. Any system with a “time” axis as the primary dimension.
The Three Major Time-Series Databases
InfluxDB: Purpose-Built TSDB
Data model (tags + fields + timestamp):
Measurement
: sensor_temperature Tags (indexed, string, used for filtering): device_id = "sensor-042" location = "factory-floor-3" unit = "celsius" Fields (un-indexed, the actual values): temperature = 23.4 humidity = 61.2 Timestamp: 2026-04-12T00:00:00Z
Critical design rule — tags vs fields:
- Tags: metadata for filtering, grouping. Low cardinality (device_id, not request_id). Always strings. Indexed. Stored in the index structure.
- Fields: actual measurements. Any type (float, int, string, bool). NOT indexed. Value you’re tracking.
- Cardinality trap: Using a high-cardinality value as a tag (user_id = 500M users) creates 500M separate time series → massive index memory pressure → InfluxDB’s performance collapses. This is the #1 InfluxDB production mistake.
TSM Storage Engine (Time-Structured Merge Tree):
- Write path: WAL → in-memory cache → TSM files (columnar, compressed)
- Compression: delta encoding for timestamps (if data arrives every second, delta is always 1000ms → compresses to nearly nothing), XOR for floats
- InfluxDB 3.0: replaced TSM with Apache Parquet on object storage — compute-storage separation, dramatically lower storage cost
Query (Flux language):
from
(bucket: "sensors") |> range(start: -1h) |> filter(fn: (r) => r.device_id == "sensor-042") |> aggregateWindow(every: 5m, fn: mean) |> yield(name: "5min_avg_temperature")
Or SQL (InfluxDB 3.0+):
SELECT
TIME_BUCKET('5 minutes', time) AS bucket, device_id, AVG(temperature) AS avg_temp FROM sensor_temperature WHERE time >= NOW() - INTERVAL '1 hour' AND device_id = 'sensor-042' GROUP BY 1, 2 ORDER BY 1;
Retention policy per bucket:
Bucket
: "sensors-raw" retention: 30 days (full resolution, 10-sec intervals) Bucket: "sensors-hourly" retention: 1 year (hourly aggregates, downsampled) Bucket: "sensors-daily" retention: 7 years (daily aggregates, compliance)
TimescaleDB: PostgreSQL + Time-Series Superpowers
The key insight: TimescaleDB IS PostgreSQL. Every tool, driver, ORM, BI connector, and SQL feature that works with PostgreSQL works with TimescaleDB.
Hypertable creation:
--
Step 1: Create a regular PostgreSQL table CREATE TABLE sensor_readings ( time TIMESTAMPTZ NOT NULL, device_id TEXT NOT NULL, location TEXT, temperature FLOAT8, humidity FLOAT8 ); -- Step 2: Convert to hypertable (automatic time-based partitioning) SELECT create_hypertable('sensor_readings', 'time', chunk_time_interval => INTERVAL '1 day'); -- TimescaleDB automatically creates a new chunk/partition each day -- Queries with time filters automatically skip irrelevant chunks (chunk exclusion)
Continuous Aggregates (the killer feature):
--
Define an automatically-refreshed materialized view at hourly granularity CREATE MATERIALIZED VIEW sensor_hourly WITH (timescaledb.continuous) AS SELECT time_bucket('1 hour', time) AS bucket, device_id, AVG(temperature) AS avg_temp, MAX(temperature) AS max_temp, MIN(temperature) AS min_temp, COUNT(*) AS reading_count FROM sensor_readings GROUP BY 1, 2; -- Set auto-refresh policy: update when new data arrives SELECT add_continuous_aggregate_policy('sensor_hourly', start_offset => INTERVAL '3 hours', end_offset => INTERVAL '1 hour', schedule_interval => INTERVAL '1 hour');
A query for “hourly avg temperature last 30 days” automatically reads from sensor_hourly (1,440 rows) instead of sensor_readings (26M rows for a 10-second sensor). Zero application code change.
Native compression:
--
Compress chunks older than 1 day (90-97% compression ratio typical) SELECT add_compression_policy('sensor_readings', INTERVAL '1 day'); -- Compressed chunks remain queryable — no application change needed SELECT time, temperature FROM sensor_readings -- reads compressed + uncompressed seamlessly WHERE time > NOW() - INTERVAL '7 days' AND device_id = 'sensor-042';
Retention + downsampling policy:
--
Delete raw data older than 30 days SELECT add_retention_policy('sensor_readings', INTERVAL '30 days'); -- Keep hourly aggregates for 1 year SELECT add_retention_policy('sensor_hourly', INTERVAL '365 days'); -- Result: 30 days of 10-sec raw data → 1 year of hourly aggregates → forever at daily level
Why TimescaleDB wins when data must JOIN with relational tables:
--
This query is trivial in TimescaleDB — impossible in InfluxDB without application-level work SELECT sr.device_id, d.asset_name, d.maintenance_schedule, d.location_building, AVG(sr.temperature) AS avg_temp FROM sensor_readings sr JOIN devices d ON sr.device_id = d.id -- JOIN with regular PostgreSQL table! WHERE sr.time >= NOW() - INTERVAL '24 hours' AND sr.temperature > d.temperature_threshold -- JOIN condition using dimension table GROUP BY 1, 2, 3, 4 ORDER BY avg_temp DESC;
Prometheus: The Kubernetes Monitoring Standard
Pull model (the key architectural difference):
- Prometheus doesn’t receive data — it scrapes /metrics endpoints at configured intervals
- Application exposes:
http_requests_total{method="GET",status="200"} 1234 - Prometheus pulls this value every 15-60 seconds
- Advantage: Prometheus controls scrape rate → won’t be overwhelmed by a misbehaving producer
PromQL (pull-oriented query language):
#
Rate of HTTP requests per second over last 5 minutes rate(http_requests_total{job="api-server"}[5m]) # P99 latency from histogram histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # Alert: error rate > 5% sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
The cardinality problem: Labels with high cardinality (user_id, request_id) create one time series per unique label combination. At 1M unique request IDs × 10 metrics each = 10M time series → Prometheus memory explodes. Rule: never use high-cardinality values as Prometheus labels.
For long-term storage: Prometheus default retention is 15 days. For longer retention, pair with Thanos (object storage) or Mimir (horizontally scalable Prometheus). These add block compaction + downsampling at 5-min and 1-hour resolution for old data.
Downsampling: The Critical Time-Series Lifecycle Pattern
The problem: A sensor emitting every 10 seconds generates:
- 8,640 data points/day/sensor
- 3,153,600 data points/year/sensor
- At 10,000 sensors: 31.5 billion data points/year
Without downsampling: storage explodes, old-data queries get slower.
The tiered resolution pattern:
Time
Range Resolution Storage Use Case ──────────────────────────────────────────────────────────────────── Last 24 hours 10 seconds Hot (SSD) Real-time dashboards Last 30 days 1 minute Warm (SSD/HDD) Operational monitoring Last 1 year 1 hour Cool (Object) Trend analysis Last 7 years 1 day Cold (Archive) Compliance, reporting
Implementation in TimescaleDB:
--
3 continuous aggregates = automatic tiered resolution CREATE MATERIALIZED VIEW sensor_1min WITH (timescaledb.continuous) AS SELECT time_bucket('1 minute', time) bucket, device_id, AVG(temperature) avg_temp FROM sensor_readings GROUP BY 1,2; CREATE MATERIALIZED VIEW sensor_1hour WITH (timescaledb.continuous) AS SELECT time_bucket('1 hour', bucket) bucket, device_id, AVG(avg_temp) avg_temp FROM sensor_1min GROUP BY 1,2; CREATE MATERIALIZED VIEW sensor_1day WITH (timescaledb.continuous) AS SELECT time_bucket('1 day', bucket) bucket, device_id, AVG(avg_temp) avg_temp FROM sensor_1hour GROUP BY 1,2; -- TimescaleDB query routing: automatically uses the right resolution -- Query for last 1 hour → reads sensor_readings (raw 10-sec, ~360 rows) -- Query for last 30 days → reads sensor_1min (43,200 rows) -- Query for last 1 year → reads sensor_1hour (8,760 rows) -- Query for last 7 years → reads sensor_1day (2,555 rows)
Time-Series vs Analytical Warehouse: The Decision Framework
RequirementTime-Series DB (InfluxDB/TimescaleDB)Analytical Warehouse (BigQuery/Snowflake)Sub-second write latency✅ Optimized append writes❌ Batch loading is preferredAuto-downsampling + retention✅ Built-in policies❌ Manual dbt jobs requiredMillisecond read latency✅ Chunk exclusion, compressed scans❌ Typically seconds minimumAggregating 1B rows per query❌ Not the sweet spot✅ MPP execution on petabytesJOIN with business entitiesTimescaleDB ✅ / InfluxDB ❌✅ Native SQL JOINsBI tool compatibilityTimescaleDB ✅ / InfluxDB ⚠️✅ NativeCost at petabyte scale❌ Expensive per-row storage✅ Columnar compression, cheapDevOps/Kubernetes monitoringPrometheus ✅❌ Wrong tool entirely
The hybrid architecture (what production systems actually do):
Sensors
/Servers/Apps ↓ Prometheus (short-term metrics, alerting, Kubernetes monitoring) ↓ remote_write / archival TimescaleDB or InfluxDB (operational time-series, 1-90 days hot) ↓ downsampled aggregates via nightly batch BigQuery / Snowflake (long-term analytics, joining with business data, BI dashboards)
Time-Series Modeling Anti-Patterns
1. High-cardinality tags (InfluxDB)
BAD
: tags: {user_id: "u-12345678"} ← millions of unique series GOOD: tags: {user_tier: "premium"} ← low cardinality fields: {user_id: "u-12345678"} ← high cardinality in fields (not indexed)
2. Using timestamps as primary keys in relational DBs
--
BAD: in PostgreSQL without TimescaleDB CREATE TABLE metrics ( id BIGSERIAL PRIMARY KEY, timestamp TIMESTAMP, -- not the primary access path device_id TEXT, value FLOAT ); -- Result: scans entire table for time-range queries on timestamp column -- GOOD: TimescaleDB hypertable with time as the partitioning dimension SELECT create_hypertable('metrics', 'time'); -- Time-range queries now do chunk exclusion: O(n_chunks_in_range) vs O(n_total_rows)
3. Not setting retention policies (storage explosion)
Production
incident: "Our InfluxDB instance grew from 500 GB to 5 TB in 3 months" Root cause: no retention policy set Fix: set retention policy immediately, implement tiered downsampling
4. Querying raw high-frequency data for long time ranges
--
BAD: scanning 10-second data for a 1-year trend SELECT date_trunc('day', time), AVG(temperature) FROM sensor_readings -- 31.5M rows for 1 sensor/year WHERE device_id = 'sensor-042' GROUP BY 1; -- GOOD: query the daily continuous aggregate SELECT bucket, avg_temp FROM sensor_1day -- 365 rows for 1 sensor/year WHERE device_id = 'sensor-042'; -- 86,000x fewer rows to scan
Interview Questions
Q1: “You’re building observability for a data pipeline processing 10 million Kafka events per second. What metrics would you collect and how would you store them?”
Model Answer: “Five metric categories, each with different collection strategies. First, Kafka metrics: consumer lag per topic/partition, producer throughput, broker disk/network utilization. These are already exposed via JMX → collected by Prometheus Kafka exporter, stored in Prometheus with 15-day retention. Critical alert: consumer lag growing continuously for > 5 minutes = throughput gap, pages on-call. Second, Flink metrics: checkpoint duration, restart count, backpressure ratio, processing latency percentiles. Flink exposes these via its metrics interface → Prometheus. Third, data quality metrics: records processed per minute, schema validation failures, DLQ message count. These come from my pipeline code instrumented with Prometheus client library. Fourth, data freshness: MAX(event_timestamp) in the destination table vs current time. This is NOT a Prometheus metric — I’d run a SQL query every 5 minutes against the serving table and write the result to InfluxDB or as a Prometheus gauge. Fifth, infrastructure: CPU, memory, disk I/O on cluster nodes. Prometheus node_exporter, scraping every 15 seconds. Architecture: Prometheus for all operational metrics (short-term, alerting), with Thanos for long-term retention. For pipeline-specific business metrics (data freshness, quality rates) that analysts need in dashboards — write to TimescaleDB so they can JOIN with pipeline configuration tables and use standard SQL in Grafana.”
Q2: “A factory deploys 50,000 IoT sensors each emitting temperature, humidity, and pressure every 10 seconds. Design the data storage layer.”
Model Answer: “At 50,000 sensors × 3 metrics × 6 readings/minute = 900,000 data points/minute = 15,000 points/second. Five-year retention required. First, the write path: sensors → MQTT broker (Mosquitto) → MQTT bridge → InfluxDB 3.0 (object storage backend). InfluxDB 3.0’s Parquet-on-S3 storage handles 15K points/sec easily and costs ~10x less than InfluxDB 2.x local disk at this scale. Second, data model: measurement: factory_metrics, tags: {sensor_id, location_building, asset_type} (low cardinality — 50K sensors is actually fine as a tag in InfluxDB 3.0’s improved architecture), fields: {temperature, humidity, pressure}. Third, retention and downsampling: raw 10-second data retained 7 days in hot InfluxDB bucket. 1-minute aggregates retained 30 days. 1-hour aggregates retained 1 year. 1-day aggregates retained 5 years. This is configured as InfluxDB tasks (or continuous aggregates in TimescaleDB if we needed SQL JOINs). Fourth, alerting: Telegraf agents at the edge do threshold alerting (temperature > 80°C) before data even reaches InfluxDB — sub-second local response. Prometheus handles infrastructure monitoring for the MQTT and InfluxDB cluster itself. Fifth, analytics: for business reporting (maintenance scheduling, efficiency analysis), a daily Dataflow/Spark job reads daily aggregates from InfluxDB and writes to BigQuery for SQL-accessible historical analytics with JOINs against asset registration and maintenance records.”
Think About This
You’re in a Netflix interview. The prompt: “Netflix monitors 15,000+ Flink jobs streaming 60 PB/day. Design the observability data platform for these pipelines.”
Walk through:
-
What are the metric categories? (Flink internals: checkpoint duration, backpressure, restart count. Data metrics: records/sec, lag, DLQ rate. Infrastructure: CPU, memory, network per task manager.)
-
Which tool for each? (Flink + infrastructure → Prometheus + Grafana, 30-day retention with Thanos. Data freshness → custom probes writing to TimescaleDB since analysts query it alongside pipeline config tables. Long-term trend analysis → BigQuery with daily aggregates from TimescaleDB.)
-
What’s the critical alert for data freshness? (MAX(event_time) > threshold for any of 15K jobs — a join against a job registry to know expected freshness per job. Some jobs process real-time (alert at 2 min lag), others are hourly batch (alert at 2 hour lag).)
-
How do you avoid alert fatigue at 15K jobs? (Tiered: P1 = SLO-critical pipelines only (200 jobs) → PagerDuty. P2 = tier-2 pipelines → Slack channel. P3 = everything else → daily digest. The alert routing rules live in a database where each pipeline has a tier field.)
Quick Reference
- Time-series properties: append-only writes, time-range queries, extreme compression (timestamps: delta encoding; values: XOR/delta), data lifecycle (downsample then delete).
- InfluxDB: purpose-built TSDB, tags+fields+timestamp model. Avoid high-cardinality tags. InfluxDB 3.0 = Parquet on S3, 10x cheaper storage, SQL support.
- TimescaleDB: PostgreSQL extension. Hypertables = automatic time-based partitioning. Continuous aggregates = auto-refreshing tiered resolution. Full SQL + JOINs with relational tables. The choice when time-series must coexist with relational data.
- Prometheus: pull-model, scrapes /metrics endpoints. Kubernetes-native. Short retention (15 days). PromQL for rate(), histogram_quantile(). High-cardinality labels kill performance.
- Downsampling tiers: Raw → 1-min → 1-hour → 1-day. Each tier reduces storage ~60-86x. TimescaleDB continuous aggregates automate this.
- The hybrid architecture: Prometheus (short-term operational) → TimescaleDB/InfluxDB (medium-term, sub-second queries) → BigQuery/Snowflake (long-term analytical + business JOINs).
- Decision threshold: Start with PostgreSQL (< 50M rows, simple queries). Add TimescaleDB (millions of writes/day, time-range queries dominant). Use InfluxDB (pure sensor/monitoring workload, no SQL JOINs needed). Use Prometheus (Kubernetes observability, alerting).
Tomorrow’s Preview
Day 44: Exactly-Once Semantics Deep Dive — At-most-once, at-least-once, exactly-once. Kafka transactions, Flink checkpointing, idempotent producers, two-phase commit. Why exactly-once is hard and when you actually need it vs when at-least-once + deduplication is the right (and cheaper) answer.