Senior Software Engineer - The Database …...Why not use a simple relational database table?...
Transcript of Senior Software Engineer - The Database …...Why not use a simple relational database table?...
Building a scalable time-series database on PostgreSQL
Erik Nordström, Ph.D. Senior Software Engineer
[email protected] · github.com/timescale
Industrial Machines
Transportation & Logistics
Datacenter & DevOps
Financial& Marketing
Time-series Data is Emerging Everywhere
Industrial Machines
Datacenter & DevOps
Financial& Marketing
Transportation & Logistics
Time-series Data is Emerging Everywhere
Industrial Machines
Datacenter & DevOps
Financial& Marketing
Transportation & Logistics
Time-series Data is Emerging Everywhere
Industrial Machines
Datacenter & DevOps
Financial& Marketing
Transportation & Logistics
Time-series Data is Emerging Everywhere
Why time-series data?
Past Present Future
Historical analysis
Real-time monitoring, troubleshooting
Predict & avoid problems
…and there’s a lot of time-series data
71%25GB44ZBdata collected from IoT devices by 2020
(IDC)
data collected per hour by connected cars
(McKinsey)
of global businessesnow collecting IoT data
(451 Research)
@timescale
time
1/1/17 01:01:001/1/17 01:01:011/1/17 01:02:00 1/1/17 01:02:05 1/1/17 01:03:02 1/1/17 01:03:08 1/1/17 01:04:00 1/1/17 01:04:09 1/1/17 01:05:04 1/1/17 01:05:05
measurements
device_id
0s9djfal8fk25k2ga fj8va2 0s9djf jw385z 1j86aq 98aat1 0s9djf al8fk2
metadata
device_type
floor_sensorroof_sensorfloor_sensor roof_sensor floor_sensor roof_sensor floor_sensor roof_sensor floor_sensor roof_sensor
Many common scenarios
cpu_avg
1005185 40 88 36 82 10 80 30
@timescale
time
1/1/17 01:01:001/1/17 01:01:011/1/17 01:02:00 1/1/17 01:02:05 1/1/17 01:03:02 1/1/17 01:03:08 1/1/17 01:04:00 1/1/17 01:04:09 1/1/17 01:05:04 1/1/17 01:05:05
device_id
0s9djfal8fk25k2ga fj8va2 0s9djf jw385z 1j86aq 98aat1 0s9djf al8fk2
temperature
703571 60 72 60 73 -5 70 60
Many common scenarios
measurements metadata
device_type
floor_sensorroof_sensorfloor_sensor roof_sensor floor_sensor roof_sensor floor_sensor roof_sensor floor_sensor roof_sensor
location_id
439523124 124 439 632 370 621 439 523
humidity
0.60.30.6 0.3 0.6 0.3 0.6 0.3 0.6 0.3
cpu_avg
1005185 40 88 36 82 10 80 30
@timescale
time
1/1/17 01:01:001/1/17 01:01:011/1/17 01:02:00 1/1/17 01:02:05 1/1/17 01:03:02 1/1/17 01:03:08 1/1/17 01:04:00 1/1/17 01:04:09 1/1/17 01:05:04 1/1/17 01:05:05
temperature
703571 60 72 60 73 -5 70 60
Common scenario: Measurements + Metadata
measurements metadata
cargo
electronicsflowers
toys food
furniture toys
car parts frozen food electronics
food
location_id
439523124 124 439 632 370 621 439 523
humidity
0.60.30.6 0.3 0.6 0.3 0.6 0.3 0.6 0.3
primary key
type
40S40R40S 40S 40H 20S 45H 40R 40S 40S
container
0s9djfal8fk25k2ga fj8va2 0s9djf jw385z 1j86aq 98aat1 0s9djf al8fk2
container
0s9djfal8fk25k2ga fj8va2 0s9djf jw385z 1j86aq 98aat1 0s9djf al8fk2
What database for time-series data?
Relational
NoSQL
https://www.percona.com/blog/2017/02/10/percona-blog-poll-database-engine-using-store-time-series-data/
32%
68%
Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage)Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)
Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage)Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)
Challenge in scaling up
• Time is primary dimension:– B-tree index on time column for time-
oriented queries (rollups, group by time)– Separate B-tree for each secondary index
• As table grows large:– Data and indexes no longer fit in memory– Reads/writes to random locations in B-tree
+
Key-value store with indexed key lookup at high-write rates
NoSQL champion: Log-Structured Merge Trees
• Range queries over sorted keys on disk
• Compressed data storage
• Common approach for time series: use key <name,tags,field,time>
+
NoSQL + LSMTs Come at a Cost
• Requires other (in-memory) index to efficiently map tags to keys => high-cardinality issue
• Poor/no secondary index support
• Less powerful queries
• No JOINS
• Loss of SQL ecosystem
Time-series• Primarily UPDATEs
• Writes randomly distributed
• Transactions to multiple primary keys
• Primarily INSERTs
• Writes to recent time interval
• Writes associated with a timestamp and primary key
OLTP
The Hypertable Abstraction
Chunks
Hypertable • Indexes • Triggers • Constraints • UPSERTs • Table mgmt
Familiar Management$ psqlpsql (9.6.3)Type "help" for help.
tsdb=# CREATE TABLE data ( time TIMESTAMP WITH TIME ZONE NOT NULL, device_id TEXT NOT NULL, temperature NUMERIC NULL, humidity NUMERIC NULL );
tsdb=# SELECT create_hypertable (‘data’,’time’,’device_id’,8);
tsdb=# INSERT INTO data (SELECT * FROM old_data);
Creating/migratingis easy
Adaptive chunking
• New approach: Adaptive intervals• Partitions created with fixed time interval, but
interval adapts to changes in data volumes
Common mechanisms for scalingup & out
• Chunks spread across many disks
• Faster inserts
• Parallel queries
Common mechanisms for scalingup & out
• Chunks spread across servers
• Chunk-aware query optimizations (push-down LIMITs and aggregates, server-wide merge appends, etc.)
SELECT time, temp FROM data WHERE time > now() - interval ‘7 days AND device_id = ‘12345’
Avoid querying chunks via constraint exclusion
Avoid querying chunks via constraint exclusion
SELECT time, device_id, temp FROM data WHERE time > ‘2017-08-22 18:18:00+00’
Avoid querying chunks via constraint exclusion
SELECT time, device_id, temp FROM data WHERE time > now() - interval ’24 hours’ Plain Postgres
won’t exclude chunks
Democratize your data Empower your entire org to analyze data
psql (9.6.3) Type “help” for help.
postgres=#
“As a bit of an update, we’ve scaled to beyond 500 billion rows now and things
are still working very smoothly.
The performance we’re seeing is monstrous, almost 70% faster queries!”
- Software Engineer, Large Enterprise Computer Company
INSERT performance
Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage)Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)
INSERT performance
Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage)Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)
144KMETRICS / S
14.4KINSERTS / S
INSERT performance
Postgres 9.6.2 on Azure standard DS4 v2 (8 cores), SSD (premium LRS storage)Each row has 12 columns (1 timestamp, indexed 1 host ID, 10 metrics)
>20x
1.11MMETRICS / S
Query Performance
https://blog.timescale.com/timescaledb-vs-6a696248104e
Speedup
Table scans, simple column rollups 0-20%
GROUPBYs 20-200%
Time-ordered GROUPBYs 400-10000x
DELETEs 2000x
High-Level Differences from Plain Postgres
• Insert performance
• Automatic partitioning
• Easier management
• Functions for time-based analysis
• Time-aware query optimizations
• Much faster deletes
High-Level Differences from NoSQL
• Secondary indexes
• Lower memory requirements
• No high cardinality problems
• Un-siloed data (JOINs!)
• SQL
• High write rates
• Time-oriented optimizations
• Fast complex queries
• 10s billions rows / node
Scalable
• Inherits 20+ years of PostgreSQL reliability and ecosystem
• Streaming replication, backups, HA clustering
Reliable
• Supports full SQL
• Time-oriented features
• Connectors: Just speaks PostgreSQL
• 1 DB for relational & time-series data
Easy to Use