Inside the InfluxDB Storage Engine.ppt · Shards 10/11/2015 10/12/2015 Data organized into Shards...
Transcript of Inside the InfluxDB Storage Engine.ppt · Shards 10/11/2015 10/12/2015 Data organized into Shards...
TSDB
Inverted Index
preliminary intro materials…
Everything is indexed by time and series
Shards
10/11/2015 10/12/2015
Data organized into Shards of time, each is an underlying DBefficient to drop old data
10/13/201510/10/2015
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields Timestamp
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags Fields Timestamp
We actually store up to ns scale timestamps but I couldn’t fit on the slide
Each series and field to a unique IDtemperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
Data per ID is tuples ordered by timetemperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
1 (1443782126,80)
2 (1443782126,18)
Arranging in Key/Value Stores
1,1443782126
Key Value
80
ID Time
Arranging in Key/Value Stores
1,1443782126
Key Value
802,1443782126 18
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 181,1443782127 81 new data
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 181,1443782127 81key space
is ordered
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 181,1443782127 81
2,1443782256 152,1443782130 17
3,1443700126 18
Many existing storage engines have this model
New Storage Engine?!
First we used LSM Trees
deletes expensive
too many open file handles
Then mmap COW B+Trees
write throughput
compression
met our requirements
High write throughput
Awesome read performance
Better Compression
Writes can’t block reads
Reads can’t block writes
Write multiple ranges simultaneously
Hot backups
Many databases open in a single process
Enter InfluxDB’s Time Structured Merge Tree
(TSM Tree)
Enter InfluxDB’s Time Structured Merge Tree
(TSM Tree) like LSM, but different
Components
WALIn
memory cache
Index Files
Components
WALIn
memory cache
Index Files
Similar to LSM Trees
Components
WALIn
memory cache
Index Files
Similar to LSM Trees
Same
Components
WALIn
memory cache
Index Files
Similar to LSM Trees
Same like MemTables
Components
WALIn
memory cache
Index Files
Similar to LSM Trees
Same like MemTables like SSTables
awesome time series data
WAL (an append only file)
awesome time series data
WAL (an append only file)
in memory index
awesome time series data
WAL (an append only file)
in memory index
on disk index
(periodic flushes)
awesome time series data
WAL (an append only file)
in memory index
on disk index
(periodic flushes)
Memory mapped!
TSM File
TSM File
TSM File
TSM File
TSM File
TSM File
Compression
Timestamps: encoding based on precision and deltas
Timestamps (best case): Run length encoding
Deltas are all the same for a block
Timestamps (good case): Simple8B
Ann and Moffat in "Index compression using 64-bit words"
Timestamps (worst case): raw values
nano-second timestamps with large deltas
float64: double deltaFacebook’s Gorilla - google: gorilla time series facebook
https://github.com/dgryski/go-tsz
booleans are bits!
int64 uses double delta, zig-zagzig-zag same as from Protobufs
string uses Snappysame compression LevelDB uses
(might add dictionary compression)
UpdatesWrite, resolve at query
Deletestombstone, resolve at query & compaction
Compactions• Combine multiple TSM files
• Put all series points into same file
• Series points in 1k blocks
• Multiple levels
• Full compaction when cold for writes
Example Query
select percentile(90, value) from cpu where time > now() - 12h and “region” = ‘west’ group by time(10m), host
Example Query
select percentile(90, value) from cpu where time > now() - 12h and “region” = ‘west’ group by time(10m), host
How to map to series?
Inverted Index!
Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2 series to ID
Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2
cpu -> [idle] measurement to fields
series to ID
Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2
cpu -> [idle]
host -> [A, B]
measurement to fields
host to values
series to ID
Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2
cpu -> [idle]
host -> [A, B]
region -> [west]
measurement to fields
host to values
region to values
series to ID
Inverted Indexcpu,host=A,region=west#idle -> 1 cpu,host=B,region=west#idle -> 2
cpu -> [idle]
host -> [A, B]
region -> [west]
cpu -> [1, 2] host=A -> [1] host=B -> [1] region=west -> [1, 2]
measurement to fields
host to values
region to values
series to ID
postings lists
Index V1
• In-memory
• Load on boot
• Memory constrained
• Slower boot times with high cardinality
Index V2
in memory index on disk index (do we already have?)
time series meta data
in memory index on disk index (do we already have?)
time series meta data
nope
WAL (an append only file)
in memory index on disk index (do we already have?)
time series meta data
nope
WAL (an append only file)
on disk indices
(periodic flushes)
in memory index on disk index (do we already have?)
time series meta data
nope
WAL (an append only file)
on disk indices
(periodic flushes)
(compactions)on disk index
Index File Layout
Robin Hood Hashing
• Can fully load table
• No linked lists for lookup
• Perfect for read-only hashes
[ , , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
[ , , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
A -> 0
[ A, , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
A -> 0
[ A, , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
B -> 1
[ A, B, , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
B -> 1
[ A, B, , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
C -> 1
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
C -> 2
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
C -> probe 1
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
D -> 0
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
D -> probe 1
[ A, D, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 0, 0 ]
Keys
Probe Lengths
B -> probe 1
[ A, D, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 0, 0 ]
Keys
Probe Lengths
B -> probe 2
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
B -> probe 2
Steal from probe rich, give to probe poor
Refinement: average probe
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe LengthsAverage: 1
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe LengthsAverage: 1
D -> hashes to 0 + 1
Cardinality Estimation
HyperLogLog++
More Details & Write Up to Come!