A Technical Introduction to WiredTiger
-
Upload
mongodb -
Category
Technology
-
view
1.407 -
download
3
Transcript of A Technical Introduction to WiredTiger
![Page 2: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/2.jpg)
Presenter • Keith Bostic • Co-architect WiredTiger • Senior Staff Engineer MongoDB
Ask questions as we go, or [email protected]
![Page 3: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/3.jpg)
3
This presentation is not…
• How to write stand-alone WiredTiger apps – contact [email protected]
• How to configure MongoDB with WiredTiger for your workload
![Page 4: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/4.jpg)
4
WiredTiger
• Embedded database engine – general purpose toolkit – high performing: scalable throughput with low latency
• Key-value store (NoSQL) • Schema layer
– data typing, indexes • Single-node • OO APIs
– Python, C, C++, Java • Open Source
![Page 5: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/5.jpg)
5
Deployments
• Amazon AWS • ORC/Tbricks: financial trading solution And, most important of all: • MongoDB: next-generation document store
![Page 6: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/6.jpg)
You may have seen this:
![Page 7: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/7.jpg)
or this…
![Page 8: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/8.jpg)
8
MongoDB’s Storage Engine API
• Allows different storage engines to "plug-in" – different workloads have different performance characteristics – mmapV1 is not ideal for all workloads – more flexibility
• mix storage engines on same replica set/sharded cluster • Opportunity to innovate further
– HDFS, encrypted, other workloads • WiredTiger is MongoDB’s general-purpose workhorse
![Page 9: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/9.jpg)
Topics Ø WiredTiger Architecture • In-memory performance • Record-level concurrency • Compression • Durability and the journal • Future features
![Page 10: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/10.jpg)
10
Motivation for WiredTiger
• Traditional engines struggle with modern hardware: – lots of CPU cores – lots of RAM
• Avoid thread contention for resources – lock-free algorithms, for example, hazard pointers – concurrency control without blocking
• Hotter cache, more work per I/O – big blocks – compact file formats
– compression – big blocks
![Page 11: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/11.jpg)
11
WiredTiger Architecture
WiredTiger Engine
Schema &Cursors
Python API C API Java API
Database Files
Transactions
Pageread/write
Logging
Column storage
Block management
Rowstorage Snapshots
Log Files
Cache
![Page 12: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/12.jpg)
12
Column-store, LSM
• Column-store – implemented inside the B+tree – 64-bit record number keys – valued by the key’s position in the tree – variable-length or fixed-length
• LSM – forest of B+trees (row-store or column-store) – bloom filters (fixed-length column-store)
• Mix-and-match – sparse, wide table: column-store primary, LSM indexes
![Page 13: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/13.jpg)
Topics ü WiredTiger Architecture Ø In-memory performance • Record-level concurrency • Compression • Durability and the journal • Future features
![Page 14: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/14.jpg)
14
Trees in cache
non-residentchild
ordinary pointerroot page
internal page
internal page
root page
leaf page
leaf page leaf page leaf page
![Page 15: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/15.jpg)
15
Hazard Pointers
non-residentchild
ordinary pointerroot page
internal page
internal page
root page
leaf page
leaf page leaf page leaf page
1 memory flush
![Page 16: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/16.jpg)
16
Pages in cache cache
data files
page images
on-disk pageimage
index
cleanpage on-disk
pageimage
indexdirtypage
updates
![Page 17: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/17.jpg)
17
Skiplists
• Updates stored in skiplists – ordered linked lists with forward “skip” pointers
• William Pugh, 1989 – simpler, as fast as binary-search, less space – likely binary-search performance plus cache prefetch – more space for an existing data set
• Implementation – insert without locking – forward/backward traversal without locking, while inserting – removal requires locking
![Page 18: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/18.jpg)
18
In-memory performance
• Cache trees/pages optimized for in-memory access • Follow pointers to traverse a tree • No locking to read or write • Keep updates separate from initial data
– updates are stored in skiplists – updates are atomic in almost all cases
• Do structural changes (eviction, splits) in background threads
![Page 19: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/19.jpg)
Topics ü WiredTiger Architecture ü In-memory performance Ø Record-level concurrency • Compression • Durability and the journal • Future features
![Page 20: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/20.jpg)
20
Multiversion Concurrency Control (MVCC)
• Multiple versions of records maintained in cache • Readers see most recently committed version
– read-uncommitted or snapshot isolation available – configurable per-transaction or per-handle
• Writers can create new versions concurrent with readers • Concurrent updates to a single record cause write conflicts
– one of the updates wins – other generally retries with back-off
• No locking, no lock manager
![Page 21: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/21.jpg)
21
Pages in cache cache
data files
page images
on-disk pageimage
index
cleanpage on-disk
pageimage
indexdirtypage
updates
skiplist
![Page 22: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/22.jpg)
22
MVCC In Action
on-diskpageimage
index
update1(txn, value)
on-diskpageimage
index
update2(txn, value)
update1(txn, value)
update
on-diskpageimage
indexupdate
![Page 23: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/23.jpg)
Topics ü WiredTiger Architecture ü In-memory performance ü Record-level concurrency Ø Compression • Durability and the journal • Future features
![Page 24: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/24.jpg)
24
Block manager
• Block allocation – fragmentation – allocation policy
• Checksums – block compression is at a higher level
• Checkpoints – involved in durability guarantees
• Opaque address cookie – stored as internal page key’s “value”
• Pluggable
![Page 25: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/25.jpg)
25
Write path cache
data files
page images
on-disk pageimage
index
cleanpage on-disk
pageimage
indexdirtypage
updates
reconciled during write
![Page 26: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/26.jpg)
26
In-memory Compression
• Prefix compression – index keys usually have a common prefix – rolling, per-block, requires instantiation for performance
• Huffman/static encoding – burns CPU
• Dictionary lookup – single value per page
• Run-length encoding – column-store values
![Page 27: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/27.jpg)
27
On-disk Compression
• Compression algorithms: – snappy [default]: good compression, low overhead – LZ4: good compression, low overhead, better page layout – zlib: better compression, high overhead – pluggable
• Optional – compressing filesystem instead
![Page 28: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/28.jpg)
28
Compression in Action
Flights database
![Page 29: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/29.jpg)
Topics ü WiredTiger Architecture ü In-memory performance ü Record-level concurrency ü Compression Ø Durability and the journal • Future features
![Page 30: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/30.jpg)
30
Journal and Recovery
• Write-ahead logging (aka journal) enabled by default • Only written at transaction commit
– only write redo records • Log records are compressed • Group commit for concurrency • Automatic log archival / removal
– bounded by checkpoint frequency • On startup, find a consistent checkpoint in the metadata
– use the checkpoint to figure out how much to roll forward
![Page 31: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/31.jpg)
31
Durability without Journaling
• MongoDB’s MMAP storage requires the journal for consistency – running with “nojournal” is unsafe
• WiredTiger is a no-overwrite data store – with “nojournal”, updates since the last checkpoint may be lost – data will still be consistent – checkpoints every N seconds by default
• Replication can guarantee durability – the network is generally faster than disk I/O
![Page 32: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/32.jpg)
Topics ü WiredTiger Architecture ü In-memory performance ü Record-level concurrency ü Compression ü Durability and the journal Ø Future features
![Page 33: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/33.jpg)
33
![Page 34: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/34.jpg)
34
What’s next for WiredTiger?
• Our Big Year of Tuning – applications doing “interesting” things – stalls during checkpoints with 100GB+ caches – MongoDB capped collections
• Encryption • Advanced transactional semantics
– updates not stable until confirmed by replica majority
![Page 35: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/35.jpg)
35
WiredTiger LSM support
• Random insert workloads • Data set much larger than cache • Query performance less important • Background maintenance overhead acceptable • Bloom filters
![Page 36: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/36.jpg)
36
![Page 37: A Technical Introduction to WiredTiger](https://reader034.fdocuments.net/reader034/viewer/2022042605/586f912f1a28ab54768b7b4b/html5/thumbnails/37.jpg)
37
Benchmarks
Mark Callaghan at Facebook: http://smalldatum.blogspot.com/