FOEDUS:OLTPEngineforaThousandCoresandNVRAM
HideakiKimuraHPLabs,PaloAlto,CA
SlidesBy:HideakiKimuraPresentedBy:AmanPreetSingh
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 4/16
Next-Generation Server Hardware?
HPThe Machine
UC BerkeleyFirebox
IntelRack Scale
Architecture...
Differences Commonalities
• HP Photonics? Infiniband? QPI?• HP Memristor? PCM? Flash?• ...
• 1,000s of CPU Cores• Fast Interconnect• Huge Low-Latency NVRAM• Endurance will be an issue
GAP ?• Traditional OLTP databases do not scale to large number of
cores.• Recent main-memory databases achieve better scalability but
they do not efficiently handle NVRAM.
GAP?• Softwareshouldplacedatawisely.• Systemonachip.• Systemscapableofhandlingcachein-coherrent architectures.• Non-uniformMemoryaccess(NUMA)awaresystems.
• Aimshouldbetotrytomakethedatalocal.
Solution ?• Toprovidethesecapabilities,FOEDUSwasbuilt.• Groundupdatabase• Opensource• FullyACID• Designedtoscaleuptothousandsofcores• MakesthebestuseofDRAMforOLTP,• AndNVRAMforOLAPqueries.
FOEDUS Key Principles• Bringin-memoryspeedtoNVRAM• SILO-likeLightweightOptimisticConcurrencyControl
• Dual-Page:physicallyindependent,logicallyequivalent• MutableVolatilePagesinDRAM• ImmutableSnapshotPagesinNVRAM
• Master-Tree:SimpleandScalableOCCforNVRAM• Masstree +FosterB-Tree+Foster-Twin
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 6/16
SILO: Lightweight/Decentralized OCC [Tu et al]
Read Version
Fence
Read Data
Fence
Read VersionWrite Epoch
Write Data
(consumeor acquire)
(acquire)
Same Version?
yes, commit
no, abort(retry).
Writer Side(After Commit)
Com
mit
Tim
e
EpochStatus-
bits Data (Payload)Version (64 bits)R
ecor
d
Fence
(==unlock)
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 7/16
SILO: Decentralized Logger with Epoch [Tu et al]
core
SOC
Circular Log Buffer DiskEpoch
X~YLog
Durable Committed Tail Epoch Y~ZLog
Epoch File
- Global DurableEpoch
- Offsets etc
Log Writer
Log Manager
.........
For each epoch(10ms~50ms)
fsync
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 8/16
SILO’s OCC Benefits
9 Extremely Lightweight and ScalableBlock-free (lock-free with retries)No Write for Reads (cf., read-lock)
9 In-page Lock MechanismNo centralized lock managerCache Friendly
But, how can we go beyond DRAM?
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 9/16
In-Memory DBMS Beyond DRAM
DiskPages
All Key/All
Index
Tuple Data
a) Traditional Databases b) H-Store/Anti-Cache
Evict/Page-In
Evict/Retrieve
Bufferpool
DRAM
NVM
Xct
ColdStore
d) FOEDUSc) Hekaton/Siberia
BloomFilters
Volatilepages
Snapshotpages
Dual Pages
HotStore
[Pavlo et al.]
[Larson et al.]
Physically Dependent
No LogicalEquivalence
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 10/16
Logically Equivalent, Physically Independent Dual�Volatile Page: Mutable, on DRAM�Snapshot Page: Immutable, on NVRAM
NULL NULL
X NULL
NULL Y
X Y
: nullptr
: Volatile-Page is just made. No Snapshot yet.
: Snapshot-Page is latest. No modification.
: X is the latest truth, but Y may be equivalent.
VolatilePointer
SnapshotPointer
Dro
p Vo
lati
le (i
f X≡Y
)
Mod
ify/
Add
Inst
all-
SP
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 18/16
Log Gleaner
Transactions
On NVRAM
Epoch X~Y
Log FileLog W
riter
StratifiedSnapshots
In SOC-2
Dual Pages
Snapshot Cache
On DRAM
VolatilePages
0xABCD
SP1,PID.
null
SP2,PID.
SP1 SP2 SP3
Epoch Y~Z
Log File
Vola. Ptr.
Snap. Ptr.
SnapshotPages
SOC-
1
SOC-
2
SequentialDump
TID Tuple
TID Tuple
Read-Set
Write-Set
Durable CurrentCommitted
SILO-like DecentralizedLogging and OCC
MasterTree
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 19/16
1. Assign Partitions
2. Run Mappers/Reducers
3. Combine Root/Metadata
4. Install/Drop Pointers
Part. 1
Part. 2
Log Files
Part. 1
Part. 2
Log Files
SortedRuns
Log Files
Part. 1
Part. 2
Buff.
New Snapshot
Mapper
Reducer 1 Reducer 2
Mapper
Part 1 Part 2
Batch-Apply
Constructing Stratified Snapshot from Logical Logs
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 20/16
Benefits of Stratified Snapshots
¾Serializable transactions check only a single version of data ↔ LSM-Trees¾Drastically more scalable and efficient construction of NVRAM-resident data pages¾Separate from transactions/logging
¾Everything Batched, Processed in a tight loop¾Large Sequential Write only. No frequent flush.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 21/16
Master-Tree in a Nutshell
�Mass-tree: Cache Crafty B-tree and OCC with in-page Lock Mechanism.�Foster B-tree: Single incoming pointer via foster-child to ease page in/out.�Foster-Twins: Drastically Simplifies OCC and Reduces Aborts/Retries.
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 22/16
OCC Problem 1: Unnecessary Retries/Aborts
Mass-tree
1 10 1 5 5 10
RCU
1 100
Further, SILO must take Page-Version set and abort if changed by pre-commit. Lots of Aborts!
Global Retry!
Local Retry!
TID 7 TID 7
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 24/16
Foster Twins here to Save You!
• Strong Invariants to drastically simplify OCCand eliminate unnecessary aborts/retries
• Still everything in-page• No per-tuple GC or compaction/migration• No centralized data structure
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 25/16
Mass-tree
1 10 1 5 5 10
RCU
Foster B-Tree
1015
5 10
Adopt
Bw-Tree
Master-Tree
1 101 5 5 10
Retired1 10
Adopt
1 100 1 100
1 100 1 100
PID Ptr
P
Q
Mapping Table
Foster-child
1 5 5 10
FosterTwins
Pre-commit Verify
TrackMoved
• All data/metadata placed in-page
• Immutable range• All information
always track-able• All invariants
recursively hold
P
10
Q1
55 10
Split Δ
Moved
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 26/16
Foster-Twins Commit Protocol
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 11/16
Dual Pages: Benefits
9 Snapshot Pages are ImmutableDrastically simplifies Caching/Replication/Traversal/Commit/etc
9 Physically IndependentTransactions never interfere w/ construction/retrieval/eviction of Snapshot Pages
9 Logically Equivalent� No Bloom Filter or global data needed for Serializability� Volatile Pages guaranteed to contain latest data� Snapshot Pages guaranteed to be complete as of snapshot-epoch
↔ LSM-Trees
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 12/16
Experiments
� TPC-C, Serializable� vs. H-Store, SILO, Shore-MT� (Traditional DBs (e.g., MySQL) are too slow to compare with.)
� HP Superdome X (DragonHawk):16 sockets, 240 cores, 12TB DRAM.
� Emulated NVRAM
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 13/16
Experiment 1: vs In-Memory DBMS
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
0 20 40 60
TPC-
C Th
roug
hput
[TPS
]
Remote Transactions [%]
Observations1. SILO-style OCC much more
resilient to contention2. 100x~ faster than H-Store.3. More Cores = More Speedup
FOEDUS
SILO
H-Store
CoresSpeed-up[/H-Store]
16 33x
60 66x
240 394x
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 14/16
Experiment 2: DBMS on NVRAM (emulated)
1.E+04
1.E+05
1.E+06
1.E+07
1.E+02 1.E+03 1.E+04 1.E+05
TPC-
C Th
roug
hput
[TPS
]
Emulated NVRAM Latency [ns]
Environments- tmpfs-based NVRAM
emulation (varied latency)
- Data/xlog in NVRAM- H-Store w/ anti-cache
Observations1. FOEDUS 100x~ faster2. Performs best when
NVRAM read <10us
FOEDUS
FOEDUS (No Snapshot Cache)
H-Store + Anticache
Shore-MT
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 15/16
Experiment 3: OLAP Workload
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
Thro
ughp
ut [T
PS]
MAX_OL_CNT (Size of DB)
Workload- 30x Larger Data- Cursor-Heavy- Read-Only- Volatile Pages Only vs
Snapshot Pages Only
Observations1. 100x~ faster than
H-Store.2. FOEDUS runs faster
when database is cold (!!)
In Paper: how Dual Pages achieve it15 100 500
FOEDUS-Volatile
FOEDUS-Snapshot
H-Store
Ideas• ThekeyideaofFEODUSistousethemaster-treemodeltoreducecontention,makingOCClightweight.Canthisbeextendedtodistributedmemorysystemswithsharednothingtopology,especiallywherereadingremotememoryisnotverycostly,likeRDMA?
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 17/16
Reserved Slides
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 16/16
Ongoing Work
9Try on 1,000 Cores~9Further Resilience to ContentionsCombining Pessimistic Approach When Beneficial9SIMD, Xeon-PhiLog-Gleaner’s Sorting, Page Construction, etc9Non-C++ InterfaceSQL/JDBC/ODBC for OLAP. But, for OLTP???
© Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. 27/16
Easy OCC with Foster Twins
�Simple, Robust, and Efficient Search:
• No Page-Version-set nor unnecessary aborts: All Records/TIDs always track-able via Foster-Twin!
Find a pointer that probably contains the key;Follow the page pointer;If the page’s key-range contain the key: go on;Else: locally retry;
No hand-over-hand verification/split counters.No unnecessary local retry. Absolutely no global retry.
Top Related