Post on 24-Jan-2021
Optimizing Flash-based Key-value Cache Systems
The 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16)
†Department of Computing, Hong Kong Polytechnic University
‡Computer Science & Engineering, Louisiana State University
Zhaoyan Shen†, Feng Chen‡, Yichen Jia‡, Zili Shao†
Key-value Information
1
• Key-value access is dominant in web services
– Many apps simply store and retrieve key-value pairs
– Key-value cache is the first line of defense• Benefits: Improve throughput, reduce latency, reduce server load
– In-memory KV cache is popular (Memcache)
• High speed but has cost, power, capacity problem
Key-value Slabs (Flash LBA)
Log-Structured Logical Flash Space
Flash based Key-value Cache
2
+
Slab(Slab, Slot)Key
Hash Index (Memory)
SHA-1(Key_1)
Key-value cache Flash SSD
• In-memory hash map to track key-to-value mapping
• Slabs are used in a log-structured way
• Updated value item written to a new location and old values recycled later
Critical Issues
3
• Redundant mapping
• Double garbage collection
• Over-over-provisioning
Critical Issue 1: Redundant Mapping
4
• Redundant mapping at application- and FTL-level – KVC: An in-memory hash table (Key Slab, Offset)
– FTL: An on-device page mapping table (LBA PBA)
• Problems – Two mapping structures (unnecessarily) co-exist at two levels
– A significant waste of on-device DRAM space (e.g., 1GB for 1TB)• The on-device DRAM buffer is costly, unreliable, and could be used for buffering.
SHA-1(Key)
AAA,BBBBBB,CCCCCC,DDDDDD,EEE
Slab SpaceHash table
(1, 2)K1
Mapping Table
0123
AAA, BBBBBB,CCCCCC,DDDDDD,EEE
……
PBALBA
Critical Issues
KVC software Mapping FTL Mapping in hardware
Critical Issue 2: Double Garbage Collection
5
• Garbage collection (GC) at app- and FTL- levels– KVC: Recycle deleted or changed key-value items
– FTL: Recycle trimmed or changed pages
• Problems– Semantic validity of a key-value entry is not seen at FTL
– Redundant data copy operation
SHA-1(Key)
Hash table
(1, 2)K1
AAA,BBBBBB,CCCCCC,DDDDDD,EEE
PBASlab Space
AAA,BBBBBB,CCCCCC,DDDDDD,EEE
012345
LBA
CCC,DDD
Mapping Table
(2, 1) AAA,BBBBBB,CCCCCC,DDDDDD,EEE
CCC,DDD
FTL-level GCKVC-level GC
Critical Issues
Critical Issue 3: Over-over-provisioning
6
• Over-provisioning at FTL-level– FTL has a portion (20-30%) of flash as Over-Provisioning Space (OPS)
– OPS space is invisible and unusable to the host applications
• Problems– OPS is reserved for dealing the worst-case situation, as a storage
– Over-over provisioning for Key-value caches• Key-value caches are dominated by read (GET) traffic, not writes
– Key-value cache hit ratio is highly sensitive to usable cache size• If 20-30% space can be released, the cache hit ratio can be greatly improved
SHA-1(Key)
AAA,BBBBBB,CCCCCC,DDDDDD,EEE
Slab SpaceHash table
(1, 2)K1
LBA
0123
AAA,BBBBBB,CCCCCC,DDDDDD,EEE
……
PBA
……
FFF,GGGHHH,IIIJJJ, KKK
Mapping Table
Unusable
Critical Issues
Semantic Gap Problem
7
Semantic Gap
• Fine-grained GC• Key-to-value mapping• Validity of slab entries
• Physical data layout on flash• Direct flash memory control• Proper mapping granularity
In the current SW/HW architecture, we can do little to address these three issues.
Key-value cache Flash SSD
Critical Issues
Optimization Goals
8
• Redundant mapping Single mapping
• Double garbage collection App-driven GC
• Over-over-provisioning Dynamic OPS
Design Overview
9
• An enhanced flash-aware key-value cache
• A thin intermediate library layer (libssd)
• A specialized flash memory PCI-E SSD hardware
Flash SSD
Garbage Collection
PageMapping
Bad Block Mgm.
Flash Operations
WearLeveling
OPSMgm.
Operating Systems
Page CacheDeviceDriver
I/OScheduling
Key-value Cache
K/V Mapping
Slab Mgm.CacheMgm.
Open-channel SSD
Flash Operations
Operating Systems
Page CacheDeviceDriver
I/OScheduling
Library (libssd)
OperationConversion
Slab/FlashTranslation
Bad Block Mgm.
Key-value Cache
Key/SlabMapping
KVC-drivenGC
OPSManagement
Slab Manager Cache Manager
An Enhanced Flash-aware Key-value Cache
10
• Slab management
• Unified direct mapping
• Garbage collection
• OPS management
Key-value Cache
• Our choice – Directly use a flash block as a slab (8MB)
• Slab buffer: An in-memory slab maintained for each class
– Parallelize and asynchronize the slab write I/Os to the flash memory
• Round-robin allocation of in-flash slab for load-balancing across channels
In-Memory Slab Buffer
Slab Management
11
Class i Class i+1 Class nClass i+2
Key/Value
Channel #0 Channel #1 Channel #2 Channel #3 Channel #4
Key-value Cache
FlashSSD
SHA-1(Key)
AAA,BBBBBB,CCCCCC,DDDDDD,EEE
Slab SpaceHash table
(1, 2)K1
Mapping Table
0123
AAA,BBBBBB,CCCCCC,DDDDDD,EEE
……
Flash Memory
Unified Direct Mapping
12
• Collapse multiple levels of indirect mapping to only one– Prior mapping: KeySlabLBAPBA
– Current mapping: KeySlab (i.e., Flash Block)
• Benefits– Removes the time overhead for lookup intermediate layers (Speed+)
– Only one single must-have in-memory hash table is needed (Cost-)
– On-device RAM space can be completely removed (or for other uses)
Key-value Cache
SHA-1(Key)
Hash table
(Slab, Offset)K1
AAA,BBBBBB,CCCCCC,DDDDDD,EEE
Flash Memory
…………
Slab #1
Slab #2
CDC, CDC
• One single GC is driven directly by key-value cache system– All slab writes are in units of blocks (no need for device-level GC)
– GC is directly triggered and controlled by application-level KVC
• Two GC policies– Greedy GC: the least occupied slab is evicted to move minimum slots
– Quick clean: the LRU slab is immediately dropped recycling valid slots
– Adaptively used for different circumstances (busy or idle)
Target Slab
App-driven Garbage Collection
13
Key-value Cache
Greedy GC
AAA,BBBBBB,CCCCCC,DDDDDD,EEE
copyCCC,DDD
Victim Slab
AAA,BBBBBB,CCCCCC,DDDDDD,EEE
Quick Clean
eee,fffccc,dddbbb,cccaaa,bbb
LRU Slab MRU Slab
333,444555,666777,888888,999
Over-Provisioning Space Management
14
• Dynamically tuning OPS space online– Rationale – KVC is typically read-intensive and OPS can be small
– Goal – keep just enough OPS space (adaptive to intensity of writes)
• OPS management policies– Heuristic method: An OPS window (WL and WH) to estimate size
• Low watermark hit – Trigger quick clean, raise the OPS window
• High watermark hit – OPS is over-allocated, lower the OPS window
Key-value Cache
OP
S
Time
WH
WL
Max OSP
Quick clean initiated
Preliminary Experiments
15
• Implementation– Key-value cache on Twitter’s Fatcache to fit hardward
– Libssd Library (621 lines of code in C)
• Experimental Setup– Intel Xeon E-1225, 32GB Memory, 1TB Disk, Open-Channel SSD
– Ubuntu 14.04 LTS, Linux 3.17.8, Ext4 filesystem
• Hardware: Open-channel SSD– A PCI-E based with 12 channel, and 192 LUNs
– Direct control to the device (via ioctl interface)
Experiments
SET – Throughput and Latency
16
• SET Workloads: 40milliom requests of 1KB key/value pairs
• Both set throughput/latency from our scheme are the best
Experiments
Our Scheme
Stock Fatcache
Our Scheme Stock Fatcache
Conclusion
• KV stores become critical as they are one of the most important building blocks in modern web infrastructures and high-performance data-intensive applications.
• We build a highly efficient flash-based cache system which enables three benefits, namely a unified single-level direct mapping, a cache-driven fine-grained garbage collection, and an adaptive over-provisioning scheme
• We are implementing a prototype on the Open-Channel SSD hardware and our preliminary results show that it is highly promising
17
Conclusion
Thank You !
19
Backup Slides
GET – Throughput and Latency
20
• GET performance is largely determined by the raw speed
• GET latencies are among the lowest in the set of SSDs
Our Scheme
Stock Fatcache
Our SchemeStock Fatcache
SET Latencies – A Zoom-in View
21
• The direct control successfully removes high cost GC effects
• Quick clean removes long I/Os under high write pressure
Our Scheme Stock Fatchche + KingSpec
Block Erase Count
22
• Trace collected with running Fatcache on Samsung SSD
• Block trace is replayed on MSR’s SSD simulator for erase count
• Our solution reduces erase count by 30%
0
500
1000
1500
2000
2500
3000
3500
1 2 4 6 8
Erase Count
SSDSim Open-channel SSD
Quick Clean kicks in
Effect of the In-memory Slab Buffer
23
• 10x buffer size difference does not affect latency significantly
• A small (56MB) in-memory slab buffer is sufficient for use