Janus: Optimal Flash Provisioning for Cloud Storage Workloads
Flash-based (cloud) storage systems
description
Transcript of Flash-based (cloud) storage systems
![Page 1: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/1.jpg)
Flash-based (cloud) storage systems
Lecture 25Aditya Akella
![Page 2: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/2.jpg)
• BufferHash: invented in the context of network de-dup (e.g., inter-DC log transfers)
• SILT: more “traditional” key-value store
![Page 3: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/3.jpg)
Cheap and Large CAMs for High Performance Data-Intensive Networked Systems
Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya AkellaUniversity of Wisconsin-Madison
Suman NathMicrosoft Research
![Page 4: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/4.jpg)
New data-intensive networked systems
Large hash tables (10s to 100s of GBs)
![Page 5: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/5.jpg)
New data-intensive networked systems
Data center Branch office
WAN
WAN optimizersObject
Object store (~4 TB)Hashtable (~32GB)
Look up
Object
Chunks(4 KB)
Key (20 B)
Chunk pointer Large hash tables (32 GB)
High speed (~10 K/sec) inserts and evictions
High speed (~10K/sec) lookups for 500 Mbps link
![Page 6: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/6.jpg)
New data-intensive networked systems
• Other systems – De-duplication in storage systems (e.g., Datadomain)– CCN cache (Jacobson et al., CONEXT 2009)– DONA directory lookup (Koponen et al., SIGCOMM 2006)
Cost-effective large hash tablesCheap Large cAMs
![Page 7: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/7.jpg)
Candidate options
DRAM 300K $120K+
Disk 250 $30+
Random reads/sec
Cost(128 GB)
Flash-SSD 10K* $225+
Random writes/sec
250
300K
5K*
Too slow Too
expensive
* Derived from latencies on Intel M-18 SSD in experiments
2.5 ops/sec/$
Slow writes
How to deal with slow writes of Flash SSD
+Price statistics from 2008-09
![Page 8: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/8.jpg)
CLAM design
• New data structure “BufferHash” + Flash• Key features– Avoid random writes, and perform sequential writes
in a batch• Sequential writes are 2X faster than random writes (Intel
SSD)• Batched writes reduce the number of writes going to Flash
– Bloom filters for optimizing lookups
BufferHash performs orders of magnitude better than DRAM based traditional hash tables in ops/sec/$
![Page 9: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/9.jpg)
Flash/SSD primer
• Random writes are expensive Avoid random page writes
• Reads and writes happen at the granularity of a flash page
I/O smaller than page should be avoided, if possible
![Page 10: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/10.jpg)
Conventional hash table on Flash/SSD
Flash
Keys are likely to hash to random locations
Random writes
SSDs: FTL handles random writes to some extent;But garbage collection overhead is high
~200 lookups/sec and ~200 inserts/sec with WAN optimizer workload, << 10 K/s and 5 K/s
![Page 11: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/11.jpg)
Conventional hash table on Flash/SSD
DRAM
Flash
Can’t assume locality in requests – DRAM as cache won’t work
![Page 12: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/12.jpg)
Our approach: Buffering insertions• Control the impact of random writes• Maintain small hash table (buffer) in memory • As in-memory buffer gets full, write it to flash
– We call in-flash buffer, incarnation of buffer
Incarnation: In-flash hash table
Buffer: In-memory hash table
DRAM Flash SSD
![Page 13: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/13.jpg)
Two-level memory hierarchyDRAM
Flash
Buffer
Incarnation table
Incarnation
1234
Net hash table is: buffer + all incarnations
Oldest incarnation
Latest incarnation
![Page 14: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/14.jpg)
Lookups are impacted due to buffers
DRAM
Flash
Buffer
Incarnation table
Lookup key
In-flash look ups
Multiple in-flash lookups. Can we limit to only one?
4 3 2 1
![Page 15: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/15.jpg)
Bloom filters for optimizing lookups
DRAM
Flash
Buffer
Incarnation table
Lookup keyBloom filters
In-memory look ups
False positive!
Configure carefully! 4 3 2 1
2 GB Bloom filters for 32 GB Flash for false positive rate < 0.01!
![Page 16: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/16.jpg)
Update: naïve approachDRAM
Flash
Buffer
Incarnation table
Bloom filtersUpdate key
Update keyExpensive random writes
Discard this naïve approach
4 3 2 1
![Page 17: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/17.jpg)
Lazy updatesDRAM
Flash
Buffer
Incarnation table
Bloom filtersUpdate key
Insert key
4 3 2 1
Lookups check latest incarnations first
Key, new value
Key, old value
![Page 18: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/18.jpg)
Eviction for streaming apps• Eviction policies may depend on application– LRU, FIFO, Priority based eviction, etc.
• Two BufferHash primitives– Full Discard: evict all items
• Naturally implements FIFO– Partial Discard: retain few items
• Priority based eviction by retaining high priority items
• BufferHash best suited for FIFO– Incarnations arranged by age– Other useful policies at some additional cost
• Details in paper
![Page 19: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/19.jpg)
Issues with using one buffer
• Single buffer in DRAM– All operations and
eviction policies
• High worst case insert latency– Few seconds for 1
GB buffer– New lookups stall
DRAM
Flash
Buffer
Incarnation table
Bloom filters
4 3 2 1
![Page 20: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/20.jpg)
Partitioning buffers
• Partition buffers– Based on first few bits of
key space– Size > page
• Avoid i/o less than page– Size >= block
• Avoid random page writes
• Reduces worst case latency
• Eviction policies apply per buffer
DRAM
Flash
Incarnation table
4 3 2 1
0 XXXXX 1 XXXXX
![Page 21: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/21.jpg)
BufferHash: Putting it all together
• Multiple buffers in memory• Multiple incarnations per buffer in flash• One in-memory bloom filter per incarnation
DRAM
FlashBuffer 1 Buffer K. .
. .
Net hash table = all buffers + all incarnations
![Page 22: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/22.jpg)
Latency analysis
• Insertion latency – Worst case size of buffer – Average case is constant for buffer > block size
• Lookup latency– Average case Number of incarnations – Average case False positive rate of bloom filter
![Page 23: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/23.jpg)
Parameter tuning: Total size of Buffers
.
. .
.
Total size of buffers = B1 + B2 + … + BN
Too small is not optimalToo large is not optimal eitherOptimal = 2 * SSD/entry
DRAM
Flash
Given fixed DRAM, how much allocated to buffers
B1 BN
# Incarnations = (Flash size/Total buffer size)
Lookup #Incarnations * False positive rate
False positive rate increases as the size of bloom filters decrease
Total bloom filter size = DRAM – total size of buffers
![Page 24: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/24.jpg)
Parameter tuning: Per-buffer size
Affects worst case insertion
What should be size of a partitioned buffer (e.g. B1) ?
.
. .
.
DRAM
Flash
B1 BN
Adjusted according to application requirement (128 KB – 1 block)
![Page 25: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/25.jpg)
SILT: A Memory-Efficient,High-Performance Key-Value Store
Hyeontaek Lim, Bin Fan, David G. AndersenMichael Kaminsky†
Carnegie Mellon University†Intel Labs
2011-10-24
![Page 26: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/26.jpg)
26
Key-Value Store
Clients
PUT(key, value)value = GET(key)
DELETE(key)
Key-Value StoreCluster
• E-commerce (Amazon)• Web server acceleration (Memcached)• Data deduplication indexes• Photo storage (Facebook)
![Page 27: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/27.jpg)
• SILT goal: use much less memory than previous systems while retaining high performance.
27
![Page 28: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/28.jpg)
Three Metrics to Minimize
Memory overhead
Read amplification
Write amplification
• Ideally 0 (no memory overhead)
• Limits query throughput• Ideally 1 (no wasted flash reads)
• Limits insert throughput• Also reduces flash life expectancy• Must be small enough for flash to last a few years
= Index size per entry
= Flash reads per query
= Flash writes per entry
28
![Page 29: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/29.jpg)
0 2 4 6 8 10 120
2
4
6
Landscape before SILTRead amplification
Memory overhead (bytes/entry)
FAWN-DS
HashCache
BufferHash FlashStore
SkimpyStash
29
?
![Page 30: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/30.jpg)
30
SILT Sorted Index(Memory efficient)
SILT Log Index(Write friendly)
Solution Preview: (1) Three Stores with (2) New Index Data Structures
MemoryFlash
SILT Filter
Inserts only go to Log
Data are moved in background
Queries look up stores in sequence (from new to old)
![Page 31: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/31.jpg)
LogStore: No Control over Data Layout
6.5+ bytes/entry 1Memory overhead Write amplification
Inserted entries are appended
On-flash log
Memory
Flash
31
SILT Log Index (6.5+ B/entry)
(Older) (Newer)
Naive Hashtable (48+ B/entry)
![Page 32: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/32.jpg)
SortedStore: Space-Optimized Layout
0.4 bytes/entry High
On-flash sorted array
Memory overhead Write amplification
Memory
Flash
32
SILT Sorted Index (0.4 B/entry)
Need to perform bulk-insert to amortize cost
![Page 33: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/33.jpg)
Combining SortedStore and LogStore
On-flash log
SILT Sorted Index
Merge
33
SILT Log Index
On-flash sorted array
<SortedStore> <LogStore>
![Page 34: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/34.jpg)
Achieving both Low Memory Overhead and Low Write Amplification
34
SortedStore LogStore
SortedStore
LogStore
• Low memory overhead• High write amplification
• High memory overhead• Low write amplification
Now we can achieve simultaneously:Write amplification = 5.4 = 3 year flash lifeMemory overhead = 1.3 B/entry
With “HashStores”, memory overhead = 0.7 B/entry!
![Page 35: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/35.jpg)
1.010.7 bytes/entry 5.4Memory overhead Read amplification Write amplification
35
SILT’s Design (Recap)
On-flash sorted array
SILT Sorted Index
On-flash log
SILT Log Index
On-flash hashtables
SILT Filter
Merge Conversion
<SortedStore> <LogStore><HashStore>
![Page 36: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/36.jpg)
New Index Data Structures in SILT
Partial-key cuckoo hashing
For HashStore & LogStoreCompact (2.2 & 6.5 B/entry)
Very fast (> 1.8 M lookups/sec)
36
SILT Filter & Log Index
Entropy-coded tries
For SortedStoreHighly compressed (0.4 B/entry)
SILT Sorted Index
![Page 37: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/37.jpg)
0 2 4 6 8 10 120
2
4
6
LandscapeRead amplification
Memory overhead (bytes/entry)
FAWN-DS
HashCache
BufferHash FlashStore
SkimpyStash
38
SILT
![Page 38: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/38.jpg)
BufferHash: Backup
![Page 39: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/39.jpg)
Outline
• Background and motivation
• Our CLAM design– Key operations (insert, lookup, update)– Eviction– Latency analysis and performance tuning
• Evaluation
![Page 40: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/40.jpg)
Evaluation
• Configuration– 4 GB DRAM, 32 GB Intel SSD, Transcend SSD– 2 GB buffers, 2 GB bloom filters, 0.01 false positive
rate– FIFO eviction policy
![Page 41: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/41.jpg)
BufferHash performance
• WAN optimizer workload– Random key lookups followed by inserts– Hit rate (40%)– Used workload from real packet traces also
• Comparison with BerkeleyDB (traditional hash table) on Intel SSDAverage latency BufferHash BerkeleyDB
Look up (ms) 0.06 4.6
Insert (ms) 0.006 4.8
Better lookups!
Better inserts!
![Page 42: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/42.jpg)
Insert performance
0.001 0.01 0.1 1 10 100
BerkeleyDB
0.001 0.01 0.1 1 10 100
Bufferhash
0.2
0.40.6
0.81.0
CDF
Insert latency (ms) on Intel SSD
99% inserts < 0.1 ms
40% of inserts > 5 ms !
Random writes are slow! Buffering effect!
![Page 43: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/43.jpg)
Lookup performance
0.001 0.01 0.1 1 10 100
Bufferhash
0.001 0.01 0.1 1 10 100
BerkeleyDB
0.20.40.60.81.0
CDF
99% of lookups < 0.2ms
40% of lookups > 5 ms
Garbage collection overhead due to writes!
60% lookups don’t go to Flash 0.15 ms Intel SSD latency
Lookup latency (ms) for 40% hit workload
![Page 44: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/44.jpg)
Performance in Ops/sec/$
• 16K lookups/sec and 160K inserts/sec
• Overall cost of $400
• 42 lookups/sec/$ and 420 inserts/sec/$– Orders of magnitude better than 2.5 ops/sec/$ of
DRAM based hash tables
![Page 45: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/45.jpg)
Other workloads
• Varying fractions of lookups• Results on Trancend SSD
Lookup fraction BufferHash BerkeleyDB0 0.007 ms 18.4 ms0.5 0.09 ms 10.3 ms1 0.12 ms 0.3 ms
• BufferHash ideally suited for write intensive workloads
Average latency per operation
![Page 46: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/46.jpg)
Evaluation summary• BufferHash performs orders of magnitude better in
ops/sec/$ compared to traditional hashtables on DRAM (and disks)
• BufferHash is best suited for FIFO eviction policy– Other policies can be supported at additional cost, details in
paper
• WAN optimizer using Bufferhash can operate optimally at 200 Mbps, much better than 10 Mbps with BerkeleyDB– Details in paper
![Page 47: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/47.jpg)
Related Work
• FAWN (Vasudevan et al., SOSP 2009)– Cluster of wimpy nodes with flash storage– Each wimpy node has its hash table in DRAM– We target…• Hash table much bigger than DRAM • Low latency as well as high throughput systems
• HashCache (Badam et al., NSDI 2009)– In-memory hash table for objects stored on disk
![Page 48: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/48.jpg)
WAN optimizer using BufferHash
• With BerkeleyDB, throughput up to 10 Mbps
• With BufferHash, throughput up to 200 Mbps with Transcend SSD– 500 Mbps with Intel SSD
• At 10 Mbps, average throughput per object improves by 65% with BufferHash
![Page 49: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/49.jpg)
SILT Backup Slides
![Page 50: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/50.jpg)
Evaluation
1. Various combinations of indexing schemes2. Background operations (merge/conversion)3. Query latency
Experiment SetupCPU 2.80 GHz (4 cores)
Flash driveSATA 256 GB
(48 K random 1024-byte reads/sec)
Workload size 20-byte key, 1000-byte value, ≥ 50 M keysQuery pattern Uniformly distributed (worst for SILT)
51
![Page 51: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/51.jpg)
LogStore Alone: Too Much Memory
52
Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)
![Page 52: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/52.jpg)
LogStore+SortedStore: Still Much Memory
53
Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)
![Page 53: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/53.jpg)
Full SILT: Very Memory Efficient
54
Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)
![Page 54: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/54.jpg)
Small Impact from Background Operations
33 K
55
40 K
Workload: 90% GET (100~ M keys) + 10% PUT
Oops! burstyTRIM by ext4 FS
![Page 55: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/55.jpg)
Low Query Latency
56
# of I/O threads
Workload: 100% GET (100 M keys)Best tput @ 16 threads
Median = 330 μs99.9 = 1510 μs
![Page 56: Flash-based (cloud) storage systems](https://reader036.fdocuments.net/reader036/viewer/2022062310/568165ed550346895dd912dd/html5/thumbnails/56.jpg)
57
Conclusion
• SILT provides both memory-efficient andhigh-performance key-value store– Multi-store approach– Entropy-coded tries– Partial-key cuckoo hashing
• Full source code is available– https://github.com/silt/silt