Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.
-
Upload
april-fowler -
Category
Documents
-
view
219 -
download
1
Transcript of Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.
![Page 1: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/1.jpg)
Separating Data and Metadata for Robustness and Scalability
Yang WangUniversity of Texas at Austin
![Page 2: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/2.jpg)
Goal: A better storage system
• Data is important.
• Data grows bigger.
• Data is accessed in different ways.
![Page 3: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/3.jpg)
Challenge: achieve multiple goals simultaneously
• Robustness– Durable and available despite failures
• Scalability– Thousands of machines or more
• Efficiency – Good performance with a reasonable cost
![Page 4: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/4.jpg)
Solution
Separating data and metadata
![Page 5: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/5.jpg)
My works
Gnothi
Salus
ExaltEvaluate
Design
![Page 6: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/6.jpg)
My works
Gnothi
Salus
Exalt
Small-scaleCrash failures
![Page 7: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/7.jpg)
My works
Gnothi
Salus
Exalt
Large-scaleArbitrary failures
![Page 8: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/8.jpg)
How to design?
• Problem: Stronger protection -> Higher cost• Key observation:– Data: big (4K to several MBs)– Metadata: small (tens of bytes); can validate data
• Solution– Strong protection for metadata -> Robustness– Minimal replication for data -> Scalability and Efficiency
![Page 9: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/9.jpg)
How to evaluate?
Gnothi
Salus
Exalt
Evaluate large-scale storage systems on small to medium platforms
![Page 10: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/10.jpg)
Outline
• Gnothi: Efficient and Available Storage Replication– Small scale; tolerate crash faults and timing errors
• Salus: Robust and Scalable Block Store– Large scale; tolerate arbitrary failures
• Exalt: Evaluate large-scale storage systems
![Page 11: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/11.jpg)
Resolving a long-standing trade-off
• Efficiency– Write to f+1 nodes and read from 1 node
• Robustness– Availability: Aggressive timeout for failure detection– Consistency: Read returns the data of the latest write
11
Synchronous Primary Backup
![Page 12: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/12.jpg)
Resolving a long-standing trade-off
• Efficiency– Write to f+1 nodes and read from 1 node
• Robustness– Availability: Aggressive timeout for failure detection– Consistency: Read returns the data of the latest write
12
Asynchronous Replication
![Page 13: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/13.jpg)
Resolving a long-standing trade-off
• Efficiency– Write to f+1 nodes and read from 1 node
• Robustness– Availability: Aggressive timeout for failure detection– Consistency: Read returns the data of the latest write
13
Gnothi
![Page 14: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/14.jpg)
Gnothi Overview
Gnothi resolves the trade-off …… but only for block storage, meaning …– A fixed number of fixed-size blocks.– A request reads/writes a single block.
Key ideas:– Don’t insist that nodes have identical state.– A node knows which blocks are fresh/stale.
14
Gnothi Seauton – Know yourself
![Page 15: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/15.jpg)
Separating Data and Metadata
2f+1 nodesClients
Metadata Size: 24 bytes for a block (4K to 1M)
LAN
15
DataData Write requestMetadata: blockNo, client ID, ...
![Page 16: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/16.jpg)
Rest of Gnothi
• Why is the trade-off challenging?
• How does Gnothi resolve the trade-off?
• How well does Gnothi perform?
16
![Page 17: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/17.jpg)
Why is the trade-off challenging?
17
How to handle a timeout?
• Can we have both f+1 replication and short timeout?
2 Timeout?
1
Synchronous Primary Backup(Remus, Hbase, Hypervisor, …)
Continue with 1 node
Use conservative timeout
2 Timeout?
1
Asynchronous Replication(Paxos, …)
Send to 2f+1 nodes and waits for f+1 ACKs
3
![Page 18: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/18.jpg)
Why is the trade-off challenging?× Continue with 1 node? – Not safe. × Wait? – Not live. Switch to another node. (Cheap Paxos, ZZ, …)? However, state of newly enlisted node may be incomplete.
– One solution: on switch, copy all data to new node – bad availability.
2 Partial Replication f+1
3
1
18
TimeoutWait?
Copy dataSwitch
?
![Page 19: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/19.jpg)
Rest of Gnothi
• Why is the trade-off challenging?
• How does Gnothi resolve the trade-off?
• How well does Gnothi perform?
19
![Page 20: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/20.jpg)
Gnothi: Nodes can be incomplete• A new write will overwrite the block anyway.• Read can be processed correctly
– As long as a node knows which blocks are stale
• Recovery can be processed correctly– As long as a node knows which block is the latest one
2
1
Read block 2
I do not have current version of block 2
2
1
Write block 2
Fetch block 2
20
Write latest version of block 2
![Page 21: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/21.jpg)
How does Gnothi work?
How to perform writes and reads efficiently when no failures occur?
- Write to f+1 and read from 1
How to continue processing requests during failures?
- Still write to f+1 and read from 1
How to recover the failed node efficiently?
21
![Page 22: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/22.jpg)
How to perform writes and reads efficiently when no failures occur?
Metadata
Data
Write Read
Maintain a single bit for each block: “do I have the current data?”
• Data replicated f+1 times
Metadata ensures read can be processed correctly.
Node 1
Node 2
Node 3
Node 1
Node 2
Node 3
Client Node with both data and metadata Node with only metadata
22
Gaios: Bolosky et al. NSDI 2011
![Page 23: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/23.jpg)
Load-balanced Data DistributionVirtual diskVirtual disk
Slice 1Slice 1 Slice 2Slice 2
Slice 2Slice 2 Slice 3Slice 3
Slice 1Slice 1 Slice 3Slice 3
Slice 1Slice 1 Slice 2Slice 2 Slice 3Slice 3
Gnothi Block Drivers
LAN
Divide space into multiple slices
Evenly distribute slices to different preferred nodes
23
33
11
22
Preferred Storage Reserve Storage
Node 1
Node 2
Node 3
![Page 24: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/24.jpg)
Load-balanced Data DistributionVirtual diskVirtual diskSlice 1Slice 1 Slice 2Slice 2 Slice 3Slice 3
Gnothi Block Drivers
LAN
Divide space into multiple slices
Evenly distribute slices to different preferred nodes
24
Preferred Storage Reserve Storage
Node 1
Node 2
Node 3
1 2
2 3
1 3
3
1
2
![Page 25: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/25.jpg)
How to continue processing requests during failures?
WriteDo not wait for data or metadata transfer Read
• Metadata replicated 2f+1 times
Metadata allows a node to process requests correctly.
25
Node 1
Node 2
Node 3
? ? ?
![Page 26: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/26.jpg)
Catch-up problem in recovery
Can I catch up?
• Recovery speed vs Execution speed– Traditional systems have the catch-up problem
Node 1
Node 2
Node 3
26
![Page 27: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/27.jpg)
How to recover the failed node efficiently?
Node 1
Node 2
Node 3
27
• Separate metadata and data recovery– Phase 1: Metadata recovery – fast
![Page 28: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/28.jpg)
How to recover the failed node efficiently?
Node 1
Node 2
Node 3
28
Data Recovery in background
• Separate metadata and data recovery– Phase 1: Metadata recovery – fast– Phase 2: Data recovery – slow, in background
![Page 29: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/29.jpg)
Rest of Gnothi
• Why is the trade-off challenging?
• How does Gnothi resolve the trade-off?
• How well does Gnothi perform?
29
![Page 30: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/30.jpg)
Evaluation
• Throughput– Compare to a Gaios (Bolosky et al. NSDI 2011) like system G’.– Sequential/Random read/write– f=1 (Gnothi-3, G’-3) and f=2 (Gnothi-5 and G’-5)– Block size 4K, 64K, and 1M
• Failure Recovery– Compare Gnothi to G’ and Cheap Paxos– How long does recovery take?– What is the client throughput during recovery?
30
![Page 31: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/31.jpg)
Gnothi achieves higher throughput
Gnothi can achieve 40%-64% more write throughput and scalable read throughput.
31
More write throughput
Scalable read throughput
![Page 32: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/32.jpg)
Higher throughput during recovery
Gnothi does not block long for failures.Gnothi can achieve 100%-200% more throughput during recovery.
32
Kill Restart
Cheap Paxos blocks for data copy
100%-200% more throughput
Complete recovery at almost the same time
No blocking
![Page 33: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/33.jpg)
Gnothi can always catch up
Tunable recovery speed.In Gnothi, the recovering node can always catch up with others.
33
Gnothi G’
Catch upCatch up
Cannot catch up
Thro
ughp
ut (M
B/s)
Thro
ughp
ut (M
B/s)
![Page 34: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/34.jpg)
Gnothi conclusion
• Separate Data and Metadata– Replication• Improve efficiency.• Ensure availability during failures.
– Recovery• Ensure catch-up.
34
![Page 35: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/35.jpg)
Outline
• Gnothi: Efficient and Available Storage Replication– Small scale; tolerate crash faults and timing errors
• Salus: Robust and Scalable Block Store– Large scale; tolerate arbitrary failures
• Exalt: Evaluate large-scale storage systems
![Page 36: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/36.jpg)
Problem: Not enough machines
• In practice– WAS in Microsoft: 60PB– HDFS in Facebook: 4000 servers– …
• In research– Salus: 100 servers– COPS: 300 servers– Spanner: 200 servers
Research should go beyond practice.
![Page 37: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/37.jpg)
Public testbeds
• Utah Emulab: 588 machines• CMU Emulab: 1024 machines• TACC (Texas Advanced Computing Center)– 6400 machines, but not enough storage
• Amazon EC2– Cost $1400 for our Salus experiment (108 servers)
![Page 38: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/38.jpg)
Solution 1: Extrapolation
• Measure with a small cluster• Predict the bottleneck• Assumption: resource consumption grows
linearly with the scale
CPU
Network
100 nodes
10%
5%
Extrapolate: The system can scale to 1,000 nodes.
Scale
Resource utilization
![Page 39: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/39.jpg)
Solution 1: Extrapolation
• Measure with a small cluster• Predict the bottleneck• Problem: Assumption may not be true.
CPU
Network
100 nodes
10%
Scale
Resource utilization
![Page 40: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/40.jpg)
Solution 2: Stub
• Build stub components to simulate real components
• Problem: stub component can be as complex as the original one
![Page 41: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/41.jpg)
Solution 3: Simulation
![Page 42: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/42.jpg)
Exalt: Evaluate 10,000 nodes on 100 machines
• Run real code
• Use fewer resources
• Seems impossible?– In general, Yes.– For storage systems with big data, we can achieve.
![Page 43: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/43.jpg)
Key insight
• I/O is the bottleneck.• However, the content of data does not matter.• Solution:– We can choose a highly compressible data pattern.– Build emulated I/O devices that compress data.
00000000…
Emulated Network
1 million zeros
compress
00000000…
decompress
![Page 44: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/44.jpg)
Challenge
• System may add metadata
• System may split data (possibly nondeteministically)
• Existing approaches are either inaccurate or inefficient on such mixed patterns.
00000… 00000… 00000…
![Page 45: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/45.jpg)
Goals
• Can not lose metadata• High compressing ratio• Computationally efficient• Can work with the mixed pattern
![Page 46: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/46.jpg)
Existing approaches
• David (FAST 11): discard file content– Lose metadata since it’s mixed with data
• Gzip, etc:– Not efficient
• Write all zeroes and scan for zeros– Still not efficient enough
![Page 47: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/47.jpg)
Solution: Tardis
• Key: we cannot choose metadata but we can choose data – Make data distinguishable from metadata
Magic sequence of bytes that do not exist in metadata
An integer representing number of bytes left
![Page 48: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/48.jpg)
Tardis compression
Search for magic sequence
Retrieve number of bytes left (Nleft) Jump Nleft bytes
Search for magic sequence again
![Page 49: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/49.jpg)
Problems
• How to find a magic sequence– A randomly chosen 8-byte one works for HDFS.– Run the system, record trace, and analyze.
• What if system inserts metadata into data?– After jumping, check if it matches with the jumped
bytes.– If not, binary search until a match is found.
![Page 50: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/50.jpg)
Use Exalt• Emulated devices have inaccurate performance.• If one or several nodes are bottleneck– Run those nodes in real mode– Run other nodes in emulation mode
![Page 51: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/51.jpg)
Use Exalt
• How about if the behavior depends on a large number of nodes?– E.g. 99% latency and parallel recovery
• Need to model the behavior of emulated devices
Number of bytes
Disk/Network latency
Energy consumption
![Page 52: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/52.jpg)
Implementation
• Bytecode Instrumentation (BCI)
• Emulated devices:– Disk (transparent)– Network (transparent)– Memory (need to modify code)
![Page 53: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/53.jpg)
Preliminary results on HDFS
![Page 54: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/54.jpg)
Preliminary results on HDFS
![Page 55: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/55.jpg)
Proposed work
• Apply “separating data and metadata” to active storage in Salus
• Complete Exalt: – Incorporate latency modeling– Apply Exalt to more applications– Complete Tardis implementation
• Multiple-RSM communication– Join the project leaded by Manos– Not part of my thesis
![Page 56: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/56.jpg)
Publications
"Robustness in the Salus scalable block store". Y. Wang, M. Kapritsos, Z. Ren, P. Mahajan, J. Kirubanandam, L. Alvisi, and M. Dahlin, in NSDI 2013.
"All about Eve: Execute-Verify Replication for Multi-Core Servers". M. Kapritsos, Y. Wang, V. Quema, A. Clement, L. Alvisi, and M. Dahlin, in OSDI 2012.
"Gnothi: Separating Data and Metadata for Efficient and Available Storage Replication". Y. Wang, L. Alvisi, and M. Dahlin, in USENIX ATC 2012.
"UpRight Cluster Services". A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin, T. Riche, in SOSP 2009.
![Page 57: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/57.jpg)
Backup slides
![Page 58: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/58.jpg)
Cost of Gnothi
• Higher write latency:– In LAN, the major latency comes from disk.– Write metadata and data together to disk.– Rethink-the-sync write should also help.
• Lose generality– Gnothi is only designed for block storage.
58
![Page 59: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/59.jpg)
How does Gnothi compare to GFS/HDFS/xFS/… ?
• Those systems have a metadata server and multiple data servers.
• Gnothi updates metadata for every write and checks metadata for every read.
• They do that at a coarse granularity– Advantages: high scalability– Disadvantages: weaker consistency guarantee;
append-only interface, worse availability, …
59
![Page 60: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/60.jpg)
Efficient Recovery
Can I catch up?
• Recovery speed vs Execution speed– Traditional systems have the catch-up problem
Node 1
Node 2
Node 3
60
![Page 61: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/61.jpg)
Is timing error a real threat?
• Can cause data inconsistency• Reasons:– Network partitions– Server overloading– …
• A real concern in practical systemsHBASE-2238“Because HDFS and ZK are partitioned (in the sense that there's no communication between them) and there may be an unknown delay between acquiring the lock and performing the operation on HDFS you have no way of knowing that you still own the lock, like you say.”
61
![Page 62: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/62.jpg)
Interface & Models
• Disk interface– A fixed number of fixed-size blocks– A request can read/write a single block– Linearizable reads and writes
• Asynchronous model: no maximum delay– Omission failure only– Always safe– Live when the network is synchronous
62
![Page 63: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/63.jpg)
Architecture
• Fully replicated metadata• Partially replicated data– Load balancing– Preferred Storage– Reserve Storage
Slice 0Slice 0
Metadata Preferred
Virtual diskVirtual disk
Slice 1Slice 1 Slice 2Slice 2
Slice 0Slice 0 Slice 1Slice 1
Slice 1Slice 1 Slice 2Slice 2
Slice 0Slice 0 Slice 2Slice 2
Slice 2Slice 2
Slice 0Slice 0
Slice 1Slice 1
Reserve
63
![Page 64: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/64.jpg)
Data can be stored out of its preferred replicas.
Data
Network problem
Metadata
Replica 0 does not have current data.
Only Replica 2 has current data.
64
![Page 65: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/65.jpg)
Gnothi: Available and EfficientGnothi Storage ServersGnothi Block Drivers
• Availability: same as Asynchronous Replication– Safe regardless of timing errors– Can use aggressive timeout
AppApp
AppApp
LAN
65
![Page 66: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/66.jpg)
Gnothi: Available and EfficientGnothi Storage ServersGnothi Block Drivers
• Efficiency:– Storage/Bandwidth efficiency: write to f+1 replicas– Read efficiency: read from 1 replica
AppApp
AppApp
LAN
66
![Page 67: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/67.jpg)
Previous work cannot achieve both
Availability
Synchronous Primary Backup:Use conservative timeouts
Remus, Hypervisor, HBase, …
Efficiency
Availability
Preferred Quorum:Use cold backups
Cheap Paxos, ZZ, …
Efficiency Efficiency
Availability
Gaios:Scalable Read
Read Storage/Bandwidth
Availability
Asynchronous Replication:Use 2f+1 replicas
Paxos, …
Efficiency
Availability
Gnothi:Separating Data and Metadata
Efficiency
67
![Page 68: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/68.jpg)
Resolving a long-standing trade-off
• Efficiency– Write to f+1 replicas and read from 1 replica
• Availability– Aggressive timeout for failure detection
• Consistency– Read always returns the data of the latest write.
68
Synchronous Primary BackupAsynchronous ReplicationGnothi (this talk)
![Page 69: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/69.jpg)
Catch-up problem in recovery• Recovery speed vs Execution speed– Traditional systems have the catch-up problem
Fail Recover
Node 1
Node 2
Traditional Approaches: Fetch missing data before processing new requests
Cannot catch up
Have to block or throttle
Node 3
69
![Page 70: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/70.jpg)
Separate Metadata and Data Recovery
• Metadata Recovery: fast• Data Recovery: slow; in background
Metadata
Metadata Recovery
The recovering node can process new requests after Metadata Recovery.
70
Node 1 recovers
Node 2
Node 3
![Page 71: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/71.jpg)
Separate Metadata and Data Recovery
• Metadata Recovery: fast• Data Recovery: slow; in background
Data
Data Recovery
Release reserve storage
71
Node 1 recovers
Node 2
Node 3
![Page 72: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/72.jpg)
Gnothi ensures catch-up
Fail Recover
Node 1
Node 2
Gnothi: fetch missing metadata before processing new requests
Traditional Approaches: fetch missing data before processing new requests
Node 1 is never left behind after Metadata Recovery.
Metadata
Node 3
72
![Page 73: Separating Data and Metadata for Robustness and Scalability Yang Wang University of Texas at Austin.](https://reader036.fdocuments.net/reader036/viewer/2022062322/56649e7d5503460f94b7fb4f/html5/thumbnails/73.jpg)
How does Gnothi work?
Write Read
Write Read
Recovery
How to perform writes and reads efficiently when no failures occur?
How to continue processing requests during failures?
How to recover the failed node efficiently?
73