Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince...
-
Upload
jerome-king -
Category
Documents
-
view
216 -
download
0
Transcript of Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince...
Robustness in the Salus scalable block store
Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam,
Lorenzo Alvisi, and Mike DahlinUniversity of Texas at Austin
1
Storage: Do not lose data
Fsck, IRON, ZFS, …Past
Now & Future
2
Salus: Provide robustness to scalable systems
Scalable systems: Crash failure + certain disk corruption(GFS/Bigtable, HDFS/HBase, WAS, …)
Strong protections: Arbitrary failure (BFT, End-to-end verification, …) Salus
3
Start from “Scalable”
4
File systemFile system Key valueKey value DatabaseDatabaseBlock storeBlock storeBlock storeBlock store
Scalable
Robust
Key valueKey value
Borrow scalable designs
Scalable architecture
Metadata server
Data servers
Clients
Parallel data transfer
Data is replicated for durability and availability5
Computing node:• No persistent storage
Storage nodes
Problems
?
• No ordering guarantees for writes• Single points of failures:
– Computing nodes can corrupt data.– Client can accept corrupted data.
6
1. No ordering guarantees for writes
Metadata server
Data servers
Clients
Block store requires barrier semantics
7
1 2 3
1 2
2. Computing nodes can corrupt data
Metadata server
Data servers
Clients
Single point of failure
8
3. Client can accept corrupted data
Metadata server
Data servers
Clients
Single point of failure
Does a simple disk checksum work?Can not prevent errors in memory, etc.
9
Salus: Overview
Metadata server
Block drivers(single driver per virtual disk)
Pipelined commit
Active storage
End-to-end verification
10
Failure model: Almost arbitrary but no impersonation
Pipelined commit
• Goal: barrier semantics– A request can be marked as a barrier.– All previous ones must be executed before it.
• Naïve solution:– Waits at a barrier: Lose parallelism
• Look similar to distributed transaction– Well-known solution: Two phase commit (2PC)
11
Problem of 2PC
1 2 3
4 5 4 5
Prepared
Driver
Servers
Leader
Prepared
Leader
Batch i
Batch i+1
1 3
2
12
1 2 3 4 5
Barriers
Problem of 2PC
1 2 3
4 5 4 5
Driver
Servers
Leader
Leader
Batch i
Batch i+1
1 3
2
13
Commit
Commit
Problem can still happen.Need ordering guarantee between different batches.
Pipelined commit
1 2 3
4 5
1 3
2
4 5
Lead of batch i-1
Driver
Servers
Leader
Commit
CommitLeader
Batch i
Batch i+1
Parallel data transfer
14
Batch i-1 committed
Batch i committed
Sequential commit
Outline
• Challenges• Salus
– Pipelined commit– Active storage– Scalable end-to-end checks
• Evaluation
15
Active storage
• Goal: a single node cannot corrupt data• Well-known solution: BFT replication
– Problem: require at least 2f+1 replicas
• Salus’ goal: Use f+1 computing nodes– f+1 is enough to guarantee safety– How about availability/liveness?
16
Active storage: unanimous consentComputing node
Storage nodes
17
Active storage: unanimous consentComputing nodes
Storage nodes
• Unanimous consent:– All updates must be agreed by f+1 computing nodes.
• Additional benefit: – Colocate computing and storage: save network bandwidth
18
Active storage: restore availabilityComputing nodes
Storage nodes
• What if something goes wrong?– Problem: we may not know which one is faulty.
• Replace all computing nodes
19
Active storage: restore availabilityComputing nodes
Storage nodes
• What if something goes wrong?– Problem: we may not know which one is faulty.
• Replace all computing nodes– Computing nodes do not have persistent states.
20
Discussion
• Does Salus achieve BFT with f+1 replication?– No. Fork can happen in a combination of failures.
• Is it worth using 2f+1 replicas to eliminate this case?– No. Fork can still happen as a result of other events (e.g.
geo-replication).
21
Outline
• Challenges• Salus
– Pipelined commit– Active storage– Scalable end-to-end checks
• Evaluation
22
Evaluation
• Is Salus robust against failures?
• Do the additional mechanisms hurt scalability?
23
Is Salus robust against failures?
24
• Always ensure barrier semantics– Experiment: Crash block driver– Result: Requests after “holes” are discarded
• Single computing node cannot corrupt data– Experiment: Inject errors into computing node– Result: Storage nodes reject it and trigger replacement
• Block drivers never accept corrupted data– Experiment: Inject errors into computing/storage node– Result: Block driver detects errors and retries other nodes
Do the additional mechanisms hurt scalability?
25
Conclusion
• Pipelined commit– Write in parallel – Provide ordering guarantees
• Active Storage– Eliminate single point of failures– Consume less network bandwidth
26
High robustness ≠ Low performance
Salus: scalable end-to-end verification
• Goal: read safely from one node– The client should be able to verify the reply.– If corrupted, the client retries another node.
• Well-known solution: Merkle tree– Problem: scalability
• Salus’ solution:– Single writer– Distribute the tree among servers
27
Scalable end-to-end verification
Server 1 Server 2 Server 3 Server 4
Client maintains the top tree.
Client does not need to store anything persistently.It can rebuild the top tree from the servers.
28
Scalable and robust storage
More hardware More complex softwareMore failures
29
Start from “Scalable”
Scalable
Robust
30
Pipelined commit - challenge
• Is 2PC slow?– Locking? Single writer, no locking– Phase 2 network overhead? Small compare to Phase 1.– Phase 2 disk overhead? Eliminate that, push
complexity to recovery
• Order across transactions– Easy when failure-free.– Complicate recovery.
Key challenge: Recovery
31
Scalable architecture
Computation node:• No persistent storage• Data forwarding, garbage collection, etc• Tablet server (Bigtable), Region server (HBase) , Stream Manager (WAS), …
32
Computation node
Storage nodes
Salus: Overview
• Block store interface (Amazon EBS):– Volume: a fixed number of fixed-size blocks– Single block driver per volume: barrier semantic
• Failure model: arbitrary but no identity forge• Key ideas:
– Pipelined commit – write in parallel and in order– Active storage – replicate computation nodes– Scalable end-to-end checks – client verifies replies
33
Complex recovery
• Still need to guarantee barrier semantic.
• What if servers fail concurrently?
• What if different replicas diverge?
34
Support concurrent users
• No perfect solution now
• Give up linearizability– Casual or Fork-Join-Casual (Depot)
• Give up scalability– SUNDR
35
How to handle storage node failures?
• Computing nodes generate certificates when writing to storage nodes
• If a storage node fails, computing nodes agree on the current states
• Approach: look for longest and valid prefix• After that, re-replicate data• As long as one correct storage node is
available, no data loss
36
Why barrier semantics?
• Goal: provide atomicity• Example (file system): to append to a file, we
need to modify a data block and the corresponding pointer to that block
• It’s bad if pointer is modified but data is not.• Solution: modify data first and then pointer.
Pointer is attached with a barrier.
37
Cost of robustness
38
Robustness
Cost
CrashSalus
Eliminate fork
Fully BFT
Use 2f+1 replication
Use signature and N-way programming