Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince...

Robustness in the Salus scalable block store

Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam,

Lorenzo Alvisi, and Mike DahlinUniversity of Texas at Austin

1

Storage: Do not lose data

Fsck, IRON, ZFS, …Past

Now & Future

2

Salus: Provide robustness to scalable systems

Scalable systems: Crash failure + certain disk corruption(GFS/Bigtable, HDFS/HBase, WAS, …)

Strong protections: Arbitrary failure (BFT, End-to-end verification, …) Salus

3

Start from “Scalable”

4

File systemFile system Key valueKey value DatabaseDatabaseBlock storeBlock storeBlock storeBlock store

Scalable

Robust

Key valueKey value

Borrow scalable designs

Scalable architecture

Metadata server

Data servers

Clients

Parallel data transfer

Data is replicated for durability and availability5

Computing node:• No persistent storage

Storage nodes

Problems

?

• No ordering guarantees for writes• Single points of failures:

– Computing nodes can corrupt data.– Client can accept corrupted data.

6

1. No ordering guarantees for writes

Metadata server

Data servers

Clients

Block store requires barrier semantics

7

1 2 3

1 2

2. Computing nodes can corrupt data

Metadata server

Data servers

Clients

Single point of failure

8

3. Client can accept corrupted data

Metadata server

Data servers

Clients

Single point of failure

Does a simple disk checksum work?Can not prevent errors in memory, etc.

9

Salus: Overview

Metadata server

Block drivers(single driver per virtual disk)

Pipelined commit

Active storage

End-to-end verification

10

Failure model: Almost arbitrary but no impersonation

Pipelined commit

• Goal: barrier semantics– A request can be marked as a barrier.– All previous ones must be executed before it.

• Naïve solution:– Waits at a barrier: Lose parallelism

• Look similar to distributed transaction– Well-known solution: Two phase commit (2PC)

11

Problem of 2PC

1 2 3

4 5 4 5

Prepared

Driver

Servers

Leader

Prepared

Leader

Batch i

Batch i+1

1 3

2

12

1 2 3 4 5

Barriers

Problem of 2PC

1 2 3

4 5 4 5

Driver

Servers

Leader

Leader

Batch i

Batch i+1

1 3

2

13

Commit

Commit

Problem can still happen.Need ordering guarantee between different batches.

Pipelined commit

1 2 3

4 5

1 3

2

4 5

Lead of batch i-1

Driver

Servers

Leader

Commit

CommitLeader

Batch i

Batch i+1

Parallel data transfer

14

Batch i-1 committed

Batch i committed

Sequential commit

Outline

• Challenges• Salus

– Pipelined commit– Active storage– Scalable end-to-end checks

• Evaluation

15

Active storage

• Goal: a single node cannot corrupt data• Well-known solution: BFT replication

– Problem: require at least 2f+1 replicas

• Salus’ goal: Use f+1 computing nodes– f+1 is enough to guarantee safety– How about availability/liveness?

16

Active storage: unanimous consentComputing node

Storage nodes

17

Active storage: unanimous consentComputing nodes

Storage nodes

• Unanimous consent:– All updates must be agreed by f+1 computing nodes.

• Additional benefit: – Colocate computing and storage: save network bandwidth

18

Active storage: restore availabilityComputing nodes

Storage nodes

• What if something goes wrong?– Problem: we may not know which one is faulty.

• Replace all computing nodes

19

Active storage: restore availabilityComputing nodes

Storage nodes

• What if something goes wrong?– Problem: we may not know which one is faulty.

• Replace all computing nodes– Computing nodes do not have persistent states.

20

Discussion

• Does Salus achieve BFT with f+1 replication?– No. Fork can happen in a combination of failures.

• Is it worth using 2f+1 replicas to eliminate this case?– No. Fork can still happen as a result of other events (e.g.

geo-replication).

21

Outline

• Challenges• Salus

– Pipelined commit– Active storage– Scalable end-to-end checks

• Evaluation

22

Evaluation

• Is Salus robust against failures?

• Do the additional mechanisms hurt scalability?

23

Is Salus robust against failures?

24

• Always ensure barrier semantics– Experiment: Crash block driver– Result: Requests after “holes” are discarded

• Single computing node cannot corrupt data– Experiment: Inject errors into computing node– Result: Storage nodes reject it and trigger replacement

• Block drivers never accept corrupted data– Experiment: Inject errors into computing/storage node– Result: Block driver detects errors and retries other nodes

Do the additional mechanisms hurt scalability?

25

Conclusion

• Pipelined commit– Write in parallel – Provide ordering guarantees

• Active Storage– Eliminate single point of failures– Consume less network bandwidth

26

High robustness ≠ Low performance

Salus: scalable end-to-end verification

• Goal: read safely from one node– The client should be able to verify the reply.– If corrupted, the client retries another node.

• Well-known solution: Merkle tree– Problem: scalability

• Salus’ solution:– Single writer– Distribute the tree among servers

27

Scalable end-to-end verification

Server 1 Server 2 Server 3 Server 4

Client maintains the top tree.

Client does not need to store anything persistently.It can rebuild the top tree from the servers.

28

Scalable and robust storage

More hardware More complex softwareMore failures

29

Start from “Scalable”

Scalable

Robust

30

Pipelined commit - challenge

• Is 2PC slow?– Locking? Single writer, no locking– Phase 2 network overhead? Small compare to Phase 1.– Phase 2 disk overhead? Eliminate that, push

complexity to recovery

• Order across transactions– Easy when failure-free.– Complicate recovery.

Key challenge: Recovery

31

Scalable architecture

Computation node:• No persistent storage• Data forwarding, garbage collection, etc• Tablet server (Bigtable), Region server (HBase) ， Stream Manager (WAS), …

32

Computation node

Storage nodes

Salus: Overview

• Block store interface (Amazon EBS):– Volume: a fixed number of fixed-size blocks– Single block driver per volume: barrier semantic

• Failure model: arbitrary but no identity forge• Key ideas:

– Pipelined commit – write in parallel and in order– Active storage – replicate computation nodes– Scalable end-to-end checks – client verifies replies

33

Complex recovery

• Still need to guarantee barrier semantic.

• What if servers fail concurrently?

• What if different replicas diverge?

34

Support concurrent users

• No perfect solution now

• Give up linearizability– Casual or Fork-Join-Casual (Depot)

• Give up scalability– SUNDR

35

How to handle storage node failures?

• Computing nodes generate certificates when writing to storage nodes

• If a storage node fails, computing nodes agree on the current states

• Approach: look for longest and valid prefix• After that, re-replicate data• As long as one correct storage node is

available, no data loss

36

Why barrier semantics?

• Goal: provide atomicity• Example (file system): to append to a file, we

need to modify a data block and the corresponding pointer to that block

• It’s bad if pointer is modified but data is not.• Solution: modify data first and then pointer.

Pointer is attached with a barrier.

37

Cost of robustness

38

Robustness

Cost

CrashSalus

Eliminate fork

Fully BFT

Use 2f+1 replication

Use signature and N-way programming

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince...

Documents

Transcript of Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince...