ZooKeeper - courses.cse.tamu.educourses.cse.tamu.edu/chiache/csce678/s19/slides/zookeeper.pdf ·...

Post on 15-Jul-2020

6 views 0 download

Transcript of ZooKeeper - courses.cse.tamu.educourses.cse.tamu.edu/chiache/csce678/s19/slides/zookeeper.pdf ·...

ZooKeeper

CSCE 678

Weak vs Strong Consistency

• In distributed systems, consistency is often the target of weakening

• Examples of weak consistency:• NoSQL servers – Dynamo• Datanodes in HDFS

• But sometimes, strong consistency is needed

2

Use Cases of Strong Consistency

• Configuration management

• Message Queues

• Group membership

• Synchronization:• Mutexes and read/write locks• Barriers: process joins

3

(Will talk about them one by one)

Use Case: Configuration & Group Membership

4

Primary[Configuration]Primary: …Secondaries: …Port IDs: …Created by: …Push

Use Case: Message Queues

5

ProducerA

ProducerB

Consumer

enqueue

enqueue

dequeue

Use Case: Synchronization

• Mutexes

6

• Read/write locks

lock();

last_x = read(x);write(y, last_x);write(x, last_x + 1);

unlock();

rd_lock();

if (queue.size > 0) {wr_lock();x = queue.dequeue();wr_unlock();

}

rd_unlock();

Shared among readers

Exclusive for one writer

How to Define Strong Consistency?

• Linearizability• Also called atomic consistency or immediate consistency• As soon as a write operation finishes, the whole system

should see the latest data.

7

Serializability vs Linearizability

• For distributed systems, these two words are used in very specific contexts

• Serializability: for databases• Equivalent to Serializable Isolation (I) in ACID• No ordering for concurrent transactions

• Linearizability: for strongly-consistent reads/writes• In respect to operations, not transactions• Considering the global ordering of reads/writes

8

Serializability vs Linearizability

• Serializability

9

Client A

Client B

read(x), write(y), read(y)

read(y), write(x), write(y) read(y), read(x), write(y)

System read(x), write(y), read(y) read(y), write(x), write(y) read(y), read(x), write(y)

Transactions

No constraint of ordering between concurrent transactions.

(from A) (from B) (from B)

time

Serializability vs Linearizability• Actually, serializability can be implemented by linearizability

(i.e., a lock service)

10

Server 1

Server 2

read(x), write(y), read(y)

lock(x)

lock(y)

read (shared) lock

write (exclusive) lock

read(y), write(x), write(y)

blocked

ok ok

ok ok

Two-phase lock (2PL) Irrelevant to two-phase commit (2PC)

Serializability vs Linearizability

• LinearizabilityAs soon as a write operation finishes, the whole system should see the latest data.

11

Client A write(x, 1) => oktime

Server 1

Server 2insert

Client B read(x) => 1

More on Linearizability

• Reads concurrent with a write

12

Client Atime

Client B

read(x) => 0

Client C write(x, 1) => ok

read(x) => 1

read(x) => 1

Any read that begins before the writemust see the old value.

Any read that begins AFTER the write may see old or new value; but as soon as one client see the new value, all other clients should too.

More on Linearizability

• Write concurrent with another write

13

Client Atime

Client B

Client C

write(x, 1) => ok

read(x) => 1

Any write begins before another writemust be committed first.

write(x, 2) => ok

read(x) => 2

Linearizability vs Causality• Linearizability implies causal consistency

14

(1) Causality is based on happens-before relationshipin a client

If client sets x = 1 and sets y = 0, then the two values have causal relationship.

Client Atime

Client B

write(y, 0)write(x, 1)

read(y) => 0

Causality

read(x) => 0

Violate causal relationship!

read(x) => 0

15

Linearizability vs Causality• Linearizability implies causal consistency

(2) Linearizability implies total ordering of each object, soautomatically preserves happens-before.

A linearizable system doesn’t have to do anything to preserve causal consistency.

Client Atime

Client B

write(y, 0)write(x, 1)

read(y) => 0 read(x) => 1

read(x) => 0

Total order

How to Implement Linearizability?

• Read/write from single leader• Failover to a replica may lose linearizability

• Consensus algorithms• Using two-phase commits (2PC) or Paxos• Prevent stale replicas

• Most likely unlinearizable: multi-leader replication

16

Q: How does linearizability apply to CAP Theorem?

Linearizability in CAP Theorem

• Linearizability = Strong C• Read/write through single leader: lose A & P• Read through replicas, write through leader: lose P

• With consensus, linearizability can be:• Partition tolerant when ½ of replicas are connected• Fair availability with wait-free operations and fast,

lossless leader recovery

17

Apache ZooKeeper

• A coordination service for all use cases of strong consistency in a distributed system

• Wait-free: any operation will not block on other slow or failed clients

• ZooKeeper has no API for locking, but can be used to implement any locking mechanism

18

System Overview

19

ZooKeeper Service

ServerServerServer Server Server

Follower LeaderFollower Follower Follower

Client Client Client Client Client Client Client

Forward operations

Namespace

• ZooKeeper uses a filesystem-like namespace

20

/

/App1 /App2

/App1/p_1 /App1/p_2 /App1/p_3

znodes

Data Data Data Not large data files

(more like metadata)

Data

Namespace

• ZooKeeper has two types of znodes (paths)• Permanent (regular): clients explicitly create and delete

the znodes• Ephemeral: clients create the znodes, and either delete

them explicitly or let the system automatically deletes them when client sessions timeout.

21

API• create(path, data, flags)

• flags: regular/ephemeral, sequential (appending a seqnum)

• getData(path, watch) -> (data, version)

• setData(path, data, version)

• delete(path, version)

• exists(path, watch) -> true/false

• getChildren(path, watch) -> [paths]

• sync(path) path is ignored now

22

Asynchronous Operations• Operations can be synchronous or asynchronous

• Client can queue up multiple asynchronous operations

• Server responds by invoking callbacks

23

ZooKeeper Se

ServerServerServer

Follower LeaderFollower

Forward operati

Client

getData(x) create(y) …

callback for getData(x) callback for create(y) …

Event Notification

• All operations except sync are wait-free

• No locking API but clients can implement locks using watch events

24

1 l = “/my-lock”;2 if exists(l, watch=true) then wait for watch event;3 n = create(l, EPHEMERAL);4 if n is error then goto 2;

1 delete(l);

Lock (very naïve version)

Unlock

Server will push events to clientwhen the path is updated

A(asynchronous)-Linearizability

• Local order: all operations from the same client are processed FIFO (first-in-first-out)

• Global order: Linearizable writes

25

ServerClient A

ServerClient B

write Op 1, write Op 2, …

write Op 1, write Op 2, …

Some consensus to ensurethe write ops are appliedexactly as their global order.

A(asynchronous)-Linearizability• Reads are not linearized (may not see latest state)

• Read directly from server (identical replicas)• Other servers may have pending writes in queues• Solution: sync after writes

26

ServerClient A

ServerClient B

write Op 1, write Op 2, sync

read Op 1, read Op 2, …

After sync, all clients willsee the changes of priorwrites from client A.

Zab (Atomic Broadcast)

• All servers forward messages to a single leader• Important role as a sequencer• Leader can change if partitioned or failed

• The leader broadcasts (proposes) the messages to be delivered to all followers

27

ZooKeeper Service

ServerServerServer Server Server

Follower LeaderFollower Follower Follower

Forward operations

Zab (Atomic Broadcast)

• Two-phase commit (2PC)

28

Leader

Request, e.g., write(x, 1)Step 1. assign a monotonicallyincreasing id (zxid) to the request

Follower Follower

Zab (Atomic Broadcast)

• Two-phase commit (2PC)

29

Leader

Request, e.g., write(x, 1)Step 1. assign a monotonicallyincreasing id (zxid) to the request

Follower Follower

Step 2. Propose the changewith zxid to followers

Zab (Atomic Broadcast)

• Two-phase commit (2PC)

30

Leader

Request, e.g., write(x, 1)Step 1. assign a monotonicallyincreasing id (zxid) to the request

Follower Follower

Step 2. Propose the changewith zxid to followers

Step 3. Followersacknowledge theproposal.

Q: In what situation maythe follower reject the proposal?

Zab (Atomic Broadcast)

• Two-phase commit (2PC)

31

Leader

Request, e.g., write(x, 1)Step 1. assign a monotonicallyincreasing id (zxid) to the request

Follower Follower

Step 2. Propose the changewith zxid to followers

Step 3. Followersacknowledge theproposal.

Step 4. Leader commitsthe change if receives morethan ½ of acks (votes).

Leader Election

• If a server finds the leader disconnected or failed, it tries to become the leader (same 2PC protocol).

32

LeaderCandidate

Follower Follower

Becomes the leader whenreceives ½ of votes.

If followers receive multiple proposal,they vote for the candidate with the highest zxid.

References

• “ZooKeeper: Wait-free coordination for Internet-scale systems,” USENIX ATC ‘10 (by Hunt et al.)

• Zab protocol: “A simple totally ordered broadcast protocol”, LADIS ’08 (by Reed and Junqueira)

• “Linearizability: A Correctness Condition for Concurrent Objects”, TOPLAS 1990 (by Herlihy and Wing)

• “Designing Data-Intensive Applications”, O’Reilly 2017(by Martin Kleppmann)

33