12 Consistency & Replication

Distributed Systems

Consistency & Replication (II)

2

Client-centric Consistency Models• Guarantees for a single client• How to hide inconsistencies from a client ?

– … assuming a data store where concurrent conflicting updates are rare

• … and relatively easy to resolve

• Examples:– DNS

• Single naming authority per zone• “lazy” propagation of updates

– WWW• No write-write conflicts• Usually acceptable to serve slightly out-of-date pages from a

cache

– Bayou (Terry et al – 1994)

3

Eventual Consistency• The principle of a mobile user accessing different replicas of a

distributed database.

If no updates take place for some time, all replicas gradually converge to a consistent state …

4

Alternative client-centric models• xi[t]: version of object x at local copy Li at time t

– … result of updates to a series of writes since system initialization at Li

– WS(xi[t]): series of writes– WS(xi[t2]; xj[t2]): series of writes that have also been

performed at copy Lj at a later time

• Assume an “owner” for each data item– … avoid write-write conflicts

• Monotonic reads• Monotonic writes• Read-your-values• Writes-follow-reads

5

Monotonic Reads

• The read operations performed by a single process P at two different local copies of the same data store.

a) A monotonic-read consistent data storeb) A data store that does not provide monotonic reads.

If a process has seen a value of x at time t, it will never see an older value at a later time.

Example: -replicated mailboxes with on-demand propagation of updates

WS(x1) is part of WS(x2)

6

Monotonic Writes

• The write operations performed by a single process P at two different local copies of the same data store

a) A monotonic-write consistent data store.b) A data store that does not provide monotonic-write consistency.

If an update is made to a copy, all preceding updatesmust have been completed first.

Example: - s/w library

FIFO propagation ofupdates by each process

A write may affect only part of the state of a data item

No guarantee that x at L2 has the same value as x at L1 at the time W(x1) completed

7

Read Your Writes

a) A data store that provides read-your-writes consistency.

b) A data store that does not.

A write is completed before a successive read, no matter where the read takes place

Negative examples:- updates of Web pages- changes of passwords

The effects of the previous write at L1 have not yet been propagated !

8

Writes Follow Reads

a) A writes-follow-reads consistent data storeb) A data store that does not provide writes-

follow-reads consistency

Any successive write will be performed on a copy that is up-to-date with the value most recently read by the process.

Example:- updates of a newsgroup: Responses are visible only after the original posting has been received

9

Implementing client-centric models (I)

• Globally unique ID per write operation– Assigned by the initiating server

• Per-client state:– Read set

• Write IDs relevant to client’s read operations

– Write set• IDs of writes performed by client

• Major performance issue:– Size of read/write sets ?

10

Implementing client-centric models (II)• Monotonic read:

– When a client issues a read, the server is given the client’s read set to check whether all the identified reads have taken place locally

• If not, the server contacts others to ensure that it is brought up-to-date

– After the read, the client’s read set is updated with the server’s “relevant” writes

• Monotonic write:– When a client issues a write, the server is given the

client’s write set• … to ensure that all specified writes have been applied (in-order)

– The write operation’s ID is appended to client’s write set

11

Implementing client-centric models (III)

• Read-your-writes:– Before serving a read request, the server fetches

(from other servers) all writes in the client’s write set

• Writes-follow-reads:– Server is brought up-to-date with the writes in the

client’s read set– After write, the new ID is added to the client’s

write set, along with the IDs in the read set • … as these have become “relevant” for the write just

performed

12

Implementing client-centric models (IV)

• Grouping a client’s read and write operations into sessions– A session is typically associated with an

application• … but may also be associated with an application that

can be temporarily shutdown (eg: email agent)

– What if the client never closes a session ?

• How to represent the read & write sets ?– List of IDs for write operations

• … Not all of these are actually needed !!

13

Implementing client-centric models (V)

• Using vector timestamps for improving efficiency:– When server Si accepts a write operation, it

assigns to it a globally unique WID and a timestamp ts(WID)

– Each server maintains vector RCVD(i)• RCVD(i)[j] := timestamp of the latest write initiated at

server Sj that has been received & processed at Si

• Server returns its current vector timestamp with its responses to read/write requests

• Client adjusts the timestamp for its own read/write set

14

Implementing client-centric models (VI)

• Efficient representation of read/write set A:– VT(A): vector timestamp

• VT(A)[i] := max. timestamp of all operations in A that were initiated at server Si

– Union of 2 sets of write IDs:• VT(A+B)[i] := max{ VT(A)[i], VT(B)[i] }

– Efficient way to check if A is contained in B:• VT(A)[i] <= VT(B)[i]

15

Replica Placement (I)

• The logical organization of different kinds of copies of a data store into three concentric rings.

16

Replica Placement (II)• Permanent copies

– Basis of distributed data store• Example from the Web:

– Anycasting & round-robin clusters– Mirror sites

• Server-initiated– Push caches

• Dynamic replication to handle bursts• Read-only

– Content Distribution Network (CDN)

• Client-initiated– Improve access time to data

• Danger of “stale” data

– Private vs Shared caches

17

Server-Initiated Replicas• Counting access requests from different clients.

•Deletion threshold: del(S, F)•Replication threshold: rep(S, F)

Routing DB to determine “closest” server for client C

P := closest serverfor both C1 & C2

CntQ(P, F)

At each server:•Count of accessesfor each file•Originating clients

Extra care to ensure that at least one copy remains !

Dynamic decisions to delete/migrate/replicate file F to server S

18

Update propagation• State vs Operations

– Notification of an update• Invalidation protocols• Best for low read/write ratio (%)

– Transfer data from one copy to another• Transfer of actual data … or log of changes• Batching• Best for relatively high read/write %

– Propagate the update to other copies• Active replication

• Pull vs Push– Push replicas maintain a high degree of consistency

• Updates are expected to be of use to multiple readers– Pull best for low read/write %– Hybrid scheme based on lease model

• Unicast vs Multicast– Push multicast group– Pull single server or client requests an update

19

Leases• A promise by a server that it will push

updates for a specified time period– After expiration, client has to “pull” for updates

• Alternatives:– Age-based leases

• Depending on the last time an item was modified– Long-lasting leases for items that are expected to remain

unmodified

• Renewal frequency-based leases– Short-term leases for clients that only occasionally ask fo a

specific item

• Leases based on state-space overhead at the server:– Lower expiration time as the server’s approaches overload

20

Pull versus Push Protocols

• Comparison between push-based & pull-based protocols in the case of multiple client, single server systems.

Issue Push-based Pull-based

State of server List of client replicas and caches None

Messages sent Update (and possibly fetch update later) Poll and update

Response time at client

Immediate (or fetch-update time) Fetch-update time

Stateful server: keeps track of all caches

21

Remote-Write Protocols (I)

• Primary-based remote-write protocol with a fixed server to which all read & write operations are forwarded.

22

Remote-Write Protocols (II)

• The principle of primary-backup protocol.

23

Primary-backup protocols• Blocking updates

– … straightforward implementation of sequential consistency

• The primary orders all updates• Processes see the effects of their most recent write

• Non-blocking updates– … reduce blocking delay for the process that

initiated the update• The process only waits until the primary’s ACK

– Fault tolerance ?

24

Local-Write Protocols (I)

• Primary-based local-write protocol in which a single copy is migrated between processes.

Keeping track of each data items’ current location ?

25

Local-Write Protocols (II)

• Primary-backup protocol in which the primary migrates to the process wanting to perform an update.

Suitable for disconnected operation

26

Active Replication (I)

• The problem of replicated invocations.

27

Active Replication (II)

(a) Forwarding an invocation request from a replicated object.

(b) Returning a reply to a replicated object.

28

Gifford’s quorum scheme (I)• Version numbers or timestamps per copy• A number of votes is assigned to each physical copy

– “weight” related to demand for a particular copy– totV(g): total number of votes for group of RMs– totV: total votes

• Obtain quorum before read/write:– R votes before read– W votes before write– W > 0.5*totV no write-write conflicts– (R + W) > totV(g) no read-write conflicts

• Any quorum pair must contain common copies– In case of partition, it is not possible to perform conflicting

operations on the same copy

29

Gifford’s quorum scheme (II)• Read:

– Version number inquiries to find set (g) of RMs • totV(g) >= R

– Not all copies need to be up-to-date• Every read quorum contains at least one current copy

• Write: – Version number inquiries to find set (g) of RMs

• totV(g) >= W • up-to-date copies

– If there are insufficient up-to-date copies, replace a non-current copy with a copy of the current copy

• Groups of RMs can be configured to provide different performance/reliability characteristics– Decrease W to improve writes– Decrease R to improve reads

30

Gifford’s quorum scheme (III)• Performance penalty for reads

– Due to the need for collecting a read quorum• Support for copies on local disks of clients

– Assigned zero votes - weak representatives• These copies cannot be included in a quorum

– After obtaining a read quorum, a read may be carried out on the local copy if it is up-to-date

• Blocking probability:– In some cases, a quorum cannot be obtained

31

Gifford’s quorum scheme (IV)Example 1 Example 2 Example 3

Latency Replica 1 75 75 75

(milliseconds) Replica 2 65 100 750

Replica 3 65 750 750

Voting Replica 1 1 2 1

configuration Replica 2 0 1 1

Replica 3 0 1 1

Quorum R 1 2 1

sizes W 1 3 3

Derived performance of file suite:

Read Latency 65 75 75

Blocking probability 0.01 0.0002 0.000001

Write Latency 75 100 750

Blocking probability 0.01 0.0101 0.03

Examples assume 99% availability for RMs

Ex1: file with high% read/write

Ex2: file with moderate %read/write

Ex3: file with very high % read/write

Reads can be satisfied by local RM, but writes must also access one remote RM

32

Quorum-Based Protocols

Three examples of the voting algorithm:a) A correct choice of read & write setb) A choice that may lead to write-write conflictsc) A correct choice, known as ROWA (read one, write all)

33

Transactions with Replicated Data• Better performance

– Concurrent service– Reduced latency

• Higher availability• Fault tolerance

– What if a replica fails or becomes isolated ?• Upon rejoining, it must “catch up”

• Replicated transaction service– Data replicated at a set of replica managers

• Replication transparency – One copy serializability– Read one, write all

Failures must be observed to have “happened before” any active Tx’s at other servers

34

Network Partitions• Separate but viable groups of servers• Optimistic schemes validate on recovery

– Available copies with validation

• Pessimistic schemes limit availability until recovery

T U

B B

withdraw(B) deposit(B)

BB

partition

35

Fault Tolerance

• Design to recover after a failure with no loss of (committed) data.

• Designs for fault tolerance:– Single server, fail and recover– Primary server with “trailing” backups– Replicated service

36

Fault Tolerance = ?

• Define correctness criteria • When 2 replicas are separated by network partition:

– Both are deemed “incorrect” & stop serving.– One (the master) continues & the other ceases service.– One (the master) continues to accept updates & both

continue to supply reads (of possibly stale data).– Both continue service & subsequently synchronise.

37

Passive Replication (I)• At any time, system has a single primary RM• One or more secondary backup RMs• Front ends communicate with primary, primary

executes requests, response to all backups• If primary fails, one backup is promoted to primary• New primary starts from “Coordination phase” for

each new request• What happens if primary crashes

before/during/after agreement phase?

38

Passive Replication (II)

FEC

FEC

RM

Primary

Backup

Backup

RM

RM

39

Passive replication (III)• Satisfies linearizability• Front end: looks up new primary, when current

primary does not respond• Primary RM is performance bottleneck• Can tolerate F failures for F+1 RMs• A variation: clients can access backup RMs

(linearizability is lost, but clients get sequential consistency)

• SUN NIS (yellow pages) uses passive replication: clients can contact primary or backup servers for reads, but only primary server for updates

40

Active replication (I)• RMs are state machines with equivalent roles• Front ends communicates the client requests to

RM group, using totally ordered reliable multicast• RMs process independently requests & reply to

front end (correct RMs process each request identically)

• Front end can synthesize final response to client (tolerating Byzantine failures)

• Active replication provides sequential consistency if multicast is reliable & ordered

• Byzantine failures (F out of 2F+1): front end waits until it gets F+1 identical responses

41

Active replication (II)

FE CFEC RM

RM

RM

42

replicamanagers

Replication Architectures

• How many replicas are required?– All or majority ?

• Forward all updates as soon as received.

• Two phase commit protocol.– Contacted replica acts as

coordinator– What if one of the replicas

isn’t available?”

• Primary copy replication

TA

getBalance(A)

B

deposit(B)

A

A

B

B

B

43

replicamanagers

T

Available Copies Replication

• Not all copies will always be available.

• Failures– Timeout at failed

replica– Rejected by

recovering, unsynchronised replica

Y

AgetBalance(A)

MB

NB

PB

deposit(B)

U

getBalance(B)

deposit(A)

X

A

44

Local Validation• Failure & recovery events do not occur during a Tx.• Example:

– T reads A before server X’s failure, therefore T failX– T observes server N’s failure when it writes B, therefore

failN T– failN T.getBalance(A) T.deposit(B) failX– failX U.getBalance(B) U.deposit(A) failN

Failure and recovery must be serialised just like a Tx: They occur before or after a Tx, but not during.

Server x fails followed by

Transaction U which is followed by

Server N’s failure which is followed by

Transaction T which is followed by server X’s failure.

This is inconsistent, so the transactions must not be allowed to commit.

12 Consistency & Replication

Documents

Transcript of 12 Consistency & Replication