Coordination and Agreement

63
Master 2007 Subject: Subject: Coordination and Coordination and Agreement Agreement

description

Distributed systems

Transcript of Coordination and Agreement

  • Outline* Distributed Mutual Exclusion Election Algorithms Group Communication Consensus and Related Problems Introduction

  • Main AssumptionsEach pair of processes is connected by reliable channels*Processes independent from each otherNetwork: dont disconnect

    Processes fail only by crashingLocal failure detector

  • Distributed Mutual Exclusion (1)Mutual exclusion very importantPrevent interferenceEnsure consistency when accessing the resources*Process 2Process 1Process 3Process nShared resource

  • Distributed Mutual Exclusion (2)Critical section*Access shared resources in critical sectionEnter()Exit()Mutual exclusion useful when the server managing the resources dont use locksenter critical section blockingLeave critical section

  • Distributed Mutual Exclusion (3)Distributed mutual exclusion: no shared variables, only message passing*

    Ordering: If one request to enter the CS happened-before another, then entry to the CS is granted in that orderProperties:Safety: At most one process may execute in the critical section at a time

  • Mutual Exclusion Algorithms*Mutual Exclusion Algorithms ComparisonBasic hypotheses:System: asynchronousProcesses: dont failMessage transmission: reliable Central Server Algorithm Ring-Based Algorithm Mutual Exclusion using Multicast and Logical Clocks Maekawas Voting Algorithm

  • Central Server Algorithm*Holds the tokenWaitingHolds the tokenP3P3P1P2P4P4ServerRequest token2) Release token3) Grant tokenP2

  • Ring-Based Algorithm (1)*EthernetA group of unordered processes in a network

  • Ring-Based Algorithm (2)*Enter()Exit()P1P2P3P4PnToken navigates around the ringP1P2P3P1P2P3P4

  • Mutual Exclusion using Multicast and Logical Clocks (1)*OKP1P1P3P21919P1 and P2 request entering the critical section simultaneously192323OKOK23Enter()Exit()P1P2OK

  • Mutual Exclusion using Multicast and Logical Clocks (2)*State := HELD;Main steps of the algorithm:State := RELEASED; InitializationProcess pi request entering the critical sectionState := WANTED;T := requests timestamp;Multicast request to all processes;Wait until (Number of replies received = (N 1));

  • Mutual Exclusion using Multicast and Logical Clocks(3)*Main steps of the algorithm (contd):If (state = HELD) OR (state = WANTED AND (T, pj) < (Ti, pi))On receipt of a request at pj (i j)state := RELEASED;To quit the critical sectionThen queue request from pi without replying; Else reply immediately to pi;Reply to any queued requests;

  • Maekawas Voting Algorithm (1)*Each process pi maintain a voting set Vi (i=1, , N), where Vi {p1, , pN}Candidate process: must collect sufficient votes to enter to the critical sectionSets Vi: chosen such that i,j Each process pj is contained in M of the voting sets Vi pi Vi Vi Vj (at least one common member of any two voting sets) Vi = k(fairness)

  • Maekawas Voting Algorithm (2)*state := RELEASED;Initializationstate := WANTED;For pi to enter the critical sectionMain steps of the algorithm:state := HELD;Wait until (number of replies received = K 1);Multicast request to all processes in Vi {pi};voted := FALSE;

  • Maekawas Voting Algorithm (3)*state := RELEASED;For pi to exit the critical sectionElse Reply immediately to pi;voted := TRUE;On receipt of a request from pi at pj (i j)If (state = HELD OR voted = TRUE)Then queue request from pi without replying; Multicast release to all processes Vi {pi};Main steps of the algorithm (contd):

  • Maekawas Voting Algorithm (4)*Else voted := FALSE;On a receipt of a release from pi at pj (i j)If (queue of requests is non-empty)Then remove head of queue, e.g., pk; send reply to pk;voted := TRUE;Main steps of the algorithm (contd):

  • M. E. Algorithms Comparison*Centralized3Crash of serverVirtual ring1 to Crash of a processToken lostOrdering non satisfiedLogicalclocks2(N-1)Crash of a processMaekawas Alg. Crash of a process who votes3N20 to N-12(N-1)2N

    AlgorithmNumber of messagesProblemsEnter()/ExitBefore Enter()

  • Outline* Distributed Mutual Exclusion Election Algorithms Group Communication Consensus and Related Problems Introduction

  • Election Algorithms (1) *Objective: Elect one process pi from a group of processes p1pNUtility: Elect a primary manager, a master process, a coordinator or a central serverEach process pi maintains the identity of the elected in the variable Electedi (NIL if it isnt defined yet)Properties to satisfy: pi,Even if multiple elections have been started simultaneously Safety: Electedi = NIL or Elected = P Liveness: pi participates and sets Electedi NIL, or crashes

  • Election Algorithms (2)*Ring-Based Election Algorithm Bully AlgorithmElection Algorithms Comparison

  • Ring-Based Election Algorithm (1)*9Process 5 starts the election5162525516253

  • Ring-Based Election Algorithm (2)*Pi starts an electionParticipanti := FALSE; Electedi := NILInitializationParticipanti := TRUE; Send the message to its neighborReceipt of a message at piIf pi pjThenSend the message to its neighbor Participanti := FALSE;

  • Ring-Based Election Algorithm (3)*Else If pi = pj Receipt of the elections message at pjIf pi > pjThenSend the message to its neighbor Participantj := TRUE;Else If pi < pj AND Participantj = FALSEThen Send the message to its neighborParticipantj := TRUE;Then Electedj := TRUE;Participantj := FALSE;Send the message to its neighbor

  • Bully Algorithm (1)*Characteristic: Allows processes to crash during an electionHypotheses: Reliable transmission Synchronous systemDelayTrans.DelayTrans.DelayTrait.T = 2 DelayTrans. + DelayTrait.

  • Bully Algorithm (2)*Election started by a process when it notices, through timeouts, that the coordinator has failedThree types of messages: Coordinator: announces the new coordinator Each process knows which processes have higher identifiers, and it can communicate with all such processesHypotheses (contd): Election: starts an election OK: sent in response to an election message

  • Bully Algorithm (3)*CoordinatorCoordinator failed8153247Process 5 detects it first867New Coordinator

  • Bully Algorithm (4)*Else waits until receipt of the message (coordinator) (if it doesnt arrive during another timeout T, it begins another election)

    pi starts the electionSend the message (Election, pi) to pj , i.e., pj > pi Waits until all messages (OK, pj) from pj are received;If no message (OK, pj) arrives during TThen Elected := pi; Send the message (Coordinator, pi) to pj , i.e., pj < pi Electedi := NILInitialization

  • Bully Algorithm (5)*Elected := pj; Receipt of the message (Coordinator, pj) Send the message (OK, pi) to pjReceipt of the message (Election, pj ) at piStart the election unless it has begun one alreadyWhen a process is started to replace a crashed process: it begins an election

  • Election Algorithms Comparison*Virtualring2N to 3N-1BullyN-2 to (N2) Dont toleratefaultsSystem must be synchronous

    ElectionalgorithmNumber of messagesProblems

  • Outline* Distributed Mutual Exclusion Election Algorithms Group Communication Consensus and Related Problems Introduction

  • Group Communication (1)*Objective: each of a group of processes must receive copies of the messages sent to the group Agreement: on the set of messages that is received and on the delivery ordering Group communication requires: CoordinationWe study multicast communication of processes whose membership is known (static groups)

  • Group Communication (2)*Groups:Closed groupOpen groupSystem: contains a collection of processes, which can communicate reliably over one-to-one channelsProcesses: members of groups, may fail only by crashing

  • Group Communication (3)*multicast(g, m): sends the message m to all members of group gPrimitives:sender(m) : unique identifier of the process that sent the message mgroup(m): unique identifier of the group to which the message m was sent

  • Group Communication (4)* Basic Multicast Reliable Multicast Ordered Multicast

  • Basic Multicast *Use of threads to perform the send operations simultaneouslyObjective: Guarantee that a correct process will eventually deliver the message as long as the multicaster does not crashPrimitives: B_multicast, B_deliverImplementation: Use a reliable one-to-one communicationFor each process p g, send(p, m);To B_multicast(g, m)B_deliver(m) to pOn receive(m) at pUnreliable: Acknowledgments may be dropped

  • Reliable Multicast (1)* Validity: If a correct process multicasts a message m, then it will eventually deliver m Agreement: If a correct process delivers the message m, then all other correct processes in group(m) will eventually deliver mPrimitives: R_multicast, R_deliver

  • Reliable Multicast (2)*msgReceived := {};InitializationB-multicast(g, m);// p gR-multicast(g, m) by pImplementation using B-multicast:B-deliver(m) by q with g = group(m)If (m msgReceived)Then msgReceived := msgReceived {m}; If (q p) Then B-multicast(g, m);R-deliver(m);Correct algorithm, but inefficient (each message is sent |g| times to each process)

  • Ordered Multicast *Ordering categories: FIFO Ordering Total Ordering Causal OrderingHybrid Ordering: Total-Causal, Total-FIFO

  • FIFO Ordering (1)*If a correct process issues multicast(g, m1) and then multicast(g, m2), then every correct process that delivers m2 will deliver m1 before m2

  • FIFO Ordering (2)*Primitives: FO_multicast, FO_deliverImplementation: Use of sequence numbersVariables maintained by each process p:AlgorithmFIFO Ordering is reached only under the assumption that groups are non-overlapping

  • Total Ordering (1)*If a correct process delivers message m2 before it delivers m1, then any correct process that delivers m1 will deliver m2 before m1

    Primitives: TO_multicast, TO_deliver

  • Total Ordering (2)*Implementation: Assign totally ordered identifiers to multicast messages Each process makes the same ordering decision based upon these identifiersMethods for assigning identifiers to messages: Sequencer process Processes collectively agree on the assignment of sequence numbers to messages in a distributed fashion

  • Total Ordering (3)*Sequencer process: Maintains a group-specific sequence number SgSg := 0;InitializationRg := 0;InitializationB-multicast(g, );B-deliver() with g = group(m)Sg = Sg + 1; Algorithm for group member p g

  • Total Ordering (4)*Wait until ( in hold-back queue AND S = Rg);B-deliver(morder= ) by p, with g = group(morder)TO-deliver(m);Rg = S + 1;Place in hold-back queue;B-deliver() by p, with g = group(m)B-multicast(g Sequencer(g), );TO-multicast(g, m) by p

  • Total Ordering (5)*Processes collectively agree on the assignment of sequence numbers to messages in a distributed fashionVariables maintained by each process p:

  • Total Ordering (6)*Message transmissionProposition of a sequence number Assigning a sequence number to the messageP2P3P5P1P4

  • Causal Ordering (1)*If multicast(g, m1) multicast(g, m3), then any correct process that delivers m3 will deliver m1 before m3

  • Causal Ordering (2)*Primitives: CO_multicast, CO_deliverAlgorithm for group member pi:InitializationExample

  • Causal Ordering (3)*CO-multicast(g, m)CO-deliver(m);

  • Outline* Distributed Mutual Exclusion Election Algorithms Group Communication Consensus and Related Problems Introduction

  • Make agreement in a distributed mannerMutual exclusion: who can enter the critical regionTotally ordered multicast: the order of message deliveryByzantine generals: attack or retreat?Consensus problemAgree on a value after one or more of the processes has proposed what the value should beConsensus introduction

  • Consensus (1)*Objective: processes must agree on a value after one or more of the processes has proposed what that value should beHypotheses: reliable communication, but processes may failConsensus problem:Every process Pi begins in the undecided stateProposes a value Vi D (i=1, , N) Processes communicate with one another, exchanging valuesEach process then sets the value of a decision variable di

  • Consensus (2)*Consensus algorithmP1P3P2d1:=proceedd2:=proceedV1:=proceedV2:=proceedV3:=abort(Crashes)

  • Consensus (3)*Proprieties to satisfy: Termination: Eventually each correct process sets its decision variable Agreement: the decision value of all correct processes is the same: Pi and Pj are correct di = dj (i,j=1, , N)Integrity: If the correct processes all proposed the same value, then any correct process in the decided state has chosen that value

  • Consensus (4)*

  • Consensus (5)*Interactive consistency problem: variant of the consensus problemObjective: correct processes must agree on a vector of values, one for each processProprieties to satisfy:Termination: Eventually each correct process sets its decision variableAgreement: the decision vector of all correct processes is the sameIntegrity: If Pi is correct, then all correct processes decide on Vi as the ith component of their vector

  • Consensus (6)*Agreement: the decision value of all correct processes is the same: Pi and Pj are correct di = dj (i,j=1, , N)Byzantine generals problems: variant of the consensus problemObjective: a distinguished process supplies a value that the others must agree uponProprieties to satisfy:Termination: Eventually each correct process sets its decision variableIntegrity: If the commander is correct, then all correct processes decide on the value that the commander proposedAgreement: the decision value of all correct processes is the same

  • Consensus (7)*Byzantine agreement in a synchronous system:Example : a system composed of three processes (must agree on a binary value 0 or 1)

  • Consensus (8)*For m faulty processes, n 3m+1, where n denotes the total number of processesIf a process doesnt send a message, the receiving process will use a default value Interactive Consistency Algorithm: ICA(m), m>0, m denotes the maximal number of processes that may fail simultaneouslySender: all nodes must agree upon its valueReceivers: all other processesICA Algorithm requires m+1 rounds in order to achieve the consensus

  • Consensus (9)*Algorithm ICA(m) 1. The sender sends its value to all the other n-1 processes 2. Let Vi be the value received by process i from the sender, or the default value if no message is received Process i consider itself as a sender in ICA(m-1): it sends the value Vi to the n-2 other processes 3. i, Let Vj be the value received from process j (j i) The process i uses the value Choice(V1, , Vn)EndInteractive Consistency Algorithm:Algorithm ICA(0) 1. The sender sends its value to all the other n-1 processes 2. Each process uses the value received from the sender,or use the default value if no message is receivedEnd

  • ReferencesPhD. Mourad Elhadefs presentation

    Coulouris G (et al) Distributed System Concepts and Design Pearson 2001

    Other presentations

    Wikipedia: www.wikipedia.com

    *

  • *

    Mutual exclusion: loi tr ln nhauConsensus: ng thun*Prevent interference: ngn chn s can thipEnsure consistency: bo m tnh thng nht

    *Critical section: vng ti nguyn dng chung

    *An ton: ti mi thi im c ti a 1 tin trnh chy trong vng critical section.Khng phi i vnh vin: cc yu cu ra vo vng tng tranh chc chn thnh cngTh t: yu cc cu c x l theo th t xy ra trc*Cc yu cu vo ra critical section c gi n server khaTin trnh c quyn vo critical section khi nhn c 1 tokenKhi ra khi critical section, token c tr v cho seerver*Cc tin trnh c xp trong 1 cu trc vngMt token c chuyn dn cho nhau quanh vngTrc khi vo critical section, tin trnh by i token nTin trnh gi token cho n khi ra khi critical section

    *Tnh cht: tr trung bnh N/2 chngToken chim bandwidthVng v: nt hng hoc knh hng*Tin trnh 1 trong 3 trng thi:ThMunGi*1 tin trnh mun vo critical section: multicast thng ipi tr li t tt c tin trnh khcIf Th: tr li tt c cc yu cu mun vo critical sectionIf Gi: khg tr li cho n khi ra khi critical sectionIf Mun: so snh timestampt, nh hn th tr li

    *Multicast dn n chi ph cao v ng truynD tht bi do li*************************