Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such...

68
Fault Tolerance Amir H. Payberah [email protected] Amirkabir University of Technology (Tehran Polytechnic) Based on slides by Maarten Van Steen Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 1 / 46

Transcript of Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such...

Page 1: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Fault Tolerance

Amir H. [email protected]

Amirkabir University of Technology(Tehran Polytechnic)

Based on slides by Maarten Van Steen

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 1 / 46

Page 2: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

What is the problem?

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 2 / 46

Page 3: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Dependability

I A component provides a services to a clients.

I To provide services, a component may require the services fromother components.

I A component C depends on C ∗ if the correctness of C ’s behaviordepends on the correctness of C ∗’s behavior.

I Components are processes or channels.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 3 / 46

Page 4: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Dependability

I A component provides a services to a clients.

I To provide services, a component may require the services fromother components.

I A component C depends on C ∗ if the correctness of C ’s behaviordepends on the correctness of C ∗’s behavior.

I Components are processes or channels.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 3 / 46

Page 5: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Dependability

I A component provides a services to a clients.

I To provide services, a component may require the services fromother components.

I A component C depends on C ∗ if the correctness of C ’s behaviordepends on the correctness of C ∗’s behavior.

I Components are processes or channels.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 3 / 46

Page 6: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Dependability

I A component provides a services to a clients.

I To provide services, a component may require the services fromother components.

I A component C depends on C ∗ if the correctness of C ’s behaviordepends on the correctness of C ∗’s behavior.

I Components are processes or channels.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 3 / 46

Page 7: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Terminology - Subtle Differences

I Failure: when a component is not living up to its specifications, afailure occurs.

I Error: that part of a component’s state that can lead to a failure.

I Fault: the cause of an error.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 4 / 46

Page 8: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Terminology - What To Do About Faults

I Fault prevention: prevent the occurrence of a fault.

I Fault tolerance: build a component such that it can mask the pres-ence of faults.

I Fault removal: reduce presence, number, seriousness of faults.

I Fault forecasting: estimate present number, future incidence, andconsequences of faults.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 5 / 46

Page 9: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Terminology - Failure Models

I Crash failures: component halts, but behaves correctly before halt-ing.

I Omission failures: component fails to respond.

I Timing failures: output is correct, but lies outside a specified real-time interval.

I Response failures: output is incorrect, e.g., wrong value is produced.

I Arbitrary failures: component produces arbitrary output and be sub-ject to arbitrary timing failures.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 6 / 46

Page 10: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Crash Failures (1/2)

I Clients cannot distinguish between a crashed component and onethat is just a bit slow.

I Consider a server from which a client is expecting output:• Is the server perhaps exhibiting timing or omission failures?• Is the channel between client and server faulty?

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 7 / 46

Page 11: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Crash Failures (2/2)

I Assumptions we can make:

• Fail-silent: the component exhibits omission or crash failures; clientscannot tell what went wrong.

• Fail-stop: the component exhibits crash failures, but its failure canbe detected.

• Fail-safe: the component exhibits arbitrary, but they can’t do anyharm.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 8 / 46

Page 12: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Process Resilience

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 9 / 46

Page 13: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Process Resilience (1/2)

I Protect yourself against faulty processes by replicating and distribut-ing computations in a group. implement.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 10 / 46

Page 14: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Process Resilience (2/2)

I Flat groups: good for fault tolerance as information exchange imme-diately occurs with all group members; however, may impose moreoverhead as control is completely distributed.

I Hierarchical groups: all communication through a single coordina-tor ⇒ not really fault tolerant and scalable, but relatively easy toimplement.

(a) (b)

Flat group Hierarchical group Coordinator

Worker

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 11 / 46

Page 15: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Process Resilience (2/2)

I Flat groups: good for fault tolerance as information exchange imme-diately occurs with all group members; however, may impose moreoverhead as control is completely distributed.

I Hierarchical groups: all communication through a single coordina-tor ⇒ not really fault tolerant and scalable, but relatively easy toimplement.

(a) (b)

Flat group Hierarchical group Coordinator

Worker

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 11 / 46

Page 16: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Groups and Failure Masking (1/4)

I K-fault tolerant group: when a group can mask any k concurrentmember failures (k is called degree of fault tolerance).

I Assumption: all members are identical, and process all input in thesame order.

I How large does a k-fault tolerant group need to be?

• In crash failure semantics ⇒ a total of k + 1 members are neededto survive k member failures.

• What about in arbitrary failure semantics? the group output definedby voting.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 12 / 46

Page 17: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Groups and Failure Masking (1/4)

I K-fault tolerant group: when a group can mask any k concurrentmember failures (k is called degree of fault tolerance).

I Assumption: all members are identical, and process all input in thesame order.

I How large does a k-fault tolerant group need to be?

• In crash failure semantics ⇒ a total of k + 1 members are neededto survive k member failures.

• What about in arbitrary failure semantics? the group output definedby voting.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 12 / 46

Page 18: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Groups and Failure Masking (1/4)

I K-fault tolerant group: when a group can mask any k concurrentmember failures (k is called degree of fault tolerance).

I Assumption: all members are identical, and process all input in thesame order.

I How large does a k-fault tolerant group need to be?

• In crash failure semantics ⇒ a total of k + 1 members are neededto survive k member failures.

• What about in arbitrary failure semantics? the group output definedby voting.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 12 / 46

Page 19: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Groups and Failure Masking (1/4)

I K-fault tolerant group: when a group can mask any k concurrentmember failures (k is called degree of fault tolerance).

I Assumption: all members are identical, and process all input in thesame order.

I How large does a k-fault tolerant group need to be?• In crash failure semantics ⇒ a total of k + 1 members are needed

to survive k member failures.

• What about in arbitrary failure semantics? the group output definedby voting.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 12 / 46

Page 20: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Groups and Failure Masking (1/4)

I K-fault tolerant group: when a group can mask any k concurrentmember failures (k is called degree of fault tolerance).

I Assumption: all members are identical, and process all input in thesame order.

I How large does a k-fault tolerant group need to be?• In crash failure semantics ⇒ a total of k + 1 members are needed

to survive k member failures.• What about in arbitrary failure semantics? the group output defined

by voting.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 12 / 46

Page 21: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Groups and Failure Masking (2/4)

I (a) What they send to each other.

I (b) What each one got from the other.

I (c) What each one got in the second step.

1

23

121

x

y

2

123

Got(Got(Got(

1, 2, x1, 2, y1, 2, 3

)))

1 Got 2 Got( (( (1, 1,a, d,

2, 2,b, e,

y xc f

) )) )

(a)

(b) (c)

Faulty process

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 13 / 46

Page 22: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Groups and Failure Masking (3/4)

I (a) What they send to each other.

I (b) What each one got from the other.

I (c) What each one got in the second step.

1 2

3 4

1

2

2 4

z

41 x

1

4

y

2

1234

Got(Got(Got(Got(

1, 2, x, 41, 2, y, 41, 2, 3, 41, 2, z, 4

))))

1 Got 2 Got 4 Got( ( (( ( (( ( (

1, 1, 1,a, e, 1,1, 1, i,

2, 2, 2,b, f, 2,2, 2, j,

y, x, x,c, g, y,z, z, k,

4 4 4d h 44 4 l

) ) )) ) )) ) )

(a)

(b) (c)

Faulty process

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 14 / 46

Page 23: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Groups and Failure Masking (4/4)

I In a system with K faulty processes, agreement can be achievedonly if 2K + 1 correctly functioning processes are present.

I Agreement is possible only if more than two-thirds of the processesare working properly: to achieve a majority vote among a group ofnonfaulty processes.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 15 / 46

Page 24: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Failure Detection

I We detect failures through timeout mechanisms.

I Setting timeouts properly is very difficult:• You cannot distinguish process failures from network failures.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 16 / 46

Page 25: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable Communication

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 17 / 46

Page 26: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable Communication

I Client-Server communication

I Group communication

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 18 / 46

Page 27: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Client-Server Communication

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 19 / 46

Page 28: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable Communication

I Concentrated on process resilience (by means of process groups).

I What about reliable communication channels?

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 20 / 46

Page 29: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable RPC (1/6)

I RPC communication - what can go wrong?1 Client cannot locate server2 Client request is lost3 Server crashes4 Server response is lost5 Client crashes

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 21 / 46

Page 30: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable RPC (2/6)

I Problem: client cannot locate server.

I Solution: report back to client.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 22 / 46

Page 31: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable RPC (3/6)

I Problem: client request is lost.

I Solution: resend message.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 23 / 46

Page 32: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable RPC (4/6)

I Problem: server crashes.

I It is hard as you don’t know what it had already done.

Receive Receive ReceiveExecute Execute CrashReply Crash

REQ REQ REQ

REP No REP No REP

ServerServerServer

(a) (b) (c)

I We need to decide on what we expect from the server:• At-least-once-semantics: the server guarantees it will carry out an

operation at least once, no matter what.• At-most-once-semantics: the server guarantees it will carry out an

operation at most once.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 24 / 46

Page 33: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable RPC (4/6)

I Problem: server crashes.

I It is hard as you don’t know what it had already done.

Receive Receive ReceiveExecute Execute CrashReply Crash

REQ REQ REQ

REP No REP No REP

ServerServerServer

(a) (b) (c)

I We need to decide on what we expect from the server:• At-least-once-semantics: the server guarantees it will carry out an

operation at least once, no matter what.• At-most-once-semantics: the server guarantees it will carry out an

operation at most once.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 24 / 46

Page 34: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable RPC (5/6)

I Problem: server response is lost.

I Detecting lost replies can be hard, because it can also be that theserver had crashed. You don’t know whether the server has carriedout the operation.

I Solution: none, except that you can try to make your operationsidempotent: repeatable without any harm done if it happened to becarried out before.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 25 / 46

Page 35: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable RPC (6/6)

I Problem: client crashes.

I The server is doing work and holding resources for nothing (calleddoing an orphan computation).

I Solution:• Orphan is killed (or rolled back) by client when it reboots.• Broadcast new epoch number when recovering ⇒ servers kill

orphans• Require computations to complete in a T time units. Old ones are

simply removed.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 26 / 46

Page 36: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Group Communication

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 27 / 46

Page 37: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable Multicasting (1/2)

I We have a multicast channel c with two groups:• SND(c): the sender group of processes that submit messages to

channel c .• RCV (c): the receiver group of processes that can receive messages

from channel c .

I Simple reliability: if process P ∈ RCV (c) at the time message mwas submitted to c , and P does not leave RCV (c), m should bedelivered to P.

I Atomic multicast: how can we ensure that a message m submitted tochannel c is delivered to process P ∈ RCV (c) only if m is deliveredto all members of RCV (c).

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 28 / 46

Page 38: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable Multicasting (1/2)

I We have a multicast channel c with two groups:• SND(c): the sender group of processes that submit messages to

channel c .• RCV (c): the receiver group of processes that can receive messages

from channel c .

I Simple reliability: if process P ∈ RCV (c) at the time message mwas submitted to c , and P does not leave RCV (c), m should bedelivered to P.

I Atomic multicast: how can we ensure that a message m submitted tochannel c is delivered to process P ∈ RCV (c) only if m is deliveredto all members of RCV (c).

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 28 / 46

Page 39: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable Multicasting (1/2)

I We have a multicast channel c with two groups:• SND(c): the sender group of processes that submit messages to

channel c .• RCV (c): the receiver group of processes that can receive messages

from channel c .

I Simple reliability: if process P ∈ RCV (c) at the time message mwas submitted to c , and P does not leave RCV (c), m should bedelivered to P.

I Atomic multicast: how can we ensure that a message m submitted tochannel c is delivered to process P ∈ RCV (c) only if m is deliveredto all members of RCV (c).

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 28 / 46

Page 40: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable Multicasting (2/2)

I Let the sender log messages submitted to channel c:• If P sends message m, m is stored in a history buffer.• Each receiver acknowledges the receipt of m, or requests retransmis-

sion at P when noticing message lost.• Sender P removes m from history buffer when everyone has acknowl-

edged receipt.

I Why doesn’t this scale?• If RCV (c) is large, P will be swamped with feedback (ACKs and

NACKs).• Sender P has to know all members of RCV (c).

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 29 / 46

Page 41: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reliable Multicasting (2/2)

I Let the sender log messages submitted to channel c:• If P sends message m, m is stored in a history buffer.• Each receiver acknowledges the receipt of m, or requests retransmis-

sion at P when noticing message lost.• Sender P removes m from history buffer when everyone has acknowl-

edged receipt.

I Why doesn’t this scale?• If RCV (c) is large, P will be swamped with feedback (ACKs and

NACKs).• Sender P has to know all members of RCV (c).

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 29 / 46

Page 42: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Scalable Reliable Multicasting

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 30 / 46

Page 43: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Scalable Reliable Multicasting

I Feedback suppression

I Hierarchical solutions

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 31 / 46

Page 44: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Feedback Suppression (1/2)

I Basic idea: let a process P suppress its own feedback when it noticesanother process Q is already asking for a retransmission.

I Assumptions:• All receivers listen to a common feedback channel to which

feedback messages are submitted.• Process P schedules its own feedback message randomly, and

suppresses it when observing another feedback message.

NACK

NACK

NACK NACK NACKT=3 T=4 T=1 T=2

Sender Receiver Receiver Receiver Receiver

Network

Receivers suppress their feedbackSender receivesonly one NACK

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 32 / 46

Page 45: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Feedback Suppression (1/2)

I Basic idea: let a process P suppress its own feedback when it noticesanother process Q is already asking for a retransmission.

I Assumptions:• All receivers listen to a common feedback channel to which

feedback messages are submitted.• Process P schedules its own feedback message randomly, and

suppresses it when observing another feedback message.

NACK

NACK

NACK NACK NACKT=3 T=4 T=1 T=2

Sender Receiver Receiver Receiver Receiver

Network

Receivers suppress their feedbackSender receivesonly one NACK

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 32 / 46

Page 46: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Feedback Suppression (2/2)

I Why is the random schedule so important? random schedule neededto ensure that only one feedback message is eventually sent.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 33 / 46

Page 47: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Hierarchical Solutions (1/2)

I Basic idea: construct a hierarchical feedback channel in which allsubmitted messages are sent only to the root.

I Intermediate nodes aggregate feedback messages before passingthem on.

I Intermediate nodes can easily be used for retransmission purposes.

CC

S

(Long-haul) connectionSender

Coordinator

RootR

Receiver

Local-area network

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 34 / 46

Page 48: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Hierarchical Solutions (2/2)

I What’s the main problem with this solution? dynamically construct-ing the hierarchical feedback channel is the main problem.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 35 / 46

Page 49: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Atomic Multicast

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 36 / 46

Page 50: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Receiving vs. Delivering

I The logical organization of a distributed system to distinguish be-tween message receipt and message delivery.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 37 / 46

Page 51: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Atomic Multicast

I A message is delivered only to the nonfaulty members of the currentgroup.

I All members should agree on the current group membership: virtu-ally synchronous multicast.

I We consider views V ⊆ RCV (c) ∪ SND(c).

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 38 / 46

Page 52: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony

I Suppose the message m is multicast at the time its sender has groupview G .

I Assume that while the multicast is taking place, another processjoins or leaves the group.

• The group membership change is announced to all processes in G :by multicasting a message vc .

I We now have two multicast messages simultaneously in transit: mand vc .

I We need to guarantee is that m is either delivered to all processesin G before each one of them is delivered message vc , or m is notdelivered at all.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 39 / 46

Page 53: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony

I Suppose the message m is multicast at the time its sender has groupview G .

I Assume that while the multicast is taking place, another processjoins or leaves the group.

• The group membership change is announced to all processes in G :by multicasting a message vc .

I We now have two multicast messages simultaneously in transit: mand vc .

I We need to guarantee is that m is either delivered to all processesin G before each one of them is delivered message vc , or m is notdelivered at all.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 39 / 46

Page 54: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony

I Suppose the message m is multicast at the time its sender has groupview G .

I Assume that while the multicast is taking place, another processjoins or leaves the group.

• The group membership change is announced to all processes in G :by multicasting a message vc .

I We now have two multicast messages simultaneously in transit: mand vc .

I We need to guarantee is that m is either delivered to all processesin G before each one of them is delivered message vc , or m is notdelivered at all.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 39 / 46

Page 55: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony

I Suppose the message m is multicast at the time its sender has groupview G .

I Assume that while the multicast is taking place, another processjoins or leaves the group.

• The group membership change is announced to all processes in G :by multicasting a message vc .

I We now have two multicast messages simultaneously in transit: mand vc .

I We need to guarantee is that m is either delivered to all processesin G before each one of them is delivered message vc , or m is notdelivered at all.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 39 / 46

Page 56: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony

I Suppose the message m is multicast at the time its sender has groupview G .

I Assume that while the multicast is taking place, another processjoins or leaves the group.

• The group membership change is announced to all processes in G :by multicasting a message vc .

I We now have two multicast messages simultaneously in transit: mand vc .

I We need to guarantee is that m is either delivered to all processesin G before each one of them is delivered message vc , or m is notdelivered at all.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 39 / 46

Page 57: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony - Implementation (1/3)

I How to guarantee that all messages sent to view G are deliveredto all nonfaulty processes in G before the next group membershipchange takes place.

I Make sure that each process in G has received all messages thatwere sent to G .

I Because the sender of a message m to G may have failed beforecompleting its multicast, there may be processes in G that will neverreceive m.

• Because the sender has crashed, these processes should get m fromsomewhere else.

P1 joins the group P3 crashes P3 rejoins

G = {P1,P2,P3,P4} G = {P1,P2,P4} G = {P1,P2,P3,P4}

Partial multicastfrom P3 is discarded

P1

P2

P3

P4

Time

Reliable multicast by multiplepoint-to-point messages

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 40 / 46

Page 58: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony - Implementation (1/3)

I How to guarantee that all messages sent to view G are deliveredto all nonfaulty processes in G before the next group membershipchange takes place.

I Make sure that each process in G has received all messages thatwere sent to G .

I Because the sender of a message m to G may have failed beforecompleting its multicast, there may be processes in G that will neverreceive m.

• Because the sender has crashed, these processes should get m fromsomewhere else.

P1 joins the group P3 crashes P3 rejoins

G = {P1,P2,P3,P4} G = {P1,P2,P4} G = {P1,P2,P3,P4}

Partial multicastfrom P3 is discarded

P1

P2

P3

P4

Time

Reliable multicast by multiplepoint-to-point messages

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 40 / 46

Page 59: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony - Implementation (1/3)

I How to guarantee that all messages sent to view G are deliveredto all nonfaulty processes in G before the next group membershipchange takes place.

I Make sure that each process in G has received all messages thatwere sent to G .

I Because the sender of a message m to G may have failed beforecompleting its multicast, there may be processes in G that will neverreceive m.

• Because the sender has crashed, these processes should get m fromsomewhere else.

P1 joins the group P3 crashes P3 rejoins

G = {P1,P2,P3,P4} G = {P1,P2,P4} G = {P1,P2,P3,P4}

Partial multicastfrom P3 is discarded

P1

P2

P3

P4

Time

Reliable multicast by multiplepoint-to-point messages

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 40 / 46

Page 60: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony - Implementation (1/3)

I How to guarantee that all messages sent to view G are deliveredto all nonfaulty processes in G before the next group membershipchange takes place.

I Make sure that each process in G has received all messages thatwere sent to G .

I Because the sender of a message m to G may have failed beforecompleting its multicast, there may be processes in G that will neverreceive m.

• Because the sender has crashed, these processes should get m fromsomewhere else.

P1 joins the group P3 crashes P3 rejoins

G = {P1,P2,P3,P4} G = {P1,P2,P4} G = {P1,P2,P3,P4}

Partial multicastfrom P3 is discarded

P1

P2

P3

P4

Time

Reliable multicast by multiplepoint-to-point messages

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 40 / 46

Page 61: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony - Implementation (2/3)

I Solution: let every process in G keep m until it knows for sure thatall members in G have received it.

I If m has been received by all members in G , m is said to be stable.

I Only stable messages are allowed to be delivered.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 41 / 46

Page 62: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony - Implementation (3/3)

I (a) 4 notices that 7 has crashed and sends a view change.

I (b) 6 sends out all its unstable messages, followed by a flush mes-sage.

I (c) 6 installs the new view when it has received a flush messagefrom everyone else.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 42 / 46

Page 63: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony - Implementation (3/3)

I (a) 4 notices that 7 has crashed and sends a view change.

I (b) 6 sends out all its unstable messages, followed by a flush mes-sage.

I (c) 6 installs the new view when it has received a flush messagefrom everyone else.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 42 / 46

Page 64: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Virtual Synchrony - Implementation (3/3)

I (a) 4 notices that 7 has crashed and sends a view change.

I (b) 6 sends out all its unstable messages, followed by a flush mes-sage.

I (c) 6 installs the new view when it has received a flush messagefrom everyone else.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 42 / 46

Page 65: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Summary

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 43 / 46

Page 66: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Summary

I Failure

I Failure models: crash, omission, timing, response, arbitrary

I Crash failure: fail-silent, fail-stop, fail-safe

I Process resilience: flat group, hierarchical group

I K-fault tolerant group: more than two-thirds of the processes workproperly

I Reliable communication: client-server, group

I Scalable reliable multicast: feedback suppression, hierarchical

I Atomic broadcast: virtual synchrony

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 44 / 46

Page 67: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Reading

I Chapter 9 of the Distributed Systems: Principles and Paradigms.

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 45 / 46

Page 68: Fault Tolerance - Amir H. Payberah · 2021. 3. 15. · I Fault tolerance:build a component such that it canmaskthe pres-ence offaults. I Fault removal:reducepresence, number, seriousness

Questions?

Amir H. Payberah (Tehran Polytechnic) Fault Tolerance 1394/2/1 46 / 46