Distributed Algorithms Time, clocks and the ordering of...

Distributed AlgorithmsTime, clocks and the ordering of events

Alberto Montresor

University of Trento, Italy

2017/05/19

This work is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.

references

O. Babaoglu and K. Marzullo.Consistent global states of distributed systems: Fundamentalconcepts and mechanisms.In S. Mullender, editor, Distributed Systems (2nd ed.).Addison-Wesley, 1993.http://www.disi.unitn.it/~montreso/ds/papers/

ConsistentGlobalStates.pdf.

Contents

1 Modeling DistributedExecutions

HistoriesHappen-beforeGlobal states and cuts

2 Global predicate evaluationProblem definitionExample: deadlock detection

3 Real clocks vs logical clocksLogical clocksScalar logical clocksVector logical clocks

4 Passive monitoringPassive monitoring, v.1Passive monitoring, v.2Passive monitoring, v.3

5 Proactive monitoringSnapshot protocol, v.1Snapshot protocol, v.2Snapshot protocol, v.3

6 Stable predicatesDeadlock detectionDeadlock detection throughactive monitoringDeadlock detection throughpassive monitoring

7 Non-stable predicates

Contents

Modeling Distributed Executions

Distributed Execution

Definition (Distributed algorithm)

A distributed algorithm is a collection of distributed automata, one perprocess

Definition (Distributed execution)

The execution of a distributed algorithm is a sequence of events executedby the processes

Partial execution: a finite sequence of eventsInfinite execution: a infinite sequence of events

Possible events

send(m, p): sends a message m to process preceive(m): receives a message mlocal events that change the local state

Alberto Montresor (UniTN) DS - Time,clocks,events 2017/05/19 1 / 96

Modeling Distributed Executions Histories

Histories

Definition (Local history)

The local history of process pi is a (possibly infinite) sequence of eventshi = e0i e

2i . . . e

mii (canonical enumeration)

Definition (Partial history)

The partial history up to event eki is denoted hki and is given by theprefix of the first k events of hi

Modeling Distributed Executions Histories

Histories

Local histories do not specify any relative timing between eventsbelonging to different processes.

We need a notion of ordering between events, that could help us indeciding whether:

I one event occurs before anotherI they are actually concurrent

Modeling Distributed Executions Happen-before

Happen-Before

Definition (Happen-before)

We say that an event e happens-before an event e′, and write e→ e′, ifone of the following three cases is true:

1 ∃pi ∈ Π : e = eri , e′ = esi , r < s(e and e′ are executed by the same process, e before e′)

2 e = send(m, ∗) ∧ e′ = receive(m)(e is the send event of a message m and e′ is the correspondingreceive event)

3 ∃e′′ : e→ e′′ → e′)(in other words, → is transitive)

Space-Time Diagram of a Distributed Computation

e1 1 e2

1 e3 1 e4

1 e5 1 e6

e1 2 e2

2 e3 2

e1 3 e2

3 e3 3 e4

3 e5 3 e6

Meaning of Happen-Before

If e→ e′, this means that we can find a series of events e1e2e3 . . . en,where e1 = e and en = e′, such that for each pair of consecutive eventsei and ei+1:

1 ei and ei+1 are executed on the same process, in this order

2 ei = send(m, ∗) and ei+1 = receive(m)

Notes:

happen-before captures the concept of potential causal ordering

happen-before captures a flow of data between two events.

Two events e, e′ that are not related by the happen-before relation(e 6→ e′ ∧ e′ 6→ e) are concurrent, and we write e||e′.

Homework

Prove or disprove that the || relation is transitive.

Reality check

Forse non tutti sanno che...

The multi-threaded memory model of Java is based on thehappen-before relation, where communication between threads is basedon the acquisition and release of locks.

http://download.oracle.com/javase/6/docs/api/java/util/

concurrent/package-summary.html

Modeling Distributed Executions Global states and cuts

Global States

Definition (Local state)

The local state of process pi after the execution of event eki isdenoted σkiThe local state contains all data items accessible by that process

Local state is completely private to the process

σ0i is the initial state of process pi

Definition (Global state)

The global state of a distributed computation is an n-tuple of localstates Σ = (σ1, . . . , σn), one for each process.

Definition (Cut)

A cut of a distributed computation is the union of n partial histories,one for each process:

C = hc11 ∪ h

c22 ∪ . . . ∪ h

A cut may be described by a tuple (c1, c2, . . . , cn), identifying thefrontier of the cut, i.e. the set of last events, one per process.

Each cut (c1, . . . , cn) has a corresponding global state(σc1

1 , σc22 , . . . , σ

cnn ).

e1 1 e2

1 e3 1 e4

1 e5 1 e6

e1 2 e2

2 e3 2

e1 3 e2

3 e3 3 e4

3 e5 3 e6

3C C’

Consistent cut

Consider cuts C ′ and C in the previous figure.

Is it possible that cut C correspond to a “real” state in theexecution of a distributed algorithm?

Is it possible that cut C ′ correspond to a “real” state in theexecution of a distributed algorithm?

Consistent cut

Definition (Consistent cut)

A cut C is consistent, if for all events e and e′,

(e ∈ C) ∧ (e′ → e)⇒ e′ ∈ C

Definition (Consistent global state)

A global state is consistent if the corresponding cut is consistent.

In other words:

A consistent cut is left-closed w.r.t. the happen-before relation

All messages that have been received must have been sent before

Consistent cut

In the previous figures, C is consistent and C ′ is not.

In the space-time diagram, a cut C is consistent if all the arrowsstart on the left of the cut and finish on the right of the cut.

Consistent cuts represent the concept of scalar time in distributedcomputation: it is possible to distinguish between a “before” andan “after”.

Predicates can be evaluated in consistent cuts, because theycorrespond to potential global states that could have taken placeduring an execution.

Contents

Global predicate evaluation Problem definition

Introduction

Definition (Global Predicate Evaluation)

The problem of detecting whether the global state of a distributedsystem satisfies some predicate Φ.

Motivation

Many problems in distributed systems require a reaction when theglobal state of the system satisfies some condition.

I Monitoring: Notify an administrator in case of failuresI Debugging: Verify whether an invariant is respected or notI Deadlock detection: can the computation continue?I Garbage collection: like Java, but distributed

Thus, the ability to construct a global state and evaluate apredicate over it is a core problem in distributed computing.

Examples

Garbagecollection

Deadlock

wait-for

Message

objectreference

Why GPE is difficult

A global state obtained through remote observations could be

obsolete: represent an old state of the system.Solution: build the global state more frequently

inconsistent: capture a global state that could never have occurredin realitySolution: build only consistent global states

incomplete: not “capture” every moment of the systemSolution: build all possible consistent global states

Global predicate evaluation Example: deadlock detection

Space-Time Diagram of a Distributed Computation

e1 1 e2

1 e3 1 e4

1 e5 1 e6

e1 2 e2

2 e3 2

e1 3 e2

3 e3 3 e4

3 e5 3 e6

reqreq

respreq

Example – Deadlock detection on a multi-tier system

Processes in the previous figures use RPCs:

Client sends a request for method execution; blocks.

Server receives request.

Server executes method; may invoke methods on other servers,acting as a client.

Server sends reply to client

Clients receives reply; unblocks.

Such a system can deadlock, as RPCs are blocking. It is important tobe able to detect when a deadlock occurs.

Runs and consistent runs

Definition (Run)

A run of global computation is a total ordering R that includes all theevents in the local histories and that is consistent with each of them.

In other words, the events of pi appear in R in the same order inwhich they appear in hi.

A run corresponds to the notion that events in a distributedcomputation actually occur in a total order

A distributed computation may correspond to many runs

Definition (Consistent run)

A run R is said to be consistent if for all events e and e′, e→ e′ impliesthat e appears before e′ in R.

Runs and consistent runs

e11 e21 e

e11 e12 e

e1 1 e2

1 e3 1 e4

1 e5 1 e6

e1 2 e2

2 e3 2

e1 3 e2

3 e3 3 e4

3 e5 3 e6

reqreq

respreq

Monitoring Distributed Computations

Assumptions:I There is a single process p0 called monitor which is responsible for

evaluating ΦI We assume that the monitor p0 is distinct from the observed

processes p1 . . . pnI Events executed on behalf of monitoring do not alter canonical

enumeration of “real” events

In general, observed processes send notifications about local eventsto the monitor, which builds an observation.

Observations

Definition (Observation)

The sequence of events corresponding to the order in which notificationmessages arrive at the monitor is called an observation.

Given the asynchronous nature of our distributed system, anypermutation of a run R is a possible observation of it.

Definition (Consistent observation)

An observation is consistent if it corresponds to a consistent run.

Contents

Real clocks vs logical clocks

How to obtain consistent observations

The happen-before relation captures the concept of potentialcausality

In the “day-to-day” life, causality/concurrency are tracked usingphysical time

I We use loosely synchronized watches;I Example: I have withdrawn money from an ATM in Trento at 13.00

on 17th May 2006, so I can prove that I’ve not withdrawn money onthe same day at 13.20 in Paris

In distributed computing systems:I the rate of occurrence of events is several magnitudes higherI event execution time is several magnitudes smaller

If physical clocks are not precisely synchronized, thecausality/concurrency relations between events may not beaccurately captured

Real clocks vs logical clocks Logical clocks

Logical clocks

Instead of using physical clocks, which are impossible to synchronize,we use logical clocks.

Every process has a logical clock that is advanced using a set ofrules

Its value is not required to have any particular relationship to anyphysical clock.

Every event is assigned a timestamp, taken from the logical clock

The causality relation between events can be generally inferredfrom their timestamps

Real clocks vs logical clocks Logical clocks

Logical clocks

Definition (Logical clock)

A logical clock LC is a function that maps an event e from the historyH of a distributed system execution to an element of a time domain T :

LC : H → T

Definition (Clock Condition)

e→ e′ ⇒ LC (e) < LC (e′)

Definition (Strong Clock Condition)

e→ e′ ⇔ LC (e) < LC (e′)

Real clocks vs logical clocks Scalar logical clocks

Scalar logical clocks

Definition (Scalar logical clocks)

Lamport’s scalar logical clock is a monotonically increasingsoftware counter

Each process pi keeps its own logical clock LC i

The timestamp of event e executed by process pi is denoted LCi(e)

Messages carry the timestamp of their send event

Logical clocks are initialized to 0

Update rule

Whenever an event e is executed by process pi, its local logical clock isupdated as follows:

LC i =

{LCi + 1 If ei is an internal or send eventmax{LCi,TS (m)}+ 1 If ei = receive(m)

1 2 4 5 6 7

1 2 43 5 7

Properties

Theorem

Scalar logical clocks satisfy Clock condition, i.e.

e→ e′ ⇒ LC(e) < LC(e′)

Proof.

This immediately follows from the update rules of the clock.

Properties

Theorem

Scalar logical clocks satisfy Clock condition, i.e.

e→ e′ ⇒ LC(e) < LC(e′)

Proof.

This immediately follows from the update rules of the clock.

2018-1

DS - Time,clocks,events

Properties

Total ordering: It is also possible to provide a total order for events, bysorting based on the process identifier for events with the same timestamp.

Theorem

Scalar logical clocks do not satisfy Strong clock condition, i.e.

LC(e) < LC(e′) 6⇒ e→ e′

1 2 4 5 6 7

1 2 43 5 7

Theorem

Scalar logical clocks do not satisfy Strong clock condition, i.e.

LC(e) < LC(e′) 6⇒ e→ e′

1 2 4 5 6 7

1 2 43 5 7

2018-1

LC(p12) < LC(p2

3), yet p12 6→ p2

Real clocks vs logical clocks Vector logical clocks

Causal histories clocks

Definition (Causal History)

The causal history of an event e is the set of events that happen-beforee, plus e itself.

θ(e) = {e′ ∈ H | e′ → e} ∪ {e}

Theorem

Causal histories satisfy Strong clock condition

Proof.

∀e 6= e′ : LC(e) < LC(e′)def⇔ θ(e) ⊂ θ(e′)⇔ e ∈ θ(e′)⇔ e→ e′

Example

Problem:Causal histories tend to grow too much; they cannot be used as“timestamps” for messages.

e1 1 e2

1 e3 1 e4

1 e5 1 e6

e1 2 e2

2 e3 2

e1 3 e2

3 e3 3 e4

3 e5 3 e6

reqreq

respreq

Vector clocks

Causal history projection: θi(e) = θ(e) ∩ hi = hciiθ(e) = θ1(e) ∪ θ2(e) ∪ . . . ∪ θn(e) = hc1

1 ∪ hc22 ∪ . . . ∪ hcnn

In other words, θ(e) is a cut, which happens to be consistent.

Cuts can be represented by their frontiers: θ(e) = (c1, c2, . . . , cn)

Definition

The vector clock associated to event e is a n-dimensional vector VC (e)such that

VC (e)[i] = ci where θi(e) = hcii

Vector clocks: implementation

Each process pi maintains a vector clock VC i, initially all zeroes;

When event ei is executed, its vector clock is updated and assumesthe value of VC (ei);

If ei = send(m, ∗), the timestamp of m is TS(m) = VC (ei);

Update rule

When event ei is executed by process pi, VC i is updated as follows:

If ei is an internal or send event:

VC i[i] = V Ci[i] + 1

If ei = receive(m):

VC i[j] = max{V Ci[j], TS(m)[j]} ∀j 6= iVC i[i] = V Ci[i] + 1

Example

(1,0,0) (2,1,0) (3,1,3) (4,1,3) (5,1,3) (6,1,3)

(0,1,0) (1,2,4)

(4,3,4)

(0,0,1) (1,0,2) (1,0,3)(1,0,4) (1,0,5) (5,1,6)

Properties of Vector clocks

“Less than” relation for Vector clocks

V < V ′ ⇔ (V 6= V ′) ∧ (∀k : 1 ≤ k ≤ n : V [k] ≤ V ′[k])

Strong Clock Condition

e→ e′ ⇔ VC (e) < V C(e′)⇔ θ(e) ⊂ θ(e′)

Simple Strong Clock Condition

ei → ej ⇔ VC (ei)[i] ≤ V C(ej)[i]

Properties of Vector clocks

Definition (Concurrent events)

Events ei and ej are concurrent (i.e. ei||ej) if and only if:

(V C(ei)[i] > V C(ej)[i]) ∧ (V C(ej)[j] > V C(ei)[j])

In other words, event ei does not happen-before ej , and ej does nothappen before ei.

Example: ?

Concurrent events

(1,0,0) (2,1,0) (3,1,3) (4,1,3) (5,1,3) (6,1,3)

(0,1,0) (1,2,4)

(4,3,4)

(0,0,1) (1,0,2) (1,0,3)(1,0,4) (1,0,5) (5,1,6)

Concurrent events

(1,0,0) (2,1,0) (3,1,3) (4,1,3) (5,1,3) (6,1,3)

(0,1,0) (1,2,4)

(4,3,4)

(0,0,1) (1,0,2) (1,0,3)(1,0,4) (1,0,5) (5,1,6)

p32018-1

Vector logical clocks

Concurrent events

Example of Concurrent Events: Example: (3,1,3) and (1,2,4)

Properties of vector clocks

Definition (Pairwise Inconsistent)

Events ei and ej with i 6= j are pairwise inconsistent if and only if

(V C(ei)[i] < V C(ej)[i]) ∨ (V C(ej)[j] < V C(ei)[j])

In other words, two events are pairwise inconsistent if they cannotbelong to the frontier of the same consistent cut. The formulacharacterize the fact that the cut include a receive event withoutincluding a send event.

Example: ?

Pairwise inconsistent

(1,0,0) (2,1,0) (3,1,3) (4,1,3) (5,1,3) (6,1,3)

(0,1,0) (1,2,4)

(4,3,4)

(0,0,1) (1,0,2) (1,0,3)(1,0,4) (1,0,5) (5,1,6)

(1,0,0) (2,1,0) (3,1,3) (4,1,3) (5,1,3) (6,1,3)

(0,1,0) (1,2,4)

(4,3,4)

(0,0,1) (1,0,2) (1,0,3)(1,0,4) (1,0,5) (5,1,6)

p32018-1

Vector logical clocks

Example of Pairwise Inconsistent Events: Example: (3,1,3) and (1,0,2)

Properties of Vector Clocks

Definition (Consistent Cut)

A cut defined by (c1, . . . , cn) is consistent if and only if:

∀i, j ∈ [1 . . . n] : VC (ecii )[i] ≥ VC (ecjj )[i])

In other words, a cut is consistent if its frontier does not contain anypairwise inconsistent pair of events.

Contents

Passive monitoring

A Passive Approach to GPE

How it works

At each (relevant) event, each process sends a notification to themonitor describing it local state

The monitor collects the sequence of notifications, i.e. anobservation of the distributed run.

An observation taken in this way can correspond to:

A consistent run

A run which is not consistent

No run at all

Can you find example of the three cases?Can you explain why this happen?

Passive monitoring

Observations which are not runs

Problem

Observations may not correspond to a run because the notificationssent by a single process to the monitor may be delayed arbitrarily andthus arrive in any possible order

Solution

To adopt communication channels between the processes and themonitor that guarantee that notifications are never re-ordered

Passive monitoring

Notification ordering

Definition (FIFO Order)

Two notifications sent by pi to p0 must be delivered in the same orderin which they were sent:

∀m,m′ : send i(m, p0)→ send i(m′, p0)⇒ deliver0(m)→ deliver0(m

Uh? What is “deliver”?

Passive monitoring

Delivery Rules

How to order notifications?

To be ordered, each notification m carries a timestamp TS (m)containing “ordering” information

The act of providing the monitor with a notification in the desiredorder is called delivery; the event deliver(m) is thus distinct fromreceive(m).

The rule describing which notifications can be delivered amongthose received is called delivery rule

Passive monitoring

FIFO Order - Implementation

Each process maintains a local sequence number incremented ateach notification sent

The timestamp of a notification corresponds to the local sequencenumber of the sender at the time of sending

Definition (FIFO Delivery Rule)

If the last notification delivered by p0 from pj has timestamp s, p0 maydeliver “any” notification m received from pj with TS (m) = s+ 1.

Passive monitoring

Observations which are not consistent runs

Problem

If we use FIFO order between all processes and p0, all the observationstaken by p0 will be runs; but there is no guarantee that they areconsistent runs.

Solution

To adopt communication channels that guarantee that notifications aredelivered in an order that respects the happen-before relation.

Passive monitoring

Notification ordering

Definition (Causal Order)

Two notifications sent by pi and pj to p0 must be delivered followingthe happen-before relation:

∀m,m′ : send i(m, p0)→ send j(m′, p0)⇒ deliver0(m)→ deliver0(m

Question

FIFO order among all channels...Is it sufficient to obtain Causal delivery?

Passive monitoring

Example

2018-1

Passive monitoring

Example

• Given that each communication channels is associated with just onemessage, this distributed system satisfies FIFO order.

• However, it does not satisfy Causal order: the deliver of message m′

should be delivered after message m, because the send event of m ispotentially caused by m′.

• Silly example: p1 sends an invitation for a party to p2 and p3 by SMS;p2 sends a message to p3 saying “Are you going to participate to theparty of p1?” p3 gets offended because it has not been invited.

Passive monitoring

Causal delivery and consistent observations

Theorem

If p0 uses a delivery rule satisfying Causal Order, then all of itsobservations will be consistent.

Proof.

Definition of Causal Order ≡ definition of a consistent observation

Three implementations of the causal delivery rule:

Real-time clocks

Logical (scalar) clocks

Vector clocks

Theorem

If p0 uses a delivery rule satisfying Causal Order, then all of itsobservations will be consistent.

Proof.

Definition of Causal Order ≡ definition of a consistent observation

Three implementations of the causal delivery rule:

Real-time clocks

Logical (scalar) clocks

Vector clocks

2018-1

Passive monitoring

• Suppose, by contradiction, that the monitor obtains an observationwhich is not consistent.

• This means that an event e′ appears before e in the observation, bute→ e′.

• This means that the notification of e′ has been delivered before thenotification of e

• But this is impossible, because send(me, p0)→ send(me′ , p0) and bythe Casual Delivery rule deliver(me)→ deliver(me′), a contradiction.

Passive monitoring Passive monitoring, v.1

Passive monitoring with real-time

Initial assumptions

All processes have access to a real-time clock RC

Let RC (e) be the real-time at which e is executed

All notifications are delivered within a time δ

The timestamp of notification m corresponding to an event e isTS (m) = RC (e).

Definition (DR1: Real-time delivery rule)

At any time, deliver all received notifications in increasing timestamporder.

Theorem

Observation O constructed by p0 using DR1 is guaranteed to beconsistent (?)

Passive monitoring with real-time

Initial assumptions

All processes have access to a real-time clock RC

Let RC (e) be the real-time at which e is executed

All notifications are delivered within a time δ

The timestamp of notification m corresponding to an event e isTS (m) = RC (e).

Definition (DR1: Real-time delivery rule)

At any time, deliver all received notifications in increasing timestamporder.

Theorem

Observation O constructed by p0 using DR1 is guaranteed to beconsistent

Stability of messages

Definition (Stability)

A notification m received by p0 is stable at p0 if no notification m′ withTS (m′) < TS (m) can be received in the future by p0

Definition (DR1: Delivery rule for RC )

Deliver all received notifications that are stable at p0, in increasingtimestamp order

Theorem

Observation O constructed by p0 using DR1 is guaranteed to beconsistent

Safety: Clock Condition for RC

e→ e′ ⇒ RC (e) < RC (e′)

Note that RC (e) < RC (e′) 6⇒ e→ e′, but this rule is sufficient toobtain consistent observations, as two notifications e→ e′ are neverdelivered in the incorrect order.

Liveness: Stability

At time t, any message sent by time t− δ is stable.

Note that real-time clocks do not support stability; it is the maximumdelay of messages that enables it.

Passive monitoring with logical clocks

Initial assumptions

All processes have access to a logical clock LC ;let LC (e) be the logical clock at which e is executed

The timestamp of notification m corresponding to an event e isTS (m) = LC (e)

Definition (DR2: Deliver Rule for LC )

Deliver all received notifications that are stable at p0 in increasingtimestamp order.

Safety: Clock Condition for LC

e→ e′ ⇒ LC (e) < LC (e′)

Note that LC (e) < LC (e′) 6⇒ e→ e′, but this rule is sufficient toobtain consistent observations, as two events e→ e′ are never orderedincorrectly

Liveness: Stability

We need a way to reproduce the concept of δ in an asynchronoussystem, otherwise no notification will be ever delivered.

Solution

Each process communicates with p0 using FIFO delivery

When p0 receives a notification from pi describing an event e withtimestamp TS (e), it is sure that it will never receive a messagefrom pi describing an event e′ with TS(e′) ≤ TS(e)

Stability of notification m at p0 can be guaranteed when p0 hasreceived at least one notification from all other processes with atimestamp greater or equal than TS (m)

Problems of Logical Clocks

They add unnecessary delays to observations

They require a constant flux of notifications from all processes

1 2 3 6

2 3 4p1

p01 2 3 5

Stable messages

Passive Monitoring with Vector Clocks

Variables maintained at p0

M : the set of notifications received but not yet delivered by p0

D: an array, initialized to 0’s, such that D[k] contains TS(m)[k]where m is the last notification delivered by p0 from process pk.

When a notification is deliverable by p0?A notification m ∈M from process pj is deliverable as soon as p0 canverify that there is no other notification m′ (neither in M nor in thechannels) such that send(m′, p0)→ send(m, p0).

Implementing Causal Delivery with Vector Clocks

m ∈M : a notification sent by pj to p0

m′: the last notification delivered from process pk, k 6= j

Definition (Weak Gap Detection)

If TS (m′)[k] < TS (m)[k] for some k 6= j, then there exists eventsendk(m′′) such that

sendk(m′, p0)→ sendk(m′′, p0)→ send j(m, p0)

Implementing Causal Delivery with Vector Clocks

Two conditions to be verified to check if m can be delivered:

There is no earlier message from pj that has not been delivered yet.

Causal Delivery Rule, Part 1 D[j] == TS (m)[j]− 1

∀k 6= j, let m′ be the last message from pk delivered by p0(D[k] = TS (m′)[k]); we must be sure that no message m′′ from pkexists such that: sendk(m′, p0)→ sendk(m′′, p0)→ send j(m, p0)

Causal Delivery Rule, Part 2: ∀k 6= j : D[k] ≥ TS (m)[k]It follows from Weak Gap Detection

Example

(D[j] == TS (m)[j]− 1

)∧(∀k 6= j : D[k] ≥ TS (m)[k]

VC1 [0,0] [1,0]

VC2 [0,0] [1,1]

D [0,0] [1,0] [1,1]

Final Comments

We have seen how to implement Causal Delivery “many-to-one”

The same rules apply if we implement a mechanism forimplementing “one-to-many” (reliable broadcast)

Contents

Proactive monitoring

Snapshot Protocol

Problem

Goal: To build the global state “on demand” of the monitor.

How: By taking “pictures” (snapshots) of the local state wheninstructed

Challenge: To build a consistent global state.

Applications

Failure recovery: a global state (checkpoint) is periodically saved;to recover from a failure, the system is restored to the last savedcheckpoint.

Distributed garbage collection

Monitoring distributed events (e.g., industrial process control)

Chandy-Lamport Snapshot Algorithm

This particular protocol enables to reason about “channel states”

GPE can be delayed with respect to passive monitoring

Assumption: processes communicate through FIFO channels

Snapshot Protocol

Channel State

For each channel between pi and pjI xi,j contains the messages sent by pi but not received yet by pjI xj,i contains the messages sent by pj but not received yet by pi

Note: channel state can be obtained by storing appropriateinformation in the local state, but it is complicated.

Recorded information

Each process pi will record its local state σi and the content of itsincoming channels xj,i.

Snapshot Protocol

Assumptions:

Communication channels satisfy FIFO order

Assumptions to be relaxed later:

Access to a real time clock RC ;

Message delays are bounded by some known value δ;

Relative process speeds are bounded;

Message m is tagged with a timestamp TS (m) = RC (e), wheree = send(m).

Proactive monitoring Snapshot protocol, v.1

Snapshot Protocol, v.1

1 p0 chooses a time ts far enough in the future that a messagecontaining ts sent by p0 will be received by all processes by ts:

ts = now() + δ

2 p0 sends a message “take a snapshot at ts” to all processes in Π

3 When RC i = ts do:1 pi records its local state2 pi sends an empty message to all processes in Π3 pi starts to record all messages received over each incoming channel

4 When pi receives a message m from j such that TS (m) ≥ ts, itstops recording messages for incoming channel j

5 Each pi sends its recorded local state and channel states to p0

Liveness The empty messages at 3.2 guarantee liveness.Safety This protocol constructs a consistent global state; actually, thisglobal state did in fact occur. Formally:Let Cs be the cut associated to the constructed global state;

(e ∈ Cs) ∧ (e′ → e) ⇒(e ∈ Cs) ∧ (RC (e′) < RC (e)) ⇒ C.C. : e′ → e⇒ RC (e′) < RC (e)

(e′ ∈ Cs) Cs Def. : e ∈ Cs ⇔ RC (e) < ts

Can we use logical clocks instead of a real time clock?

From real time clocks to logical clocks:

The construct “ when LC = ts do statement” makes no senseI LC s are not continuous (e.g., they can jump from ts − 1 to ts + 1)I When LC = ts, the event that caused the clock update is already

Solution: before pi executes an event e:I If e is an internal or send event, and LC = ts − 2:

F pi executes eF pi executes statement

I If e = receive(m) ∧ TS (m) ≥ ts and LC < ts − 1:F pi puts LC = ts − 1F pi executes statementF pi executes e

From real time clocks to logical clocks:

We assume that we can found a logical time ts large enough that amessage containing it sent by p0 will be received by all otherprocesses before ts

Impossible in an asynchronous distributed systems

We will relax this assumption later

1 p0 chooses a logical time ts large enough

2 p0 sends a message “take a snapshot at ts” to all processes in Π

3 p0 sets its logical clock to ts;

4 When LC i = ts do:1 pi records its local state;2 pi sends an empty message to all processes in Π;3 pi starts to record all messages received over each incoming channel.

5 When pi receives a message m from j such that TS (m) ≥ ts, itstops recording messages for incoming channel j

6 Each pi sends its recorded local state and channel states to p0.

We now remove the need for ts:

A process may receive an ”empty message” from a node before the“take snapshot at ts” is actually received

In other words, it may be already aware of the snapshot protocol

We remove the “at ts” and we use snapshot messages instead ofempty ones

We can now remove logical clocks completely, as messages are nottimestamped any more

1 p0 sends a message snapshot to itself ;

2 when pi receives snapshot for the first time:I let pj be the sender of this messageI pi records its local state σi;I sends snapshot to all processes in Π;I xk,i ← ∅ ∀k 6= i;

3 when pi receives message m 6= snapshot from pk, k 6= jI xk,i ← xk,i ∪ {m}

4 when pi receives a snapshot from pk, k 6= j beyond the first time:I pi stops recording messages in xk,i;

5 when pi has received a snapshot from all processesI pi then sends its recorded local state and the channel states to p0.

Example

e1 1 e2

1 e3 1 e4

1 e5 1

e1 2 e2

2 e3 2 e4

2 e5 2

SnapshotsRecorded states

Normal messages State p1

State p2

Incoming channel p1

Proof of correctness

Theorem

A global state built using the Chandy-Lamport snapshot algorithm isconsistent.

Theorem

A global state built using the Chandy-Lamport snapshot algorithm isconsistent.

2018-1

Snapshot protocol, v.3

In order for the global state to be inconsistent, there should be a receiveevent included in the global state for which the corresponding send event isnot included. Let ei = receivei(m) and let ej = send j(m).In order for ej to not belong to the global state, process pj must have re-ceived the first snapshot message before ej . Thus, it has sent a messageto pi containing a snapshot message to pi before sending m. For the FIFOproperty of the channel, pi must have received the snapshot message beforeei. So, it must have stored the local state before receiving m, a contradiction.

Contents

Stable predicates

Stable Predicates

Problem

Let Σ be a global state built by one of the presented methods

It represents a state of the past, potentially with no bearing to thepresent

Does it make sense to evaluate predicate Φ on it?

A special case: stable predicates

Many systems properties have the characteristic that once they becometrue, they remain true.

Deadlock

Garbage collection

Termination

Stable predicates

A distributed computation may have many runs

Definition (Leads-to relation)

A consistent run R = e1e2 . . . results in a sequence of consistentglobal states Σ0Σ1Σ2 . . ., where Σ0 denotes the initial global state.

We say that a global state Σ leads to to a global state Σ′, denotedΣ R Σ′ in a consistent run R if:

I R results in a sequence of global states Σ0Σ1Σ2 . . .;I Σ = Σi,Σ′ = Σj , i < j.

We write Σ Σ′ if there is a run R such that Σ R Σ′.

Stable predicates

A distributed computation may have many runs

Definition (Lattice)

The set of all consistent global states of a computation along withthe leads-to relation defines a lattice;

n orthogonal axis, one per process;

Σk1...kn shorthand for the global state (σk11 , . . . , σ

knn );

The level of Σk1...kn is equal to k1 + · · ·+ kn.

A path in the lattice is a sequence of global states of increasinglevels that corresponds to a consistent run.

Stable predicates

6 Consistency

132231

455463

Figure 3. A Distributed Computation and the Lattice of its Global States

UBLCS-93-1 9

Stable predicates

Stable Predicates

Consider a global state construction protocol:

Let Σa be the global state in which the protocol is initiated;

Let Σf be the global state in which the protocol terminates;

Let Σs be the global state constructed by the protocol

Since Σa Σs Σf , if Φ is stable, then:

Φ(Σs) = true ⇒ Φ(Σf ) = true

Φ(Σs) = false ⇒ Φ(Σa) = false

Stable predicates Deadlock detection

Deadlock Detection

Server code

Server code, modified for snapshot protocol

Monitor code for snapshot protocol

Server code, modified for passive protocol

Monitor code for passive protocol

No need to store channel state in this case

Stable predicates Deadlock detection

Server code

Process pi

Queue pending← new Queue()boolean working← falsewhile true do

while working or pending.size() = 0 doreceive 〈m, pj〉if m.type = request then

pending.enqueue(〈m, pj〉)else if m.type = response then〈m′, pk〉 ← nextState(m, pj)working← (m′.type = request)send m′ to pk

while not working and pending.size() > 0 do〈m, pj〉 ← pending.dequeue()〈m′, pk〉 ← nextState(m, pj)working← (m′.type = request)send m′ to pk

Stable predicates Deadlock detection through active monitoring

Deadlock Detection through Snapshot

Approach

All channels are based on FIFO delivery

Add code to deal with Snapshot messages

Pros and Cons

Generates overhead only when deadlock is suspected

Introduces a delay between deadlock and detection

Server code, modified for active monitoring (1)

Process pi

Queue pending← new Queue()boolean working← falseboolean[] blocking← {false, ..., false}while true do

blocking[j]← truepending.enqueue(〈m, pj〉)

else if m.type = response then〈m′, pk〉 ← nextState(m, pj)working← (m′.type = request)send m′ to pkif m′.type = response then

blocking[k]← false

Server code, modified for active monitoring (2)

Process pi

else if m.type = snapshot thenif s = 0 then

send 〈snapshot, blocking〉 to p0send 〈snapshot〉 to Π− {pi}

s← (s+ 1) mod n

while not working and pending.size() > 0 do〈m, pj〉 ← pending.dequeue()〈m′, pk〉 ← nextState(m, pj)working← (m′.type = request)send m′ to pkif m′.type = response then

blocking[k]← false

Monitor code, for active monitoring

Process p0

boolean[ ][ ] wfg← new boolean[1 . . . n][1 . . . n]while true do{ Wait until deadlock is suspected }send 〈snapshot〉 to Πfor k ← 1 to n do

receive 〈m, j〉wfg[j]← m.data

if there is a cycle in wfg then{ the system is deadlocked }

Stable predicates Deadlock detection through passive monitoring

Deadlock detection through passive monitoring

Approach

Sends a message to p0 for each relevant event

Communication with p0 based on causal delivery

Pros and Cons

Simpler approach, but complexity is hidden by the causal deliverymechanism

Latency limited to message delays

Higher overhead

Server code, modified for passive monitoring (1)

Process pi

Queue pending← new Queue()boolean working← falsewhile true do

send 〈requested, j, i〉 to p0pending.enqueue(〈m, pj〉)

else if m.type = response then〈m′, pk〉 ← nextState(m, pj)working← (m′.type = request)send m′ to pkif m′.type = response then

send 〈responded, i, k〉 to p0

Server code, modified for passive monitoring (2)

Process pi

while not working and pending.size() > 0 do〈m, pj〉 ← pending.dequeue()〈m′, pk〉 ← nextState(m, pj)working← (m′.type = request)send m′ to pkif m′.type = response then

send 〈responded, i, k〉 to p0

Monitor code, for passive monitoring

Process p0

boolean[ ][ ] wfg← new boolean[1 . . . n][1 . . . n]while true do

receive 〈m, pj〉if m.type = responded then

wfg[m.from,m.to]← falseelse

wfg[m.from,m.to]← true

if there is a cycle in wfg then{ the system is deadlocked }

Contents

Non-stable predicates

Problems of non-stable predicates

The condition encoded by the predicate may not persist longenough for it to be true when the predicate is evaluated

If a predicate Φ is found to be true by the monitor, we do notknow whether Φ ever held during the actual run.

Conclusions

Evaluating a non-stable predicate over a single computation makesno sense

The evaluation must be extended to the entire lattice of thecomputation

Predicates over entire computations

It is possible to evaluate a predicate over an entire computation usingan observation obtained by a passive monitor.

Possibly(Φ): There exists a consistent observation O of thecomputation such that Φ holds in a global state of O.

Definitely(Φ): For every consistent observation O of thecomputation, there exists a global state of O in which Φ holds.

Example (Debugging)

If Possibly(Φ) is true, and Φ identifies some erroneous state of thecomputation, than there is a bug, even if it is not observed during anactual run.

Example: Possibly((y − x) = 2), Definitely(x = y)

14 Properties of Global Predicates

: 6 : 4

: 3 : 4 : 5

132231

Figure 16. Global States Satisfying Predicates and 2

UBLCS-93-1 33

Predicates over entire computations

Theorem

Possibly and Definitely are not duals:

¬Possibly(Φ) 6⇔ Definitely(¬Φ)

¬Definitely(Φ) 6⇔ Possibly(¬Φ)

Example

Possibly(x 6= y), Definitely(x = y)

Algorithms for detecting Possibly and Definitely

We use the passive approach in which processes send notificationsof events relevant to Φ to the monitor p0;

Events are tagged with vector clocks;

The monitor collects all the events and builds the lattice of globalstates.How?

To detect Possibly(Φ): if there exists one global state in which Φis true, then return true, otherwise false.

To detect Definitely(Φ): mark nodes where Φ is true with a value1, the other nodes with value 0. If the cost of the shortest pathbetween the initial state and the final state is larger than 0, returntrue, otherwise false.

Algorithms for detecting Possibly and Definitely

Problems

The number of states grows exponentially with the number oftotal events.

Techniques can be used to reduce the number of eventsI Only those relevant to ΦI Forcing periodic synchronizationI Reducing the complexity of predicates (conjunction of local

predicates)

Reading Material

O. Babaoglu and K. Marzullo. Consistent global states of distributed systems:Fundamental concepts and mechanisms.

In S. Mullender, editor, Distributed Systems (2nd ed.). Addison-Wesley, 1993.

//www.disi.unitn.it/~montreso/ds/papers/ConsistentGlobalStates.pdf

Reality Check: Interesting links

Clocks are bad, or welcome to distributed systemsWhy vector clocks are easyWhy vector clocks are hardWhy Cassandra doesn’t need vector clocks

Distributed Algorithms Time, clocks and the ordering of...

Documents

Transcript of Distributed Algorithms Time, clocks and the ordering of...

Clocks 2013.2.22

B. Prabhakaran 1 Theoretical Aspects Logical Clocks Causal Ordering Global State Recording Termination Detection.

Stavelet Clocks

Clocks Arrangement

Ansonia History - Antique Clocks Guy Antique Clocks and Mechani

Evergreen clocks

Add Fault Tolerance – order & time Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport Optimal Clock Synchronization T.K. Srikanth.

Rapport Clocks

Molecular Clocks

Introduction Clocks,events and process states Synchronizing physical clocks

Clocks - University of Washington · Lamport Clocks Framework for reasoning about event ordering - notion of logical time vs. physical time - causal ordering and vector clocks (e.g.,

Vector Clocks

wall clocks cuckoo clocks home furniture lighting

Desk Clocks

digital clocks and clocks

Logical Clocks and Causal Orderingpallab/dist_sys/Lec-05-Logical-Clocks.pdfLogical Clocks and Causal Ordering CS60002: ... Limitation of Lamport’s Clock a ... Problem of Vector Clock

Lamport Clocks - University of Washington · Lamport Clocks Framework for reasoning about event ordering - notion of logical time vs. physical time - leads to causal ordering, vector

Time clocks and the ordering of events

Wall Clocks

Distributed Systems Dinesh Bhat - Advanced Systems (Some slides from 2009 class) CS 6410 – Fall 2010 Time Clocks and Ordering of events Distributed Snapshots.