Ch13 Checkpointing and Recovery

Ch13Checkpointing and Recovery

Outline

Introduction What ? Why? Where? Problems in Rollback Incarnation numbers Taxonomy of solution techniques Uncoordinated checkpoint Coordinated checkpoint Synchronous Logging Asynchronous Logging Adaptive Logging

Checkpointing and Recovery

Introduction During a computation, a node might fail and then be repaired After a failed processor has been repaired, how to take the system to a consistent global state?

If every processor periodically : records its local state on stable storage, records messages received on stable storage Then One can take the system to a consistent global state by rolling back the system to a previously recorded global state

Terminology checkpointing : record state in a stable storage log received messages : record received messages on a stable storage


Recovery line A set C of local checkpoints forms a consistent state (also called recovery line) if the following conditions are satisfied: 1) there are no lost messages in C 2) there are no orphan messages in C 3) C contains exactly one checkpoint for each processor


Problems in rollback Goal of rollback is to roll back the system to a consistent state

Some precautions have to be taken for this to work properly

For simplicity, we do not consider channel state for the rollback

To see the problem, assume: 1) processors checkpoint from time to time 2) checkpoints are established independently without any coordination between themselves


Problems in rollback To see the problem, assume: 1) processors checkpoint periodically 2) checkpoints are established independently without any coordination between themselves

p1 p2 p3

m2m3

m1

c1

c2 c3 The global state formed by c1,c2,c3 is inconsistent it contains:lost messages: m2, m3orphan messages: m1

Checkpointing and Recovery Problems in rollback : cascading rollbacks

p2

q2

q1

p3

q3

r1

q4

r2

p1

r3

r4

p q rpp3

rr4

m1

m1

qq4

pp2

m3

m2

m2m3

m4

m4

qq3

m5

m5rr3

“p rolls back to p3”requires , because ofmessage m1 that “r rolls back to r4”

...

{p2,q3,r3} is a recovery line

A rollback by a processor can causean avalanche of rollbacks

How to avoid this ?

Checkpointing and Recovery Problems in rollback : I/O stuttering

I/O

p q r

pi

Rolling back processor p to pi requires that the I/O event be re-executed: I/O stutteringHow can we avoid this ?

Log inputs: avoid input stutteringOutput commit: avoid output stuttering

Checkpointing and Recovery Problems in rollback : messages duplication

p q

pi

Rollback(p)

m

pi

m

r(m)

r(m)p

q

After recovery, processor p sends m again. Processor q should recognize that message m is a duplicate message

After p recovers

Processor p rolls back to pi No need for q to roll back

r(m)

Checkpointing and Recovery Incarnation numbers: handling duplicate messages

Every processor: maintains an incarnation number on a stable storage

stores a guess of the incarnation number of every other processor

On every recovery from failure or rollback, the incarnation number is incremented;

Each message carries the incarnation number of the sender

Checkpointing and Recovery Incarnation numbers: handling duplicate messages

0 1 2

Recoveryfrom failure Rollback

[ period 0 [ period 1 [

When processor p receives a message m from processor q, processor p behaves as follows:if m.incarnation < incarnation[q]: message m is a duplicate, discard itif = : deliver mif > : m belongs to an incarnation that p don’t know yet, so block the delivery of m until m.incarnation=incarnation[q]

Evolution of a processor is organized into periods. Incarnations numbers serve to identify these periods

Checkpointing and Recovery Choices to be made to implement a recovery scheme To log or not to log messages ? Log messages: + : increases flexibility at the recovery time - : expensive (space) processes must be deterministic (which is not often the case)

Checkpointing and Recovery Choices to be made to implement a recovery scheme To coordinated or not to coordinated recording state? Uncoordinated checkpoints Sufficient information (we’ll see later) must be kept for rollback

+ : keeps the cost of establishing checkpoints low - : the amount of rollback may be unbounded

Coordinated checkpoints The set of checkpoints together form a recovery line

+ : limits the amount of rollback - : increases the cost of establishing checkpoints

Checkpointing and Recovery Uncoordinated checkpointing

Assumptions 1. Processors asynchronously checkpoint from time to time

2. No coordination between processors for establishment of checkpoints

3. No log of messages

Goal find a maximal recovery line (latest recovery line) i.e the one that happens after every other possible recovery line

Checkpointing and Recovery Uncoordinated checkpointingCheckpoint interval algorithm (progressive rollback) Notations Ci,j : the jth checkpoint at processor pi Ii,j : the interval ] Ci,j ; Ci,j+1[, processing interval of pi between Ci,j and Ci,j+1

Definition Ik,l depends on Ii,j iff there is a message m sent in Ii,j and received in Ik,lpi pk

m Ck,l

Ck,l+1

Ci,j

Ci,j+1

Checkpointing and Recovery Uncoordinated checkpointingCheckpoint interval algorithm (progressive rollback) Idea of the algorithm When a processor pi fails and then is repaired 1. Processor pi initiates recovery by restoring its last checkpoint, say Ci,j

2. Every processor pk in Ik,l such that Ik,l depends on Ii,j rolls back (but to which checkpoint ? We’ll see later)

3. This process continues recursively (transitively) until a recovery line is determined

To support recovery, the information about interval dependence must be recorded (This is the sufficient information !)

Checkpointing and Recovery Uncoordinated checkpointingInterval dependence graph: to capture rollback requirements GI is a graph in which VI: vertices are checkpoint intervals that exist when recovery starts EI: directed edges such that 1). for every processor pi, (Ii,j , Ii,j+1) is in EI

2). If Ik,l depends on Ii,j then (Ii,j , Ik,l) is added to EI

Ii,j

Ii,j+1

If then

Ii,j

If

Ik,l

in GI

Ii,j

Ii,j+1

in GI

Ii,j

Ik,k+1

then

Checkpointing and Recovery Uncoordinated checkpointing

Intuition behind interval dependence graph: If processor pi rolls back to Ci,j and Ik,l depends on Ii,j

then processor pk must roll back to Ck,,l

This, to avoid orphan messages

Ii,j

If

Ik,l

thenand

Ci,jpi Ck,l

pk

m

Because of m

Checkpointing and Recovery Uncoordinated checkpointingInterval dependence graph illustrated:

p1 p2 p3

I1,1 I2,1

I1,2

I1,3

I3,1

I2,3

I2,2

I1,4

I3,3

I3,4

I3,2

1,1

1,2

3,3

2,3

1,3

3,2

3,1

1,4

2,2

2,1

3,4

Message passing and checkpoiting Interval dependence graph

m2

m1

m3m4

m5

Checkpointing and Recovery Uncoordinated checkpointingThe checkpoint interval algorithm (progressive rollback)When a processor pi fails and then is repaired, then pi performs

Step 1. Compute GI

Step 2. Mark the node of GI corresponding to its last checkpoint interval; Let Ii,j be that node. Mark all the nodes of GI that are reachable from Ii,j Step 3. Define for each processor k, the “best checkpoint” of k w.r.t. recovery of pi to be : Ck,l such that l = min {j | Ik,j is marked} every processor rolls back to its “best checkpoint”

Checkpointing and Recovery Uncoordinated checkpointingThe algorithm illustrated: assume that p2 fails and then is repaired

1,1

1,2

3,3

2,3

1,3

3,2

3,1

1,4

2,2

2,1

3,4

Interval dependence graph

Step 1. p2 computes GI


1,1

1,2

3,3

2,3

1,3

3,2

3,1

1,4

2,2

2,1

3,4

Interval dependence graph

Step 2. p2 marks all the nodes of GI

reachable from its last checkpoint interval

Recall: for each processor kthe “best checkpoint” of k w.r.t.recovery of p2 is Ck,l such that l = min {j | Ik,j is marked}


Step 3. Each processor rolls back to its “best checkpoint” w.r.t. Recovery of p2

Recall: for processor kthe “best checkpoint” of k w.r.t.recovery of p2 is Ck,l such that l = min {j | Ik,j is marked}

p1 p2 p3

I1,1 I2,1

I1,2

I1,3

I3,1

I2,3

I2,2

I1,4

I3,3

I3,4

I3,2

The recovery line determined

m2

m1

m3m4

m5

Checkpointing and Recovery Uncoordinated checkpointingSome comments about the checkpoint interval algorithm

Rollback can take the system to the initial state

The algorithm presented is a centralized algorithm can be implemented on a recovery manager that directs all the participants to restart, each from its “best checkpoint” For a distributed version, recovery control messages are must be used to communicate parts of GI

Checkpointing and Recovery Coordinated checkpointing

Idea: Processors coordinate the checkpointing of their local statesto ensure that the checkpoints taken by the different processors form a recovery line This avoid cascading rollback

Method used: Similar to that used for computing a “global snapshot” However, there are some differences

Checkpointing and Recovery Coordinated checkpointingSubtleties: 1. Only processor states are recorded (save space)

2. Failures during checkpointing are handled

3. Store the minimum number of checkpoints (save space)

4. Lost messages are handled by the communication protocol (a consistent set of checkpoints may now contain lost messages)

5. No orphan messages in the computed set of checkpoints

Checkpointing and Recovery Coordinated checkpointingSubtleties (cont.):

6. Only a minimum number of processors must checkpoint idea: old checkpoints together with new checkpoints of some processors may form a “consistent set” of checkpoints

Checkpointing and Recovery Coordinated checkpointingKoo & Toueg 87 (the original algorithm): Uses a two-phase protocol to ensure that either all processors checkpoint or none do

Two types of checkpoints are used for that

“tentative checkpoint” : established when global state recording is ongoing

“permanent checkpoint” : if the recorded state is consistent, tentative checkpoints become permanent checkpoints

Checkpointing and Recovery Coordinated checkpointing: Koo & Toueg 87 (the original algorithm)

Basic idea Phase 1 Initiator q: 1. an initiator processor q takes a tentative checkpoint; 2. q requests all other processors to take tentative checkpoints Non-initiator p: on receiving this request 1. p establish/ not establish the tentative checkpoint; 2. p sends its decision to the initiator; 3. p waits for the final decision from q (i.e. refrains from any communication with any other until the second phase is over)


Basic idea (cont.) Phase 2 : Initiator q: 1. Processor q collects decisions from all other processors 2. If all other processors have taken tentative checkpoints then q makes its tentative checkpoint permanent; else q undo its tentative checkpoint; 3. q requests all others to perform the same final decision Non-initiator p: on receiving this final decision processor p executes the order;


The Basic idea ensures that there are no orphan messages Why?


The Basic idea ensures that there are no orphan messages Why? Answer: no communication is allowed until the second phase is over


It is not necessary that all processors record their state during checkpointing

Why ?


It is not necessary that all processors record their state during checkpointing

Why ?

p1 p2 p3

C1,1

C1,2

C2,1

C2,2

C3,1

C3,2

p1 initiates checkpointing by establishing c1,1then p1 contacts p2, p3 sending red messages

assume that everything went fine and p2, p3 establishc2,2 and c3,2 respectively as new checkpoints

{c1,2 , c2,2 , c3,2} form a consistent set of checkpoints

However, {c1,2 , c2,1 , c3,2}also form a consistent set of checkpoints (i.e. no orphan messages) Hence, processor p2 need not take a new checkpoint


Ensuring a minimum number of checkpoints: Every processor assigns monotonically increasing sequence numbers to each message it sends

Each processor p uses: p.last_rec[1..M] an array of sequence numbers p.last_rec[i] = sequence number of the last message that processor p received from processor pi since p’s last checkpoint

p.first_sent[1..M] an array of sequence numbers p.first_sent[i] = sequence number of the first message that processor p sent to processor pi since p’s last checkpoint


Ensuring a minimum number of checkpoints: When an initiator processor q requests a processor p to take a tentative checkpoint, processor q appends q.last_rec[p] to its request

On receiving this request from q, processor p takes the tentative checkpoint only if (p.first_sent[q] q.last_rec[p])

q

Current checkpoint of q

p

p takes a new checkpoint only in this case avoid orphan messages

Last checkpoint of qLast checkpoint of p


Ensuring a minimum number of checkpoints (cont.) Only processors that have sent messages to the initiator processor q since q’s last checkpoint need to consider the establishment of a new checkpoint requested by q

an initiator processor q should send requests only to those processors p such that :

q


p

Last checkpoint of q


Ensuring a minimum number of checkpoints (cont.) Every processor q maintains: q.checkpoint_cohort : a set that contains those processors from which q has received some messages since q’s last chekpoint

i.e. q.checkpoint_cohort stores processors p such that:

q


p

Last checkpoint of q


The algorithm

Phase 1 Initiator processor q: 1. Take tentative checkpoint; 2. for every processor p in q.checkpoint_cohort do send (Request_tentative_chkp; q.last_rec[p]) to p;


The algorithm Phase 1: Non-initiator processor p: On receiving “Request_tentative_chkp; q.last_rec[p]” from q if (ready to perform tentative checkpoint) and (p.first_sent[q] q.last_rec[p]) then take tentative checkpoint; for every processor r in p.checkpoint_cohort do send (Request_tentative_chkp; p.last_rec[r]) to r; p.replies := empty; for every processor r in p.checkpoint_cohort do wait until r sends “OK” or “KO” , Timeout=T; on “OK” : add r to p.replies; /* set of replies */ If p.replies p.checkpoint_cohort then send “KO” to q else send “OK” to q


The algorithm Phase 2 Initiator processor q: 1. q.replies := empty; 2. for every processor p in q.checkpoint_cohort do wait until p sends “OK” or “KO” , Timeout=T; on “OK” : add p to q.replies; /* set of replies */ if q.replies q.checkpoint_cohort then undo tentative; send “undo tentative checkpoint” to every processor in q.checkpoint_cohort else permanent := tentative; send “make tentative checkpoint permanent” to every processor in q.checkpoint_cohort


The algorithm

Phase 2 Non-initiator processor p: wait until q sends “undo …” or “make … permanent”; timeout = T on “undo …” do undo tentative checkpoint end on “make … permanent” do checkpoint : =tentative_checkpoint end if no timeout then m := message received;

for every processor r in p.checkpoint_cohort do send m to r;


Handling failures idea:

Failures are detected by timeouts;

On recovery, if the recovering processor was the initiator, it undoes its tentative checkpoint and sends this decision to the other processors else the recovered processor consults the initiator oe some other processor to find the final decision

Checkpointing and Recovery Logging Idea: Processors record incoming messages Purpose: avoid need of “resending” reduce the amount of rollback (idea of virtual checkpoint)

Log messages

Virtual checkpoint

+ flexibility- expensive

Checkpointing and Recovery Synchronous Logging Idea Each message must be logged before it can be delivered During recovery, logged messages are replayed until the recovering processor is up to date (guarantee of replay after all sends that can cause subsequent rollback) Problem : expensive

Checkpointing and Recovery Asynchronous Logging Idea Each message must be logged but not necessarily before it can be delivered Messages can be first saved in main memory

Exploit idle period to log messages

several messages can be packed together then logged simultaneously (efficient used of I/O devices)

Problem some messages may be lost not always possible to replay

Ch13 Checkpointing and Recovery

Documents

Transcript of Ch13 Checkpointing and Recovery