A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale...

23
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana- Champaign

Transcript of A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale...

Page 1: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

A Fault Tolerant Protocol for Massively Parallel Machines

Sayantan Chakravorty

Laxmikant Kale

University of Illinois, Urbana-Champaign

Page 2: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

2

Outline

Motivation Background Design Protocols Results Summary Future Work

Page 3: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

3

Motivation

As machines grow in size MTBF decreases Applications have to tolerate faults

Checkpoint/Rollback doesn’t scale All nodes are rolled back just because 1 crashed Even nodes independent of the crashed node are

restarted Restart cost is similar to Checkpoint period

Page 4: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

4

Requirements

Fast and scalable Checkpoints Fast Restart

Only crashed processor to be restarted Minimize effect on fault free processors Restart cost less than checkpoint period

Low fault free runtime overhead Transparent to the user

Page 5: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

5

Background

Checkpoint based methods Coordinated – Blocking [Tamir84], Non-blocking [Chandy85]

Co-check, Starfish, Clip – fault tolerant MPI Uncoordinated – suffers from rollback propagation Communication – [Briatico84], doesn’t scale well

Log-based Pessimistic – MPICH-V1 and V2, SBML [Johnson87] Optimistic – [Strom85] unbounded rollback, complicated

recovery Causal Logging – [Elnozahy93] Manetho, complicated

causality tracking and recovery

Page 6: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

6

Design

Message Logging Sender side message logging

Asynchronous checkpoints Each processor has a buddy processor Stores its checkpoint in the buddy’s memory

Processor Virtualization Speed up restart

Page 7: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

8

Processor Virtualization

User View System implementation

•Charm++•Parallel C++ with Data driven objects - Chares•Runtime maps objects to physical processors•Asynchronous method invocation

•Adaptive MPI•Implemented on Charm++•Multiple virtual processors on a physical processor

Page 8: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

9

Benefits of Virtualization

Latency Tolerant Adaptive overlap of communication and

computation Supports migration of virtual processors

Page 9: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

10

Message Logging Protocol

Correctness: Messages should be processed in the same order before and after the crash

Problem:

A

B

CA

B

C

Before Crash After Crash

Page 10: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

11

Message Logging..

Solution: Fix an order the first time and always follow it Receiver gives each message a ticket number Process messages in order of ticket number

Each message contains Sender ID – who sent it Receiver ID – to whom was it sent Sequence Number (SN) – together with sender

and receiver IDs, identifies a message Ticket Number (TN) – decide order of processing

Page 11: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

12

Message to Remote Chares

Chare Psender

Chare Qreceiver

<Sender, SN>

<SN,TN, Receiver> <SN, TN, Message>

•If <sender, SN> has been seen earlier TN is marked as received •Otherwise create new TN and store the <sender, SN,TN>

Page 12: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

13

Message to Local Chare Multiple Chares on 1 processor

If processor crashes all trace of local message is lost After restart it should have the same TN Store <sender, receiver, SN, TN> on buddy

<sender, SN> <SN,TN, Receiver>

<sender, receiver, SN, TN>

Ack

<SN, TN, Message>

Processor R

Chare Q

Chare P

Buddy of Processor R

Page 13: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

14

Checkpoint Protocol

A processor asynchronously decides to checkpoint

Packs up the state of all its chares and sends it to the buddy Message logs are part of a chare’s state

Message log on senders can be garbage collected

Deciding when to checkpoint is an interesting problem

Page 14: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

15

Reliability

Only one scenario when our protocol fails Processor X (buddy of Y) crashes and restarts Checkpoint of Y is lost Y now crashes before saving its checkpoint

Result of not assuming reliable nodes for storing checkpoint

Still increases reliability by orders of magnitude

Probability can be minimized by having Y checkpoint after X crashes and restarts

Page 15: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

16

Basic Restart Protocol

After a crash, a Charm++ process is restarted on a new processor

Gets checkpoint and local message log from buddy

Chares are restored and other processors are informed of it

Logged messages for chares on restarted processors are resentThe highest TN, from a crashed chare, seen is also sent

Messages are reprocessed by the restarted charesLocal messages check first in the restored local message log

Page 16: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

17

Parallel Restart

Message Logging allows fault-free processors to continue with their execution

However, sooner or later some processors start waiting for crashed processor

Virtualization allows us to move work from the restarted processor to waiting processors

Chares are restarted in parallel Restart cost can be reduced

Page 17: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

18

Present Status

Most of Charm++ has been ported Support for migration has not yet been

implemented in the fault tolerant protocol Simple AMPI programs work

Barriers to be done Parallel restart not yet implemented

Page 18: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

19

Experimental Evaluation

NAS benchmarks could not be used Used a 5-point stencil computation with a 1-D

decomposition 8 quad 500 Mhz PIII cluster with 500 MB of

RAM per node, connected by ethernet

Page 19: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

20

Overhead

Measurement of overhead for an application with low communication to computation ratio

Overhead measurement

0102030405060708090

100

0 5 10 15 20 25 30 35

Number of processors

Norm

aliz

ed p

erfo

rman

ce

Normal Charm++ FT- wi thout checkpoi nt FT- f ul l protocol

Page 20: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

21

Measurement of overhead for an application with high communication to computation ratio

Overhead measurement

0

20

40

60

80

100

0 5 10 15 20 25 30 35

Number of processors

Norm

aliz

ed p

erfo

rman

ce

Normal Charm++ FT-without checkpoint FT-full protocol

Page 21: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

22

Recovery Performance

Execution Time with increasing number of faults on 8 processors(Checkpoint period 30s)

Execut i on Ti me wi th Faul t s

0

200

400

600

800

0 1 2 3 4 5 6 7

Number of f aul t s

Execution Time(s)

Page 22: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

23

Summary

Designed a fault tolerant protocol that Performs fast checkpoints Performs fast parallel restarts Doesn’t depend on any completely reliable node Supports multiple faults Minimizes the effect of a crash on fault free

processors Partial implementation of the protocol

Page 23: A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Parallel Programming LaboratoryUniv. of Illinois, U-C

24

Future Work

Include support for migration in the protocol Parallel restart Extend to AMPI Test with NAS benchmark Study the tradeoffs involved in deciding the

checkpoint period