Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status...

32
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Increasing Fault Resiliency in a Message-Passing Environment Rolf Riesen, Kurt Ferreira Ron Oldfield, Jon Stearley James Laros, Kevin Pedretti Ron Brightwell [email protected] Sandia National Laboratories Todd Kordenbrock Hewlett-Packard Company October 14, 2009

Transcript of Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status...

Page 1: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed MartinCompany, for the United States Department of Energy’s National Nuclear Security

Administration under contract DE-AC04-94AL85000.

Increasing Fault Resiliency in a Message-Passing Environment

Rolf Riesen, Kurt FerreiraRon Oldfield, Jon Stearley

James Laros, Kevin PedrettiRon Brightwell

[email protected]

Sandia National Laboratories

Todd KordenbrockHewlett-Packard Company

October 14, 2009

Page 2: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Motivation

Motivation

Design

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 2 / 32

Page 3: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Motivation

Motivation

Design

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 3 / 32

■ Checkpoint/Restart is common way to deal with faults■ What can we do to increase checkpoint interval?■ Explore redundant computation for MPI applications

◆ Can it be done at user level?◆ What are requirements for RAS system and runtime?◆ What is the overhead◆ Cost versus benefit?

■ Write rMPI library at MPI profiling layer to learn what theissues are

Page 4: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Design

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 4 / 32

Page 5: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Redundant Computation

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 5 / 32

■ Active and redundant node do same computation■ One node continues when the other fails■ MPI Comm rank() returns same value on both nodes■ Not each node needs to have a redundant partner

Page 6: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Basic Operation

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 6 / 32

■ Each message gets sent twice (4, if we count redundantnodes)

■ Msg and redundant copy received into same buffer■ Redundant msg have unused tag bit set■ Protocol needed to coordinate receives and other MPI ops

Page 7: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Message Order

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 7 / 32

■ Active and redundant node must receive msg in same order■ MPI ANY SOURCE and MPI ANY TAG are problematic■ Redundant node maintains queue of posted receives if

MPI ANY SOURCE

■ Coordinate with active node to post specific receive

Page 8: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Other Functions

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 8 / 32

■ Redundant node must return same info for probe, test, andtime functions

■ Active node does operation and sends result to redundantnode

■ Collectives use redundant point-to-point■ rMPI re-implements almost all of MPI

◆ rMPI uses MPICH (mostly) as a transport layer

Page 9: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Status

Motivation

Design

Redundancy

Basics

Msg. order

Other

Status

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 9 / 32

■ Can’t do MPI ANY TAG and MPI ANY SOURCE simultaneously■ Most mayor functions of MPI-2 implemented■ Are considering transactions to limit data volume■ Plan to open source it■ Few RAS features needed:

◆ Notification of node availability◆ No Byzantine behavior (error correction protocol on

network)◆ Messages to/from dead nodes must be consumed (no

dead- or life-lock)

Page 10: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Evaluation

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary

HPC Resiliency 2009 10 / 32

Page 11: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Bandwidth

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary

HPC Resiliency 2009 11 / 32

Ban

dwith

Diff

eren

ce to

nat

ive

Message size

Base Base % Forward Forward % Native

0.0 B/s

200.0 MB/s

400.0 MB/s

600.0 MB/s

800.0 MB/s

1.0 GB/s

1.2 GB/s

1.4 GB/s

1.6 GB/s

1.8 GB/s

2.0 GB/s

1 B

10 B

100 B

1 kB

10 kB

100 kB

1 MB

10 MB

0 %

20 %

40 %

60 %

80 %

100 %

Native Benchmark w/o rMPIBase rMPI, no redundancyForward ABCD|A’B’C’D’

Page 12: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Latency

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary

HPC Resiliency 2009 12 / 32

Late

ncy

Diff

eren

ce to

nat

ive

Message size

Base Base % Forward Forward % Native

4.0 us

6.0 us

8.0 us

10.0 us

12.0 us

14.0 us

16.0 us

18.0 us

20.0 us

1 B

10 B

100 B

1 kB

10 kB

0 %

20 %

40 %

60 %

80 %

100 %

Native Benchmark w/o rMPIBase rMPI, no redundancyForward ABCD|A’B’C’D’

Page 13: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Latency with MPI ANY SOURCE

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary

HPC Resiliency 2009 13 / 32

Late

ncy

Diff

eren

ce to

nat

ive

Message size

Base Base % Forward Forward % Native

0.0 s

5.0 us

10.0 us

15.0 us

20.0 us

25.0 us

30.0 us

1 B

10 B

100 B

1 kB

10 kB

0 %

20 %

40 %

60 %

80 %

100 %

120 %

140 %

160 %

Native Benchmark w/o rMPIBase rMPI, no redundancyForward ABCD|A’B’C’D’

Page 14: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Allreduce

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary

HPC Resiliency 2009 14 / 32

Tim

e pe

r op

erat

ion

Diff

eren

ce to

nat

ive

Message size

Allreduce performance on 32 nodes

NativeBase

ReverseShuffle

Fully redundant

0.0 s

1.0 ms

2.0 ms

3.0 ms

4.0 ms

5.0 ms

6.0 ms

1 B

10 B

100 B

1 kB

10 kB

100 kB

1 MB

0 %

20 %

40 %

60 %

80 %

100 %

Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’

Page 15: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

CTH

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary

HPC Resiliency 2009 15 / 32

Exe

cutio

n tim

e

Diff

eren

ce to

nat

ive

Nodes

BaseBase %

ReverseReverse %

ShuffleShuffle %

ForwardForward %

Native

240.0 s

260.0 s

280.0 s

300.0 s

320.0 s

340.0 s

360.0 s

380.0 s

400.0 s

420.0 s

440.0 s

4 8 16 32 64 128

256

512

1,024

2,048

0 %

20 %

40 %

60 %

80 %

100 %

Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’

Page 16: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

SAGE

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary

HPC Resiliency 2009 16 / 32

Exe

cutio

n tim

e

Diff

eren

ce to

nat

ive

Nodes

BaseBase %

ReverseReverse %

ShuffleShuffle %

ForwardForward %

Native

300.0 s

400.0 s

500.0 s

600.0 s

700.0 s

800.0 s

900.0 s

1.0 ks

1.1 ks

4 8 16 32 64 128

256

512

1,024

2,048

0 %

20 %

40 %

60 %

80 %

100 %

Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’

Page 17: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

LAMMPS

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary

HPC Resiliency 2009 17 / 32

Exe

cutio

n tim

e

Diff

eren

ce to

nat

ive

Nodes

BaseBase %

ReverseReverse %

ShuffleShuffle %

ForwardForward %

Native

315.0 s

320.0 s

325.0 s

330.0 s

335.0 s

340.0 s

345.0 s

350.0 s

4 8 16 32 64 128

256

512

1,024

2,048

0 %

20 %

40 %

60 %

80 %

100 %

Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’

Page 18: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

HPCCG

Motivation

Design

Evaluation

Bandwidth

Latency

Allreduce

CTH

SAGE

LAMMPS

HPCCG

Analysis

Implications

Summary

HPC Resiliency 2009 18 / 32

Exe

cutio

n tim

e

Diff

eren

ce to

nat

ive

Nodes

BaseBase %

ReverseReverse %

ShuffleShuffle %

ForwardForward %

Native

0.0 s

5.0 s

10.0 s

15.0 s

20.0 s

25.0 s

30.0 s

35.0 s

4 8 16 32 64 128

256

512

1,024

0 %

20 %

40 %

60 %

80 %

100 %

Native Benchmark w/o rMPIBase rMPI, no redundancy Reverse ABCD|D’C’B’A’Forward ABCD|A’B’C’D’ Shuffle e.g., ABCD|C’B’D’A’

Page 19: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Analysis

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary

HPC Resiliency 2009 19 / 32

Page 20: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

rMPI Model

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary

HPC Resiliency 2009 20 / 32

■ Acts as a filter

◆ Input is a series of faults◆ Output is application interrupts that cause a restart

Page 21: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Application Interrupts

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary

HPC Resiliency 2009 21 / 32

Pro

babi

lity

of a

pplic

atio

nin

terr

upt (

inte

rrup

ts/fa

ults

)

Number of nodes

No redundancyN redundant nodes

0.001

0.010

0.100

1.0

100

200

500

1,000

2,000

5,000

10,000

20,000

50,000

100,000

Page 22: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Application Lifetime

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary

HPC Resiliency 2009 22 / 32

■ Application needs to finish a set amount of work■ Do checkpoints and restart when necessary

Page 23: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Application Simulation

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary

HPC Resiliency 2009 23 / 32

■ Combine rMPI model with state machine■ Finish a set amount of work■ Do checkpoints and restart when necessary

Page 24: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Application Behavior no Redundancy

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary

HPC Resiliency 2009 24 / 32

Ela

psed

tim

e

Number of nodes

workckptreworkrestart

0%

20%

40%

60%

80%

100%

100

200

500

1,000

2,000

5,000

10,00020,00050,000100,000

Normalized time spent in work, checkpoint, restart, and rework.168h work, 5 min checkpoint, 10 min restart, 5-year node MTBF

Page 25: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Application Behavior no Redundancy

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary

HPC Resiliency 2009 25 / 32

Ela

psed

tim

e in

hou

rs

Number of nodes

workckptreworkrestart

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

100

200

500

1,000

2,000

5,000

10,000

20,000

50,000

100,000

Total time spent in work, checkpoint, restart, and rework.168h work, 5 min checkpoint, 10 min restart, 5-year node MTBF

Page 26: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Application Behavior with Redundancy

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary

HPC Resiliency 2009 26 / 32

Ela

psed

tim

e in

hou

rs

Number of nodes

workckptreworkrestart

0.0

50.0

100.0

150.0

200.0

250.0

100

200

500

1,000

2,000

5,000

10,000

20,000

50,000

100,000

Normalized time spent in work, checkpoint, restart, and rework.168h work, 5 min checkpoint, 10 min restart, 5-year node MTBF

Page 27: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Validation

Motivation

Design

Evaluation

Analysis

rMPI Model

Interrupts

Lifetime

Simulation

Behavior

Validation

Implications

Summary

HPC Resiliency 2009 27 / 32

Daly’s equation from A higher order estimate of the optimum

checkpoint interval for restart dumps.

Tw(τ) = ΘeR

Θ (eτ+δ

Θ − 1)Ts

τfor δ << Ts

Exe

cutio

n tim

e in

hou

rs

Number of nodes

Daly’s modelApp simulation

600.0

800.0

1.0 k

1.2 k

1.4 k

1.6 k

1.8 k

2.0 k

100

200

500

1,000

2,000

5,000

10,000

20,000

50,000

Page 28: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Implications and Trade-Offs

Motivation

Design

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 28 / 32

Page 29: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Implications

Motivation

Design

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 29 / 32

R =Tw + To(none)

Tw + To(full) + TrMPI

(1)

■ Tw: amount of work■ To(none): checkpoint, restart, rework overhead without

redundant nodes■ To(full): overhead with redundant nodes■ If R > 2 then using redundant nodes makes sense

Page 30: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Implications

Motivation

Design

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 30 / 32

none1/4

1/23/4

fullLevel of redundancy 100

1 k

10 k

100 k

Number of nodes

1

10

100

1 k

10 k

100 k

1 M

10 MIn

terr

upts

1 10 100 1 k10 k100 k1 M10 M

Inte

rrup

ts

5,000h work, 5 min checkpoint, 10 min restart, 1-year nodeMTBF

Page 31: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Summary

Motivation

Design

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 31 / 32

Page 32: Increasing Fault Resiliency in a Message-Passing Environment...Basics Msg. order Other Status Evaluation Analysis Implications Summary HPC Resiliency 2009 8 / 32 Redundant node must

Summary

Motivation

Design

Evaluation

Analysis

Implications

Summary

HPC Resiliency 2009 32 / 32

■ Redundant MPI library can be done at user level■ Overhead for applications is not significant for most

applications■ Application restart simulator allows modeling of

◆ varying node counts◆ node MTBF◆ level of redundancy◆ failure distribution function (exponential, gamma,

weibull)◆ amount of work, checkpoint and restart times

■ Model helps determine when redundancy pays off