Dealing with the scale problem

32
Dealing with the scale problem Innovative Computing Laboratory MPI Team

description

Dealing with the scale problem. Innovative Computing Laboratory MPI Team. Runtime Scalability Collective Communications Fault Tolerance. Binomial Graph (BMG). Undirected graph G :=( V , E ), | V |= n (any size) Node i ={0,1,2,…, n -1} has links to a set of nodes U - PowerPoint PPT Presentation

Transcript of Dealing with the scale problem

Page 1: Dealing with the scale problem

Dealing with the scale problem

Innovative Computing Laboratory

MPI Team

Page 2: Dealing with the scale problem
Page 3: Dealing with the scale problem

Runtime ScalabilityCollective CommunicationsFault Tolerance

Page 4: Dealing with the scale problem

Runtime Scalability

Binomial Graph (BMG)

10

5

2

8

9

3

4

6

7

0

1

Undirected graph G:=(V, E), |V|=n (any size)Node i={0,1,2,…,n-1} has links to a set of nodes UU={i±1, i±2,…, i±2k | 2k ≤ n} in a circular spaceU={ (i+1)mod n, (i+2)mod n,…, (i+2k)mod n | 2k ≤ n } and { (n+i-1)mod n, (n+i-2)mod n,…, (n+i-2k)mod n | 2k ≤ n }

11

5

28

10

9

3

46

7

0

1

Merging all links create binomial graph from each node of the graphBroadcast from any node in Log2(n) steps

Page 5: Dealing with the scale problem

Runtime Scalability

BMG propertiesDegree = number of neighbors

= number of connections = resource consumption

Node-Connectivity (κ)Link-Connectivity (λ)Optimally Connectedκ = λ =δmin

Diameter ⎡ ⎤ )( 2)(log2 ⎥⎥⎤⎢⎢

⎡ nOAverage Distance 3

)(log2n≈

Page 6: Dealing with the scale problem

Routing Cost• Because of the counter-

clockwise links the routing problem is NP-complete

• Good approximations exists, with a max overhead of 1 hop

o Broadcast: optimal in number of steps log2(n) (binomial tree from each node)

o Allgather: Bruck algorithm log2(n) steps

o At step s:

o Node i send data to node i-2s

o Node i receive data from node i+2s

Runtime Scalability

Page 7: Dealing with the scale problem

Dynamic Environments

5

2

8

9

3

4

6

7

0

1

Self-Healing BMG

# of nodes

# of addednodes

# of newconnections

Runtime Scalability

Page 8: Dealing with the scale problem

Runtime ScalabilityCollective CommunicationsFault Tolerance

Page 9: Dealing with the scale problem

Collective Communications

Optimization process

• Run-time selection process

• We use performance models, graphical encoding, and statistical learning techniques to build platform-specific, efficient and fast runtime decision functions

Page 10: Dealing with the scale problem

Model prediction

Collective Communications

Page 11: Dealing with the scale problem

Model prediction

Collective Communications

Page 12: Dealing with the scale problem

Tuning broadcast

64 Opterons with1Gb/s TCP

Collective Communications

Page 13: Dealing with the scale problem

Collective Communications

Tuning broadcast

Page 14: Dealing with the scale problem

Application Tuning• Parallel Ocean Program (POP) on a Cray XT4

• Dominated by MPI_Allreduce of 3 doubles

• Default Open MPI select recursive doubling

• similar with Cray MPI (based on MPICH)

• Cray MPI has better latency

• i.e. POP using Open MPI is 10% slower on 256 processes

• Profile the system for this specific collective and determine that “reduce + bcast” is faster

• Replace the decision function

• New POP performance is about 5% faster than Cray MPI

Collective Communications

Page 15: Dealing with the scale problem

Runtime ScalabilityCollective CommunicationsFault Tolerance Fault Tolerant Linear Algebra Uncoordinated checkpoint Sequential debugging of parallel applications

Page 16: Dealing with the scale problem

Diskless Checkpointing

P1P1 P2P2 P3P3 P4P4 4 available processors

Fault Tolerance

Page 17: Dealing with the scale problem

Diskless Checkpointing

P1P1 P2P2 P3P3 P4P4 4 available processors

P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)P5P5+ P4P4+ + =

Fault Tolerance

Page 18: Dealing with the scale problem

Diskless Checkpointing

P1P1 P2P2 P3P3 P4P4 4 available processors

P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)P5P5+ P4P4+ + =

P1P1 P2P2 P3P3 P4P4 Ready to continueP5P5P4P4

Fault Tolerance

Page 19: Dealing with the scale problem

Diskless Checkpointing

P1P1 P2P2 P3P3 P4P4 4 available processors

P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)P5P5+ P4P4+ + =

P1P1 P2P2 P3P3 P4P4 Ready to continueP5P5P4P4

....

P1P1 P2P2 P3P3 P4P4 FailureP5P5P4P4

Fault Tolerance

Page 20: Dealing with the scale problem

Fault Tolerance

Diskless Checkpointing

P1P1 P2P2 P3P3 P4P4 4 available processors

P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)PcPc+ P4P4+ + =

P1P1 P2P2 P3P3 P4P4 Ready to continuePcPcP4P4

....

P1P1 P2P2 P3P3 P4P4 FailurePcPcP4P4

P1P1 P3P3 P4P4 Ready for recoveryPcPcP4P4

P1P1 P3P3 P4P4 Recover the processor/dataPcPc P2P2- - - =

Page 21: Dealing with the scale problem

Diskless Checkpointing• How to checkpoint ?

• either floating-point arithmetic or binary arithmetic will work

• If checkpoints are performed in floating‐point arithmetic then we can exploit the linearity of the mathematical relations on the object to maintain the checksums

• How to support multiple failures ?

• Reed-Salomon algorithm

• support p failures require p additional processors (resources)

Fault Tolerance

Page 22: Dealing with the scale problem

PCG• Fault Tolerant CG

• 64x2 AMD 64 connected using GigE

Performance of PCG with different MPI libraries

For ckpt we generate one

ckpt every 2000 iterations

Fault Tolerance

Page 23: Dealing with the scale problem

PCGCheckpoint overhead in

seconds

Fault Tolerance

Page 24: Dealing with the scale problem

Fault ToleranceUncoordinated

checkpoint

Fault Tolerance

Page 25: Dealing with the scale problem

Detailing event types to avoid intrusiveness

•promiscuous receptions (i.e. ANY_SOURCE)

•Non blocking delivery tests (i.e. WaitSome, Test, TestAny, iProbe…)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Fault Tolerance

Page 26: Dealing with the scale problem

Interposition in Open MPI

•We want to avoid tainting the base code with #ifdef FT_BLALA

•Vampire PML loads a new class of MCA components

•Vprotocols provide the entire FT protocol (only pessimistic for now)

•You can use yourself the ability to define subframeworks in your components ! :)

•Keep using the optimized low level and zero-copy devices (BTL) for communication

•Unchanged message scheduling logic

Fault Tolerance

Page 27: Dealing with the scale problem

Performance Overhead

•Myrinet 2G (mx 1.1.5) - Opteron 146x2 - 2GB RAM - Linux 2.6.18 - gcc/gfortran 4.1.2 - NPB3.2 - NetPIPE

•Only two application kernels shows non-deterministic events (MG, LU)

Performance Overhead of Improved Pessimistic Message LoggingNetPipe Myrinet 2000

75,00%

80,00%

85,00%

90,00%

95,00%

100,00%

105,00%

1 4 1219 2735 5167 991311952593875157711027153920513075409961478195122911638724579327714915565539983071310751966112621473932195242917864351E+062E+062E+063E+064E+066E+068E+06

Message Size (bytes)

%performance of Open MPI

Open MPI-V no Sender-Based Open MPI-V with Sender-Based

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Fault Tolerance

Page 28: Dealing with the scale problem

Fault ToleranceSequential debugging

of parallel applications

Fault Tolerance

Page 29: Dealing with the scale problem

Debugging Applications• Usual scenario involves

• Programmer design testing suite

• Testing suite shows a bug

• Programmer runs the application in a debugger (such as gdb) up to understand the bug and fix it

• Programmer runs again the testing suite

• Cyclic DebuggingInteractive Debugging

Testing

Release

Design and code

Fault Tolerance

Page 30: Dealing with the scale problem

Fault Tolerance

Interposition Open MPI

•Events are stored (asynchronously on disk) during initial run

•Keep using the optimized low level and zero-copy devices (BTL) for communication

•Unchanged message scheduling logic

•We expect low impact on application behavior

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 31: Dealing with the scale problem

Fault Tolerance

Performance Overhead• Myrinet 2G (mx 1.0.3) - Opteron 146x2 - 2GB RAM - Linux 2.6.18 - gcc/gfortran 4.1.2 -

NPB3.2 - NetPIPE

• 2% overhead on bare latency, no overhead on bandwidth

• Only two application kernels shows non-deterministic events

• Receiver-based have more overhead, moreover it incurs large amount of logged data - 350MB on simple ping-pong (this is beneficial it is enabled on-demand)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 32: Dealing with the scale problem

Log Size on NAS Parallel

Benchmarks

o Among the 7 NAS kernels, only 2 NPB generates non deterministic events

o Log size does not correlate with number of processes

o The more scalable is the application, the more scalable is the log mechanism

o Only 287KB of log/process for LU.C.1024 (200MB memory footprint/process)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Log size per process on NAS Parallel Benchmarks (kB)

Fault Tolerance