Dealing with the scale problem
description
Transcript of Dealing with the scale problem
![Page 1: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/1.jpg)
Dealing with the scale problem
Innovative Computing Laboratory
MPI Team
![Page 2: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/2.jpg)
![Page 3: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/3.jpg)
Runtime ScalabilityCollective CommunicationsFault Tolerance
![Page 4: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/4.jpg)
Runtime Scalability
Binomial Graph (BMG)
10
5
2
8
9
3
4
6
7
0
1
Undirected graph G:=(V, E), |V|=n (any size)Node i={0,1,2,…,n-1} has links to a set of nodes UU={i±1, i±2,…, i±2k | 2k ≤ n} in a circular spaceU={ (i+1)mod n, (i+2)mod n,…, (i+2k)mod n | 2k ≤ n } and { (n+i-1)mod n, (n+i-2)mod n,…, (n+i-2k)mod n | 2k ≤ n }
11
5
28
10
9
3
46
7
0
1
Merging all links create binomial graph from each node of the graphBroadcast from any node in Log2(n) steps
![Page 5: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/5.jpg)
Runtime Scalability
BMG propertiesDegree = number of neighbors
= number of connections = resource consumption
Node-Connectivity (κ)Link-Connectivity (λ)Optimally Connectedκ = λ =δmin
Diameter ⎡ ⎤ )( 2)(log2 ⎥⎥⎤⎢⎢
⎡ nOAverage Distance 3
)(log2n≈
![Page 6: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/6.jpg)
Routing Cost• Because of the counter-
clockwise links the routing problem is NP-complete
• Good approximations exists, with a max overhead of 1 hop
o Broadcast: optimal in number of steps log2(n) (binomial tree from each node)
o Allgather: Bruck algorithm log2(n) steps
o At step s:
o Node i send data to node i-2s
o Node i receive data from node i+2s
Runtime Scalability
![Page 7: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/7.jpg)
Dynamic Environments
5
2
8
9
3
4
6
7
0
1
Self-Healing BMG
# of nodes
# of addednodes
# of newconnections
Runtime Scalability
![Page 8: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/8.jpg)
Runtime ScalabilityCollective CommunicationsFault Tolerance
![Page 9: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/9.jpg)
Collective Communications
Optimization process
• Run-time selection process
• We use performance models, graphical encoding, and statistical learning techniques to build platform-specific, efficient and fast runtime decision functions
![Page 10: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/10.jpg)
Model prediction
Collective Communications
![Page 11: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/11.jpg)
Model prediction
Collective Communications
![Page 12: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/12.jpg)
Tuning broadcast
64 Opterons with1Gb/s TCP
Collective Communications
![Page 13: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/13.jpg)
Collective Communications
Tuning broadcast
![Page 14: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/14.jpg)
Application Tuning• Parallel Ocean Program (POP) on a Cray XT4
• Dominated by MPI_Allreduce of 3 doubles
• Default Open MPI select recursive doubling
• similar with Cray MPI (based on MPICH)
• Cray MPI has better latency
• i.e. POP using Open MPI is 10% slower on 256 processes
• Profile the system for this specific collective and determine that “reduce + bcast” is faster
• Replace the decision function
• New POP performance is about 5% faster than Cray MPI
Collective Communications
![Page 15: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/15.jpg)
Runtime ScalabilityCollective CommunicationsFault Tolerance Fault Tolerant Linear Algebra Uncoordinated checkpoint Sequential debugging of parallel applications
![Page 16: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/16.jpg)
Diskless Checkpointing
P1P1 P2P2 P3P3 P4P4 4 available processors
Fault Tolerance
![Page 17: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/17.jpg)
Diskless Checkpointing
P1P1 P2P2 P3P3 P4P4 4 available processors
P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)P5P5+ P4P4+ + =
Fault Tolerance
![Page 18: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/18.jpg)
Diskless Checkpointing
P1P1 P2P2 P3P3 P4P4 4 available processors
P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)P5P5+ P4P4+ + =
P1P1 P2P2 P3P3 P4P4 Ready to continueP5P5P4P4
Fault Tolerance
![Page 19: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/19.jpg)
Diskless Checkpointing
P1P1 P2P2 P3P3 P4P4 4 available processors
P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)P5P5+ P4P4+ + =
P1P1 P2P2 P3P3 P4P4 Ready to continueP5P5P4P4
....
P1P1 P2P2 P3P3 P4P4 FailureP5P5P4P4
Fault Tolerance
![Page 20: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/20.jpg)
Fault Tolerance
Diskless Checkpointing
P1P1 P2P2 P3P3 P4P4 4 available processors
P1P1 P2P2 P3P3 P4P4Add a fifth and performa checkpoint(Allreduce)PcPc+ P4P4+ + =
P1P1 P2P2 P3P3 P4P4 Ready to continuePcPcP4P4
....
P1P1 P2P2 P3P3 P4P4 FailurePcPcP4P4
P1P1 P3P3 P4P4 Ready for recoveryPcPcP4P4
P1P1 P3P3 P4P4 Recover the processor/dataPcPc P2P2- - - =
![Page 21: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/21.jpg)
Diskless Checkpointing• How to checkpoint ?
• either floating-point arithmetic or binary arithmetic will work
• If checkpoints are performed in floating‐point arithmetic then we can exploit the linearity of the mathematical relations on the object to maintain the checksums
• How to support multiple failures ?
• Reed-Salomon algorithm
• support p failures require p additional processors (resources)
Fault Tolerance
![Page 22: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/22.jpg)
PCG• Fault Tolerant CG
• 64x2 AMD 64 connected using GigE
Performance of PCG with different MPI libraries
For ckpt we generate one
ckpt every 2000 iterations
Fault Tolerance
![Page 23: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/23.jpg)
PCGCheckpoint overhead in
seconds
Fault Tolerance
![Page 24: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/24.jpg)
Fault ToleranceUncoordinated
checkpoint
Fault Tolerance
![Page 25: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/25.jpg)
Detailing event types to avoid intrusiveness
•promiscuous receptions (i.e. ANY_SOURCE)
•Non blocking delivery tests (i.e. WaitSome, Test, TestAny, iProbe…)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Fault Tolerance
![Page 26: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/26.jpg)
Interposition in Open MPI
•We want to avoid tainting the base code with #ifdef FT_BLALA
•Vampire PML loads a new class of MCA components
•Vprotocols provide the entire FT protocol (only pessimistic for now)
•You can use yourself the ability to define subframeworks in your components ! :)
•Keep using the optimized low level and zero-copy devices (BTL) for communication
•Unchanged message scheduling logic
Fault Tolerance
![Page 27: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/27.jpg)
Performance Overhead
•Myrinet 2G (mx 1.1.5) - Opteron 146x2 - 2GB RAM - Linux 2.6.18 - gcc/gfortran 4.1.2 - NPB3.2 - NetPIPE
•Only two application kernels shows non-deterministic events (MG, LU)
Performance Overhead of Improved Pessimistic Message LoggingNetPipe Myrinet 2000
75,00%
80,00%
85,00%
90,00%
95,00%
100,00%
105,00%
1 4 1219 2735 5167 991311952593875157711027153920513075409961478195122911638724579327714915565539983071310751966112621473932195242917864351E+062E+062E+063E+064E+066E+068E+06
Message Size (bytes)
%performance of Open MPI
Open MPI-V no Sender-Based Open MPI-V with Sender-Based
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Fault Tolerance
![Page 28: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/28.jpg)
Fault ToleranceSequential debugging
of parallel applications
Fault Tolerance
![Page 29: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/29.jpg)
Debugging Applications• Usual scenario involves
• Programmer design testing suite
• Testing suite shows a bug
• Programmer runs the application in a debugger (such as gdb) up to understand the bug and fix it
• Programmer runs again the testing suite
• Cyclic DebuggingInteractive Debugging
Testing
Release
Design and code
Fault Tolerance
![Page 30: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/30.jpg)
Fault Tolerance
Interposition Open MPI
•Events are stored (asynchronously on disk) during initial run
•Keep using the optimized low level and zero-copy devices (BTL) for communication
•Unchanged message scheduling logic
•We expect low impact on application behavior
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 31: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/31.jpg)
Fault Tolerance
Performance Overhead• Myrinet 2G (mx 1.0.3) - Opteron 146x2 - 2GB RAM - Linux 2.6.18 - gcc/gfortran 4.1.2 -
NPB3.2 - NetPIPE
• 2% overhead on bare latency, no overhead on bandwidth
• Only two application kernels shows non-deterministic events
• Receiver-based have more overhead, moreover it incurs large amount of logged data - 350MB on simple ping-pong (this is beneficial it is enabled on-demand)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 32: Dealing with the scale problem](https://reader035.fdocuments.net/reader035/viewer/2022062217/56815a6e550346895dc7cd48/html5/thumbnails/32.jpg)
Log Size on NAS Parallel
Benchmarks
o Among the 7 NAS kernels, only 2 NPB generates non deterministic events
o Log size does not correlate with number of processes
o The more scalable is the application, the more scalable is the log mechanism
o Only 287KB of log/process for LU.C.1024 (200MB memory footprint/process)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Log size per process on NAS Parallel Benchmarks (kB)
Fault Tolerance