A new Approach to MPI Collective Communication Implementations · Introduction Framework...
Transcript of A new Approach to MPI Collective Communication Implementations · Introduction Framework...
IntroductionFramework Architecture
Conclusions
A new Approach to MPI CollectiveCommunication Implementations
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2,A. Lumsdaine1
1Open Systems Lab 2Computer Architecture GroupIndiana University Technical University of Chemnitz
3Cisco Systems 4University of TennesseeSan Jose Dept. of Computer Science
2nd Austrian Grid Symposium - DAPSYS’06Innsbruck, Austria, 22nd September 2006
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Outline
1 IntroductionKnown ProblemsState of the ArtOpen MPIDesign Goals
2 Framework ArchitectureSoftware ArchitectureInitializationRuntime Selection
3 Conclusions
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Known ProblemsState of the ArtOpen MPIDesign Goals
Outline
1 IntroductionKnown ProblemsState of the ArtOpen MPIDesign Goals
2 Framework ArchitectureSoftware ArchitectureInitializationRuntime Selection
3 Conclusions
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Known ProblemsState of the ArtOpen MPIDesign Goals
Known Problems
huge number of different collective algorithms andimplementationshardware-dependent collective implementationsno framework that offers run-time selection existsselection of optimal algorithm not trivial, because
depends on MPI-parameters (size, comm)decision in critical pathdifferent implementations only work for certain parametersevery process has to chose the same (runtime-decision)
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Known ProblemsState of the ArtOpen MPIDesign Goals
Predictive Performance Models
Prediction is PossibleLogP-Family (LogGP) predicts accurately
L hardware latencyo host overhead (can be divided into or and os)g gap between consecutive messages (bw limiting)G gap between each byte of a messageP number of processes
Collective OperationsAll collective operations based on point-to-point messages canbe predicted with Log(G)P!
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Known ProblemsState of the ArtOpen MPIDesign Goals
Predictive Performance Models
Prediction is PossibleLogP-Family (LogGP) predicts accurately
L hardware latencyo host overhead (can be divided into or and os)g gap between consecutive messages (bw limiting)G gap between each byte of a messageP number of processes
Collective OperationsAll collective operations based on point-to-point messages canbe predicted with Log(G)P!
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Known ProblemsState of the ArtOpen MPIDesign Goals
Further Problems
LogP vs. HW optimized implementations⇒ need common denominatorseconds to assess running timeHW implementations have to offer predictive modelsbypassing must be possible (optimized impl.)
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Known ProblemsState of the ArtOpen MPIDesign Goals
State of the Art
most impl. use suboptimal hard-coded switching points(MPICH(2), MVAPICH, LAM/MPI, Open MPI)”tuned” Open MPI component experiments with dynamicselection with a fixed set of algorithms (no HWoptimization)Open MPI allows coarse grained third party coll modules⇒ no flexible selection framework available yet
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Known ProblemsState of the ArtOpen MPIDesign Goals
Open MPI
⇒ merged FT-MPI, LA-MPI, LAM/MPI, PACX-MPIimplements MPI-2support for different networks (TCP, GM, MX, MVAPI,OpenIB, Portals, SM)modular framework architecture
some frameworks: PML, BTL, COLL ...easy addition of new ideasclearly defined interfacesbinary modules (vendor)
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Known ProblemsState of the ArtOpen MPIDesign Goals
Goals of our Design
⇒ redesign of collv1 framework in Open MPI 1.0/1.1enable fine-grained selectionefficient run-time decisionbypassing/fast-pathingmodular approach/third party (binary) modulesautomatic usage of best available module
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Software ArchitectureInitializationRuntime Selection
Outline
1 IntroductionKnown ProblemsState of the ArtOpen MPIDesign Goals
2 Framework ArchitectureSoftware ArchitectureInitializationRuntime Selection
3 Conclusions
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Software ArchitectureInitializationRuntime Selection
Terms
component functionality without resources provided byimplementer
module communicator specific instance of aimplementation
query request to a component to return comm specificmodules
implementation implementation of a collective operationopaque functions non-visible functions in coll. modules
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Software ArchitectureInitializationRuntime Selection
Software Architecture
Broadcast Module*broadcast_fn_1*broadcast_fn_2*broadcast_eval_fn
Barrier Module*barrier_fn*barrier_eval_fn
Alltoall Module*alltoall_fn_1*alltoall_fn_2*alltoall_eval_fn
Broadcast Module*broadcast_fn*broadcast_eval_fn
Broadcast Module*broadcast_fn*broadcast_eval_fn
...
Component A
Gather Module*gather_fn_1*gather_fn_2*gather_eval_fn
...
Component B
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Software ArchitectureInitializationRuntime Selection
Actions During MPI_INIT
didthe user force
anything?
listreturn component
component if itwants to run
open no
unload itclose it and
yes
no
componentsload all available load selected
components
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Software ArchitectureInitializationRuntime Selection
Actions During Communicator Construction
to the avail_<op> arrayadd returned modules
are
left?
yes any components
no
query componentwith comm
left?no
yes
at the communicatorput the decision function
for each collective operation
is there only one module
at the communicatorput it as direct callable
construct function listunify module array
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Software ArchitectureInitializationRuntime Selection
Architecture
all returned modules are attached to the communicatoreach module offers an evaluation functioneval. function returns pointer to fastest function and anestimated timeestimation up to implementer (model, previous benchmark,...)
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Software ArchitectureInitializationRuntime Selection
Invocation
MPIarguments in
cache?but winner
put fastest to cache
call fastest
cleanup all modules
estimated running timequery module for
yes
no
untestedmodule left?
no
yes
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Software ArchitectureInitializationRuntime Selection
Decision Overhead
Cache Hitaccess in a hash-table
Cache Missdepends on number of modules
query each modulereturns model or benchmark-based prediction
Cache FriendlinessABINIT/Band: 295/16 (94.6%)ABINIT/CG: 53887/75 (99.9%)CPMD: 15428/85 (99.4%)
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Software ArchitectureInitializationRuntime Selection
Decision Overhead
Cache Hitaccess in a hash-table
Cache Missdepends on number of modules
query each modulereturns model or benchmark-based prediction
Cache FriendlinessABINIT/Band: 295/16 (94.6%)ABINIT/CG: 53887/75 (99.9%)CPMD: 15428/85 (99.4%)
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Outline
1 IntroductionKnown ProblemsState of the ArtOpen MPIDesign Goals
2 Framework ArchitectureSoftware ArchitectureInitializationRuntime Selection
3 Conclusions
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Conclusions and Future Work
Conclusionseasy, flexible and reliable schemeoptimized for common caseuses ”argument-locality”
Future Workimplement system in Open MPIanalyze more applications
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication
IntroductionFramework Architecture
Conclusions
Conclusions and Future Work
Conclusionseasy, flexible and reliable schemeoptimized for common caseuses ”argument-locality”
Future Workimplement system in Open MPIanalyze more applications
T. Hoefler1,2, J. Sqyures3, G. Fagg4, G. Bosilca4, W. Rehm2, A. Lumsdaine1MPI Collective Communication