Managed by UT-Battelle for the Department of Energy MPI for MultiCore and ManyCore Galen Shipman Oak...
-
Upload
may-cannon -
Category
Documents
-
view
226 -
download
0
Transcript of Managed by UT-Battelle for the Department of Energy MPI for MultiCore and ManyCore Galen Shipman Oak...
Managed by UT-Battellefor the Department of Energy
MPI for MultiCore and ManyCore
Galen ShipmanOak Ridge National Laboratory
June 4, 2008
2 Managed by UT-Battellefor the Department of Energy
Ripe Areas of Improvement for MultiCore
MPI Implementations
The MPI Standard
Resource Managers
Improving MPI as a Low Level Substrate
3 Managed by UT-Battellefor the Department of Energy
Open MPI Component Architecture
Intra-node Optimizations are primarily isolated to the collectives and point-to-point interfaces within Open MPI
4 Managed by UT-Battellefor the Department of Energy
MPI Implementation Improvements
Extending Intra-node optimizations beyond “shared memory as a transport mechanism”– Process synchronization primitives
– Hierarchical collectives Reduces network contention Exploit on-node memory hierarchies if they exist
– “Offload” some MPI library tasks to dedicated cores At extremely large scale the additional overhead of this
offload may be insignificant in contrast to the ability to schedule operations effectively
Requires applications to be optimized for overlap
5 Managed by UT-Battellefor the Department of Energy
Shared Memory Optimizations
MPI_Bcast On 16 cores (quad socket dual core Opteron)
“MPI Support for Multi-Core Architectures: Optimized Shared Memory Collectives”, R. L. Graham and G. M. Shipman, To appear in the proceedings of EuroPVM/MPI 2008
6 Managed by UT-Battellefor the Department of Energy
Shared Memory Optimizations
MPI_Reduce On 16 cores (quad socket dual core Opteron)
“MPI Support for Multi-Core Architectures: Optimized Shared Memory Collectives”, R. L. Graham and G. M. Shipman, To appear in the proceedings of EuroPVM/MPI 2008
7 Managed by UT-Battellefor the Department of Energy
Shared Memory Optimizations
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
8 Managed by UT-Battellefor the Department of Energy
Shared Memory Optimizations
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
9 Managed by UT-Battellefor the Department of Energy
MPI Implementation Improvements
Reduced Memory Footprint– MPI has gotten used to 1 GB/core (no we don’t use all of it)– Careful memory usage needed even at “modest scale”
~100K cores on Baker may require data structure changes What happens at 1M cores when I only have 256 MB/Core?
Can we improve performance through reduced generality of MPI? – What if I don’t want datatypes (other than MPI_BYTE)? – What if I don’t use MPI_ANYSOURCE?– Can relaxed ordering semantics help for some use cases?– Additional crazy ideas
And don’t forget about I/O.. – Hierarchies in I/O infrastructure may need to be explicitly managed to
achieve reasonable performance– Applications will have to change how they do I/O
10 Managed by UT-Battellefor the Department of Energy
MPI Standards Improvements
MPI RMA (Remote Memory Access) – Not MPI-2 One Sided– Need to decouple
RMA Initialization RMA Ordering Remote Completion Process Synchronization
– Intertwining these semantics reduces performance (see MPI_WIN_FENCE)
– Need RMW (read modify write) operations Not MPI_ACCUMULATE
– Relax Window access restrictions
Explicit support for process hierarchies– Are all processes created equal?– Should some process groups have “Divine Rights”?
11 Managed by UT-Battellefor the Department of Energy
MPI Standards Improvements
Can threads be first (or even second class) citizens in an MPI world?
Work has been done in this area long ago – (see TMPI and TOMPI)
12 Managed by UT-Battellefor the Department of Energy
RMA Example - SHMEM on Cray X1
#include <mpp/shmem.h> #define SIZE 16 main(int argc, char* argv[]) { int buffer[SIZE]; int i = 0; int num_pe = atoi(argv[1]); start_pes(num_pe); buffer[SIZE-1] = 0; if (_my_pe() == 0) { buffer[SIZE-1] = 1; shmem_int_put(buffer, buffer, SIZE, 1); shmem_int_wait(&buffer[SIZE-1], 1); } else if(_my_pe() == 1) { shmem_int_wait(&buffer[SIZE-1], 0); buffer[SIZE-1] = 0; shmem_int_put(buffer, buffer, SIZE, 0); } shmem_barrier_all(); /* sync before exiting */ }
13 Managed by UT-Battellefor the Department of Energy
RMA Example - MPI on My Laptop
#include <mpi.h> #define SIZE 16 main(int argc, char* argv[]) { int buffer[SIZE]; int proc, nproc; MPI_Win win; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &proc); MPI_Comm_size(MPI_COMM_WORLD, &nproc); MPI_Win_create(buffer, SIZE*sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); if(proc == 0) { MPI_Win_fence(0, win); /* exposure epoch */ MPI_Put(buffer, SIZE, MPI_INT, 1, 0, SIZE, MPI_INT, win); MPI_Win_fence(0, win); MPI_Win_fence(0, win); /* exposure epoch */ MPI_Win_fence(0, win); /* data has landed */ }
else if(proc == 1) { MPI_Win_fence(0, win); /* exposure epoch */ MPI_Win_fence(0, win); /* data has landed */ MPI_Win_fence(0, win); MPI_Put(buffer, SIZE, MPI_INT, 0, 0, SIZE, MPI_INT, win); MPI_Win_fence(0, win); } MPI_Win_free(&win); MPI_Finalize(); }
14 Managed by UT-Battellefor the Department of Energy
Improved Resource Managers
Express resource requirements for ranks and groups (stencils) of ranks– Network and Memory bandwidth requirements– Memory per process– Latency requirements– I/O requirements
We can do this in MPI but it doesn’t belong there – An application may not even want to be scheduled if certain
resource requirements cannot be met– We will need improved integration between MPI and
resource managers– MPI can use information provided by the resource manager
to allocate internal resources depending on the resource requirements specified for the given communicating peers
15 Managed by UT-Battellefor the Department of Energy
MPI as a Low Level Parallel Computing Substrate
MPI is not enough– Nor should it be, we need domain specific packages to layer
on top of MPI– As such MPI had better provide the low level communication
primitives that these packages need (reasonably)
MPI should be improved– MPI RMA can allow PGAS, SHMEM, Global Arrays and others
to effectively use MPI as a low-level communication substrate– Composing MPI features for a particular MPI consumer and
operating environment (Opteron or Cell SPE) can remove the barriers to MPI adoption on many MultiCore and hybrid environments
CML (Cell messaging layer) by Scott Pakin is a good example of a special purpose “lightweight” MPI
16 Managed by UT-Battellefor the Department of Energy
Questions?