Introduction to MPI Programming (Part III) Michael Griffiths, Deniz Savas & Alan Real January 2006.
-
Upload
jayden-kimble -
Category
Documents
-
view
222 -
download
1
Transcript of Introduction to MPI Programming (Part III) Michael Griffiths, Deniz Savas & Alan Real January 2006.
Introduction to MPI Programming
(Part III)
Michael Griffiths, Deniz Savas & Alan Real
January 2006
OverviewReview blocking and non-blocking communicationsCollective Communication
Broadcast, Scatter & Gather of dataReduction OperationsBarrier Synchronisation
Processor topologiesPatterns for Parallel ProgrammingExercises
Blocking operations
Relate to when the operation has completedOnly return from the subroutine call when the
operation has completed
Non-blocking communication
Separate communication into three phases:Initiate non-blocking communicationDo some work:
Perhaps involving other communications
Wait for non-blocking communication to complete.
Collective Communications(one for all, all for one!!!)
Collective communication is defined as that which involves all the processes in a group. Collective communication routines can be divided into the following broad categories:
Barrier synchronisationBroadcast from one to all.Scatter from one to allGather from all to one.Scatter/Gather. From all to all.Global reduction (distribute elementary operations)IMPORTANT NOTE: Collective Communication operations and point-to-
point operations we have seen earlier are invisible to each other and hence do not interfere with each other.This is important to avoid dead-locks due to interference.
Timers
Double precision MPI functionsFortran, DOUBLE PRECISION t1:
t1 = MPI_WTIME();
C double t1:t1 = MPI_Wtime();
C++ double t1:t1 = MPI::Wtime();
Time is measured in seconds.Time to perform a task is measured by
consulting the timer before and after.
Practice Session 4: diffusion example
Arrange processes to communicate round a ring.Each process stores a copy of its rank in an integer
variable.Each process communicates this value to its right
neighbour and receives a value from its left neighbour.Each process computes the sum of all the values received.Repeat for the number of processes involved and print out
the sum stored at each process.
Generating Cartesian Topologies
MPI_Cart_createMakes a new communicator to which topology
information has been attached
MPI_Cart_coordsDetermines process coords in cartesian topology given
rank in group
MPI_Cart_shiftReturns the shifted source and destination ranks,
given a shift direction and amount
MPI_Cart_create syntax
FortranINTEGER comm_old, ndims, dims(*), comm_cart, ierror
logical periods(*), reorder
CALL MPI_CART_CREATE(comm_old, ndims, dims, periods, reorder, comm_cart, ierror)
C:MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart );
C++:MPI::Intracomm::Create_cart (int ndims, const int
dims[], const bool periods[], bool reorder );
MPI_Comm_rank Syntax
MPI_Comm_rank - Determines the rank of the calling process in the communicator.
• int MPI_Comm_rank(MPI_Comm comm, int *rank)
• MPI_COMM_RANK(COMM, RANK, IERROR) • int Comm::Get_rank() const
Transform Rank to CoordinatesMPI_Cart_coords syntax
FortranCALL MPI_CART_COORDS(INTEGER COMM,INTEGER
RANK,INTEGER MAXDIMS,INTEGER COORDS(*),INTEGER IERROR)
C:int MPI_Cart_coords(MPI_Comm comm,int rank,int
maxdims,int *coords);
C++:void MPI::Cartcomm::Get_coords(int rank, int
maxdims, int coords[]) const;
Transform Coordinatesto Rank
MPI_Cart_rank syntax
FortranCALL MPI_CART_RANK(INTEGER COMM, INTEGER
COORDS(*),INTEGER)
C:int MPI_Cart_rank(MPI_Comm comm, int *coords,int
*rank);
C++:void MPI::Cartcomm::Get_rank(int coords[],int
*rank) const;
MPI_Cart_shift syntax
FortranMPI_CART_SHIFT(INTEGER COMM,INTEGER
DIRECTION,INTEGER DISP, INTEGER RANK_SOURCE,INTEGER RANK_DEST,INTEGER IERROR)
C:int MPI_Cart_shift(MPI_Comm comm,int direction,int
disp,int *rank_source,int *rank_dest); C++:
void MPI::Cartcomm::Shift(int direction, int disp, int &rank_source, int &rank_dest) const;
Mapping 4x4 Cartesian Topology Onto Processor Ranks
Topologies: Examples
See Diffusion exampleSee cartesian example
Examples for Parallel Programming
Master slaveE.g. share work exampleExample ising model
Communicating Sequential Elements PatternPoisson equation
Highly coupled processesSystolic loop algorithmE.g. md example
Poisson Solver Using Jacobi Iteration
Communicating Sequential Elements PatternOperations in each component depend on partial
results in neighbour components.
SlaveThread
SlaveThread
SlaveThread
SlaveThread
SlaveThread
SlaveThread
Data Exchange
Data Exchange
Layered Decomposition of 2d Array
Distribute 2d array across processorsProcessors store all columnsRows allocated amongst processors
Each proc has left proc and right procEach proc has max and min vertex that it storesUij
new=(Ui+1j+Ui-1j+Uij+1+Uij-1)/4Each proc has a “ghost” layer
Used in calculation of update (see above)Obtained from neighbouring left and right processorsPass top and bottom layers to neighbouring processors
Become neighbours ghost layers
Distribute rows over processors N/nproc rows per procEvery processor stores all N columns
Processor 1
Processor 2
Processor 3
Processor 4
N+1
N+1
1
N
p2min
p3max
p3max
p2min
p1min
p2max
p2max
p1min
Send top layer
Send bottom layerReceive
top layer
Receive bottom layer
Master Slave
A computation is required where independent computations are performed, perhaps repeatedly, on all elements of some ordered data.
ExampleImage processing perform computation on different sets of pixels within
an image
Master
Slave
Slave
SlaveThread
Thread
Thread
Data Exchange
Highly Coupled Efficient Element Exchange
Highly Coupled Efficient Element Exchange using Systolic loop techniques
Extreme example of Communicating Sequential Elements Pattern
Systolic Loop
Distribute Elements Over ProcessorsThree buffers
Local elementsTravelling Elements (local elements at start)Send buffer
Loop over number of processorsTransfer travelling elements
Interleave send/receive to prevent deadlockSend contents of send buffer to next procReceive buffer from previous proc to
travelling elementsPoint travelling elements to send buffer
Allow local elements to interact with travelling elementsAccumulate reduced computations over processors
Systolic Loop Element Pump
Proc 1
LocalElements
MovingElements(from 4)
Proc 2
LocalElements
MovingElements(from 1)
Proc 3
LocalElements
MovingElements(from 2)
Proc 4
LocalElements
MovingElements(from 3)
First cycle of 3 for 4 processor systolic loop
Practice Sessions 5 and 6
Defining and Using Processor TopologiesPatterns for parallel computing
Further InformationAll MPI routines have a UNIX man page:
Use C-style definition for Fortran/C/C++:E.g. “man MPI_Finalize” will give correct syntax
and information for Fortran, C and C++ calls.
Designing and building parallel programs (Ian Foster)http://www-unix.mcs.anl.gov/dbpp/
Standard documents:http://www.mpi-forum.org/
Many books and information on web.EPCC documents.