Introduction to MPI Programming (Part III) Michael Griffiths, Deniz Savas & Alan Real January 2006.

Introduction to MPI Programming

(Part III)

Michael Griffiths, Deniz Savas & Alan Real

January 2006

OverviewReview blocking and non-blocking communicationsCollective Communication

Broadcast, Scatter & Gather of dataReduction OperationsBarrier Synchronisation

Processor topologiesPatterns for Parallel ProgrammingExercises

Blocking operations

Relate to when the operation has completedOnly return from the subroutine call when the

operation has completed

Non-blocking communication

Separate communication into three phases:Initiate non-blocking communicationDo some work:

Perhaps involving other communications

Wait for non-blocking communication to complete.

Collective Communications(one for all, all for one!!!)

Collective communication is defined as that which involves all the processes in a group. Collective communication routines can be divided into the following broad categories:

Barrier synchronisationBroadcast from one to all.Scatter from one to allGather from all to one.Scatter/Gather. From all to all.Global reduction (distribute elementary operations)IMPORTANT NOTE: Collective Communication operations and point-to-

point operations we have seen earlier are invisible to each other and hence do not interfere with each other.This is important to avoid dead-locks due to interference.

Timers

Double precision MPI functionsFortran, DOUBLE PRECISION t1:

t1 = MPI_WTIME();

C double t1:t1 = MPI_Wtime();

C++ double t1:t1 = MPI::Wtime();

Time is measured in seconds.Time to perform a task is measured by

consulting the timer before and after.

Practice Session 4: diffusion example

Arrange processes to communicate round a ring.Each process stores a copy of its rank in an integer

variable.Each process communicates this value to its right

neighbour and receives a value from its left neighbour.Each process computes the sum of all the values received.Repeat for the number of processes involved and print out

the sum stored at each process.

Generating Cartesian Topologies

MPI_Cart_createMakes a new communicator to which topology

information has been attached

MPI_Cart_coordsDetermines process coords in cartesian topology given

rank in group

MPI_Cart_shiftReturns the shifted source and destination ranks,

given a shift direction and amount

MPI_Cart_create syntax

FortranINTEGER comm_old, ndims, dims(*), comm_cart, ierror

logical periods(*), reorder

CALL MPI_CART_CREATE(comm_old, ndims, dims, periods, reorder, comm_cart, ierror)

C:MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart );

C++:MPI::Intracomm::Create_cart (int ndims, const int

dims[], const bool periods[], bool reorder );

MPI_Comm_rank Syntax

MPI_Comm_rank - Determines the rank of the calling process in the communicator.

• int MPI_Comm_rank(MPI_Comm comm, int *rank)

• MPI_COMM_RANK(COMM, RANK, IERROR) • int Comm::Get_rank() const

Transform Rank to CoordinatesMPI_Cart_coords syntax

FortranCALL MPI_CART_COORDS(INTEGER COMM,INTEGER

RANK,INTEGER MAXDIMS,INTEGER COORDS(*),INTEGER IERROR)

C:int MPI_Cart_coords(MPI_Comm comm,int rank,int

maxdims,int *coords);

C++:void MPI::Cartcomm::Get_coords(int rank, int

maxdims, int coords[]) const;

Transform Coordinatesto Rank

MPI_Cart_rank syntax

FortranCALL MPI_CART_RANK(INTEGER COMM, INTEGER

COORDS(*),INTEGER)

C:int MPI_Cart_rank(MPI_Comm comm, int *coords,int

*rank);

C++:void MPI::Cartcomm::Get_rank(int coords[],int

*rank) const;

MPI_Cart_shift syntax

FortranMPI_CART_SHIFT(INTEGER COMM,INTEGER

DIRECTION,INTEGER DISP, INTEGER RANK_SOURCE,INTEGER RANK_DEST,INTEGER IERROR)

C:int MPI_Cart_shift(MPI_Comm comm,int direction,int

disp,int *rank_source,int *rank_dest); C++:

void MPI::Cartcomm::Shift(int direction, int disp, int &rank_source, int &rank_dest) const;

Mapping 4x4 Cartesian Topology Onto Processor Ranks

Topologies: Examples

See Diffusion exampleSee cartesian example

Examples for Parallel Programming

Master slaveE.g. share work exampleExample ising model

Communicating Sequential Elements PatternPoisson equation

Highly coupled processesSystolic loop algorithmE.g. md example

Poisson Solver Using Jacobi Iteration

Communicating Sequential Elements PatternOperations in each component depend on partial

results in neighbour components.

SlaveThread

SlaveThread

SlaveThread

SlaveThread

SlaveThread

SlaveThread

Data Exchange

Data Exchange

Layered Decomposition of 2d Array

Distribute 2d array across processorsProcessors store all columnsRows allocated amongst processors

Each proc has left proc and right procEach proc has max and min vertex that it storesUij

new=(Ui+1j+Ui-1j+Uij+1+Uij-1)/4Each proc has a “ghost” layer

Used in calculation of update (see above)Obtained from neighbouring left and right processorsPass top and bottom layers to neighbouring processors

Become neighbours ghost layers

Distribute rows over processors N/nproc rows per procEvery processor stores all N columns

Processor 1

Processor 2

Processor 3

Processor 4

N+1

N+1

1

N

p2min

p3max

p3max

p2min

p1min

p2max

p2max

p1min

Send top layer

Send bottom layerReceive

top layer

Receive bottom layer

Master Slave

A computation is required where independent computations are performed, perhaps repeatedly, on all elements of some ordered data.

ExampleImage processing perform computation on different sets of pixels within

an image

Master

Slave

Slave

SlaveThread

Thread

Thread

Data Exchange

Highly Coupled Efficient Element Exchange

Highly Coupled Efficient Element Exchange using Systolic loop techniques

Extreme example of Communicating Sequential Elements Pattern

Systolic Loop

Distribute Elements Over ProcessorsThree buffers

Local elementsTravelling Elements (local elements at start)Send buffer

Loop over number of processorsTransfer travelling elements

Interleave send/receive to prevent deadlockSend contents of send buffer to next procReceive buffer from previous proc to

travelling elementsPoint travelling elements to send buffer

Allow local elements to interact with travelling elementsAccumulate reduced computations over processors

Systolic Loop Element Pump

Proc 1

LocalElements

MovingElements(from 4)

Proc 2

LocalElements


Proc 3

LocalElements


Proc 4

LocalElements


First cycle of 3 for 4 processor systolic loop

Practice Sessions 5 and 6

Defining and Using Processor TopologiesPatterns for parallel computing

Further InformationAll MPI routines have a UNIX man page:

Use C-style definition for Fortran/C/C++:E.g. “man MPI_Finalize” will give correct syntax

and information for Fortran, C and C++ calls.

Designing and building parallel programs (Ian Foster)http://www-unix.mcs.anl.gov/dbpp/

Standard documents:http://www.mpi-forum.org/

Many books and information on web.EPCC documents.

Introduction to MPI Programming (Part III) Michael Griffiths, Deniz Savas & Alan Real January 2006.

Documents

Transcript of Introduction to MPI Programming (Part III) Michael Griffiths, Deniz Savas & Alan Real January 2006.