High Performance Parallel Programming
description
Transcript of High Performance Parallel Programming
High Performance Parallel Programming
Dirk van der KnijffAdvanced Research Computing
Information Division
High Performance Parallel Programming
High Performance Parallel Programming
• Lecture 4: Message Passing Interface 3
High Performance Parallel Programming
So Far..• Messages
– source, dest, data, tag, communicator• Communicators
– MPI_COMM_WORLD
• Point-to-point communications– different modes - standard, synchronous, buffered, ready– blocking vs non-blocking
• Derived datatypes– construct then commit
High Performance Parallel Programming
Ping-pong exercise: program/**********************************************************************
* This file has been written as a sample solution to an exercise in a
* course given at the Edinburgh Parallel Computing Centre. It is made
* freely available with the understanding that every copy of this file
* must include this header and that EPCC takes no responsibility for
* the use of the enclosed teaching material.
*
* Authors: Joel Malard, Alan Simpson
*
* Contact: [email protected]
*
* Purpose: A program to experiment with point-to-point
* communications.
*
* Contents: C source code.
*
********************************************************************/
#include <stdio.h>
#include <mpi.h>
#define proc_A 0
#define proc_B 1
#define ping 101
#define pong 101
float buffer[100000];
long float_size;
void processor_A (void), processor_B (void);
void main ( int argc, char *argv[] )
{
int ierror, rank, size;
extern long float_size;
MPI_Init(&argc, &argv);
MPI_Type_extent(MPI_FLOAT, &float_size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == proc_A)
processor_A();
else if (rank == proc_B)
processor_B();
MPI_Finalize();
}
void processor_A( void )
{
int i, length, ierror;
MPI_Status status;
double start, finish, time;
extern float buffer[100000];
extern long float_size;
printf("Length\tTotal Time\tTransfer Rate\n");
for (length = 1; length <= 100000; length += 1000){
start = MPI_Wtime();
for (i = 1; i <= 100; i++){
MPI_Ssend(buffer, length, MPI_FLOAT, proc_B, ping,
MPI_COMM_WORLD);
MPI_Recv(buffer, length, MPI_FLOAT, proc_B, pong,
MPI_COMM_WORLD, &status);
}
finish = MPI_Wtime();
time = finish - start;
printf("%d\t%f\t%f\n", length, time/200.,
(float)(2 * float_size * 100 * length)/time);
}
}
void processor_B( void )
{
int i, length, ierror;
MPI_Status status;
extern float buffer[100000];
for (length = 1; length <= 100000; length += 1000) {
for (i = 1; i <= 100; i++) {
MPI_Recv(buffer, length, MPI_FLOAT, proc_A, ping,
MPI_COMM_WORLD, &status);
MPI_Ssend(buffer, length, MPI_FLOAT, proc_A, pong,
MPI_COMM_WORLD);
}
}
}
High Performance Parallel Programming
Ping-pong exercise: resultsPing_pong performance
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
0.004
0.0045
1
Message Length
seconds
0
1
2
3
4
5
6
7
8
9
MBytes/sec
Total Time Transfer Rate
High Performance Parallel Programming
Ping-pong exercise: results 2Ping_pong performance
0
0.01
0.02
0.03
0.04
0.05
0.06
1 10001 20001 30001 40001 50001 60001 70001 80001 90001
Message Length
seconds
0
2
4
6
8
10
12
MBytes/sec
Total Time Transfer Rate
High Performance Parallel Programming
Running ping-pongcompile:
mpicc ping_pong.c -o ping_pong
submit:qsub ping_pong.sh
where ping_pong.sh is#PBS -q exclusive
#PBS -l nodes=2
cd <your sub_directory>
mpirun ping_pong
High Performance Parallel Programming
Collective communication• Communications involving a group of processes• Called by all processes in a communicator
– for sub-groups need to form a new communicator• Examples
– Barrier synchronisation– Broadcast, Scatter, Gather– Global sum, Global maximum, etc.
High Performance Parallel Programming
Characteristics• Collective action over a communicator• All processes must communicate• Synchronisation may or may not occur• All collective operations are blocking• No tags• Recieve buffers must be exactly the right size• Collective communications and point-to-point
communications cannot interfere
High Performance Parallel Programming
MPI_Barrier• Blocks each calling process until all other members
have also called it.• Generally used to synchronise between phases of a
program• Only one argument - no data is exchanged
MPI_Barrier(comm)
High Performance Parallel Programming
Broadcast• Copies data from a specified root process to all other
processes in communicator– all processes must specify the same root– other aguments same as for point-to-point– datatypes and sizes must match
MPI_Bcast(buffer, count, datatype, root, comm)
• Note: MPI does not support a multicast function
High Performance Parallel Programming
Scatter, Gather• Scatter and Gather are inverse operations• Note that all processes partake - even rootScatter:
a
a b c d e
b c d e
a b c d e
before
after
High Performance Parallel Programming
Gather
Gather:
a b c d e
before
a
a b c d e
b c d e
after
High Performance Parallel Programming
MPI_Scatter, MPI_GatherMPI_Scatter(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, root, comm)
MPI_Gather(sendbuf, sendcount, sendtype,recvbuf, recvcount, recvtype, root, comm)
• sendcount in scatter and recvcount in gatherrefer to the size of each individual message(sendtype = recvtype => sendcount = recvcount)
• total type signatures must match
High Performance Parallel Programming
ExampleMPI_Comm comm;
int gsize, sendarray[100];
int root, myrank, *rbuf;
MPI_Datatype rtype;
...
MPI_Comm_rank(comm, myrank);
MPI_Comm_size(comm, &gsize);
MPI_Type_contigous(100, MPI_INT, &rtype);
MPI_Type_commit(&rtype);
if (myrank == root) {
rbuf = (int *)malloc(gsize*100*sizeof(int));
}
MPI_Gather(sendarray, 100, MPI_INT, rbuf, 1, rtype, root, comm);
High Performance Parallel Programming
More routinesMPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)
MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)
a b c d e
a b c d ea b c d e
a b c d e a b c d e a b c d e a b c d e
a b c d e f g h i j k l m n o p q r s t u v w x y
a f k p u b g l q v c h m r w d i n s x e j o t ya b c d e f g h i j k l m n o p q r s t u v w x y
High Performance Parallel Programming
Vector routinesMPI_Scatterv(sendbuf, sendcount, displs, sendtype,
recvbuf, recvcount, recvtype, root, comm)
MPI_Gatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, root, comm)
MPI_Allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, comm)
MPI_Alltoallv(sendbuf, sendcount, sdispls, sendtype, recvbuf, recvcount, rdispls, recvtype, comm)
• Allow send/recv to be from/to non-contiguous locationsin an array
• Useful if sending different counts at different times
High Performance Parallel Programming
Global reduction routines• Used to compute a result which depends on data
distributed over a number of processes• Examples:
– global sum or product– global maximum or minimum– global user-defined operation
• Operation should be associative– aside: remember floating-point operations technically aren’t
associative but we usually don’t care - can affect results in parallel programs though
High Performance Parallel Programming
Global reduction (cont.)MPI_Reduce(sendbuf, recvbuf, count, datatype, op,
root, comm)
• combines count elements from each sendbuf using op and leaves results in recvbuf on process root
• e.g.
MPI_Reduce(&s, &r, 2, MPI_INT, MPI_SUM, 1, comm)
2 1 3 1 1
2 1 3 1 18
3 1 2 1 2
2 1 3 1 19
rs
rs
rs
rs
rs
rs
rs
rs
rs
rs
High Performance Parallel Programming
Reduction operatorsMPI_MAX MaximumMPI_MIN MinumumMPI_SUM SumMPI_PROD ProductMPI_LAND Logical ANDMPI_BAND Bitewise ANDMPI_LOR Logical ORMPI_BOR Bitwise ORMPI_LXOR Logical XORMPI_BXOR Bitwise XORMPI_MAXLOC Max value and locationMPI_MINLOC Min value and location
High Performance Parallel Programming
User-defined operatorsIn C the operator is defined as a function of type
typedef void MPI_User_function(void *invec, void*inoutvec, int *len, MPI_Datatype
*datatype);
In Fortran must write a function asfunction <user_function>(invec(*), inoutvec(*),
len, type)
where the function has the following schemafor (i = 1 to len)
inoutvec(i) = inoutvec(i) op invec(i)
ThenMPI_Op_create(user_function, commute, op)
returns a handle op of type MPI_Op
High Performance Parallel Programming
VariantsMPI_Allreduce(sendbuf, recvbuf, count, datatype,
op, comm)
• All processes invloved receive identical results
MPI_Reduce_scatter(sendbuf, recvbuf, recvcounts, datatype, op, comm)
• Acts as if a reduce was performed and then each process recieves recvcount(myrank) elements of the result.
High Performance Parallel Programming
Reduce-scatterMPI_INT *s, *r, *rc;
int rank, gsize;
...
rc = (/ 1, 2, 0, 1, 1 /)
MPI_Reduce-scatter(s, r, rc, MPI_INT, MPI_SUM, comm)
1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 1
1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 17 9 6 9 9
High Performance Parallel Programming
ScanMPI_Scan(sendbuf, recvbuf, count, datatype, op,
comm)
• Performs a prefix reduction on data across grouprecvbuf(myrank) = op(sendbuf((i,i=1,myrank)))
MPI_Scan(&s, &r, 5, MPI_INT, MPI_SUM, comm);
1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 1
1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 11 1 2 1 3 2 3 3 3 5 3 6 4 4 7 5 7 5 6 8 7 9 6 9 9
High Performance Parallel Programming
Further topics• Error-handling
– Errors are handled by an error handler– MPI_ERRORS_ARE_FATAL - default for MPI_COMM_WORLD– MPI_ERRORS_RETURN - MPI state is undefined– MPI_Error_string(errorcode, string, resultlen)
• Message probing– Messages can be probed– Note - wildcard reads may receive a different message– blocking and non-blocking
• Persistent communications
High Performance Parallel Programming
Assignment 2.• Write a general procedure to multiply 2 matrices.• Start with
– http://www.hpc.unimelb.edu.au/cs/assignment2/• This is a harness for last years assignment
– Last year I asked them to optimise first– This year just parallelize
• Next Tuesday I will discuss strategies– That doesn’t mean don’t start now…– Ideas available in various places…
High Performance Parallel Programming
High Performance Parallel Programming
Tomorrow - matrix multiplication