High Performance Parallel Programming

30
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division

description

High Performance Parallel Programming. Dirk van der Knijff Advanced Research Computing Information Division. High Performance Parallel Programming. Lecture 4:Message Passing Interface 3. So Far. Messages source, dest, data, tag, communicator Communicators MPI_COMM_WORLD - PowerPoint PPT Presentation

Transcript of High Performance Parallel Programming

Page 1: High Performance Parallel Programming

High Performance Parallel Programming

Dirk van der KnijffAdvanced Research Computing

Information Division

Page 2: High Performance Parallel Programming

High Performance Parallel Programming

High Performance Parallel Programming

• Lecture 4: Message Passing Interface 3

Page 3: High Performance Parallel Programming

High Performance Parallel Programming

So Far..• Messages

– source, dest, data, tag, communicator• Communicators

– MPI_COMM_WORLD

• Point-to-point communications– different modes - standard, synchronous, buffered, ready– blocking vs non-blocking

• Derived datatypes– construct then commit

Page 4: High Performance Parallel Programming

High Performance Parallel Programming

Ping-pong exercise: program/**********************************************************************

* This file has been written as a sample solution to an exercise in a

* course given at the Edinburgh Parallel Computing Centre. It is made

* freely available with the understanding that every copy of this file

* must include this header and that EPCC takes no responsibility for

* the use of the enclosed teaching material.

*

* Authors: Joel Malard, Alan Simpson

*

* Contact: [email protected]

*

* Purpose: A program to experiment with point-to-point

* communications.

*

* Contents: C source code.

*

********************************************************************/

Page 5: High Performance Parallel Programming

#include <stdio.h>

#include <mpi.h>

#define proc_A 0

#define proc_B 1

#define ping 101

#define pong 101

float buffer[100000];

long float_size;

void processor_A (void), processor_B (void);

void main ( int argc, char *argv[] )

{

int ierror, rank, size;

extern long float_size;

MPI_Init(&argc, &argv);

MPI_Type_extent(MPI_FLOAT, &float_size);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == proc_A)

processor_A();

else if (rank == proc_B)

processor_B();

MPI_Finalize();

}

Page 6: High Performance Parallel Programming

void processor_A( void )

{

int i, length, ierror;

MPI_Status status;

double start, finish, time;

extern float buffer[100000];

extern long float_size;

printf("Length\tTotal Time\tTransfer Rate\n");

for (length = 1; length <= 100000; length += 1000){

start = MPI_Wtime();

for (i = 1; i <= 100; i++){

MPI_Ssend(buffer, length, MPI_FLOAT, proc_B, ping,

MPI_COMM_WORLD);

MPI_Recv(buffer, length, MPI_FLOAT, proc_B, pong,

MPI_COMM_WORLD, &status);

}

finish = MPI_Wtime();

time = finish - start;

printf("%d\t%f\t%f\n", length, time/200.,

(float)(2 * float_size * 100 * length)/time);

}

}

Page 7: High Performance Parallel Programming

void processor_B( void )

{

int i, length, ierror;

MPI_Status status;

extern float buffer[100000];

for (length = 1; length <= 100000; length += 1000) {

for (i = 1; i <= 100; i++) {

MPI_Recv(buffer, length, MPI_FLOAT, proc_A, ping,

MPI_COMM_WORLD, &status);

MPI_Ssend(buffer, length, MPI_FLOAT, proc_A, pong,

MPI_COMM_WORLD);

}

}

}

Page 8: High Performance Parallel Programming

High Performance Parallel Programming

Ping-pong exercise: resultsPing_pong performance

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

0.0045

1

Message Length

seconds

0

1

2

3

4

5

6

7

8

9

MBytes/sec

Total Time Transfer Rate

Page 9: High Performance Parallel Programming

High Performance Parallel Programming

Ping-pong exercise: results 2Ping_pong performance

0

0.01

0.02

0.03

0.04

0.05

0.06

1 10001 20001 30001 40001 50001 60001 70001 80001 90001

Message Length

seconds

0

2

4

6

8

10

12

MBytes/sec

Total Time Transfer Rate

Page 10: High Performance Parallel Programming

High Performance Parallel Programming

Running ping-pongcompile:

mpicc ping_pong.c -o ping_pong

submit:qsub ping_pong.sh

where ping_pong.sh is#PBS -q exclusive

#PBS -l nodes=2

cd <your sub_directory>

mpirun ping_pong

Page 11: High Performance Parallel Programming

High Performance Parallel Programming

Collective communication• Communications involving a group of processes• Called by all processes in a communicator

– for sub-groups need to form a new communicator• Examples

– Barrier synchronisation– Broadcast, Scatter, Gather– Global sum, Global maximum, etc.

Page 12: High Performance Parallel Programming

High Performance Parallel Programming

Characteristics• Collective action over a communicator• All processes must communicate• Synchronisation may or may not occur• All collective operations are blocking• No tags• Recieve buffers must be exactly the right size• Collective communications and point-to-point

communications cannot interfere

Page 13: High Performance Parallel Programming

High Performance Parallel Programming

MPI_Barrier• Blocks each calling process until all other members

have also called it.• Generally used to synchronise between phases of a

program• Only one argument - no data is exchanged

MPI_Barrier(comm)

Page 14: High Performance Parallel Programming

High Performance Parallel Programming

Broadcast• Copies data from a specified root process to all other

processes in communicator– all processes must specify the same root– other aguments same as for point-to-point– datatypes and sizes must match

MPI_Bcast(buffer, count, datatype, root, comm)

• Note: MPI does not support a multicast function

Page 15: High Performance Parallel Programming

High Performance Parallel Programming

Scatter, Gather• Scatter and Gather are inverse operations• Note that all processes partake - even rootScatter:

a

a b c d e

b c d e

a b c d e

before

after

Page 16: High Performance Parallel Programming

High Performance Parallel Programming

Gather

Gather:

a b c d e

before

a

a b c d e

b c d e

after

Page 17: High Performance Parallel Programming

High Performance Parallel Programming

MPI_Scatter, MPI_GatherMPI_Scatter(sendbuf, sendcount, sendtype,

recvbuf, recvcount, recvtype, root, comm)

MPI_Gather(sendbuf, sendcount, sendtype,recvbuf, recvcount, recvtype, root, comm)

• sendcount in scatter and recvcount in gatherrefer to the size of each individual message(sendtype = recvtype => sendcount = recvcount)

• total type signatures must match

Page 18: High Performance Parallel Programming

High Performance Parallel Programming

ExampleMPI_Comm comm;

int gsize, sendarray[100];

int root, myrank, *rbuf;

MPI_Datatype rtype;

...

MPI_Comm_rank(comm, myrank);

MPI_Comm_size(comm, &gsize);

MPI_Type_contigous(100, MPI_INT, &rtype);

MPI_Type_commit(&rtype);

if (myrank == root) {

rbuf = (int *)malloc(gsize*100*sizeof(int));

}

MPI_Gather(sendarray, 100, MPI_INT, rbuf, 1, rtype, root, comm);

Page 19: High Performance Parallel Programming

High Performance Parallel Programming

More routinesMPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)

MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)

a b c d e

a b c d ea b c d e

a b c d e a b c d e a b c d e a b c d e

a b c d e f g h i j k l m n o p q r s t u v w x y

a f k p u b g l q v c h m r w d i n s x e j o t ya b c d e f g h i j k l m n o p q r s t u v w x y

Page 20: High Performance Parallel Programming

High Performance Parallel Programming

Vector routinesMPI_Scatterv(sendbuf, sendcount, displs, sendtype,

recvbuf, recvcount, recvtype, root, comm)

MPI_Gatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, root, comm)

MPI_Allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, comm)

MPI_Alltoallv(sendbuf, sendcount, sdispls, sendtype, recvbuf, recvcount, rdispls, recvtype, comm)

• Allow send/recv to be from/to non-contiguous locationsin an array

• Useful if sending different counts at different times

Page 21: High Performance Parallel Programming

High Performance Parallel Programming

Global reduction routines• Used to compute a result which depends on data

distributed over a number of processes• Examples:

– global sum or product– global maximum or minimum– global user-defined operation

• Operation should be associative– aside: remember floating-point operations technically aren’t

associative but we usually don’t care - can affect results in parallel programs though

Page 22: High Performance Parallel Programming

High Performance Parallel Programming

Global reduction (cont.)MPI_Reduce(sendbuf, recvbuf, count, datatype, op,

root, comm)

• combines count elements from each sendbuf using op and leaves results in recvbuf on process root

• e.g.

MPI_Reduce(&s, &r, 2, MPI_INT, MPI_SUM, 1, comm)

2 1 3 1 1

2 1 3 1 18

3 1 2 1 2

2 1 3 1 19

rs

rs

rs

rs

rs

rs

rs

rs

rs

rs

Page 23: High Performance Parallel Programming

High Performance Parallel Programming

Reduction operatorsMPI_MAX MaximumMPI_MIN MinumumMPI_SUM SumMPI_PROD ProductMPI_LAND Logical ANDMPI_BAND Bitewise ANDMPI_LOR Logical ORMPI_BOR Bitwise ORMPI_LXOR Logical XORMPI_BXOR Bitwise XORMPI_MAXLOC Max value and locationMPI_MINLOC Min value and location

Page 24: High Performance Parallel Programming

High Performance Parallel Programming

User-defined operatorsIn C the operator is defined as a function of type

typedef void MPI_User_function(void *invec, void*inoutvec, int *len, MPI_Datatype

*datatype);

In Fortran must write a function asfunction <user_function>(invec(*), inoutvec(*),

len, type)

where the function has the following schemafor (i = 1 to len)

inoutvec(i) = inoutvec(i) op invec(i)

ThenMPI_Op_create(user_function, commute, op)

returns a handle op of type MPI_Op

Page 25: High Performance Parallel Programming

High Performance Parallel Programming

VariantsMPI_Allreduce(sendbuf, recvbuf, count, datatype,

op, comm)

• All processes invloved receive identical results

MPI_Reduce_scatter(sendbuf, recvbuf, recvcounts, datatype, op, comm)

• Acts as if a reduce was performed and then each process recieves recvcount(myrank) elements of the result.

Page 26: High Performance Parallel Programming

High Performance Parallel Programming

Reduce-scatterMPI_INT *s, *r, *rc;

int rank, gsize;

...

rc = (/ 1, 2, 0, 1, 1 /)

MPI_Reduce-scatter(s, r, rc, MPI_INT, MPI_SUM, comm)

1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 1

1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 17 9 6 9 9

Page 27: High Performance Parallel Programming

High Performance Parallel Programming

ScanMPI_Scan(sendbuf, recvbuf, count, datatype, op,

comm)

• Performs a prefix reduction on data across grouprecvbuf(myrank) = op(sendbuf((i,i=1,myrank)))

MPI_Scan(&s, &r, 5, MPI_INT, MPI_SUM, comm);

1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 1

1 1 2 1 3 1 2 1 2 2 1 3 1 1 2 2 1 1 2 1 2 2 1 3 11 1 2 1 3 2 3 3 3 5 3 6 4 4 7 5 7 5 6 8 7 9 6 9 9

Page 28: High Performance Parallel Programming

High Performance Parallel Programming

Further topics• Error-handling

– Errors are handled by an error handler– MPI_ERRORS_ARE_FATAL - default for MPI_COMM_WORLD– MPI_ERRORS_RETURN - MPI state is undefined– MPI_Error_string(errorcode, string, resultlen)

• Message probing– Messages can be probed– Note - wildcard reads may receive a different message– blocking and non-blocking

• Persistent communications

Page 29: High Performance Parallel Programming

High Performance Parallel Programming

Assignment 2.• Write a general procedure to multiply 2 matrices.• Start with

– http://www.hpc.unimelb.edu.au/cs/assignment2/• This is a harness for last years assignment

– Last year I asked them to optimise first– This year just parallelize

• Next Tuesday I will discuss strategies– That doesn’t mean don’t start now…– Ideas available in various places…

Page 30: High Performance Parallel Programming

High Performance Parallel Programming

High Performance Parallel Programming

Tomorrow - matrix multiplication