Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R....
-
Upload
winifred-robertson -
Category
Documents
-
view
223 -
download
0
Transcript of Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R....
Parallel Processing 1
Parallel Processing (CS 676)
Lecture 7: Message Passing using MPI*
Jeremy R. Johnson
*Parts of this lecture was derived from chapters 3-5,11 in Pacheco
Parallel Processing 2
Introduction
• Objective: To introduce distributed memory parallel programming using message passing. Introduction to the MPI standard for message passing.
• Topics– Introduction to MPI
• hello.c
• hello.f
– Example Problem (numeric integration)– Collective Communication– Performance Model
Parallel Processing 3
MPI
• Message Passing Interface• Distributed Memory Model
– Single Program Multiple Data (SPMD)– Communication using message passing
• Send/Recv
– Collective Communication• Broadcast
• Reduce (AllReduce)
• Gather (AllGather)
• Scatter (AllScatter)
• Alltoall
Parallel Processing 4
Benefits/Disadvantges
• No new language is requried
• Portable
• Good performance
• Explicitly forces programmer to deal with local/global access
• Harder to program that shared memory – requires larger program/algorithm changes
Parallel Processing 5
Further Information
• http://www-unix.mcs.anl.gov/mpi/• en.wikipedia.org/wiki/Message_Passing_Interface• www.mpi-forum.org• www.open-mpi.org• www.mcs.anl.gov/research/projects/mpich2
• Textbook– Peter S. Pacheco, Parallel Programming with MPI, Morgan Kaufman,
1997.
Parallel Processing 6
Basic MPI Functions
• int MPI_Init(• int* argc /* in/out */,• char** argv /* in/out */)• • int MPI_Finalize(void)
• Int MPI_Comm_size(• MPI_Comm communicator /* in */,• int* number_of_processors /* out */)• Int MPI_Comm_rank(• MPI_Comm communicator /* in */,• int* my_rank /* out */)
Parallel Processing 7
Send
• Must package message in envelope containing destination, size, and an identifying tag, set of processors participating in the communication.
• int MPI_Send(• void* message /* in */• int count /* in */• MPI_Datatype datatype /* in */• int dest /* in */• int tag /* in */• MPI_Comm communicator /* in */)
Parallel Processing 8
Receive
• int MPI_Recv(• void* message /* out */• int count /* in */• MPI_Datatype datatype /* in */• int source /* in */• int tag /* in */• MPI_Comm communicator /* in */• MPI_Status* status /* out */)
Parallel Processing 9
Status
• Status-> MPI_SOURCE• Status-> MPI_TAG• Status-> MPI_ERROR
• Int MPI_Get_count(• MPI_Status* status /* in */,• MPI_Datatype datatype /* in */,• int* count_ptr /* out */)
Parallel Processing 10
hello.c#include <stdio.h>#include <string.h>#include "mpi.h"
main(int argc, char * argv[]){ int my_rank; /* rank of process */ int p; /* number of processes */ int source; /* rank of sender */ int dest; /* rank of receiver */ int tag = 0; /* tag for messages */ char message[100]; /* storage for message */ MPI_Status status; /* return status for receive */
/* Start up MPI */ MPI_Init(&argc, &argv);
/* Find out process rank */ MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
/* Find out number of processes */ MPI_Comm_size(MPI_COMM_WORLD, &p);
Parallel Processing 11
hello.c if (my_rank != 0) { /* create message */ sprintf(message, "Greetings from process %d!\n",my_rank); dest = 0; /* user strlen + 1 so tat '\0' gets transmitted */ MPI_Send(message, strlen(message)+1,MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else { for (source = 1; source < p; source++) { MPI_Recv(message,100, MPI_CHAR, source, tag, MPI_COMM_WORLD,&status); printf("%s\n",message); } }
/* Shut down MPI */ MPI_Finalize();}
Parallel Processing 12
AnySource if (my_rank != 0)
{
/* create message */
sprintf(message, "Greetings from process %d!\n",my_rank);
dest = 0;
/* user strlen + 1 so tat '\0' gets transmitted */
MPI_Send(message, strlen(message)+1,MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
}
else
{
for (source = 1; source < p; source++) {
MPI_Recv(message,100, MPI_CHAR, MPI_ANY_SOURCE, tag,
MPI_COMM_WORLD,&status);
printf("%s\n",message);
}
}
Ring Communication
Oct. 30, 2002 Parallel Processing 13
0 1
5 4
7
6
2
3
Parallel Processing 14
First Attempt
sprintf(message, "Greetings from process %d!\n",my_rank);
dest = (my_rank + 1) % p;
MPI_Send(message, strlen(message)+1,MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
source = (my_rank -1) % p;
MPI_Recv(message,100, MPI_CHAR, source, tag,
MPI_COMM_WORLD,&status);
printf("PE %d received: %s\n",my_rank,message);
Parallel Processing 15
Deadlock
sprintf(message, "Greetings from process %d!\n",my_rank);
dest = (my_rank + 1) % p;
source = (my_rank -1) % p;
MPI_Recv(message,100, MPI_CHAR, source, tag,
MPI_COMM_WORLD,&status);
MPI_Send(message, strlen(message)+1,MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
printf("PE %d received: %s\n",my_rank,message);
Oct. 30, 2002 Parallel Processing 16
Buffering Assumption
• Previous code is not safe since it depends on sufficient system buffers being available so that deadlock does not occur.
• SendRecv can be used to guarantee that deadlock does not occur.
Oct. 30, 2002 Parallel Processing 17
SendRecv
• int MPI_Sendrecv(• void* send_buf /* in */,• int send_count /* in */,• MPI_Datatype send_type /* in */,• int dest /* in */,• int send_tag /* in */,• void* recv_buf /* out */,• int recv_count /* in */,• MPI_Datatype recv_type /* in */,• int source /* in */,• int recv_tag /* in */,• MPI_Comm communicator /* in */,• MPI_Status* status /* out */)
Parallel Processing 18
Correct Version with SendReceive
sprintf(omessage, "Greetings from process %d!\n",my_rank);
dest = (my_rank + 1) % p;
source = (my_rank -1) % p;
MPI_Sendrecv(omessage,strlen(omessage)+1,MPI_CHAR,dest,tag,
imessage,100,MPI_CHAR,source,tag,MPI_COMM_WORLD,&status);
printf("PE %d received: %s\n",my_rank,imessage);
Parallel Processing 19
Lower Level Implementation
sprintf(smessage, "Greetings from process %d!\n",my_rank);
dest = (my_rank + 1) % p;
source = (my_rank -1) % p;
if (my_rank % 2 == 0) {
MPI_Send(smessage, strlen(smessage)+1,MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
MPI_Recv(dmessage,100, MPI_CHAR, source, tag,
MPI_COMM_WORLD,&status);
} else {
MPI_Recv(dmessage,100, MPI_CHAR, source, tag,
MPI_COMM_WORLD,&status);
MPI_Send(smessage, strlen(smessage)+1,MPI_CHAR,
dest, tag, MPI_COMM_WORLD);
}
printf("PE %d received: %s\n",my_rank,dmessage);
Parallel Processing 20
Compiling and Executing MPI Programs with OpenMPI
• To compile a C program with MPI calls– mpicc hello.c -o hello
• To run an MPI program– mpirun –np PROCS hello– You can provide a hostfile with –hostfile NAME (see
man page for details)
Parallel Processing 21
dot.c
#include <stdio.h>
float Serial_doc(float x[] /* in */, float y[] /* in */, int n /* in */)
{
int i; float sum = 0.0;
for (i=0; i< n; i++)
sum = sum + x[i]*y[i];
return sum;
}
Parallel Processing 22
Parallel Dot
float Parallel_doc(float local_x[] /* in */, float local_y[] /* in */, int n_bar /* in */)
{
float local_dot;
local_dot = Serial_dot(local_x, local_y,b_bar);
MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
return dot;
}
Parallel Processing 23
Parallel All Dot
float Parallel_doc(float local_x[] /* in */, float local_y[] /* in */, int n_bar /* in */)
{
float local_dot;
local_dot = Serial_dot(local_x, local_y,b_bar);
MPI_Allreduce(&local_dot, &dot, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
return dot;
}
Parallel Processing 24
Reduce
• int MPI_Reduce(• void* operand /* in */• void* result /* out */• int count /* in */• MPI_Datatype datatype /* in */• MPI_Op operator /* in */• int root /* in */• MPI_Comm communicator /* in */)
• Operators– MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND,
MPI_LOR, MPI_BOR, MPI_LXOR, MPI_BXOR, MPI_MAXLOC, MPI_MINLOC
x0+ x1+x2+x3+x4+ x5+x6+ x7
Parallel Processing 25
Reduce
0
0 1
0 1 2 3
0 1 2 3 4 5 6 7
x0 x1 x2 x3 x4 x5 x6 x7
x0+x4,x1+x5,x2+x6,x3+x7
x0+x4+x2+x6
x1+x5+x3+ x7
Parallel Processing 26
AllReduce
• int MPI_Allreduce(• void* operand /* in */• void* result /* out */• int count /* in */• MPI_Datatype datatype /* in */• MPI_Op operator /* in */• int root /* in */• MPI_Comm communicator /* in */)
• Operators– MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, MPI_LAND, MPI_BAND,
MPI_LOR, MPI_BOR, MPI_LXOR, MPI_BXOR, MPI_MAXLOC, MPI_MINLOC
Parallel Processing 27
AllReduce
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Parallel Processing 28
Broadcast
• int MPI_Bcast(• void* message /* in */• int count /* in */• MPI_Datatype datatype /* in */• int root /* in */• MPI_Comm communicator /* in */)
Parallel Processing 29
Broadcast
0
0 1
0 1 2 3
0 1 2 3 4 5 6 7
Parallel Processing 30
Gather• int MPI_Gather(• void* send_data /* in */• int send_count /* in */• MPI_Datatype send_type /* in */• void* recv_data /* out */• int recv_count /* in */• MPI_Datatype recv_type /* in */• int root /* in */• MPI_Comm communicator /* in */)
Process 0
Process 1
Process 2
Process 3
x0
x1
x2
x3
Parallel Processing 31
Scatter• int MPI_Scatter(• void* send_data /* in */• int send_count /* in */• MPI_Datatype send_type /* in */• void* recv_data /* out */• int recv_count /* in */• MPI_Datatype recv_type /* in */• int root /* in */• MPI_Comm communicator /* in */)
Process 0
Process 1
Process 2
Process 3
x0 x1 x2 x3
Parallel Processing 32
AllGather• int MPI_AllGather(• void* send_data /* in */• int send_count /* in */• MPI_Datatype send_type /* in */• void* recv_data /* out */• int recv_count /* in */• MPI_Datatype recv_type /* in */• MPI_Comm communicator /* in */)
Process 0
Process 1
Process 2
Process 3
x0
x1
x2
x3
Parallel Processing 33
Matrix-Vector Product(block cyclic storage)
• y = Ax, yj =0i<n Aij*xj 0 i < m
– Store blocks of A, x, y in local memory– Gather local blocks of x in each process– Compute chunks of y in parallel
Process 0
Process 1
Process 2
Process 3
A x y
=
Parallel Processing 34
Parallel Matrix-Vector Product
float Parallel_matrix_vector_product(LOCAL_MATRIX_T local_A, int m, int n, float local_x, float global_x, float local_y, int local_m, int local_n)
{
/* local_m = m/p0, local_n = n/p */
int I, j;
MPI_Allgather(local_x, local_n, MPI_FLOAT, global_x, local_n, MPI_FLOAT, MPI_COMM_WORLD);
for (i=0; i< local_m; i++) {
local_y[i] = 0.0;
for (j = 0; j< n; j++)
local_y[i] = local_y[i] + local_A[I][j]*global_x[j];
}
Parallel Processing 35
Embarrassingly Parallel Example
• Numerical integration (trapezoid rule)
t=0..1 f(t) dt 1/2*h[f(x0) + 2f(x1)+…+ 2f(xn-1)+f(xn)]
a bx1 xn-1