Chapter 8
-
Upload
sonya-welch -
Category
Documents
-
view
41 -
download
0
description
Transcript of Chapter 8
Chapter 8Chapter 8
Matrix-Vector Multiplication
Sequential AlgorithmSequential Algorithm
Matrix-Vector Multiplication:
Input : a[0..m-1, 0..n-1] – matrix with dimension m×n b[0..n-1] – vector with dimension n×1Output: c[0..m-1] – vector with dimension m×1
for i ← 0 to m – 1 c[i] ← 0 for j ← 0 to n-1 c[i] ← c[i] + a[i,j] × b[j]
MPI_ScatterMPI_Scatter Cut an array at the specified CPU id equal and then send a part to
the other CPU ids which is at same communicator.
int Sdata[], Rdata[], Send_cnt, Recv_cnt, src, err;
MPI_Comm COMM: MPI_Datatype Stype, Rtype;
err = MPI_Scatter(Sdata, Send_cnt, Stype, Rdata, Recv_cnt, Rtype, src, COMM);
Sdata Send array. Send_cnt An amount of data which is sent to every CPU id. Stype Send data type. Rdata Receive data. If Recv_cnt > 1, Rdata is an array. Recv_cnt An amount of data which is received from send CPU id. Rtype Receive data type. COMM Communicator. src CPU id with source of data.
Rdata
[1,2,3,4,5,6,7,8]
[1,2]
MPI_ScatterMPI_Scatter
int Sdata[8] = {1,2,3,4,5,6,7,8}, Rdata[2];int Send_cnt = 2, Recv_cnt = 2, src = 0;MPI_Scatter( Sdata, Send_cnt, MPI_INTEGER, Rdata, Recv_cnt, MPI_INTEGER , src, MPI_COMM_WORLD);
CPU0 CPU1 CPU2 CPU3
MPI_Scatter [1,2]
Sdata
[5,6] [3,4] [7,8]
Rdata Rdata Rdata [3,4] [5,6] [7,8]
MPI_ScatterMPI_Scatter
In general, Send_cnt and Recv_cnt must be the same, and Stype and Rtype is also the same. If not, there may be some problem.
Suppose there are N CPU at this communicator, the size of Sdata must be at least Send_cnt*N.
MPI_ScattervMPI_Scatterv
A scatter operation in which different processes may end up with different numbers of elements.
Function MPI_ScattervFunction MPI_Scatterv
MPI_Scatterv (void *send_buffer, int* send_cnt, int* send_disp, MPI_Datatype send_type, void *rec_buffer, int recv_cnt, MPI_Datatype recv_type, int root, MPI_Comm communicator)
MPI_GatherMPI_Gather
Collective data from CPU id which is at same communicator and put result into specified CPU id.
int Sdata[], Rdata[], Send_cnt, Recv_cnt, dest, err;
MPI_Comm COMM; MPI_Datatype Stype, Rtype;
err = MPI_Gather( Sdata, Send_cnt, Stype, Rdata, Recv_cnt, Rtype,
dest, COMM);
Sdata Send data. If Send_cnt > 1, Sdata is an array.Send_cnt An amount of data which is sent from every CPU ids.Stype Send data type.Rdata Receive array.Recv_cnt An amount of data which is received from send CPU ids.Rtype Receive data type.COMM Communicator.dest CPU id which collective data from other CPU ids.
MPI_GatherMPI_Gather
int Send_cnt = 2, Recv_cnt = 2, dest = 0;MPI_Gather ( Sdata, Send_cnt, MPI_INTEGER, Rdata, Recv_cnt, MPI_INTEGER , dest, MPI_COMM_WORLD);
CPU0 CPU1 CPU2 CPU3
MPI_Gather
Sdata SdataSdata Sdata [1,2] [3,4] [5,6] [7,8]
[1,2,3,4,5,6,7,8]Rdata
MPI_GatherMPI_Gather
In general, Send_cnt and Recv_cnt must be the same, and Stype and Rtype is also the same. If not, there may be some problem.
Suppose there are N CPU at this communicator, the size of Rdata must be at least Send_cnt*N.
MPI_GathervMPI_Gatherv
A gather operation in which the number of elements collected from different processes may vary.
Function MPI_Gatherv Function MPI_Gatherv
MPI_Gatherv (void* send_buffer, int send_cnt, MPI_Datatype send_type, void *recv_buffer, int* recv_cnt, int* recv_disp, MPI_Datatype recv_type, int root, MPI_Comm communicator)
MPI_AllgatherMPI_Allgather
Be liked MPI_Gather, but MPI_Allgather let collection result sent to all CPU ids which is at same communicator.
int Sdata[], Rdata[], Send_cnt, Recvcnt, err;
MPI_Comm Comm; MPI_Datatype Stype, Rtype;
err = MPI_Allgather( Sdata, Send_cnt, Stype, Rdata, Recv_cnt, Rtype,
Comm);
Sdata Send data. If Send_cnt > 1, Sdata is an array.Send_cnt An amount of data which is send from every CPU ids.Stype Send data type.Rdata Receive array.Recv_cnt An amount of data which is receive from send CPU ids.Rtype Receive data type.COMM Communicator.
Rdata Rdata Rdata Rdata
MPI_AllgatherMPI_Allgather
int Send_cnt = 2, Recv_cnt = 8;MPI_Allgather ( Sdata, Send_cnt, MPI_INTEGER, Rdata, Recv_cnt, MPI_INTEGER , MPI_COMM_WORLD);
CPU0 CPU1 CPU2 CPU3
MPI_Allgather
Sdata Sdata Sdata Sdata [1,2] [3,4] [5,6] [7,8]
[1,2,3,4,5,6,7,8] [1,2,3,4,5,6,7,8] [1,2,3,4,5,6,7,8] [1,2,3,4,5,6,7,8]
MPI_Allgatherv
An all-gather function in which different processes may contribute different members of elements.
int MPI_Allgatherv (void* send_buffer, int int MPI_Allgatherv (void* send_buffer, int send_cnt, MPI_Datatype send_type, void* send_cnt, MPI_Datatype send_type, void* receive_buffer, int* receive_cnt, int* receive_buffer, int* receive_cnt, int* receive_disp, MPI_Datatype receive_type, receive_disp, MPI_Datatype receive_type, MPI_COMM communicator)MPI_COMM communicator)
Function MPI_Allgatherv
MPI_AlltoallMPI_Alltoall
An all-to-all exchange of data elements among processes
Function MPI_AlltoallFunction MPI_Alltoall
int MPI_Alltoall (void *send_buffer, int *send_count, int *send_displacement, MPI_Datatype send_type, void *recv_buffer, int *recv_count, int *recv_displacement, MPI_Datatype recv_type, MPI_Comm communicator)
Data Decomposition OptionsData Decomposition Options
Rowwise block-striped decompositionColumnwise block-striped decompositionCheckboard block decomposition
Creating a CommunicatorCreating a Communicator
There are four collective There are four collective communication operationscommunication operations1. The processes in the first column of the
virtual process grid participate in the communication that gathers vector b when p is not square.
2. The processes in the first row of the virtual process grid participate in the communication that scatters vector b when p is not square.
3. Each first-row process broadcasts its block of b to other processes in the same column of the process grid.
4. Each row of processes in the grid performs an independent sum-reduction, yielding vector c in the first column of processes.
int MPI_Dims_createint MPI_Dims_create
int MPI_Dims_create (int nodes, int dims, int *size)
nodes: an input parameter, the number of processes in the grid.
dims: an input parameter, the number of dimensions in the desired grid.
size: an input/output parameter, the size of each grid dimension.
int MPI_Cart_createint MPI_Cart_createint MPI_Cart_create (MPI_Comm old_comm, int dims, int *size, int *periodic, int reorder, MPI_Comm *cart_comm)
old_comm: the old communicator. All processes in the old communicator must collectively call the function.
dims: the number of grid dimensions.
*size: an array of size dims. Element size[j] is the number of processes in dimension j.
*periodic: an array of size dims. Element periodic[j] should be 1 if dimension j is periodic (communications wrap around the edges of the grid) and 0 otherwise.
reorder: a flag indicating if process ranks can be reordered. If reorder is 0, the rank of each process in the new communicator is the same as its rank in old_comm.
Reading a Checkerboard Reading a Checkerboard Matrix Matrix
int MPI_Cart_rankint MPI_Cart_rank
int MPI_Cart_rank (MPI_Comm comm, int *coords, int *rank)
comm: an input parameter whose value is the Cartesian communicator in which the communication is occurring.
coords : an input parameter: an integer array containing the coordinates of a process in the virtual grid.
rank : the rank of the process in comm with the specified coordinates.
int MPI_Cart_coordsint MPI_Cart_coordsint MPI_Cart_coords (MPI_Comm comm, int rank, int dims, int *coords)
comm: the Cartesian coordinator being examined.
rank: the rank of the process whose coordinates we seek.
dims: the number of dimensions in the process grid.
The function returns through the last parameter the coordinates of the specified process in the grid.
int MPI_Cart_splitint MPI_Cart_split
int MPI_Cart_split (MIP_Comm old_comm, int partition, int new_rank, MPI_Comm *new_comm)
old_comm: the existing communicator to which these processes belong.partition: the partition number.new_rank: rank order of process within new communicator.
The function returns through new_comm a pointer to the new communicator to which this process belongs.
BenchmarkingBenchmarking
High Performance High Performance ComputingComputing
Program ParallelismProgram Parallelism
Algorithm LevelProgram LevelInstruction Level
Cache MemoryCache MemoryTo improve the average memory
access time, modern computer systems use a high speed cache memory.◦Temporal locality: A memory is
fetched again in the near future.◦Spatial locality: The cache keep
nearby words.
A Matrix MultiplicationA Matrix MultiplicationSimple matrix multiplication C=A
× B.for i=1 to n do for j=1 to n do for k=1 to n do C[i,j]=C[i,j]+A[i,k] * B[k,j]
Improving Spatial LocalityImproving Spatial LocalityReordering the loops to get the
ikj form will satisfy spatial locality:for i=1 to n do
for k=1 to n do for j=1 to n do C[i,j]=C[i,j]+A[i,k] * B[k,j]
Improving Temporal Improving Temporal Locality (1/2)Locality (1/2)We divide the matrices into
rectangular sub-matrices, as shown below.
We have chosen a sub-matrix size s=n ÷3. The first sub-matrix C11 can computed by sub matrix multiplication:
333231
232221
131211
333231
232221
131211
333231
232221
131211
BBB
BBB
BBB
AAA
AAA
AAA
CCC
CCC
CCC
31132112111111 BABABAC
Improving Temporal Improving Temporal Locality (2/2)Locality (2/2)The program for the reformulated
algorithm:for it=1 to n by s do for kt=1 to n by s do for jt=1 to n by s do for i=it to min(it+s-1,n) do for k=kt to min(kt+s-1,n) do for j=jt to min(jt+s-1,n) do C[i,j]=C[i,j]+A[i,k]+B[k,j]
Storage orderStorage orderRow-major and column-major
storage order for a 3 × 4 array.