Chapter 8

Chapter 8Chapter 8

Matrix-Vector Multiplication

Sequential AlgorithmSequential Algorithm

Matrix-Vector Multiplication:

Input : a[0..m-1, 0..n-1] – matrix with dimension m×n b[0..n-1] – vector with dimension n×1Output: c[0..m-1] – vector with dimension m×1

for i ← 0 to m – 1 c[i] ← 0 for j ← 0 to n-1 c[i] ← c[i] + a[i,j] × b[j]

MPI_ScatterMPI_Scatter Cut an array at the specified CPU id equal and then send a part to

the other CPU ids which is at same communicator.

int Sdata[], Rdata[], Send_cnt, Recv_cnt, src, err;

MPI_Comm COMM: MPI_Datatype Stype, Rtype;

err = MPI_Scatter(Sdata, Send_cnt, Stype, Rdata, Recv_cnt, Rtype, src, COMM);

Sdata Send array. Send_cnt An amount of data which is sent to every CPU id. Stype Send data type. Rdata Receive data. If Recv_cnt > 1, Rdata is an array. Recv_cnt An amount of data which is received from send CPU id. Rtype Receive data type. COMM Communicator. src CPU id with source of data.

Rdata

[1,2,3,4,5,6,7,8]

[1,2]

MPI_ScatterMPI_Scatter

int Sdata[8] = {1,2,3,4,5,6,7,8}, Rdata[2];int Send_cnt = 2, Recv_cnt = 2, src = 0;MPI_Scatter( Sdata, Send_cnt, MPI_INTEGER, Rdata, Recv_cnt, MPI_INTEGER , src, MPI_COMM_WORLD);

CPU0 CPU1 CPU2 CPU3

MPI_Scatter [1,2]

Sdata

[5,6] [3,4] [7,8]

Rdata Rdata Rdata [3,4] [5,6] [7,8]

MPI_ScatterMPI_Scatter

In general, Send_cnt and Recv_cnt must be the same, and Stype and Rtype is also the same. If not, there may be some problem.

Suppose there are N CPU at this communicator, the size of Sdata must be at least Send_cnt*N.

MPI_ScattervMPI_Scatterv

A scatter operation in which different processes may end up with different numbers of elements.

Function MPI_ScattervFunction MPI_Scatterv

MPI_Scatterv (void *send_buffer, int* send_cnt, int* send_disp, MPI_Datatype send_type, void *rec_buffer, int recv_cnt, MPI_Datatype recv_type, int root, MPI_Comm communicator)

MPI_GatherMPI_Gather

Collective data from CPU id which is at same communicator and put result into specified CPU id.

int Sdata[], Rdata[], Send_cnt, Recv_cnt, dest, err;

MPI_Comm COMM; MPI_Datatype Stype, Rtype;

err = MPI_Gather( Sdata, Send_cnt, Stype, Rdata, Recv_cnt, Rtype,

dest, COMM);

Sdata Send data. If Send_cnt > 1, Sdata is an array.Send_cnt An amount of data which is sent from every CPU ids.Stype Send data type.Rdata Receive array.Recv_cnt An amount of data which is received from send CPU ids.Rtype Receive data type.COMM Communicator.dest CPU id which collective data from other CPU ids.


int Send_cnt = 2, Recv_cnt = 2, dest = 0;MPI_Gather ( Sdata, Send_cnt, MPI_INTEGER, Rdata, Recv_cnt, MPI_INTEGER , dest, MPI_COMM_WORLD);

CPU0 CPU1 CPU2 CPU3

MPI_Gather

Sdata SdataSdata Sdata [1,2] [3,4] [5,6] [7,8]

[1,2,3,4,5,6,7,8]Rdata


In general, Send_cnt and Recv_cnt must be the same, and Stype and Rtype is also the same. If not, there may be some problem.

Suppose there are N CPU at this communicator, the size of Rdata must be at least Send_cnt*N.

MPI_GathervMPI_Gatherv

A gather operation in which the number of elements collected from different processes may vary.

Function MPI_Gatherv Function MPI_Gatherv

MPI_Gatherv (void* send_buffer, int send_cnt, MPI_Datatype send_type, void *recv_buffer, int* recv_cnt, int* recv_disp, MPI_Datatype recv_type, int root, MPI_Comm communicator)

MPI_AllgatherMPI_Allgather

Be liked MPI_Gather, but MPI_Allgather let collection result sent to all CPU ids which is at same communicator.

int Sdata[], Rdata[], Send_cnt, Recvcnt, err;

MPI_Comm Comm; MPI_Datatype Stype, Rtype;

err = MPI_Allgather( Sdata, Send_cnt, Stype, Rdata, Recv_cnt, Rtype,

Comm);

Sdata Send data. If Send_cnt > 1, Sdata is an array.Send_cnt An amount of data which is send from every CPU ids.Stype Send data type.Rdata Receive array.Recv_cnt An amount of data which is receive from send CPU ids.Rtype Receive data type.COMM Communicator.

Rdata Rdata Rdata Rdata

MPI_AllgatherMPI_Allgather

int Send_cnt = 2, Recv_cnt = 8;MPI_Allgather ( Sdata, Send_cnt, MPI_INTEGER, Rdata, Recv_cnt, MPI_INTEGER , MPI_COMM_WORLD);

CPU0 CPU1 CPU2 CPU3

MPI_Allgather

Sdata Sdata Sdata Sdata [1,2] [3,4] [5,6] [7,8]

[1,2,3,4,5,6,7,8] [1,2,3,4,5,6,7,8] [1,2,3,4,5,6,7,8] [1,2,3,4,5,6,7,8]

MPI_Allgatherv

An all-gather function in which different processes may contribute different members of elements.

int MPI_Allgatherv (void* send_buffer, int int MPI_Allgatherv (void* send_buffer, int send_cnt, MPI_Datatype send_type, void* send_cnt, MPI_Datatype send_type, void* receive_buffer, int* receive_cnt, int* receive_buffer, int* receive_cnt, int* receive_disp, MPI_Datatype receive_type, receive_disp, MPI_Datatype receive_type, MPI_COMM communicator)MPI_COMM communicator)

Function MPI_Allgatherv

MPI_AlltoallMPI_Alltoall

An all-to-all exchange of data elements among processes

Function MPI_AlltoallFunction MPI_Alltoall

int MPI_Alltoall (void *send_buffer, int *send_count, int *send_displacement, MPI_Datatype send_type, void *recv_buffer, int *recv_count, int *recv_displacement, MPI_Datatype recv_type, MPI_Comm communicator)

Data Decomposition OptionsData Decomposition Options

Rowwise block-striped decompositionColumnwise block-striped decompositionCheckboard block decomposition

Creating a CommunicatorCreating a Communicator

There are four collective There are four collective communication operationscommunication operations1. The processes in the first column of the

virtual process grid participate in the communication that gathers vector b when p is not square.

2. The processes in the first row of the virtual process grid participate in the communication that scatters vector b when p is not square.

3. Each first-row process broadcasts its block of b to other processes in the same column of the process grid.

4. Each row of processes in the grid performs an independent sum-reduction, yielding vector c in the first column of processes.

int MPI_Dims_createint MPI_Dims_create

int MPI_Dims_create (int nodes, int dims, int *size)

nodes: an input parameter, the number of processes in the grid.

dims: an input parameter, the number of dimensions in the desired grid.

size: an input/output parameter, the size of each grid dimension.

int MPI_Cart_createint MPI_Cart_createint MPI_Cart_create (MPI_Comm old_comm, int dims, int *size, int *periodic, int reorder, MPI_Comm *cart_comm)

old_comm: the old communicator. All processes in the old communicator must collectively call the function.

dims: the number of grid dimensions.

*size: an array of size dims. Element size[j] is the number of processes in dimension j.

*periodic: an array of size dims. Element periodic[j] should be 1 if dimension j is periodic (communications wrap around the edges of the grid) and 0 otherwise.

reorder: a flag indicating if process ranks can be reordered. If reorder is 0, the rank of each process in the new communicator is the same as its rank in old_comm.

Reading a Checkerboard Reading a Checkerboard Matrix Matrix

int MPI_Cart_rankint MPI_Cart_rank

int MPI_Cart_rank (MPI_Comm comm, int *coords, int *rank)

comm: an input parameter whose value is the Cartesian communicator in which the communication is occurring.

coords : an input parameter: an integer array containing the coordinates of a process in the virtual grid.

rank : the rank of the process in comm with the specified coordinates.

int MPI_Cart_coordsint MPI_Cart_coordsint MPI_Cart_coords (MPI_Comm comm, int rank, int dims, int *coords)

comm: the Cartesian coordinator being examined.

rank: the rank of the process whose coordinates we seek.

dims: the number of dimensions in the process grid.

The function returns through the last parameter the coordinates of the specified process in the grid.

int MPI_Cart_splitint MPI_Cart_split

int MPI_Cart_split (MIP_Comm old_comm, int partition, int new_rank, MPI_Comm *new_comm)

old_comm: the existing communicator to which these processes belong.partition: the partition number.new_rank: rank order of process within new communicator.

The function returns through new_comm a pointer to the new communicator to which this process belongs.

BenchmarkingBenchmarking

High Performance High Performance ComputingComputing

Program ParallelismProgram Parallelism

Algorithm LevelProgram LevelInstruction Level

Cache MemoryCache MemoryTo improve the average memory

access time, modern computer systems use a high speed cache memory.◦Temporal locality: A memory is

fetched again in the near future.◦Spatial locality: The cache keep

nearby words.

A Matrix MultiplicationA Matrix MultiplicationSimple matrix multiplication C=A

× B.for i=1 to n do for j=1 to n do for k=1 to n do C[i,j]=C[i,j]+A[i,k] * B[k,j]

Improving Spatial LocalityImproving Spatial LocalityReordering the loops to get the

ikj form will satisfy spatial locality:for i=1 to n do

for k=1 to n do for j=1 to n do C[i,j]=C[i,j]+A[i,k] * B[k,j]

Improving Temporal Improving Temporal Locality (1/2)Locality (1/2)We divide the matrices into

rectangular sub-matrices, as shown below.

We have chosen a sub-matrix size s=n ÷3. The first sub-matrix C11 can computed by sub matrix multiplication:

333231

232221

131211

333231

232221

131211

333231

232221

131211

BBB

BBB

BBB

AAA

AAA

AAA

CCC

CCC

CCC

31132112111111 BABABAC

Improving Temporal Improving Temporal Locality (2/2)Locality (2/2)The program for the reformulated

algorithm:for it=1 to n by s do for kt=1 to n by s do for jt=1 to n by s do for i=it to min(it+s-1,n) do for k=kt to min(kt+s-1,n) do for j=jt to min(jt+s-1,n) do C[i,j]=C[i,j]+A[i,k]+B[k,j]

Storage orderStorage orderRow-major and column-major

storage order for a 3 × 4 array.

Chapter 8

Documents

Transcript of Chapter 8