Parallel and Distributed Processing CSE 8380

Post on 12-Jan-2016

55 views 0 download

description

Parallel and Distributed Processing CSE 8380. February 8, 2005 Session 8. Contents. Computing sum on EREW PRAM Computing all partial sums on EREW PRAM Matrix Multiplication on CREW Other Algorithms. Recall (PRAM Model). Control. Private Memory. P 1. - PowerPoint PPT Presentation

Transcript of Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Parallel and Distributed Processing

CSE 8380

February 8, 2005February 8, 2005

Session 8Session 8

Computer Science and Engineering

Contents

Computing sum on EREW PRAM

Computing all partial sums on EREW PRAM

Matrix Multiplication on CREW

Other Algorithms

Computer Science and Engineering

Recall (PRAM Model)

Synchronized Read Compute Write Cycle

EREW ERCW CREW CRCW Complexity:

T(n), P(n), C(n)

Control

PrivateMemory

P1

PrivateMemory

P2

PrivateMemory

Pp

Global

Memory

Computer Science and Engineering

Sum on EREW PRAM

Compute the sum of an array A[1..n]

We use n/2 processors

Summation will end up in location A[n]

For simplicity, we assume n is an integral power of 2

Work is done in log n iterations. In the first iteration, all processors are active. In the second iteration, only half the processors will be active, and so on.

Computer Science and Engineering

ExampleSum of an array of numbers on the EREW model

Example of algorithm Sum_EREW when n=8

5 2 10 1 8 12 7 3

5 7 10 11 8 20 7 10

5 7 10 18 8 20 7 30

5 7 10 18 8 20 7 48

Active processors

P1, P2, P3, P4

P2, P4

P4

A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]

Computer Science and Engineering

Group Work

1- Discuss the algorithm with your neighbor

2- Design the main loops

3- Discuss the Complexity

Computer Science and Engineering

Algorithm sum_EREW

for i =1 to log n do

forall Pj, where 1 < j < n/2 do in parallel

if (2j mod 2i) = 0 then

A[2j] A[2j] + A[j – 2i-1]

endif

endfor

endfor

Computer Science and Engineering

Complexity

Run time: T(n) = O(log n)

Number of processors: P(n) = n/2

Cost: c(n) = O(n log n)

Is it cost optimal?

Computer Science and Engineering

All partial sums - EREW PRAM

Compute all partial sums of an array A[1..n]

These are A[1], A[1]+A[2], A[1]+A[2]+A[3], …, A[1]+A[2]+… + A[n].

At first glance you might think it is inherently sequential because one must add up the first k elements before adding in element k+1

We’ll see that it can be parallelized

Let’s extend sum_EREW to do that

Computer Science and Engineering

All partial sums (cont.)

We noticed that in sum_EREW most processors are idle most of the time

By exploiting these idle processors, we should be able to compute all partial sums in the same amount of time it takes to compute the single sum

Computer Science and Engineering

All partial sums (cont.)

Compute all partial sums of A[1..n]

We use n-1 processors (P2, P3, …, Pn)

A[k] will be replaced by the sum of all elements preceding and including A[k]

In algorithm sum_EREW, at iteration i, only n/2i processors were active, while in allsums_EREW, nearly all processors will be in use.

Computer Science and Engineering

ExampleAll partial sums on EREW PRAM

Example of algorithm allsums_EREW when n=8

5 2 10 1 8 12 7 3

5 7 12 11 9 20 19 10

5 7 17 18 21 31 28 30

5 7 17 18 26 38 45 48

Active processors

P2, P3, …, P8

P3, P4, …, P8

P5, P6, P7, P8

A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]

Computer Science and Engineering

Group Work

1- Discuss the algorithm with your neighbor

2- Design the main loops

3- Discuss the Complexity

Computer Science and Engineering

Algorithm allsums_EREW

for i =1 to log n do

forall Pj, where 2i-1 + 1 < j < n do in parallel

a[j] A[j] + A[j – 2i-1]

endfor

endfor

Computer Science and Engineering

Complexity

Run time: T(n) = O(log n)

Number of processors: P(n) = n-1

Cost: c(n) = O(n log n)

Computer Science and Engineering

Matrix Multiplication

Two n X n matrices For clarity, we assume n is power of 2

We use CREW to allow concurrent read Two matrices in the shared memory A[1..n,1..n],

B[1..n,1..n].

We will use n3 processors We will also show how to reduce the number of

processors

Computer Science and Engineering

Matrix Multiplication (cont)

The n3 processors are arranged in a three dimensional array. Processor Pi,j,k is the one with index (i,j,k)

We will use the 3 dimensional array C[1..n,1..n,1..n] in the shared memory as working space.

The resulting matrix will be stored in locations C[i,j,n], where 1<= i,j <= n

Computer Science and Engineering

Two steps

1. All n3 processors operate in parallel to compute n3 multiplications. (For each of the n2 cells in the output matrix, n products are computed)

2. The n products are summed to produce the final value of each cell

Computer Science and Engineering

Matrix multiplicationUsing n3 processors

Two steps of the Algorithms

1. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

2. The idea of Algorithm Sum_EREW is applied along the k dimension n2 times in parallel to compute C[i,j,n], where 1<i, j<n. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

Computer Science and Engineering

Algorithm MatMult_CREW

/* step 1 */

forall Pi,j,k, where 1 < i, j, k<n do in parallelC[i,j,k] A[i,k] * B[k,j]

Endfor

/* step 2 */for i=1 to log n do

forall Pi,j,k, where 1 < i, j<n & 1<k<n/2 do in parallelif (2k mod 2l) = 0 then C[i,j,2k] C[i,j,2k] + C[i,j, 2k-2l-1]endif

endfor

/* the output matrix is stored in locations C[i,j,n], where l<i, j<n */

endfor

Computer Science and Engineering

Complexity

Run time: T(n) = O(log n)

Number of processors: P(n) = n3

Cost: c(n) = O(n3 log n)

Is it cost optimal?

Computer Science and Engineering

Example

Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

C[1,1,1] A[1,1]B[1,1] C[1,2,1] A[1,1]B[1,2]

C[2,1,1] A[2,1]B[1,1] C[2,2,1] A[2,1]B[1,2]

C[1,1,2] A[1,2]B[2,1] C[1,2,2] A[1,2]B[2,2]

C[2,1,2] A[2,2]B[2,1] C[2,2,2] A[2,2]B[2,2]

i

j

ij

P1,1,1 K = 1 P1,2,1

P1,1,2 P1,2,2K = 2

After step 1

P2,1,1 P2,2,1

P2,1,2 P2,2,2

Computer Science and Engineering

Example (cont.)

C[1,1,2] C[1,1,2]+C[1,1,1] C[1,2,2] C[1,2,2]+C[1,2,1]

C[2,1,2] C[2,1,2]+C[2,1,1] C[2,2,2] C[2,2,2]+C[2,2,1]

ij

P1,1,2 P1,2,2K = 2

After step 2

P2,1,2 P2,2,2

Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

Computer Science and Engineering

Matrix multiplicationreducing the number of processors to n3/log n

Processors are arranged in n X n X n/(log n) 3-dimensional array

1. Each processors Pi,j,k, where 1 <k < n/log n, computes the sum of (log n) product. This step will produce (n3/log n) partial sums.

2. The sum of products produced in step 1 are added to produce the resulting matrix as discussed previously.

Complexity analysis Run time, T(n) = O(log n) Number of processors, P(n) = n3/log n Cost, c(n) = O(n3)

Computer Science and Engineering

Searching

Given A = a1, a2, …, ai, …, an & x

Determine whether x = ai for some i Sequential Binary Search O(log n) Simple idea

Divide the list among the processors and let each processor conduct its own binary search

EREW PRAM O(log n/p) + O(log p) = O(log n) CREW O(log n/p)

Computer Science and Engineering

Parallel Binary Search

Split A into p+1 segments of almost equal length

Compare x with p elements at the boundary between successive segments

Either x = ai or search is restricted to only one of the p+1 segments

Repeat until x is found or length of the list is <= p