Post on 12-Jan-2016
description
Computer Science and Engineering
Parallel and Distributed Processing
CSE 8380
February 8, 2005February 8, 2005
Session 8Session 8
Computer Science and Engineering
Contents
Computing sum on EREW PRAM
Computing all partial sums on EREW PRAM
Matrix Multiplication on CREW
Other Algorithms
Computer Science and Engineering
Recall (PRAM Model)
Synchronized Read Compute Write Cycle
EREW ERCW CREW CRCW Complexity:
T(n), P(n), C(n)
Control
PrivateMemory
P1
PrivateMemory
P2
PrivateMemory
Pp
Global
Memory
Computer Science and Engineering
Sum on EREW PRAM
Compute the sum of an array A[1..n]
We use n/2 processors
Summation will end up in location A[n]
For simplicity, we assume n is an integral power of 2
Work is done in log n iterations. In the first iteration, all processors are active. In the second iteration, only half the processors will be active, and so on.
Computer Science and Engineering
ExampleSum of an array of numbers on the EREW model
Example of algorithm Sum_EREW when n=8
5 2 10 1 8 12 7 3
5 7 10 11 8 20 7 10
5 7 10 18 8 20 7 30
5 7 10 18 8 20 7 48
Active processors
P1, P2, P3, P4
P2, P4
P4
A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]
Computer Science and Engineering
Group Work
1- Discuss the algorithm with your neighbor
2- Design the main loops
3- Discuss the Complexity
Computer Science and Engineering
Algorithm sum_EREW
for i =1 to log n do
forall Pj, where 1 < j < n/2 do in parallel
if (2j mod 2i) = 0 then
A[2j] A[2j] + A[j – 2i-1]
endif
endfor
endfor
Computer Science and Engineering
Complexity
Run time: T(n) = O(log n)
Number of processors: P(n) = n/2
Cost: c(n) = O(n log n)
Is it cost optimal?
Computer Science and Engineering
All partial sums - EREW PRAM
Compute all partial sums of an array A[1..n]
These are A[1], A[1]+A[2], A[1]+A[2]+A[3], …, A[1]+A[2]+… + A[n].
At first glance you might think it is inherently sequential because one must add up the first k elements before adding in element k+1
We’ll see that it can be parallelized
Let’s extend sum_EREW to do that
Computer Science and Engineering
All partial sums (cont.)
We noticed that in sum_EREW most processors are idle most of the time
By exploiting these idle processors, we should be able to compute all partial sums in the same amount of time it takes to compute the single sum
Computer Science and Engineering
All partial sums (cont.)
Compute all partial sums of A[1..n]
We use n-1 processors (P2, P3, …, Pn)
A[k] will be replaced by the sum of all elements preceding and including A[k]
In algorithm sum_EREW, at iteration i, only n/2i processors were active, while in allsums_EREW, nearly all processors will be in use.
Computer Science and Engineering
ExampleAll partial sums on EREW PRAM
Example of algorithm allsums_EREW when n=8
5 2 10 1 8 12 7 3
5 7 12 11 9 20 19 10
5 7 17 18 21 31 28 30
5 7 17 18 26 38 45 48
Active processors
P2, P3, …, P8
P3, P4, …, P8
P5, P6, P7, P8
A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]
Computer Science and Engineering
Group Work
1- Discuss the algorithm with your neighbor
2- Design the main loops
3- Discuss the Complexity
Computer Science and Engineering
Algorithm allsums_EREW
for i =1 to log n do
forall Pj, where 2i-1 + 1 < j < n do in parallel
a[j] A[j] + A[j – 2i-1]
endfor
endfor
Computer Science and Engineering
Complexity
Run time: T(n) = O(log n)
Number of processors: P(n) = n-1
Cost: c(n) = O(n log n)
Computer Science and Engineering
Matrix Multiplication
Two n X n matrices For clarity, we assume n is power of 2
We use CREW to allow concurrent read Two matrices in the shared memory A[1..n,1..n],
B[1..n,1..n].
We will use n3 processors We will also show how to reduce the number of
processors
Computer Science and Engineering
Matrix Multiplication (cont)
The n3 processors are arranged in a three dimensional array. Processor Pi,j,k is the one with index (i,j,k)
We will use the 3 dimensional array C[1..n,1..n,1..n] in the shared memory as working space.
The resulting matrix will be stored in locations C[i,j,n], where 1<= i,j <= n
Computer Science and Engineering
Two steps
1. All n3 processors operate in parallel to compute n3 multiplications. (For each of the n2 cells in the output matrix, n products are computed)
2. The n products are summed to produce the final value of each cell
Computer Science and Engineering
Matrix multiplicationUsing n3 processors
Two steps of the Algorithms
1. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].
2. The idea of Algorithm Sum_EREW is applied along the k dimension n2 times in parallel to compute C[i,j,n], where 1<i, j<n. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].
Computer Science and Engineering
Algorithm MatMult_CREW
/* step 1 */
forall Pi,j,k, where 1 < i, j, k<n do in parallelC[i,j,k] A[i,k] * B[k,j]
Endfor
/* step 2 */for i=1 to log n do
forall Pi,j,k, where 1 < i, j<n & 1<k<n/2 do in parallelif (2k mod 2l) = 0 then C[i,j,2k] C[i,j,2k] + C[i,j, 2k-2l-1]endif
endfor
/* the output matrix is stored in locations C[i,j,n], where l<i, j<n */
endfor
Computer Science and Engineering
Complexity
Run time: T(n) = O(log n)
Number of processors: P(n) = n3
Cost: c(n) = O(n3 log n)
Is it cost optimal?
Computer Science and Engineering
Example
Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW
C[1,1,1] A[1,1]B[1,1] C[1,2,1] A[1,1]B[1,2]
C[2,1,1] A[2,1]B[1,1] C[2,2,1] A[2,1]B[1,2]
C[1,1,2] A[1,2]B[2,1] C[1,2,2] A[1,2]B[2,2]
C[2,1,2] A[2,2]B[2,1] C[2,2,2] A[2,2]B[2,2]
i
j
ij
P1,1,1 K = 1 P1,2,1
P1,1,2 P1,2,2K = 2
After step 1
P2,1,1 P2,2,1
P2,1,2 P2,2,2
Computer Science and Engineering
Example (cont.)
C[1,1,2] C[1,1,2]+C[1,1,1] C[1,2,2] C[1,2,2]+C[1,2,1]
C[2,1,2] C[2,1,2]+C[2,1,1] C[2,2,2] C[2,2,2]+C[2,2,1]
ij
P1,1,2 P1,2,2K = 2
After step 2
P2,1,2 P2,2,2
Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW
Computer Science and Engineering
Matrix multiplicationreducing the number of processors to n3/log n
Processors are arranged in n X n X n/(log n) 3-dimensional array
1. Each processors Pi,j,k, where 1 <k < n/log n, computes the sum of (log n) product. This step will produce (n3/log n) partial sums.
2. The sum of products produced in step 1 are added to produce the resulting matrix as discussed previously.
Complexity analysis Run time, T(n) = O(log n) Number of processors, P(n) = n3/log n Cost, c(n) = O(n3)
Computer Science and Engineering
Searching
Given A = a1, a2, …, ai, …, an & x
Determine whether x = ai for some i Sequential Binary Search O(log n) Simple idea
Divide the list among the processors and let each processor conduct its own binary search
EREW PRAM O(log n/p) + O(log p) = O(log n) CREW O(log n/p)
Computer Science and Engineering
Parallel Binary Search
Split A into p+1 segments of almost equal length
Compare x with p elements at the boundary between successive segments
Either x = ai or search is restricted to only one of the p+1 segments
Repeat until x is found or length of the list is <= p