Parallel and Distributed Processing CSE 8380

Computer Science and Engineering

Parallel and Distributed Processing

CSE 8380

February 8, 2005February 8, 2005

Session 8Session 8

Contents

Computing sum on EREW PRAM

Computing all partial sums on EREW PRAM

Matrix Multiplication on CREW

Other Algorithms

Recall (PRAM Model)

Synchronized Read Compute Write Cycle

EREW ERCW CREW CRCW Complexity:

T(n), P(n), C(n)

Control

PrivateMemory

Global

Memory

Sum on EREW PRAM

Compute the sum of an array A[1..n]

We use n/2 processors

Summation will end up in location A[n]

For simplicity, we assume n is an integral power of 2

Work is done in log n iterations. In the first iteration, all processors are active. In the second iteration, only half the processors will be active, and so on.

ExampleSum of an array of numbers on the EREW model

Example of algorithm Sum_EREW when n=8

5 2 10 1 8 12 7 3

5 7 10 11 8 20 7 10

5 7 10 18 8 20 7 30

5 7 10 18 8 20 7 48

Active processors

P1, P2, P3, P4

P2, P4

A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]

Group Work

1- Discuss the algorithm with your neighbor

2- Design the main loops

3- Discuss the Complexity

Algorithm sum_EREW

for i =1 to log n do

forall Pj, where 1 < j < n/2 do in parallel

if (2j mod 2i) = 0 then

A[2j] A[2j] + A[j – 2i-1]

endfor

Complexity

Run time: T(n) = O(log n)

Number of processors: P(n) = n/2

Cost: c(n) = O(n log n)

Is it cost optimal?

All partial sums - EREW PRAM

Compute all partial sums of an array A[1..n]

These are A[1], A[1]+A[2], A[1]+A[2]+A[3], …, A[1]+A[2]+… + A[n].

At first glance you might think it is inherently sequential because one must add up the first k elements before adding in element k+1

We’ll see that it can be parallelized

Let’s extend sum_EREW to do that

All partial sums (cont.)

We noticed that in sum_EREW most processors are idle most of the time

By exploiting these idle processors, we should be able to compute all partial sums in the same amount of time it takes to compute the single sum

All partial sums (cont.)

Compute all partial sums of A[1..n]

We use n-1 processors (P2, P3, …, Pn)

A[k] will be replaced by the sum of all elements preceding and including A[k]

In algorithm sum_EREW, at iteration i, only n/2i processors were active, while in allsums_EREW, nearly all processors will be in use.

ExampleAll partial sums on EREW PRAM

Example of algorithm allsums_EREW when n=8

5 2 10 1 8 12 7 3

5 7 12 11 9 20 19 10

5 7 17 18 21 31 28 30

5 7 17 18 26 38 45 48

Active processors

P2, P3, …, P8

P3, P4, …, P8

P5, P6, P7, P8

A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]

Group Work

1- Discuss the algorithm with your neighbor

2- Design the main loops

3- Discuss the Complexity

Algorithm allsums_EREW

for i =1 to log n do

forall Pj, where 2i-1 + 1 < j < n do in parallel

a[j] A[j] + A[j – 2i-1]

endfor

Complexity

Number of processors: P(n) = n-1

Cost: c(n) = O(n log n)

Matrix Multiplication

Two n X n matrices For clarity, we assume n is power of 2

We use CREW to allow concurrent read Two matrices in the shared memory A[1..n,1..n],

B[1..n,1..n].

We will use n3 processors We will also show how to reduce the number of

processors

Matrix Multiplication (cont)

The n3 processors are arranged in a three dimensional array. Processor Pi,j,k is the one with index (i,j,k)

We will use the 3 dimensional array C[1..n,1..n,1..n] in the shared memory as working space.

The resulting matrix will be stored in locations C[i,j,n], where 1<= i,j <= n

Two steps

1. All n3 processors operate in parallel to compute n3 multiplications. (For each of the n2 cells in the output matrix, n products are computed)

2. The n products are summed to produce the final value of each cell

Matrix multiplicationUsing n3 processors

Two steps of the Algorithms

1. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

2. The idea of Algorithm Sum_EREW is applied along the k dimension n2 times in parallel to compute C[i,j,n], where 1<i, j<n. Each processors Pi,j,k computes the product of A[i,k].B[k,j] and store it in C[i,j,k].

Algorithm MatMult_CREW

/* step 1 */

forall Pi,j,k, where 1 < i, j, k<n do in parallelC[i,j,k] A[i,k] * B[k,j]

Endfor

/* step 2 */for i=1 to log n do

forall Pi,j,k, where 1 < i, j<n & 1<k<n/2 do in parallelif (2k mod 2l) = 0 then C[i,j,2k] C[i,j,2k] + C[i,j, 2k-2l-1]endif

endfor

/* the output matrix is stored in locations C[i,j,n], where l<i, j<n */

endfor

Complexity

Number of processors: P(n) = n3

Cost: c(n) = O(n3 log n)

Is it cost optimal?

Example

Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

C[1,1,1] A[1,1]B[1,1] C[1,2,1] A[1,1]B[1,2]

C[2,1,1] A[2,1]B[1,1] C[2,2,1] A[2,1]B[1,2]

C[1,1,2] A[1,2]B[2,1] C[1,2,2] A[1,2]B[2,2]

C[2,1,2] A[2,2]B[2,1] C[2,2,2] A[2,2]B[2,2]

P1,1,1 K = 1 P1,2,1

P1,1,2 P1,2,2K = 2

After step 1

P2,1,1 P2,2,1

P2,1,2 P2,2,2

Example (cont.)

C[1,1,2] C[1,1,2]+C[1,1,1] C[1,2,2] C[1,2,2]+C[1,2,1]

C[2,1,2] C[2,1,2]+C[2,1,1] C[2,2,2] C[2,2,2]+C[2,2,1]

P1,1,2 P1,2,2K = 2

After step 2

P2,1,2 P2,2,2

Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW

Matrix multiplicationreducing the number of processors to n3/log n

Processors are arranged in n X n X n/(log n) 3-dimensional array

1. Each processors Pi,j,k, where 1 <k < n/log n, computes the sum of (log n) product. This step will produce (n3/log n) partial sums.

2. The sum of products produced in step 1 are added to produce the resulting matrix as discussed previously.

Complexity analysis Run time, T(n) = O(log n) Number of processors, P(n) = n3/log n Cost, c(n) = O(n3)

Searching

Given A = a1, a2, …, ai, …, an & x

Determine whether x = ai for some i Sequential Binary Search O(log n) Simple idea

Divide the list among the processors and let each processor conduct its own binary search

EREW PRAM O(log n/p) + O(log p) = O(log n) CREW O(log n/p)

Parallel Binary Search

Split A into p+1 segments of almost equal length

Compare x with p elements at the boundary between successive segments

Either x = ai or search is restricted to only one of the p+1 segments

Repeat until x is found or length of the list is <= p

Parallel and Distributed Processing CSE 8380

Documents

Transcript of Parallel and Distributed Processing CSE 8380

Parallel Task Routing for Crowdsourcingmausam/papers/hcomp14b.pdf · 2014-08-06 · Parallel Task Routing for Crowdsourcing Jonathan Bragg University of Washington (CSE) Seattle,

CSE 544 Parallel Databases

CS 554 / CSE 512: Parallel Numerical Algorithms Lecture ...

Lecture 12 CSE 260 – Parallel Computation (Fall 2015 ...

CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Lecture 1 CSE 260 – Parallel Computation (Fall 2015) Scott ...

CSE 544: Principles of Database Systems€¦ · CSE 544: Principles of Database Systems Parallel Databases CSE544 - Spring, 2012 1 . Announcements • Project proposals were due last

CSE 544 Principles of Database Management Systems · CSE 544 Principles of Database Management Systems Fall 2016 Lecture 13 – Parallel DBMSs

Lecture 3 CSE 260 – Parallel Computation (Fall 2015) Scott ...CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Address space organization Control Mechanism Vectorization

Parallel Algorithms K means Clustering - University at Buffalo€¦ · Parallel Algorithms K –means Clustering Final Results By: Andreina Uzcategui CSE 633: Parallel Algorithms

8380 RPC Return Path Combiner · 2020. 12. 12. · 8380 RPC. • Chapter 4, “Web Access” shows how to access the 8380 RPC through a web browser. • Chapter 5, ... May 2018 Document

Anshul Kumar, CSE IITD CS718 : Data Parallel Processors 27 th April, 2006.

CSE 531 Parallel Processors and Processing Dr. Mahmut Kandemir.

9/27/01CSE 260 - Class #3 CSE 260 – Introduction to Parallel Computation Class 3 – Sept 27, 2001 Programming Parallel Computers.

Introduction to Parallel Computing - CSE User Home Pages

CSE 613: Parallel Programming Lecture 6 ( High Probability Bounds )

CSE 260 – Introduction to Parallel Computation

Lecture 2 CSE 260 – Parallel Computation (Fall 2015) Scott ...

23 - Introduction to Parallel Processingmniemier/teaching/2010_B_Fall/... · 2010-11-15 · University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing!

CS 554 / CSE 512: Parallel Numerical Algorithms …solomonik.cs.illinois.edu/teaching/cs554_fall2017/notes/...CS 554 / CSE 512: Parallel Numerical Algorithms Lecture Notes Chapter