10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel...

20
10/02/2012 CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1

Transcript of 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel...

Page 1: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

10/02/2012 CS4230

CS4230 Parallel Programming

Lecture 11: Breaking Dependences and Task

Parallel Algorithms

Mary HallOctober 2, 2012

1

Page 2: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Administrative•Assignment deadline extended to Friday night.

•C++ validation code added to website, if Octave/Matlab too hard to use (must update file names accordingly)

•Have you figured out how to validate? How to time?

10/02/2012 CS4230 2

Page 3: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

How to Increase Parallelism Granularity when there are Dependences?

•Your assignment has a number of dependences that prevent outer loop parallelization

•Parallelizing inner “k” loops leads to poor performance

•How do we break dependences to parallelize the outer loops?

10/02/2012 CS4230 3

Page 4: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Two Transformations for Breaking Dependences

•Scalar expansion:- A scalar is replaced by an array to maintain its value

calculated in each iteration of a loop (nest). This transformation may make it possible to perform loop distribution (below) and parallelize the resulting loops differently. Array expansion is analogous, increasing the dimensionality of a loop.

•Loop distribution (fission):- The loop header is replicated and distributed across

statements in a loop, so that subsets of statements are executed in separate loops. The resulting loops can be parallelized differently.

10/02/2012 CS4230 4

Page 5: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Scalar Expansion Example

for (i=0; i<m; i++) {

a[i] = temp;

temp = b[i] + c[i];

}

10/02/2012 CS4230 5

•Dependence?• Can you parallelize some of the computation? How?

Page 6: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Loop Distribution Example

for (j=0; j<100; j++) {

a[j] = b[j] + c[j];

d[j] = a[j-1] * 2.0;

}

10/02/2012 CS4230 6

•Dependence?• Can you parallelize some of the computation? How?

Page 7: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Back to assignment•Where are the dependences?

•Can we identify where scalar expansion and/or loop distribution might be beneficial?

10/02/2012 CS4230 7

Page 8: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Summary of stencil discussion•A stencil examines the input data in the

neighborhood of an output point to compute a result.

•Common in scientific codes, also looks like a lot of image and signal processing algorithms

•Core performance issues in parallel stencils:- Multi-dimensional stencils walk all over the memory,

can lead to poor performance- May benefit from tilling

- Amount of data reuse and therefore benefit of locality optimization depends on how much of neighborhood is examined (higher-order stencils look at more of the neighborhood)

- May cause very expensive TLB misses for large data sets and 3 or more dimensions (if crosses page boundaries)

- “Ghost zones” for replicating input/output like in red-blue

10/02/2012 CS4230 8

Page 9: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

General: Task Parallelism•Recall definition:

- A task parallel computation is one in which parallelism is applied by performing distinct computations – or tasks – at the same time. Since the number of tasks is fixed, the parallelism is not scalable.

•OpenMP support for task parallelism- Parallel sections: different threads execute different

code

- Tasks (NEW): tasks are created and executed at separate times

•Common use of task parallelism = Producer/consumer

- A producer creates work that is then processed by the consumer

- You can think of this as a form of pipelining, like in an assembly line

- The “producer” writes data to a FIFO queue (or similar) to be accessed by the “consumer”

9/22/2011 CS4961 9

Page 10: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Simple: OpenMP sections directive

#pragma omp parallel{#pragma omp sections#pragma omp section

{{ a=...;b=...; }

#pragma omp section{ c=...;d=...; }

#pragma omp section{ e=...;f=...; }

#pragma omp section{ g=...;h=...; }

} /*omp end sections*/} /*omp end parallel*/

9/22/2011 10CS4961

Page 11: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Parallel Sections, Example

9/22/2011 CS4961 11

#pragma omp parallel shared(n,a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i<n; i++) d[i] = 1.0/c[i]; #pragma omp section for (i=0; i<n-1; i++) b[i] = (a[i] + a[i+1])/2; } /*-- End of sections --*/ } /*-- End of parallel region

Page 12: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Simple Producer-Consumer Example

// PRODUCER: initialize A with random datavoid fill_rand(int nval, double *A) { for (i=0; i<nval; i++) A[i] = (double) rand()/1111111111; }

// CONSUMER: Sum the data in Adouble Sum_array(int nval, double *A) { double sum = 0.0; for (i=0; i<nval; i++) sum = sum + A[i]; return sum;}

9/22/2011 CS4961 12

Page 13: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Key Issues in Producer-Consumer Parallelism•Producer needs to tell consumer that the data is ready

•Consumer needs to wait until data is ready

•Producer and consumer need a way to communicate data

- output of producer is input to consumer

•Producer and consumer often communicate through First-in-first-out (FIFO) queue

9/22/2011 CS4961 13

Page 14: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

One Solution to Read/Write a FIFO•The FIFO is in global memory and is shared

between the parallel threads

•How do you make sure the data is updated?

•Need a construct to guarantee consistent view of memory

- Flush: make sure data is written all the way back to global memory

9/22/2011 CS4961 14

Example: Double A;A = compute();Flush(A);

Page 15: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Another Example from Textbook•Implement Message-Passing on a Shared-Memory

System

•A FIFO queue holds messages

•A thread has explicit functions to Send and Receive

- Send a message by enqueuing on a queue in shared memory

- Receive a message by grabbing from queue

- Ensure safe access

9/22/2011 CS4961 15

Page 16: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Message-Passing

Copyright © 2010, Elsevier Inc. All rights Reserved

Page 17: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Sending Messages

Copyright © 2010, Elsevier Inc. All rights Reserved

Use synchronization mechanisms to update FIFO“Flush” happens implicitlyWhat is the implementation of Enqueue?

Page 18: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Receiving Messages

Copyright © 2010, Elsevier Inc. All rights Reserved

This thread is the only one to dequeue its messages.Other threads may only add more messages.Messages added to end and removed from front.Therefore, only if we are on the last entry is synchronization needed.

Page 19: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

Termination Detection

Copyright © 2010, Elsevier Inc. All rights Reserved

each thread increments this after completing its for loop

More synchronization needed on “done_sending”

Page 20: 10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2, 2012 1.

What’s coming?

10/02/2012 CS4230 20

•Next time we will finish task parallelism in OpenMP, look at actual executing code

•Homework to be used as midterm review, due before class, Thursday, October 18

•In-class midterm on October 18