Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.

Post on 18-Dec-2015

221 views 0 download

Tags:

Transcript of Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.

1

Mohsan JameelDepartment of Computing

NUST School of Electrical Engineering and Computer Science

2

OutlineI. Introduction to OpenMP

II. OpenMP Programming Model

III. OpenMP Directives

IV. OpenMP Clauses

V. Run-Time Library Routine

VI. Environment Variables

VII. Summary

3

What is OpenMPApplication program interface (API) that is used

to explicitly direct multi-threaded, shared memory parallelism

Consists of:Compiler directives Run time routinesEnvironment variables

• Specification maintained by the OpenMP, Architecture Review Board (http://www.openmp.org)

• Version 3.0 has been released May 2008

4

What OpenMP is NotNot Automatic parallelization

User explicitly specifies parallel execution Compiler does not ignore user directives even if

wrong

Not just loop level parallelism Functionality to enable coarse grained parallelism

Not meant for distributed memory parallel systems

Not necessarily implemented identically by all vendors

Not Guaranteed to make the most efficient use of shared memory

5

History of OpenMP In the early 90's, vendors of shared-memory machines supplied

similar, directive-based, Fortran programming extensions: The user would augment a serial Fortran program with directives

specifying which loops were to be parallelized. First attempt at a standard was the draft for ANSI X3H5 in 1994. It

was never adopted, largely due to waning interest as distributed memory machines became popular.

The OpenMP standard specification started in the spring of 1997, taking over where ANSI X3H5had left off, as newer shared memory machine architectures started to become prevalent

6

Goal of OpenMPStandardization :

Provide a standard among a variety of shared memory architectures/platforms

Lean and mean : Establish a simple and limited set of directives for

programming shared memory machines.

Ease of Use : Provide capability to incrementally parallelize a

serial program Provide the capability to implement both coarse-

grain and fine-grain parallelism

Portability : Support Fortran (77, 90, and 95), C, and C++

7

OutlineI. Introduction to OpenMP

II. OpenMP Programming Model

III. OpenMP Directives

IV. OpenMP Clauses

V. Run-Time Library Routine

VI. Environment Variables

VII. Summary

8

OpenMP Programming ModelThread Based Parallelism

Explicit Parallelism

Compiler Directive Based

Dynamic Threads

Nested Parallelism Support

Task parallelism support (OpenMP specification 3.0)

9

Shared Memory Model

10

Execution Model

ID=0

ID=1,2,3…N-1

11

TerminologyOpenMP Team=: Master + workers

A parallel region is block of code executed by all threads simultaneously. Master thread always has thread ID=0 Thread adjustment is done before entering parallel

region. An “if” clause can be used with parallel construct,

incase the condition evaluate to FALSE, parallel region is avoided and code run serially

Work-sharing construct is responsible for dividing work among the threads in parallel region

12

Example OpenMP Code Structure

13

Components of OpenMP

14

I. Introduction to OpenMP

II. OpenMP Programming Model

III. OpenMP Directives

IV. OpenMP Clauses

V. Run-Time Library Routine

VI. Environment Variables

VII. Summary

15

Go to helloworld.c

16

C/C++ Parallel Region Example

!$OMP PARALLEL write (*,*) “Hello”

!$OMP END PARALLEL

Hello world from thread = 0Number of threads = 3

Hello world from thread = 1 Hello world from thread = 2

thread 0 thread 1 thread 2

17

OpenMP Directives

18

OpenMP ScopingStatic Extent:

The code textually enclosed between beginning and end of structure block

The static extent does not span other routines

Orphaned Directive:An OpenMP directive appear independently

Dynamic Extent: It include extent of both static extent and

orphaned directives

19

OpenMP Parallel RegionsA block of code that will be executed by multiple

threads

Properties

- Fork-Join Model

- Number of threads won’t change inside a parallel region

- SPMD execution within region

- Enclosed block of code must be structured, no branching into or out of block

Format

#pragma omp parallel clause1 clause2 …

20

OpenMP ThreadsHow many threads?

Use of the omp_set_threads() library functionSetting of the OMP_NUM_THREADS environment

variable Implementation default

Dynamic Threads :By default, the same number of threads are used to

execute each parallel regionTwo methods for enabling dynamic threads

1 Use of the omp_set_dynamic() library function2 Setting of the OMP_DYNAMIC environment variable

21

OpenMP Work-sharing constructs

Data parallelism Functional parallelism Serialize a section

22

Example: Count3s in an array

Lets assume we have an array of N integers.

We want to find how many 3s are in the array.

We needa for loop if statement, anda count variable

Lets look at its serial and parallel version

23

Serial: Count3s in an arrayint count, n=100;int array[n]; // initialize array

for(i=0;i<length;i++){

if (array[i]==3)count++;

}

24

Work-sharing construct: “for loop”

“for loop” work-sharing construct is thought of as data parallelism construct.

25

Parallelize 1st attempt: Count3s in an array

int count, n=100;int array[n]; // initialize array

#pragma omp parallel for default(none) shared(n,array,count) private(i)

for(i=0;i<length;i++){

if (array[i]==3)count++;

}

26

Work-sharing construct:Example of “for loop”

#pragma omp parallel for default(none) shared(n,a,b,c) private(i)

for (i=0;i<n;i++)

{

c[i] = a[i] + b[i];

}

27

Work-sharing construct: “section”

“Section” work-sharing construct is thought of as functional parallelism construct.

28

Parallelize 2nd attempt: Count3s in an array

• Say we also want to count 4s in same array.• Now we have two different function i.e. count 3 and count 4.

int count, n=100;int array[n]; // initialize array#pragma omp parallel sections default(none) shared(n,array,count3,count4)

private(i)

#pragma omp parallel sectionfor(i=0;i<length;i++){

if (array[i]==3)count3++;

}#pragma omp parallel sectionfor(i=0;i<length;i++){

if (array[i]==4)count4++;

}

No date race condition in this example. WHY?

29

#pragma omp parallel sections default(none) shared(a,b,c,d,e,n) private(i)

{#pragma omp section{

printf("Thread %d executes 1st loop \n”,omp_get_thread_num());

for(i=0;i<n;i++)a[i]=3*b[i];

}#pragma omp section{

printf("Thread %d executes 1st loop \n”,omp_get_thread_num());

for(i=0;i<n;i++) e[i]=2*c[i]+d[i];

}}final_sum=sum(a,n) + sum(e,n);printf("FINAL_SUM is %d\n",final_sum)

Work-sharing construct:Example 1 of “section”

30

Work-sharing construct:Example 2 of “section” 1/2

31

Work-sharing construct:Example 2 of “section” 2/2

32

Work-sharing construct:Example of “single”

In parallel region “single block” is used to specify that this block is executed only by one thread in the team of threads.

Lets look at an example

33

I. Introduction to OpenMP

II. OpenMP Programming Model

III. OpenMP Directives

IV. OpenMP Clauses

V. Run-Time Library Routine

VI. Environment Variables

VII. Summary

34

OpenMP Clauses: Data sharing 1/2

shared(list)shared clause is used to specify which data is

shared among thread.All threads can read and write to this shared

variable.By default all variables are shared.

private(list)private variable are local to thread.Typical example of private variable is loop counter,

since each thread has its own loop counter initialized at entry point.

35

A private variable is defined between entry and exit point of parallel region.

A private variable within parallel region has no scope out side of it

firstprivate and lastprivate clauses are used to increase scope of variable beyond parallel region.

firstprivate: All variables in the list are initialized with the original value that object had before entering parallel region

lastprivate: The thread that executes the last iteration or section updates the value of object in list.

OpenMP Clauses: Data sharing 2/2

36

Example: firstprivate and lastprivate

int main(){int C, B , A= 10;

/*--- Start of parallel region ---*/#pragma omp parallel for default(none) firstprivate(A)

lastprivate(B) private(i) for (i=0;i<n;i++){

…B = i + A; …

}/*--- End of parallel region ---*/

C=B;}

37

OpenMP Clauses: nowait

nowait clause is used to avoid implicit synchronization at end of work-sharing directive

38

OpenMP Clause: schedule schedule clause is supported in loop construct only.

Used to control the manner in which loop iterations are distributed over the threads.

Syntax: schedule(kind[,chunk_size)

Types: static[,chunk]: distribute iterations in blocks of size “chunk

over the threads in a round-robin fashion dynamic[,chunk]: fixed portions of work; size is controlled by

the value chunk, when thread finishes its portion it starts with next portion.

guided[,chunk]: same as “dynamic”, but size of the portion of work decreases exponentially.

runtime[,chunk]: iteration scheduling scheme is set at runtime thought environment variable OMP_SCHEDULE

39

The Experiment with schedule clause

40

OpenMP Critical construct

int main(){int sum, n=5;

int a[5]={1,2,3,4,5};/*--- Start of parallel region ---*/#pragma omp parallel for default(none) shared(sum,a,n) private(i) for (i=0;i<n;i++){ sum += a[i]; }/*--- End of parallel region ---*/printf(“sum of vector a =%d”,sum);}

Example summation of a vector

race condition

41

OpenMP Critical constructint main(){

int sum, local_sum, n=5; int a[5]={1,2,3,4,5};/*--- Start of parallel region ---*/#pragma omp parallel default(none) shared(sum,a,n) private(local_sum,i) {

#pragma omp forfor (i=0;i<n;i++){

local_sum += a[i]; }

#pragma omp critical {

sum+=local_sum}

}/*--- End of parallel region ---*/printf(“sum of vector a =%d”,sum);}

42

Parallelize 3rd attempt: Count3s in an array

int count, n=100;int array[n]; // initialize array

#pragma omp parallel default(none) shared(n,array,count) private(i,local_count)

{#pragma omp parallel for for(i=0;i<length;i++){

if (array[i]==3)local_count ++;

}#pragma omp critical

{ count+=local_count}

} /*--- End of Parallel region ---*/

43

OpenMP Clause: reduction

int main(){

int sum, n=5;

int a[5]={1,2,3,4,5};

/*--- Start of parallel region ---*/

#pragma omp parallel for default(none) shared(a,n) private(i)\

reduction(+:sum)

for (i=0;i<n;i++)

{

sum += a[i];

}

/*--- End of parallel region ---*/

printf(“sum of vector a =%d”,sum);

}

• OpenMP provides a reduction clause which is used with for loop and section directives.

• reduction variable must be shared among threads

• race condition is avoided implicitly.

44

Parallelize 4th attempt: Count3s in an array

int count, n=100;int array[n]; // initialize array

#pragma omp parallel for default(none) shared(n,array) private(i) \

for(i=0;i<length;i++){

if (array[i]==3)count++;

} /*--- End of Parallel region ---*/

reduction(+:count)

45

Tasking in OpenMP

Tasking in OpenMPIn OpenMP 3.0 the concept of tasks has been

added to the OpenMP execution model

The Task model is useful is case where the number of parallel pieces and the work involved in each piece varies and/or unknown

Before inclusion of the Task model OpenMP was not suited for unstructured problem

Tasks are often set up within a single construct in a manager-worker model.

46

Task Parallelism Approach 1/2 Threads line up as workers, go through the queue of work

to be done, and do a task

Threads do not wait, as in loop parallelism, rather go back to queue and do more tasks.

Each task is executed serially by work thread that encounter that task in queue.

Load balancing occur as short and long task are done as threads become available.

47

48

Task Parallelism Approach 2/2

49

Example: Task parallelism

50

Best Practices Optimize barrier use

Avoid ordered construct

Avoid large critical regions

Maximize parallel regions

Avoid multiple use of parallel regions

Address poor load balance

51

I. Introduction to OpenMP

II. OpenMP Programming Model

III. OpenMP Directives

IV. OpenMP Clauses

V. Run-Time Library Routine

VI. Environment Variables

VII. Summary

52

List of runtime library routineRuntime library routine are provided in omp.h

header file void omp_set_num_threads(int num); int omp_get_num_threads(); int omp_get_max_threads(); int omp_get_thread_num(); int omp_get_thread_limit(); int omp_get_num_procs(); double omp_get_wtime(); int omp_in_parallel(); // return 0 false and non-zero true Few more

53

More list of runtime library routineThese routine are new with OpenMP 3.0

54

I. Introduction to OpenMP

II. OpenMP Programming Model

III. OpenMP Directives

IV. OpenMP Clauses

V. Run-Time Library Routine

VI. Environment Variables

VII. Summary

55

Environment VariableOMP_NUM_THREAD

OMP_DYANMIC

OMP_THREAD_LIMIT

OMP_STACKSIZE

56

I. Introduction to OpenMP

II. OpenMP Programming Model

III. OpenMP Directives

IV. OpenMP Clauses

V. Run-Time Library Routine

VI. Environment Variables

VII. Summary

57

SummaryOpenMP provides small but yet powerful programming

model

Compilers with OpenMP support are widely available

OpenMP is a directive based shared memory programming model

OpenMP API is a general purpose parallel programming API with emphasis on the ability to parallelize existing programs

Scalable parallel programs can be written by using parallel regions

Work-sharing constructs enable efficient parallelization of computationally intensive portions of program

58

Thank Youand

Exercise Session