OpenMP - Technische Universität Münchengerndt/home/Teaching/Parallel Programming... · 3 OpenMP...
Transcript of OpenMP - Technische Universität Münchengerndt/home/Teaching/Parallel Programming... · 3 OpenMP...
1
OpenMP
2
Shared Memory Architektur
Processor
BUS
Memory
Processor Processor Processor
3
OpenMP
• Portable programming of shared memory systems.
• It is a quasi-standard.
• OpenMP-Forum
• Started in 1997
• Current standard OpenMP 4.0 from July 2013
• API for Fortran and C/C++
• directives
• runtime routines
• environment variables
• www.openmp.org
4
> export OMP_NUM_THREADS=3
> a.out
Hello World
Hello World
Hello World
> export OMP_NUM_THREADS=2
> a.out
Hello world
Hello world
> icc –O3 –openmp openmp.c
#include <omp.h>
main(){
#pragma omp parallel
{
printf(“Hello world”);
}
}
Example
Program
Compilation
Execution
5
#pragma omp parallel
{
printf(“Hello world %d\n”, omp_get_thread_num());
}
PARALLEL {
Execution Model
print print print
}
T0
T0 T1 T2
T0
Thread
Team
Creates
Team is destroyed
6
Fork/Join Execution Model
1. An OpenMP-program starts as a single thread (master
thread).
2. Additional threads (Team) are created when the master hits
a parallel region.
3. When all threads finished the parallel region, the new
threads are given back to the runtime or operating system.
• A team consists of a fixed set of threads executing
the parallel region redundantly.
• All threads in the team are synchronized at the end
of a parallel region via a barrier.
• The master continues after the parallel region.
7
Work Sharing in a Parallel Region
main (){
int a[100];
#pragma omp parallel
{
#pragma omp for
for (int i= 1; i<n;i++)
a(i) = i;
…
}
}
8
Shared and Private Data
• Shared data are accessible by all threads. A reference
a[5] to a shared array accesses the same address in
all threads.
• Private data are accessible only by a single thread.
Each thread has its own copy.
• The default is shared.
9
Private clause for parallel loop
main (){
int a[100], t;
#pragma omp parallel
{
#pragma omp for private(t)
for (int i= 1; i<n;i++){
t=f(i);
a(i)=t;
}
}
}
10
Example: Private Data
I=3
#pragma omp parallel private(i)
{
I=17
}
Printf(“Value of I=%d\n”, I);
I = 3 I = 3
I1 = 17
I2 = 17
I3 = 17
I = 3
I = 3 I = 17
I1 = 17
I2 = 17
I = 17
11
Example
main (){
int iam, nthreads;
#pragma omp parallel private(iam,nthreads)
{
iam = omp_get_thread_num();
nthreads = omp_get_num_threads();
printf(“ThradID %d, out of %d threads\n”, iam, nthreads);
if (iam == 0) ! Different control flow
printf(“Here is the Master Thread.\n”);
else
printf(“Here is another thread.\n”);
}
}
12
Private Data
• A new copy is created for each thread.
• One thread may reuse the global shared copy.
• The private copies are destroyed after the parallel
region.
• The value of the shared copy is undefined.
13
Example: Shared Data
I=77
#pragma omp parallel shared(i)
{
I=omp_get_thread_num();
}
Printf(“Value of I=%d\n”, I);
I = 77 I = 2 I = 1 I = 0I = 0
In Parallel Region
I = 77 I = 0 I = 1 I = 2I = 2I = 77 I = 3 I = 2 I = 1I = 1
14
int main() {
#pragma omp parallel default(shared)
{
printf(”hello world\n” );
}
}
!$OMP PARALLEL DEFAULT(SHARED)
write(*,*) ´Hello world´
!$OMP END PARALLEL
Syntax of Directives and Pragmas
Fortran
!$OMP directive name [parameters]
C / C++
#pragma omp directive name [parameters]
15
Directives
Directives can have continuation lines• Fortran
!$OMP directive name first_part &
!$OMP continuation_part• C
#pragma omp parallel private(i) \
private(j)
16
#pragma omp parallel [parameters]
{
parallel region
}
Parallel Region
• The statements enclosed lexically within a region
define the lexical extent of the region.
• The dynamic extent further includes the routines
called from within the construct.
17
Lexical and Dynamic Extend
main (){
int a[100];
#pragma omp parallel
{
…
}
}
sub(int a[])
{
#pragma omp for
for (int i= 1; i<n;i++)
a(i) = i;
}
• Local variables of a subroutine called in a parallel region are
by default private.
18
Work-Sharing Constructs
• Work-sharing constructs distribute the specified work
to all threads within the current team.
• Types
• Parallel loop
• Parallel section
• Master region
• Single region
• General work-sharing construct (only Fortran)
19
#pragma omp for [parameters]
for ...
Parallel Loop
• The iterations of the do-loop are distributed to the threads.
• The scheduling of loop iterations is determined by one of the
scheduling strategies static, dynamic, guided, and runtime.
• There is no synchronization at the beginning.
• All threads of the team synchronize at an implicit barrier if the
parameter nowait is not specified.
• The loop variable is by default private. It must not be modified in
the loop body.
• The expressions in the for-statement are very restricted.
20
Scheduling Strategies
• Schedule clause
schedule (type [,size])
• Scheduling types:
• static: Chunks of the specified size are assigned in a round-
robin fashion to the threads.
• dynamic: The iterations are broken into chunks of the
specified size. When a thread finishes the execution of a
chunk, the next chunk is assigned to that thread.
• guided: Similar to dynamic, but the size of the chunks is
exponentially decreasing. The size parameter specifies the
smallest chunk. The initial chunk is implementation
dependent.
• runtime: The scheduling type and the chunk size is
determined via environment variables.
21
Example: Dynamic Scheduling
main(){
int i, a[1000];
#pragma omp parallel
{
#pragma omp for schedule(dynamic, 4)
for (int i=0; i<1000;i++)
a[i] = omp_get_thread_num();
#pragma omp for schedule(guided)
for (int i=0; i<1000;i++)
a[i] = omp_get_thread_num();
}
}
22
Reductions
• This clause performs a reduction on the variables that
appear in list, with the operator operator.
• Variables must be shared scalars
• operator is one of the following:
• +, *, -, &, ˆ, |, &&, ||
• Reduction variable might only appear in statements
with the following form:
• x = x operator expr
• x binop= expr
• x++, ++x, x--, --x
reduction(operator: list)
23
Example: Reduction
#pragma omp parallel for reduction(+: a)
for (i=0; i<n; i++) {
a = a + b[i];
}
24
Classification of Variables
• private(var-list)
• Variables in var-list are private.
• shared(var-list)
• Variables in var-list are shared.
• default(private | shared | none)
• Sets the default for all variables in this region.
• firstprivate(var-list)
• Variables are private and are initialized with the value of the
shared copy before the region.
• lastprivate(var-list)
• Variables are private and the value of the thread executing the
last iteration of a parallel loop in sequential order is copied to
the variable outside of the region.
25
Scoping Variables with Private Clause
• The values of the shared copies of i and j are undefined on exit
from the parallel region.
• The private copies of j are initialized in the parallel region to 2.
int i, j;
i = 1;
j = 2;
#pragma omp parallel private(i) firstprivate(j)
{
i = 3;
j = j + 2;
printf("%d %d\n", i, j);
}
26
Parallel Section
• Each section of a parallel section is executed once by
one thread of the team.
• Threads that finished their section wait at the implicit
barrier at the end of the section construct.
#pragma omp sections [parameters]
{
[#pragma omp section]
block
[#pragma omp section
block ]
}
27
Example: Parallel Section
main(){
int i, a[1000], b[1000]
#pragma omp parallel private(i)
{
#pragma omp sections
{
#pragma omp section
for (int i=0; i<1000; i++)
a[i] = 100;
#pragma omp section
for (int i=0; i<1000; i++)
b[i] = 200;
}
}
}
28
OMP Workshare (Fortran only)
• The WORKSHARE directive divides the work of
executing the enclosed code into separate units of
work and distributes the units amongst the threads.
• An implementation of the WORKSHARE directive
must insert any synchronization that is required to
maintain standard Fortran semantics.
• There is an implicit barrier at the end of the workshare
region.
!$OMP WORKSHARE [parameters]
block
!$OMP END WORKSHARE [NOWAIT]
29
Sharing Work in a Fortran 90 Array Statement
A(1:N)=B(2:N+1)+C(1:N)
• Each evaluation of an array expression for an
individual index is a unit of work.
• The assignment to an individual array element is also
a unit of work.
30
Master / Single Region
• A master or single region enforces that only a single thread executes the enclosed code within a parallel region.
• Common• No synchronization at the beginning of region.
• Different• Master region is executed by master thread while the single
region can be executed by any thread.
• Master region is skipped by other threads while all threads are synchronized at the end of a single region.
#pragma omp master
block
#pragma omp single [parameters]
block
31
Combined Work-Sharing and Parallel Constructs
• #pragma omp parallel for
• #pragma omp parallel sections
• !$OMP PARALLEL WORKSHARE
32
#pragma omp barrier
Barrier
• The barrier synchronizes all the threads in a team.
• When encountered, each thread waits until all of the other threads in that
team have reached this point.
33
#pragma omp critical [(Name)]
{ ... }
Critical Section
• Mutual exclusion
• A critical section is a block of code that can be executed by only one
thread at a time.
• Critical section name
• A thread waits at the beginning of a critical section until no other
thread is executing a critical section with the same name.
• All unnamed critical directives map to the same name.
• Critical section names are global entities of the program. If a name
conflicts with any other entity, the behavior of the program is
unspecified.
• Avoid long critical sections
34
#pragma omp parallel private(i)
{
#pragma omp sections
{
#pragma omp section
{
for (int i=0;i<N;i++)
ia = ia + a[i];
#pragma omp critical (c1)
{
itotal = itotal + ia;
}}
#pragma omp section
{
for (int i=0;i<N;i++)
ib = ib + b[i]
#pragma omp critical (c1)
{
itotal = itotal + ib;
}}
}}
Example: Critical Section
main(){
int ia = 0
int ib = 0
int itotal = 0
for (int i=0;i<N;i++)
{
a[i] = i;
b[i] = N-i;
}
}
35
#pragma ATOMIC
expression-stmt
Atomic Statements
• The ATOMIC directive ensures that a specific memory
location is updated atomically
• Has to have the following form:
–x binop= expr
–x++ or ++x
–x-- or -- x
• where x is an lvalue expression with scalar type and expr
does not reference the object designated by x.
• All parallel assignments to the location must be
protected with the atomic directive.
36
Translation of Atomic
#pragma omp atomic
x += expr
can be rewritten as
xtmp = expr
!$OMP CRITICAL (name)
x = x + xtmp
!$OMP END CRITICAL (name)
•Only the load and store of x are protected.
37
Simple Locks
• Locks can be hold by only one thread at a time.
• A lock is represented by a lock variable of type
omp_lock_t.
• The thread that obtained a simple lock cannot set it
again.
• Operations
• omp_init_lock(&lockvar): initialize a lock
• omp_destroy_lock(&lockvar): destroy a lock
• omp_set_lock(&lockvar): set lock
• omp_unset_lock(&lockvar): free lock
• logicalvar = omp_test_lock(&lockvar): check lock and possibly
set lock, returns true if lock was set by the executing thread.
38
Example: Simple Lock
#include <omp.h>
int id;
omp_lock_t lock;
omp_init_lock(lock);
#pragma omp parallel shared(lock) private(id)
{
id = omp_get_thread_num();
omp_set_lock(&lock); //Only a single thread writes
printf(“My Thread num is: %d”, id);
omp_unset_lock(&lock);
WHILE (!omp_test_lock(&lock))
other_work(id); //Lock not obtained
real_work(id); //Lock obtained
omp_unset_lock(&lock);//Lock freed
}
omp_destroy_lock(&lock);
locked
locked
39
Nestable Locks
• Unlike simple locks, nestable locks can be set multiple
times by a single thread.
• Each set operation increments a lock counter.
• Each unset operation decrements the lock counter.
• If the lock counter is 0 after an unset operation, the
lock can be set by another thread.
40
Ordered Construct
• Construct must be within the dynamic extent of an
omp for construct with an ordered clause.
• Ordered constructs are executed strictly in the order in
which they would be executed in a sequential
execution of the loop.
#pragma omp for ordered
for (...)
{ ...
#pragma omp ordered
{ ... }
...
}
41
Example with ordered clause
#pragma omp for ordered
for (...)
{ S1
#pragma omp ordered
{ S2}
S3
} i=1 i=2 i=3 i=N
S1 S1 S1
S2S2
S2
S2
S3 S3
S3
S3
S1
Barrier
42
Flush
• The flush directive synchronizes copies in register or cache of the
executing thread with main memory.
• It synchronizes those variable in the given list or, if no list is
specified, all shared variables accessible in the region.
• It does not update implicit copies at other threads.
• Load/stores executed before the flush in program order have to
be finished.
• Load/stores following the flush in program order are not allowed
to be executed before the flush.
• A flush is executed implicitly for some constructs, e.g. begin and
end of a parallel region, end of work-sharing constructs ...
#pragma omp flush [(list)]
43
Example: Flush
#define MAXTHREAD 100
int iam, neigh, isync[MAXTHREAD+1];
isync[0] = 1;isync[1..MAXTHREAD]=0;
#pragma omp parallel private(iam, neigh)
{
iam = omp_get_thread_num()+1;
neigh = iam – 1;
//Wait for neighbor
while (isync[neigh] == 0) {
#pragma omp flush(isync)
}
//Do my work
work();
isync[iam] = 1; //I am done
#pragma omp flush(isync)
}
1
0
0
0
isync
0
44
Lastprivate example
k=0
#pragma omp parallel
{
#pragma omp for lastprivate(k)
for (i=0; i<100; i++)
a[i] = b[i] + b[i+1];
k=2*i;
}
// The value of k is 198
45
Copyprivate Example
• Copyprivate
• Clause only for single region.
• Variables must be private in enclosing parallel region.
• Value of executing thread is copied to all other threads.
#pragma omp parallel private(x)
{
#pragma omp single copyprivate(x)
{
getValue(x);
}
useValue(x);
}
46
Other Copyprivate Example
float read_next( ) {
float * tmp;
float return_val;
#pragma omp single copyprivate(tmp)
{
tmp = (float *) malloc(sizeof(float));
}
#pragma omp master
{
get_float( tmp );
}
#pragma omp barrier
return_val = *tmp;
#pragma omp barrier
#pragma omp single
{
free(tmp);
}
return return_val;
}
47
Runtime Routines for Threads (1)
• Determine the number of threads for parallel regions
• omp_set_num_threads(count)
• Query the maximum number of threads for team
creation
• numthreads = omp_get_max_threads()
• Query number of threads in the current team
• numthreads = omp_get_num_threads()
• Query own thread number (0..n-1)
• iam = omp_get_thread_num()
• Query number of processors
• numprocs = omp_get_num_procs()
48
Runtime Routines for Threads (2)
• Query state
logicalvar = omp_in_parallel()
• Allow runtime system to determine the number of
threads for team creation
omp_set_dynamic(logicalexpr)
• Query whether runtime system can determine the
number of threads
logicalvar= omp_get_dynamic()
• Allow nesting of parallel regions
omp_set_nested(logicalexpr)
• Query nesting of parallel regions
logicalvar= omp_get_nested()
49
Environment Variables
• OMP_NUM_THREADS=4
• Number of threads in a team of a parallel region
• OMP_SCHEDULE=”dynamic”
OMP_SCHEDULE=”GUIDED,4“
• Selects scheduling strategy to be applied at runtime
• OMP_DYNAMIC=TRUE
• Allow runtime system to determine the number of threads.
• OMP_NESTED=TRUE
• Allow nesting of parallel regions.
50
OpenMP 3.0
• Introduced May 2008
• OpenMP 3.1, July 2011
51
Explicit Tasking
• Explicit creation of tasks#pragma omp parallel
{
#pragma omp single {
for ( elem = l->first; elem; elem = elem->next)
#pragma omp task
process(elem)
}
// all tasks are complete by this point
}
• Task scheduling
• Tasks can be executed by any thread in the team
• Barrier
• All tasks created in the parallel region have to be finished.
52
#pragma omp Task [clause list]
{ ... }
Tasks
Clauses
• If (scalar-expression)• FALSE: Execution starts immediately by the creating thread
• The suspended task may not be resumed until the new task is finished.
• Untied• Task is not tied to the thread starting its execution. It might be rescheduled to another
thread.
• Default (shared|none), private, firstprivate, shared
• If no default clause is present, the implicit data-sharing attribute is firstprivate.
Binding
• The binding thread set of the task region is the current team.
• A task region binds to the innermost enclosing parallel region.
53
Example: Tree Traversal
struct node {
struct node *left;
struct node *right;
};
void traverse( struct node *p ) {
if (p->left)
#pragma omp task // p is firstprivate by default
traverse(p->left);
if (p->right)
#pragma omp task // p is firstprivate by default
traverse(p->right);
process(p);
}
54
#pragma omp taskwait
{ ... }
Task Wait
• Waits for completion of immediate child tasks
• Child tasks: Tasks generated since the beginning of the current task.
55
OpenMP 4
• Task dependencies via new depend clause
• Depend (dependence-type:list)
– Where dependence-type=IN | INOUT | OUT
• Dependencies to previously generated sibling tasks.
• IN: The generated task will be a dependent task of all
previously generated sibling tasks that reference at least one
of the list items in an out or inout dependence-type list.
• OUT & INOUT: The generated task will be a dependent task
of all previously generated sibling tasks that reference at least
one of the list items in an in, out, or inout dependence-type
list.
56
#pragma omp taskyield
{ ... }
Taskyield
• The taskyield construct specifies that the current task can be
suspended in favor of execution of a different task.
• Explicit task scheduling point
• Implicit task scheduling points• Task creation
• End of a task
• Taskwait
• Barrier synchronization
57
Switch task while waiting
void foo ( omp_lock_t * lock, int n )
{
int i;
for ( i = 0; i < n; i++ )
#pragma omp task
{
something_useful();
while ( !omp_test_lock(lock) ) {
#pragma omp taskyield
}
something_critical();
omp_unset_lock(lock);
}
}
58
59
60
61
62
63
64
65
Terms
• tied task A task that, when its task region is
suspended, can be resumed only by the same thread
that suspended it. That is, the task is tied to that
thread.
• untied task (untied clause) A task that, when its task
region is suspended, can be resumed by any thread in
the team. That is, the task is not tied to any thread.
• undeferred task (if clause is false) A task for which
execution is not deferred with respect to its generating
task region. That is, its generating task region is
suspended until execution of the undeferred task is
completed.
66
Terms
• included task A task for which execution is
sequentially included in the generating task region.
That is, it is undeferred and executed immediately by
the encountering thread. It has ist own data
environment.
• merged task (mergeable clause) A task whose data
environment is the same as that of its generating task
region.
• final task (final clause) A task that forces all of its
child tasks to become final and included tasks.
67
Mergeable tasks
##include <stdio.h>
void foo ( )
{
int x = 2;
#pragma omp task mergeable
{
x++;
}
#pragma omp taskwait
printf("%d\n",x); // prints 2 or 3
68
Mergeable tasks
#include <stdio.h>
void foo ( )
{
int x = 2;
#pragma omp task shared(x) mergeable
{
x++;
}
#pragma omp taskwait
printf("%d\n",x); // prints 3
69
Synchronization in Tasks – Potential Deadlock
void work()
{
#pragma omp task
{ //Task 1
#pragma omp task
{ //Task 2
#pragma omp critical //Critical region 1
{/*do work here */ }
}
#pragma omp critical //Critical Region 2
{ //Capture data for the following task
#pragma omp task
{ /* do work here */ } //Task 3
}
}
70
Collapsing of loops
• Handles multi-dimensional perfectly nested loops
• Larger iteration space ordered according to sequential
execution.
• Schedule clause applies to new iteration space
#pragma omp parallel for collapse(2)
for (i=0; i<n; i++)
for (j=0; j<n; j++)
for (k=0; k<n; k++)
{
.....
}
71
Guaranteed Scheduling
• Same work distribution if
• Same number of iterations, schedule static with same
chunksize
• Both regions bind to same parallel region
!$omp do schedule(static)
do i=1,n
a(i) = ....
end do
!$ompend do nowait
!$omp do schedule(static)
do i=1,n
.... = a(i)
end do
72
Scheduling strategy auto for parallel loops
• New scheduling strategy auto
• It is up to the compiler to determine the scheduling.
73
Nested Parallelism
• Currently only a single copy of the control variable
specifying the number of threads in a team.
• omp_set_num_threads()
• Can be called only outside of parallel regions.
• This is applied for nested parallelism
• All teams have the same size.
• But num_threads clause of parallel region
• OpenMP 3.0 supports individual copies
• There is one copy per task.
• Teams might have different sizes.
74
OpenMP 4
• SIMD support
• Directive for loops: guarantees that loop can be executed in a
SIMD fashion
• Directive for omp loops: interations are parallelized and those
assigned to a thread are executed with SIMD instructions
• Target construct for accelerators
• User-defined reductions
• Cancellation of a parallel region
• Affinity
• Places: Thread, core, socket
• Affinity policies: spread, close, master
75
76
Summary
• OpenMP is quasi-standard for shared memory
programming
• Based on Fork-Join Model
• Parallel region and work sharing constructs
• Declaration of private or shared variables
• Reduction variables
• Scheduling strategies
• Synchronization via Barrier, Critical section, Atomic,
locks, nestable locks
• Task concept
• SIMD and accelerator support.