Parallel & Concurrent Programming: OpenMPemery/classes/cmpsci... · OpenMP Emery Berger CMPSCI 691W...
Transcript of Parallel & Concurrent Programming: OpenMPemery/classes/cmpsci... · OpenMP Emery Berger CMPSCI 691W...
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science
Parallel & Concurrent Programming:
OpenMPEmery BergerCMPSCI 691WSpring 2006
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 2
Outline
Last time(s):MPI – point-to-point & collective
Library calls
Today:OpenMP - parallel directives
Language extensions to Fortran/C/C++
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 3
Motivation
Take vectors a & b (100 ints)Distribute across all processorsEach processor:
Compute sum of all a[i] * b[i]Print overall sum
MPI: Use MPI_Scatter, MPI_Gather orMPI_Reduce
MPI_Scatter/Gather(sendbuf, cnt, type, recvbuf, recvcnt, type, root, comm)MPI_Reduce(sendbuf, recvbuf, cnt, type, op, root, comm)
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 4
MPI SolutionMPI_Init (&argc, &argv);MPI_Comm_rank (MPI_COMM_WORLD, &rank);MPI_Comm_size (MPI_COMM_WORLD, &size);
// Distribute a and bMPI_Scatter (a, 100, MPI_INT, a1, 100 / size, MPI_INT, 0, MPI_COMM_WORLD);MPI_Scatter (b, 100, MPI_INT, b1, 100 / size, MPI_INT, 0, MPI_COMM_WORLD);
// Multiply each chunkfor (int i = 0; i < 100/size; i++) {
z += a[i] *b1[i];}
// Reduce by summingif (rank == 0) {z1 = new int[size]; }MPI_Reduce (&z, &z, 1, MPI_INT, MPI_OP_PLUS, 0, MPI_COMM_WORLD);
// Output resultif (rank == 0) {
cout << z << endl; }
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 5
Ideal Solution
int z = 0;parallel for (i = 0; i < nProcessors; i++) {z += a[i] * b[i];
}cout << z << endl;
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 6
OpenMP Solution
int z = 0;#pragma omp forfor (int i = 0; i < 100; i++) {z += a[i] * b1[i];
}cout << z << endl;
OpenMP pragma directivesOmit = sequential programMore declarative styleAdd more pragmas for more efficiency
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 7
OpenMP Concepts
Fork-join modelOne thread executes sequential codeUpon reaching parallel directive:
Start new team of work-sharing threadsWait until all done (usually barrier)Can be nested!
Apparent global shared memory but relaxed consistency model
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 8
Consistency
Consistency =ordering of reads & writes
In same thread, across threads
Most “intuitive” consistency model = sequential consistency (Lamport)
Behaves like some sequential executionBUT: seriously limits parallelism
Must synchronize frequently
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 9
OpenMP Consistency
OpenMP: consistency across flushesWrites set of variables to memoryIf two flushes have intersecting sets, flushes must be seen in some sequential order by all threads
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 10
Parallel Execution
#pragma omp parallelExecutes next chunk of code across all or some number of threads
num_threads(n)
Only “master thread” continues after parallel section completes
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 11
Dynamic Threads
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 12
Parallel + nowait
Implicit barrier unless nowaitBarrier = flush operation
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 13
Parallel + Memory
Memory model:Heap objects sharedStack objects private
Includes loop iterators
unless indicated otherwise...
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 14
Parallel Example
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 15
Data-Sharing Attributes
sharedprivate
Each thread gets own private copyUndefined value
firstprivateCopies in original value
lastprivateCopies out private value
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 16
Lastprivate Example
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 17
Threadprivate Example
Can also declare variables as alwaysthread-private
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 18
Reduce
reductionprivate value per threadinitialized “appropriately”
uses predefined operators
copies out to originalreduction(+:a)
initializes a = 0reduction(*:1)
initializes a = 1
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 19
OpenMP Solution
int z = 0;#pragma omp for reduction(+:z)for (int i = 0; i < 100; i++) {z += a[i] * b1[i];
}cout << z << endl;
OpenMP pragma directivesOmit = sequential programMore declarative styleAdd more pragmas for more efficiency
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 20
All Together
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 21
But Still Races...
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 22
Master & Synchronization
masterAlways run by master thread
criticalDeclares critical section (one thread at a time)Can add names for greater concurrency
barrieratomic
Updated atomically (a++, a--, etc.)ordered
Executes loop body sequentially
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 23
Atomic Example
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 24
The End
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 25
Single Example
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 26
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 27
Ordered For
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 28
Copyin Example
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 29
Copyprivate Example
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 30
UUNIVERSITY OF NIVERSITY OF MMASSACHUSETTSASSACHUSETTS AAMHERST MHERST •• Department of Computer ScienceDepartment of Computer Science 31
The End
Next time:OpenMP