Parallel Programming Lecture Set 4 POSIX Threads Overview & OpenMP Johnnie Baker February 2, 2011 1.
Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP...
Transcript of Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP...
![Page 1: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/1.jpg)
Lecture04-06:ProgrammingwithOpenMP
ConcurrentandMul<coreProgramming,CSE536
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
1
![Page 2: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/2.jpg)
Topics(Part1)
• IntroducAon• Programmingonsharedmemorysystem(Chapter7)
– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonsharedmemorysystem(Chapter7)
– Cilk/Cilkplus(?)– PThread,mutualexclusion,locks,synchroniza<ons
• AnalysisofparallelprogramexecuAons(Chapter5)– PerformanceMetricsforParallelSystems
• Execu<onTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools
2
![Page 3: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/3.jpg)
Outline
• OpenMPIntroducAon• ParallelProgrammingwithOpenMP
– OpenMPparallelregion,andworksharing– OpenMPdataenvironment,taskingandsynchronizaAon
• OpenMPPerformanceandBestPracAces• MoreCaseStudiesandExamples• ReferenceMaterials
3
![Page 4: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/4.jpg)
WhatisOpenMP
• StandardAPItowritesharedmemoryparallelapplicaAonsinC,C++,andFortran– CompilerdirecAves,RunAmerouAnes,Environmentvariables
• OpenMPArchitectureReviewBoard(ARB)– MaintainsOpenMPspecificaAon– Permanentmembers
• AMD,Cray,Fujitsu,HP,IBM,Intel,NEC,PGI,Oracle,MicrosoZ,TexasInstruments,NVIDIA,Convey
– Auxiliarymembers• ANL,ASC/LLNL,cOMPunity,EPCC,LANL,NASA,TACC,RWTHAachenUniversity,UH
– h`p://www.openmp.org• LatestVersion4.5releasedNov2015
4
![Page 5: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/5.jpg)
MyrolewithOpenMP
5
![Page 6: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/6.jpg)
“HelloWord”Example/1
#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { printf("Hello World\n"); return(0); }
6
![Page 7: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/7.jpg)
“HelloWord”-AnExample/2
#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { #pragma omp parallel { printf("Hello World\n"); } // End of parallel region return(0); }
7
![Page 8: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/8.jpg)
“HelloWord”-AnExample/3
$ gcc –fopenmp hello.c $ export OMP_NUM_THREADS=2 $ ./a.out Hello World Hello World $ export OMP_NUM_THREADS=4 $ ./a.out Hello World Hello World Hello World Hello World $
8
#include <stdlib.h> #include <stdio.h> int main(int argc, char *argv[]) { #pragma omp parallel { printf("Hello World\n"); } // End of parallel region return(0); }
![Page 9: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/9.jpg)
OpenMPComponents
9
• Parallelregion
• Worksharingconstructs
• Tasking
• Offloading
• Affinity
• ErrorHanding
• SIMD
• SynchronizaAon
• Data-sharinga`ributes
• Numberofthreads
• ThreadID
• Dynamicthreadadjustment
• Nestedparallelism
• Schedule
• AcAvelevels
• Threadlimit
• NesAnglevel
• Ancestorthread
• Teamsize
• Locking
• WallclockAmer
• Numberofthreads
• Schedulingtype
• Dynamicthreadadjustment
• Nestedparallelism
• Stacksize
• Idlethreads
• AcAvelevels
• Threadlimit
Direc<ves Run<meEnvironment EnvironmentVariable
![Page 10: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/10.jpg)
“HelloWord”-AnExample/3
#include <stdlib.h> #include <stdio.h> #include <omp.h> int main(int argc, char *argv[]) { #pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); } return(0); }
10
DirecAves
RunAmeEnvironment
![Page 11: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/11.jpg)
“HelloWord”-AnExample/4
11
EnvironmentVariable
EnvironmentVariable:itissimilartoprogramargumentsusedtochangetheconfiguraAonoftheexecuAonwithoutrecompiletheprogram.
NOTE:theorderofprint
![Page 12: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/12.jpg)
TheDesignPrincipleBehind
• Eachprin^isatask
• Aparallelregionistoclaimasetofcoresforcomputa<on– Coresarepresentedasmul<plethreads
• Eachthreadexecuteasingletask– Taskidisthesameasthreadid
• omp_get_thread_num()– Num_tasksisthesameastotalnumberofthreads
• omp_get_num_threads()
• 1:1mappingbetweentaskandthread– Everytask/coredosimilarworkinthissimpleexample
12
#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); }
![Page 13: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/13.jpg)
OpenMPParallelCompu<ngSolu<onStack
13
Prog.Layer
(Ope
nMPAP
I)
RunAmelibrary
OS/system
DirecAves,Compiler OpenMPlibrary Environment
variables
ApplicaAon
End User
System
layer
Userlayer
![Page 14: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/14.jpg)
OpenMPSyntax
14
• MostOpenMPconstructsarecompilerdirec+vesusingpragmas.– ForCandC++,thepragmastaketheform:#pragma…
• pragmavslanguage– pragmaisnotlanguage,shouldnotexpresslogics– Toprovidecompiler/preprocessoraddi<onalinforma<ononhowtoprocessingdirec<ve-annotatedcode
– Similarto#include,#define
![Page 15: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/15.jpg)
OpenMPSyntax
15
• ForCandC++,thepragmastaketheform:#pragma omp construct [clause [clause]…]
• ForFortran,thedirecAvestakeoneoftheforms:– Fixedform *$OMP construct [clause [clause]…] C$OMP construct [clause [clause]…]
– Freeform(butworksforfixedformtoo)!$OMP construct [clause [clause]…]
• IncludefileandtheOpenMPlibmodule#include <omp.h> use omp_lib
![Page 16: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/16.jpg)
OpenMPCompiler• OpenMP:threadprogrammingat“highlevel”.
– Theuserdoesnotneedtospecifythedetails• ProgramdecomposiAon,assignmentofworktothreads• Mappingtaskstohardwarethreads
• Usermakesstrategicdecisions• Compilerfiguresoutdetails
– CompilerflagsenableOpenMP(e.g.–openmp,-xopenmp,-fopenmp,-mp)
16
![Page 17: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/17.jpg)
OpenMPMemoryModel
• OpenMPassumesasharedmemory• Threadscommunicatebysharingvariables.
• SynchronizaAonprotectsdataconflicts.– SynchronizaAonisexpensive.• ChangehowdataisaccessedtominimizetheneedforsynchronizaAon.
17
![Page 18: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/18.jpg)
OpenMPFork-JoinExecu<onModel
• Master thread spawns multiple worker threads as needed, together form a team
• Parallel region is a block of code executed by all threads in a team simultaneously
18
Parallel Regions
Master thread
A Nested Parallel region
Worker threads
Fork Join
![Page 19: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/19.jpg)
OpenMPParallelRegions
• InC/C++:ablockisasinglestatementoragroupofstatementbetween{}
• InFortran:ablockisasinglestatementoragroupofstatementsbetweendirecAve/end-direcAvepairs.
19
C$OMP PARALLEL 10 wrk(id) = garbage(id) res(id) = wrk(id)**2 if(.not.conv(res(id)) goto 10 C$OMP END PARALLEL
C$OMP PARALLEL DO do i=1,N
res(i)=bigComp(i) end do C$OMP END PARALLEL DO
#pragma omp parallel { id = omp_get_thread_num(); res[id] = lots_of_work(id); }
#pragma omp parallel for for(i=0;i<N;i++) { res[i] = big_calc(i); A[i] = B[i] + res[i]; }
![Page 20: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/20.jpg)
20
lexical extent of parallel region
C$OMP PARALLEL call whoami C$OMP END PARALLEL
Dynamic extent of parallel region includes lexical extent
subroutine whoami external omp_get_thread_num integer iam, omp_get_thread_num iam = omp_get_thread_num() C$OMP CRITICAL print*,’Hello from ‘, iam C$OMP END CRITICAL return end
+
Orphaned directives can appear outside a parallel construct
bar.f foo.f
A parallel region can span multiple source files.
ScopeofOpenMPRegion
![Page 21: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/21.jpg)
SPMDProgramModels
21
• SPMD(SingleProgram,Mul<pleData)forparallelregions– Allthreadsoftheparallelregionexecutethesamecode– EachthreadhasuniqueID
• UsethethreadIDtodivergetheexecuAonofthethreads– Differentthreadcanfollowdifferentpathsthroughthesame
code
• SPMDisbyfarthemostcommonlyusedpa`ernforstructuringparallelprograms– MPI,OpenMP,CUDA,etc
if(my_id == x) { } else { }
![Page 22: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/22.jpg)
ModifytheHelloWorldProgramso…
• Onlyonethreadprintsthetotalnumberofthreads– gcc–fopenmphello.c–ohello
• Onlyonethreadreadthetotalnumberofthreadsandallthreadsprintthatinfo
22
#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); if (thread_id == 0) printf("Hello World from thread %d of %d\n", thread_id, num_threads); else printf("Hello World from thread %d\n", thread_id); }
int num_threads = 99999; #pragma omp parallel { int thread_id = omp_get_thread_num(); if (thread_id == 0) num_threads = omp_get_num_threads(); #pragma omp barrier printf("Hello World from thread %d of %d\n", thread_id, num_threads); }
![Page 23: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/23.jpg)
Barrier
23
#pragma omp barrier
![Page 24: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/24.jpg)
OpenMPMaster
• Denotesastructuredblockexecutedbythemasterthread
• Theotherthreadsjustskipit– nosynchronizaAonisimplied
24
#pragma omp parallel private (tmp) {
do_many_things_together(); #pragma omp master
{ exchange_boundaries_by_master_only (); } #pragma barrier do_many_other_things_together(); }
![Page 25: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/25.jpg)
• Denotesablockofcodethatisexecutedbyonlyonethread.– Couldbemaster
• Abarrierisimpliedattheendofthesingleblock.
25
#pragma omp parallel private (tmp) {
do_many_things_together(); #pragma omp single
{ exchange_boundaries_by_one(); } do_many_other_things_together();
}
OpenMPSingle
![Page 26: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/26.jpg)
Usingompmaster/singletomodifytheHelloWorldProgramso…
• Onlyonethreadprintsthetotalnumberofthreads
• Onlyonethreadreadthetotalnumberofthreadsandallthreadsprintthatinfo
26
#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads(); printf("Hello World from thread %d of %d\n", thread_id, num_threads); }
![Page 27: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/27.jpg)
Distribu<ngWorkBasedonThreadID
27
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
#pragma omp parallel shared (a, b)
{
int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }
}
Sequential code
OpenMP parallel region
cat/proc/cpuinfo
![Page 28: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/28.jpg)
Implemen<ngaxpyusingOpenMPparallel
28
![Page 29: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/29.jpg)
• DividestheexecuAonoftheenclosedcoderegionamongthemembersoftheteam
• The“for” worksharingconstructsplitsuploopiteraAonsamongthreadsinateam– Eachthreadgetsoneormore“chunk”->loopchuncking
29
#pragmaompparallel#pragmaompforfor(i=0;i<N;i++){
work(i);} By default, there is a barrier at the end of the “omp
for”. Use the “nowait” clause to turn off the barrier.
#pragma omp for nowait
“nowait” is useful between two consecutive, independent omp for loops.
OpenMPWorksharingConstructs
![Page 30: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/30.jpg)
WorksharingConstructs
30
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
#pragma omp parallel shared (a, b)
{
int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }
}
#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
Sequential code
OpenMP parallel region
OpenMP parallel region and a worksharing for construct
![Page 31: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/31.jpg)
• schedule(staAc|dynamic|guided[,chunk])• schedule(auto|runAme)
31
staAc DistributeiteraAonsinblocksofsize"chunk"overthethreadsinaround-robinfashion
dynamic FixedporAonsofwork;sizeiscontrolledbythevalueofchunk;Whenathreadfinishes,itstartsonthenextporAonofwork
guided Samedynamicbehavioras"dynamic",butsizeoftheporAonofworkdecreasesexponenAally
auto Thecompiler(orrunAmesystem)decideswhatisbesttouse;choicecouldbeimplementaAondependent
runAme IteraAonschedulingschemeissetatrunAmethroughenvironmentvariableOMP_SCHEDULE
OpenMPscheduleClause
![Page 32: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/32.jpg)
OpenMPSec<ons
• Worksharingconstruct• Givesadifferentstructuredblocktoeachthread
32
#pragma omp parallel #pragma omp sections { #pragma omp section
x_calculation(); #pragma omp section
y_calculation(); #pragma omp section
z_calculation(); }
By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier.
![Page 33: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/33.jpg)
LoopCollapse
• AllowsparallelizaAonofperfectlynestedloopswithoutusingnestedparallelism
• Thecollapseclauseonfor/doloopindicateshowmanyloopsshouldbecollapsed
33
!$ompparalleldocollapse(2)...doi=il,iu,isdoj=jl,ju,jsdok=kl,ku,ks.....enddoenddoenddo!$ompendparalleldo
![Page 34: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/34.jpg)
Exercise:OpenMPMatrixMul<plica<on
• Parallelversion• Parallelforversion
– Experimentdifferentschedulepolicyandchunksize• #omppragmaparallelfor
– Experimentcollapse(2)
34
#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
#pragma omp parallel for schedule(static) private (i) num_threads(num_ths)
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
gcc-fopenmpmm.c-omm
![Page 35: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/35.jpg)
Barrier
• Barrier:EachthreadwaitsunAlallthreadsarrive.
35
#pragma omp parallel shared (A, B, C) private(id) {
id=omp_get_thread_num(); A[id] = big_calc1(id);
#pragma omp barrier #pragma omp for
for(i=0;i<N;i++){C[i]=big_calc3(I,A);} #pragma omp for nowait
for(i=0;i<N;i++){ B[i]=big_calc2(C, i); } A[id] = big_calc3(id);
} implicit barrier at the end of a parallel region
implicit barrier at the end of a for work-sharing construct
no implicit barrier due to nowait
![Page 36: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/36.jpg)
• Mostvariablesaresharedbydefault• GlobalvariablesareSHAREDamongthreads
– Fortran:COMMONblocks,SAVEvariables,MODULEvariables
– C:Filescopevariables,staAc• Butnoteverythingisshared...
– Stackvariablesinsub-programscalledfromparallelregionsarePRIVATE
– AutomaAcvariablesdefinedinsidetheparallelregionarePRIVATE.
36
DataEnvironment
![Page 37: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/37.jpg)
OpenMPDataEnvironment
double a[size][size], b=4; #pragma omp parallel private (b) { .... }
shared data a[size][size]
T0 T1 T2 T3
privatedatab’=?
b becomes undefined
privatedatab’=?
privatedatab’=?
privatedatab’=?
37
![Page 38: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/38.jpg)
38
program sort common /input/ A(10) integer index(10)
C$OMP PARALLEL call work (index)
C$OMP END PARALLEL print*, index(1)
subroutine work (index) common /input/ A(10) integer index(*) real temp(10) integer count save count ………… !
temp
A, index, count
temp temp
A, index, count
A, index and count are shared by all threads.
temp is local to each thread
OpenMPDataEnvironment
![Page 39: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/39.jpg)
DataEnvironment:Changingstorageaeributes
• SelecAvelychangestoragea`ributesconstructsusingthefollowingclauses– SHARED– PRIVATE– FIRSTPRIVATE– THREADPRIVATE
• Thevalueofaprivateinsideaparallelloopandglobalvalueoutsidetheloopcanbeexchangedwith– FIRSTPRIVATE,andLASTPRIVATE
• Thedefaultstatuscanbemodifiedwith:– DEFAULT(PRIVATE|SHARED|NONE)
39
![Page 40: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/40.jpg)
OpenMPPrivateClause
• private(var)createsalocalcopyofvarforeachthread.– Thevalueisunini+alized– Privatecopyisnotstorage-associatedwiththeoriginal– Theoriginalisundefinedattheend
40
IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000
IS = IS + J END DO C$OMP END PARALLEL DO print *, IS
![Page 41: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/41.jpg)
OpenMPPrivateClause
• private(var)createsalocalcopyofvarforeachthread.– Thevalueisunini+alized– Privatecopyisnotstorage-associatedwiththeoriginal– Theoriginalisundefinedattheend
41
IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000
IS = IS + J END DO C$OMP END PARALLEL DO print *, IS
IS was not initialized
IS is undefined here
✗✗
![Page 42: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/42.jpg)
FirstprivateClause
• firstprivateisaspecialcaseofprivate.– IniAalizeseachprivatecopywiththecorrespondingvalue
fromthemasterthread.
42
IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20 CONTINUE C$OMP END PARALLEL DO print *, IS
![Page 43: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/43.jpg)
FirstprivateClause
• firstprivateisaspecialcaseofprivate.– IniAalizeseachprivatecopywiththecorrespondingvalue
fromthemasterthread.
43
IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20 CONTINUE C$OMP END PARALLEL DO print *, IS
Regardless of initialization, IS is undefined at this point
Each thread gets its own IS with an initial value of 0
✔ ✗
![Page 44: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/44.jpg)
LastprivateClause
• LastprivatepassesthevalueofaprivatefromthelastiteraAontothevariableofthemasterthread
44
IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) C$OMP& LASTPRIVATE(IS) DO 20 J=1,1000
IS = IS + J 20 CONTINUE C$OMP END PARALLEL DO print *, IS
IS is defined as its value at the last iteration (i.e. for J=1000)
Each thread gets its own IS with an initial value of 0
✔ Isthiscodemeaningful?
![Page 45: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/45.jpg)
OpenMPReduc<on
45
l Here is the correct way to parallelize this code.
IS = 0 C$OMP PARALLEL DO REDUCTION(+:IS) DO 20 J=1,1000
IS = IS + J 20 CONTINUE print *, IS
Reduction NOT implies firstprivate, where is the initial 0 comes from?
![Page 46: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/46.jpg)
Reduc<onoperands/ini<al-values
• AssociaAveoperandsusedwithreducAon• IniAalvaluesaretheonesthatmakesense
mathemaAcally
46
Operand Initial value
+ 0
* 1
- 0
.AND. All 1’s
Operand Initial value
.OR. 0
MAX 1
MIN 0
// All 0’s
![Page 47: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/47.jpg)
Exercise:OpenMPSum.c
• Twoversions– ParallelforwithreducAon– Parallelversion,notusing“ompfor”or“reducAon”clause
47
![Page 48: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/48.jpg)
OpenMPThreadprivate
• Makesglobaldataprivatetoathread,thuscrossingparallelregionboundary– Fortran:COMMONblocks– C:FilescopeandstaAcvariables
• DifferentfrommakingthemPRIVATE– WithPRIVATE,globalvariablesaremasked.– THREADPRIVATEpreservesglobalscopewithineach
thread• ThreadprivatevariablescanbeiniAalizedusing
COPYIN orbyusingDATAstatements.
48
![Page 49: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/49.jpg)
Threadprivate/copyin
49
parameter (N=1000) common/buf/A(N) C$OMP THREADPRIVATE(/buf/) C Initialize the A array call init_data(N,A) C$OMP PARALLEL COPYIN(A) … Now each thread sees threadprivate array A initialized … to the global value set in the subroutine init_data() C$OMP END PARALLEL .... C$OMP PARALLEL ... Values of threadprivate are persistent across parallel regions C$OMP END PARALLEL
• You initialize threadprivate data using a copyin clause.
![Page 50: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/50.jpg)
OpenMPSynchroniza<on
• HighlevelsynchronizaAon:– criAcalsecAon– atomic– barrier– ordered
• LowlevelsynchronizaAon– flush– locks(bothsimpleandnested)
50
![Page 51: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/51.jpg)
Cri<calsec<on
• OnlyonethreadataAmecanenteracriAcalsecAon.
51
C$OMP PARALLEL DO PRIVATE(B) C$OMP& SHARED(RES)
DO 100 I=1,NITERS B = DOIT(I)
C$OMP CRITICAL CALL CONSUME (B, RES)
C$OMP END CRITICAL 100 CONTINUE C$OMP END PARALLEL DO
![Page 52: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/52.jpg)
Atomic
• AtomicisaspecialcaseofacriAcalsecAonthatcanbeusedforcertainsimplestatements
• ItappliesonlytotheupdateofamemorylocaAon
52
C$OMP PARALLEL PRIVATE(B) B = DOIT(I)
tmp = big_ugly();
C$OMP ATOMIC X = X + temp
C$OMP END PARALLEL
![Page 53: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/53.jpg)
OpenMPTasks
Defineatask:– C/C++:#pragmaomptask– Fortran:!$omptask
53
• Ataskisgeneratedwhenathreadencountersataskconstruct– Containsataskregionanditsdataenvironment– Taskcanbenested
• AtaskregionisaregionconsisAngofallcodeencounteredduringtheexecuAonofatask.
• ThedataenvironmentconsistsofallthevariablesassociatedwiththeexecuAonofagiventask.– constructedwhenthetaskisgenerated
![Page 54: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/54.jpg)
Taskcomple<onandsynchroniza<on
• Taskcomple<onoccurswhenthetaskreachestheendofthetaskregioncode
• MulApletasksjoinedtocompletethroughtheuseoftasksynchroniza<onconstructs– taskwait– barrierconstruct
• taskwaitconstructs:– #pragmaomptaskwait– !$omptaskwait
54
intfib(intn){intx,y;if(n<2)returnn;else{#pragmaomptaskshared(x)x=fib(n-1);#pragmaomptaskshared(y)y=fib(n-2);#pragmaomptaskwaitreturnx+y;}}
![Page 55: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/55.jpg)
Example:ALinkedList
55
An Overview of OpenMP
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Example - A Linked List
........
while(my_pointer) {
(void) do_independent_work (my_pointer); my_pointer = my_pointer->next ;
} // End of while loop
........
![Page 56: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/56.jpg)
Example:ALinkedListwithTasking
56
An Overview of OpenMP
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Example - A Linked List With Tasking my_pointer = listhead;
#pragma omp parallel { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); } my_pointer = my_pointer->next ; } } // End of single - no implied barrier (nowait) } // End of parallel region - implied barrier
OpenMP Task is specifi ed here(executed in parallel)
![Page 57: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/57.jpg)
Ordered
• TheorderedconstructenforcesthesequenAalorderforablock.
57
#pragma omp parallel private (tmp) #pragma omp for ordered for (i=0;i<N;i++){ tmp = NEAT_STUFF_IN_PARALLEL(i); #pragma ordered res += consum(tmp); }
![Page 58: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/58.jpg)
OpenMPSynchroniza<on
• Theflushconstructdenotesasequencepointwhereathreadtriestocreateaconsistentviewofmemory.– AllmemoryoperaAons(bothreadsandwrites)definedprior
tothesequencepointmustcomplete.– AllmemoryoperaAons(bothreadsandwrites)definedaZer
thesequencepointmustfollowtheflush.– Variablesinregistersorwritebuffersmustbeupdatedin
memory.• Argumentstoflushspecifywhichvariablesareflushed.Noargumentsspecifiesthatallthreadvisiblevariablesareflushed.
58
![Page 59: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/59.jpg)
Aflushexample
59
l pair-wise synchronization.
integer ISYNC(NUM_THREADS) C$OMP PARALLEL DEFAULT (PRIVATE) SHARED (ISYNC)
IAM = OMP_GET_THREAD_NUM() ISYNC(IAM) = 0
C$OMP BARRIER CALL WORK() ISYNC(IAM) = 1 ! I’m all done; signal this to other threads
C$OMP FLUSH(ISYNC) DO WHILE (ISYNC(NEIGH) .EQ. 0)
C$OMP FLUSH(ISYNC) END DO
C$OMP END PARALLEL
Make sure other threads can see my write.
Make sure the read picks up a good copy from memory.
Note: flush is analogous to a fence in other shared memory APIs.
![Page 60: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/60.jpg)
OpenMPLockrou<nes• SimpleLockrouAnes:availableifitisunset.
– omp_init_lock(),omp_set_lock(),omp_unset_lock(),omp_test_lock(),omp_destroy_lock()
• NestedLocks:availableifitisunsetorifitissetbutownedbythethreadexecuAngthenestedlockfuncAon
– omp_init_nest_lock(),omp_set_nest_lock(),omp_unset_nest_lock(),omp_test_nest_lock(),omp_destroy_nest_lock()
60
![Page 61: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/61.jpg)
OpenMPLocks• Protectresourceswithlocks.
61
omp_lock_t lck; omp_init_lock(&lck); #pragma omp parallel private (tmp, id) { id = omp_get_thread_num(); tmp = do_lots_of_work(id); omp_set_lock(&lck); printf(“%d %d”, id, tmp); omp_unset_lock(&lck); } omp_destroy_lock(&lck);
Wait here for your turn.
Release the lock so the next thread gets a turn.
Free-up storage when done.
![Page 62: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/62.jpg)
OpenMPLibraryRou<nes• Modify/Checkthenumberofthreads
– omp_set_num_threads(),omp_get_num_threads(),omp_get_thread_num(),omp_get_max_threads()
• Areweinaparallelregion?– omp_in_parallel()
• Howmanyprocessorsinthesystem?– omp_num_procs()
62
![Page 63: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/63.jpg)
OpenMPEnvironmentVariables
• Setthedefaultnumberofthreadstouse.– OMP_NUM_THREADSint_literal
• Controlhow“ompforschedule(RUNTIME)”loopiteraAonsarescheduled.– OMP_SCHEDULE“schedule[,chunk_size]”
63
![Page 64: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/64.jpg)
Outline
• OpenMPIntroducAon• ParallelProgrammingwithOpenMP
– Worksharing,tasks,dataenvironment,synchronizaAon• OpenMPPerformanceandBestPracAces• CaseStudiesandExamples• ReferenceMaterials
64
![Page 65: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/65.jpg)
OpenMPPerformance
• RelaAveeaseofusingOpenMPisamixedblessing• WecanquicklywriteacorrectOpenMPprogram,butwithoutthedesiredlevelofperformance.
• Therearecertain“bestpracAces” toavoidcommonperformanceproblems.
• Extraworkneededtoprogramwithlargethreadcount
65
![Page 66: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/66.jpg)
TypicalOpenMPPerformanceIssues
66
• OverheadsofOpenMPconstructs,threadmanagement.E.g.– dynamicloopscheduleshavemuchhigheroverheadsthan
staAcschedules– SynchronizaAonisexpensive,useNOWAITifpossible– Largeparallelregionshelpreduceoverheads,enablebe`er
cacheusageandstandardopAmizaAons• OverheadsofrunAmelibraryrouAnes
– Somearecalledfrequently• Loadbalance• CacheuAlizaAonandfalsesharing
![Page 67: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/67.jpg)
OverheadsofOpenMPDirec<ves
0
200000
400000
600000
800000
1000000
1200000
1400000
Ove
rhea
d (C
ycle
s)
1 2 4 8 16 32 64 128 256
PARALLEL
PARALLEL FOR
SINGLE
LOCK/UNLOCK
ATOMIC
Number of Threads
OpenMP OverheadsEPCC Microbenchmarks
SGI Altix 3600
PARALLEL
FOR
PARALLEL FOR
BARRIER
SINGLE
CRITICAL
LOCK/UNLOCK
ORDERED
ATOMIC
REDUCTION
67
![Page 68: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/68.jpg)
OpenMPBestPrac<ces
• Reduceusageofbarrierwithnowaitclause
#pragmaompparallel{#pragmaompforfor(i=0;i<n;i++)….#pragmaompfornowaitfor(i=0;i<n;i++)}
68
![Page 69: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/69.jpg)
OpenMPBestPrac<ces
#pragmaompparallelprivate(i){#pragmaompfornowaitfor(i=0;i<n;i++)a[i]+=b[i];#pragmaompfornowaitfor(i=0;i<n;i++)c[i]+=d[i];#pragmaompbarrier#pragmaompfornowaitreducAon(+:sum)for(i=0;i<n;i++)sum+=a[i]+c[i];}
69
![Page 70: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/70.jpg)
OpenMPBestPrac<ces
• Avoidlargeorderedconstruct• AvoidlargecriAcalregions
#pragmaompparallelshared(a,b)private(c,d){….#pragmaompcriAcal{a+=2*c;c=d*d;}}
Move out this Statement 70
![Page 71: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/71.jpg)
OpenMPBestPrac<ces
• MaximizeParallelRegions
#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop1*/}}opt=opt+N;//sequenAal#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop2*/}#pragmaompforfor(…){/*Work-sharingloopN*/}}
#pragmaompparallel{#pragmaompforfor(…){/*Work-sharingloop1*/}#pragmaompsinglenowaitopt=opt+N;//sequenAal#pragmaompforfor(…){/*Work-sharingloop2*/}#pragmaompforfor(…){/*Work-sharingloopN*/}}
71
![Page 72: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/72.jpg)
OpenMPBestPrac<ces
• Singleparallelregionenclosingallwork-sharingloops.for(i=0;i<n;i++)for(j=0;j<n;j++)pragmaompparallelforprivate(k)for(k=0;k<n;k++){……}
#pragmaompparallelprivate(i,j,k){for(i=0;i<n;i++)for(j=0;j<n;j++)#pragmaompforfor(k=0;k<n;k++){…….}} 72
![Page 73: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/73.jpg)
OpenMPBestPrac<ces
73
• Addressloadimbalances• Useparallelfordynamicschedulesanddifferentchunksizes
Smith-Waterman Sequence Alignment Algorithm
![Page 74: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/74.jpg)
OpenMPBestPrac<ces
• Smith-WatermanAlgorithm– DefaultscheduleisforstaAcevenàloadimbalance
74
#pragmaompforfor(…)for(…)for(…)for(…){/*computealignments*/}#pragmaompcriAcal{./*computescores*/}
![Page 75: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/75.jpg)
OpenMPBestPrac<cesSmith-Waterman Sequence Alignment Algorithm
1
10
100
2 4 8 16 32 64 128
Speedup
threads
100
600
1000
Ideal
1
10
100
2 4 8 16 32 64 128
Speedup
threads
100
600
1000
Ideal
#pragma omp for
#pragma omp for dynamic(schedule, 1)
128 threads with 80% efficiency 75
![Page 76: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/76.jpg)
OpenMPBestPrac<ces
0
10000
20000
30000
40000
50000
60000
70000
80000
Ove
rhea
d (in
Cyc
les)
1 2 4 8 16 32 64 128 256
Default
2
8
32
128
Number of Threads
Chunk Size
Overheads of OpenMP For Static Scheduling SGI Altix 3600
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
Cyc
les
1 2 4 8 16 32 64 128 256
1
4
16
64
Number of Threads
Chunk Size
Overheads of OpenMP For Dynamic ScheduleSGI Altix 3600
• AddressloadimbalancesbyselecAngthebestscheduleandchunksize
• AvoidselecAngsmallchunksizewhenworkinchunkissmall.
76
![Page 77: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/77.jpg)
OpenMPBestPrac<ces
• PipelineprocessingtooverlapI/OandcomputaAons
for(i=0;i<N;i++){ReadFromFile(i,…);for(j=0;j<ProcessingNum;j++)ProcessData(i,j);WriteResultsToFile(i)}
77
![Page 78: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/78.jpg)
OpenMPBestPrac<ces
#pragmaompparallel{#pragmaompsingle{ReadFromFile(0,...);}for(i=0;i<N;i++){#pragmaompsinglenowait{if(i<N-1)ReadFromFile(i+1,….);}#pragmaompforschedule(dynamic)for(j=0;j<ProcessingNum;j++)ProcessChunkOfData(i,j);#pragmaompsinglenowait{WriteResultsToFile(i);}}}
• PipelineProcessing• Pre-fetchesI/O• ThreadsreadingorwriAngfilesjoinsthecomputaAons
The implicit barrier here is very important: 1) file i is finished so we can write to file. 2) file i+1 is read in so we can process in the next loop iteration
78
For dealing with the last file
![Page 79: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/79.jpg)
OpenMPBestPrac<ces
• singlevs.masterwork-sharing– masterismoreefficientbutrequiresthread0tobeavailable– singleismoreefficientifmasterthreadnotavailable– singlehasimplicitbarrier
79
![Page 80: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/80.jpg)
CacheCoherence
• Real-worldsharedmemorysystemshavecachesbetweenmemoryandCPU
• CopiesofasingledataitemcanexistinmulAplecaches• ModificaAonofashareddataitembyoneCPUleadstooutdatedcopiesinthecacheofanotherCPU
Memory
CPU0
Cache
CPU1
Cache
Originaldataitem
CopyofdataitemincacheofCPU0 Copyofdataitem
incacheofCPU1
![Page 81: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/81.jpg)
• Falsesharing– Whenatleastonethreadwritetoa
cachelinewhileothersaccessit• Thread0:=A[1](read)• Thread1:A[0]=…(write)
• SoluAon:usearraypadding
int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i] +=i;
int a[max_threads][cache_line_size]; #pragma omp parallel for schedule(static,1) for(int i=0; i<max_threads; i++) a[i][0] +=i;
OpenMPBestPrac<ces
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
False Sharing
CPUs Caches Memory
A store into a shared cache line invalidates the other copies of that line:
The system is not able to distinguish between changes
within one individual line
81
A
T0
T1
![Page 82: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/82.jpg)
Exercise:Feelthefalsesharingwithaxpy-papi.c
82
![Page 83: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/83.jpg)
OpenMPBestPrac<ces• DataplacementpolicyonNUMAarchitectures
• FirstTouchPolicy
– Theprocessthatfirsttouchesapageofmemorycausesthatpagetobeallocatedinthenodeonwhichtheprocessisrunning
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
A generic cc-NUMA architecture
83
![Page 84: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/84.jpg)
NUMAFirst-touchplacement/1
84
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About “First Touch” placement/1
for (i=0; i<100; i++) a[i] = 0;
a[0] :a[99]
First TouchAll array elements are in the memory of
the processor executing this thread
int a[100]; Onlyreservethevm
address
![Page 85: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/85.jpg)
NUMAFirst-touchplacement/2
85
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About “First Touch” placement/2
for (i=0; i<100; i++) a[i] = 0;
a[0] :a[49]
#pragma omp parallel for num_threads(2)
First TouchBoth memories each have “their half” of
the array
a[50] :a[99]
![Page 86: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/86.jpg)
OpenMPBestPrac<ces
• First-touchinpracAce– IniAalizedataconsistentlywiththecomputaAons
#pragmaompparallelforfor(i=0;i<N;i++){a[i]=0.0;b[i]=0.0;c[i]=0.0;}readfile(a,b,c);#pragmaompparallelforfor(i=0;i<N;i++){a[i]=b[i]+c[i];}
86
![Page 87: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/87.jpg)
• PrivaAzevariablesasmuchaspossible– Privatevariablesarestoredinthelocalstacktothethread
• Privatedataclosetocache
doublea[MaxThreads][N][N]#pragmaompparallelforfor(i=0;i<MaxThreads;i++){for(intj…)for(intk…)a[i][j][k]=…}
doublea[N][N]#pragmaompparallelprivate(a){for(intj…)for(intk…)a[j][k]=…}
OpenMPBestPracAces
87
![Page 88: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/88.jpg)
OpenMPBestPracAces
procedurediff_coeff(){arrayallocaAonbymasterthreadiniAalizaAonofsharedarraysPARALLELREGION{looplower_bn[id],upper_bn[id]computaAononsharedarrays…..}}
• CFDapplicaAonpsudo-code– SharedarraysiniAalizedincorrectly(firsttouchpolicy)– DelaysinremotememoryaccessesareprobablecausesbysaturaAonofinterconnect
88
![Page 89: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/89.jpg)
Stall Cycle Breakdown for Non-Privatized (NP) andPrivatized (P) Versions of diff_coeff
0.00E+005.00E+091.00E+101.50E+102.00E+102.50E+103.00E+103.50E+104.00E+104.50E+105.00E+10
D-c
ach
stal
ls
Bra
nch
mis
pred
ictio
n
Inst
ruct
ion
mis
s st
all
FLP
Uni
ts
Fron
t-end
flush
es
Cyc
les NP
PNP-P
OpenMPBestPracAces• ArrayprivaAzaAon
– Improvedtheperformanceofthewholeprogramby30%– Speedupof10fortheprocedure,nowonly5%oftotalAme
• Processorstallsarereducedsignificantly
89
![Page 90: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/90.jpg)
• AvoidThreadMigraAon– Affectsdatalocality
• Bindthreadstocores.• Linux:
– numactl–cpubind=0foobar– taskset–c0,1foobar
• SGIAlAx– dplace–x2foobar
OpenMPBestPracAces
90
![Page 91: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/91.jpg)
OpenMPSourceofErrors
• IncorrectuseofsynchronizaAonconstructs– LesslikelyifusersAckstodirecAves– ErroneoususeofNOWAIT
• RacecondiAons(truesharing)– Canbeveryhardtofind
• Wrong“spelling”ofsenAnel• Usetoolstocheckfordataraces.
91
![Page 92: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/92.jpg)
Outline
• OpenMPIntroducAon• ParallelProgrammingwithOpenMP
– Worksharing,tasks,dataenvironment,synchronizaAon• OpenMPPerformanceandBestPracAces• HybridMPI/OpenMP• CaseStudiesandExamples• ReferenceMaterials
92
![Page 93: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/93.jpg)
Matrixvectormul<plica<on
93
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The Sequential Source
= *
j
i
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The OpenMP Source
= *
j
i
![Page 94: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/94.jpg)
Performance–2-socketNehalem
94
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Performance - 2 Socket Nehalem
![Page 95: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/95.jpg)
2-socketNehalem
95
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
A Two Socket Nehalem System
![Page 96: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/96.jpg)
Dataini<aliza<on
96
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Data Initialization
= *
j
i
IniAalizaAonwillcausetheallocaAonofmemoryaccordingtothefirsttouchpolicy
![Page 97: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/97.jpg)
ExploitFirstTouch
97
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Exploit First Touch
![Page 98: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/98.jpg)
A3Dmatrixupdate
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Observation No data dependency on 'I' Therefore we can split the 3D
matrix in larger blocks and process these in parallel
JI
K
98
![Page 99: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/99.jpg)
Theidea
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The IdeaK
J
I
ie is
We need to distribute the M iterations over the number of processors
We do this by controlling the start (IS) and end (IE) value of the inner loop
Each thread will calculate these values for it's portion of the work
99
![Page 100: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/100.jpg)
A3Dmatrixupdate
100
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
A 3D matrix update
The loops are correctly nested for serial performance
Due to a data dependency on J and K, only the inner loop can be parallelized
This will cause the barrier to be executed (N-1) 2 times
Data Dependency Graph
Jj-1 j
k-1k
I
K
![Page 101: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/101.jpg)
101
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The performance
Dimensions : M=7,500 N=20Footprint : ~24 MByte
Perf
orm
ance
(Mflo
p/s)
Inner loop over I has been parallelized
Scaling is very poor(as to be expected)
Number of threads
Theperformance
![Page 102: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/102.jpg)
Performanceanalyzerdata
102
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Performance Analyzer datascales som
ewhat
do n
ot
scal
e at
all
Using 10 threads
Question: Why is __mt_WaitForWork so high in the profi le ?
Using 20 threads
![Page 103: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/103.jpg)
Falsesharingatwork
103Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
This is False Sharing at work !
no s
harin
g
P=1 P=2 P=4 P=8
False sharing increases as we increase the number of
threads
![Page 104: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/104.jpg)
Performancecompared
104
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Performance comparison
Number of threads
Perf
orm
ance
(Mflo
p/s)
For a higher value of M, the program scales better
M = 7,500
M = 75,000
![Page 105: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/105.jpg)
Thefirstimplementa<on
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The fi rst implementation
105
![Page 106: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/106.jpg)
OpenMPversion
106
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Another Idea: Use OpenMP !
![Page 107: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/107.jpg)
Howthisworks
107
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
How this works on 2 threads
parallel region
work sharing
parallel region
work sharing
This splits the operation in a way that is similar to our manual implementation
… etc …! … etc …!
![Page 108: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/108.jpg)
Performance
108
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Performance We have set M=7500 N=20
This problem size does not scale at all when we explicitly parallelized the inner loop over 'I'
We have have tested 4 versions of this program Inner Loop Over 'I' - Our fi rst OpenMP version AutoPar - The automatically parallelized version of
'kernel' OMP_Chunks - The manually parallelized version
with our explicit calculation of the chunks OMP_DO - The version with the OpenMP parallel
region and work-sharing DO
![Page 109: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/109.jpg)
Performance
109
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The performance (M=7,500)
Dimensions : M=7,500 N=20Footprint : ~24 MByte
Perfo
rman
ce (M
flop/
s)
The auto-parallelizingcompiler does really well !
Number of threads
OMP DO
Innerloop
OMP Chunks
![Page 110: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/110.jpg)
ReferenceMaterialonOpenMP
110
• OpenMPHomepagewww.openmp.org:– TheprimarysourceofinformaAonaboutOpenMPanditsdevelopment.
• OpenMPUser’sGroup(cOMPunity)Homepage– www.compunity.org:
• Books:– UsingOpenMP,BarbaraChapman,GabrieleJost,RuudVanDerPas,Cambridge,MA:TheMITPress2007,ISBN:978-0-262-53302-7
– ParallelprogramminginOpenMP,Chandra,Rohit,SanFrancisco,Calif.:MorganKaufmann;London:Harcourt,2000,ISBN:1558606718
110
![Page 111: Lecture 04-06: Programming with OpenMP - GitHub Pages · Lecture 04-06: Programming with OpenMP Concurrent and Mul](https://reader034.fdocuments.net/reader034/viewer/2022050116/5f86f2b9fe009c4d6047e04b/html5/thumbnails/111.jpg)
• DirecAvesimplementedviacodemodificaAonandinserAonofrunAmelibrarycalls
– Basicstepisoutliningofcodeinparallelregion
• RunAmelibraryresponsibleformanagingthreads
– Schedulingloops– Schedulingtasks– ImplemenAng
synchronizaAon• ImplementaAoneffortis
reasonable
OpenMP Code Translation
int main(void) { int a,b,c; #pragma omp parallel \ private(c) do_sth(a,b,c); return 0; }
_INT32 main() { int a,b,c; /* microtask */ void __ompregion_main1() { _INT32 __mplocal_c; /*shared variables are kept intact, substitute accesses to private variable*/ do_sth(a, b, __mplocal_c); } … /*OpenMP runtime calls */ __ompc_fork(&__ompregion_main1); … }
Eachcompilerhascustomrun-Amesupport.QualityoftherunAmesystemhasmajorimpactonperformance.
StandardOpenMPImplementa<on