Parallel Computing with OpenMP
Transcript of Parallel Computing with OpenMP
![Page 1: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/1.jpg)
Parallel Computingwith OpenMP
Yutaka Masuda
University of Georgia
Summer Course May 18, 2018
![Page 2: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/2.jpg)
Computing cores
• Why don’t you use multiple cores for your computations?
= This is what OpenMP will do.
![Page 3: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/3.jpg)
Multiple computers
• Why don’t you use multiple computers working together?
= This is a cluster: out of interest for OpenMP but MPI.
https://en.wikipedia.org/wiki/File:MEGWARE.CLIC.jpg
![Page 4: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/4.jpg)
Methods in parallel computing• OpenMP
• Shared memory (single computer)
• A set of directives
• Focus on parallelization for loops= limited purpose
• Automatic management by the program = easier to program
• MPI (Message Passing Interface)• Distributed memory (clusters)
• A collection of subroutines
• Any kinds of parallel computing= flexible
• Manual control of data flow & management = complicated
From www.comsol.com
![Page 5: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/5.jpg)
Sum of 1 to 100
• OpenMP • MPIprogram summpi
use mpiimplicit noneinteger :: i,myid,nprocs,ioreal :: s,local_scall mpi_init(io)call mpi_comm_rank(MPI_COMM_WORLD,myid,io)call mpi_comm_size(MPI_COMM_WORLD,nprocs,io)local_s = 0.0do i=myid+1,100,nprocs
local_s = local_s + iend docall mpi_reduce(local_s,s,1,MPI_REAL, &
MPI_SUM,0,MPI_COMM_WORLD,io)if(myid==0) then
print *,send ifcall mpi_finalize(io)
end program summpi
program sumompimplicit noneinteger :: ireal :: s = 0.0
!$omp parallel private(i) &!$omp reduction(+:s)!$omp do
do i=1,100s = s + i
end do!$omp end do!$end end parallel
print *,send program sumomp
![Page 6: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/6.jpg)
Cores and threads
• Cores: physical hardware within CPU (central processing unit)
• Threads: tasks defined in your software
• Operating system assigns a task to a core by “time splicing”.• The system switches tasks for a core in a short period of time.
• # of threads can be > # of cores.
https://www.intel.com/content/www/us/en/products/processors/core/i5-processors/i5-8400.html
![Page 7: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/7.jpg)
OpenMP creates threads
From Wikipedia
Regular (sequential) program:
Parallel program:
![Page 8: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/8.jpg)
Fork-Join model
From Wikipedia
ParallelRegion 1
ParallelRegion 2
ParallelRegion 3
Fork Join Fork Join Fork Join
![Page 9: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/9.jpg)
Program structure with OpenMP
do i=1,10x(i)=sqrt(x(i))
end do
!$omp parallel!$omp dodo i=1,10x(i)=sqrt(x(i))
end do!$omp end do!$omp end parallel
i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10
i=1
i=2i=3
i=4
i=5i=6
i=7
i=8
i=9
i=10 Using 3 threads.
![Page 10: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/10.jpg)
OpenMP directives
• The directive must begin with a keyword !$omp.• The directives will be effective
only if you put a compiler option.
• Otherwise, the directives will be ignored (because it looks like a comment).
• An OpenMP region must be encircled with !$omp … and!$omp … end directives.
!$omp parallel!$omp dodo i=1,10x(i)=sqrt(x(i))
end do!$omp end do!$omp end parallel
![Page 11: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/11.jpg)
OpenMP directives (con’d)
• Each directive can have an optional clause.• Variable type, number of threads,
conditional execution etc.
• A statement starts with !$ will be complied only when the OpenMP is effective (conditional compilation).
!$omp parallel private(i) shared(x)!$omp dodo i=1,10x(i)=sqrt(x(i))
end do!$omp end do!$omp end parallel
!$print *,’OpenMP is active!’
![Page 12: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/12.jpg)
Compiler options
• Depends on compilers• Intel Fortran Compiler (ifort): -qopenmp (or -fopenmp)
• GFortran: -fopenmp
• Absoft: -openmp
• NAG: -openmp
• PGI: -mp
• Examples• ifort -qopenmp prog.f90
• gfortran -fopenmp prog.f90
![Page 13: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/13.jpg)
Runtime options (seeing it later)
• Specification of the number of threads (on Linux)
OMP_NUM_THREADS=2 ./a.out (with 2 threads)
• Default behavior:• If OMP_NUM_THREADS, follows this.
• If limitation in the program, follows this.
• If nothing, use as many threads as possible in your computer.
![Page 14: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/14.jpg)
Directive: parallel
!$omp parallelprint *,’Hi!’ !$omp end parallel
• Defines a parallel region and assigns the task to each thread.• The region will be executed by
multiple threads.
• The number of threads can be controlled by the an optional clause, supplemental functions or an environmental variable.
print *,’Hi!’
print *,’Hi!’
print *,’Hi!’
(Output)Hi!Hi!Hi!
![Page 15: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/15.jpg)
Directive do
• Perform the do-loop with multiple threads.• The !$omp do directive must be
placed just before a do-loop.
• The directive must be surrounded by parallel.
• The counter is not necessarily incremented in order.
• The counter i is treated as a separate variable for each thread (private variable).
!$omp parallel!$omp dodo i=1,10x(i)=sqrt(x(i))
end do!$omp end do!$omp end parallel
i=1 i=5i=6
i=4 i=7i=9
i=2i=3 i=8 i=10
![Page 16: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/16.jpg)
Shared variable by default
! compute parent average (PA)
!$omp parallel!$omp dodo i=1,n
s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0
end do!$omp end do!$omp end parallel
nsire(:)
dam(:)s d
pa(:)
ebv(:)
Shared variable
= 1= =
=( + )/2.0
i
s sire(1)
d dam(1)pa(1) ebv( )s ebv( )d
iPrivate variable
= 2= =
=( + )/2.0
i
s sire(2)
d dam(2)pa(2) ebv( )s ebv( )d
iPrivate variable• All threads share the variables s and d.• One thread rewrites the variables while
another thread cites the variable!
![Page 17: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/17.jpg)
Private and shared variable! compute parent average (PA)
!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa)!$omp dodo i=1,n
s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0
end do!$omp end do!$omp end parallel
• Each thread has own variables s and dso there is no competition any more.
nsire(:)
dam(:)
pa(:)
ebv(:)
Shared variable
= 1= =
=( + )/2.0
i
s sire(1)
d dam(1)pa(1) ebv( )s ebv( )d
iPrivate variable
= 2= =
=( + )/2.0
i
s sire(2)
d dam(2)pa(2) ebv( )s ebv( )d
iPrivate variable
s d
s d
![Page 18: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/18.jpg)
Clause: shared and private
• Define variable types.• Use private() and shared()
clauses in the parallel directive.
• Private variables will be created for each thread.
• Shared variables will be shared (rewritten) by all threads.
• Variables will be shared by default except loop counters.
• Always declare the variable type to avoid bugs.
! compute parent average (PA)
!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa)!$omp dodo i=1,n
s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0
end do!$omp end do!$omp end parallel
![Page 19: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/19.jpg)
Clause: reduction
known=0
!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa) &!$omp reduction(+:known)!$omp dodo i=1,n
s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0if(s/=0.and.d/=0) known=known+1
end do!$omp end do!$omp end parallel
• Specify variable for “reduction” operations.• A variable known is treated as
private for each thread.
• In the end of the loop, all threads will add their private known to the global known.
• Other operations (instead of +) are available:
• +,*,max,min etc.
![Page 20: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/20.jpg)
Clause: if
known=0
!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa) &!$omp reduction(+:known) &!$omp if(n>100000)!$omp dodo i=1,ns=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0if(s/=0.and.d/=0) known=known+1
end do!$omp end do!$omp end parallel
• Conditional use of OpenMP• If the condition is true, OpenMP
will be invoked in the parallel region.
• If not, the OpenMP directives in this region will be ignored (i.e. single-thread execution).
![Page 21: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/21.jpg)
Built-in functions/subroutines
• Built-in functions/subroutines for OpenMP are defined in the module omp_lib.• Recommendation: always cite this
module as !$ use omp_lib because the module is usable only when you put a compiler option.
• See the textbook or openmp.org for details.
use omp_lib
or
!$ use omp_lib
![Page 22: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/22.jpg)
Built-in function: timing
• OpenMP function omp_get_wtime() returns wall-clock time.!$ use omp_lib
integer,parameter :: r8=selected_real_kind(15,300)real(r8) :: tic,toc...
!$ tic=omp_get_wtime()!$omp parallel!$omp dodo
...
end do!$omp end do!$omp end parallel!$ toc=omp_get_wtime()!$ print *,’running time=’,toc-tic
![Page 23: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/23.jpg)
The number of threads
• The default number of threads used is the maximum number available on your system.
• A parallel program will be slow if …• You separately run another parallel program and each program tries to use
the maximum number of threads.
• Three different ways to change the number of threads.1. Region-specific configuration (use of a clause in the parallel directive)
2. Program-specific configuration (use of a built-in subroutine)
3. Run-time configuration (use of an environmental variable)
![Page 24: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/24.jpg)
Approach 1
integer :: n
n = 2
!$omp parallel num_threads(n)!$omp do
do
...
end do
!$omp end do
!$omp end parallel
• Use of num_threads clause.• This is a region-specific
configuration.
![Page 25: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/25.jpg)
Approach 2
!$ use omp_lib
integer :: n
n=2
!$call omp_set_num_threads(n)
!$omp parallel
!$omp do
do
...
end do
!$omp end do
!$omp end parallel
• Use of a built-in function omp_set_num_threads.• It changes the default number of
threads in the program.
• It basically affects all the subsequent parallel regions without the num_threads clause (see OpenMP specifications).
![Page 26: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/26.jpg)
Approach 3
Linux and Mac OS X:
export OMP_NUM_THREADS=4
or
OMP_NUM_THREADS=4 ./a.out
Windows:
• Use of an environmental variable OMP_NUM_THREADS.• It means you don’t have to change
the program. You can just change the system variable.
• In Linux and Mac OS X, this variable is effective only in the session. Write the variable in your Bash-profile.
• In Windows, open the computer's property to set the variable.
![Page 27: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/27.jpg)
OpenMP is not perfect.
• A task should be split into several independent jobs.• Not directly applicable if there are data-dependencies.
do i=3,nx(i)=x(i-1)+x(i-2)
end do
• Hard to parallelized: Gauss-Seidel iterations, MCMC, and so on• Modify the algorithm to create independent tasks
• Is it really worth to do?
![Page 28: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/28.jpg)
Why my OpenMP is not so fast?
• Thread management is automatic but costly.• Possibility: management overhead ≥ advantage of threading
• 0.20 vs 0.50 sec w/wo 4 threads 11.9 vs 46.7 sec w/wo 4 threads
• OpenMP is useful only when the overhead can be ignored e.g. heavy computations repeated many times.
!$omp parallel private(i) &!$omp reduction(+:s)!$omp dodo i=1,1000000000
s = s + 1/real(i)end do!omp end do!omp end parallel
!$omp parallel private(i) &!$omp reduction(+:s)!$omp dodo i=1,1000000000
s = s + log(abs(sin((sqrt(1/dble(i)))**0.6)))end do!omp end do!omp end parallel
![Page 29: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/29.jpg)
Why my OpenMP is so slow?• Many other jobs use the computing cores.
• Possibility: too many multi-threaded programs in background, ortoo many threads consumed by few other programs
• Best practice: limit the number of threads for each program
0
100
200
300
400
1 2 4 8 16
Wall-clock time (sec) in one round ofAI REML with YAMS
![Page 30: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/30.jpg)
BLUPF90 programs and parallelization
• Relies on parallel libraries and modules• Genomic module dependent on Intel MKL i.e. optimized BLAS & LAPACK
subroutines. (MKL parallelized by OpenMP).
• The module also uses OpenMP directives.
• YAMS (a sparse matrix library) calls MKL as well.
• BLUPF90IOD2 (a licensed software) supports OpenMP.
• Please make sure how many threads you are actually using before running the parallel programs.
![Page 31: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/31.jpg)
BLAS and LAPACK
• BLAS: Basic Linear Algebra Subprograms• Very basic matrix/vector computations: mainly multiplications
• Originally written in Fortran
• Optimized by vendors
• MKL (Math Kernel Library) by Intel; A lot of optimization + OpenMP
• LAPACK: Linear Algebra Package• High-level matrix computations:
• Cholesky, eigenvalue, and singular value decompositions
• Linear equations etc.
• Performance dependent on BLAS
![Page 32: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/32.jpg)
Genomic module (next week)
• G-matrix𝐆 = 𝐙𝐙′/𝑘
• DGEMM from BLAS (Aguilar et al., 2011)
• Some quality control on markers
• A22-matrix and related computations• Subset of numerator relationship matrix (Aguilar et al., 2011)
• Diagonals of A22-inverse (Masuda et al., 2017)
• Indirect computations in A22-inverse (Masuda et al., 2016)
![Page 33: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/33.jpg)
Aguilar et al., 2011
![Page 34: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/34.jpg)
Animals Time (s) Mem (GB)
1000 0.18 0.4
2000 0.47 0.8
5000 1.87 2.1
10000 8.40 4.5
20000 28.4 10.4
50000 162.7 37.3
100000 664.2 111.8
200000 2796.0 372.5
Computing time in G = ZZ’/k with 50K SNP markers using DSYRK from MKL(Intel Xeon E5-2650 2.2GHz x 24 cores)
![Page 35: Parallel Computing with OpenMP](https://reader033.fdocuments.net/reader033/viewer/2022060603/6296b125105f68469c63efe1/html5/thumbnails/35.jpg)
YAMS (next week)
• Yet Another Mixed-model Solver (Masuda et al., 2014, 2015)
• Supernodal approaches• Use of dense matrix in sparse matrix i.e. BLAS
• Both in factorization and inversion
• Replacement of FSPAK• OPTION use_yams
• Variance component estimation with REML
• S. E. (or accuracy) of breeding value by inversion
• Indirect computation in A22-inverse
• Unavailable in Gibbs sampling