Code Tuning and Parallelization on Boston University’s Scientific Computing Facility
description
Transcript of Code Tuning and Parallelization on Boston University’s Scientific Computing Facility
![Page 1: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/1.jpg)
Code Tuning and Parallelization on Boston
University’s Scientific Computing Facility
Doug Sondak
Boston UniversityScientific Computing
and Visualization
![Page 2: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/2.jpg)
Outline
• Introduction• Timing• Profiling• Cache• Tuning• Timing/profiling exercise• Parallelization
![Page 3: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/3.jpg)
Introduction
• Tuning– Where is most time being used?– How to speed it up
• Often as much art as science
• Parallelization– After serial tuning, try parallel processing– MPI– OpenMP
![Page 4: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/4.jpg)
Timing
![Page 5: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/5.jpg)
Timing
• When tuning/parallelizing a code, need to assess effectiveness of your efforts
• Can time whole code and/or specific sections
• Some types of timers– unix time command– function/subroutine calls– profiler
![Page 6: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/6.jpg)
CPU or Wall-Clock Time?
• both are useful• for parallel runs, really want wall-clock
time, since CPU time will be about the same or even increase as number of procs. is increased
• CPU time doesn’t account for wait time• wall-clock time may not be accurate if
sharing processors– wall-clock timings should always be
performed in batch mode
![Page 7: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/7.jpg)
Unix Time Command
• easiest way to time code• simply type time before your run
command• output differs between c-type shells
(cshell, tcshell) and Bourne-type shells (bsh, bash, ksh)
![Page 8: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/8.jpg)
• tcsh results
Unix time Command (cont’d)
twister:~ % time mycode1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w
user CPU time (s)
system CPU time (s)
wall-clock time (s)
(u+s)/wc
avg. shared + unsharedtext space
input + output operations
page faults + no. timesproc. was swapped
![Page 9: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/9.jpg)
Unix Time Command (3)
• bsh results
$ time mycodeReal 1.62User 1.57System 0.03
wall-clock time (s)
user CPU time (s)
system CPU time (s)
![Page 10: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/10.jpg)
Function/Subroutine Calls
• often need to time part of code• timers can be inserted in source code• language-dependent
![Page 11: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/11.jpg)
cpu_time
• intrinsic subroutine in Fortran• returns user CPU time (in seconds)
– no system time is included
• 0.01 sec. resolution on p-series
real :: t1, t2call cpu_time(t1) ... do stuff to be timed ... call cpu_time(t2)print*, 'CPU time = ', t2-t1, ' sec.'
![Page 12: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/12.jpg)
system_clock
• intrinsic subroutine in Fortran• good for measuring wall-clock time• on p-series:
– resolution is 0.01 sec.– max. time is 24 hr.
![Page 13: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/13.jpg)
system_clock (cont’d)
• t1 and t2 are tic counts• count_rate is optional argument
containing tics/sec.
integer :: t1, t2, count_rate call system_clock(t1, count_rate) ... do stuff to be timed ... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’
![Page 14: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/14.jpg)
times
• can be called from C to obtain CPU time
• 0.01 sec. resolution on p-series
• can also get system time with tms_stime
#include <sys/times.h>#include <unistd.h>void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed … times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }
![Page 15: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/15.jpg)
gettimeofday
• can be called from C to obtain wall-clock time
• sec resolution on p-series
#include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }
![Page 16: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/16.jpg)
MPI_Wtime
• convenient wall-clock timer for MPI codes
• sec resolution on p-series
![Page 17: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/17.jpg)
MPI_Wtime (cont’d)
• Fortran
• C
double precision t1, t2t1 = mpi_wtime() ... do stuff to be timed ...t2 = mpi_wtime()print*,'wall-clock time = ', t2-t1
double t1, t2;t1 = MPI_Wtime();... do stuff to be timed ...t2 = MPI_Wtime();printf(“wall-clock time = %5.3f\n”,t2-t1);
![Page 18: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/18.jpg)
omp_get_wtime
• convenient wall-clock timer for OpenMP codes
• resolution available by calling omp_get_wtick()
• 0.01 sec. resolution on p-series
![Page 19: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/19.jpg)
omp_get_wtime (cont’d)
• Fortran
• C
double precision t1, t2, omp_get_wtimet1 = omp_get_wtime() ... do stuff to be timed ...t2 = omp_get_wtime()print*,'wall-clock time = ', t2-t1
double t1, t2;t1 = omp_get_wtime();... do stuff to be timed ...t2 = omp_get_wtime();printf(“wall-clock time = %5.3f\n”,t2-t1);
![Page 20: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/20.jpg)
Timer Summary
CPU Wall
Fortran cpu_time system_clock
C times gettimeofday
MPI MPI_Wtime
OpenMP omp_get_time
![Page 21: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/21.jpg)
Profiling
![Page 22: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/22.jpg)
Profilers
• profile tells you how much time is spent in each routine
• various profilers available, e.g.– gprof (GNU)– pgprof (Portland Group)– Xprofiler (AIX)
![Page 23: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/23.jpg)
gprof
• compile with -pg
• file gmon.out will be created when you run
• gprof executable > myprof
• for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof
![Page 24: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/24.jpg)
gprof (cont’d)ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds
called/total parents index %time self descendents called+self name index called/total children
0.00 340.50 1/1 .__start [2][1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]
![Page 25: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/25.jpg)
gprof (3)
ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds
% cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]
![Page 26: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/26.jpg)
pgprof
• compile with Portland Group compiler– pgf95 (pgf90, etc.)– pgcc– –Mprof=func
• similar to –pg
– run code
• pgprof –exe executable
• pops up window with flat profile
![Page 27: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/27.jpg)
pgprof (cont’d)
![Page 28: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/28.jpg)
pgprof (3)
• line-level profiling– –Mprof=line
• optimizer will re-order lines– profiler will lump lines in some loops or
other constructs– may want to compile without
optimization, may not
• in flat profile, double-click on function
![Page 29: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/29.jpg)
pgprof (4)
![Page 30: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/30.jpg)
xprofiler• AIX (twister) has a graphical interface to
gprof• compile with -g -pg -Ox
– Ox represents whatever level of optimization you’re using (e.g., O5)
• run code– produces gmon.out file
• type xprofiler mycode– mycode is your code run comamnd
![Page 31: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/31.jpg)
xprofiler (cont’d)
![Page 32: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/32.jpg)
xprofiler (3)
• filled boxes represent functions or subroutines
• “fences” represent libraries• left-click a box to get function name
and timing information• right-click on box to get source code
or other information
![Page 33: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/33.jpg)
xprofiler (4)
• can also get same profiles as from gprof by using menus– report flat profile– report call graph profile
![Page 34: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/34.jpg)
Cache
![Page 35: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/35.jpg)
Cache
• Cache is a small chunk of fast memory between the main memory and the registers
secondary cache
registers
primary cache
main memory
![Page 36: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/36.jpg)
Cache (cont’d)
• Variables are moved from main memory to cache in lines– L1 cache line sizes on our machines
• Opteron (katana cluster) 64 bytes• Power4 (p-series) 128 bytes• PPC440 (Blue Gene) 32 bytes• Pentium III (linux cluster) 32 bytes
• If variables are used repeatedly, code will run faster since cache memory is much faster than main memory
![Page 37: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/37.jpg)
Cache (cont’d)
• Why not just make the main memory out of the same stuff as cache?– Expensive– Runs hot– This was actually done in Cray computers
• Liquid cooling system
![Page 38: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/38.jpg)
Cache (cont’d)
• Cache hit– Required variable is in cache
• Cache miss– Required variable not in cache– If cache is full, something else must be
thrown out (sent back to main memory) to make room
– Want to minimize number of cache misses
![Page 39: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/39.jpg)
Cache example
…
x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
x[8]
x[9]
Main memory
“mini” cacheholds 2 lines, 4 words each
for(i=0; i<10; i++) x[i] = i
a
b
…
![Page 40: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/40.jpg)
Cache example (cont’d)
…
x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
x[8]
x[9]
for(i=0; i<10; i++) x[i] = i
x[0]
x[1]
x[2]
x[3]
•We will ignore i for simplicity•need x[0], not in cache cache miss•load line from memory into cache•next 3 loop indices result in cache hits
a
b
…
![Page 41: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/41.jpg)
Cache example (cont’d)
…
x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
x[8]
x[9]for(i=0; i<10; i++) x[i] = i
x[0]
x[1]
x[2]
x[3]
•need x[4], not in cache cache miss•load line from memory into cache•next 3 loop indices result in cache hits
x[4]
x[5]
x[6]
x[7]
a
b
…
![Page 42: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/42.jpg)
Cache example (cont’d)
…
x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
x[8]
x[9]
for(i==0; i<10; i++) x[i] = i
x[8]
x[9]
a
b
•need x[8], not in cache cache miss•load line from memory into cache•no room in cache!•replace old line
x[4]
x[5]
x[6]
x[7]
a
b…
![Page 43: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/43.jpg)
Cache (cont’d)
• Contiguous access is important• In C, multidimensional array is stored
in memory as a[0][0] a[0][1] a[0][2]
…
![Page 44: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/44.jpg)
Cache (cont’d)
• In Fortran and Matlab, multidimensional array is stored the opposite way:
a(1,1) a(2,1) a(3,1)
…
![Page 45: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/45.jpg)
Cache (cont’d)
• Rule: Always order your loops appropriately– will usually be taken care of by optimizer– suggestion: don’t rely on optimizer!
for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; }}
do j = 1, n do i = 1, n a(i,j) = 1.0 enddoenddo
C Fortran
![Page 46: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/46.jpg)
Tuning Tips
![Page 47: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/47.jpg)
Tuning Tips
• Some of these tips will be taken care of by compiler optimization– It’s best to do them yourself, since
compilers vary
![Page 48: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/48.jpg)
Tuning Tips (cont’d)
• Access arrays in contiguous order– For multi-dimensional arrays,
rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab
Bad Goodfor(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; }}
for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }}
![Page 49: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/49.jpg)
Tuning Tips (3)
• Eliminate redundant operations in loops
• Bad:
• Good:
for(i=0; i<N; i++){ x = 10;
}
…
x = 10;for(i=0; i<N; i++){ }
…
![Page 50: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/50.jpg)
Tuning Tips (4)
• Eliminate if statements within loops
• They may inhibit pipelining
for(i=0; i<N; i++){
if(i==0)
perform i=0 calculations
else
perform i>0 calculations
}
![Page 51: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/51.jpg)
Tuning Tips (5)
• Better way
perform i=0 calculations
for(i=1; i<N; i++){
perform i>0 calculations
}
![Page 52: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/52.jpg)
Tuning Tips (6)
• Divides cost far more than multiplies or adds– Often order of magnitude difference!
• Bad:
• Good:
for(i=0; i<N; i++)
x[i] = y[i]/scalarval;
qs = 1.0/scalarval;
for(i=0; i<N; i++)
x[i] = y[i]*qs;
![Page 53: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/53.jpg)
Tuning Tips (7)
• There is overhead associated with a function call
• There is overhead associated with a function call
• Bad:
• Good:
for(i=0; i<N; i++)
myfunc(i);
myfunc ( );
void myfunc(x){
for(int i=0; i<N; i++){
do stuff
}
}
![Page 54: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/54.jpg)
Tuning Tips (8)
• There is overhead associated with a function call
• Minimize calls to math functions
• Bad:
• Good:
for(i=0; i<N; i++)
z[i] = log(x[i]) * log(y[i]);
for(i=0; i<N; i++){
z[i] = log(x[i] + y[i]);
![Page 55: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/55.jpg)
Tuning Tips (9)
• There is overhead associated with a function call
• recasting may be costlier than you think
• Bad:
• Good:
sum = 0.0;
for(i=0; i<N; i++)
sum += (float) i
isum = 0;
for(i=0; i<N; i++)
isum += i;
sum = (float) isum
![Page 56: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/56.jpg)
Parallelization
![Page 57: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/57.jpg)
Parallelization
• Introduction
• MPI & OpenMP
• Performance metrics
• Amdahl’s Law
![Page 58: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/58.jpg)
Introduction
• Divide and conquer!– divide operations among many
processors
– perform operations simultaneously
– if serial run takes 10 hours and we hit the problem with 5000 processors, it should take about 7 seconds to complete, right?• not so easy, of course
![Page 59: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/59.jpg)
Introduction (cont’d)
• problem – some calculations depend upon previous calculations– can’t be performed simultaneously
– sometimes tied to the physics of the problem, e.g., time evolution of a system
• want to maximize amount of parallel code– occasionally easy
– usually requires some work
![Page 60: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/60.jpg)
Introduction (3)
• method used for parallelization may depend on hardware
proc0
proc1
proc2
proc3
mem0
mem1
mem2
mem3
distributed memory
proc0
proc1
proc2
proc3
mem
shared memory
proc0
proc1
proc2
proc3
mem0
mem1
mixed memory
![Page 61: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/61.jpg)
Introduction (4)
• distributed memory– e.g., katana, Blue Gene
– each processor has own address space
– if one processor needs data from another processor, must be explicitly passed
• shared memory– e.g., p-series IBM machines
– common address space
– no message passing required
![Page 62: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/62.jpg)
Introduction (5)
• MPI– for both distributed and shared memory
– portable
– freely downloadable
• OpenMP– shared memory only
– must be supported by compiler (most do)
– usually easier than MPI
– can be implemented incrementally
![Page 63: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/63.jpg)
MPI
• Computational domain is typically decomposed into regions– One region assigned to each processor
• Separate copy of program runs on each processor
![Page 64: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/64.jpg)
MPI (cont’d)
• Discretized domain to solve flow over airfoil
• System of coupled PDE’s solved at each point
![Page 65: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/65.jpg)
MPI (3)
• Decomposed domain for 4 processors
![Page 66: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/66.jpg)
MPI (4)
• Since points depend on adjacent points, must transfer information after each iteration
• This is done with explicit calls in the source code
xxiii
211
![Page 67: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/67.jpg)
MPI (5)
• Diminishing returns– Sending messages can get expensive
– Want to maximize ratio of computation to communication
![Page 68: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/68.jpg)
OpenMP
• Usually loop-level parallelization
• An OpenMP directive is placed in the source code before the loop– Assigns subset of loop indices to each processor
– No message passing since each processor can “see” the whole domain
for(i=0; i<N; i++){ do lots of stuff}
![Page 69: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/69.jpg)
OpenMP (cont’d)
• Can’t guarantee order of operations for(i = 0; i < 7; i++)
a[i] = 1;for(i = 1; i < 7; i++) a[i] = 2*a[i-1];
i a[i] (serial) a[i] (parallel)
0 1 1
1 2 2
2 4 4
3 8 8
4 16 2
5 32 4
6 64 8
Proc. 0
Proc. 1
Parallelize this loop on 2 processors
Example of how to do it wrong!
![Page 70: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/70.jpg)
Quantify performance
• Two common methods– parallel speedup
– parallel efficiency
![Page 71: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/71.jpg)
Parallel Speedup
Sn = parallel speedup
n = number of processors
T1 = time on 1 processor
Tn = time on n processors
nn T
TS 1
![Page 72: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/72.jpg)
Parallel Speedup (2)
![Page 73: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/73.jpg)
Parallel Efficiency
n = parallel efficiency
T1 = time on 1 processor
Tn = time on n processors
n = number of processors
nS
nTT n
nn
*1
![Page 74: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/74.jpg)
Parallel Efficiency (2)
![Page 75: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/75.jpg)
Parallel Efficiency (3)
• What is a “reasonable” level of parallel efficiency?
• depends on– how much CPU time you have available– when the paper is due
• can think of (1-) as “wasted” CPU time
• my personal rule of thumb ~60%
![Page 76: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/76.jpg)
Parallel Efficiency (4)
• Superlinear speedup– parallel efficiency > 1.0– sometimes quoted in the literature– generally attributed to cache issues
• subdomains fit entirely in cache, entire domain does not
• this is very problem-dependent• be suspicious!
![Page 77: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/77.jpg)
Amdahl’s Law
• let fraction of code that can execute in parallel be denoted p
• let fraction of code that must execute serially be denoted s
• let T = time, n = number of processors
np
sTTn
1
![Page 78: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/78.jpg)
Amdahl’s Law (2)
• Noting that p = (1-s)
parallel speedup is (don’t confuse Sn with s)
Amdahl’sLaw
n
ss
T
Tn
1
1
)1(11
ns
n
T
TS
nn
![Page 79: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/79.jpg)
Amdahl’s Law (3)
• can also be expressed as parallel efficiency by dividing by n
)1(1
1
nsn Amdahl’sLaw
![Page 80: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/80.jpg)
• suppose s = 0; => linear speedup
Amdahl’s Law (4)
)1(1
1
nsn
1n
)1(1
ns
nSn
nSn
![Page 81: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/81.jpg)
Amdahl’s Law (5)
• suppose s = 1; => no speedup
)1(1
1
nsn
1nS
)1(1
ns
nSn
nn
1
![Page 82: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/82.jpg)
Amdahl’s Law (6)
![Page 83: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/83.jpg)
Amdahl’s Law (7)
• Should we despair?– No!– bigger machines bigger
computations
smaller value of s
• if you want to run on a large number of processors, try to minimize s
![Page 84: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/84.jpg)
Recommendations
![Page 85: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/85.jpg)
Recommendations• Add timers to your code
– As you make changes and/or run new cases, they may give you an indication of a problem
• Profile your code– Sometimes results are surprising– Review “tuning tips”– See if you can speed up functions that are
consuming the most time
• Try highest levels of compiler optimization
![Page 86: Code Tuning and Parallelization on Boston University’s Scientific Computing Facility](https://reader036.fdocuments.net/reader036/viewer/2022062305/56815a72550346895dc7d3ed/html5/thumbnails/86.jpg)
Recommendations (cont’d)• Once you’re comfortable that you’re getting
reasonable serial performance, parallelize• If portability is an issue, MPI is a good
choice• If you’ll always be running on a shared-
memory machine (e.g., multicore PC), consider OpenMP
• For parallel code, plot parallel efficiency vs. number of processors– Choose appropriate number of processors