Molecular models, threads and you
-
Upload
jiahao-chen -
Category
Technology
-
view
112 -
download
0
description
Transcript of Molecular models, threads and you
![Page 1: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/1.jpg)
Molecular Models, Threads and You
Jiahao Chen
Martínez GroupDept. Chemistry, CATMS, MRL and Beckman
CS 498 MG presentation: 2007-12-07
Optimizing the TINKER classical molecular dynamics code while maintaining code readability
![Page 2: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/2.jpg)
Molecular models/force fields
covalent bond effectsE =
+
Typical energy function
noncovalent interactions
![Page 3: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/3.jpg)
Molecular models/force fields
bond stretch angle torsion dihedrals
electrostatics dispersion
E = !
a!angles
!a("a ! "eq,a)2!
b!bonds
kb(rb ! req,b)2
!
i<j!atoms
qiqj
rij
!
d!dihedrals
!
n
lnd cos (n!)
+ -
++
+ +
Typical energy function
!
i<j!atoms
!ij
"#"ij
rij
$12
!#
"ij
rij
$6%
computation cost = O(N2)
![Page 4: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/4.jpg)
• The state of the system is given by the position and momentum of every atom (of mass )
• Solve the system of partial differential equations
• with user-specified initial conditions (e.g. with constant temperature and pressure)
• Subject to (user-specified) constraints, e.g. fixed bond angles
Problem description
(x1, p1, x2, p2, · · · , xN , pN ) ! R3!2!N
!xi
!t=
pi
mi,!pi
!t= ! !E
!xi, i = 1, · · · , N
mi
![Page 5: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/5.jpg)
Many parallel and serial implementations
Package name Threads MPI GlobalArrays
NAMD CHARM++GROMACS ✓ ✓
TINKERAMBER partly ✓ ✓
CHARMM ✓LAMMPS ✓
NWChem ✓ ✓
![Page 6: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/6.jpg)
Things I tried
• Compiler flags optimization
• Cache miss reduction
• Lookup tables
• Parallelization with OpenMP
![Page 7: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/7.jpg)
Compiler flag optimizationflags gfortran 4.1.2 ifort 10.0.023
-O0 29.95(2) s - 36.30(2) s -
-Os 29.92(3) s +0.77(3) % 32.59(4) s +10.22(2) %
-O1 30.22(1) s -0.90(4) % 32.12(3) s +11.51(1) %
-O2 29.66(3) s +0.96(1) % 30.30(2) s +16.54(2) %
-O3 29.84(2) s +0.38(2) % 30.83(2) s +15.06(2) %
CE search 28.77(2) s +3.62(3) %1 28.96(2) s +20.22(1)%2
1. FFLAGS =”-falign-functions -falign-jumps -falign-labels -falign-loops -fvpt -fcse-skip-blocks -fdelete-null-pointer-checks -ffast-math -fforce-addr -fgcse -fgcse-lm -fgcse-sm -floop-optimize -fkeep-static-consts -fmerge-constants -fno-defer-pop -fno-guess-branch-probability -fno-math-errno -funsafe-math-optimizations -fno-trapping-math -foptimize-register-move -fregmove -freorder-blocks -freorder-functions -frerun-cse-after-loop -fno-sched-spec -fsched-spec-load -fsched-stalled-insns -fsignaling-nans -fsingle-precision-constant -fstrength-reduce -fthread-jumps -funroll-all-loops”
2. FFLAGS =”-xN -no-prec-div -static -inline-level=1 -ip -fno-alias -fno-fnalias -fno-omit-frame-pointer -fkeep-static-consts -nolib-inline -heap-arrays 1 -pad -O3 -scalar-rep -funroll-loops -complex-limited-range”
![Page 8: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/8.jpg)
Algorithm and time profile
>98%
for each time step
>59% <31%
Initialize model and parameters
EndMove one time step
Enforce temp. & pressure
Flush I/O
Update state by t/2
Calculate potential energy
and forces
Calculate & record kinetic energy and
properties
Update state by t/2
Enforce temp. & pressure
Remove unphysical motions
Calculate charge
interactions
Calculate dispersion
interactions
Calculate bond
interactions
Calculate angle
interactions
Calculate dihedral
interactions
Add up all compo-nents
...
37%12% 8%9% 26%
O(N2) O(N)
N = 6gfortran 4.1.2
O(N)O(N2)
![Page 9: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/9.jpg)
Add up all compo-nents
An unexpected cost
>98%
for each time step
>59% <31%
Initialize model and parameters
EndMove one time step
Enforce temp. & pressure
Flush I/O
Update state by t/2
Calculate potential energy
and forces
Calculate & record kinetic energy and
properties
Update state by t/2
Enforce temp. & pressure
Remove unphysical motions
Calculate charge
interactions
Calculate dispersion
interactions
Calculate bond
interactions
Calculate angle
interactions
Calculate dihedral
interactions...
37%12% 8%9% 26%
O(N2) O(N)
N = 6
O(N)O(N2)
Text
Q: Why is 15% of total execution time spent adding
numbers!?
![Page 10: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/10.jpg)
A: many L2 cache missesc zero out each of the first derivative components 7 do i = 1, n do j = 1, 3 42 deb(j,i) = 0.0d0 ... end do end do ...c sum up to get the total energy and first derivatives energy = eb + ... do i = 1, n do j = 1, 3 19 desum(j,i) = deb(j,i) + ... 2 derivs(j,i) = desum(j,i) end do end do
70 of 91 cache misses per time step (n = 6) shown
22 other terms
22 other terms
![Page 11: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/11.jpg)
A simple solutionc zero out each of the first derivative components 7 do i = 1, n do j = 1, 326 42 deb(j,i) = 0.0d0 ... end do end do ...c sum up to get the total energy and first derivatives energy = eb + ... do i = 1, n do j = 1, 3 6 temp = deb(j,i) + ... 1 19 desum(j,i) = temp 1 2 derivs(j,i) = temp end do end do
reduced cache misses from 92 to 41 per time step
![Page 12: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/12.jpg)
Speedup from reducing L2 cache misses
flags gfortran 4.1.2 ifort 10.0.023
original
with scalar replacement
speedup
29.95(2) s 28.96(2) s
27.43(3) s 28.95(1) s
+8.44(1) % +0.03(2) %
ifort already called with scalar replacement flag
![Page 13: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/13.jpg)
Lookup tables (LUTs)
• Calculations of sqrt() and exp() take up 23.8% of execution time
• Idea: pre-compute values of sqrt() and exp() in an array and recall them from memory when needed
• Caution: LUT should not displace too much data from L2 cache
![Page 14: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/14.jpg)
sqrt() with LUTLUT with linear interpolationdirect LUT
![Page 15: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/15.jpg)
exp() with LUTLUT with first-order Taylor
series refinement*direct LUT
ex = ex0 + (x! x0)ex0 +O!(x! x0)2
"
![Page 16: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/16.jpg)
Choice of implementation
function desired precision
table size
(doubles)
refinement expected speedup
sqrt()
exp()
10-4 10,764 none +118%
10-8 6,836 Taylor +151%
LUT aligned to 128-bitsL2 cache = 4 MB = 512K doubles
![Page 17: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/17.jpg)
Speedup from LUT use
flags gfortran 4.1.2 ifort 10.0.023
original
with lookup tables
speedup
29.95(2) s 28.96(2) s
26.89(1) s 25.87(2) s
+10.23(2) % +7.22(3) %
![Page 18: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/18.jpg)
Summary of serial improvements
Improvement gfortran 4.1.2 ifort 10.0.023
Best compiler flags +3.62(3) % +20.22(1) %
L2 cache miss reduction
+8.44(2) % +0.03(1) %
Lookup tables +10.23(1) % +7.22(2) %
Total 23.91(3) s+20.17(4) %
26.86(2) s+26.00(2) %
![Page 19: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/19.jpg)
Add up all compo-nents
Parallelization targets
>98%
for each time step
>59% <31%
Initialize model and parameters
EndMove one time step
Enforce temp. & pressure
Flush I/O
Update state by t/2
Calculate potential energy
and forces
Calculate & record kinetic energy and
properties
Update state by t/2
Enforce temp. & pressure
Remove unphysical motions
Calculate charge
interactions
Calculate dispersion
interactions
Calculate bond
interactions
Calculate angle
interactions
Calculate dihedral
interactions...
37%12% 8%9% 26%
O(N2) O(N)
N = 6
O(N)O(N2)
Text
![Page 20: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/20.jpg)
Parallelization strategy
Add up all compo-nents
Calculate potential energy
and forces
Calculate charge
interactions
Calculate dispersion
interactions
Calculate bond
interactions
Calculate angle
interactions
Calculate dihedral
interactions...
omp sections
omp parallel do
12%16% 11%50%
omp parallel doomp parallel do
omp parallel do
omp parallel do
omp section
omp section
2%
50%
50%
100%
![Page 21: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/21.jpg)
Parallelization results
5
10
15
20
25
30
35
0.5 1 1.5 2 2.5 3 3.5 4 4.5
N=6N=1000Ideal
Exe
cutio
n tim
e/s
# cores
gfortran 4.1.2
![Page 22: Molecular models, threads and you](https://reader034.fdocuments.net/reader034/viewer/2022051412/54c6b6354a7959aa228b45c9/html5/thumbnails/22.jpg)
Summary
• Free software can sometimes be better than non-free software
• L2 cache misses can significantly degrade performance
• Lookup tables are an effective tradeoff between speed and memory vs. precision
• Simple OpenMP parallelization is effective for small numbers of processors