The Fast Multipole Method in molecular dynamics
Transcript of The Fast Multipole Method in molecular dynamics
![Page 1: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/1.jpg)
bioexcel.eu
Partners Funding
The Fast Multipole Method in molecular dynamics
Berk HessKTH Royal Institute of Technology, Stockholm, Sweden
ADAC6 workshop Zurich, 20-06-2018
![Page 2: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/2.jpg)
Slide
bioexcel.eu
Partners Funding
BioExcel
![Page 3: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/3.jpg)
Slide
bioexcel.eu
Partners Funding
Powerful because:• structure and dynamics in atomistic detail
Challenging because:• fixed system size: strong scaling needed• always more simulation needed than possible• force-field (model) accuracy can be limiting• results can be difficult to analyse
Molecular Dynamics of biomolecules
![Page 4: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/4.jpg)
Slide
bioexcel.eu
Partners Funding
mid2xi
dt2= Fi = �riV (x) , i = 1, . . . , N
vi(t+�t
2) = vi(t�
�t
2) + ai(t)�t
xi(t+�t) = x(t) + vi(t+�t
2)
Molecular Dynamics
Newton’s equation of motion
Simple (symplectic) integrator
Fundamental issue: sequential problem Δt ~ 1 fs, timescales of interest μs - s ⇨ need 109 - 1015 steps!
![Page 5: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/5.jpg)
Slide
bioexcel.eu
Partners Funding
U(r) =X
bonds
Ubond(r) +X
angles
Uangle(r) +X
dihs
Udih(r)
+X
i
X
j>i
qiqj4⇥�0rij
+Aij
r12ij� Bij
r6ij
Computational cost mostly in computing the forces
Fi = �dU
dri
We need the forceon each particle i pair-sum: largest cost,
use cut-off (and long-rangeelectrostatics algorithm)
{
![Page 6: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/6.jpg)
Slide
bioexcel.eu
Partners Funding
To reach 109 - 1015 steps we need to minimise the walltime per step
The best MD codes in the world are at 100 microseconds per step (on a single node):• at ~ 200 atoms per CPU core ⇨ we run in L1/L2 cache• at ~ 3000 atoms per GPU• but systems of interest are usually > 100000 atoms
Extreme demands on hardware and parallelisation libraries (MPI, OpenMP):• ideally overheads should be < 1 microsecond
The MD strong scaling challenge
![Page 7: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/7.jpg)
Slide
bioexcel.eu
Partners Funding
• Started as hardware+software project in Groningen (NL)• Now core team in Stockholm and many contributors world-wide• Focus on high performance: efficient algorithms, SIMD & GPUs• Good parallel scaling• Compiles and runs efficiently on nearly any hardware• Open source, LGPL• Incorporate important algorithms for flexibly manipulating systems• C++, MPI, OpenMP, CUDA, OpenCL• C++ and Python API in the works
GROMACS
![Page 8: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/8.jpg)
Slide
bioexcel.eu
Partners Funding
12 13 14 15111098
1111
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0000
8653 9 10 11 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0000
7654 12 13 14 15
222
3333
2
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
11110000
222 33332
0123
4567
891011
Classical 1x1 neighborlist on 4-way SIMD
4x4 setup on 4-way SIMD
4x4 setup on 8-way SIMD
4x4 setup on SIMT-8
141312 1515141312654 77654
Highly optimised vectorization:
Small C++ SIMD layer with highly optimised math functions for:SSE2/4, AVX2-128, AVX(2)-256, AVX-512, ARM, VSX, HPC-ACE
CPU kernels reach 50% of peak
CUDA and OpenCL need separate kernels
![Page 9: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/9.jpg)
Slide
bioexcel.eu
Partners Funding
We can optimize the size of this buffer
Larger buffer means morecalculations, butwe can updatethe neighbor listless frequently
Atom clustering (regularization) and pair list buffering
![Page 10: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/10.jpg)
Slide
bioexcel.eu
Partners Funding
We can optimize the size of this buffer
Larger buffer means morecalculations, butwe can updatethe neighbor listless frequently
8 10 15 20 25 40 50 75 100 150 200
50
55
60
65
70
75
80
85
90
56 ranks
56 ranks with dynamic pruning
pair search frequency (MD steps)
perf
orm
ance (
ns/d
ay)
New in GROMACS-2018: dual-pair list: • reduces overhead • less sensitive to parameters
Atom clustering and pair list buffering
![Page 11: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/11.jpg)
Slide
bioexcel.eu
Partners Funding
GROMACS has efficient parallelization of all algorithms using MPI + OpenMP
OpenMP is (performance) portable, but limited:• No way to run parallel tasks next to each other• No binding of threads to cores (cache locality)
Need for a better threading model, requirements:• Extremely low overhead barriers (all-all, all-1, 1-all)• Binding of threads to cores• Portable
Intra-rank parallelisation: OpenMP
![Page 12: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/12.jpg)
Slide
bioexcel.eu
Partners Funding
1 12
d0
3 2
3’
rc
rb2’
7
30
4cr
165
8th-shell domain decomposition:• minimizes communicated volume• minimizes communication calls• implemented using aggregation of data
along dimensions• 1 MPI rank per domain• dynamic load balancing
Question: still the best approach for networks that support many messages?
Spatial decomposition
Hess, Kutzner, Van der Spoel, Lindahl; J. Chem. Theory Comput. 4, 435 (2008)
![Page 13: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/13.jpg)
Slide
bioexcel.eu
Partners Funding
GPUCUDA
CPU OpenMP threads
H2Dpair list
Average CPU-GPU overlap 75-90%
Bonded FIntegration,Constraints
Non-bonded F
Wait
Idle
DDPair search
Idle
DD & Pair search: every 80-150 steps
MD step
H2Dx,q
Clear Fbuffer
Rollingpruning
Prunepair list
Reduce F
Idle Spread 3D-FFT Solve 3D-FFT Gather
D2HF,E
D2HF,E
pull/FE/etc.
100-500 μs New in GROMACS-2018: PME on GPU!
![Page 14: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/14.jpg)
Slide
bioexcel.eu
Partners Funding
Lignocellulose benchmark• 3.3 million atoms• No long-range electrostatics!• machine: Piz Daint at CSCS
• Cray XC50• NVIDIA P100 GPUs
96 / 4 960 /40 9600 / 40020
200
4 ranks/node
6 ranks/node
12 ranks/node
#cores / #GPUs
Perf
orm
ance
(ns/
day)
100
GROMACS strong scaling
![Page 15: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/15.jpg)
Slide
bioexcel.eu
Partners Funding
At 100s of microseconds per step:• The 3D-FFT in PME electrostatics• MPI overhead
• we need MPI_Put_notify()• OpenMP barriers take significant time• Load imbalance• CUDA API overhead can be 50% of the CPU time• Too many GROMACS options to tweak manually
Strong scaling issues
10 100 1000number of cores or #GPUs*8
10
100
1000
perfo
rman
ce (n
s/da
y)
140000 atom ion channelon Piz Daint xc
![Page 16: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/16.jpg)
Slide
bioexcel.eu
Partners Funding
All examples up till now without long-range electrostatics, but needed in most casesWe need all-vs-all Coulomb interactions
All MD codes use Particle-mesh Ewald (PME) for electrostaticsPME uses a 3D-FFT, 643 - 1283 grid pointsIssues:• The 3D-FFT requires two grid transposes that require
All-to-All communication in slabs• Pencil decomposition clashes with 3D decomposition• 3D-FFT scaling is the bottleneck for MD
Long-range electrostatics GROMACS Multiple-ProgramMultiple-Data parallelisationimproves scaling by a factor 4
8 PP/PME ranks
6 PP ranks 2 PME ranks
3D-FFTcommunication
![Page 17: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/17.jpg)
Slide
bioexcel.eu
Partners Funding
in use in MD computational cost pre-factor communication
costEwald
summation since 70s O(N3/2) small (FFT) high
PPPM / Particle Mesh Ewald since 90s O(N log N) small (FFT) high
Fast Multipole Method not yet O(N) larger low
Multigrid Electrostatics 2010 O(N) larger low
Long-range electrostatics methods for molecular dynamics
we recently solved the energy conservation issue
![Page 18: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/18.jpg)
Slide
bioexcel.eu
Partners FundingRio YokotaTokyo Tech
FMM scheme
![Page 19: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/19.jpg)
Slide
bioexcel.eu
Partners Funding
The accuracy of FMM is controlled by• the order of the expansion(s)• the theta parameter
Main issue for FMM in MD:• We want low accuracy (or
better: high speed)• We want energy conservation
R
r
L
↵
• •
Figure 3: Extended cells.
θ=0.50 θ=0.35
θ=0.25 θ=0.20
Figure 4: Number of near neighbors increases as ✓ decreases.
3
![Page 20: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/20.jpg)
Slide
bioexcel.eu
Partners Funding
FMMformoleculardynamics
direct approx.p : Order of expansionθ : Opening angle
This combination yieldsthe same accuracy
But even for the same accuracy the oneswhere θ is small has less energy driftbecause the jump is further away
What happens if wesmooth the jump?
� =1
r
work byRio Yokota
![Page 21: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/21.jpg)
Slide
bioexcel.eu
Partners Funding
Main physics issue with FMM in MD:• All atoms have partial charges, but molecules are mostly locally neutral• The sharp boundaries of FMM cells introduce charges at the boundaries
qO = +0.82
qH= -0.41qH = -0.41
qcell = -0.41 qcell = +0.41
a neutral water molecule crossing 2 FMM cells
at long distancewater is a dipole
![Page 22: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/22.jpg)
Slide
bioexcel.eu
Partners FundingS2
q=0 q=0 q=0
q−multipoleq−q
S1
V = s1(x)V1(x) + s2(x)V2(x)
F = �rV
= s1(x)F1(x) + s2(x)F2(x)
�s01(x)V1(x)� s02(x)V2(x)
FMM regularization: add smooth overlap between cells
orig. idea: Chartier, Darrigand Faou,BIT Num. Math. 50, 23 (2010)
![Page 23: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/23.jpg)
Slide
bioexcel.eu
Partners Funding
• Davoud Saffar Shamshir Gar has analyzed the accuracy of regularization• Currently: 2D FFM with full regularization• Results for charge pairs with θ=0.35 = second nearest neighbors, 8 atoms
per cell
0 0.2 0.4 0.6regularization width / cell size
10−4
10−3
pote
ntia
l erro
r p=4
p=6
factor 5.7
![Page 24: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/24.jpg)
Slide
bioexcel.eu
Partners Funding
1 2 3 4 5 6�8
�7
�6
�5
�4
�3
�2
�1
0
1
P
AbsRMS
dumbbell bf=0dumbbell bf=0.03random bf=0random bf=0.03
1 2 3 4 5 60.5
1
1.5
2
2.5
3
3.5
4
4.5
5
P
AbsRMS
multipolecoefficients,log10
local expansion coefficients, log10
0 0.1 0.2half regularization width (nm)
0.01
0.1
1
RM
S co
effic
ient
SPC/E water, 0.473 nm cell size
monopoledipolequadrupole
Real water shows the samedecrease of coefficients
![Page 25: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/25.jpg)
Slide
bioexcel.eu
Partners Funding
Computational cost of regularization
• 1/2 regularization width = 1/4 cell• Second nearest neighbor direct• Direct pair cost factor (2.25/2)3 = 1.42
• At 0.473 nm (10.6 atoms) cell free with LJ• Multipole lowest level: factor 1.53 = 3.4
• But we get factor 5 more accuracy with P=4• And we get energy conservation!
![Page 26: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/26.jpg)
Slide
bioexcel.eu
Partners Funding
Conclusions
• PME 3D-FFT is limiting scaling• FMM has better communication characteristics• FMM is a good fit for SIMD/GPU acceleration• FMM regularization gives energy conservation• FMM regularization improves accuracy• FMM fits wells with GROMACS pair computation
• We are implementing optimized, regularised FMM
Is there a better threading/tasking model than OpenMP on the horizon?
![Page 27: The Fast Multipole Method in molecular dynamics](https://reader033.fdocuments.net/reader033/viewer/2022052821/628fda865e3ceb368d2f3670/html5/thumbnails/27.jpg)
Acknowledgements
bioexcel.eu
Partners Funding
Erik LindahlMark AbrahamSzilard PallAleksei Iupinov
Rio Yokota, Tokyo TechJohn Eblen, ORNLRoland Schulz, IntelMark Berger, NVIDIArest of the GROMACS team world-wide