LAMMPS OLCF Titan Summit · prepare LAMMPS for Titan 1. Give an overview of Molecular Dynamics...
Transcript of LAMMPS OLCF Titan Summit · prepare LAMMPS for Titan 1. Give an overview of Molecular Dynamics...
LAMMPS: Code Transforma3ons in Preparing for Titan -‐ A Hybrid Mul3-‐Core Architecture
Presenter: Arnold Tharrington
Prepared by Arnold Tharrington and William Brown
Scientific Computing Group, OLCF
2
Outline
1. Convey ongoing code modifications needed to prepare LAMMPS for Titan
1. Give an overview of Molecular Dynamics Overview of MD/LAMMPS Runtime profile of LAMMPS
2. Accelerating LAMMPS Accelerating the nonbonded interactions Accelerating the electrostatic interactions
2. Summary
3
1. Science problem
4
Science Problem: Molecular Level Insight into Membrane Fusion from Molecular Dynamics • Gain molecular level understanding of membrane fusion mechanisms
• A major means for materials to enter and exit cells • Role of lipids and membrane proteins
To date, the majority of simulation work on membranes has focused on model systems containing typically one to two lipid types. However biological membranes are complex mixtures, and to model these systems without finite size effects, simulations of significantly larger systems than what is currently possible needs to be performed.
Computer image of membrane vesicles undergoing fusion.
5
Science Problem: Molecular Level Insight into Membrane Fusion from Molecular Dynamics (cont.)
Computer image of the 2 petaflop target science problem. The membrane bilayer represented by coarse grain bead model.
Some requirements for realistic simulations are
o Proper solvation o Eliminate modeling artifacts
due to interactions between periodic images
o Capture long wavelength modes
o Proper treatment of coulombic or electrostatic interactions
The smallest, experimentally accessible vesicles to date have diameters of approximately 500 Å.
We need simulations with the minimum size of 750,000,000 coarse grain particles to model experimentally produced vesicles
Acceleration of the simulation is needed to reduce the time to solution!!
6
Science Problem: Molecular Level Insight into Membrane Fusion from Molecular Dynamics (cont.) Coarse grain model of Dipalmitoylphosphatidylcholine (DPPC) stacked bilayers 2 PF problem:
A charged system of 834,403 coarse grained beads System dimensions ~1000Å x 1000Å x 33 Å
20 PF problem:
A charge system of ~850,000,000 coarse grained beads System Dimension ~ 32,000Å x 32,000Å x 33 Å
Computer image of the 2PF science problem. The membrane bilayer is represented by coarse grain beads.
7
2. LAMMPS and MD Overview
8
MD Simulations and Algorithm, F=ma
( ) ( )i
NiN r
rrrUFrrrUU
…
…
∂
∂−==
,,,;,,, 2121
( )
( )
( )NNNN
N
N
N
rrrUFdtrdm
rrrUFdtrdm
rrrUFdtrdm
…
…
…
,,,
,,,
,,,
212
2
211222
2
2
211121
2
1
−∇==
−∇==
−∇==
Calculate the trajectory of a system of particles interacting with potential U.
mi is the mass of particle i ri is the position of particle i vi is the velocity of particle i
Solve by Velocity Verlet algorithm
9
Force fields details
H
O
H
HO
H
(1)
(2) (3)
(4)
(5)
(6)
l1 ɵ2
l2
ɵ1
l4
l3
( ) ( ) ( ) ( )242
32
22
1 21
21
21
21 llkllkllkllkU OH
bOH
bOH
bOH
bbond OHOHOHOH
−+−+−+−=
( ) ( )222
1 21
21
Θ−Θ+Θ−Θ= ΘΘHOHHOHangle OHOH
kkU
∑∑= <
=N
i
N
ij ji
ji
oticelectrosta r
qqU
1 ,41πε
nonbondedticelectrostaanglebondsystem UUUUU +++=
O = -q
H = +q/2
∑ ∑=
<<
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
N
i
N
rrij ji
ji
ji
jinonbonded
cutoffji
rB
rA
U1
6,
,12,
,
,
10
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H H
O
H
H O H
H O
H H O
H
Periodic Boundary Conditions
11
Spatial Decomposition
H O H
H O H
H O
H
H O H
H O
H
H O H
H O
H
H O H
H O H H O
H
H O H
H O H
H O
H
H O H
H O
H
H O H
H O
H
H O H
H O H H O
H
H O H
H O H
H O
H
H O H
H O
H
H O H
H O
H
H O H
H O H H O
H
H O H
H O H
H O
H
H O H
H O
H
H O H
H O
H
H O H
H O H H O
H
H O H
H O H
H O
H
H O H
H O
H
H O H
H O
H
H O H
H O H H O
H
H O H
H O H
H O
H
H O H
H O
H
H O H
H O
H
H O H
H O H H O
H
H O H
H O H
H O
H
H O H
H O
H
H O H
H O
H
H O H
H O H H O
H
H O H
H O H
H O
H
H O H
H O
H
H O H
H O
H
H O H
H O H H O
H
H O H
H O H
H O
H
H O H
H O
H
H O H
H O
H
H O H
H O H H O
H
0 1
23
∑ ∑=
<<
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
N
i
N
rrij ji
ji
ji
jinonbonded
cutoffji
rB
rA
U1
6,
,12,
,
,
12
H O
H H
O H
H O
H
H O
H
H O
H
H O
H H
O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
Ghost Atom Region
0
H O
H H
O H
H O
H
H O
H
H O
H
H O
H
H O
H
H O H
H O
H H O
H
0
13
Force Calculation Implementation Within LAMMPS
Compute nonbonded
forces
Compute neighbor List
Compute bond forces
Compute electrosta8c
forces
Compute angle forces
Compute dihedral forces
Compute improper forces
time step N
Update coordinates
(Repeat previous steps)
time step N+1
14
Compute neighbor List
Compute electrosta8c
forces
Compute nonbond/bond
forces
Update coordinates
Compute neighbor List
Compute electrosta8c
forces
Compute nonbond/bond
forces
Update coordinates
Compute neighbor List
Compute electrosta8c
forces
Compute nonbond/bond
forces
Update coordinates
Compute neighbor List
Compute electrosta8c
forces
Compute nonbond/bond
forces
Update coordinates
0 1 2 3
15
• Approximately 80% of the total computational work is in computing the forces
i Terms Force The F
∑∑
∑
∑∑∑∑
<
Θ
+
⎥⎥
⎦
⎤
⎢⎢
⎣
⎡
⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟
⎟⎠
⎞⎜⎜⎝
⎛+
−+−+Θ−Θ+−=
i ji ji
ji
o
nonbonded ijij
improperso
dihedralso
angleso
bondsobsystem
rqq
rR
rR
knkkbbkU
ijij
,
6min
12min
2222
41
)()()()(
πε
ε
ωωφφ ωφ
These two terms, nonbonded and electrostatic, respectively dominate the computational work
i
systemiii r
UamF
∂
∂−==
16
DPPC Stacked Bilayer Profile: 2 PF Problem
0
500
1000
1500
2000
2500
3000 Time (secs)
Job size
DPPC Stacked Bilayer Run 3me profile for 100 steps
Pair 3me Bond 3me Kspace 3me Neigh 3me Comm 3me Output 3me Other 3me
nsinteractio rqq
4π1 U theofpart iscomponent Kspace The
i ji ji,
ji
oticelectrosta ∑∑
≠
=ε
nsinteractio r
Rr
Rε U theofpart iscomponent pair The
i ji
6
ij
min12
ij
minnonbonded
ijij∑∑≠ ⎥
⎥
⎦
⎤
⎢⎢
⎣
⎡
⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟
⎟⎠
⎞⎜⎜⎝
⎛= These two terms,
nonbonded and electrostatic, respectively dominate the computational work
17
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Job Size
Other 3me
Comm 3me
Neigh 3me
Kspace 3me Bond 3me
Pair 3me
DPPC Stacked Bilayer Profile: 2 PF Problem DPPC Stacked Bilayer
Normalized run 3me profile without I/O Normalized
Cu
mula3
ve Tim
e Distrib
u3on
nsinteractio rqq
4π1U the ofpart iscomponent Kspace The
i ji ji,
ji
oticelectrosta ∑∑
≠
=ε
nsinteractio r
R
r
Rε U theofpart iscomponent pair The
i ji
6
ij
min12
ij
minnonbonded
ijij∑∑≠ ⎥
⎥
⎦
⎤
⎢⎢
⎣
⎡
⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟
⎟⎠
⎞⎜⎜⎝
⎛= These two terms,
nonbonded and electrostatic, respectively dominate the computational work
18
DPPC Stacked Bilayer Profile: 2 PF Problem
0 2 4 6 8
10 12 14 16 18 20
1 cpu
2 cpu
4 cpu
8 cpu
16 cpu
32 cpu
64 cpu
128 cpu
256 cpu
512 cpu
1024 cpu
Time (secs)
Kspace cumula3ve 3me for 100 steps
Kspace
nsinteractio rqq
4π1U the ofpart iscomponent Kspace The
i ji ji,
ji
oticelectrosta ∑∑
≠
=ε
19
Compute neighbor List
Compute electrosta8c
forces
Compute non-‐bonded forces
Update coordinates
Compute neighbor List
Compute electrosta8c
forces
Compute non-‐bonded forces
Update coordinates
Compute neighbor List
Compute electrosta8c
forces
Compute non-‐bonded forces
Update coordinates
Compute neighbor List
Compute electrosta8c
forces
Compute non-‐bonded forces
Update coordinates
0 1 2 3
20
3. Accelerating LAMMPS
Need to accelerate the nonbonded forces and the Kspace component of the electrostatic forces nsinteractio
rqq
4π1U the ofpart iscomponent Kspace The
i ji ji,
ji
oticelectrosta ∑∑
≠
=ε
nsinteractio r
Rr
RεU the ofpart iscomponent pair The
i ji
6
ij
min12
ij
minnonbonded
ijij∑∑≠ ⎥
⎥
⎦
⎤
⎢⎢
⎣
⎡
⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟
⎟⎠
⎞⎜⎜⎝
⎛=
21
Accelerating the Nonbonded Interactions within LAMMPS
nsinteractio r
Rr
RεU the ofpart iscomponent pair The
i ji
6
ij
min12
ij
minnonbonded
ijij∑∑≠ ⎥
⎥
⎦
⎤
⎢⎢
⎣
⎡
⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟
⎟⎠
⎞⎜⎜⎝
⎛=
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Normalized
Cu
mula3
ve Tim
e Distrib
u3on
Job Size
DPPC Stacked Bilayer Normalized run 3me profile without I/O
Other 3me
Comm 3me
Neigh 3me
Kspace 3me Bond 3me
Pair 3me
22
Calculation of Nonbonded Interactions (Current CPU Only Implementation)
Each spatial region is assigned to a MPI task.
doenddoend
MjdoNido
j particle to due i on force compute particles ji, of energy compute
,1,1=
=
Done on host CPU
23
Calculation of Non-bonded Interactions Accelerated Implementation (Planned Work)
• Spatial region maps to a node. • Spatial region hosts a MPI task • Particles in the region will be evolved by CPUs and GPUs attached to this node
doenddoend
NeighborsjdoNNido
i
cpu
particles ji, ninteractio compute,1
1,,1
=
+=
Asynchronously done on GPUs
Done on host CPUs
doenddoend
NeighborsjdoNido
i
cpu
particles ji, of ninteractio compute,1
1,,1
=
=
• A numerical fraction X of the N particles will be assigned the CPUs • The remaining numerical fraction (1-X) will be assign to the accelerator
24
Load Balancing of Nonbonded Interactions
Lennard-Jones kernel (864k Particles) cutoff=2.5 x radius (~78 neighbors) Low Arithmetic Intensity
Gay-Berne kernel (125k Particles) cutoff=7 x short radius High Arithmetic Intensity
• Multiple cores can be assigned to each GPU
• Dynamic load balancing assigns particles to CPU and GPU for asynchronous force calculation
– Not the only benefit, splitting per GPU data into smaller subdomains improves memory latencies and splits “other” time (integration, etc.) across more cores
Dual Hex-Core AMD Istanbul 2.6 GHz, Dual Fermi C2050, Infiniband
25
Acceleration of the Electrostatic Interactions by Means of Multilevel Summation Method (MSM)
nsinteractio rqq
4π1U the ofpart iscomponent Kspace The
i ji ji,
ji
oticelectrosta ∑∑
≠
=ε
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Normalized
Cu
mula3
ve Tim
e Distrib
u3on
Job Size
DPPC Stacked Bilayer Normalized run 3me profile without I/O
Other 3me
Comm 3me
Neigh 3me
Kspace 3me
Bond 3me
Pair 3me
0
2
4
6
8
10
12
14
16
18
20
Time (secs)
Kspace Cumula3ve for 100 Steps
Kspace
26
Multilevel Summation Method (MSM) vs. Particle Mesh Ewald (PME)
• LAMMPS currently uses the PME algorithm solve the electrostatic interactions
• MSM is more scalable than PME MSM is grid based method whose communication pattern is nearest neighbor while PME involves 2 ALL-TO-ALLs over reciprocal lattice points to compute 2 3d FFTs (Hardy, 2009) MSM scale like O(N) whereas PME scales as O(NlnN) For typical systems, the crossover between MSM and PME is approximately 100,000 atoms
MSM is essential for the scalability of large problems like the ones proposed for OLCF3
27
ij rr −
1
( )jia rrg ,0
( )⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
−−
jiaij
rrgrr
,1
1. Separation of the electrostatic interaction potential into a MSM short range part plus a MSM long range, slowly varying part
2. Approximate the slowly varying MSM long range part to a grid.
3. Recursive application of the first two ideas on a hierarchy of coarser grids (Hardy, 2006, p. 12).
Multilevel Summation Method Overview
MSM Short range part Long range part
( )jia rrg ,+
28
MSM Algorithmic Steps
qq 00 ϕ=
q
anterpolation
01
1 qq ϕ=
restriction
restriction
MSM short range part
qGe Shortshort =
21
1 −−
− = ll
l qq ϕ
1−= ll
l qq ϕ
11
00000 eqGqGe Short ϕ+≅=
22
11111 eqGqGe Short ϕ+≅=
ll
llShort
lll eqGqGe ϕ+≅= −−−−− 11111
ll qGlll qGe =
1−le
1e
0e
interpolation
00 eqGqGe Short ϕ+≅=
Grid lattice spacing h, lattice cutoff a
Grid lattice spacing 2h, lattice cutoff 2a
Grid lattice spacing 2(l-1)h, lattice cutoff 2(l-1)a
Grid lattice spacing 2lh, lattice cutoff 2la
29
MSM Direct
{ }[ ]
[ ] ( )( ) ( )( )( )[ ]kkjjiiq
kjihgkjihgkjie
ha
hai
ha
haj
ha
hak
kjieindiceskji
cccn
nnnnccc
shortn
xx
yy
zz
cccshortn
nccc
+++×
−=+
⎥⎦
⎥⎢⎣
⎢⎥⎦
⎥⎢⎣
⎢−=
⎥⎥⎦
⎥
⎢⎢⎣
⎢
⎥⎥⎦
⎥
⎢⎢⎣
⎢−=
⎥⎦
⎥⎢⎣
⎢⎥⎦
⎥⎢⎣
⎢−=
=
Ω∈
+
,,
,,2,,,2,,,
2,,2
2,,2
2,,2
0,,
),,(
1,
,
00
for
for
for
for
…
…
…
30
Compute neighbor List
MSM
Compute non-‐bonded forces
Update coordinates
Compute neighbor List
MSM
Compute non-‐bonded forces
Update coordinates
Compute neighbor List
MSM
Compute non-‐bonded forces
Update coordinates
Compute neighbor List
MSM
Compute non-‐bonded forces
Update coordinates
0 1 2 3
acc acc acc acc
31
• Need non-trivial knowledge of what the code is doing – Algorithmic details
• Use a debugger to step through the code – Problem/data decomposition – Interproccessor data communication patterns
• Profile the code to discover bottlenecks – VampirTrace, CrayPat, homemade timers. etc.
• “There is no free lunch for porting to GPUs – (Except for those who registered for this workshop) – There is a significant communication cost with the
accelerator; some algorithms may not be computational intense enough to be worth this cost.
32
Summary • We have identified the major challenges and have a technologically
reasonable plan for accelerating LAMMPS – Nonbonded interactions
• Capturing various levels of parallelism on CPU and GPU • Load balancing between CPU and GPU
– Electrostatic interactions • Developing the MSM algorithm
– Capturing various levels of parallelism on CPU and GPU – Load balancing between CPU and GPU
– Regardless of the accelerator, we are forced to restructure some parts LAMMPS to effectively capture the inter and intra node parallelism
33
Acknowledgements
• Bobby Phillips and Richard Mills – Multi-grid discussions
• Jim Schwarzmeier and Sarah Anderson – Algorithmic discussions, benchmarking, ….
• Scott Hampton and Pratul Agarwal – CUDA and algorithmic discussions
34
References
• Hardy, David. “Multilevel Summation For the Fast Evaluation of Forces for the Simulation of Biomolecules" Ph.D. diss., University of Illinois at Urbana-Champaign, 2006.
• Hardy, David., Stone John., and Schulten K. 2009. Multilevel Summation of Electrostatic Potentials Using Graphics Processing Units. Parallel Computing vol 35: pgs 164-177.
35
• Team Members and Expertise – Oak Ridge National Laboratories
• Arnold Tharrington - Computational biophysicist, primary programmer for code modifications
• Michael Brown - Computational biophysicist, primary programmer for code modifications
• Bobby Philips - Mathematician and multigrid expert – Sandia National Laboratories
• Steve Plimpton - Computational scientist, originator of LAMMPS • Paul Crozier - Computational scientist, LAMMPS developer
– NVIDIA • Peng Wang - Computational scientist and CUDA expert
– Cray Inc. • James Schwarzmeier - Computational scientist, benchmarking, code profile and
optimization expert, threading, etc. • Sarah Anderson - Computational Scientist, benchmarking, code profile and optimization
expert, threading, etc. – Institute for Computational Molecular Science at Temple University (Director: Michael Klein)
• Axel Kohlmeyer - Computational Biophysicist, science lead for 20 PF target problem
36
Supplementary Slides
37
Energy and Forces
( ) [ ]
( )
11
0000
0
0021
21
21
21,,,
eqGqGe
rr
eqqGr
qF
eqqGqrrrU
Short
ll
lShortl
l
TShortN
ϕ
ϕ
ϕ
ααα
+≅=
∂
∂−
∂
∂≅
+≅
where
…
38
Matrix Vector Decomposition
[ ] 11
11
++
++
+≅=
=
nTn
nnShort
nnn
nn
n
eqGqGe
ϕ
ϕ
q [ ] 00 eqGqGe T
Short ϕ+≅=
qq 00 ϕ= [ ] 1
10000
0 eqGqGe TShort ϕ+≅=
01
1 qq ϕ=[ ] 22
11111 eqGqGe TShort ϕ+≅=
21
1 −−
− = ll
l qq ϕ
1−= ll
l qq ϕ
[ ] lTl
llShort
lll eqGqGe ϕ+≅= −−−−− 11111
lll qGe =0=ll qG
1−le
1e
0e
e
This product is zero for neutral charged systems
39
Calculation of MSM Short Range Part
doenddoend
NeighborsjdoNNido
i
cpu
particles ji, ninteractio compute,1
1,,1
=
+=
Asynchronously done on GPUs.
Done on host CPUs
doenddoend
NeighborsjdoNido
i
cpu
particles ji, of ninteractio compute,1
1,,1
=
=
• A numerical fraction X of the N particles will be assigned the CPUs • The remaining numerical fraction (1-X) will be assign to the accelerator
• Spatial region maps to a node. • Spatial region hosts a MPI task • Particles in the region will be evolved by CPUs and GPUs attached to this node
40
MSM Grid Computations
Done on host CPUs
doend
foripoint grid on compute
pointsgridcpu
Done on GPUs
Kn total grid points for level n
• A numerical fraction X of the grid points will be assigned the CPUs • The remaining numerical fraction (1-X) will be assign to the accelerator
X 1-X
doend
foripoint grid on compute
pointsgridraccelerato
• Spatial region maps to a node. • Spatial region hosts a MPI task • Particles in the region will be evolved by CPUs and GPUs attached to this node
41
Algorithmic Steps
Particle Level
anterpolation
restriction
restriction
short range part
direct part
direct part
direct part
direct part (top level)
prolongation
prolongation
prolongation
interpolation
42
Operational Counts
• Short range part ~ (a/h)3N • Anterpolation ~ (p3 + 7p2 + 9p + 6)N • Direct part ~ (a/h)3Kn
• Interpolation ~ (5p3 + 24p2 + 27p + 11)N + K0
• Restriction and Prolongation ~(2p+2)(1-1/8l)K0
Hardy, J. 2006 Multilevel Summation for the Fast Evaluation of Forces For The Simulation of Biomolecules. PhD Thesis, University of Illinois at Urbana-Champaign.
43
Splitting Operation
ij rr −
1
( )jia rrg ,0
( ) ( )jiajiaij
rrgrrgrr
,,1
+⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
−−
Short range Long range
44
Splitting Operation (cont.)
( )
( )
( )
( ) 11)(,22
1,
,0)(,,
)(;,1
,0,,
0
≥=⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −≡
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
=
≠∈−
≠∉−−
=
⎩⎨⎧
=
≠=
ρρ
ργγ
χ
χ
for and
where
and
and
arr
arrg
jijiijrrg
jiijrrgrr
G
jijirrg
G
nij
njina
jia
jiaij
Short
jiaLong
45
Splitting Operation (cont.)
( ) ( )jiajiaij
rrgrrgrr
,,1
+⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
−−
( ) ( ){ } ( )jiajiajia rrgrrgrrg ,,, 22 +−
( ) ( ){ } ( )jiajiajia rrgrrgrrg ,,, 442 +−
convolve charges to grid of length a; split long range part
convolve charges to grid of length 2a; split long range part
46
MPI Internode Communication
Cutoff distance r for atom
MPI task 0
MPI task 1
r
r
47
MPI Internode Communication (cont.)
Neighbor list
Compute forces
Integrate and update
atom coordinates
• Many small messages between neighbors • Typical MPI message packet <100 Mbytes
MPI task 0
MPI task 1
Time step N
48
MSM Grid Structure
Level 0 Lattice spacing h
16x16 grid
Level 1 Lattice spacing 2h
8x8 grid
Level 2 Lattice spacing 4h
4x4 grid
49
Anterpolation – Convolution of Charges
Particle level Level 0
Level 0
Basis functions provide local support about each grid point
Level 0
50
Anterpolation (cont.) Basis functions provide local support about each grid point
Level 0
MPI task 0
MPI task 2
MPI task 1
MPI task 3
51
Nodal Basis Functions
),( ξϕ m
ξ
52
Anterpolation Communication-Nearest Neighbor
MPI task 0
MPI task 1
MPI task 2
MPI task 3
53
Restriction of Grid
Level 0 grid
Level 1 grid
Restrict charges from level 0 to level 1
Level 2 grid
Restrict charges from level 1 to level 2
MPI task 0 MPI task 1
MPI task 2 MPI task 3
Animation of grid restriction
54
MSM Short Range
a
rr
ar
arqqf
rr
ar
arqqU
ijjij
ijji
Short
Short
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛ʹ′−=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛−=
γ
γ
22
11
11
MSM short range cutoff of distance a
a
MPI task 0 MPI task 1
MPI task 2 MPI task 3
55
MSM Direct MPI task 0 MPI task 1
MPI task 2 MPI task 3
[ ] ( )( ) ( )( )( ) [ ]kkjjiiqkjihgkjihgkjie cccnnnnn
cccshortn +++×−=+ + ,,,,2,,,2,,, 1, 00
( )ccc k,j,i
Animation of direct calculation for MPI task 0
56
MSM Prolongation
?
1
=
Ω −
n
n
e1+
Ωn
n
e
[ ] 11
11
++
++
+≅=
=
nTn
nnShort
nnn
nn
n
eqGqGe
ϕ
ϕAnimation of prolongation calculation for MPI task 0
MPI task 0 MPI task 0 MPI task 1 MPI task 1
MPI task 2 MPI task 2 MPI task 3 MPI task 3
57
Initial Results of NAMD Lite with MSM
(GPU: NVIDIA GTX-285, using CUDA 3.0; CPU: 2.4 GHz Intel Core 2 Q6600 quad core)
Box of 21950 flexible waters, 12 A cutoff, 1ps
CPU only with GPU Speedup vs. NAMD/CPU
NAMD with PME 1199.8 s 210.5 s 5.7 x
NAMD-‐Lite with MSM
5183.3 s (4598.6 short, 572.23 long)
176.6 s (93.9 short, 63.1
long)
6.8 x (19% over NAMD/
GPU)
Hardy, D. 2010, “Multiscale Molecular Modeling”, Edinburgh, UK, June 30 - July 3, 2010
58
Data Transfer Between Host and Device
Number of MPI tasks: 18,000 Unit Cell Dimensions: 32,000 Å x 32,000 Å x 33 Å
Maximum number of Levels
X grid length
Y grid length
Z grid length
Total Number of grid points
Number of grid
points per MPI task
8 3.90625 3.90625 0.12890625 1.68E+07 9.32E+02
9 1.953125 1.953125 0.064453125 1.34E+08 7.46E+03
10 0.9765625 0.9765625 0.032226563 1.07E+09 5.97E+04
11 0.48828125 0.48828125 0.016113281 8.59E+09 4.77E+05
12 0.244140625 0.244140625 0.008056641 6.87E+10 3.82E+06
For the finest grid only
59
Data Transfer Between Host and Device(cont.) Maximum number of levels Bytes per MPI task Bytes per MPI task Bytes per MPI task Bytes per MPI task
Level 0 Level 1 Level 2 Level 3
8
Data in Data out Data in Data out Data in Data out Data in Data out 29.8E+3 7.5E+3 3.7E+3 932.1E+0 466.0E+0 116.5E+0 58.3E+0 14.6E+0
9 238.6E+3 59.7E+3 29.8E+3 7.5E+3 3.7E+3 932.1E+0 466.0E+0 116.5E+0
10 1.9E+6 477.2E+3 238.6E+3 59.7E+3 29.8E+3 7.5E+3 3.7E+3 932.1E+0
11 15.3E+6 3.8E+6 1.9E+6 477.2E+3 238.6E+3 59.7E+3 29.8E+3 7.5E+3
12 122.2E+6 30.5E+6 15.3E+6 3.8E+6 1.9E+6 477.2E+3 238.6E+3 59.7E+3
60
MSM Short Range
11
11
++
++
+≅=
=
nn
nnShort
nnn
nn
n
eqGqGe
ϕ
ϕ
q0
0 eqGqGe Short ϕ+≅=
qq 00 ϕ=
11
00000 eqGqGe Short ϕ+≅=
01
1 qq ϕ=2
211111 eqGqGe Short ϕ+≅=
21
1 −−
− = ll
l qq ϕ
1−= ll
l qq ϕ
ll
llShort
lll eqGqGe ϕ+≅= −−−−− 11111
lll qGe =0=ll qG
1−le
1e
0e
e
This product is zero for neutral charged systems
61
MSM Short Range
( ) ( ) ( )
rr
ar
arqqf
rr
ar
arqqU
ar
zzyyxxr
zzyyxxr
ijjij
ijji
Short
ijijij
ijijijij
Short
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛ʹ′−=−
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎠
⎞⎜⎝
⎛−=+
<
−+−+−=
−−−=
γ
γ
22
22
2222
11
11
)(
],,[
then if
62
MSM Direct
11
11
++
++
+≅=
=
nn
nnShort
nnn
nn
n
eqGqGe
ϕ
ϕ
q0
0 eqGqGe Short ϕ+≅=
qq 00 ϕ=
11
00000 eqGqGe Short ϕ+≅=
01
1 qq ϕ=2
211111 eqGqGe Short ϕ+≅=
21
1 −−
− = ll
l qq ϕ
1−= ll
l qq ϕ
ll
llShort
lll eqGqGe ϕ+≅= −−−−− 11111
lll qGe =0=ll qG
1−le
1e
0e
e
This product is zero for neutral charged systems
63
MSM Direct
{ }[ ]
[ ] ( )( ) ( )( )( )[ ]kkjjiiq
kjihgkjihgkjie
ha
hai
ha
haj
ha
hak
kjieindiceskji
cccn
nnnnccc
shortn
xx
yy
zz
cccshortn
nccc
+++×
−=+
⎥⎦
⎥⎢⎣
⎢⎥⎦
⎥⎢⎣
⎢−=
⎥⎥⎦
⎥
⎢⎢⎣
⎢
⎥⎥⎦
⎥
⎢⎢⎣
⎢−=
⎥⎦
⎥⎢⎣
⎢⎥⎦
⎥⎢⎣
⎢−=
=
Ω∈
+
,,
,,2,,,2,,,
2,,2
2,,2
2,,2
0,,
),,(
1,
,
00
for
for
for
for
…
…
…
64
MSM Anterpolation
11
11
++
++
+≅=
=
nn
nnShort
nnn
nn
n
eqGqGe
ϕ
ϕ
q0
0 eqGqGe Short ϕ+≅=
qq 00 ϕ=
11
00000 eqGqGe Short ϕ+≅=
01
1 qq ϕ=2
211111 eqGqGe Short ϕ+≅=
21
1 −−
− = ll
l qq ϕ
1−= ll
l qq ϕ
ll
llShort
lll eqGqGe ϕ+≅= −−−−− 11111
lll qGe =0=ll qG
1−le
1e
0e
e
This product is zero for neutral charged systems
65
Anterpolation
( )
( )
( )
( )
( )
( )
nxyzgridgridgrid
xgrid
ygrid
yzyz
zgrid
zz
z
z
kz
y
y
jy
x
x
ix
z
y
x
qqzyxq
xp
y
pzq
pkhzz
j
hyy
ihxx
p
phzzk
phyyj
phxxi
zyxNn
][],,[
,...,0
][
,...,0
][
,...,0
][
][
][
,...,02
1
21
21
),,(stencil gsurroundin ofindex gridlowest thefind,...,1
0
0
0
0
00
00
00
000
0
0
0
x_stencil
x_stencil
y_stencil
y_stencil
z_stencil
z_stencil
N
Nfor
N
Nfor
N
Nfor
for
for
ϕ
νλ
νλ
ϕ
νλ
ϕ
ννλ
νϕ
ννλ
νϕ
ννλ
νϕ
ν
ν
ν
ν
=+
=
=
=
=
=
=
=
=
+=
⎟⎟⎠
⎞⎜⎜⎝
⎛ −Φ=
+=
⎟⎟⎠
⎞⎜⎜⎝
⎛ −Φ=
+=
⎟⎟⎠
⎞⎜⎜⎝
⎛ −Φ=
=
−−⎥⎦
⎥⎢⎣
⎢ −=
−−⎥⎥⎦
⎥
⎢⎢⎣
⎢ −=
−−⎥⎦
⎥⎢⎣
⎢ −=
=
+
+
+
Nodal basis function of degree p=3
( )
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
≤≤−−−
≤−+−
=Φ
otherwisefor
for
cubic
,021,)2)(1(
1),231)(1(
221
2
ξξξ
ξξξξ
ξ
),,(000 kji zyx
66
Anterpolation
h
Particle level (k=0)
for i=-3,3 for j=-3,3 for k=-3,3 Q(m)=Q(m)+dQ(ijk)
Node i
67
Restriction
11
11
++
++
+≅=
=
nn
nnShort
nnn
nn
n
eqGqGe
ϕ
ϕ
q0
0 eqGqGe Short ϕ+≅=
qq 00 ϕ=
11
00000 eqGqGe Short ϕ+≅=
01
1 qq ϕ=2
211111 eqGqGe Short ϕ+≅=
21
1 −−
− = ll
l qq ϕ
1−= ll
l qq ϕ
ll
llShort
lll eqGqGe ϕ+≅= −−−−− 11111
lll qGe =0=ll qG
1−le
1e
0e
e
This product is zero for neutral charged systems
68
MSM Restriction
( )[ ]
( )[ ]
( )[ ]
( ){ }( ){ }( ){ }
[ ]
[ ] [ ]( ){ }
( ){ }
[ ]
[ ] [ ]( ){ }
[ ] [ ]ννϕ
ν
ννϕ
ν
ννϕ
ν
φ
φ
φ
ν
ν
ν
ν
++ʹ′⋅=+ʹ′ʹ′ʹ′
+−=
Ω∈ʹ′
++ʹ′⋅=+
+−=
=
Ω∈
Ω∈ʹ′
++ʹ′⋅=+
+−=
=
Ω∈
Ω∈
Ω∈ʹ′
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −Φ=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −Φ=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −Φ=
+−=
⎥⎦
⎥⎢⎣
⎢ −=⎥
⎦
⎥⎢⎣
⎢ −=⎥
⎦
⎥⎢⎣
⎢ −=
Ω
++
+
++
+
+
+
+
+
+
++
+
++
+
++
+++
032
)(1
1
031
)(32
32
1
0)(31
31
1
1
10
1
10
1
10
01
00
01
00
01
00
2][,,
1,,
2,][
1,,0
2,,][,
1,,0,
2
2
2
1,,2
,2
,2
iikjiq
ppforifor
jjii
ppfori
ifor
jfor
kkjiqji
ppforji
ifor
jforkfor
hzz
v
hyy
v
hxx
v
ppforhxxk
hxxj
hxxi
q
n
xn
xn
n
y
n
n
xn
yn
nz
n
n
xn
yn
zn
zn
nni
z
yn
nni
y
xn
nni
x
n
nn
n
nn
n
nn
n
offset
offset
offset
Q
indices
Q
indices
indices
Q
Q
indices
indices indices
charge with grid given N
…
69
MSM Prolongation
11
11
++
++
+≅=
=
nn
nnShort
nnn
nn
n
eqGqGe
ϕ
ϕ
q0
0 eqGqGe Short ϕ+≅=
qq 00 ϕ=
11
00000 eqGqGe Short ϕ+≅=
01
1 qq ϕ=2
211111 eqGqGe Short ϕ+≅=
21
1 −−
− = ll
l qq ϕ
1−= ll
l qq ϕ
ll
llShort
lll eqGqGe ϕ+≅= −−−−− 11111
lll qGe =0=ll qG
1−le
1e
0e
e
This product is zero for neutral charged systems
70
Prolongation
( )[ ]
( )[ ]
( )[ ]
( ){ }
( ){ }
( ){ }
[ ] [ ]( ){ }
[ ] [ ]( ){ }( ){ }
[ ] [ ] [ ]jikkjie
ppforifor
jfor
ijji
ppforifor
kjieii
ppforifor
jfor
kfor
hzz
v
hyy
v
hxx
v
ppforhxxk
hxxj
hxxi
q
n
ylongn
xn
yn
n
y
n
xn
nx
n
xn
n
yn
n
zn
zn
nni
z
yn
nni
y
xn
nni
x
n
nn
n
nn
n
nn
n
offset
offset
offset
,2,,
1,,
][2,
1,,
,,][2
1,,
2
2
2
1,,2
,2
,2
31
)(0,
32
)(031
1)(0
32
1
32
1
31
1
1
10
1
10
1
10
01
00
01
00
01
00
+
++
++
+
+
+
+
+
+
++
+
++
+
++
+++
⋅=+++ʹ′
+−=
Ω∈
Ω∈
ʹ′⋅=+++ʹ′
+−=
Ω∈
ʹ′ʹ′ʹ′⋅=+++ʹ′
+−=
Ω∈ʹ′
Ω∈ʹ′
Ω∈ʹ′
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −Φ=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −Φ=
⎟⎟
⎠
⎞
⎜⎜
⎝
⎛ −Φ=
+−=
⎥⎦
⎥⎢⎣
⎢ −=⎥
⎦
⎥⎢⎣
⎢ −=⎥
⎦
⎥⎢⎣
⎢ −=
Ω
E
indices
indices
EE
indices
E
indices 0 to Ereset
indices 0 to Ereset
indices
charge with grid given N
νϕν
ν
νϕν
ν
νϕν
ν
φ
φ
φ
ν
ν
ν
ν
…