LAMMPS OLCF Titan Summit · prepare LAMMPS for Titan 1. Give an overview of Molecular Dynamics...

LAMMPS: Code Transforma3ons in Preparing for Titan -‐ A Hybrid Mul3-‐Core Architecture

Presenter: Arnold Tharrington

Prepared by Arnold Tharrington and William Brown

Scientific Computing Group, OLCF

2

Outline

1.  Convey ongoing code modifications needed to prepare LAMMPS for Titan

1.  Give an overview of Molecular Dynamics   Overview of MD/LAMMPS   Runtime profile of LAMMPS

2.  Accelerating LAMMPS   Accelerating the nonbonded interactions   Accelerating the electrostatic interactions

2.  Summary

3

1. Science problem

4

Science Problem: Molecular Level Insight into Membrane Fusion from Molecular Dynamics • Gain molecular level understanding of membrane fusion mechanisms

• A major means for materials to enter and exit cells • Role of lipids and membrane proteins

To date, the majority of simulation work on membranes has focused on model systems containing typically one to two lipid types. However biological membranes are complex mixtures, and to model these systems without finite size effects, simulations of significantly larger systems than what is currently possible needs to be performed.

Computer image of membrane vesicles undergoing fusion.

5

Science Problem: Molecular Level Insight into Membrane Fusion from Molecular Dynamics (cont.)

Computer image of the 2 petaflop target science problem. The membrane bilayer represented by coarse grain bead model.

 Some requirements for realistic simulations are

o  Proper solvation o  Eliminate modeling artifacts

due to interactions between periodic images

o  Capture long wavelength modes

o  Proper treatment of coulombic or electrostatic interactions

 The smallest, experimentally accessible vesicles to date have diameters of approximately 500 Å.

 We need simulations with the minimum size of 750,000,000 coarse grain particles to model experimentally produced vesicles

 Acceleration of the simulation is needed to reduce the time to solution!!

6

Science Problem: Molecular Level Insight into Membrane Fusion from Molecular Dynamics (cont.) Coarse grain model of Dipalmitoylphosphatidylcholine (DPPC) stacked bilayers 2 PF problem:

 A charged system of 834,403 coarse grained beads  System dimensions ~1000Å x 1000Å x 33 Å

20 PF problem:

  A charge system of ~850,000,000 coarse grained beads  System Dimension ~ 32,000Å x 32,000Å x 33 Å

Computer image of the 2PF science problem. The membrane bilayer is represented by coarse grain beads.

7

2. LAMMPS and MD Overview

8

MD Simulations and Algorithm, F=ma

( ) ( )i

NiN r

rrrUFrrrUU

…

…

∂

∂−==

,,,;,,, 2121

( )

( )

( )NNNN

N

N

N

rrrUFdtrdm

rrrUFdtrdm

rrrUFdtrdm

…

…

…

,,,

,,,

,,,

212

2

211222

2

2

211121

2

1

−∇==

−∇==

−∇==

Calculate the trajectory of a system of particles interacting with potential U.

mi is the mass of particle i ri is the position of particle i vi is the velocity of particle i

Solve by Velocity Verlet algorithm

9

Force fields details

H

O

H

HO

H

(1)

(2) (3)

(4)

(5)

(6)

l1 ɵ2

l2

ɵ1

l4

l3

( ) ( ) ( ) ( )242

32

22

1 21

21

21

21 llkllkllkllkU OH

bOH

bOH

bOH

bbond OHOHOHOH

−+−+−+−=

( ) ( )222

1 21

21

Θ−Θ+Θ−Θ= ΘΘHOHHOHangle OHOH

kkU

∑∑= <

=N

i

N

ij ji

ji

oticelectrosta r

qqU

1 ,41πε

nonbondedticelectrostaanglebondsystem UUUUU +++=

O = -q

H = +q/2

∑ ∑=

<<

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

N

i

N

rrij ji

ji

ji

jinonbonded

cutoffji

rB

rA

U1

6,

,12,

,

,

10

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H H

O

H

H O H

H O

H H O

H

Periodic Boundary Conditions

11

Spatial Decomposition

H O H

H O H

H O

H

H O H

H O

H

H O H

H O

H

H O H

H O H H O

H

H O H

H O H

H O

H

H O H

H O

H

H O H

H O

H

H O H

H O H H O

H

H O H

H O H

H O

H

H O H

H O

H

H O H

H O

H

H O H

H O H H O

H

H O H

H O H

H O

H

H O H

H O

H

H O H

H O

H

H O H

H O H H O

H

H O H

H O H

H O

H

H O H

H O

H

H O H

H O

H

H O H

H O H H O

H

H O H

H O H

H O

H

H O H

H O

H

H O H

H O

H

H O H

H O H H O

H

H O H

H O H

H O

H

H O H

H O

H

H O H

H O

H

H O H

H O H H O

H

H O H

H O H

H O

H

H O H

H O

H

H O H

H O

H

H O H

H O H H O

H

H O H

H O H

H O

H

H O H

H O

H

H O H

H O

H

H O H

H O H H O

H

0 1

23

∑ ∑=

<<

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

N

i

N

rrij ji

ji

ji

jinonbonded

cutoffji

rB

rA

U1

6,

,12,

,

,

12

H O

H H

O H

H O

H

H O

H

H O

H

H O

H H

O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

Ghost Atom Region

0

H O

H H

O H

H O

H

H O

H

H O

H

H O

H

H O

H

H O H

H O

H H O

H

0

13

Force Calculation Implementation Within LAMMPS

Compute nonbonded

forces

Compute neighbor List

Compute bond forces

Compute electrosta8c

forces

Compute angle forces

Compute dihedral forces

Compute improper forces

time step N

Update coordinates

(Repeat previous steps)

time step N+1

14



forces

Compute nonbond/bond

forces

Update coordinates



forces


forces

Update coordinates



forces


forces

Update coordinates



forces


forces

Update coordinates

0 1 2 3

15

• Approximately 80% of the total computational work is in computing the forces

i Terms Force The F

∑∑

∑

∑∑∑∑

<

Θ

+

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟

⎟⎠

⎞⎜⎜⎝

⎛+

−+−+Θ−Θ+−=

i ji ji

ji

o

nonbonded ijij

improperso

dihedralso

angleso

bondsobsystem

rqq

rR

rR

knkkbbkU

ijij

,

6min

12min

2222

41

)()()()(

πε

ε

ωωφφ ωφ

These two terms, nonbonded and electrostatic, respectively dominate the computational work

i

systemiii r

UamF

∂

∂−==

16

DPPC Stacked Bilayer Profile: 2 PF Problem

0

500

1000

1500

2000

2500

3000 Time (secs)

Job size

DPPC Stacked Bilayer Run 3me profile for 100 steps

Pair 3me Bond 3me Kspace 3me Neigh 3me Comm 3me Output 3me Other 3me

nsinteractio rqq

4π1 U theofpart iscomponent Kspace The

i ji ji,

ji

oticelectrosta ∑∑

≠

=ε

nsinteractio r

Rr

Rε U theofpart iscomponent pair The

i ji

6

ij

min12

ij

minnonbonded

ijij∑∑≠ ⎥

⎥

⎦

⎤

⎢⎢

⎣

⎡

⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟

⎟⎠

⎞⎜⎜⎝

⎛= These two terms,

nonbonded and electrostatic, respectively dominate the computational work

17

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Job Size

Other 3me

Comm 3me

Neigh 3me

Kspace 3me Bond 3me

Pair 3me

DPPC Stacked Bilayer Profile: 2 PF Problem DPPC Stacked Bilayer

Normalized run 3me profile without I/O Normalized

Cu

mula3

ve Tim

e Distrib

u3on

nsinteractio rqq

4π1U the ofpart iscomponent Kspace The

i ji ji,

ji


≠

=ε

nsinteractio r

R

r

Rε U theofpart iscomponent pair The

i ji

6

ij

min12

ij

minnonbonded

ijij∑∑≠ ⎥

⎥

⎦

⎤

⎢⎢

⎣

⎡

⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟

⎟⎠

⎞⎜⎜⎝

⎛= These two terms,

nonbonded and electrostatic, respectively dominate the computational work

18

DPPC Stacked Bilayer Profile: 2 PF Problem

0 2 4 6 8

10 12 14 16 18 20

1 cpu

2 cpu

4 cpu

8 cpu

16 cpu

32 cpu

64 cpu

128 cpu

256 cpu

512 cpu

1024 cpu

Time (secs)

Kspace cumula3ve 3me for 100 steps

Kspace

nsinteractio rqq


i ji ji,

ji


≠

=ε

19



forces

Compute non-‐bonded forces

Update coordinates



forces


Update coordinates



forces


Update coordinates



forces


Update coordinates

0 1 2 3

20

3. Accelerating LAMMPS

Need to accelerate the nonbonded forces and the Kspace component of the electrostatic forces nsinteractio

rqq


i ji ji,

ji


≠

=ε

nsinteractio r

Rr

RεU the ofpart iscomponent pair The

i ji

6

ij

min12

ij

minnonbonded

ijij∑∑≠ ⎥

⎥

⎦

⎤

⎢⎢

⎣

⎡

⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟

⎟⎠

⎞⎜⎜⎝

⎛=

21

Accelerating the Nonbonded Interactions within LAMMPS

nsinteractio r

Rr

RεU the ofpart iscomponent pair The

i ji

6

ij

min12

ij

minnonbonded

ijij∑∑≠ ⎥

⎥

⎦

⎤

⎢⎢

⎣

⎡

⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟

⎟⎠

⎞⎜⎜⎝

⎛=

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Normalized

Cu

mula3

ve Tim

e Distrib

u3on

Job Size

DPPC Stacked Bilayer Normalized run 3me profile without I/O

Other 3me

Comm 3me

Neigh 3me

Kspace 3me Bond 3me

Pair 3me

22

Calculation of Nonbonded Interactions (Current CPU Only Implementation)

Each spatial region is assigned to a MPI task.

doenddoend

MjdoNido

j particle to due i on force compute particles ji, of energy compute

,1,1=

=

Done on host CPU

23

Calculation of Non-bonded Interactions Accelerated Implementation (Planned Work)

• Spatial region maps to a node. • Spatial region hosts a MPI task • Particles in the region will be evolved by CPUs and GPUs attached to this node

doenddoend

NeighborsjdoNNido

i

cpu

particles ji, ninteractio compute,1

1,,1

=

+=

Asynchronously done on GPUs

Done on host CPUs

doenddoend

NeighborsjdoNido

i

cpu

particles ji, of ninteractio compute,1

1,,1

=

=

• A numerical fraction X of the N particles will be assigned the CPUs • The remaining numerical fraction (1-X) will be assign to the accelerator

24

Load Balancing of Nonbonded Interactions

Lennard-Jones kernel (864k Particles) cutoff=2.5 x radius (~78 neighbors) Low Arithmetic Intensity

Gay-Berne kernel (125k Particles) cutoff=7 x short radius High Arithmetic Intensity

•  Multiple cores can be assigned to each GPU

•  Dynamic load balancing assigns particles to CPU and GPU for asynchronous force calculation

–  Not the only benefit, splitting per GPU data into smaller subdomains improves memory latencies and splits “other” time (integration, etc.) across more cores

Dual Hex-Core AMD Istanbul 2.6 GHz, Dual Fermi C2050, Infiniband

25

Acceleration of the Electrostatic Interactions by Means of Multilevel Summation Method (MSM)

nsinteractio rqq


i ji ji,

ji


≠

=ε

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Normalized

Cu

mula3

ve Tim

e Distrib

u3on

Job Size

DPPC Stacked Bilayer Normalized run 3me profile without I/O

Other 3me

Comm 3me

Neigh 3me

Kspace 3me

Bond 3me

Pair 3me

0

2

4

6

8

10

12

14

16

18

20

Time (secs)

Kspace Cumula3ve for 100 Steps

Kspace

26

Multilevel Summation Method (MSM) vs. Particle Mesh Ewald (PME)

• LAMMPS currently uses the PME algorithm solve the electrostatic interactions

• MSM is more scalable than PME  MSM is grid based method whose communication pattern is nearest neighbor while PME involves 2 ALL-TO-ALLs over reciprocal lattice points to compute 2 3d FFTs (Hardy, 2009)  MSM scale like O(N) whereas PME scales as O(NlnN)  For typical systems, the crossover between MSM and PME is approximately 100,000 atoms

 MSM is essential for the scalability of large problems like the ones proposed for OLCF3

27

ij rr −

1

( )jia rrg ,0

( )⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

−−

jiaij

rrgrr

,1

1.  Separation of the electrostatic interaction potential into a MSM short range part plus a MSM long range, slowly varying part

2.  Approximate the slowly varying MSM long range part to a grid.

3.  Recursive application of the first two ideas on a hierarchy of coarser grids (Hardy, 2006, p. 12).

Multilevel Summation Method Overview

MSM Short range part Long range part

( )jia rrg ,+

28

MSM Algorithmic Steps

qq 00 ϕ=

q

anterpolation

01

1 qq ϕ=

restriction

restriction

MSM short range part

qGe Shortshort =

21

1 −−

− = ll

l qq ϕ

1−= ll

l qq ϕ

11

00000 eqGqGe Short ϕ+≅=

22


ll

llShort

lll eqGqGe ϕ+≅= −−−−− 11111

ll qGlll qGe =

1−le

1e

0e

interpolation


Grid lattice spacing h, lattice cutoff a

Grid lattice spacing 2h, lattice cutoff 2a

Grid lattice spacing 2(l-1)h, lattice cutoff 2(l-1)a

Grid lattice spacing 2lh, lattice cutoff 2la

29

MSM Direct

{ }[ ]

[ ] ( )( ) ( )( )( )[ ]kkjjiiq

kjihgkjihgkjie

ha

hai

ha

haj

ha

hak

kjieindiceskji

cccn

nnnnccc

shortn

xx

yy

zz

cccshortn

nccc

+++×

−=+

⎥⎦

⎥⎢⎣

⎢⎥⎦

⎥⎢⎣

⎢−=

⎥⎥⎦

⎥

⎢⎢⎣

⎢

⎥⎥⎦

⎥

⎢⎢⎣

⎢−=

⎥⎦

⎥⎢⎣

⎢⎥⎦

⎥⎢⎣

⎢−=

=

Ω∈

+

,,

,,2,,,2,,,

2,,2

2,,2

2,,2

0,,

),,(

1,

,

00

for

for

for

for

…

…

…

30


MSM


Update coordinates


MSM


Update coordinates


MSM


Update coordinates


MSM


Update coordinates

0 1 2 3

acc acc acc acc

31

• Need non-trivial knowledge of what the code is doing –  Algorithmic details

•  Use a debugger to step through the code –  Problem/data decomposition –  Interproccessor data communication patterns

• Profile the code to discover bottlenecks –  VampirTrace, CrayPat, homemade timers. etc.

•  “There is no free lunch for porting to GPUs –  (Except for those who registered for this workshop) –  There is a significant communication cost with the

accelerator; some algorithms may not be computational intense enough to be worth this cost.

32

Summary •  We have identified the major challenges and have a technologically

reasonable plan for accelerating LAMMPS –  Nonbonded interactions

•  Capturing various levels of parallelism on CPU and GPU •  Load balancing between CPU and GPU

–  Electrostatic interactions •  Developing the MSM algorithm

– Capturing various levels of parallelism on CPU and GPU – Load balancing between CPU and GPU

–  Regardless of the accelerator, we are forced to restructure some parts LAMMPS to effectively capture the inter and intra node parallelism

33

Acknowledgements

• Bobby Phillips and Richard Mills – Multi-grid discussions

•  Jim Schwarzmeier and Sarah Anderson – Algorithmic discussions, benchmarking, ….

• Scott Hampton and Pratul Agarwal – CUDA and algorithmic discussions

34

References

• Hardy, David. “Multilevel Summation For the Fast Evaluation of Forces for the Simulation of Biomolecules" Ph.D. diss., University of Illinois at Urbana-Champaign, 2006.

• Hardy, David., Stone John., and Schulten K. 2009. Multilevel Summation of Electrostatic Potentials Using Graphics Processing Units. Parallel Computing vol 35: pgs 164-177.

35

•  Team Members and Expertise –  Oak Ridge National Laboratories

•  Arnold Tharrington - Computational biophysicist, primary programmer for code modifications

•  Michael Brown - Computational biophysicist, primary programmer for code modifications

•  Bobby Philips - Mathematician and multigrid expert –  Sandia National Laboratories

•  Steve Plimpton - Computational scientist, originator of LAMMPS •  Paul Crozier - Computational scientist, LAMMPS developer

–  NVIDIA •  Peng Wang - Computational scientist and CUDA expert

–  Cray Inc. •  James Schwarzmeier - Computational scientist, benchmarking, code profile and

optimization expert, threading, etc. •  Sarah Anderson - Computational Scientist, benchmarking, code profile and optimization

expert, threading, etc. –  Institute for Computational Molecular Science at Temple University (Director: Michael Klein)

•  Axel Kohlmeyer - Computational Biophysicist, science lead for 20 PF target problem

36

Supplementary Slides

37

Energy and Forces

( ) [ ]

( )

11

0000

0

0021

21

21

21,,,

eqGqGe

rr

eqqGr

qF

eqqGqrrrU

Short

ll

lShortl

l

TShortN

ϕ

ϕ

ϕ

ααα

+≅=

∂

∂−

∂

∂≅

+≅

where

…

38

Matrix Vector Decomposition

[ ] 11

11

++

++

+≅=

=

nTn

nnShort

nnn

nn

n

eqGqGe

qq

ϕ

ϕ

q [ ] 00 eqGqGe T

Short ϕ+≅=

qq 00 ϕ= [ ] 1

10000

0 eqGqGe TShort ϕ+≅=

01

1 qq ϕ=[ ] 22

11111 eqGqGe TShort ϕ+≅=

21

1 −−

− = ll

l qq ϕ

1−= ll

l qq ϕ

[ ] lTl

llShort

lll eqGqGe ϕ+≅= −−−−− 11111

lll qGe =0=ll qG

1−le

1e

0e

e

This product is zero for neutral charged systems

39

Calculation of MSM Short Range Part

doenddoend

NeighborsjdoNNido

i

cpu

particles ji, ninteractio compute,1

1,,1

=

+=

Asynchronously done on GPUs.

Done on host CPUs

doenddoend

NeighborsjdoNido

i

cpu

particles ji, of ninteractio compute,1

1,,1

=

=

• A numerical fraction X of the N particles will be assigned the CPUs • The remaining numerical fraction (1-X) will be assign to the accelerator


40

MSM Grid Computations

Done on host CPUs

doend

foripoint grid on compute

pointsgridcpu

Done on GPUs

Kn total grid points for level n

• A numerical fraction X of the grid points will be assigned the CPUs • The remaining numerical fraction (1-X) will be assign to the accelerator

X 1-X

doend

foripoint grid on compute

pointsgridraccelerato


41

Algorithmic Steps

Particle Level

anterpolation

restriction

restriction

short range part

direct part

direct part

direct part

direct part (top level)

prolongation

prolongation

prolongation

interpolation

42

Operational Counts

• Short range part ~ (a/h)3N • Anterpolation ~ (p3 + 7p2 + 9p + 6)N • Direct part ~ (a/h)3Kn

•  Interpolation ~ (5p3 + 24p2 + 27p + 11)N + K0

• Restriction and Prolongation ~(2p+2)(1-1/8l)K0

Hardy, J. 2006 Multilevel Summation for the Fast Evaluation of Forces For The Simulation of Biomolecules. PhD Thesis, University of Illinois at Urbana-Champaign.

43

Splitting Operation

ij rr −

1

( )jia rrg ,0

( ) ( )jiajiaij

rrgrrgrr

,,1

+⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

−−

Short range Long range

44

Splitting Operation (cont.)

( )

( )

( )

( ) 11)(,22

1,

,0)(,,

)(;,1

,0,,

0

≥=⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −≡

⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

=

≠∈−

≠∉−−

=

⎩⎨⎧

=

≠=

ρρ

ργγ

χ

χ

for and

where

and

and

arr

arrg

jijiijrrg

jiijrrgrr

G

jijirrg

G

nij

njina

jia

jiaij

Short

jiaLong

45

Splitting Operation (cont.)

( ) ( )jiajiaij

rrgrrgrr

,,1

+⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

−−

( ) ( ){ } ( )jiajiajia rrgrrgrrg ,,, 22 +−

( ) ( ){ } ( )jiajiajia rrgrrgrrg ,,, 442 +−

convolve charges to grid of length a; split long range part

convolve charges to grid of length 2a; split long range part

46

MPI Internode Communication

Cutoff distance r for atom

MPI task 0

MPI task 1

r

r

47

MPI Internode Communication (cont.)

Neighbor list

Compute forces

Integrate and update

atom coordinates

• Many small messages between neighbors • Typical MPI message packet <100 Mbytes

MPI task 0

MPI task 1

Time step N

48

MSM Grid Structure

Level 0 Lattice spacing h

16x16 grid

Level 1 Lattice spacing 2h

8x8 grid

Level 2 Lattice spacing 4h

4x4 grid

49

Anterpolation – Convolution of Charges

Particle level Level 0

Level 0

Basis functions provide local support about each grid point

Level 0

50

Anterpolation (cont.) Basis functions provide local support about each grid point

Level 0

MPI task 0

MPI task 2

MPI task 1

MPI task 3

51

Nodal Basis Functions

),( ξϕ m

ξ

52

Anterpolation Communication-Nearest Neighbor

MPI task 0

MPI task 1

MPI task 2

MPI task 3

53

Restriction of Grid

Level 0 grid

Level 1 grid

Restrict charges from level 0 to level 1

Level 2 grid

Restrict charges from level 1 to level 2

MPI task 0 MPI task 1


Animation of grid restriction

54

MSM Short Range

a

rr

ar

arqqf

rr

ar

arqqU

ijjij

ijji

Short

Short

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛ʹ′−=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−=

γ

γ

22

11

11

MSM short range cutoff of distance a

a



55

MSM Direct MPI task 0 MPI task 1


[ ] ( )( ) ( )( )( ) [ ]kkjjiiqkjihgkjihgkjie cccnnnnn

cccshortn +++×−=+ + ,,,,2,,,2,,, 1, 00

( )ccc k,j,i

Animation of direct calculation for MPI task 0

56

MSM Prolongation

?

1

=

Ω −

n

n

e1+

Ωn

n

e

[ ] 11

11

++

++

+≅=

=

nTn

nnShort

nnn

nn

n

eqGqGe

qq

ϕ

ϕAnimation of prolongation calculation for MPI task 0

MPI task 0 MPI task 0 MPI task 1 MPI task 1

MPI task 2 MPI task 2 MPI task 3 MPI task 3

57

Initial Results of NAMD Lite with MSM

(GPU: NVIDIA GTX-285, using CUDA 3.0; CPU: 2.4 GHz Intel Core 2 Q6600 quad core)

Box of 21950 flexible waters, 12 A cutoff, 1ps

CPU only with GPU Speedup vs. NAMD/CPU

NAMD with PME 1199.8 s 210.5 s 5.7 x

NAMD-‐Lite with MSM

5183.3 s (4598.6 short, 572.23 long)

176.6 s (93.9 short, 63.1

long)

6.8 x (19% over NAMD/

GPU)

Hardy, D. 2010, “Multiscale Molecular Modeling”, Edinburgh, UK, June 30 - July 3, 2010

58

Data Transfer Between Host and Device

Number of MPI tasks: 18,000 Unit Cell Dimensions: 32,000 Å x 32,000 Å x 33 Å

Maximum number of Levels

X grid length

Y grid length

Z grid length

Total Number of grid points

Number of grid

points per MPI task

8 3.90625 3.90625 0.12890625 1.68E+07 9.32E+02

9 1.953125 1.953125 0.064453125 1.34E+08 7.46E+03

10 0.9765625 0.9765625 0.032226563 1.07E+09 5.97E+04

11 0.48828125 0.48828125 0.016113281 8.59E+09 4.77E+05

12 0.244140625 0.244140625 0.008056641 6.87E+10 3.82E+06

For the finest grid only

59

Data Transfer Between Host and Device(cont.) Maximum number of levels Bytes per MPI task Bytes per MPI task Bytes per MPI task Bytes per MPI task

Level 0 Level 1 Level 2 Level 3

8

Data in Data out Data in Data out Data in Data out Data in Data out 29.8E+3 7.5E+3 3.7E+3 932.1E+0 466.0E+0 116.5E+0 58.3E+0 14.6E+0

9 238.6E+3 59.7E+3 29.8E+3 7.5E+3 3.7E+3 932.1E+0 466.0E+0 116.5E+0

10 1.9E+6 477.2E+3 238.6E+3 59.7E+3 29.8E+3 7.5E+3 3.7E+3 932.1E+0

11 15.3E+6 3.8E+6 1.9E+6 477.2E+3 238.6E+3 59.7E+3 29.8E+3 7.5E+3

12 122.2E+6 30.5E+6 15.3E+6 3.8E+6 1.9E+6 477.2E+3 238.6E+3 59.7E+3

60

MSM Short Range

11

11

++

++

+≅=

=

nn

nnShort

nnn

nn

n

eqGqGe

qq

ϕ

ϕ

q0


qq 00 ϕ=

11


01

1 qq ϕ=2


21

1 −−

− = ll

l qq ϕ

1−= ll

l qq ϕ

ll

llShort

lll eqGqGe ϕ+≅= −−−−− 11111

lll qGe =0=ll qG

1−le

1e

0e

e


61

MSM Short Range

( ) ( ) ( )

rr

ar

arqqf

rr

ar

arqqU

ar

zzyyxxr

zzyyxxr

ijjij

ijji

Short

ijijij

ijijijij

Short

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛ʹ′−=−

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛−=+

<

−+−+−=

−−−=

γ

γ

22

22

2222

11

11

)(

],,[

then if

62

MSM Direct

11

11

++

++

+≅=

=

nn

nnShort

nnn

nn

n

eqGqGe

qq

ϕ

ϕ

q0


qq 00 ϕ=

11


01

1 qq ϕ=2


21

1 −−

− = ll

l qq ϕ

1−= ll

l qq ϕ

ll

llShort

lll eqGqGe ϕ+≅= −−−−− 11111

lll qGe =0=ll qG

1−le

1e

0e

e


63

MSM Direct

{ }[ ]

[ ] ( )( ) ( )( )( )[ ]kkjjiiq

kjihgkjihgkjie

ha

hai

ha

haj

ha

hak

kjieindiceskji

cccn

nnnnccc

shortn

xx

yy

zz

cccshortn

nccc

+++×

−=+

⎥⎦

⎥⎢⎣

⎢⎥⎦

⎥⎢⎣

⎢−=

⎥⎥⎦

⎥

⎢⎢⎣

⎢

⎥⎥⎦

⎥

⎢⎢⎣

⎢−=

⎥⎦

⎥⎢⎣

⎢⎥⎦

⎥⎢⎣

⎢−=

=

Ω∈

+

,,

,,2,,,2,,,

2,,2

2,,2

2,,2

0,,

),,(

1,

,

00

for

for

for

for

…

…

…

64

MSM Anterpolation

11

11

++

++

+≅=

=

nn

nnShort

nnn

nn

n

eqGqGe

qq

ϕ

ϕ

q0


qq 00 ϕ=

11


01

1 qq ϕ=2


21

1 −−

− = ll

l qq ϕ

1−= ll

l qq ϕ

ll

llShort

lll eqGqGe ϕ+≅= −−−−− 11111

lll qGe =0=ll qG

1−le

1e

0e

e


65

Anterpolation

( )

( )

( )

( )

( )

( )

nxyzgridgridgrid

xgrid

ygrid

yzyz

zgrid

zz

z

z

kz

y

y

jy

x

x

ix

z

y

x

qqzyxq

xp

y

qq

pzq

pkhzz

j

hyy

ihxx

p

phzzk

phyyj

phxxi

zyxNn

][],,[

,...,0

][

,...,0

][

,...,0

][

][

][

,...,02

1

21

21

),,(stencil gsurroundin ofindex gridlowest thefind,...,1

0

0

0

0

00

00

00

000

0

0

0

x_stencil

x_stencil

y_stencil

y_stencil

z_stencil

z_stencil

N

Nfor

N

Nfor

N

Nfor

for

for

ϕ

νλ

νλ

ϕ

νλ

ϕ

ννλ

νϕ

ννλ

νϕ

ννλ

νϕ

ν

ν

ν

ν

=+

=

=

=

=

=

=

=

=

+=

⎟⎟⎠

⎞⎜⎜⎝

⎛ −Φ=

+=

⎟⎟⎠

⎞⎜⎜⎝

⎛ −Φ=

+=

⎟⎟⎠

⎞⎜⎜⎝

⎛ −Φ=

=

−−⎥⎦

⎥⎢⎣

⎢ −=

−−⎥⎥⎦

⎥

⎢⎢⎣

⎢ −=

−−⎥⎦

⎥⎢⎣

⎢ −=

=

+

+

+

Nodal basis function of degree p=3

( )

⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

≤≤−−−

≤−+−

=Φ

otherwisefor

for

cubic

,021,)2)(1(

1),231)(1(

221

2

ξξξ

ξξξξ

ξ

),,(000 kji zyx

66

Anterpolation

h

Particle level (k=0)

for i=-3,3 for j=-3,3 for k=-3,3 Q(m)=Q(m)+dQ(ijk)

Node i

67

Restriction

11

11

++

++

+≅=

=

nn

nnShort

nnn

nn

n

eqGqGe

qq

ϕ

ϕ

q0


qq 00 ϕ=

11


01

1 qq ϕ=2


21

1 −−

− = ll

l qq ϕ

1−= ll

l qq ϕ

ll

llShort

lll eqGqGe ϕ+≅= −−−−− 11111

lll qGe =0=ll qG

1−le

1e

0e

e


68

MSM Restriction

( )[ ]

( )[ ]

( )[ ]

( ){ }( ){ }( ){ }

[ ]

[ ] [ ]( ){ }

( ){ }

[ ]

[ ] [ ]( ){ }

[ ] [ ]ννϕ

ν

ννϕ

ν

ννϕ

ν

φ

φ

φ

ν

ν

ν

ν

++ʹ′⋅=+ʹ′ʹ′ʹ′

+−=

Ω∈ʹ′

++ʹ′⋅=+

+−=

=

Ω∈

Ω∈ʹ′

++ʹ′⋅=+

+−=

=

Ω∈

Ω∈

Ω∈ʹ′

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −Φ=

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −Φ=

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −Φ=

+−=

⎥⎦

⎥⎢⎣

⎢ −=⎥

⎦

⎥⎢⎣

⎢ −=⎥

⎦

⎥⎢⎣

⎢ −=

Ω

++

+

++

+

+

+

+

+

+

++

+

++

+

++

+++

032

)(1

1

031

)(32

32

1

0)(31

31

1

1

10

1

10

1

10

01

00

01

00

01

00

2][,,

1,,

2,][

1,,0

2,,][,

1,,0,

2

2

2

1,,2

,2

,2

iikjiq

ppforifor

jjii

ppfori

ifor

jfor

kkjiqji

ppforji

ifor

jforkfor

hzz

v

hyy

v

hxx

v

ppforhxxk

hxxj

hxxi

q

n

xn

xn

n

y

n

n

xn

yn

nz

n

n

xn

yn

zn

zn

nni

z

yn

nni

y

xn

nni

x

n

nn

n

nn

n

nn

n

offset

offset

offset

Q

indices

QQ

Q

indices

indices

Q

Q

indices

indices indices

charge with grid given N

…

69

MSM Prolongation

11

11

++

++

+≅=

=

nn

nnShort

nnn

nn

n

eqGqGe

qq

ϕ

ϕ

q0


qq 00 ϕ=

11


01

1 qq ϕ=2


21

1 −−

− = ll

l qq ϕ

1−= ll

l qq ϕ

ll

llShort

lll eqGqGe ϕ+≅= −−−−− 11111

lll qGe =0=ll qG

1−le

1e

0e

e


70

Prolongation

( )[ ]

( )[ ]

( )[ ]

( ){ }

( ){ }

( ){ }

[ ] [ ]( ){ }

[ ] [ ]( ){ }( ){ }

[ ] [ ] [ ]jikkjie

ppforifor

jfor

ijji

ppforifor

kjieii

ppforifor

jfor

kfor

hzz

v

hyy

v

hxx

v

ppforhxxk

hxxj

hxxi

q

n

ylongn

xn

yn

n

y

n

xn

nx

n

xn

n

yn

n

zn

zn

nni

z

yn

nni

y

xn

nni

x

n

nn

n

nn

n

nn

n

offset

offset

offset

,2,,

1,,

][2,

1,,

,,][2

1,,

2

2

2

1,,2

,2

,2

31

)(0,

32

)(031

1)(0

32

1

32

1

31

1

1

10

1

10

1

10

01

00

01

00

01

00

+

++

++

+

+

+

+

+

+

++

+

++

+

++

+++

⋅=+++ʹ′

+−=

Ω∈

Ω∈

ʹ′⋅=+++ʹ′

+−=

Ω∈

ʹ′ʹ′ʹ′⋅=+++ʹ′

+−=

Ω∈ʹ′

Ω∈ʹ′

Ω∈ʹ′

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −Φ=

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −Φ=

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛ −Φ=

+−=

⎥⎦

⎥⎢⎣

⎢ −=⎥

⎦

⎥⎢⎣

⎢ −=⎥

⎦

⎥⎢⎣

⎢ −=

Ω

E

indices

indices

EE

indices

E

indices 0 to Ereset

indices 0 to Ereset

indices

charge with grid given N

νϕν

ν

νϕν

ν

νϕν

ν

φ

φ

φ

ν

ν

ν

ν

…

LAMMPS OLCF Titan Summit · prepare LAMMPS for Titan 1. Give an overview of Molecular Dynamics...

Documents

Transcript of LAMMPS OLCF Titan Summit · prepare LAMMPS for Titan 1. Give an overview of Molecular Dynamics...