IPCC @ RWTH Aachen University - umu.se

28
IPCC @ RWTH Aachen University Optimization of multibody and long-range solvers in LAMMPS Paolo Bientinesi Rodrigo Canales Markus H¨ ohnerbach Ahmed E. Ismail Key results – First year IPCC EMEA meeting Ostrava

Transcript of IPCC @ RWTH Aachen University - umu.se

Page 1: IPCC @ RWTH Aachen University - umu.se

IPCC @ RWTH Aachen UniversityOptimization of multibody and long-range solvers in

LAMMPS

Paolo Bientinesi Rodrigo CanalesMarkus Hohnerbach Ahmed E. Ismail

Key results – First year

IPCC EMEA meeting Ostrava

Page 2: IPCC @ RWTH Aachen University - umu.se

Team

RWTH

Prof. Paolo Bientinesi Rodrigo Canales Markus Hohnerbach Prof. Ahmed Ismail

Intel

Georg Zitzlsberger Klaus-Dieter Ortel Michael W. Brown

2

Page 3: IPCC @ RWTH Aachen University - umu.se

LAMMPS

Large-scale Atomic–Molecular Massively Parallel Simulator

I Sandia National Labshttp://lammps.sandia.gov

I Wide collection of potentials

I Open source

3

Page 4: IPCC @ RWTH Aachen University - umu.se

Parallel Packages for LAMMPS

I LAMMPS parallelization: MPI

I Additional acceleration: OpenMP,GPUs, Intel

Intel Package

I Developed by Michael Brown(Intel)

I Targets Intel’s hardware

I Offloading, Vectorization, Precision

I For several potentials

Figure : Xeon Phicoprocessors

4

Page 5: IPCC @ RWTH Aachen University - umu.se

Goals

I Optimize core kernels within LAMMPS

I Multi-threading and vectorization

I Intel Xeon Phi

I Buckingham potential, PPPM solver

I Tersoff potential, AIREBO potential

5

Page 6: IPCC @ RWTH Aachen University - umu.se

Buckingham Potential Optimization

Rodrigo Canales

6

Page 7: IPCC @ RWTH Aachen University - umu.se

Pair Potentials

Lennard Jones (Intel package)

Φlj = 4ε

[(σr

)12−(σr

)6]

Buckingham

Φbuck = Ae−r/ρ − C

r6

Φbuck/coul = Φbuck +Cqiqjε rij

Φbuck/coul/long = Φbuck +Cqiqjε rij

erfc(αrij)

7

Page 8: IPCC @ RWTH Aachen University - umu.se

Buck Potential Optimization

I USER-INTEL package as base of the development

I Data Packing for parameters

I Alignment of force and position arrays

I Multiple precision support

I Enable Xeon Phi Offloading

I Vectorization: Pragma SIMD

8

Page 9: IPCC @ RWTH Aachen University - umu.se

Speedup Xeon (single-threaded)

Figure : Speedup on the Xeon E5-2650 (Sandy Bridge)

9

Page 10: IPCC @ RWTH Aachen University - umu.se

Speedup Xeon Phi (single-threaded, native)

Figure : Speedup on Xeon Phi 5110P

10

Page 11: IPCC @ RWTH Aachen University - umu.se

KNC Intrinsics Vectorization

I Gather operations in neighbor loading

I Replace by in-register transpose 4x8 or 4x16

Templating intrinsics: 760 linesImplementation pair/buck: 330 lines

Figure : Speedup comparison on theXeon Phi (Double)

Figure : Speedup comparison on theXeon Phi (Single)

11

Page 12: IPCC @ RWTH Aachen University - umu.se

Runtime on Full System

576’000 atoms, double precision

Xeon Phi Native Base 253s

(240 Threads) SIMD 202s

Xeon (×2) + Xeon Phi Base 120s

(32 + 240 Threads) SIMD 77s

12

Page 13: IPCC @ RWTH Aachen University - umu.se

Multibody potentialsModernization of the Tersoff potential

Markus Hohnerbach

13

Page 14: IPCC @ RWTH Aachen University - umu.se

14

Page 15: IPCC @ RWTH Aachen University - umu.se

The Tersoff potential

V =∑

i

j∈Ni

V (i ,j ,ζij )︷ ︸︸ ︷fC (rij) [fR(rij) + bijfA(rij)] (1)

bij = (1 + βηζηij )− 1

2η (2)

ζij =∑

k∈Ni\{j}

fC (rik)g(θijk) exp(λ3(rij − rik))︸ ︷︷ ︸ζ(i ,j ,k)

(3)

15

Page 16: IPCC @ RWTH Aachen University - umu.se

Popularity

1,990 1,995 2,000 2,005 2,010 2,0150

50

100

150

200#Citations

I Tersoff potential: Widely used, fairly simple (~700 LOC)

I Previous work for GPU: EAMa, Stillinger-Weberb and Tersoff c

16

Page 17: IPCC @ RWTH Aachen University - umu.se

The Tersoff Algorithm

for i in local atoms of the current thread dofor j in atoms neighboring i do

ζij ← 0;for k in atoms neighboring i do

ζij ← ζij + ζ(i , j , k);

E ← E + V (i , j , ζij);Fi ← Fi − ∂xiV (i , j , ζij);Fj ← Fj − ∂xjV (i , j , ζij);

δζ ← ∂ζV (i , j , ζij);for k in atoms neighboring i do

Fi ← Fi − δζ · ∂xi ζ(i , j , k);Fj ← Fj − δζ · ∂xj ζ(i , j , k);Fk ← Fk − δζ · ∂xk ζ(i , j , k)

17

Page 18: IPCC @ RWTH Aachen University - umu.se

Close-Up

(a) (b)

Figure 6: Snapshot of (a) an undeformed silicon nanowire and (b) an undeformed carbon

nanotube.

the crystalline starting configuration used in the MD simulations. Due to thermal fluc-

tuations, the atomistic reference configuration A0 typically deviates from this perfectly

straight shape, leading to different effective lengths L. Three cross sections were chosen

as circular (001) surfaces with atoms located within different radii of Rg = 2.5a, Rg =

3.5a and Rg = 4.5a, which are also merely geometrical. Physically, though, the atoms

are not point particles, but have a finite extent; for example, the van der Waals radius of

Si is Rvdw = 2.1A. Therefore, the effective cross-sectional radius is Rcs = Rg + Rvdw.

The Lennard-Jones parameters of the wall interaction were set to e = 600 A3 bar and

s = 3.5 A, and the wall was tilted against the z-axis, ⌫ =⇥0,0.3,�

p0.91

⇤T.

The identified mass-dependent properties for systems of different radius and length

are summarized in table 1. The temperature was always T = 300K. We see that the

inertia is distributed symmetrically, as expected. The only exception is the very slen-

der nanowire (Rg = 2.5a, Lg = 150a), where the thermal vibrations lead to significant

deviations in the atomic mean positions from a canonical reference configuration even

without any deforming boundary conditions applied, resulting in a seemingly asym-

metric mass distribution. Averaging the atom positions over a longer time span may

attenuate this issue. The material parameters determined for the constitutive law (58)

are presented in table 2.

Dividing the axial stiffness EA by the cross-sectional area A = R2csp gives the ax-

ial Young’s modulus of the beam. For example, for beams of length Lg = 150a and

radii Rg = 2.5a, 3.5a, and 4.5a we find E = 49.8GPa, 69.1GPa, and 71.3GPa, respec-

tively. These values are significantly smaller than Young’s modulus for bulk Si, which

is 151.4 GPa for the Stillinger-Weber potential. This reveals a well-known size effect

in the mechanical properties of nanowires, caused by non-negligible surface effects as

26

18

Page 19: IPCC @ RWTH Aachen University - umu.se

Challenges

Figure : Graphene

I Few neighbors

I Fewer interactions

19

Page 20: IPCC @ RWTH Aachen University - umu.se

Vectorization

“J” algorithm

for i dofor j ∈ Ni do

skip cutoff;. . . ;for k ∈ Ni \ {j} do

skip cutoff;. . . ;

. . . ;for k ∈ Ni \ {j} do

skip cutoff;. . . ;

“I” algorithm

for i dofor j ∈ Ni do

skip cutoff;. . . ;for k ∈ Ni \ {j} do

skip cutoff;. . . ;

. . . ;for k ∈ Ni \ {j} do

skip cutoff;. . . ;

20

Page 21: IPCC @ RWTH Aachen University - umu.se

K Loop

0 5 10 15#Lane

←Ti

me

0 5 10 15#Lane

←Ti

me

21

Page 22: IPCC @ RWTH Aachen University - umu.se

Abstraction

typedef vector_routines<double, double, AVX> v;

typedef v::fvec fvec;

fvec a(1);

fvec b(2);

fvec c = v::recip(a + b);

Features

I Supports single, double and mixedprecision

I Supports scalar, SSE4.2, AVX,AVX2, IMCI, AVX-512,array notation (Cilk)

Advantages

I Maintainability

I Testing (through AN)

I Portability

I Thin wrapper

22

Page 23: IPCC @ RWTH Aachen University - umu.se

KNL Readiness

I Intrinsics abstraction already supports AVX-512

I Compilation possible for -xMIC-AVX512

I Running under Intel SDE sde -knl -- ...

I Has been tested on KNL prototypes by Intel employees

I We already have benchmarks prepared for the point whenperformance data can be shared

23

Page 24: IPCC @ RWTH Aachen University - umu.se

Portable Optimization (single-threaded, native)

Westmere(SSE4.2)

Sandy Bridge(AVX)

Haswell(AVX2)

Xeon Phi KNC(IMCI)

0

2

4

6

8

10

original double single mixed

24

Page 25: IPCC @ RWTH Aachen University - umu.se

Impact on a Realistic Simulation (multi-threaded)

Sandy Bridge Xeon Phi Haswell KNL

0

0.5

1

·107

ato

m-t

imes

tep

s/se

con

d original doublesingle

ConfigurationArch. Model Year CoresHaswell 2x Xeon E5-2680 v3 2014 24Sandy Bridge 2x Xeon E5-2450 2012 16

1x Xeon Phi 5110P 2012 60

25

Page 26: IPCC @ RWTH Aachen University - umu.se

Dissemination

Oct’15 Code dungeon 3

EMEA IPCC meeting, Munich

Nov’15 github.com/HPAC

Code, tests and benchmarks in progress

Nov’15 Talk + paper at SC’15 Workshop 3

Dec’15 Code integrated into LAMMPS 3

Dec’15 IXPUG: Vectorization WG in progress

26

Page 27: IPCC @ RWTH Aachen University - umu.se

Future Work

PPPM Long-ranged solver, used in almost any simulation

I Special focus on vectorization

I Particle-to-grid and grid-to-particle step

AIREBO Complex potential for simulation of hydrocarbons

I Some reuse from Tersoff optimization

I Challenging vectorization: Searches

I Challenging vectorization: Data-dependent, unlikely branches

I Challenging vectorization: Deep loop-nests with low trip counts

27

Page 28: IPCC @ RWTH Aachen University - umu.se