Molecular Dynamics (MD) on GPUs -...

Post on 21-Aug-2020

9 views 0 download

Transcript of Molecular Dynamics (MD) on GPUs -...

Feb. 2, 2017

Molecular Dynamics (MD) on GPUs

2

Accelerating Discoveries

Using a supercomputer powered by the Tesla

Platform with over 3,000 Tesla accelerators,

University of Illinois scientists performed the first

all-atom simulation of the HIV virus and discovered

the chemical structure of its capsid — “the perfect

target for fighting the infection.”

Without gpu, the supercomputer would need to be

5x larger for similar performance.

3

Overview of Life & Material Accelerated Apps

MD: All key codes are GPU-accelerated

Great multi-GPU performance

Focus on dense (up to 16) GPU nodes &/or large # of

GPU nodes

ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso,

Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*,

LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD,

OpenMM, PolyFTS, SOP-GPU* & more

QC: All key codes are ported or optimizing

Focus on using GPU-accelerated math libraries,

OpenACC directives

GPU-accelerated and available today:

ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS-

UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012,

NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack,

Quantum Espresso/PWscf, QUICK, TeraChem*

Active GPU acceleration projects:

CASTEP, GAMESS, Gaussian, ONETEP, Quantum

Supercharger Library*, VASP & more

green* = application where >90% of the workload is on GPU

4

MD vs. QC on GPUs

“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)Simulates positions of atoms over time;

chemical-biological or chemical-material behaviors

Calculates electronic properties; ground state, excited states, spectral properties,

making/breaking bonds, physical properties

Forces calculated from simple empirical formulas (bond rearrangement generally forbidden)

Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies)

Up to millions of atoms Up to a few thousand atoms

Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods

Single precision dominated Double precision is important

Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC

Geforce (Workstations), Tesla (Servers) Tesla recommended

ECC off ECC on

5

GPU-Accelerated Molecular Dynamics Apps

ACEMD

AMBER

CHARMM

DESMOND

ESPResSO

Folding@Home

GPUGrid.net

GROMACS

HALMD

HOOMD-Blue

LAMMPS

mdcore

Green Lettering Indicates Performance Slides Included

GPU Perf compared against dual multi-core x86 CPU socket.

MELD

NAMD

OpenMM

PolyFTS

6

Benefits of MD GPU-Accelerated Computing

• 3x-8x Faster than CPU only systems in all tests (on average)

• Most major compute intensive aspects of classical MD ported

• Large performance boost with marginal price increase

• Energy usage cut by more than half

• GPUs scale well within a node and/or over multiple nodes

• K80 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive

Why wouldn’t you want to turbocharge your research?

ACEMD

www.acellera.com

470 ns/day on 1 GPU for L-Iduronic acid (1362 atoms)

116 ns/day on 1 GPU for DHFR (23K atoms)

M. Harvey, G. Giupponi and G. De Fabritiis, ACEMD: Accelerated molecular dynamics simulations in the microseconds timescale, J. Chem. Theory and Comput. 5, 1632 (2009)

www.acellera.com

NVT, NPT, PME, TCL, PLUMED, CAMSHIFT1

1 M. J. Harvey and G. De Fabritiis, An implementation of the smooth particle-mesh Ewald (PME) method on GPU hardware, J. Chem. Theory Comput., 5, 2371–2377 (2009)2 For a list of selected references see http://www.acellera.com/acemd/publications

February 2017

AMBER 16

11

PME-Cellulose_NPT on K80s

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)2.35

11.36

15.43

0

4

8

12

16

20

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

ns/

day

PME-Cellulose_NPT

4.8X

6.6X

12

PME-Cellulose_NPT on P100s PCIe

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)2.35

21.85

30.00

0

5

10

15

20

25

30

35

40

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

ns/

day

PME-Cellulose_NPT

9.3X

12.8X

13

PME-Cellulose_NPT on P100s SXM2

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)2.35

23.37

32.22

36.65

0

5

10

15

20

25

30

35

40

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

ns/

day

PME-Cellulose_NPT

9.9X

13.7X15.6X

14

PME-Cellulose_NVE on K80s

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)2.47

11.85

16.53

0

4

8

12

16

20

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

ns/

day

PME-Cellulose_NVE

4.8X

6.7X

15

PME-Cellulose_NVE on P100s PCIe

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)2.47

23.34

32.55

0

5

10

15

20

25

30

35

40

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

ns/

day

PME-Cellulose_NVE

9.4X

13.2X

16

PME-Cellulose_NVE on P100s SXM2

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)2.47

24.94

35.16

40.88

0

5

10

15

20

25

30

35

40

45

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

ns/

day

PME-Cellulose_NVE

10.1X

14.2X16.6X

17

PME-FactorIX_NPT on K80s

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)11.43

48.54

66.68

0

10

20

30

40

50

60

70

80

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

ns/

day

PME-FactorIX_NPT

4.2X

5.8X

18

PME-FactorIX_NPT on P100s PCIe

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)11.43

98.77

132.86

0

20

40

60

80

100

120

140

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

ns/

day

PME-FactorIX_NPT

8.6X

11.6X

19

PME-FactorIX_NPT on P100s SXM2

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)11.43

106.25

144.11

159.80

0

20

40

60

80

100

120

140

160

180

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

ns/

day

PME-FactorIX_NPT

9.3X

12.6X14.0X

20

PME-FactorIX_NVE on K80s

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

11.98

51.14

71.49

0

10

20

30

40

50

60

70

80

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

ns/

day

PME-FactorIX_NVE

5.4X

6.0X

21

PME-FactorIX_NVE on P100s PCIe

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)11.98

105.86

145.83

0

20

40

60

80

100

120

140

160

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

ns/

day

PME-FactorIX_NVE

8.8X

12.2X

22

PME-FactorIX_NVE on P100s SXM2

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)11.98

114.88

159.24

178.02

0

20

40

60

80

100

120

140

160

180

200

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

ns/

day

PME-FactorIX_NVE

9.6X

13.3X14.9X

23

PME-JAC_NPT on K80s

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

45.89

162.09

216.78

0

50

100

150

200

250

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

ns/

day

PME-JAC_NPT

3.5X

4.7X

24

PME-JAC_NPT on P100s PCIe

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

45.89

283.60

327.69

0

50

100

150

200

250

300

350

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

ns/

day

PME-JAC_NPT

6.2X

7.1X

25

PME-JAC_NPT on P100s SXM2

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)45.89

310.52

360.64

423.09

0

50

100

150

200

250

300

350

400

450

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

ns/

day

PME-JAC_NPT

6.8X7.9X

9.2X

26

PME-JAC_NVE on K80s

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

47.90

173.20

234.99

0

50

100

150

200

250

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

ns/

day

PME-JAC_NVE

3.6X

4.9X

27

PME-JAC_NVE on P100s PCIe

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

47.90

308.46

363.79

0

50

100

150

200

250

300

350

400

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

ns/

day

PME-JAC_NVE

6.4X

7.6X

28

PME-JAC_NVE on P100s SXM2

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)47.90

339.81

402.18

473.10

0

50

100

150

200

250

300

350

400

450

500

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

ns/

day

PME-JAC_NVE

7.1X

8.4X9.9X

29

GB-Myoglobin on K80s

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)28.86

288.47

339.45

0

50

100

150

200

250

300

350

400

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

ns/

day

GB-Myoglobin

10.0X

11.8X

30

GB-Myoglobin on P100s PCIe

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)28.86

483.37

561.94

0

100

200

300

400

500

600

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +4x P100 PCIe

per node

ns/

day

GB-Myoglobin

16.7X

19.5X

31

GB-Myoglobin on P100s SXM2

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)28.86

534.28

639.37

0

100

200

300

400

500

600

700

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +4x P100 PCIe

per node

ns/

day

GB-Myoglobin

18.5X

22.2X

32

GB-Nucleosome on K80s

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

0.40

5.84

11.31

20.55

0

5

10

15

20

25

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

ns/

day

GB-Nucleosome

14.6X

28.3X

51.4X

33

GB-Nucleosome on P100s PCIe

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)0.40

11.91

22.77

39.91

45.92

0

5

10

15

20

25

30

35

40

45

50

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

ns/

day

GB-Nucleosome

29.8X

56.9X

99.8X

114.8X

34

GB-Nucleosome on P100s SXM2

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)0.40

13.36

25.53

46.2948.29

0

10

20

30

40

50

60

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

ns/

day

GB-Nucleosome

33.4X

63.8X

115.7X

120.7X

35

Rubisco-75K on K80s

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

0.01

0.35

0.69

1.34

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

ns/

day

Rubisco-75K

35.0X

69.0X

134.0X

36

Rubisco-75K on P100s PCIe

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)0.01

0.71

1.40

2.69

4.20

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

ns/

day

Rubisco-75K

71.0X140.0X

269.0X

420.0X

37

Rubisco-75K on P100s SXM2

Running AMBER version 16.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)0.01

0.80

1.57

3.06

4.46

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

ns/

day

Rubisco-75K

80.0X

157.0X

306.0X

446.0X

AMBER 14

39

AMBER 14 vs. AMBER 12

Courtesy of

Scott Le Grand

From GTC 2014

presentation

40

AMBER 14; large P2P and small Boost Clocks impacts

2 x Xeon E5-2690 v2@3.00GHz + 4 xTesla K40@745Mhz (no P2P)

2 x Xeon E5-2690 v2@3.00GHz + 4 xTesla K40@875Mhz (no P2P)

2 x Xeon E5-2690 v2@3.00GHz + 4 xTesla K40@745Mhz (P2P)

2 x Xeon E5-2690 v2@3.00GHz + 4 xTesla K40@875Mhz (P2P)

Series1 125.77 132.97 196.68 215.18

125.77132.97

196.68

215.18

0

50

100

150

200

250

ns/d

ay

AMBER 14 (ns/day) on 4x K40; P2P and Boost Clocks ImpactDHFR NVE PME, 2fs Benchmark (CUDA 6.0, ECC off)

Boost

P2P

Boost

No P2P

No Boost

P2PNo Boost

No P2P

4141

AMBER Performance Over Time

Courtesy of

Scott Le Grand

From GTC 2014

presentation

42

Cellulose on K40s, K80s and M6000s

Running AMBER version 14

The blue node contains Dual Intel E5-2698 v3@2.3GHz, 3.6GHz Turbo CPUs

The green nodes contain Dual Intel E5-2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro

M6000@987Mhz GPUs

1.93

8.96

7.87

11.76

10.49

13.67

15.3814.90

0

4

8

12

16

20

1 HaswellNode

1 CPU Node+ 1x K40

1 CPU Node+ 0.5x K80

1 CPU Node+ 1x K80

1 CPU Node+ 1x M6000

1 CPU Node+ 2x K40

1 CPU Node+ 2x K80

1 CPU Node+ 2x M6000

Sim

ula

ted T

ime (

ns/

day)

PME-Cellulose_NVE

4.1X

6.1X

5.4X

8.0X7.7X

4.6X

7.1X

43

Factor IX on K40s, K80s and M6000s

Running AMBER version 14

The blue node contains Dual Intel E5-2698 v3@2.3GHz, 3.6GHz Turbo CPUs

The green nodes contain Dual Intel E5-2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro

M6000@987Mhz GPUs

9.68

40.48

33.59

50.7047.80

61.18 60.93

66.89

0

10

20

30

40

50

60

70

80

1 HaswellNode

1 CPU Node+ 1x K40

1 CPU Node+ 0.5x K80

1 CPU Node+ 1x K80

1 CPU Node+ 1x M6000

1 CPU Node+ 2x K40

1 CPU Node+ 2x K80

1 CPU Node+ 2x M6000

Sim

ula

ted T

ime (

ns/

day)

PME-FactorIX_NVE

3.5X

5.2X5.0X

6.4X6.3X

7.0X

4.2X

44

JAC on K40s, K80s and M6000s

Running AMBER version 14

The blue node contains Dual Intel E5-2698 v3@2.3GHz, 3.6GHz Turbo CPUs

The green nodes contain Dual Intel E5-2698 v3@2.3GHz, 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro

M6000@987Mhz GPUs

37.38

134.82

121.30

174.34

161.53

200.34

225.34219.83

0

50

100

150

200

250

1 HaswellNode

1 CPU Node+ 1x K40

1 CPU Node+ 0.5x K80

1 CPU Node+ 1x K80

1 CPU Node+ 1x M6000

1 CPU Node+ 2x K40

1 CPU Node+ 2x K80

1 CPU Node+ 2x M6000

Sim

ula

ted T

ime (

ns/

day)

PME-JAC_NVE

3.2X

4.7X

4.3X

5.4X

6.0X 5.9X

3.6X

45

Cellulose on M40s

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs

The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge)

CPUs + Tesla M40 (autoboost) GPUs

1.07

10.12

14.40

15.90

0

2

4

6

8

10

12

14

16

18

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

PME - Cellulose_NPT

9.5X

13.5X14.9X

46

Cellulose on M40s

1.07

10.50

15.41

17.13

0

2

4

6

8

10

12

14

16

18

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

PME - Cellulose_NVE

9.8X

14.4X

16.0XRunning AMBER version 14

The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs

The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge)

CPUs + Tesla M40 (autoboost) GPUs

47

FactorIX on M40s

5.38

46.90

67.37

72.96

0

10

20

30

40

50

60

70

80

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

PME - FactorIX_NPT

8.7X

12.5X

13.6XRunning AMBER version 14

The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs

The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge)

CPUs + Tesla M40 (autoboost) GPUs

48

FactorIX on M40s

5.47

49.33

73.00

80.04

0

10

20

30

40

50

60

70

80

90

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

PME - FactorIX_NVE

9.0X

13.3X14.6X

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs

The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge)

CPUs + Tesla M40 (autoboost) GPUs

49

JAC on M40s

20.88

149.40

211.97

226.63

0

50

100

150

200

250

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

PME - JAC_NPT

7.2X

10.2X10.9X

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs

The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge)

CPUs + Tesla M40 (autoboost) GPUs

50

JAC on M40s

21.11

157.68

230.18

246.15

0

50

100

150

200

250

300

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

PME - JAC_NVE

7.5X

10.9X

11.7X

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs

The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge)

CPUs + Tesla M40 (autoboost) GPUs

51

Myoglobin on M40s

9.83

232.20

300.86

322.09

0

50

100

150

200

250

300

350

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

GB - Myoglobin

23.6X

30.6X32.8X

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs

The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge)

CPUs + Tesla M40 (autoboost) GPUs

52

Nucleosome on M40s

0.13

4.67

9.05

16.11

0

2

4

6

8

10

12

14

16

18

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

GB - Nucleosome

35.9X

69.6X

123.9X

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs

The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge)

CPUs + Tesla M40 (autoboost) GPUs

53

TrpCage on M40s

408.88

831.91

551.36

464.63

0

100

200

300

400

500

600

700

800

900

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

GB - TrpCage2.03X

1.3X

1.1X

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 v3@2.30GHz (Haswell) CPUs

The green nodes contain Single Intel Xeon E5-2697 v2@2.70GHz (IvyBridge)

CPUs + Tesla M40 (autoboost) GPUs

54

Recommended GPU Node Configuration for AMBER Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2

Cores per CPU socket 6+ (1 CPU core drives 1 GPU)

CPU speed (Ghz) 2.66+

System memory per node (GB) 16

GPUs Kepler K20, K40, K80, P100

# of GPUs per CPU socket1-4

GPU memory preference (GB) 6

GPU to CPU connection PCIe 3.0 16x or higher

Server storage 2 TB

Network configuration Infiniband QDR or better

Scale to multiple nodes with same single node configuration54

July 2016

CHARMM DOMDEC-GUI

56

CHARMM DOMDEC-GUI 465 K System Benchmark

Running CHARMM version c40a1

The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who

is responsible for possible benchmarking error.

0.36

2.15

0

1

2

3

4

1 Haswell node 1 node + 1x K80 per node

ns/

day

465 K System (Her1_HER1_membrane)

6.0X

*Higher is better

57

CHARMM DOMDEC-GUI 534 K System Benchmark

Running CHARMM version c40a1

The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs + Tesla K80 (autoboost) GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who

is responsible for possible benchmarking error.

0.18

1.43

0.0

0.5

1.0

1.5

2.0

1 Haswell node 1 node + 1x K80 per node

ns/

day

534 K System (POPC_PSPC_CHL1mixture)

*Higher is better

8.0X

58

CHARMM DOMDEC-GUI 20 K System Benchmark

Running CHARMM version c40a1

The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell)

CPUs + Tesla M40 GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who

is responsible for possible benchmarking error.

16.00

59.68

0

20

40

60

80

1 Haswell node 1 node + 1x M40 per node

ns/

day

20 K System (Crambin)

*Higher is better

3.7X

59

CHARMM DOMDEC-GUI 61 K System Benchmark

Running CHARMM version c40a1

The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell)

CPUs + Tesla M40 GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who

is responsible for possible benchmarking error.3.90

25.08

0

5

10

15

20

25

30

35

1 Haswell node 1 node + 1x M40 per node

ns/

day

61 K System (GlnBP)

6.4X

*Higher is better

60

CHARMM DOMDEC-GUI 465 K System Benchmark

Running CHARMM version c40a1

The blue node contains Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v3@2.30 GHz (Haswell)

CPUs + Tesla M40 GPUs

Benchmarks were done based on the STANDARD CHARMM c40a1 version by the Yang group (FSU), who

is responsible for possible benchmarking error.

0.36

2.27

0

1

2

3

4

1 Haswell node 1 node + 1x M40 per node

ns/

day

465 K System (Her1_HER1_membrane)

*Higher is better

6.3X

October 2016

GROMACS 2016

62

Erik Lindahl (GROMACS developer) video

63

Water 1.5M on K80s

Running GROMACS version 2016

The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

2.79

5.22

6.14

0

1

2

3

4

5

6

7

1 Broadwell node 1 node + 2x K80 per node 1 node + 4x K80 per node

ns/

day

Water 1.5M

1.9X

2.2X

64

Water 3M on K80s

Running GROMACS version 2016

The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

1.32

2.66

3.05

0

1

1

2

2

3

3

4

1 Broadwell node 1 node + 2x K80 per node 1 node + 4x K80 per node

ns/

day

Water 3M

2.0X

2.3X

65

Water 1.5M on M40s

Running GROMACS version 2016

The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla M40 (autoboost) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

2.79

6.15

7.60

0

1

2

3

4

5

6

7

8

1 Broadwell node 1 node + 2x M40 per node 1 node + 4x M40 per node

ns/

day

Water 1.5M

2.2X

2.7X

66

Water 3M on M40s

Running GROMACS version 2016

The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla M40 (autoboost) GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

1.32

2.97

3.94

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 Broadwell node 1 node + 2x M40 per node 1 node + 4x M40 per node

ns/

day

Water 3M

2.3X

3.0X

67

Water 1.5M on P40s

Running GROMACS version 2016

The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P40 GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

2.79

6.60

8.07

0

1

2

3

4

5

6

7

8

9

1 Broadwell node 1 node + 2x P40 per node 1 node + 4x P40 per node

ns/

day

Water 1.5M

2.4X

2.9X

68

Water 3M on P40s

Running GROMACS version 2016

The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P40 GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

1.32

3.36

4.19

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 Broadwell node 1 node + 2x P40 per node 1 node + 4x P40 per node

ns/

day

Water 3M

2.5X

3.2X

69

Water 1.5M on P100 PCIes

Running GROMACS version 2016

The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

2.79

6.34

7.11

0

1

2

3

4

5

6

7

8

1 Broadwell node 1 node + 2x P100 PCIe per node 1 node + 4x P100 PCIe per node

ns/

day

Water 1.5M

2.3X

2.5X

70

Water 3M on P100 PCIes

Running GROMACS version 2016

The blue node contains Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

1.32

3.16

3.43

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

1 Broadwell node 1 node + 2x P100 PCIe per node 1 node + 4x P100 PCIe per node

ns/

day

Water 3M

2.4X

2.6X

February 2017

GROMACS 5.1.2

72

Water 1.5M on K80s

Running GROMACS version 5.1.2

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

3.04

3.49

5.75

0

1

2

3

4

5

6

7

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

ns/

day

Water 1.5M

1.1X

1.9X

73

Water 1.5M on P100s PCIe

3.04

4.39

6.967.21

0

2

4

6

8

10

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

ns/

day

Water 1.5M

1.4X

2.3X2.4X

Running GROMACS version 5.1.2

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

74

Water 1.5M on P100s SXM2

Running GROMACS version 5.1.2

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

3.04

4.11

6.70

7.18

7.88

0

1

2

3

4

5

6

7

8

9

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x 100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

ns/

day

Water 1.5M

1.4X

2.2X2.4X

2.6X

75

Water 3M on K80s

1.38

1.59

2.98

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

ns/

day

Water 3M

1.2X

2.2X

Running GROMACS version 5.1.2

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

76

Water 3M on P100s PCIe

1.38

1.96

3.43

3.80

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

ns/

day

Water 3M

1.4X

2.5X

2.8X Running GROMACS version 5.1.2

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

77

Water 3M on P100s SXM2

Running GROMACS version 5.1.2

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

1.38

1.84

3.50

3.82

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

ns/

day

Water 3M

1.3X

2.5X2.8X

78

Recommended GPU Node Configuration for GROMACS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2

Cores per CPU socket 6+

CPU speed (Ghz) 2.66+

System memory per socket (GB) 32

GPUs Kepler K20, K40, K80

# of GPUs per CPU socket

1x

Kepler GPUs: need fast Sandy Bridge or Ivy Bridge, or

high-end AMD Opterons

GPU memory preference (GB) 6

GPU to CPU connection PCIe 3.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand78

February 2017

HOOMD-Blue 1.3.3

80

lj-liquid on K80s

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)326.52

1324.84

1594.37

1942.12

0

500

1000

1500

2000

2500

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

avg

tim

e s

teps/

sec

lj-liquid

4.1X

4.9X

5.9X

81

lj-liquid on P100s PCIe

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)326.52

2912.66

3217.68

0

500

1000

1500

2000

2500

3000

3500

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +8x P100 PCIe

per node

avg

tim

est

eps/

sec

lj-liquid

8.9X

9.9X

82

lj-liquid on P100s SXM2

326.52

3129.11

3397.74

0

500

1000

1500

2000

2500

3000

3500

4000

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +8x P100 SXM2

per node

avg

tim

est

eps/

sec

lj-liquid

9.6X

10.4XRunning HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

83

lj_liquid_512k on K80s

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)43.43

220.10

334.59

526.47

0

100

200

300

400

500

600

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

avg

tim

est

eps/

sec

lj_liquid_512k

5.1X

7.7X

12.1X

84

lj_liquid_512k on P100s PCIe

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)43.43

398.12

534.54

770.18

1045.50

0

200

400

600

800

1000

1200

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

avg

tim

est

eps/

sec

lj_liquid_512k

9.2X

12.3X

17.7X

24.1X

85

lj_liquid_512k on P100s SXM2

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)43.43

443.74

568.51

793.36

1119.76

0

200

400

600

800

1000

1200

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

avg

tim

est

eps/

sec

lj_liquid_512k

10.2X

13.1X

18.3X

25.8X

86

lj_liquid_1m on K80s

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)22.07

109.54

181.42

303.00

0

50

100

150

200

250

300

350

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

avg

tim

est

eps/

sec

lj_liquid_1m

5.0X

8.2X

13.7X

87

lj_liquid_1m on P100s PCIe

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)22.07

204.67

294.88

465.58

672.46

0

100

200

300

400

500

600

700

800

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

avg

tim

est

eps/

sec

lj_liquid_1m

9.3X

13.4X

21.1X

30.5X

88

lj_liquid_1m on P100s SXM2

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)22.07

221.02

315.07

488.04

707.73

0

100

200

300

400

500

600

700

800

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

avg

tim

est

eps/

sec

lj_liquid_1m

10.0X

14.3X

22.1X

32.1X

89

Microsphere on K80s

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)17.53

64.87

98.43

166.74

0

20

40

60

80

100

120

140

160

180

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

avg

tim

est

eps/

sec

microsphere

3.7X

5.6X

9.5X

90

Microsphere on P100s PCIe

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)17.53

145.71

179.54

257.58

371.24

0

50

100

150

200

250

300

350

400

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

avg

tim

est

eps/

sec

microsphere

8.3X

10.2X

14.7X

21.2X

91

Microsphere on P100s SXM2

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)17.53

151.51

186.01

271.21

384.72

0

50

100

150

200

250

300

350

400

450

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

avg

tim

est

eps/

sec

microsphere

8.6X10.6X

15.5X

21.9X

92

Polymer on K80s

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

362.19

975.14

1209.45

1518.99

0

200

400

600

800

1000

1200

1400

1600

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

avg

tim

est

eps/

sec

polymer

2.7X

3.3X

4.2X

93

Polymer on P100s PCIe

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

362.19

1999.64

2143.15

2480.70

0

500

1000

1500

2000

2500

3000

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

avg

tim

est

eps/

sec

polymer

5.5X

5.9X

6.8X

94

Polymer on P100s SXM2

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

362.19

2111.99

2272.27

2651.56

0

500

1000

1500

2000

2500

3000

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

avg

tim

est

eps/

sec

polymer

5.8X

6.3X

7.3X

95

Quasicrystal on K80s

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)78.32

502.53

767.90

1280.44

0

200

400

600

800

1000

1200

1400

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

avg

tim

est

eps/

sec

quasicrystal

6.4X

9.8X

16.3X

96

Quasicrystal on P100s PCIe

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)78.32

851.29

1199.64

1791.41

2261.72

0

500

1000

1500

2000

2500

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

avg

tim

steps/

sec

quasicrystal

10.9X15.3X

22.9X

28.9X

97

Quasicrystal on P100s SXM2

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)78.32

939.53

1249.90

1940.29

2429.68

0

500

1000

1500

2000

2500

3000

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

avg

tim

steps/

sec

quasicrystal

24.8X

31.0X

12.0X

16.0X

98

Triblock-copolymer on K80s

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

361.42

953.01

1170.47

1492.01

0

200

400

600

800

1000

1200

1400

1600

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

avg

tim

est

eps/

sec

triblock-copolymer

2.6X

3.2X

4.1X

99

Triblock-copolymer on P100s PCIe

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

361.42

1999.14

2155.27

2456.09

0

500

1000

1500

2000

2500

3000

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

avg

tim

est

eps/

sec

triblock-copolymer

5.5X6.0X

6.8X

100

Triblock-copolymer on P100s SXM2

Running HOOMD-Blue version 1.3.3

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

361.42

2132.922253.83

2587.91

0.00

500.00

1000.00

1500.00

2000.00

2500.00

3000.00

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

avg

tim

est

eps/

sec

triblock-copolymer

5.9X

6.2X

7.2X

February 2017

LAMMPS 2016

102

Atomic-Fluid Lennard-Jones 2.5 Cutoff on K80s

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

0.37

0.57

0.00

0.20

0.40

0.60

0.80

1.00

1 Broadwell node 1 node +2x K80 per node

1/se

conds

Atomic-Fluid Lennard-Jones 2.5 Cutoff

1.5X

103

Atomic-Fluid Lennard-Jones 2.5 Cutoff on P100s PCIe

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe (autoboost) GPUs

0.37

0.62

0.00

0.20

0.40

0.60

0.80

1.00

1 Broadwell node 1 node +2x P100 PCIe

per node

1/se

conds

Atomic-Fluid Lennard-Jones 2.5 Cutoff

1.7X

104

Atomic-Fluid Lennard-Jones 2.5 Cutoff on P100s SXM2

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 (autoboost) GPUs

0.37

0.64

0.00

0.25

0.50

0.75

1.00

1 Broadwell node 1 node + 2x P100 SXM2 per node

1/se

conds

Atomic-Fluid Lennard-Jones 2.5 Cutoff

1.7X

105

Atomic-Fluid Lennard-Jones 5.0 Cutoff on K80s

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)0.10

0.14

0.26

0.36

0.00

0.20

0.40

0.60

0.80

1.00

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

1/se

conds

Atomic-Fluid Lennard-Jones 5.0 Cutoff

1.4X2.6X

3.6X

106

Atomic-Fluid Lennard-Jones 5.0 Cutoff on P100s PCIe

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)0.10

0.22

0.350.37 0.38

0.00

0.20

0.40

0.60

0.80

1.00

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

1/se

conds

Atomic-Fluid Lennard-Jones 5.0 Cutoff

2.2X

3.5X 3.7X 3.8X

107

Atomic-Fluid Lennard-Jones 5.0 Cutoff on P100s SXM2

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)0.10

0.22

0.36

0.41

0.00

0.25

0.50

0.75

1.00

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1/se

conds

Atomic-Fluid Lennard-Jones 5.0 Cutoff

2.2X3.6X

4.1X

108

Course-grain Water on K80s

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

0.00437 0.00444

0.0000

0.0020

0.0040

0.0060

0.0080

0.0100

1 Broadwell node 1 node +4x K80 per node

1/se

conds

Course-grain Water

1.0X

109

Course-grain Water on P100s PCIe

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

0.0044

0.0061

0.0093

0.0000

0.0010

0.0020

0.0030

0.0040

0.0050

0.0060

0.0070

0.0080

0.0090

0.0100

1 Broadwell node 1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

1/se

conds

Course-grain Water

1.4X

2.1X

110

Course-grain Water on P100s SXM2

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

0.0044

0.0069

0.0110

0.0000

0.0020

0.0040

0.0060

0.0080

0.0100

0.0120

1 Broadwell node 1 node +4x P100 SXM2

per node

1 node +8x 100 SXM2

per node

1/se

conds

Course-grain Water

1.6X

2.5X

111

EAM on K80s

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)0.01

0.02

0.04

0.07

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

1/se

conds

EAM

2.0X

4.0X

7.0X

112

EAM on P100s PCIe

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)0.01

0.03

0.05

0.08

0.13

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

1/se

conds

EAM

3.0X

5.0X

8.0X

13.0X

113

EAM on P100s SXM2

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)0.01

0.03

0.05

0.08

0.13

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

1/se

conds

EAM

3.0X

5.0X

8.0X

13.0X

114

Gay-Berne on K80s

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

0.010.02

0.03

0.04

0.00

0.01

0.02

0.03

0.04

0.05

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

1/se

conds

Gay-Berne

2.0X

3.0X

4.0X

115

Gay-Berne on P100s PCIe

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

0.01

0.02

0.04

0.05

0.00

0.01

0.02

0.03

0.04

0.05

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1/se

conds

Gay-Berne

2.0X

4.0X

5.0X

116

Gay-Berne on P100s SXM2

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

0.01

0.02

0.04

0.05

0.00

0.01

0.02

0.03

0.04

0.05

1 Broadwell node 1 node +1x SXM2per node

1 node +2x SXM2per node

1 node +4x SXM2per node

1/se

conds

Gay-Berne

2.0X

4.0X

5.0X

117

Rhodopsin on K80s

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

0.22 0.22

0.31

0.38

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

1/se

conds

Rhodopsin

1.4X

1.7X

118

Rhodopsin on P100s PCIe

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

0.22

0.29

0.33

0.48

0.52

0.00

0.10

0.20

0.30

0.40

0.50

0.60

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

1 node +8x P100 PCIe

per node

1/se

conds

Rhodopsin

1.3X1.5X

2.2X

2.4X

119

Rhodopsin on P100s SXM2

Running LAMMPS version 2016

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 SXM2 GPUs

1x P100 SXM2 is paired with Single Intel Xeon E5-2698 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

0.22

0.30

0.38

0.490.50

0.00

0.10

0.20

0.30

0.40

0.50

0.60

1 Broadwell node 1 node +1x P100 SXM2

per node

1 node +2x P100 SXM2

per node

1 node +4x P100 SXM2

per node

1 node +8x P100 SXM2

per node

1/se

conds

Rhodopsin

1.4X

1.7X

2.2X 2.3X

120

Recommended GPU Node Configuration for LAMMPS Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2

Cores per CPU socket 6+

CPU speed (Ghz) 2.66+

System memory per socket (GB) 32

GPUsGTX Titan X,

Kepler K20, K40, K80, M40

# of GPUs per CPU socket 1-2

GPU memory preference (GB) 6+

GPU to CPU connection PCIe 3.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand

Scale to thousands of nodes with same single node configuration12

0

NAMD 2.11 – Up to 2X Faster

122

New GPU features in NAMD 2.11

• GPU-accelerated simulations up to twice as fast as NAMD 2.10

• Pressure calculation with fixed atoms on GPU works as on CPU

• Improved scaling for GPU-accelerated particle-mesh Ewald calculation

• CPU-side operations overlap better and are parallelized across cores.

• Improved scaling for GPU-accelerated simulations

• Nonbonded force calculation results are streamed from the GPU for better overlap.

• NVIDIA CUDA GPU-acceleration binaries for Mac OS X

Selected Text from the NAMD website

123

NAMD 2.11 is up to 2x faster

0

5

10

15

20

25

1 Node 2 Nodes 4 Nodes

Sim

ula

ted T

ime (

ns/

day)

APoA1 (92,224 atoms)

1.2X

1.6X2.0X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs

124

NAMD 2.11 APoA1 on 1 and 2 nodes

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 v3@2.3GHz (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 v3@2.3GHz (Haswell) CPUs + Tesla

K80 (autoboost) GPUs

2.77

11.67

16.99

5.22

19.73

24.31

0

5

10

15

20

25

1 Node 1 Node +1x K80

1 Node +2x K80

2 Nodes 2 Nodes +1x K80

2 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

APoA1(92,224 atoms)

4.2X

6.1X

3.8X

4.7X

125

NAMD 2.11 APoA1 on 4 and 8 nodes

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 v3@2.3GHz (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 v3@2.3GHz (Haswell) CPUs + Tesla

K80 (autoboost) GPUs

10.27

20.64

23.52

16.85

27.83 27.74

0

5

10

15

20

25

30

4 Nodes 4 Nodes +1x K80

4 Nodes +2x K80

8 Nodes 8 Nodes +1x K80

8 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

APoA1 (92,224 atoms)

2.0X

2.3X1.7X 1.6X

126

NAMD 2.11 is up to 1.8x faster

0

2

4

6

8

10

1 Node 2 Nodes 4 Nodes

Sim

ula

ted T

ime (

ns/

day)

F1-ATPase (327,506 atoms)

1.1X

1.8X 1.4X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs

127

NAMD 2.11 F1-ATPase on 1 and 2 nodes

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 v3@2.3GHz (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 v3@2.3GHz (Haswell) CPUs + Tesla

K80 (autoboost) GPUs

0.94

3.87

6.11

1.86

7.23

10.58

0

5

10

15

1 Node 1 Node +1x K80

1 Node +2x K80

2 Nodes 2 Nodes +1x K80

2 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

F1-ATPase(327,506 atoms)

4.1X

6.5X

3.9X

5.7X

128

NAMD 2.11 F1-ATPase on 4 and 8 nodes

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 v3@2.3GHz (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 v3@2.3GHz (Haswell) CPUs + Tesla

K80 (autoboost) GPUs

3.63

11.66

12.62

6.88

14.22

15.74

0

5

10

15

20

4 Nodes 4 Nodes +1x K80

4 Nodes +2x K80

8 Nodes 8 Nodes +1x K80

8 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

F1-ATPase(327,506 atoms)

3.2X

3.5X2.1X

2.3X

129

NAMD 2.11 is up to 1.5x faster

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

1 Node 2 Nodes 4 Nodes

Sim

ula

ted T

ime (

ns/

day)

STMV (1,066,628 atoms)

1.5X

1.1X

1.5X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 v2@2.7GHz (IvyBridge) CPUs + 2 Tesla K80 (autoboost) GPUs

130

NAMD 2.11 STMV on 1 and 2 nodes

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 v3@2.3GHz (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 v3@2.3GHz CPUs (Haswell) + Tesla

K80 (autoboost) GPUs

0.23

1.03

1.75

0.46

1.98

3.27

0

1

2

3

4

1 Node 1 Node +1x K80

1 Node +2x K80

2 Nodes 2 Nodes +1x K80

2 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

STMV(1,066,628 atoms)

4.5X

7.6X4.3X

7.1X

131

NAMD 2.11 STMV on 4 and 8 nodes

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 v3@2.3GHz (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 v3@2.3GHz CPUs (Haswell) + Tesla

K80 (autoboost) GPUs

0.90

3.61

4.54

1.74

5.86

6.24

0

2

4

6

8

4 Nodes 4 Nodes +1x K80

4 Nodes +2x K80

8 Nodes 8 Nodes +1x K80

8 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

STMV (1,066,628 atoms)

4.0X

5.0X

3.4X

3.6X

November 2016

NAMD 2.11

133

APOA1 on K80s

Running NAMD version 2.11

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

4.09

11.51

14.86

16.39

20.59

0

5

10

15

20

25

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

1 node +8x K80 per node

ns/

day

APOA1

2.8X

3.6X4.0X

5.0X

134

APOA1 on P100s PCIe

Running NAMD version 2.11

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

4.09

13.96

19.62

0

5

10

15

20

25

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

ns/

day

APOA1

3.4X

4.8X

135

F1ATPASE on K80s

Running NAMD version 2.11

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)

1.47

3.77

5.95 6.04

7.55

0

1

2

3

4

5

6

7

8

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

1 node +8x K80 per node

ns/

day

F1ATPASE

2.6X

4.0X

4.1X

5.1X

136

F1ATPASE on P100s PCIe

Running NAMD version 2.11

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

1.47

4.52

6.03 6.12

0

1

2

3

4

5

6

7

8

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +4x P100 PCIe

per node

ns/

day

F1ATPASE

3.1X

4.1X4.2X

137

STMV on K80s

Running NAMD version 2.11

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla K80 (autoboost) GPUs

1x K80 is paired with Single Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell)0.36

1.02

1.71

1.90

2.17

0.0

0.5

1.0

1.5

2.0

2.5

1 Broadwell node 1 node +1x K80 per node

1 node +2x K80 per node

1 node +4x K80 per node

1 node +8x K80 per node

ns/

day

STMV

2.8X

4.8X5.3X

6.0X

138

STMV on P100s PCIe

Running NAMD version 2.11

The blue node contains Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz Turbo]

(Broadwell) CPUs

The green nodes contain Dual Intel Xeon E5-2699 v4@2.2GHz [3.6GHz

Turbo] (Broadwell) CPUs + Tesla P100 PCIe GPUs

1x P100 PCIe is paired with Single Intel Xeon E5-2699 v4@2.2GHz

[3.6GHz Turbo] (Broadwell)

0.36

1.32

1.88 1.89

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 Broadwell node 1 node +1x P100 PCIe

per node

1 node +2x P100 PCIe

per node

1 node +8x P100 PCIe

per node

ns/

day

STMV

3.7X

5.2X5.3X

139

Recommended GPU Node Configuration for NAMD Computational Chemistry

Workstation or Single Node Configuration

# of CPU sockets 2

Cores per CPU socket 6+

CPU speed (Ghz) 2.66+

System memory per socket (GB) 32

GPUs Kepler K20, K40, K80

# of GPUs per CPU socket 1-2

GPU memory preference (GB) 6-12

GPU to CPU connection PCIe 3.0 or higher

Server storage 500 GB or higher

Network configuration Gemini, InfiniBand

Scale to thousands of nodes with same single node configuration13

9

140

Benefits of MD GPU-Accelerated Computing

• 3x-8x Faster than CPU only systems in all tests (on average)

• Most major compute intensive aspects of classical MD ported

• Large performance boost with marginal price increase

• Energy usage cut by more than half

• GPUs scale well within a node and/or over multiple nodes

• K80 GPU is our fastest and lowest power high performance GPU yet

Try GPU accelerated MD apps for free – www.nvidia.com/GPUTestDrive

Why wouldn’t you want to turbocharge your research?

Dec. 19, 2016

Molecular Dynamics (MD) on GPUs