NVIDIA Molecular Dynamics App Catalog

May 8th, 2012 Higher Ed & Research

GPU Test Drive Experience GPU Acceleration

Free & Easy – Sign up, Log in and See Results

Preconfigured with Molecular Dynamics Apps

www.nvidia.com/GpuTestDrive

Remotely Hosted GPU Servers

For Computational Chemistry Researchers, Biophysicists

! Molecular Dynamics Applications Overview ! AMBER

! NAMD

! GROMACS

! LAMMPS

Sections Included

Application Features Supported GPU Perf Release Status Notes/Benchmarks

AMBER PMEMD Explicit Solvent & GB Implicit Solvent

89.44 ns/day JAC

NVE on 16X 2090s

Released Multi-GPU, multi-node

AMBER 12 http://ambermd.org/gpus/

benchmarks. htm#Benchmarks

NAMD Full electrostatics with

PME and most simulation features

6.44 ns/days STMV 585X 2050s

Released 100M atom capable

Multi-GPU, multi-node

NAMD 2.8. 2.9 version April 12

http://biowulf.nih.gov/apps/namd/

namd_ bench.html

GROMACS Implicit (5x), Explicit

(2x) Solvent via OpenMM

165 ns/Day DHFR

4X C2075s

4.5 Single GPU released 4.6 Multi-GPU Released

http://biowulf.nih.gov/apps/gromacs-gpu.html

LAMMPS Lennard-Jones, Gay-Berne, Tersoff 3.5-15x Released.

Multi-GPU, multi-node

1 billion atom on Lincoln: http://lammps.sandia.gov/

bench.html# machine GPU Perf compared against Multi-core x86 CPU socket.

GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison

Molecular Dynamics (MD) Applications

Application Features Supported GPU Perf Release Status Notes

Abalone TBD Simulations

4-29X (on 1060 GPU) Released Single GPU.

Agile Molecule, Inc.

ACEMD Written for use on GPUs 160 ns/day Released

Production bio-molecular dynamics (MD) software specially

optimized to run on single and multi-GPUs

DL_POLY Two-body Forces, Link-cell Pairs, Ewald SPME

forces, Shake VV 4x V 4.0 Source only

Results Published Multi-GPU, multi-node supported

HOOMD-Blue

Written for use on GPUs

2X (32 CPU cores

vs. 2 10XX GPUs)

Released, Version 0.9.2 Single and multi-GPU.

New/Additional MD Applications Ramping

GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features

and may be a kernel to kernel perf comparison

GPU Value to Molecular Dynamics

What

Why

How

!   Study disease & discover drugs !   Predict drug and protein interactions !   Speed of simulations is critical !   Enables study of:

!   Longer timeframes !   Larger systems !   More simulations

!   GPUs increase throughput & accelerate simulations

AMBER 11 Application 4.6x performance increase with 2 GPUs with only a 54% added cost*

•  AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) •  Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333

GPU Test Drive Pre-configured Applications

AMBER 11 NAMD 2.8

GPU Ready Applications

Abalone ACEMD AMBER

DL_PLOY GAMESS

GROMACS LAMMPS NAMD

All Key MD Codes are GPU Ready

!   AMBER, NAMD, GROMACS, LAMMPS

!   Life and Material Sciences

!   Great multi-GPU performance

!   Additional MD GPU codes: Abalone, ACEMD, HOOMD-Blue

!   Focus: scaling to large numbers of GPUs

Outstanding AMBER Results with GPUs

Run AMBER Faster Up to 5x Speed Up With GPUs

DHFR (NVE) 23,558 Atoms

ns/Day 58.28

ns/Day 14.16

GPU+CPU CPU

“…with two GPUs we can run a single simulation as fast as on 128 CPUs of a Cray XT3 or on 1024 CPUs of an IBM BlueGene/L machine. We can try things that were undoable before. It still blows my mind.” Axel Kohlmeyer Temple University

2.44 4.55

8.22

13.49

46.01

21.29

31.57

45.58

56.45

0

10

20

30

40

50

60

1 node 2 nodes 4 nodes 8 nodes NICS Kraken (Athena)

ns/day Cluster Performance Scaling AMBER 11 JAC Simulation time, ns/day

CPU only CPU + 1x C2050 per node

CPU Supercomputer

0.0

1.0

2.0

3.0

4.0

5.0

Cost of 1 Node Performance Speed-up

Rela

tive

Cos

t an

d Pe

rfor

man

ce B

enef

it S

cale

AMBER 11 on 2X E5670 CPUs (per node)

AMBER 11 on 2X E5670 CPUs + 2X Tesla M2090s (per node)

AMBER Make Research More Productive with GPUs

Adding Two 2090 GPUs to a Node Yields a > 4x Performance Increase

Base node configuration: Dual Xeon X5670s and Dual Tesla M2090 GPUs per node

318% Higher Performance

54% Additional Expense

No GPU

With GPU

Run NAMD Faster Up to 7x Speed Up With GPUs

ApoA-‐1 92,224 Atoms

ns/Day 2.94

ns/Day 0.51

GPU+CPU CPU

STMV 1,066,628 Atoms

0 0.5

1 1.5

2 2.5

3 3.5

1 2 4 8 12 16

Ns/Day

# of Compute Nodes

NAMD 2.8 Benchmark

GPU+CPU

CPU only

Test Platform: 1 Node, Dual Tesla M2090 GPU (6GB), Dual Intel 4-core Xeon (2.4 GHz), NAMD 2.8, CUDA 4.0, ECC On. Visit www.nvidia.com/simcluster for more information on speed up results, configuration and test models.

NAMD 2.8 B1 + unreleaesd patch, STMV Benchmark A Node is Dual-Socket, Quad-core x5650 with 2 Tesla M2070 GPUs

Performance numbers for 2 M2070 8 cores (GPU+CPU) vs. 8 cores (CPU)

0.0

1.0

2.0

3.0

4.0

Cost of 1 Node Performance Speed-up

Rela

tive

Cos

t an

d Pe

rfor

man

ce B

enef

it S

cale

NAMD 2.8 on 2X E5670 CPUs (per node)

NAMD 2.8 on 2X E5670 CPUs + 2X Tesla C2070s (per node)

Make Research More Productive with GPUs

Get up to a 250% Performance Increase (STMV – 1,066628 atoms)

No GPU

With GPU

250% Higher

54% Additional Expense

GROMACS Partnership Overview

!   Erik Lindahl, David van der Spoel, Berk Hess are head authors and project leaders. Szilárd Páll is a key GPU developer.

!   2010: single GPU support (OpenMM library in GROMACS 4.5)

!   NVIDIA Dev Tech resources allocated to GROMACS code

!   2012: GROMACS 4.6 will support multi-GPU nodes as well as GPU clusters

GROMACS 4.6 Release Features

!   Multi-GPU support - GPU acceleration is one of main focus: majority of features

will be accelerated in 4.6 in a transparent fashion !   PME simulations get special attention, and most of the effort will go into making

these algorithms well load-balanced !   Reaction-Field and Cut-Off simulations also run accelerated

!   List of non-supported GPU accelerated features will be quite short

GROMACS Multi-GPU Expected in April 2012

GROMACS 4.6 Alpha Release Absolute Performance

!   Absolute performance of GROMACS running CUDA- and SSE-accelerated non-bonded kernels with PME on 3-12 CPU cores and 1-4 GPUs. Simulations with cubic and truncated dodecahedron cells, pressure coupling, as well as virtual interaction sites enabling 5 fs are shown

!   Benchmark systems: RNAse in water with 24040 atoms in cubic and 16816 atoms in truncated dodecahedron box

!   Settings: electrostatics cut-off auto-tuned >0.9 nm, LJ cut-off 0.9 nm, 2 fs and 5 fs (with vsites) time steps

!   Hardware: workstation with 2x Intel Xeon X5650

(6C), 4x NVIDIA Tesla C2075

GROMACS 4.6 Alpha Release Strong Scaling

Strong scaling of GPU-accelerated GROMACS with PME and reaction-field on: !   Up to 40 cluster nodes with 80 GPUs !   Benchmark system: water box with 1.5M

particles !   Settings: electrostatics cut-off auto-tuned >0.9

nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps

!   Hardware: Bullx cluster nodes with 2x Intel

Xeon E5649 (6C), 2x NVIDIA Tesla M2090, 2x QDR Infiniband 40 Gb/s

GROMACS 4.6 Alpha Release PME Weak Scaling

Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: !   3-12 CPU cores and 1-4 GPUs. The

gradient background indicates the range of system

!   Sizes which fall beyond the typical

single-node production size.

!   Benchmark systems: water boxes size ranging from 1.5k to 3M particles.

!   Settings: electrostatics cut-off auto-

tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps.

!   Hardware: workstation with 2x Intel

Xeon X5650 (6C), 4x NVIDIA Tesla C2075.

GROMACS 4.6 Alpha Release Rxn-Field Weak Scaling

Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: !   3-12 CPU cores and 1-4 GPUs. The

gradient background indicates the range of system

!   sizes which fall beyond the typical single-node production size

!   Benchmark systems: water boxes size ranging from 1.5k to 3M particles

!   Settings: electrostatics cut-off auto-tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps

!   Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075

GROMACS 4.6 Alpha Release Weak Scaling

!   Weak scaling of the CUDA non-bonded force kernel on GeForce and Tesla GPUs. Perfect weak scaling, challenges for strong scaling

!   Benchmark systems: water boxes size ranging from 1.5k to 3M particles

!   Settings: electrostatics & LJ cut-off 1.0 nm, 2 fs time steps

!   Hardware: workstation with 2x Intel Xeon X5650 (6C) CPUs, 4x NVIDIA Tesla C2075

LAMMPS Released GPU Features and Future Plans

* Courtesy of Michael Brown at ORNL and Paul Crozier at Sandia Labs

LAMMPS August 2009 !   First GPU accelerated support

LAMMPS Aug. 22, 2011 !   Selected accelerated Non-bonded short-range

potentials (SP, MP, DP support) ! Lennard-Jones (several variants with &

without coulombic interactions) !   Morse !   Buckingham !   CHARMM !   Tabulated !   Course grain SDK !   Anisotropic Gay-Bern !   RE-squared !   “Hybrid” combinations (GPU accel & no GPU accel)

!   Particle-Particle Particle-Mesh (SP or DP) !   Neighbor list builds

Longer Term* !   Improve performance on smaller particle counts

!   Neighbor List is the problem

!   Improve long-range performance !   MPI/Poisson Solve is the problem

!   Additional pair potential support (including expensive advanced force fields) – See “Tremendous Opportunity for GPUs” slide*

!   Performance improvements focused to specific science problems

W.M. Brown, “GPU Acceleration in LAMMPS”, 2011 LAMMPS Workshop

LAMMPS 8.6x Speed-up with GPUs

LAMMPS 4x Faster on Billion Atoms

Test PlaCorm: NCSA Lincoln Cluster with S1070 1U GPU servers aOached CPU-‐only Cluster-‐ Cray XT5

Billion Atom Lennard-‐Jones Benchmark

29 Seconds

103 Seconds

288 GPUs + CPUs 1920 x86 CPUs

!   4X-15X Speedups !   Gay-Berne !   RE-Squared

!   From August 2011

LAMMPS Workshop

!   Courtesy of W. Michael Brown, ORNL

LAMMPS

LAMMPS Conclusions

!   Runs both with individual multi-GPU node, as well as GPU clusters

!   Outstanding raw performance! !   Performance is 3x-40X higher than equivalent CPU code

!   Impressive linear strong scaling

!   Good weak scaling, scales to a billion particles

!   Tremendous opportunity to GPU accelerate other force fields

GPU Test Drive Experience GPU Acceleration

Free & Easy – Sign up, Log in and See Results

Preconfigured with Molecular Dynamics Apps

www.nvidia.com/GpuTestDrive

Remotely Hosted GPU Servers

For Computational Chemistry Researchers, Biophysicists

NVIDIA Molecular Dynamics App Catalog

Documents

Transcript of NVIDIA Molecular Dynamics App Catalog