Speed Without Compromise: Rethinking Precision in...

42
SAN DIEGO SUPERCOMPUTER CENTER Speed Without Compromise: Rethinking Precision in MD Calculations in the Era of Vanishing Double Precision FloPs Ross Walker, Associate Professor and NVIDIA CUDA Fellow San Diego Supercomputer Center UC San Diego Department of Chemistry & Biochemistry 1

Transcript of Speed Without Compromise: Rethinking Precision in...

SAN DIEGO SUPERCOMPUTER CENTER

Speed Without Compromise: Rethinking Precision in MD

Calculations in the Era of Vanishing Double

Precision FloPs

Ross Walker, Associate Professor and NVIDIA CUDA Fellow ���San Diego Supercomputer Center���UC San Diego Department of Chemistry & Biochemistry

1

UCSD

SAN DIEGO SUPERCOMPUTER CENTER

http://www.wmd-lab.org/

Researchers / Postdocs: Age Skjevik, Andreas Goetz, Perri NeedhamGraduate Students: Ben Madej, Longhua Yang, Maria Rosaria-ferrero, Charles Lin, Daniel Mermelstein. Undergraduate Researchers: Robin Betz

GPU Acceleration

Lipid Force Fields

Enzyme Activation

QM/MM MD Automated Refinement

Walker Molecular Dynamics Lab

2

SAN DIEGO SUPERCOMPUTER CENTER

Molecular Dynamics for the 99%•  Develop a GPU accelerated

version of AMBER’s PMEMD.

San DiegoSupercomputer CenterRoss C. Walker

NVIDIAScott Le Grand

3

Partly funded under NSF SI2 - SSE Program

Taking MD to 11

SAN DIEGO SUPERCOMPUTER CENTER

Project Info•  AMBER Website: http://ambermd.org/gpus/

Publications1.  Salomon-Ferrer, R.; Goetz, A.W.; Poole, D.; Le Grand, S.;  Walker, R.C.* "Routine microsecond

molecular dynamics simulations with AMBER - Part II: Particle Mesh Ewald" , J. Chem. Theory Comput. 2013, 9 (9), pp 3878-3888. DOI: 10.1021/ct400314y

2.  Goetz, A.W., Williamson, M.J., Xu, D., Poole, D.,Grand, S.L., Walker, R.C. "Routine microsecond molecular dynamics simulations with amber - part i: Generalized born", Journal of Chemical Theory and Computation, 2012, 8 (5), pp 1542-1555, DOI:10.1021/ct200909j

3.  Pierce, L.C.T., Salomon-Ferrer, R. de Oliveira, C.A.F. McCammon, J.A. Walker, R.C., "Routine access to millisecond timescale events with accelerated molecular dynamics.", Journal of Chemical Theory and Computation, 2012, 8 (9), pp 2997-3002, DOI: 10.1021/ct300284c

4.  Salomon-Ferrer, R.; Case, D.A.; Walker, R.C.; "An overview of the Amber biomolecular simulation package", WIREs Comput. Mol. Sci., 2012, in press, DOI: 10.1002/wcms.1121

5.  Grand, S.L.; Goetz, A.W.; Walker, R.C.; "SPFP: Speed without compromise - a mixed precision model for GPU accelerated molecular dynamics simulations", Chem. Phys. Comm., 2013, 184, pp374-380, DOI: 10.1016/j.cpc.2012.09.022

4

SAN DIEGO SUPERCOMPUTER CENTER

Design GoalsOverriding Design Goal: Sampling for the 99%

•  Focus on ~< 4 million atoms.•  Maximize single workstation performance.•  Focus on minimizing costs.

•  Be able to use very cheap nodes.•  Both gaming and tesla cards.•  Ease of use (same input, same output)

5

The <0.0001% The 1.0% The 99.0%

SAN DIEGO SUPERCOMPUTER CENTER

Map problem onto GPU hardware

•  Subdivide force matrix into 3 classes of independent tiles

Off-diagonal On-diagonal Redundant

•  Map non-redundant tiles to warps•  SMs consume tiles

Warp 0 Warp 1

Warp 2

Warp n

Warp 0 Warp 1

Warp 2

Warp n

Warp 0 Warp 1

Warp 2

Warp n

Warp 0 Warp 1

Warp 2

Warp n

. . . SM 0 SM 1 SM m SM 2

•  Avoid race conditions by dividing the calculation in both space (tiles) and time (warps).

Shared Memory

Reg

iste

rs

Example: Nonbonded forcesatom j

atom

i

Patent: US 8473948 B1

SAN DIEGO SUPERCOMPUTER CENTER

Version History•  AMBER 10 – Released Apr 2008

•  Implicit Solvent GB GPU support released as patch Sept 2009.•  AMBER 11 – Released Apr 2010

•  Implicit and Explicit solvent supported internally on single GPU.•  Oct 2010 – Bugfix.9 doubled performance on single GPU, added

multi-GPU support.•  AMBER 12 – Released Apr 2012

•  Added Umbrella Sampling Support, REMD, Simulated Annealing, aMD, IPS and Extra Points.

•  Aug 2012 – Bugfix.9 new SPFP precision model, support for Kepler I, GPU accelerate NMR restraints, improved performance.

•  Jan 2013 – Bugfix.14 support CUDA 5.0, Jarzynski on GPU, GBSA. Kepler II support.

7

SAN DIEGO SUPERCOMPUTER CENTER

New in AMBER 14 (GPU)Apr 2014

•  ~20-30% performance improvement for single GPU runs.•  Peer to peer support for multi-GPU runs providing enhanced multi-GPU scaling.•  Hybrid bitwise reproducible fixed point precision model as standard (SPFP)•  Support for Extra Points in Multi-GPU runs.•  Jarzynski Sampling•  GBSA support•  Support for off-diagonal modifications to VDW parameters.•  Multi-dimensional Replica Exchange (Temperature and Hamiltonian)•  Support for CUDA 5.0, 5.5 and 6.0•  Support for latest generation GPUs.•  Monte Carlo barostat support providing NPT performance equivalent to NVT.•  ScaledMD support.•  Improved accelerated (aMD) MD support.•  Explicit solvent constant pH support.•  NMR restraint support on multiple GPUs.•  Improved error messages and checking.•  Hydrogen mass repartitioning support (4fs time steps).

8

SAN DIEGO SUPERCOMPUTER CENTER

A Question of Dynamic Range32-bit floating point has approximately 7 significant figures

When it happens: PBC, SHAKE, and Force Accumulation.

1.456702 +0.3046714 ----------- 1.761373 -1.456702 ----------- 0.3046710

Lost a sig fig

1456702.0000000 + 0.3046714 ----------------- 1456702.0000000 -1456702.0000000 ----------------- 0.0000000 Lost everything.

SAN DIEGO SUPERCOMPUTER CENTER

Precision ModelsSPSP - Use single precision for the entire calculation with

the exception of SHAKE which is always done in double precision.

SPDP - Use a combination of single precision for calculation and double precision for accumulation (default < AMBER 12.9)

DPDP – Use double precision for the entire calculation.

10

SAN DIEGO SUPERCOMPUTER CENTER

Validation and Precision Testing•  Measure a combination of elements that depend

on both static energies / forces and ensemble averages.•  Energy conservation.•  Optimized structures.•  Free energy surfaces.•  Order parameters.•  RMSF.•  Radial distribution functions. etc…

•  2 aims•  Is our implementation valid/correct?•  What level of approximation with precision is acceptable?

SAN DIEGO SUPERCOMPUTER CENTER

Force Accuracy

SAN DIEGO SUPERCOMPUTER CENTER

Energy Conservation

SAN DIEGO SUPERCOMPUTER CENTER

Free Energy Surfaces

CPU (DP) GPU (SPDP)

SAN DIEGO SUPERCOMPUTER CENTER

Order Parameters

SAN DIEGO SUPERCOMPUTER CENTER

Explicit Solvent Performance(JAC DHFR Production Benchmark)

SAN DIEGO SUPERCOMPUTER CENTER

But then…

17

GTX680 and K10 Ruined the Party.

DP performance REALLY sucked.

4 month delay in usefulness while weDeveloped and tested a new precisionmodel.

SAN DIEGO SUPERCOMPUTER CENTER

SPFP•  Single / Double / Fixed precision hybrid. Designed for

optimum performance on Kepler I. Uses fire and forget atomic ops. Fully deterministic, faster and more precise than SPDP, minimal memory overhead. (default >= AMBER 12.9)

18

Q24.40 for Forces, Q34.30 for Energies / Virials

SAN DIEGO SUPERCOMPUTER CENTER

Worked GreatUntil Maxwell

19

30.2181.26

129.79251.43

262.39280.54

383.32261.82

356.48116.09

196.99263.85266.07

364.67489.68

229.29334.05

423.69

0.00 100.00 200.00 300.00 400.00 500.00 600.002xE5-2660v2 CPU (16 Cores)

1X C20752X C2075

1X GTX 7801X GTX 980

1X GTX Titan Black2X GTX Titan Black

GTX-Titan-Z (1 GPU, 1/2 board)GTX-Titan-Z (2 GPU, full board)

1X K81X K202X K201X K402X K404X K40

1/2x K80 board (1 GPU)1x K80 board (2 GPUs)

2x K80 boards (4 GPUs)

Performance (ns/day)

DHFR (NVE) HMR 4fs 23,558 Atoms

SAN DIEGO SUPERCOMPUTER CENTER

Titan-X Helps(But only through brute force)

20

SAN DIEGO SUPERCOMPUTER CENTER

Yet another solution neededSPXP

Use 2 x 32 bits (~48-bit FP)Extended-Precision Floating-Point Numbers for GPU Computation - Andrew Thall, Alma Collegehttp://andrewthall.org/papers/df64_qf128.pdf

High-Performance Quasi Double-Precision Method Using Single-Precision Hardware for Molecular Dynamics on GPUs – Tetsuo Narumi et al. HPC Asia and JAPAN 2009

SAN DIEGO SUPERCOMPUTER CENTER

Narumi SummationRepresented as a float and an int

const int NARUMI_LARGE_SHIFT = 21;const float NARUMI_LARGE = (float)(1 << (NARUMI_LARGE_SHIFT - 1));

struct Accumulator { float hs; int li; Accumulator() : hs(NARUMI_LARGE), li(0) {}};

SAN DIEGO SUPERCOMPUTER CENTER

Additionvoid add_narumi(Accumulator& a, float ys){ float hs, ls, ws;

// Knuth and Dekker addition hs = a.hs + ys; ws = hs - a.hs; ls = ys - ws;

// Inner Narumi correction a.hs = hs; a.li += (int)(ls * NARUMI_LOWER_FACTOR);}

SAN DIEGO SUPERCOMPUTER CENTER

Conversion to doubledouble upcast_narumi(Accumulator& a){ double d = (double)(a.hs - NARUMI_LARGE); d += a.li * NARUMI_LOWER_FACTOR_1_D; return d;}

SAN DIEGO SUPERCOMPUTER CENTER

Something for Everyone

•  DPFP 64-bit everything

•  SPFP 32-bit forces, U64 force summation, 64-bit state

•  SPXP 32-bit forces, Narumi force summation for inner loops, U64 summation, 64-bit state

SAN DIEGO SUPERCOMPUTER CENTER

Side by SideDP: 22.855216396810960

DPFP: 22.855216396810960

SPFP: 22.855216396810xxx

SPXP: 22.8552163xxxxxxxx

SP: 22.855xxxxxxxxxxxx

SAN DIEGO SUPERCOMPUTER CENTER

SPXP – Unlocking true Maxwell Performance

27

SAN DIEGO SUPERCOMPUTER CENTER

Meanwhile…(After having dealt with NVIDIA’s enforced distraction)

28

SAN DIEGO SUPERCOMPUTER CENTER

New Features: Automated Workflows

29

AMBER  GPU  MD  Workbench  

Collabora7on  with  Al7ntas,  Amaro,  &  Walker  

Facilitates publication-quality MD-based research & training Automated multistep minimization, heating, equilibration protocols Automated multi-copy job submission on GPU clusters Addresses reproducibility, comparison of results, rapid parameter testing / exploration

Minimiza7on  Actor  Equilibra7on  Actor    

31

SAN DIEGO SUPERCOMPUTER CENTER

New Features: Asymmetric Boundary Conditions

32

SAN DIEGO SUPERCOMPUTER CENTER

Asymmetric Periodic Boundary Conditions

•  Mapping:•  Even: Same as PBC corr=corr-corr/length•  Odd:

•  x’=x-1/2Lx-x/Lx•  y’=y-1/2Ly-y/Ly•  z’=(Lz-z)-(z-z/Lz)

Reducing  computa/onal  costs  through  reducing  the  need  of  mul/ple  lipid  bilayers  in  ionic  gradient  simula/ons.  

1  (odd)  

0  (even)  

-­‐1  (odd)  

SAN DIEGO SUPERCOMPUTER CENTER

APBC Considerations•  Ewald Forces (WIP)

•  Even

•  Odd

•  Other considerations•  van der Waals (not included in ewald)•  Visualization / post imaging?

SAN DIEGO SUPERCOMPUTER CENTER

Investigating ligand permeability across membranes using COM distance restraints and

Lipid14

Glycerol Ligand

SAN DIEGO SUPERCOMPUTER CENTER

Improved performance of COM distance restraints using pmemd.cuda

1.0   1.1  

8.6  

12.4  

0  

5  

10  

15  

Intel    i7-­‐5930K  CPU  (6  cores)  

Intel  i7-­‐5930K  CPU  (2  x  6  cores)  

1  x  K40   2  X  K40  

Speed-­‐up

 

Lipid  bilayer  +  glycerol  

4.92   5.42  

42.23  

61.15  

0

10

20

30

40

50

60

70

Intel    i7-­‐5930K  CPU  (6  cores)  

Intel  i7-­‐5930K  CPU  (2  x  6  cores)  

1  x  K40   2  X  K40  

ns/day  

Lipid  bilayer  +  glycerol  ligands  

SAN DIEGO SUPERCOMPUTER CENTER

New Features: Constant pH MD•  Solution pH often has a dramatic impact on biomolecular systems.

•  Affects the charge distribution within the biomolecule. •  Affects the fundamental structure and function of biomolecules. •  Some proteins’ native states are stable only in a narrow pH range

•  Traditional approach = constant protonation state•  Fixed protonation states of titratable residues at the start of each simulation.•  Does not sample changes in protonation state.•  Only samples conformational space of fixed protonation state(s).•  pH effects accounted for qualitatively.

•  Constant pH approach•  Samples conformational space AND samples protonation states.•  Constant chemical potential of hydronium ions.•  Periodic changes in discrete protonation states of titratable residues using Monte Carlo.

SAN DIEGO SUPERCOMPUTER CENTER

These  projects  were  impossible  to  pursue  before  GPUs  allowed  rou/ne  access  to  100ns  long  pH-­‐REMD  simula/ons  in  one  day  

Nitrophorin  4  

Well  characterized  pH-­‐dependent  NO  delivery  into  blood  vessels  

Talin  

pH-­‐dependent  binding  to  ac/n  implicated  in    cancer  

metastasis  

Human  Folate  receptor  

pH-­‐dependent  intracellular  delivery:  poten/al  target  

for  drug  transport  

Constant pH REMD in Implicit Solvent

SAN DIEGO SUPERCOMPUTER CENTER

Constant pH MD in Explicit Solvent

!Input&parameters&and&initial&prot.&

states&(n)&

Explicit&solvent&MD&with&prot.&states&

(n)&

Implicit&(GB)&solvent&MD&with&random&prot.&state&

change&(n)&

Accept&prot.&state&change?&

Strip&solvent&

Restore&solvent&

Run&solvent&relaxation&MD&

NO& YES& &

Current implementation only makes use of the GPU for part of the algorithm

SAN DIEGO SUPERCOMPUTER CENTER

Predicted Performance of GPU-CpH-Ex

0

10

20

30

40

50

60

AMD FX-8150 CPU (8 core)

AMD FX-8150 CPU + GeForce GTX TITAN

Z GPU

GeForce GTX TITAN Z GPU

ns/d

ay

3LZT crystal structure of the hen egg-white lysozyme (HEWL)

0  

2  

4  

6  

8  

10  

12  

AMD  FX-­‐8150  CPU  (8  core)  

AMD  FX-­‐8150  CPU  +  GeForce  GTX  TITAN  Z  

GPU  

GeForce  GTX  TITAN  Z  GPU  

Speed-­‐up

 

3LZT  crystal  structure  of  the  hen  egg-­‐white  lysozyme  (HEWL)  

SAN DIEGO SUPERCOMPUTER CENTER

Summary•  GPUs are awesome but continual ‘internal’

performance changes are crippling development.

•  Lots of new things in the pipeline – would be more if we didn’t have to keep rewriting the guts.

SAN DIEGO SUPERCOMPUTER CENTER

AcknowledgementsSan Diego Supercomputer Center

University of California San Diego

National Science Foundation NSF SI2-SSE Program

NVIDIA Corporation Hardware + People

People

Perri Needham Romelia Salomon Scott Le Grand

Robin Betz Ben Madej Simon Layton

Duncan Poole Mark Berger Sarah Tariq

Andreas Goetz Age Skjevik 42