NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation:...

45
NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale… Thomas E. Cheatham III Associate Professor [email protected] Departments of Medicinal Chemistry and of Pharmaceutics and Pharmaceutical Chemistry, College of Pharmacy, University of Utah NSF TeraGrid Science Advisory Board NSF LRAC/MRAC allocations panel (~2002-2008), chair NSF LRAC award since ~2001; ||-computing since 1987; ~17 M hours this year on local and NSF machines U Utah CI Council; Information Technology Council; CHPC

Transcript of NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation:...

Page 1: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

NSF CI days @ U Kentucky, February 2010

Insight into biomolecular structure from simulation: Promise, peril and barriers

to moving beyond the terascale…

Thomas E. Cheatham IIIAssociate [email protected]

Departments of Medicinal Chemistry and ofPharmaceutics and Pharmaceutical Chemistry,

College of Pharmacy, University of Utah

NSF TeraGrid Science Advisory BoardNSF LRAC/MRAC allocations panel (~2002-2008),

chairNSF LRAC award since ~2001; ||-computing since

1987;~17 M hours this year on local and NSF machinesU Utah CI Council; Information Technology Council;

CHPC

Page 2: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

eScience = cyberinfrastructure (???)

"The term "e-Science" denotes the systematic development of research methods that exploit advanced computational thinking“

Professor Malcolm Atkinson, e-Science Envoy.

“Cyberinfrastructure” consists of computing systems, data storage systems, data repositories and advanced instruments, visualization environments, and people, all linked together by software and advanced networks to improve scholarly productivity and enable breakthroughs not otherwise possible.

EDUCAUSE, Campus Cyberinfrastructure workgroup

Page 3: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

“If you’re a scientist, talk to a computer scientist about your challenges, and vice versa.”

i.e. clustering, data handling, …

Page 4: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

How do drugs bind and influence structure (and dynamics)?

the tool: biomolecular simulation

Page 5: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

The tool: Biomolecular simulationenergy

vs.sampling

Page 6: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

What is bio-molecular simulation?• “physics” based atomic potential—the force field—

tuned for proteins, nucleic acids and their surroundings (solvent, ions, drugs, …)

d+ d-

bonds

electrostatics

angles

van der Waals

dihedrals

There are many force fields, each withdistinct performance characteristics…

Page 7: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

What is bio-molecular simulation?• “physics” based atomic potential—the force field—

tuned for proteins, nucleic acids and their surroundings (solvent, ions, drugs, …)

• codes and methods developed over the past ~40 yrs by various teams including centers, labs and industry– 80’s vectorization + early parallel architectures– 90’s shared memory and distributed memory parallelized– 00’s special purpose hardware and optimized codes

– AMBER, CHARMM, Encad, NAMD, Desmond, GROMOS, Gromacs, LAMMPS, …

Page 8: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

• de novo protein folding, structure prediction• computer aided drug design• design of novel materials / properties• multi-scale modeling• simulation of time scales that are approaching

relevant time scales

CAPTOPRIL: ACE inhibitor (antihypertensive)VIRACEPT: HIV protease inhibitor (AIDS therapy). Agouron.CRIXIVAN: HIV protease inhibitor (AIDS therapy). Merck.VIAGRA: cGMP PD type 5 inhibitor (impotence). Pfizer.ZOMIG: trypamine receptor antagonist (migraine) ZenecaTEVETEN: Angiotensin II receptor antagonist (hypertension)TRUSOPT: carbonic anhydrase inhibitor (glaucoma)ARICEPT: AChE inhibitor (alzheimers/dementia)COZAAR: angiotensin II receptor antagonist (hypertension)NOROXIN: inhibits bacterial DNA synthesis (antibacterial)

Page 9: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

- Do the simulations model reality?

- How can we assess & validate the results?

- Can simulations provide predictive insight?

- How can we improve the applied methods?

general goals of bio-molecular simulation research

Page 10: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

- Do the simulations model reality?

- How can we assess & validate the results?

- Can simulations provide predictive insight?

- How can we improve the applied methods?

- How can we facilitate the simulation experiment?

- How can we better disseminate the data?

- How can we use the emerging machines?

general goals of bio-molecular simulation research

computational science (???)

Page 11: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Experience required?

• structural biology

• statistical mechanics

• biophysics / computational chemistry

• pharmacy / organic chemistry

• UNIX / system administration

• coding ability (Fortran90, scripting, …)

• parallel computing

• data handling, analysis, viz

B.A. ChemistryB.A. Mathematics & Comp. Sci.PhD Pharm Chem

Programmer /Analyst, 2 yrs

+NSF centers

Still learning…

interdisciplinary teams?

Page 12: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

The power of the TeraGrid(aka metacenter, PACI, xD: NSF centers)

• education / training– CM-2a, CM-5, MasPar training, ~1989– Summer Institute in Supercomputing at PSC, 1992– Scientific Computing Institute at Los Alamos, 1992

• vectorization, basic concepts of shared vs. distributed memory– Heterogeneous computing at PSC, 1994

• shared memory + MPI (+ PVM, TCGMSG, …)– Shared memory and MPI parallelized AMBER released (PSC, SGI)– AMBER workshops (as teacher), 1996 & 1998

• outreach– center brochures, literature, WWW pages, joint publications– Computerworld Smithsonian Awards Finalist (with PSC, UCSF, NIEHS)

• cycles!!!– friendly user status , consultants, helpline, porting guides

Allocations: ~100K in 1995, ~1M in 2002, ~10M in 2009, ~14M in 2010, …

Page 13: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

$$$:

1R01-GM081411-01: Biomolecular simulation for theend-stage refinement of nucleic acid structure

1R01-GM079383-01: “AMBER force field consortium”

Research funding focuses on NIH mission(basic science + health relevance)

Curious trends, barriers and limitations in the field…

- Funding is results driven, little reward for software optimization- NIH does not really fund (or support) supercomputing / CI - …yet NIH funds the bulk of biomolecular simulation research

(?)

Page 14: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

$$$:

1R01-GM081411-01: Biomolecular simulation for theend-stage refinement of nucleic acid structure

1R01-GM079383-01: “AMBER force field consortium”

Research funding focuses on NIH mission(basic science + health relevance)

Curious trends, barriers and limitations in the field…

- Funding is results driven, little reward for software optimization- NIH does not really fund (or support) supercomputing / CI - …yet NIH funds the bulk of biomolecular simulation research

(?)

- Student PhD’s tend to be in “chemistry” (no expertise in computational science)

- Codes are complex, legacy, and evolving…

Page 15: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Curious trends, barriers and limitations in the field…

• NSF does not directly fund mostbiomolecular simulation research

• few agencies or companies supportbiosimulation code development

• bulk of cycles in field from NSF centers, then DOE.– 10% cap on NIH research vs. inter-agency cooperation?

PHY18%

AST14%

CHE11%

DMR8%

DMS0%

BIO30%

ENG10%

CIS3%

GEO2%

SBE0%

IND4%

Page 16: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Curious trends, barriers and limitations in the field…

• NSF does not directly fund mostbiomolecular simulation research

• few agencies or companies supportbiosimulation code development

• bulk of cycles in field from NSF centers, then DOE.– 10% cap on NIH research vs. inter-agency cooperation?

Threats:- Without NSF cycles and the TeraGrid/xD the field of

biomolecular simulation would stagnate.

PHY18%

AST14%

CHE11%

DMR8%

DMS0%

BIO30%

ENG10%

CIS3%

GEO2%

SBE0%

IND4%

Page 17: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Curious trends, barriers and limitations in the field…

• NSF does not directly fund mostbiomolecular simulation research

• few agencies or companies supportbiosimulation code development

• bulk of cycles in field from NSF centers, then DOE.– 10% cap on NIH research vs. inter-agency cooperation?

Threats:- Without NSF cycles and the TeraGrid/xD the field of

biomolecular simulation would stagnate.- …we are spending more and more of our time

running simulations, managing workflow, transferring data, i.e. doing computational science

PHY18%

AST14%

CHE11%

DMR8%

DMS0%

BIO30%

ENG10%

CIS3%

GEO2%

SBE0%

IND4%

Page 18: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

0

1

2

3

4

5

6

0 500 1000 1500 2000 2500 3000

RMS to A-DNARMS to B-DNA

All

ato

m R

MS

d (

Å)

Time (ps)

- simulations run for ~6 months, 16-32-way parallel, batch

- < 100 GB data, run remotely, stored and analyzed locally

- analysis is standard (key values vs. time)- required advances (completed):

- methods improvement (PME electrostatics)- optimized codes for shared memory, MPI, …- development of general purpose analysis

utilities“ptraj”

MD simulations~500ps – 3 ns~1994-1997

bio-molecular simulation at the meta-scale

Page 19: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

bio-molecular simulationat the tera+ -scale

tetraloop receptor5 simulations @ ~200 ns cyp-P450 2B4

8 simulations @ ~150 ns

DNA minor groove binders7 drugs, 2 binding modes,

4 sequences @ ~50 ns

- simulations run for ~6 months, 16-1K -way parallel, batch

- ~1-5 TB per set, run remotely, stored and analyzed locally

- analysis has become rate limiting; data too large/slow…

Page 20: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Data is complex: How to simplify?(don’t throw out baby with bathwater)

vast time/size scales; granularity scales

Page 21: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

…if we know what we want to see, analyzing and visualizing is easy…

…and tools are available

Page 22: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

force fields vs. samplingwe (likely) have systematic

problems with structure or converge to incorrect

structure

we (likely) get trapped in a meta-stable conformations

en

erg

y

reaction coordinate

Computer power?

the good the bad

Page 23: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

David E. Shaw: DESRES

16 microseconds / day !!!

Page 24: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Funny things can and do happen…

&

we’re experiencing serious data overload…

500 nanosecond simulation of a

DNA duplex using generalized

Born implicit solvation

Page 25: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Some problems (~2000-2008)

K+, Cl-, Mg2+crystal?

Phased A-tract

burrowing Mg2+ ion?

Page 26: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Joung / Cheatham, JPCB 113, 13279 (2009)

Page 27: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

How about long DNA simulation?

> 500 ns on DAPI bound DNA duplexesCornell et al. force field.

site Ecomplex EDNA+20

w

DAPI DG DDG*ATTG -

4085.0-3915.6 -

149.7-

19.7-2.4

AATT

-4086.4

-3917.9 -149.7

-18.8

 

ATTG

-4085.7

-3916.4 -149.7

-19.6

+1.0AAT

T-

4087.5-3917.2 -

149.7-

20.6 

ATTG

-4087.2

-3918.7 -149.7

-18.8

+1.4AAT

T-

4092.8-3922.9 -

149.7-

20.2 

J. Amer. Chem. Soc. (2003) Špackova et al. (Cheatham, Sponer)

* Includes entropic

differences

Page 28: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

(CCAATTGG)2GG at ~350 ns(two separate simulations)

…DNA duplex structure goes away and never comes back…

Page 29: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

• dynamics / flexibility• > 1 conformation• structure is (very) sensitive to the surroundings• un-validated force fields• very few drug bound structures…

RNA is more difficult… …but also much more interesting!

Page 30: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

8

9

10

7

STATISTICS d109 DISTANCE between atoms :9@H5 & :7@H1' AVERAGE: 6.8887 (2.7204 stddev) INITIAL: 4.2624 FINAL: 6.5966 NOE SERIES: S < 2.9, M < 3.5, w < 5.0, blank otherwise. |SMMMMWMMWMWW W | NOE < 4.30 for 21.86% of the time NOE < 4.80 for 24.83% of the time

< 2.5 2.5-3.5 3.5-4.5 4.5-5.5 5.5-6.5 > 6.5 ------------------------------------------------------- %occupied | 0.7 | 13.1 | 9.2 | 6.2 | 10.0 | 60.7 |

“Long” MD (~20-100 ns): restraints progressively violated…

U8 U U A7 C A A Ψ G C A U C G U A

U

Page 31: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

peta- to exa-scale worries…

Page 32: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

MD codes scale to ~16-256 processors @ > 70% efficiency

► getting to 1000 is do-able(Bob Duke, UNC; Schulten, UIUC; DE Shaw; E. Lindahl) ► getting to 10,000 is hard (PetaApps).

► getting to 100,000: ??? (ensemble methods)

not easily with embarrassingly NOT parallel MD

Petascale science: scaling It is hard to ||-ize time

Page 33: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

trajin traj.1.gztrajin traj.2.gztrajin traj.3.gztrajout traj.stripcenter :1-10 mass originimage origin centercenter :1-20 mass originimage origin centerrms first mass out rms.dat :1-20distance d1 out d1.dat :1 :10grid wat.xplor 100 0.5 100 0.5 100 0.5 :3-8,13-18strip :WATaverage pdb average.pdb :1-20

..the standard means of analysis is breaking down…

data management & simulation workflow

are limiting…

Page 34: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

trajin traj.1.gztrajin traj.2.gztrajin traj.3.gztrajout traj.stripcenter :1-10 mass originimage origin centercenter :1-20 mass originimage origin centerrms first mass out rms.dat :1-20distance d1 out d1.dat :1 :10grid wat.xplor 100 0.5 100 0.5 100 0.5 :3-8,13-18strip :WATaverage pdb average.pdb :1-20

..the standard means of analysis is breaking down…

data management & simulation workflow

are limiting…

New modes of operation: ENSEMBLES- replica-exchange- path integral- EVB- DG simulations- NEB / path sampling- meta-dynamics

More data, more complicated workflow, …

Essentially a set of loosely coupled 16-1K processor jobs

Examples: two DG states, 20 windows = 40 * 20 temperatures= 800 instances

256 frames on a reaction path * 16 beads per particle * … =

Page 35: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

tetraloop receptor5 simulations @ 200 ns

> 1 TB of data

Petascale science: the problem will only get worse!

What if we can run 1000x longer?…or 10x bigger for 100x longer?

Page 36: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

tetraloop receptor5 simulations @ 200 ns

> 1 TB of data

What if we can run 1000x longer?…or 10x bigger for 100x longer?

> 1000 TB of data

…factor of 10: OK…factor of 100: hard…factor of 1000: ???

…more and more time is spent moving data / managing simulations;

less time spent doing science…

Petascale science: the problem will only get worse!

Page 37: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

- Do not move the data (?)- Tiered resources- Persistent storage- Re-running the simulations

Solutions? Analysis “on the fly…”[ & more coarse-grained sampling ]

+ workflow tools for ensembles

Petascale science: the problem will only get worse!

…what will we miss? Can we only get low hanging fruit?

Page 38: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Hindrances:

Codes have become “simpler” and will need to be restructured. intra-core vs. intra-node vs. inter-node vs. cpu type

We want to retain high precision / accuracy.

We want to be able to enable new methods (with ease).

( Force fields are not yet up to the challenge!!! )

Petascale science: Worries as we move forward…

Page 39: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

What we need (data/workflow-centric) is:

…a means to speed up & enable science…

…a means to interact with our simulations: “steer”, inspect, share, search, understand, expose (hidden correlations, meaning, data)

…a means to manage large simulation workflows…

disseminate,enable re-use

How do we make TB’s of

raw data available?

- remote references to data- partial analysis, on the fly analysis- history, memory, or provenance- standards (?)- annotation- automation – workflow!

Page 40: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

- Educated people / teams (multidisciplinary, experts)

- Software / middleware (workflow, provenance, data handling)

- Software – code optimization / parallelization / extensions

- Ease of use

- Means to analyze data, distribute data, preserve/archive

data…

- More cycles, more disk space, …

- More science, less computational science

My world reinforces Seidel’s CI crises. We need:

Page 41: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…
Page 42: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Hepatitis C virus IRES

IRES = internal ribosome entry site(translation initiation in middle of mRNA)

Page 43: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…
Page 44: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

Why is failure important to learn about?

These methods are in wide use worldwide:

- CADD- Structure Prediction- Mechanisms- Molecular association

Most people do not have 15M hour allocations

Data from failure can be reused!

~500 active NIH grants with “molecular dynamics”

in abstract!

Page 45: NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation: Promise, peril and barriers to moving beyond the terascale…

officeCHPC

home

NIH RO1-GM081411-01A1 NIH RO1-GM079383-01A1 NSF TG-MCA01S027