NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation:...

Post on 17-Dec-2015

215 views 0 download

Tags:

Transcript of NSF CI days @ U Kentucky, February 2010 Insight into biomolecular structure from simulation:...

NSF CI days @ U Kentucky, February 2010

Insight into biomolecular structure from simulation: Promise, peril and barriers

to moving beyond the terascale…

Thomas E. Cheatham IIIAssociate Professortec3@utah.edu

Departments of Medicinal Chemistry and ofPharmaceutics and Pharmaceutical Chemistry,

College of Pharmacy, University of Utah

NSF TeraGrid Science Advisory BoardNSF LRAC/MRAC allocations panel (~2002-2008),

chairNSF LRAC award since ~2001; ||-computing since

1987;~17 M hours this year on local and NSF machinesU Utah CI Council; Information Technology Council;

CHPC

eScience = cyberinfrastructure (???)

"The term "e-Science" denotes the systematic development of research methods that exploit advanced computational thinking“

Professor Malcolm Atkinson, e-Science Envoy.

“Cyberinfrastructure” consists of computing systems, data storage systems, data repositories and advanced instruments, visualization environments, and people, all linked together by software and advanced networks to improve scholarly productivity and enable breakthroughs not otherwise possible.

EDUCAUSE, Campus Cyberinfrastructure workgroup

“If you’re a scientist, talk to a computer scientist about your challenges, and vice versa.”

i.e. clustering, data handling, …

How do drugs bind and influence structure (and dynamics)?

the tool: biomolecular simulation

The tool: Biomolecular simulationenergy

vs.sampling

What is bio-molecular simulation?• “physics” based atomic potential—the force field—

tuned for proteins, nucleic acids and their surroundings (solvent, ions, drugs, …)

d+ d-

bonds

electrostatics

angles

van der Waals

dihedrals

There are many force fields, each withdistinct performance characteristics…

What is bio-molecular simulation?• “physics” based atomic potential—the force field—

tuned for proteins, nucleic acids and their surroundings (solvent, ions, drugs, …)

• codes and methods developed over the past ~40 yrs by various teams including centers, labs and industry– 80’s vectorization + early parallel architectures– 90’s shared memory and distributed memory parallelized– 00’s special purpose hardware and optimized codes

– AMBER, CHARMM, Encad, NAMD, Desmond, GROMOS, Gromacs, LAMMPS, …

• de novo protein folding, structure prediction• computer aided drug design• design of novel materials / properties• multi-scale modeling• simulation of time scales that are approaching

relevant time scales

CAPTOPRIL: ACE inhibitor (antihypertensive)VIRACEPT: HIV protease inhibitor (AIDS therapy). Agouron.CRIXIVAN: HIV protease inhibitor (AIDS therapy). Merck.VIAGRA: cGMP PD type 5 inhibitor (impotence). Pfizer.ZOMIG: trypamine receptor antagonist (migraine) ZenecaTEVETEN: Angiotensin II receptor antagonist (hypertension)TRUSOPT: carbonic anhydrase inhibitor (glaucoma)ARICEPT: AChE inhibitor (alzheimers/dementia)COZAAR: angiotensin II receptor antagonist (hypertension)NOROXIN: inhibits bacterial DNA synthesis (antibacterial)

- Do the simulations model reality?

- How can we assess & validate the results?

- Can simulations provide predictive insight?

- How can we improve the applied methods?

general goals of bio-molecular simulation research

- Do the simulations model reality?

- How can we assess & validate the results?

- Can simulations provide predictive insight?

- How can we improve the applied methods?

- How can we facilitate the simulation experiment?

- How can we better disseminate the data?

- How can we use the emerging machines?

general goals of bio-molecular simulation research

computational science (???)

Experience required?

• structural biology

• statistical mechanics

• biophysics / computational chemistry

• pharmacy / organic chemistry

• UNIX / system administration

• coding ability (Fortran90, scripting, …)

• parallel computing

• data handling, analysis, viz

B.A. ChemistryB.A. Mathematics & Comp. Sci.PhD Pharm Chem

Programmer /Analyst, 2 yrs

+NSF centers

Still learning…

interdisciplinary teams?

The power of the TeraGrid(aka metacenter, PACI, xD: NSF centers)

• education / training– CM-2a, CM-5, MasPar training, ~1989– Summer Institute in Supercomputing at PSC, 1992– Scientific Computing Institute at Los Alamos, 1992

• vectorization, basic concepts of shared vs. distributed memory– Heterogeneous computing at PSC, 1994

• shared memory + MPI (+ PVM, TCGMSG, …)– Shared memory and MPI parallelized AMBER released (PSC, SGI)– AMBER workshops (as teacher), 1996 & 1998

• outreach– center brochures, literature, WWW pages, joint publications– Computerworld Smithsonian Awards Finalist (with PSC, UCSF, NIEHS)

• cycles!!!– friendly user status , consultants, helpline, porting guides

Allocations: ~100K in 1995, ~1M in 2002, ~10M in 2009, ~14M in 2010, …

$$$:

1R01-GM081411-01: Biomolecular simulation for theend-stage refinement of nucleic acid structure

1R01-GM079383-01: “AMBER force field consortium”

Research funding focuses on NIH mission(basic science + health relevance)

Curious trends, barriers and limitations in the field…

- Funding is results driven, little reward for software optimization- NIH does not really fund (or support) supercomputing / CI - …yet NIH funds the bulk of biomolecular simulation research

(?)

$$$:

1R01-GM081411-01: Biomolecular simulation for theend-stage refinement of nucleic acid structure

1R01-GM079383-01: “AMBER force field consortium”

Research funding focuses on NIH mission(basic science + health relevance)

Curious trends, barriers and limitations in the field…

- Funding is results driven, little reward for software optimization- NIH does not really fund (or support) supercomputing / CI - …yet NIH funds the bulk of biomolecular simulation research

(?)

- Student PhD’s tend to be in “chemistry” (no expertise in computational science)

- Codes are complex, legacy, and evolving…

Curious trends, barriers and limitations in the field…

• NSF does not directly fund mostbiomolecular simulation research

• few agencies or companies supportbiosimulation code development

• bulk of cycles in field from NSF centers, then DOE.– 10% cap on NIH research vs. inter-agency cooperation?

PHY18%

AST14%

CHE11%

DMR8%

DMS0%

BIO30%

ENG10%

CIS3%

GEO2%

SBE0%

IND4%

Curious trends, barriers and limitations in the field…

• NSF does not directly fund mostbiomolecular simulation research

• few agencies or companies supportbiosimulation code development

• bulk of cycles in field from NSF centers, then DOE.– 10% cap on NIH research vs. inter-agency cooperation?

Threats:- Without NSF cycles and the TeraGrid/xD the field of

biomolecular simulation would stagnate.

PHY18%

AST14%

CHE11%

DMR8%

DMS0%

BIO30%

ENG10%

CIS3%

GEO2%

SBE0%

IND4%

Curious trends, barriers and limitations in the field…

• NSF does not directly fund mostbiomolecular simulation research

• few agencies or companies supportbiosimulation code development

• bulk of cycles in field from NSF centers, then DOE.– 10% cap on NIH research vs. inter-agency cooperation?

Threats:- Without NSF cycles and the TeraGrid/xD the field of

biomolecular simulation would stagnate.- …we are spending more and more of our time

running simulations, managing workflow, transferring data, i.e. doing computational science

PHY18%

AST14%

CHE11%

DMR8%

DMS0%

BIO30%

ENG10%

CIS3%

GEO2%

SBE0%

IND4%

0

1

2

3

4

5

6

0 500 1000 1500 2000 2500 3000

RMS to A-DNARMS to B-DNA

All

ato

m R

MS

d (

Å)

Time (ps)

- simulations run for ~6 months, 16-32-way parallel, batch

- < 100 GB data, run remotely, stored and analyzed locally

- analysis is standard (key values vs. time)- required advances (completed):

- methods improvement (PME electrostatics)- optimized codes for shared memory, MPI, …- development of general purpose analysis

utilities“ptraj”

MD simulations~500ps – 3 ns~1994-1997

bio-molecular simulation at the meta-scale

bio-molecular simulationat the tera+ -scale

tetraloop receptor5 simulations @ ~200 ns cyp-P450 2B4

8 simulations @ ~150 ns

DNA minor groove binders7 drugs, 2 binding modes,

4 sequences @ ~50 ns

- simulations run for ~6 months, 16-1K -way parallel, batch

- ~1-5 TB per set, run remotely, stored and analyzed locally

- analysis has become rate limiting; data too large/slow…

Data is complex: How to simplify?(don’t throw out baby with bathwater)

vast time/size scales; granularity scales

…if we know what we want to see, analyzing and visualizing is easy…

…and tools are available

force fields vs. samplingwe (likely) have systematic

problems with structure or converge to incorrect

structure

we (likely) get trapped in a meta-stable conformations

en

erg

y

reaction coordinate

Computer power?

the good the bad

David E. Shaw: DESRES

16 microseconds / day !!!

Funny things can and do happen…

&

we’re experiencing serious data overload…

500 nanosecond simulation of a

DNA duplex using generalized

Born implicit solvation

Some problems (~2000-2008)

K+, Cl-, Mg2+crystal?

Phased A-tract

burrowing Mg2+ ion?

Joung / Cheatham, JPCB 113, 13279 (2009)

How about long DNA simulation?

> 500 ns on DAPI bound DNA duplexesCornell et al. force field.

site Ecomplex EDNA+20

w

DAPI DG DDG*ATTG -

4085.0-3915.6 -

149.7-

19.7-2.4

AATT

-4086.4

-3917.9 -149.7

-18.8

 

ATTG

-4085.7

-3916.4 -149.7

-19.6

+1.0AAT

T-

4087.5-3917.2 -

149.7-

20.6 

ATTG

-4087.2

-3918.7 -149.7

-18.8

+1.4AAT

T-

4092.8-3922.9 -

149.7-

20.2 

J. Amer. Chem. Soc. (2003) Špackova et al. (Cheatham, Sponer)

* Includes entropic

differences

(CCAATTGG)2GG at ~350 ns(two separate simulations)

…DNA duplex structure goes away and never comes back…

• dynamics / flexibility• > 1 conformation• structure is (very) sensitive to the surroundings• un-validated force fields• very few drug bound structures…

RNA is more difficult… …but also much more interesting!

8

9

10

7

STATISTICS d109 DISTANCE between atoms :9@H5 & :7@H1' AVERAGE: 6.8887 (2.7204 stddev) INITIAL: 4.2624 FINAL: 6.5966 NOE SERIES: S < 2.9, M < 3.5, w < 5.0, blank otherwise. |SMMMMWMMWMWW W | NOE < 4.30 for 21.86% of the time NOE < 4.80 for 24.83% of the time

< 2.5 2.5-3.5 3.5-4.5 4.5-5.5 5.5-6.5 > 6.5 ------------------------------------------------------- %occupied | 0.7 | 13.1 | 9.2 | 6.2 | 10.0 | 60.7 |

“Long” MD (~20-100 ns): restraints progressively violated…

U8 U U A7 C A A Ψ G C A U C G U A

U

peta- to exa-scale worries…

MD codes scale to ~16-256 processors @ > 70% efficiency

► getting to 1000 is do-able(Bob Duke, UNC; Schulten, UIUC; DE Shaw; E. Lindahl) ► getting to 10,000 is hard (PetaApps).

► getting to 100,000: ??? (ensemble methods)

not easily with embarrassingly NOT parallel MD

Petascale science: scaling It is hard to ||-ize time

trajin traj.1.gztrajin traj.2.gztrajin traj.3.gztrajout traj.stripcenter :1-10 mass originimage origin centercenter :1-20 mass originimage origin centerrms first mass out rms.dat :1-20distance d1 out d1.dat :1 :10grid wat.xplor 100 0.5 100 0.5 100 0.5 :3-8,13-18strip :WATaverage pdb average.pdb :1-20

..the standard means of analysis is breaking down…

data management & simulation workflow

are limiting…

trajin traj.1.gztrajin traj.2.gztrajin traj.3.gztrajout traj.stripcenter :1-10 mass originimage origin centercenter :1-20 mass originimage origin centerrms first mass out rms.dat :1-20distance d1 out d1.dat :1 :10grid wat.xplor 100 0.5 100 0.5 100 0.5 :3-8,13-18strip :WATaverage pdb average.pdb :1-20

..the standard means of analysis is breaking down…

data management & simulation workflow

are limiting…

New modes of operation: ENSEMBLES- replica-exchange- path integral- EVB- DG simulations- NEB / path sampling- meta-dynamics

More data, more complicated workflow, …

Essentially a set of loosely coupled 16-1K processor jobs

Examples: two DG states, 20 windows = 40 * 20 temperatures= 800 instances

256 frames on a reaction path * 16 beads per particle * … =

tetraloop receptor5 simulations @ 200 ns

> 1 TB of data

Petascale science: the problem will only get worse!

What if we can run 1000x longer?…or 10x bigger for 100x longer?

tetraloop receptor5 simulations @ 200 ns

> 1 TB of data

What if we can run 1000x longer?…or 10x bigger for 100x longer?

> 1000 TB of data

…factor of 10: OK…factor of 100: hard…factor of 1000: ???

…more and more time is spent moving data / managing simulations;

less time spent doing science…

Petascale science: the problem will only get worse!

- Do not move the data (?)- Tiered resources- Persistent storage- Re-running the simulations

Solutions? Analysis “on the fly…”[ & more coarse-grained sampling ]

+ workflow tools for ensembles

Petascale science: the problem will only get worse!

…what will we miss? Can we only get low hanging fruit?

Hindrances:

Codes have become “simpler” and will need to be restructured. intra-core vs. intra-node vs. inter-node vs. cpu type

We want to retain high precision / accuracy.

We want to be able to enable new methods (with ease).

( Force fields are not yet up to the challenge!!! )

Petascale science: Worries as we move forward…

What we need (data/workflow-centric) is:

…a means to speed up & enable science…

…a means to interact with our simulations: “steer”, inspect, share, search, understand, expose (hidden correlations, meaning, data)

…a means to manage large simulation workflows…

disseminate,enable re-use

How do we make TB’s of

raw data available?

- remote references to data- partial analysis, on the fly analysis- history, memory, or provenance- standards (?)- annotation- automation – workflow!

- Educated people / teams (multidisciplinary, experts)

- Software / middleware (workflow, provenance, data handling)

- Software – code optimization / parallelization / extensions

- Ease of use

- Means to analyze data, distribute data, preserve/archive

data…

- More cycles, more disk space, …

- More science, less computational science

My world reinforces Seidel’s CI crises. We need:

Hepatitis C virus IRES

IRES = internal ribosome entry site(translation initiation in middle of mRNA)

Why is failure important to learn about?

These methods are in wide use worldwide:

- CADD- Structure Prediction- Mechanisms- Molecular association

Most people do not have 15M hour allocations

Data from failure can be reused!

~500 active NIH grants with “molecular dynamics”

in abstract!

officeCHPC

home

NIH RO1-GM081411-01A1 NIH RO1-GM079383-01A1 NSF TG-MCA01S027