In collaboration with BioPilot: Data-Intensive Computing for Complex Biological Systems Pacific...
-
Upload
betty-gardner -
Category
Documents
-
view
222 -
download
0
Transcript of In collaboration with BioPilot: Data-Intensive Computing for Complex Biological Systems Pacific...
In collaboration with
BioPilot: Data-Intensive Computingfor Complex Biological Systems
Pacific Northwest National LaboratoryOak Ridge National Laboratory
Sponsored bythe Department of Energy
Office of Science
Program Managers:Drs. Gary Johnson and Anil Dean
2 Samatova_BioPilot_0611
Goals for enabling data intensivecomputing in biology
Biological research is becoming a high-throughput, data-intensive science, requiring development and evaluation of new methods and algorithms for
large-scale computational analysis, modeling, and simulation
-80 -60 -40 -20 0 20 40 60 800
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Figure 1
Rel
ativ
e F
req
uen
cy o
f A
ppe
ara
nce
Fragment Ion Mass Offset
-80 -60 -40 -20 0 20 40 60 800
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Figure 1
Rel
ativ
e F
req
uen
cy o
f A
ppe
ara
nce
Fragment Ion Mass Offset
Computational algorithms for proteomics Improve the efficiency and reliability of protein identification and
quantification. Peptide identification using MS/MS using pre-computed spectral
databases Combinatorial peptide search with mutations, post- translational
modifications and cross-linked constructs.
Inference, modeling, and simulation of biological networks
Reconstruction of cellular network topologies using high-throughput biological data
Stochastic and differential equation-based simulations of the dynamics of cellular networks.
Integration of bioinformatics and biomolecular modeling and simulation
Structure, function, and dynamics of complex biomolecular system in appropriate environments.
Comparative analysis of large-scale molecular simulation trajectories Event recognition in biomolecular simulations Multi-level modeling of protein models and their conformational space.
3 Samatova_BioPilot_0611
Comparative molecular trajectoryanalysis with BioSimGrid
Routine protein simulations generate TB trajectories
Routine analysis tools (ED) require multiple passes
Correlation analyses have random access patterns
Comparative analyses require multiple/distributed trajectory access
Comparative Trajectory Analysis Tools
ENERGY
GRADIENT
OPTIMIZE
DYNAMICS
THERMODYNAMICS
QMD
QM/MM
ET
QHOP
INPUT
PROPERTY
PREPARE
ANALYZE
ESP
VIB
Classical Force Field
DFT
SCF: RHF UHF ROHF
MP2: RHF UHF
MP3: RHF UHF
MP4: RHF UHF
RI-MP2
CCSD(T): RHF
CASSCF/GVB
MCSCF
MR-CI-PT
CI: Columbus Full Selected
Integral API
Geometry
Basis Sets
PEigS
pFFT
LAPACK
BLAS
MA
Global Arrays
ecce
ChemIO
NWCHM
The analysis of molecular dynamicsTrajectories presents a large,data-intensive analysis problem:
The Scientific Challenge:
Integrated molecular simulationand comparative moleculartrajectory analysis:
Enzymatic reaction mechanisms Molecular machines Molecular basis of signal and
material transport
T.P.Straatsma, Computational Biology and Bioinformatics
4 Samatova_BioPilot_0611
d
Building accurate protein structures Homology models can be built from related protein structures, but even
with proper alignment, usually have about 4A RMS error–too large to permit meaningful computational simulation/molecular dynamics.Goal to reduce the errors in initial models
We multi-align the group of neighbors using sequence and structure information to find the most stable parts of the domain. Extracted stable core structure is 2030% closer in terms of RMSD to the target than to any of the original templates
More flexible parts of target are modeled locally by choosing most sequentially similar loops from the library of local segments found in all the homologues
A genetic algorithm conformation search strategy using Cray XT3 uses backbone and sidechain angles as parameters.Each GA step is evaluated by minimization
A computed template compared to the target, improved from 3.4A initially to 2.3A
E. Uberbacher, Biological and Environmental Sciences Division
5 Samatova_BioPilot_0611
High-resolution protein docking High-throughput protein interaction data generation
provides many interacting proteins. However, the best docking programs (Glide, GOLD) recognize proper docking site and position between known docking partners a small fraction of the time
We take advantage of a reduced search space for contacting protein surfaces but sample all relative conformations on a grid
Surface normals are anti-aligned with those on the other protein surface in the docking procedure. Complexity of the search is r * N1 * N2 with steric clashes filtered
Bayesian scoring evaluates each configuration based on specific residue pair distances at each interface: Score=sum of Log ratios for interface/decoys contact frequencies over all contacts in the patch
Scoring scheme is effective at separating false configurations from true. It recognizes 1/3 of correct configurations as the top choice and a second third
within the top three choices.
Surface normals for human Ras-related protein Rap-1A [PDB:1C1Y, shown in CPK (orange) and RAF kinase (cyan)]
Blue curve decoys, yellow curve true
A. Gorin, Computer Science and Mathematics Division
6 Samatova_BioPilot_0611
Predictive proteome analysis using advanced computation and physical modelsScientific Challenge:Data sizes: 30,000 to 200,000 spectra from a single experiment to becompared with as many as millions of theoretical spectra
Computational tools identify only 1020%of spectra from proteomics analysis:
Fragmentation patterns are highly sequence-specific
Generic patterns result in low identification rate
Physical models not used in current analyses becauseof computational expense
W.R.Cannon, Computational Biology and Bioinformatics
NH3-HC-CO- NH-HC-CO- NH-HC-CO- NH-HC-COOHR R R R
7 Samatova_BioPilot_0611
In-depth analysis of MS proteomicsdata flowsMass-spectrometry-based proteomics is one of the richest sources of the biological information, but the data flows are enormous (100,000s of samples each consisting of 100,000s of spectra)
Computationally, both advanced graph algorithms and memory-based indexes (> 1 TB are required forin-depth analysis of the spectra)
Two principally new capabilities demonstrated • Quantity of identified peptides is 100% increased • Among highly reliable identifications, increase is many fold
A. Gorin, Computer Science and Mathematics Division
8 Samatova_BioPilot_0611
ScalaBLAST
Standalone BLAST application would take >1 year for IMG (>1.6 million proteins) vs IMG and >3 years for IMG vs nr
“PERL script” approach breaks even modest clusters becauseof poor memory management
Scientific Challenge:
Good scaling on both shared and distributed memory architectures
C.S.Oehmen, Computational Biology and Bioinformatics
0 150
sca
lin
g f
acto
r
number of processors10050
0
40
80
120
MPP2
ALTIX
HTCBLAST
0 1000
Qu
erie
s p
er (
pro
c*m
inu
te)
number of processors10050
0
3
4
6
10000
5
2
1work factor
ideal
9 Samatova_BioPilot_0611
To model microbial communities, the smallest realistic model would include > 109 cells x 100 reactions ~ 1011 reactions
Currently can handle ~ 104106 reactions on a single-processor machine
Scientific Challenge:
Stochastic simulations of biological networks
Speedups ~40-200
J. Phys. Chem. B 105, 11026, 2001
Speedup over exact Gillespie SSA ~25
Probability-weighted dynamic Monte Carlo method
Multinomial Tau-leaping method
H. Resat, Computational Biology and Bioinformatics
1 4
number of phosphorylation binding site
32 5 876 9 1211100
250
500
750
1000
1250
1500
ave
rag
e ru
nn
ing
ti m
e ( s
eco
nd
s)
10 Samatova_BioPilot_0611
Uncovering mechanismsof transcriptional controlStochastic fluctuations in molecular populations have consequences for gene regulation. Our theoretical and simulation studies predicted that noise frequency content is determined by underlying gene circuits, leading to mapping between gene circuit structure and noise frequency range. Predictions are validated experimentally
Model of shift of frequency range distribution shape due to negative feedback. Red bars (e) represent unregulated circuit distribution;blue bars represent distributionfor circuit with negative auto-regulation
Dashed box and arrow show shiftof the low-frequency trajectoriesto center of distribution whilehigher-frequency trajectoriesare unaffected
2006 Feb 2; 439(7076):608
N. Samatova, Computer Science and Mathematics Division
11 Samatova_BioPilot_0611
Contacts
Dr. Gary JohnsonActing Program Manager(301) 903-5800e-mail: [email protected]
11 Presenter_date
Dr. Anil Dean Program Manager (301) 903-1465
e-mail: [email protected]