Data Marts Integrate the Proteome
Jay Vyas
The Information Content of the Proteome
1) cdc2+, cyclinB+, Mitosis, 2) cdc2-, Arrest3) cdc2 Binds Importin alpha/beta.…
Knowledge
Information
Data
Evolution of a Relational Proteome
Needleman Wunsch
PDB
SmithWaterman;NEWAT
PDGF-VSIS…
Atlas
SWISSPROT
NCBISCOP
HGP
REFSEQ
ProteinDomains
Insulin
1965 1975 1985 1995 2005
http://bytesizebio.net/http://www.dna.affrc.go.jp/growth/images/P-grwth-entrs.gifPLoS Comput Biol. 2006 Aug 25;2(8):e114. Epub 2006 Jul 14.Genome Res. 2008 March; 18(3): 449–461. doi: 10.1101/gr.6943508.http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=fold-scop
Data vs. Knowledge
Structures/Functions
Sequences
Data > Information
An Integrated Framework for building Molecular Biological Data Marts
Putting the model to use …
Data Marts : Targeted Integration FlatData Repositories
structurefunction
sequence
taxonomy
A Family of Data Driven Molecular Biology Tools
Integrated of structure calculation via NMR.-hybrid methods, iterative processing, reproducibilityspectra,sequence,chemical shifts -> structure
Automated detection of signaling/binding motifs in a candidate protein.protein sequence -> biological activity
Filtration of “passenger” residues from specificity/functional residues on surfaces of protein structures .
sequence + structure - > function
“Multidimensional” Sequence Comparisonsequence + taxonomy -> evolution
Sequence + Spectrum -> Structure
CONNJUR WB integrates format conversion, data inspection, and integrative processing . . . .
RNMRTKNMRPIPE
CONNJURWB
Connjur-WB
J Bio. NMR, 2011
Detection of functional subunits in proteins
TREMBL-SwissProtSwissProt vs Uniprot vs TREMBLMachine Learning
Spearmint (+)(Nuclear) Bacterial Proteins
Xanthippe (-)Snake proteins (can’t bind ATP)
Domain databases
Bioinformatics. 2004 Aug 4;20 Suppl 1:i342-7.http://pir.georgetown.edu/pirwww/about/doc/tutorials/uniprot_struc.gif Bioinformatics (2001) 17 (10): 920-926
?
• How can we increase the size of our functional motif database without increasing the amount of false positives predicted ?
+
_
MinimotifMiner – a tool for predicting protein function via Short Sequence Peptide Motifs
3000+ estimated motif publications / yr…
Variant Pubmed searches ‘x’
(("amino acid motif"[TIAB])) OR ((“protein motif” [TIAB])) AND (“<x>"[PDAT] : “<x>"[PDAT])
1975 1980 1985 1990 1995 2000 2005 2010 20150
500
1000
1500
2000
2500
3000
3500
4000
4500
Relational Model of Functional Data - A Precise Model of Protein Functional Semantics.
BMC Genomics, 2009
RMSD = .9
NCBI_FEDERATED + Mimosa
BMC Genomics , 2009
A Peptide Annotation Pipeline
BMC Bioinformatics 2010, 11:328
Further (GO) integration controls for the degenerate nature of motif searches
PLOS One, 2010
~400
~400
~900
Short Sequences are degenerate
…Can they be merged with
structural and evolutionaryinformation ?
Chemistry & Biology, January 2000BMC Genomics, 2009
Venn : An Integrated Application For Database Driven Homology Threading of Protein Structures ….
Nucleic Acids Research, 2009Trends in Plant sciences, 2010
VENN : "Twilight Zone" Sequence Homology Threading
NAR, 2009
Left to right …
1AZG (Human FYN) PRPLPVAP LYYGDWIPSNY1AVZ (Human FYN) TPQVPL YD … GDWPSNY1PRL (Chicken FYN) APPLPR YD ... WPNY (not shown)1H3H (Mouse GRB2) SRSTK ENPSWWTLPANY
VENN-InterfaceMiner : How do different SH3 binding peptides functionally relate to one another ?
Standard BLASTSearches
SSPEs reside in the “Twilight Zone”
J. Bacteriology 2011
What happens when a sequence is inherently noisy ?
max 100-250
e val
10E-3 ...
word size3-5
score matrix
80,62,30
gap?0,4
Q/N?
manskysktdvqqvkrqnqqsasgqgqygtefgsetdaqqvrkqnqsaeqnkqqns
Sequence mining in 2D
Use a hypersensitive sequence search(+), andexpand results into a 2nd dimension (-).
Combined with taxonomical information To pinpoint a first estimate of the gene’s appearance.
J. Bacteriology 2011
R3 : A prototypical methodfor improved structure calculation.
R3: Convergence is generally improved by reseeding
Availability
Sequence , Structure
Sequence , Function
StructureSequenceTaxonomyFunction , Specificity
SequenceTaxonomy , Evolution
www.connjur.org
mnm.engr.uconn.edu
venn.vcell.uchc.edu
www.bio-toolkit.com
RMSD = .9
NCBI_FEDERATED + EXPERT SYSTEM
BMC Genomics , 2009
VENN : Fine grained analysis.
Nuc. Acids Research, 2009
Residue enrichment profiles.
NCBI_FEDERATED : Taxonomy, Domain, Homologene & Refseq.
VENN : Fine grained analysis of SH3 bound peptides--- reveals a similar interface for divergent sequences. Are the peptides similar to ?
Left to right …
1AZG (Human FYN) PRPLPVAP LYYGDWIPSNY1AVZ (Human FYN) TPQVPL YD … GDWPSNY1PRL (Chicken FYN) APPLPR YD ... WPNY1H3H (Mouse GRB2) SRSTK ENPSWWTLPANY
Solution : Use an hypersensitive sequence search, and expand results into a 2nd dimension.
Combined with taxonomical information pinpoints a first estimate of the gene’s appearance.
Gene Duplication, Domain Reuse, Functional Motifs, and Varaince of Structural Specificity
- "Twilight Zone" homologies - Structural Interfaces
- Binding Specificity
- Short Functional Motifs
Vertebrates appear to have
arranged pre-existing components into a richer collection of domain architectures. Nature 2001
Doolittle
* Functional Protein Bioinformatics - CDD, MnM, Modular evolution of Proteins
* Database Normalization - "Archival" -> low S/N ; unrepresentative * Protein-centric sequence searching - Rous Sarcoma Discovery (DNA, lost in translation)
***** All done before modern computing/database theory.
The Modern Age
Gen Bank - archival NCBI / EBI - sequence data curation
PDB/BMRB - structural data curation, deposition
GO - functional annotations
...............................
What is data modelling ?
- Ambiguety vs. Vagueness
- "Text" vs "Syntax"
- Biological Data : No clear "reference object". Solution : CONTEXT
Integration Strategies
Database FederationArchitectures
Data Warehousing Data Marts
When To Federate ?
* New Genomes... Draft sequences.
* Reproducibility is less important than insight.
Stark et Al.
Control of the G2/M Transition 2006
Problem: There are hundreds of native peptides which possess subsequences which are predicted to have SH3 binding properties. For example [KR]..[KR] and P..P are known to interact with SH3 domains. But there is no method for comparing the structural binding mechanisms behind these variant peptides. This is necessary, given the fact that there are hundreds of SH3 domains in the human genome, with several diverse structures existing in the protein data bank, which cannot be collectively analyzed by eye.
Results
Left to right …
1AZG (Human FYN) PRPLPVAP binds LYYGDWIPSNY1AVZ (Human FYN) TPQVPL binds YD … GDWPSNY1PRL (Chicken FYN) APPLPR binds YD ... WPNY1H3H (Mouse GRB2) SRSTK binds ENPSWWTLPANY
Solution: Use the VENN program for homology titration to extract molecular interfaces from SH3 bound peptides.
1) For each atom “a1” in each peptide chain of a structureFor each atom in “a2” DIFFERENT chain of the same structure.Is “a1” close to “a2” ?If yes, store a1,a2.If no, keep going.
2) Now, create a “synthetic structure”, which extracts residues associated with only atoms stored in step (1), which ignores covalent peptide bonds entirely. This structure represents a molecular interface, where all non interacting residues are considered to be “extraneous noise”.
3) To test the biological relevance of the molecular interface, apply it to varying species : Is the same signature generated from different structures ?
Conclusion:Although the W/P/N/Y residues in SH3 domains are far apart and variably spaced in sequence distance, they may have evolved to possess a common feature : Conformance to a highly specific molecular interface.
Mouse GRB2 / Human FYN are completely different domains, in different species, which bind different peptides …. Yet surprisingly, their binding sites conform to the same interface.
Venn is available at http://sbtools.uchc.edu/venn.
Orthologous Homology Threading : Course Grained Function . . .
Do canonical binding motifs in proteins exhibit structural specificity before when unbound ?
8000 distinct pdb chains (out of 35000 total structures).
• SH3 Bound non PXXPs o 1AZG PLPV 137o 1AZG PRPL 107o 1PRL PPLP 150o 1PRL PLPR 154
• Non SH3 complexed PXXPso 2DJY PPPP 89o 1WA7 PGMP 111
• Non SH3 bound, non PXXPo 2ORU PATG 817
historyHuman Genome - 2001
SCOP - 1994SwissProt/NCBI - 1986/1988NEWAT - 1981 PDGF ~ v sis - 1983Smith-Waterman - 1981 PDB - 1973Needleman-Wunsch - 1970ATLAS - 1965Insulin Sequence - 1955Double Helix - 1953
Top Related