Post on 22-Dec-2015
The STRING Database
What it does and how it interfaces to other resources
The STRING Database
What it does and how it interfaces to other resources
Christian von Mering, University of Zurich & SIBbigDATA Workshop
- viewers for all types of evidence
- focus on useability and speed
- integrated scoring scheme
- information transfer between species
Genomic Neighborhood
Genes/Species Co-occurence
Gene Fusions
Database Imports
Exp. Interaction Data
Co-expression
Literature co-occurence
STRING http://string-db.org/
http://string-db.org
• 630 organisms
• 2.6 Mio proteins
• 88 Mio interactions
• server-footprint: 320 Gb
Numbers:
networks
Phylogenetic Profiles
Conserved Neighborhood
Gene-Fusions
quantify …
integrate …
Interaction prediction from genome information
“genomic context”
Other Interaction Sources
Interaction Databases Pathway Databases
Reactome
Automated Textmining Interolog Transfer
final interaction score: protein A – protein B 0.856between 0 and 1, pseudoprobability, “likelihood of functional association”
1 – (1 – nscore) * (1 – fscore) * (1 – pscore) * (1 – cscore) * (1 – escore) * (1 – tscore)neighborhood fusion cooccurence coexpression experimental textmining
nscore = 1 – (1 – nscorequery species) * (1 – nscoretransf.)
evidence transfer between speciesinformation transfer betweenspecies either via orthologs(COG database) or via homology
analog for cscore, escore, tscore,...
benchmarking
raw score
KE
GG
per
form
ance
(fra
ctio
n on
sa
me
map
) raw score Example - Neighborhood raw score:
each predictor has its own raw-score regime
gene A gene B
100 bp 6 bp 20 bp
raw score: sum of intergenic distances
The scoring system
The raw score regimes
gene A gene B
100 bp 6 bp 20 bp
raw score: sum of intergenic distances
Neighborhood Phylogenetic profiles
• “similarity profiles”• singular value decomposition
raw score: euklidian distance
filter: downweigh scores for homologous pairs
raw score: constant (0.99)
Fusion experimental interactions• two-hydrid, TAP, annotated complexes, …• topology-based analysis: who with whom, how many other partners?
raw score: various (usually ‘uniqueness’ of interaction).
Co-expression
• download all microarray datasets for a given species• data normalization (spatial correction)
raw score: pairwise pearson-correlation coefficient
Textmining
• download all PubMed abstracts• identify proteins in the abstracts• search for co-mentioned pairs
raw score: log-odds score
User-Experience: Aiming to be Visual and Intuitive
1’000 visits / day800 users / day9’000 pageviews / day> 10’000 DB-queries / day
Citations
2000 NAR Snel et al.
2003 NAR von Mering et al.
2005 NAR von Mering et al.
2007 NAR von Mering et al.
2009 NAR Jensen et al.
80 citations
215 citations
183 citations
189 citations
47 citations
total: 714 citations
Cross-links
SMART: protein domain information
GENECARDS: info and products on human genes
SWISS-MODEL-REPOSITORY: homology models
CYTOSCAPE: access via plug-in architecture
SWISSPROT / UNIPROT: expert protein annotation
Cross-link example
launchSwissModel
Reciprocal View
popup: launchSTRING
Example #1
A missing chaperone for Cytochrome C oxidase
Question: who inserts the Copper-atom into CcO ?
Initial observation:
Example #1
The missing chaperone for Cytochrome C oxidase
Example #1
The missing chaperone for Cytochrome C oxidase
• gene expressed• structure solved• it binds copper !• likely function - copper delivery
Example #2
Simplify discovery in genome-wide association screens ?
Christian von Mering – UZH MolBio – SIB
a) download data in relational database scheme
d) cross-link to server(version controlled, to network, protein, link, ...)
In-House Use of STRING
b) download data ascompact flat-files
e) PSI-MI export
f) [ SOAP / webservices ]
c) in-house installationof webserver
Core organisms:
• include all model organisms (annotated knowledge)
• non-redundant, each genus is covered
• include organisms with functional genomics data
Irrelevant Organisms
[future category]
Version 9.0 – exceeding 1000 genomes
More details & new features
“Payload Display” - Your Own STRING Server
=> “branding” STRING via remote-control: a call-back API
=> “branding” STRING via remote-control: a call-back API
Acknowledgements
The STRING team:
Samuel ChaffronManuel WeissMichael KuhnLars Juhl Jensen
Sean HooperBerend SnelMartijn HuynenPeer Bork
The STRING institutions:
SIB – Swiss Instituteof Bioinformatics
University ofZurich
TU-Dresden,University of Copenhagen
European MolecularBiology Laboratory
“MySTRING”
users can register / login
using OpenID or similar for authentication
persistency of search results (“history”)
store lists / items of interest (“bag of genes”)
users can customize the interface
generate revenue (?)
Feature #2 (Finding Relevant Texts)
Example #2
The missing enzymes for uric acid degradation
Question: why can’t humans degrade uric acid ?
Example #2
The missing enzymes for uric acid degradation
?
?
Example #2
The missing enzymes for uric acid degradation
initial observation:
Example #2
The missing enzymes for uric acid degradation
• genes cloned, expressed• enzymatic activity demonstrated• candidate short-term therapeutics !