The STRING database
Michael KuhnEMBL Heidelberg
protein interactions
example
Tryptophan synthase beta chainE. Coli K12
curated knowledge
Texperimental evidence
373 genomes
(only completely sequenced genomes)
1.5 million genes
(not proteins)
model organism databases
genomic context methods
gene neighborhood
phylogenetic profiles
Cell
Cellulosomes
Cellulose
automatic inferenceof interactions
correct interactions
wrong associations
gene fusion
score: sequence similarity
gene neighborhood
score: sum of intergenic distances
phylogenetic profiles
SVD
singular value decomposition(removes redundancy)
score: Euclidean distance
all scores are “raw scores”
not comparable
sequence similarity
sum of intergenic distances
Euclidean distance
benchmarking
calibrate against “gold standard”(KEGG)
probabilistic scores
e.g. “70% chance for an assocation”
curated knowledge
KEGG
Kyoto Encyclopedia of Genes
primary experimental data
BIND
Biomolecular Interaction Network Database
GRID
General Repository for Interaction Datasets
HPRD
Human Protein Reference Database
co-expression
microarray data
GEO
Gene Expression Omnibus
correlation coefficient
literature mining
different gene identifiers
SGD
Saccharomyces Genome Database
The Interactive Fly
OMIM
Online Mendelian Inheritance in Man
NLP
Natural Language Processing
Gene and protein namesCue words for entity
recognitionVerbs for relation extraction
The expression ofthe cytochrome genes
CYC1 and CYC7is controlled by
HAP1
calibrate against gold standard
combine all evidence
Bayesian scoring scheme
e.g.: two scores of 0.7
combined probability: ?
e.g.: two scores of 0.7
combined probability: 0.91
1 - (1-0.7)2 = 0.91
evidence transfer
evidence spread over many species
transfer by orthology
(or “fuzzy orthology”)
von Mering et al., Nucleic Acids Research, 2005
von Mering et al., Nucleic Acids Research, 2005
von Mering et al., Nucleic Acids Research, 2005
higher coveragelower specificity
includes all available evidence
some orthologous groups are too large to be meaningful
von Mering et al., Nucleic Acids Research, 2005
maximum specificitylower coverage
information will be relevant for selected species
take home message
STRING integrates information and predicts interactions
You can always go to the sources
Proteins mode: specific speciesCOG mode: more coverage,
especially for prokaryotic genes
Acknowledgements
The STRING team
Lars JensenPeer Bork
Christian von Mering & group in Zurich
Berend SnelMartijn Huynen
Thank you for your attention
take home message
STRING integrates information and predicts interactions
You can always go to the sources
Proteins mode: specific speciesCOG mode: more coverage,
especially for prokaryotic genes
Exercises:tinyurl.com/36twzq
(or via course wiki)
Alternative server:xi.embl.de
Bork et al., Current Opinion in Structural Biology, 2004