The STRING database Michael Kuhn EMBL Heidelberg.

The STRING database

Michael KuhnEMBL Heidelberg

protein interactions

example

Tryptophan synthase beta chainE. Coli K12

many sources

genomic context

curated knowledge

Texperimental evidence

literature

373 genomes

(only completely sequenced genomes)

1.5 million genes

(not proteins)

Genome Reviews

RefSeq

Ensembl

model organism databases

data integration

genomic context methods

gene fusion

gene neighborhood

phylogenetic profiles

Cell

Cellulosomes

Cellulose

automatic inferenceof interactions

correct interactions

wrong associations

gene fusion

score: sequence similarity

gene neighborhood

score: sum of intergenic distances

phylogenetic profiles

SVD

singular value decomposition(removes redundancy)

score: Euclidean distance

all scores are “raw scores”

not comparable

sequence similarity

sum of intergenic distances

Euclidean distance

benchmarking

calibrate against “gold standard”(KEGG)

raw scores

probabilistic scores

e.g. “70% chance for an assocation”

curated knowledge

KEGG

Kyoto Encyclopedia of Genes

Reactome

GO

Gene Ontology

primary experimental data

many sources

many parsers

BIND

Biomolecular Interaction Network Database

GRID

General Repository for Interaction Datasets

HPRD

Human Protein Reference Database

co-expression

microarray data

GEO

Gene Expression Omnibus

correlation coefficient

literature mining

different gene identifiers

synonyms list

Medline

SGD

Saccharomyces Genome Database

The Interactive Fly

OMIM

Online Mendelian Inheritance in Man

simple scheme

co-mentioning

more advanced

NLP

Natural Language Processing

Gene and protein namesCue words for entity

recognitionVerbs for relation extraction

The expression ofthe cytochrome genes

CYC1 and CYC7is controlled by

HAP1

calibrate against gold standard

combine all evidence

Bayesian scoring scheme

e.g.: two scores of 0.7

combined probability: ?

e.g.: two scores of 0.7

combined probability: 0.91

1 - (1-0.7)2 = 0.91

evidence transfer

evidence spread over many species

transfer by orthology

(or “fuzzy orthology”)

von Mering et al., Nucleic Acids Research, 2005

two modes

COG mode

higher coveragelower specificity

includes all available evidence

some orthologous groups are too large to be meaningful

proteins mode

maximum specificitylower coverage

information will be relevant for selected species

outlook

take home message

STRING integrates information and predicts interactions

You can always go to the sources

Proteins mode: specific speciesCOG mode: more coverage,

especially for prokaryotic genes

Acknowledgements

The STRING team

Lars JensenPeer Bork

Christian von Mering & group in Zurich

Berend SnelMartijn Huynen

Thank you for your attention

take home message

STRING integrates information and predicts interactions

You can always go to the sources

Proteins mode: specific speciesCOG mode: more coverage,

especially for prokaryotic genes

Exercises:tinyurl.com/36twzq

(or via course wiki)

Alternative server:xi.embl.de

Bork et al., Current Opinion in Structural Biology, 2004

The STRING database Michael Kuhn EMBL Heidelberg.

Documents

Transcript of The STRING database Michael Kuhn EMBL Heidelberg.