Integration of heterogeneous data

252
Lars Juhl Jensen Integration of heterogeneous data

description

9th Course in Bioinformatics for Molecular Biologist, Bertinoro, Italy, March 22-26, 2009

Transcript of Integration of heterogeneous data

Page 1: Integration of heterogeneous data

Lars Juhl Jensen

Integration of heterogeneous data

Page 2: Integration of heterogeneous data

Lars Juhl Jensen

Integration of heterogeneous data

Page 3: Integration of heterogeneous data

Lars Juhl Jensen

Integration of heterogeneous data

Page 4: Integration of heterogeneous data
Page 5: Integration of heterogeneous data
Page 6: Integration of heterogeneous data

what went wrong?

Page 7: Integration of heterogeneous data

a good question

Page 8: Integration of heterogeneous data

signaling networks

Page 9: Integration of heterogeneous data

Oda & Kitano, Molecular Systems Biology, 2006

Page 10: Integration of heterogeneous data

long way to go

Page 11: Integration of heterogeneous data

mass spectrometry

Page 12: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 13: Integration of heterogeneous data

phosphorylation sites

Page 14: Integration of heterogeneous data

in vivo

Page 15: Integration of heterogeneous data

kinases are unknown

Page 16: Integration of heterogeneous data

peptide assays

Page 17: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 18: Integration of heterogeneous data

sequence specificity

Page 19: Integration of heterogeneous data

kinase-specific

Page 20: Integration of heterogeneous data

in vitro

Page 21: Integration of heterogeneous data

no context

Page 22: Integration of heterogeneous data

what a kinase could do

Page 23: Integration of heterogeneous data

not what it actually does

Page 24: Integration of heterogeneous data

computational methods

Page 25: Integration of heterogeneous data

sequence specificity

Page 26: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 27: Integration of heterogeneous data

kinase-specific

Page 28: Integration of heterogeneous data

no context

Page 29: Integration of heterogeneous data

what a kinase could do

Page 30: Integration of heterogeneous data

not what it actually does

Page 31: Integration of heterogeneous data

in vitro

Page 32: Integration of heterogeneous data

in vivo

Page 33: Integration of heterogeneous data

context

Page 34: Integration of heterogeneous data

co-activators

Page 35: Integration of heterogeneous data

scaffolders

Page 36: Integration of heterogeneous data

expression

Page 37: Integration of heterogeneous data

association networks

Page 38: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 39: Integration of heterogeneous data

a good idea

Page 40: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 41: Integration of heterogeneous data

Part Isequence motifs

Page 42: Integration of heterogeneous data

curated motifs

Page 43: Integration of heterogeneous data

PROSITE

Page 44: Integration of heterogeneous data

ELM

Page 45: Integration of heterogeneous data

HPRD

Page 46: Integration of heterogeneous data

regular expressions

Page 47: Integration of heterogeneous data

[ST]P.[KR]

Page 48: Integration of heterogeneous data

no score

Page 49: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 50: Integration of heterogeneous data

insufficient

Page 51: Integration of heterogeneous data

machine learning

Page 52: Integration of heterogeneous data

NetPhosK

Page 53: Integration of heterogeneous data

PredPhospho

Page 54: Integration of heterogeneous data

PHOSITE

Page 55: Integration of heterogeneous data

GPS

Page 56: Integration of heterogeneous data

KinasePhos

Page 57: Integration of heterogeneous data

PPSP

Page 58: Integration of heterogeneous data

GANNPhos

Page 59: Integration of heterogeneous data

PhoScan

Page 60: Integration of heterogeneous data

no regular updates

Page 61: Integration of heterogeneous data

NetPhorest

Page 62: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 63: Integration of heterogeneous data

data sources

Page 64: Integration of heterogeneous data

Phospho.ELM

Page 65: Integration of heterogeneous data

Diella et al., Nucleic Acids Res., 2008

Page 66: Integration of heterogeneous data

Diella et al., Nucleic Acids Res., 2008

Page 67: Integration of heterogeneous data

Scansite

Page 68: Integration of heterogeneous data

Obenauer et al., Nucleic Acids Res., 2003

Page 69: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 70: Integration of heterogeneous data

common basis

Page 71: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 72: Integration of heterogeneous data

automated pipeline

Page 73: Integration of heterogeneous data

compilation of datasets

Page 74: Integration of heterogeneous data

classification vs. prediction

Page 75: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 76: Integration of heterogeneous data

homology reduction

Page 77: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 78: Integration of heterogeneous data

training and evaluation

Page 79: Integration of heterogeneous data

cross-validation

Page 80: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 81: Integration of heterogeneous data

classifier selection

Page 82: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 83: Integration of heterogeneous data

motif atlas

Page 84: Integration of heterogeneous data
Page 85: Integration of heterogeneous data

179 kinases

Page 86: Integration of heterogeneous data

93 SH2 domains

Page 87: Integration of heterogeneous data

8 PTB domains

Page 88: Integration of heterogeneous data

BRCT domains

Page 89: Integration of heterogeneous data

WW domains

Page 90: Integration of heterogeneous data

14-3-3 proteins

Page 91: Integration of heterogeneous data

phosphatases

Page 92: Integration of heterogeneous data

model organisms

Page 93: Integration of heterogeneous data

S. cerevisiae

Page 94: Integration of heterogeneous data

D. melanogaster

Page 95: Integration of heterogeneous data

C. elegans

Page 96: Integration of heterogeneous data

biological insights

Page 97: Integration of heterogeneous data

docking domains

Page 98: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 99: Integration of heterogeneous data

disease-related kinases

Page 100: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 101: Integration of heterogeneous data

predictive power

Page 102: Integration of heterogeneous data

ROC curves

Page 103: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 104: Integration of heterogeneous data

comparison

Page 105: Integration of heterogeneous data

Miller, Jensen et al., Science Signaling, 2008

Page 106: Integration of heterogeneous data

conclusions

Page 107: Integration of heterogeneous data

data collection

Page 108: Integration of heterogeneous data

automation

Page 109: Integration of heterogeneous data

benchmarking

Page 110: Integration of heterogeneous data

homology reduction!

Page 111: Integration of heterogeneous data

Part IIassociation networks

Page 112: Integration of heterogeneous data

STRING

Page 113: Integration of heterogeneous data

Jensen, Kuhn et al., Nucleic Acids Research, 2009

Page 114: Integration of heterogeneous data

functional associations

Page 115: Integration of heterogeneous data

data integration

Page 116: Integration of heterogeneous data

common basis

Page 117: Integration of heterogeneous data

630 genomes

Page 118: Integration of heterogeneous data

model organism databases

Page 119: Integration of heterogeneous data

Ensembl

Page 120: Integration of heterogeneous data

RefSeq

Page 121: Integration of heterogeneous data

genomic context methods

Page 122: Integration of heterogeneous data

gene fusion

Page 123: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 124: Integration of heterogeneous data

conserved neighborhood

Page 125: Integration of heterogeneous data

operons

Page 126: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 127: Integration of heterogeneous data

bidirectional promoters

Page 128: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 129: Integration of heterogeneous data

phylogenetic profiles

Page 130: Integration of heterogeneous data

Korbel et al., Nature Biotechnology, 2004

Page 131: Integration of heterogeneous data

primary experimental data

Page 132: Integration of heterogeneous data

protein interactions

Page 133: Integration of heterogeneous data

yeast two-hybrid

Page 134: Integration of heterogeneous data

affinity purification

Page 135: Integration of heterogeneous data

fragment complementation

Page 136: Integration of heterogeneous data

Jensen & Bork, Science, 2008

Page 137: Integration of heterogeneous data

genetic interactions

Page 138: Integration of heterogeneous data

Beyer et al., Nature Reviews Genetics, 2007

Page 139: Integration of heterogeneous data

BINDBiomolecular Interaction Network Database

Page 140: Integration of heterogeneous data

BioGRIDGeneral Repository for Interaction Datasets

Page 141: Integration of heterogeneous data

DIPDatabase of Interacting Proteins

Page 142: Integration of heterogeneous data

IntAct

Page 143: Integration of heterogeneous data

MINTMolecular Interactions Database

Page 144: Integration of heterogeneous data

HPRDHuman Protein Reference Database

Page 145: Integration of heterogeneous data

PDBProtein Data Bank

Page 146: Integration of heterogeneous data

inferred associations

Page 147: Integration of heterogeneous data

gene coexpression

Page 148: Integration of heterogeneous data
Page 149: Integration of heterogeneous data

GEOGene Expression Omnibus

Page 150: Integration of heterogeneous data

expression compendia

Page 151: Integration of heterogeneous data

curated knowledge

Page 152: Integration of heterogeneous data

complexes

Page 153: Integration of heterogeneous data

MIPSMunich Information center

for Protein Sequences

Page 154: Integration of heterogeneous data

Gene Ontology

Page 155: Integration of heterogeneous data

pathways

Page 156: Integration of heterogeneous data

Letunic & Bork, Trends in Biochemical Sciences, 2008

Page 157: Integration of heterogeneous data

KEGGKyoto Encyclopedia of Genes and Genomes

Page 158: Integration of heterogeneous data

MetaCyc

Page 159: Integration of heterogeneous data

Reactome

Page 160: Integration of heterogeneous data

PIDNCI-Nature Pathway Interaction Database

Page 161: Integration of heterogeneous data

literature mining

Page 162: Integration of heterogeneous data

MEDLINE

Page 163: Integration of heterogeneous data

SGDSaccharomyces Genome Database

Page 164: Integration of heterogeneous data

The Interactive Fly

Page 165: Integration of heterogeneous data

OMIMOnline Mendelian Inheritance in Man

Page 166: Integration of heterogeneous data

co-mentioning

Page 167: Integration of heterogeneous data

statistical methods

Page 168: Integration of heterogeneous data

NLPNatural Language Processing

Page 169: Integration of heterogeneous data

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxgene The GAL4 gene]

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Page 170: Integration of heterogeneous data
Page 171: Integration of heterogeneous data

easy in theory …

Page 172: Integration of heterogeneous data

… but not in practice

Page 173: Integration of heterogeneous data

different formats

Page 174: Integration of heterogeneous data

parsers

Page 175: Integration of heterogeneous data

different identifiers

Page 176: Integration of heterogeneous data

thesaurus

Page 177: Integration of heterogeneous data

redundant sources

Page 178: Integration of heterogeneous data

book keeping

Page 179: Integration of heterogeneous data

variable quality

Page 180: Integration of heterogeneous data

raw quality scores

Page 181: Integration of heterogeneous data

reproducibility

Page 182: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 183: Integration of heterogeneous data

benchmarking

Page 184: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 185: Integration of heterogeneous data

spread over 630 genomes

Page 186: Integration of heterogeneous data

transfer by orthology

Page 187: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 188: Integration of heterogeneous data

two modes

Page 189: Integration of heterogeneous data

COG mode

Page 190: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 191: Integration of heterogeneous data

protein mode

Page 192: Integration of heterogeneous data

von Mering et al., Nucleic Acids Research, 2005

Page 193: Integration of heterogeneous data

combine all evidence

Page 194: Integration of heterogeneous data

visualize

Page 195: Integration of heterogeneous data

Frishman et al., Modern Genome Annotation, 2009

Page 196: Integration of heterogeneous data

STITCH

Page 197: Integration of heterogeneous data
Page 198: Integration of heterogeneous data

metabolite–enzyme links

Page 199: Integration of heterogeneous data

pathway databases

Page 200: Integration of heterogeneous data

Letunic & Bork, Trends in Biochemical Sciences, 2008

Page 201: Integration of heterogeneous data

drug–target links

Page 202: Integration of heterogeneous data

Drugbank

Page 203: Integration of heterogeneous data

PDSP Ki

Page 204: Integration of heterogeneous data

MATADOR

Page 205: Integration of heterogeneous data

Campillos & Kuhn et al., Science, 2008

Page 206: Integration of heterogeneous data

chemical–chemical links

Page 207: Integration of heterogeneous data

shared targets

Page 208: Integration of heterogeneous data

fingerprint similarity

Page 209: Integration of heterogeneous data

chemical–protein network

Page 210: Integration of heterogeneous data
Page 211: Integration of heterogeneous data

conclusions

Page 212: Integration of heterogeneous data

more data is better

Page 213: Integration of heterogeneous data

quality scores

Page 214: Integration of heterogeneous data

benchmarking

Page 215: Integration of heterogeneous data

cross-species integration

Page 216: Integration of heterogeneous data

Part IIIputting it all together

Page 217: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 218: Integration of heterogeneous data

NetworKIN

Page 219: Integration of heterogeneous data
Page 220: Integration of heterogeneous data

benchmarking

Page 221: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 222: Integration of heterogeneous data

2.5-fold better accuracy

Page 223: Integration of heterogeneous data

context is crucial

Page 224: Integration of heterogeneous data

localization

Page 225: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 226: Integration of heterogeneous data

DNA damage response

Page 227: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 228: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 229: Integration of heterogeneous data

small-scale validation

Page 230: Integration of heterogeneous data

ATM phosphorylates Rad50

Page 231: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 232: Integration of heterogeneous data

Cdk1 phosphorylates 53BP1

Page 233: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 234: Integration of heterogeneous data

high-throughput validation

Page 235: Integration of heterogeneous data

multiple reaction monitoring

Page 236: Integration of heterogeneous data

Linding, Jensen, Ostheimer et al., Cell, 2007

Page 237: Integration of heterogeneous data

systematic validation

Page 238: Integration of heterogeneous data

kinase inhibitor matrix

Page 239: Integration of heterogeneous data

Fedorov et al., PNAS, 2007

Page 240: Integration of heterogeneous data

design optimal experiments

Page 241: Integration of heterogeneous data

integration with literature

Page 242: Integration of heterogeneous data

Reflect

Page 243: Integration of heterogeneous data
Page 244: Integration of heterogeneous data
Page 245: Integration of heterogeneous data
Page 246: Integration of heterogeneous data

conclusions

Page 247: Integration of heterogeneous data

complementary data

Page 248: Integration of heterogeneous data

visualization

Page 249: Integration of heterogeneous data

a good question

Page 250: Integration of heterogeneous data
Page 251: Integration of heterogeneous data

Acknowledgments

NetworKIN.info– Rune Linding– Gerard Ostheimer– Francesca Diella– Karen Colwill– Jing Jin– Pavel Metalnikov– Vivian Nguyen– Adrian Pasculescu– Jin Gyoon Park– Leona D. Samson– Rob Russell– Peer Bork– Michael Yaffe– Tony Pawson

STITCH.embl.de– Michael Kuhn– Christian von Mering– Monica Campillos– Peer Bork

NetPhorest.info– Martin Lee Miller– Francesca Diella– Claus Jørgensen– Michele Tinti– Lei Li– Marilyn Hsiung– Sirlester A. Parker– Jennifer Bordeaux– Thomas Sicheritz-Pontén– Marina Olhovsky– Adrian Pasculescu– Jes Alexander– Stefan Knapp– Nikolaj Blom– Peer Bork– Shawn Li– Gianni Cesareni– Tony Pawson– Benjamin E. Turk– Michael B. Yaffe– Søren Brunak

STRING.embl.de– Christian von Mering– Michael Kuhn– Manuel Stark– Samuel Chaffron– Philippe Julien– Tobias Doerks– Jan Korbel– Berend Snel– Martijn Huynen– Peer Bork

Reflect.ws– Sean O’Donoghue– Evangelos Pafilis– Heiko Horn– Michael Kuhn– Nigel Brown– Reinhardt Schneider

Page 252: Integration of heterogeneous data