KernelMethods,Paern Analysisand Computaonal& … · 2015-12-18 · KEPACO Mission: Predicting...
Transcript of KernelMethods,Paern Analysisand Computaonal& … · 2015-12-18 · KEPACO Mission: Predicting...
Associate professor Juho Rousu
Kernel Methods, Pa0ern Analysis and Computa8onal Metabolomics (KEPACO)
KEPACO in a Nutshell We develop kernel-‐based machine learning methods for predic8ng of structured, non-‐tabular data arising in biomedicine and digital health, especially applica9ons on small molecules such as drugs and metabolites. Examples of research ques9ons: • How to iden9fy a
metabolite from its mass spectrum?
• How to find mul9variate associa9ons between SNPs and metabolites?
• How to classify a enzyme according to its metabolomic role?
Flowchart for a typical study
Contact: Juho Rousu
1. Structured Big Data
1 1 11 11 10 0 0 0 0 0 0 111 1 1 1 10 0 0 0 0 0 0
0.6 0.11 0.30.06 0.020.1 0.020 0 0 0 0 0 0 0.20.30.1 0.2 0.11 0.4 0.010 0 0 0 0 0 0Nodes Intensity (NI)
NB
NI
Nodes Binary (NB)
0 10 20 30 40 50
010
2030
4050
Number of active cell lines
Num
ber o
f act
ive c
ell l
ines
0.6
0.7
0.8
0.9
1.02. Efficient Kernel (non-‐linear similarity metric) computa9on
3. Defini9on as a Regularized learning problem (max-‐margin and/or sparsity)
4. Efficient training using convex op9miza9on
Saddle point
Conditional gradient
5 10 15 20
Metlin − PubChem
TOP k
perc
enta
ge o
f cor
rect
ly id
entif
ied
2D s
truct
ures
0%
20%
40%
60%
80%
100%
5 10 15 20
our methodCFM−IDMIDASMetFragRandom
5. Predic9on and interpreta9on
KEPACO Mission: Predicting Structured data
Sequences Hierarchies Networks
Many data have logical structure, how can we make use of that in making models faster and more accurate? In particular, we are interested of predicting structured output
Example: Predicting Network Response
Problem: • To predict the spread of
activation in a complex network, given an action/stimulus
Example: • Input: post in a social network • Output: subnetwork of friends
who share the post
a
i b
h
g
f
e c
d
Alice
David
Bob
JacobBen
Dutch get the championship!
Dutch get the championship!
Dutch get the championship!Dutch get the
championship! Soccer is not my cup of tea.
Su, Hongyu, Aristides Gionis, and Juho Rousu. "Structured prediction of network response." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014.
Example: Metabolite identification through machine learning
• From a set of MS/MS spectra (x = structured input), learn a model to predict molecular fingerprints (y = multilabel output)
• With predicted fingerprints retrieve candidate molecules from a large molecular database
• Our CSI:Fingerid (http://www.csi-fingerid.org) tool is the most accurate metabolite identification tool to date
Dührkop, K., Shen, H., Meusel, M., Rousu, J., & Böcker, S. (2015). Searching molecular structure databases with tandem mass spectra using CSI: FingerID. Proceedings of the National Academy of Sciences, 112(41), 12580-12585.
Huibin ShenMay 15, 2014
8/27
New fingerprint prediction framework
5 10 15 20
perc
enta
ge o
f cor
rect
ly id
entif
ied
stru
ctur
es
0%
20%
40%
60%
80%
100%
5 10 15 20top k
our methodFingerIDCFM−IDMAGMa
MidasMetFragRandom
Example: Drug Bioactivity Classification
• Data: x = Molecule, y = labeling of a graph of cell lines • Compatibility score F(x,y): how well subset of cell lines go together
with the drug • Co-occurring features: molecular fingerprints vs. subsets of cell lines • Prediction: finding the best scoring labeling of the graph
MCF7
MDA_MB_231
HS578T
BT_549
T47D
SF_268
SF_295
SF_539
SNB_19
SNB_75U251
COLO205
HCC_2998
HCT_116
HCT_15
HT29
KM12
SW_620
CCRF_CEM
HL_60
K_562
MOLT_4
RPMI_8226
SR
LOXIMVI
MALME_3M
M14
SK_MEL_2
SK_MEL_28
SK_MEL_5
UACC_257UACC_62
MDA_MB_435
MDA_N
A549
EKVX
HOP_62
HOP_92
NCI_H226
NCI_H23
NCI_H322M
NCI_H460
NCI_H522IGROV1
OVCAR_3
OVCAR_4
OVCAR_5
OVCAR_8SK_OV_3
NCI_ADR_RES
PC_3DU_145
786_0
A498
ACHN
CAKI_1
RXF_393
TK_10UO_31
Su, Hongyu, and Juho Rousu. "Multilabel classification through random graph ensembles." Machine Learning 99.2 (2013): 231-256.
Example: Multivariate association meta-analysis • Problem: find associations
btw. groups of SNPs (=genotype) and groups of metabolites (=phenotype)
• Meta-analysis setting • Several studies analyzed
together • No original data available, only
summary statistics of indivudual studies
• Our metaCCA tool can recover multivariate associations nearly as accurately as if original data was available
2.10.2015
Cichonska A, Rousu J, Marttinen P, Kangas AJ, Soininen P, Lehtimäki T, Raitakari OT, Järvelin MR, Salomaa V, Ala-Korpela M, Ripatti S. metaCCA: Summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis. bioRxiv. 2015 Jan 1:022665.
Example: Biological network reconstruction Input: genomic measurements Output: biological network
Pitkänen E, Jouhten P, Hou J, Syed MF, Blomberg P, Kludas J, Oja M, Holm L, Penttilä M, Rousu J, Arvas M. Comparative genome-scale reconstruction of gapless metabolic networks for present and ancestral species. PLoS Comput Biol. 2014 Feb 1;10(2):1003465.
Example: Synthetic biology
• Goal: energy and carbon efficient industrial processes through synthetic biology
• Consortium: VTT, Aalto, U. Turku Our focus • In silico design, modelling and
optimization of synthetic biosystems • Systems Modelling + Machine
Learning
LiF
Fermentation experiments
Genome-wide measurements
In silico experimental
design
Strain/Process modification
Tekes Strategic Opening “LiF – Living Factories”, 2014-
Kernel Methods, Pattern Analysis & Computational Metabolomics (KEPACO)
KEPACO research group Dr. Sandor Szedmak Dr. Sahely Bhadra (w. S.Kaski) Dr. Markus Heinonen (w. H. Lähdesmäki) Dr. Celine Brouard Dr. Elena Czeizler Dr. Hongyu Su MSc Huibin Shen MSc Anna Cichonska (HIIT&FIMM) MSc Viivi Uurtio + students
Funding • Academy of Finland • EU FP7 • Tekes
KEPACO group summer 2015