Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9:...
Transcript of Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9:...
![Page 1: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/1.jpg)
Karsten Borgwardt: Data Mining in Bioinformatics, Page 1
Data Mining in BioinformaticsDay 9: Graph Mining in Chemoinformatics
Chloé-Agathe Azencott & Karsten Borgwardt
February 10 to February 21, 2014
Machine Learning & Computational Biology Research GroupMax Planck Institutes Tübingen andEberhard Karls Universität Tübingen
![Page 2: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/2.jpg)
Drug discovery
Karsten Borgwardt: Data Mining in Bioinformatics, Page 2
Modern therapeutic researchFrom serendipity to rationalized drug design
Ancient Greeks treatinfections with mould
CH 3
N
S
O
NH
O
HO
NH 2
O
HO
CH 3
Biapenem in PBP-1A
![Page 3: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/3.jpg)
Drug discovery process
Karsten Borgwardt: Data Mining in Bioinformatics, Page 3
1. Find a target
2. Identifyhits
3.Hit-to-lead: characterize
hits
4. Lead optimization
and synthesis
5. Assay
Protein that we want to inhibit so as to interfer with a biological process
Compounds likely to bind to the target
Can they be drugs? (ADME-Tox)
- in vitro- in vivo- clinical
- bioactivity- pharmacokinetics- synthetic pathway
![Page 4: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/4.jpg)
Drug discovery process
Karsten Borgwardt: Data Mining in Bioinformatics, Page 4
52 months 90 months
1. Find a target
2. Identifyhits
3.Hit-to-lead: characterize
hits
4. Lead optimization
and synthesis
5. Assay
![Page 5: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/5.jpg)
Drug discovery process
Karsten Borgwardt: Data Mining in Bioinformatics, Page 5
$500,000,000to
$2,000,000,000
52 months 90 months
1. Find a target
2. Identifyhits
3.Hit-to-lead: characterize
hits
4. Lead optimization
and synthesis
5. Assay
![Page 6: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/6.jpg)
Chemoinformatics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 6
How can computer science help?→ Chemoinformatics!
“...the mixing of information resources to transform data into informa-tion, and information into knowledge, for the intended purpose of mak-ing better decisions faster in the arena of drug lead identification andoptimisation.” – F. K. Brown
“... the application of informatics methods to solve chemical problems.”– J. Gasteiger and T. Engel
![Page 7: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/7.jpg)
Chemoinformatics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 7
Chemoinformatics
1. Find a target
2. Identifyhits
3.Hit-to-lead: characterize
hits
4. Lead optimization
and synthesis
5. Assay
![Page 8: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/8.jpg)
Chemoinformatics
Karsten Borgwardt: Data Mining in Bioinformatics, Page 8
The chemical space
1060 possible small or-ganic molecules
1022 stars in the observ-able universe
(Slide courtesy of Matthew A. Kayala)
![Page 9: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/9.jpg)
Drug discovery process
Karsten Borgwardt: Data Mining in Bioinformatics, Page 9
QSARQSPR
1. Find a target
2. Identifyhits
3.Hit-to-lead: characterize
hits
4. Lead optimization
and synthesis
5. Assay
QSAR: Qualitative Structure-Activity Relationshipi.e. classification
QSPR: Quantititive Structure-Property Relationshipi.e. regression
![Page 10: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/10.jpg)
Representing chemicals in silico
Karsten Borgwardt: Data Mining in Bioinformatics, Page 10
Expert knowledge molecular descriptors→ hard, potentially incomplete
Molecules are...
CH 3
N
S
O
NH
O
HO
NH 2
O
HO
CH 3
![Page 11: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/11.jpg)
Representing chemicals in silico
Karsten Borgwardt: Data Mining in Bioinformatics, Page 11
Similar Property PrincipleMolecules having similar structures should exhibit similaractivities.
→ Structure-based representationsCompare molecules by comparing substructures
![Page 12: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/12.jpg)
Molecular graph
Karsten Borgwardt: Data Mining in Bioinformatics, Page 12
C
O
N C
C
C
N
O
S
C
C
O O
C
C
d
d
d
C
C
NC
C
C
C
C
CO
Undirected labeled graph
![Page 13: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/13.jpg)
Fingerprints
Karsten Borgwardt: Data Mining in Bioinformatics, Page 13
Define feature vectors that record the presence/absence(or number of occurrences) of particular patterns in a givenmolecular graph
φ(A) = (φs(A))s substructure
whereφs(A) =
{1 if s occurs in A0 otherwise
Extension of traditional chemical fingerprints
![Page 14: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/14.jpg)
Fingerprints
Karsten Borgwardt: Data Mining in Bioinformatics, Page 14
Learning from fingerprintsClassical machine learning and data mining techniquescan be applied to these vectorial feature representations.
Any distance / kernel can be usedClassificationFeature selectionClustering
![Page 15: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/15.jpg)
Fingerprints
Karsten Borgwardt: Data Mining in Bioinformatics, Page 15
Fingerprints compressionSystematic enumeration→ long, sparse vectorse.g. 50, 000 random compounds from ChemDB→ 300, 000 paths of length up to 8→ 300 non-zeros on average“Naive” Compression
List the positions of the 1s219 = 524, 288average encoding: 300× 19 = 5, 700 bits
![Page 16: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/16.jpg)
Fingerprints
Karsten Borgwardt: Data Mining in Bioinformatics, Page 16
Fingerprints compressionModulo Compression (lossy)
![Page 17: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/17.jpg)
Frequent patterns fingerprints
Karsten Borgwardt: Data Mining in Bioinformatics, Page 17
MOLFEA [Helma et al., 2004]
P = positive (mutagenic) compoundsN = negative compounds
features: fragments (= patterns) f such thatboth freq(f,P) ≥ t and freq(f,N) ≥ t
Limited to frequent linear patterns
ML algorithm: SVM with linear or quadratic kernel
![Page 18: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/18.jpg)
Frequent patterns fingerprints
Karsten Borgwardt: Data Mining in Bioinformatics, Page 18
MOLFEA [Helma et al., 2004]
CPDB – Carcinogenic Potency DataBase684 compounds classified in 341 mutagens and 343 non-mutagens according to Ames test on Salmonella
1% 3% 5% 10%Frequency threshold
50
60
70
80
90
100
Cross-validated sensitivity
Mutagenicity prediction [Hema04]
Linear kernelQuadratic kernel
![Page 19: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/19.jpg)
Spectrum kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 19
φ(A) = (φs(A))s∈S
Kspectrum(A,A′) = k(φ(A), φ(A′))
k ∈ RR|(S)|×R|(S)| can beDot product (linear kernel)
RBF kernel
Tanimoto kernel: k(A,B) = A∩BA∪B
MinMax kernel:∑N
i=1min(Ai,Bi)∑Ni=1max(Ai,Bi)
![Page 20: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/20.jpg)
Spectrum kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 20
Tanimoto and MinMaxBoth Tanimoto and Minmax are kernels.
Proof for Tanimoto: J.C. Gower A general coefficientof similarity and some of its properties. Biometrics1971.Proof for MinMax:
MinMax(x, y) =〈φ(x), φ(y)〉
〈φ(x), φ(x)〉 + 〈φ(y), φ(y)〉 − 〈φ(x), φ(y)〉with φ(x) of length: # patterns × max countφ(x)i = 1 iff. the pattern indexed by bi/qc appears morethan i mod q times in x
![Page 21: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/21.jpg)
All patterns fingerprints
Karsten Borgwardt: Data Mining in Bioinformatics, Page 21
Paths fingerprintsLabeled sub-paths (walks)
O
N C C
N
O
S
C
C
O O
C
C
d
d
d
C
C
NsCsCsS
CsCsCdO
C
C
NC
C
C
C
C
CO
Some sub-paths of length 3
![Page 22: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/22.jpg)
All patterns fingerprints
Karsten Borgwardt: Data Mining in Bioinformatics, Page 22
Circular fingerprintsLabeled sub-trees - Extended-Connectivity (or Circular)features
O
N C C
N
O
S
C
C
O O
C
Cd
d
d
C
C
C{sC{sN|sC}|sN{sC}|sS{sC}}
C
C
NC
C
C
C
C
CO
Example of a circular substructure of depth 2
![Page 23: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/23.jpg)
All patterns fingerprints
Karsten Borgwardt: Data Mining in Bioinformatics, Page 23
2D spectrum kernels [Azencott et al., 2007]
Systematically extract paths / circular fingerprints,for various maximal depthsSVM with Tanimoto / Minmax
![Page 24: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/24.jpg)
All patterns fingerprints
Karsten Borgwardt: Data Mining in Bioinformatics, Page 24
2D spectrum kernels [Azencott et al., 2007]
Mutagenicity (Mutag): 188 compounds
Benzodiazepine receptor affinity (BZR): 181+125 compounds
Cyclooxygenase-2 ihibitors (COX2): 178 + 125 compounds
Estrogen receptor affinity (ER): 166 + 180 compounds
Data SVM Previous bestMutag 90.4% 85.2% (gBoost)BZR 79.8% 76.4%
COX2 70.1% 73.6%
ER 82.1% 79.8%
![Page 25: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/25.jpg)
Weisfeiler-Lehman kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 25
[Shervashidze et al., 2011]
Goal: scalability
Compute a sequence that captures topological and labelinformation of graphs in a runtime linear in the number ofedges
→ sub-tree kernel
![Page 26: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/26.jpg)
Weisfeiler-Lehman kernel
Karsten Borgwardt: Data Mining in Bioinformatics, Page 26
[Shervashidze et al., 2011]
![Page 27: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/27.jpg)
Convolution kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 27
a.k.a. decomposition kernels(x1, . . . , xD) is a tuple of parts of x, with xd ∈ X for eachpart d = 1, . . . , D
kd ∈ RXd×Xd: a Mercer kernel
Kdecomposition(x, x′) =
∑x1x2...xD=x
∑x′1x′2x′D=x
′
k1(x1, x′1)k2(x2, x
′2) . . . kD(xD, x
′D)
Spectrum kernels are a particular case of convolutionkernels
![Page 28: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/28.jpg)
Convolution kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 28
Weighted Decomposition Kernel [Menchetti et al., 2005]
Match atoms and weigh them according to a kernel between sub-graphs that include these atoms
KWDK(x, x′) =
∑(a,σ∈Dr(x))
∑(a′,σ′∈Dr(x′)) δ(a, a
′)Kc(σ, σ′)
r > 0 ∈ N
Dr(x): decompositions of the molecular graph of x in an atom a
and a subpath σ of x including a and of depth at most r
![Page 29: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/29.jpg)
Convolution kernels
Karsten Borgwardt: Data Mining in Bioinformatics, Page 29
Weighted Decomposition Kernel [Menchetti et al., 2005]
Kc: contextual kernel, here: histogram intersection kernel
Kc(σ, σ′) =
∑l∈L min(fσ(l), fσ′(l))
L: possible labels for edges and vertices
fσ(l): frequency of label l subgraph σ.
![Page 30: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/30.jpg)
Introducing spatial information
Karsten Borgwardt: Data Mining in Bioinformatics, Page 30
3D Histograms [Azencott et al., 2007]
Groups of k atoms
Associated size:
Pairwise distances(k = 2)diameter of the smallestsphere that contains allk atoms
![Page 31: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/31.jpg)
Introducing spatial information
Karsten Borgwardt: Data Mining in Bioinformatics, Page 31
3D Histograms [Azencott et al., 2007]
One histogram per class of k-tuple (e.g. C-C-C, C-C-O)
C
O
N C
C
C
N
O
S
C
C
O O
C
C
C2.2
4.6
3.2
5.6
6.7
2.4
2.6
3.7
0 1 2 3 4 5 6 7
Frequency of N-O
N-O distance (A)
C
NC
C
C
C
C
CO 6.3
6.6
9.2 2.7
5.7
7.9
9.5
8 9 10
1
2
3
4
0
![Page 32: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/32.jpg)
Introducing spatial information
Karsten Borgwardt: Data Mining in Bioinformatics, Page 32
3D Histograms: performance [Azencott et al., 2007]
Data 2D kernel Hist3D kernelMutag 90.4% 88.8%
BZR (loo) 82.0% 79.4%ER (loo) 87.0% 86.1%COX2 76.9% 78.6%
![Page 33: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/33.jpg)
Introducing spatial information
Karsten Borgwardt: Data Mining in Bioinformatics, Page 33
3D Decomposition Kernels [Ceroni et al., 2007]
Remember: KWDK(x, x′) =
∑(a,σ∈Dr(x))
∑(a′,σ′∈Dr(x′))
δ(a, a′)Kc(σ, σ′)
K3DDK(x, x′) =
∑σ∈Sr(x)
∑σ′∈Sr(x′)Ks(σ, σ
′)
Sr(x): subgraphs of x composed of r distinct vertices
Ks(σ, σ′) =
∏r(r−1)/2i=1 δ(ei, e
′i)e−γ(li−l′i)
li = length of edge ei in x(e1, e2, . . . , er(r−1)/2 lexicographically ordered; γ ∈ R
![Page 34: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/34.jpg)
Introducing spatial information
Karsten Borgwardt: Data Mining in Bioinformatics, Page 34
3DDK: Performance [Ceroni et al., 2007]
Data 2D kernel Hist3D kernel 3DDK Circ3DDKMutag 90.4% 88.8% 86.7% 83.5%
BZR (loo) 82.0% 79.4% 78.4% 81.4%ER (loo) 87.0% 86.1% 82.3% 82.1%COX2 76.9% 78.6% 75.6% 75.2%
![Page 35: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/35.jpg)
Introducing spatial information
Karsten Borgwardt: Data Mining in Bioinformatics, Page 35
The pharmacophore kernel [Mahé et al., 2006]
pharmacophore p ∈ P(x): p = [(x1, l1), (x2, l2), (x3, l3)]
xi 3D coordinates of atom i of x; li = label of atom i
K(x, x′) =∑
p∈P(x)∑
p′∈P(x′)KP (p, p′)
KP (p, p′) = Kdist(d1, d
′1)Kdist(d2, d
′2)Kdist(d3, d
′3)Kfeat(l1, l
′1)Kfeat(l2, l
′2)Kfeat(l3, l
′3)
Kdist: RBF Gaussian Kdist(d, d′) = exp
(‖d−d′‖22σ2
)Kfeat: Dirac
![Page 36: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/36.jpg)
Introducing spatial information
Karsten Borgwardt: Data Mining in Bioinformatics, Page 36
3D LAP kernel [Hinselmann et al., 2010]
M : pairwise intramolecular matrix of inter-atomicgeometric distances
![Page 37: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/37.jpg)
Introducing spatial information
Karsten Borgwardt: Data Mining in Bioinformatics, Page 37
ConclusionHow relevant is 3D information?How good is 3D information?
![Page 38: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/38.jpg)
Drug discovery process
Karsten Borgwardt: Data Mining in Bioinformatics, Page 38
Docking
VirtualHigh-Throughput
Screening
1. Find a target
2. Identifyhits
3.Hit-to-lead: characterize
hits
4. Lead optimization
and synthesis
5. Assay
![Page 39: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/39.jpg)
High-throughput screening
Karsten Borgwardt: Data Mining in Bioinformatics, Page 39
Assay a large library of potentialdrugs against their target
Very costly
→ docking
→ virtual high-throughputscreening (vHTS)
![Page 40: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/40.jpg)
Measuring performance
Karsten Borgwardt: Data Mining in Bioinformatics, Page 40
Imbalanced data
Typically, most compounds are inactive ⇒ many more negativethan positive examples
E.g. DHFR data set:99, 995 chemicals screened for activity against dihydrofolatereductase; < 0.2% active compounds
Accuracy is not appropriate:predicting all compounds negative⇒ accuracy = 99.8%
sensitivity= # True Positives# Positives
specificity= # True Negatives# Negatives
For many methods, the output is continuous⇒ accuracy, sensitivity and specificity depend on a threshold θ
![Page 41: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/41.jpg)
Measuring performance
Karsten Borgwardt: Data Mining in Bioinformatics, Page 41
Receiver-Operator Characteristic Curves
For all possible values of θ, report sensitivity and 1− specificityAUROC (Area under the ROC Curve) is a numerical measure ofperformance
AUROC(random) = 0.5 and AUROC(optimal) = 1
0 1/6 1/3 1/2 2/3 5/6 1
01
/42
/43
/41
False Positive Rate
Tru
e P
ositiv
e R
ate
x
x x
x
x x x x
x x x
Inf
0.95 0.94
0.9
0.81 0.73 0.52 0.2
0.17 0.12 0.09
random
perfect
real
label prediction+ 0.95- 0.94+ 0.90+ 0.81- 0.73- 0.52- 0.20+ 0.17- 0.12- 0.09
![Page 42: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/42.jpg)
Measuring performance
Karsten Borgwardt: Data Mining in Bioinformatics, Page 42
Inhibition of DHFR: ROC Curves [Azencott et al., 2007]
method AUCIRV 0.71SVM 0.59kNN 0.59
MAX-SIM 0.54
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
RANDOM
IRV
SVM
MAXSIM
![Page 43: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/43.jpg)
Measuring performance
Karsten Borgwardt: Data Mining in Bioinformatics, Page 43
Precision-recall curves
Precision = # True Positives# Predicted Positives
Recall = sensitivity
0 1/4 2/4 3/4 1
01/5
2/5
3/5
4/5
1
Recall
Pre
cis
ion
x
x
x
x
x
x
x
xxx
0.95
0.94
0.9
0.81
0.73
0.52
0.2
0.170.120.09
perfect
real
![Page 44: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/44.jpg)
Other applications
Karsten Borgwardt: Data Mining in Bioinformatics, Page 44
Other applications of graph mining in chemoinformatics
Database indexing and searchPrediction of 3D structures of small compoundsand proteinsReaction Prediction
![Page 45: Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February](https://reader030.fdocuments.net/reader030/viewer/2022040600/5e8a1003cae61a11fc3c9188/html5/thumbnails/45.jpg)
References and further reading
Karsten Borgwardt: Data Mining in Bioinformatics, Page 45
[Azencott et al., 2007] Azencott, C.-A., Ksikes, A., Swamidass, S. J., Chen, J. H., Ralaivola, L. and Baldi, P. (2007). One-to four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. Journal of chemical
information and modeling 47, 965–974. 23, 24, 35, 36, 37, 47
[Baldi et al., 2007] Baldi, P., Benz, R. W., Hirschberg, D. S. and Swamidass, S. J. (2007). Lossless compression of chemical fingerprintsusing integer entropy codes improves storage and retrieval. Journal of chemical information and modeling 47, 2098–2109.
[Ceroni et al., 2007] Ceroni, A., Costa, F. and Frasconi, P. (2007). Classification of small molecules by two-and three-dimensionaldecomposition kernels. Bioinformatics 23, 2038–2045. 38, 39
[Helma et al., 2004] Helma, C., Cramer, T., Kramer, S. and De Raedt, L. (2004). Data mining and machine learning techniques forthe identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal ofchemical information and computer sciences 44, 1402–1411. 17, 18
[Hinselmann et al., 2010] Hinselmann, G., Fechner, N., Jahn, A., Eckert, M. and Zell, A. (2010). Graph kernels for chemical compoundsusing topological and three-dimensional local atom pair environments. Neurocomputing 74, 219–229. 41
[Mahé et al., 2006] Mahé, P., Ralaivola, L., Stoven, V. and Vert, J.-P. (2006). The pharmacophore kernel for virtual screening withsupport vector machines. Journal of chemical information and modeling 46, 2003–2014. 40
[Menchetti et al., 2005] Menchetti, S., Costa, F. and Frasconi, P. (2005). Weighted Decomposition Kernels. In Proceedings of the 22nd
International Conference on Machine Learning pp. 585–592, ACM, Bonn, Germany. 33, 34
[Saigo et al., 2009] Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T. and Tsuda, K. (2009). gBoost: a mathematical programmingapproach to graph classification and regression. Machine Learning 75, 69–89. 26, 27, 28, 29
[Shervashidze et al., 2011] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. and Borgwardt, K. M. (2011). Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research 12, 2539–2561. 30, 31