Robust Machine Learning QSPR Models for Recognizing High ...
1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Statistics evaluation and graphics...
-
Upload
chloe-ball -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Statistics evaluation and graphics...
1
Tobias Kind FiehnLab at UC Davis Genome Center
November 2006
Statistics evaluation and graphics with ChemAxon tools and Statistica and
WEKAtowards QSPR and QSAR development
Free Academic Licenses for JChem and Instant JChem provided by ChemAxon
Academic License for Statistica Dataminerprovided by Statsoft
GNU general public license for WEKAprovided by WEKA Machine Learning Project
2
Metabolomics - The science of the small molecules
Compound Classes:• sugars• amino acids• steroids• fatty acids• lipids• phospholipids • organic acids ...
Molecules under investigation
3D model of a molecule with surface plot
Visit us!www.fiehnlab.ucdavis.edu
3
Techniques and tools
• Analytical techniques (LC-MS, GC-MS, FT-MS, NMR, IR)• BioInformatics, ChomoInformatics
Liquid Chromatography
LC-MS Gas Chromatography
GC-MS
BioInformatics and CheminformaticsStatistics (Statistica Dataminer) Open Source Tools
4
ChemAxon JChem has now PCA and PLS
1) Create new library with JCHEM Manager GUI (testcase here: fingerprints)
2) Exctract fingerprints and do dimension reduction with principal component analysis (PCA) with command line tool PCA.bat or pca.sh
5
ChemAxon JChem Principal Component Analysis (PCA)
PCA -d "sun.jdbc.odbc.JdbcOdbcDriver" -u "jdbc:odbc:jchem-z" -l test -p test -P JChemProperties -q "SELECT cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM nci99 WHERE cd_id <= 250000" -o PCA-scores.txt -t PCA-Eigenvalues.txt
TimeThis : Command Line : run-pca.batTimeThis : Start Time : Mon Nov 27 17:02:02 2006TimeThis : End Time : Mon Nov 27 17:19:52 2006TimeThis : Elapsed Time : 00:17:49.812
Start PCA by getting information from DB (here Access, but can be Oracle, Derby, MySQL) Test case 250.000 chemicals from NCI DB
PCA can be done from any descriptor, chemical fingerprints, BCUT etc.This is just a simple example made from the 16 standard fingerprints.Be sure only to select descriptors you want (and not the molecule ID)
Problem here: A) JDBC extraction not tuned - DB extraction of values nearly 2 minutes.B) PCA calculation time too long - 15 minutes for a matrix 250.000 x 16
The current PCA algorithm needs to be changed, its very inefficent (faster matrix routines exist for JAVA)
Database extraction time with Statistica: 8 seconds.The same PCA with Statistica is finished in: 1 second (no joke – thats a factor of 1:900).
6
JChem PCA output
Eigenvalues, % and Cumulated variance (in rows)=1.77 1.623 1.518 1.326 1.106 1.028 0.999 0.94 0.919 0.849 0.824 0.788 0.742 0.71 0.674 0.602 0.582
10.409 9.547 8.93 7.798 6.505 6.048 5.879 5.527 5.407 4.994 4.847 4.638 4.362 4.177 3.965 3.543 3.42410.409 19.957 28.886 36.684 43.189 49.236 55.115 60.643 66.05 71.043 75.891 80.528 84.89 89.068 93.033 96.576 100
Loadings (in rows)=0.191 0 -0.159 -0.17 0.306 0.617 -0.419 0.105 0.338 0.307 -0.304 -0.263 -0.324 0.348 -0.31 0.101 0.5630.076 0.085 0.577 -0.117 0 0.128 0.084 -0.123 -0.255 -0.146 0.084 -0.682 0.335 -0.374 -0.63 0.11 0.0630.182 -0.167 -0.049 0.181 -0.553 0.233 0.126 -0.016 -0.286 0.344 -0.535 -0.055 0.469 -0.035 0.235 -0.442 0.290.141 -0.572 0.077 -0.073
PCA scores-0.873 0.597 1.843 -0.131 0.204 -1.141 1.016 0.806 -0.263 0.221 0.208 1.54 1.704 -1.382 0.705 1.3970.622 0.668 0.233 -0.175 0.748 0.801 1.087 1.366 -0.91 -1.369 0.192 1.919 -2.231 -0.218 1.043 1.13
-0.672 0.723 -1.015 -0.089 0.477 1.877 0.381 0.766 -0.59 -0.082 0.877 0.466 0.2 -0.397 1.189 1.308-0.102 0.304 0.81 0.896 -1.853 0.435 0.551 0.32 -1.083 -0.439 0.346 1.081 0.557 -0.624 -0.042 -2.87-0.835 -1.519 -0.705 1.147 -0.62 0.198 0.492 -0.34 -0.526 0.484 0.011 -0.456 -0.299 0.509 -1.294 -0.801-0.947 0.455 -0.595 -0.673 -2.836 0.796 -0.631 0.353 -1.157 -1.519 0.957 0.966 1.113 0.919 0.368 -1.399-0.215 -1.106 -1.638 -0.673 -1.918 -0.477 1.168 1.835 -0.755 -2.252 -0.962 -0.515 -0.722 1.023 2.844 1.3490.353 -2.345 -0.737 0.808 1.778 0.343 -0.197 -0.221 -0.529 3.189 -1.481 -1.754 0.152 -0.881 -2.449 -0.6490.622 0.301 0.928 -1.174 0.526 -0.322 -0.17 -0.589 0.233 -0.149 -0.783 -0.704 1.524 -1.547 -1.642 -1.0850.981 0
The PCA results matrix is inverted and values *(-1) from Statistica.
Problem: Currently no graphics. But multivariate statistics lives from graphics.Follwing simple graphic examples are made with Statistica or WEKA via DB query.
7
Following slides „What could be“ in the future.
or „What can be done“ right now.
Check the pretty comprehensive statistics link http://www.statsoft.com/textbook/stathome.html
8
PLSPLS
Machine Learning (KNN)Machine Learning (KNN)
Feature selectionFeature selection
Tree modelTree model
Neural NetworkNeural Network
Cluster AnalysisCluster Analysis
Response curvesResponse curves
Machine Learning and statistic tools
We use Statistica Dataminer as a comprehensive statistics work tool.WEKA or YALE are free but (not yet :-) as powerful as the Statistica Dataminer.
9
Connection of a JCHEM molecule DB via JDBC with Statistica
Time for query + copy of 4,000,000 values with 250k molecules 16 fingerprints = 8 seconds.
Test system JChem 3.2 with MS Access with Statistica Dataminer 7.1Dual Opteron 2.8 GHz
10
Statistica with JChem data
11
Eigenva lues o f corre la tion matrixScree p lo t
10 .91%
10 .15%
9 .41%
8 .29%
6 .89%
6 .37% 6 .04%
5 .79% 5 .32% 5 .15%
4 .94% 4 .63% 4 .44% 4 .25%
3 .77% 3 .66%
-2 0 2 4 6 8 10 12 14 16 18 20
Eigenva lue number
0 .4
0 .6
0 .8
1 .0
1 .2
1 .4
1 .6
1 .8
2 .0
Eig
en
valu
e
PCA Scree plot – determine optimal factors to retain
Visible Step
Four factors can be retained. The 16 dimensional space can becompressed into a 4-dimensional space. (Scree plot is not optimal here)
Statistica Dataminer 7.1
12
Pro jection o f the va r iab les on the facto r-p lane ( 1 x 2 )
Active
cd_ fp1
cd_ fp2
cd_ fp3
cd_ fp4
cd_ fp5 cd_ fp6
cd_ fp7
cd_ fp8 cd_ fp9
cd_ fp10
cd_ fp11
cd_ fp12
cd_ fp13
cd_ fp14
cd_ fp15 cd_ fp16
-1 .0 -0 .5 0 .0 0 .5 1 .0
F acto r 1 : 10 .91%
-1.0
-0 .5
0 .0
0 .5
1 .0
Fa
ctor 2
: 10
.15
%
Which of the 16 fingerprints are similar? Those who “cluster” together are similar (fp_11 and fp_14). The variables fp_5 and fp_16 influence factor 1 in the same way. Variables inside or near the center (0,0) have no discrimination power. Remember PCA is no cluster analysis!
PCA Loadings plot – which variables are influential?
Statistica Dataminer 7.1
13
PCA Scores plot – picture of the reduced dimensionality.
The 16 fingerprints are compressed into 2D. We can use other high dimensionality descriptors forenhanced examples. Cases (molecules) which „cluster“ together may have same properties or functional groups (depending on input). Here we see the KOW molecule set covers the whole NCI dataset based on 16 pfs.
Statistica Dataminer 7.1
14
The 16 fingerprints are compressed into 3D. The KOWWIN test set does not cover the whole molecules space of important silicon containing molecules. You can also do an Overlap Analysis(compare two databases) within the all-new Instant-JChem.
PCA Scores 3D plot – KOWWIN versus silicon compound test set
3D Sca tte rp lo t (Score sp readshee t (Spreadsheet215) 7v*22435c)
SIL ICO N compounds (red)KO W W IN se t (b lue )
Statistica Dataminer 7.1
15
Statistica – Random Forest Machine learning
1024-DIM FC descriptor space
Histogram o f logP (O bs.)
-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13
logP (O bs.)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
Nu
mb
er o
f ob
serva
tion
s
Importance p lo tDependen t va riab le :
logP
Var418Var494
Var569Var739
Var188Var856
Var561Var704
Var900Var585
Var468Var314
Var787Var78
Var6780
200
400
600
800
1000
1200
1400
1600
Imp
orta
nce
(F-va
lue
)
Chemical fingerprint descriptors generated with JCHEM GenerateMDGenerateMD performance 1800 molecules/second for 1024 dimensional fpOn Dual Opteron 2,8 GHz (one core used only).
Statistica generates all graphical output + SQL code
16
CART tree method for QSPR and QSAR
Classification trees, boosting trees, random forest, regression trees and honest trees and adaptive trees – lots of wood and forests - did you hear about them?
T ree graph fo r MT PNum. o f non-te rmina l nodes: 33 , Num. o f te rmina l nodes: 34
T ree number: 1I D=1 N=2095
M u=166. 185442
Var =4271. 107556
I D=2 N=1596
M u=157. 640664
Var =4067. 765794
I D=4 N=422
M u=119. 878436
Var =3704. 634302
I D=6 N=356
M u=111. 806180
Var =3150. 895078
I D=8 N=305
M u=104. 120984
Var =2371. 171432
I D=10 N=274
M u=100. 075912
Var =2163. 339861
I D=12 N=213
M u=94. 826761
Var =1889. 309660
I D=14 N=203
M u=92. 885714
Var =1753. 277876
I D=16 N=158
M u=89. 049367
Var =1571. 384272
I D=5 N=1174
M u=171. 214480
Var =3501. 469135
I D=20 N=190
M u=193. 364737
Var =3571. 503041
I D=21 N=984
M u=166. 937500
Var =3374. 917552
I D=24 N=556
M u=153. 394964
Var =3073. 243790
I D=26 N=400
M u=145. 870000
Var =2801. 123252
I D=28 N=197
M u=129. 835533
Var =2308. 531729
I D=31 N=166
M u=137. 564458
Var =2176. 347474
I D=32 N=157
M u=134. 021656
Var =1922. 462462
I D=34 N=135
M u=129. 323704
Var =1802. 469217
I D=29 N=203
M u=161. 430542
Var =2787. 519957
I D=39 N=198
M u=163. 833838
Var =2612. 433045
I D=41 N=144
M u=152. 620833
Var =2273. 971372
I D=27 N=156
M u=172. 689744
Var =3253. 507718
I D=25 N=428
M u=184. 530140
Var =3219. 061548
I D=47 N=411
M u=187. 267397
Var =3068. 444876
I D=49 N=307
M u=181. 211401
Var =2919. 261601
I D=50 N=294
M u=177. 819388
Var =2739. 987556
I D=52 N=289
M u=179. 684775
Var =2562. 540189
I D=3 N=498
M u=193. 646586
Var =3940. 321058
I D=57 N=361
M u=205. 383380
Var =3320. 597122
I D=58 N=198
M u=220. 116162
Var =3241. 355800
I D=60 N=117
M u=204. 389744
Var =3022. 694602
I D=59 N=163
M u=187. 487117
Var =2832. 915478
I D=65 N=134
M u=197. 285075
Var =2416. 466792
I D=18 N=58
M u=74. 408621
Var =1203. 690097
I D=19 N=100
M u=97. 541000
Var =1588. 215217
I D=17 N=45
M u=106. 355556
Var =2158. 814916
I D=15 N=10
M u=134. 230000
Var =3021. 656100
I D=13 N=61
M u=118. 404918
Var =2688. 035237
I D=11 N=31
M u=139. 874194
Var =2785. 218049
I D=9 N=51
M u=157. 766667
Var =5348. 366548
I D=7 N=66
M u=163. 419697
Var =4444. 152491
I D=22 N=75
M u=167. 829333
Var =2890. 560740
I D=23 N=115
M u=210. 018261
Var =3313. 001513
I D=30 N=31
M u=88. 448387
Var =983. 583092
I D=36 N=94
M u=140. 021277
Var =1688. 292528
I D=37 N=41
M u=104. 797561
Var =1200. 338274
I D=35 N=22
M u=162. 850000
Var =1692. 277045
I D=33 N=9
M u=199. 366667
Var =2566. 764444
I D=38 N=5
M u=66. 259999
Var =434. 806387
I D=40 N=54
M u=193. 735185
Var =2285. 623032
I D=42 N=93
M u=163. 956989
Var =2207. 595785
I D=43 N=51
M u=131. 949020
Var =1733. 346415
I D=44 N=88
M u=157. 796591
Var =2934. 696924
I D=45 N=68
M u=191. 963235
Var =3007. 575854
I D=46 N=15
M u=109. 666667
Var =1955. 555556
I D=48 N=104
M u=205. 144231
Var =3080. 979197
I D=54 N=26
M u=219. 038462
Var =3853. 460059
I D=55 N=263
M u=175. 794297
Var =2266. 680241
I D=53 N=5
M u=70. 000000
Var =1170. 300000
I D=51 N=13
M u=257. 923077
Var =828. 724852
I D=56 N=135
M u=162. 941482
Var =4297. 134484
I D=62 N=99
M u=212. 440404
Var =2663. 669079
I D=63 N=18
M u=160. 111111
Var =2680. 265432
I D=61 N=81
M u=242. 832099
Var =2683. 945636
I D=64 N=29
M u=142. 213793
Var =2263. 935669
I D=66 N=65
M u=179. 780000
Var =2056. 397600
I D=67 N=69
M u=213. 775362
Var =2195. 069103
AM 1_dipole
<= 5. 642730 > 5. 642730
PEO E_PC+
<= 1. 203697 > 1. 203697
PEO E_VSA+2
<= 17. 157779 > 17. 157779
ASA-
<= 221. 169665 > 221. 169665
PEO E_VSA- 0
<= 82. 823990 > 82. 823990
PC+
<= 1. 087000 > 1. 087000
E_vdw
<= 53. 550139 > 53. 550139
AM 1_HO M O
<= - 8. 795825 > - 8. 795825
PEO E_VSA- 0
<= 26. 067410 > 26. 067410
logP( o/ w)
<= 1. 758500 > 1. 758500
zagr eb
<= 83. 000000 > 83. 000000
CASA-
<= 430. 315615 > 430. 315615
PM 3_HF
<= 12. 561065 > 12. 561065
a_heavy
<= 21. 500000 > 21. 500000
E
<= 33. 971064 > 33. 971064
ASA_P
<= 126. 019935 > 126. 019935
vsa_don
<= 10. 393096 > 10. 393096
b_r ot R
<= 0. 190983 > 0. 190983
PEO E_VSA_PNEG
<= 3. 755634 > 3. 755634
Kier A3
<= 2. 270166 > 2. 270166
PEO E_PC-
<= - 1. 716735 > - 1. 716735
PEO E_VSA_FPNEG
<= 0. 069790 > 0. 069790
PM 3_E
<= - 321. 575760 > - 321. 575760
Kier Flex
<= 3. 576279 > 3. 576279
ASA-
<= 442. 648105 > 442. 648105
Kier A2
<= 13. 213366 > 13. 213366
st d_dim 3
<= 0. 564750 > 0. 564750
PM 3_HO M O
<= - 9. 631155 > - 9. 631155
b_1r ot N
<= 6. 500000 > 6. 500000
weiner Pat h
<= 1393. 000000 > 1393. 000000
PC-
<= - 1. 366000 > - 1. 366000
SM R_VSA4
<= 1. 912989 > 1. 912989
SM R
<= 10. 963580 > 10. 963580
17
Distance to the mode l YNumber o f components is 2
T he norma lized d istance to the mode l is 0 .742
199
197295
393491
589687
785883
9811079
11771275
13731471
15691667
17651863
19612059
21572255
23532451
Case
0 .0
0 .5
1 .0
1 .5
2 .0
2 .5
3 .0
3 .5
4 .0
Dista
nce
X load ing sca tte r p lo t (p1 vs. p2 )
-0 .15 -0 .10 -0 .05 0 .00 0 .05 0 .10 0 .15
p1
-0 .15
-0 .10
-0 .05
0 .00
0 .05
0 .10
0 .15
p2
Model class specific model #
Generalized Linear Models (GLM)
General Discriminant Analysis 1
Binary logit (logistic) regression 2
Binary probit regression 3
Nonlinear model Multivariate adaptive regression splines (MARS)
4
Tree models Standard Classification Trees (CART) 5
Standard General Chi-square Automatic Interaction Detector (CHAID)
6
Exhaustive CHAID 7
Boosting classification trees 8
Neural Networks Multilayer Perceptron neural network (MLP)
9
Radial Basis Function neural network (RBF) 10
Machine Learning Support Vector Machines (SVM) 11
Naive Bayes classifier 12
k-Nearest Neighbors (KNN) 13
Other machine learning techniques from Statistica Dataminer we useMost of them work for classification and regression
Normal p robab ility p lo t o f residua lsDependen t va riab le : MT P
T est se t samp le ; Number o f trees: 100
-200 -150 -100 -50 0 50 100 150 200
Residua l
-4
-3
-2
-1
0
1
2
3
4
No
rma
l qu
an
tile
0 .01
0 .05
0 .15
0 .50
0 .85
0 .95
0 .99
18
Now with open source datamining tool WEKA
URL
SQL
Data
Easy: enter DB URL, enter SQL statement, import data. Try free AquaStudio for SQL!
Yellow =OK
19
WEKA - Machine learning algorithms in Java
20
WEKA – fingerprint visualization
21
Conclusions regarding statistics:
JChem PCA and PLS output (Eigenvalues, scores, loadings) are provided only as textfile. More univariate and
multivariate tools needed.
1) JChem PCA and PLS results must have graphical output. (They must)
2) JChem PCA must be made faster (factor 600-1000) by using math routines.
3) Integration into Instant-JChem would be good or ChemAxon provides enhanced bundled statistics tools.
4) Currently JDBC query from JChem to other statistical packages like WEKA or Statistica or R or MATLAB or YALE is perfect. Each package works best in the field it was designed for.