1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Statistics evaluation and graphics...

1

Tobias Kind FiehnLab at UC Davis Genome Center

November 2006

Statistics evaluation and graphics with ChemAxon tools and Statistica and

WEKAtowards QSPR and QSAR development

Free Academic Licenses for JChem and Instant JChem provided by ChemAxon

Academic License for Statistica Dataminerprovided by Statsoft

GNU general public license for WEKAprovided by WEKA Machine Learning Project

2

Metabolomics - The science of the small molecules

Compound Classes:• sugars• amino acids• steroids• fatty acids• lipids• phospholipids • organic acids ...

Molecules under investigation

3D model of a molecule with surface plot

Visit us!www.fiehnlab.ucdavis.edu

http://www.fiehnlab.ucdavis.edu/

3

Techniques and tools

• Analytical techniques (LC-MS, GC-MS, FT-MS, NMR, IR)• BioInformatics, ChomoInformatics

Liquid Chromatography

LC-MS Gas Chromatography

GC-MS

BioInformatics and CheminformaticsStatistics (Statistica Dataminer) Open Source Tools

4

ChemAxon JChem has now PCA and PLS

1) Create new library with JCHEM Manager GUI (testcase here: fingerprints)

2) Exctract fingerprints and do dimension reduction with principal component analysis (PCA) with command line tool PCA.bat or pca.sh

5

ChemAxon JChem Principal Component Analysis (PCA)

PCA -d "sun.jdbc.odbc.JdbcOdbcDriver" -u "jdbc:odbc:jchem-z" -l test -p test -P JChemProperties -q "SELECT cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM nci99 WHERE cd_id <= 250000" -o PCA-scores.txt -t PCA-Eigenvalues.txt

TimeThis : Command Line : run-pca.batTimeThis : Start Time : Mon Nov 27 17:02:02 2006TimeThis : End Time : Mon Nov 27 17:19:52 2006TimeThis : Elapsed Time : 00:17:49.812

Start PCA by getting information from DB (here Access, but can be Oracle, Derby, MySQL) Test case 250.000 chemicals from NCI DB

PCA can be done from any descriptor, chemical fingerprints, BCUT etc.This is just a simple example made from the 16 standard fingerprints.Be sure only to select descriptors you want (and not the molecule ID)

Problem here: A) JDBC extraction not tuned - DB extraction of values nearly 2 minutes.B) PCA calculation time too long - 15 minutes for a matrix 250.000 x 16

The current PCA algorithm needs to be changed, its very inefficent (faster matrix routines exist for JAVA)

Database extraction time with Statistica: 8 seconds.The same PCA with Statistica is finished in: 1 second (no joke – thats a factor of 1:900).

6

JChem PCA output

Eigenvalues, % and Cumulated variance (in rows)=1.77 1.623 1.518 1.326 1.106 1.028 0.999 0.94 0.919 0.849 0.824 0.788 0.742 0.71 0.674 0.602 0.582

10.409 9.547 8.93 7.798 6.505 6.048 5.879 5.527 5.407 4.994 4.847 4.638 4.362 4.177 3.965 3.543 3.42410.409 19.957 28.886 36.684 43.189 49.236 55.115 60.643 66.05 71.043 75.891 80.528 84.89 89.068 93.033 96.576 100

Loadings (in rows)=0.191 0 -0.159 -0.17 0.306 0.617 -0.419 0.105 0.338 0.307 -0.304 -0.263 -0.324 0.348 -0.31 0.101 0.5630.076 0.085 0.577 -0.117 0 0.128 0.084 -0.123 -0.255 -0.146 0.084 -0.682 0.335 -0.374 -0.63 0.11 0.0630.182 -0.167 -0.049 0.181 -0.553 0.233 0.126 -0.016 -0.286 0.344 -0.535 -0.055 0.469 -0.035 0.235 -0.442 0.290.141 -0.572 0.077 -0.073

PCA scores-0.873 0.597 1.843 -0.131 0.204 -1.141 1.016 0.806 -0.263 0.221 0.208 1.54 1.704 -1.382 0.705 1.3970.622 0.668 0.233 -0.175 0.748 0.801 1.087 1.366 -0.91 -1.369 0.192 1.919 -2.231 -0.218 1.043 1.13

-0.672 0.723 -1.015 -0.089 0.477 1.877 0.381 0.766 -0.59 -0.082 0.877 0.466 0.2 -0.397 1.189 1.308-0.102 0.304 0.81 0.896 -1.853 0.435 0.551 0.32 -1.083 -0.439 0.346 1.081 0.557 -0.624 -0.042 -2.87-0.835 -1.519 -0.705 1.147 -0.62 0.198 0.492 -0.34 -0.526 0.484 0.011 -0.456 -0.299 0.509 -1.294 -0.801-0.947 0.455 -0.595 -0.673 -2.836 0.796 -0.631 0.353 -1.157 -1.519 0.957 0.966 1.113 0.919 0.368 -1.399-0.215 -1.106 -1.638 -0.673 -1.918 -0.477 1.168 1.835 -0.755 -2.252 -0.962 -0.515 -0.722 1.023 2.844 1.3490.353 -2.345 -0.737 0.808 1.778 0.343 -0.197 -0.221 -0.529 3.189 -1.481 -1.754 0.152 -0.881 -2.449 -0.6490.622 0.301 0.928 -1.174 0.526 -0.322 -0.17 -0.589 0.233 -0.149 -0.783 -0.704 1.524 -1.547 -1.642 -1.0850.981 0

The PCA results matrix is inverted and values *(-1) from Statistica.

Problem: Currently no graphics. But multivariate statistics lives from graphics.Follwing simple graphic examples are made with Statistica or WEKA via DB query.

7

Following slides „What could be“ in the future.

or „What can be done“ right now.

Check the pretty comprehensive statistics link http://www.statsoft.com/textbook/stathome.html

http://www.statsoft.com/textbook/stathome.html

http://www.statsoft.com/textbook/stathome.html

8

PLSPLS

Machine Learning (KNN)Machine Learning (KNN)

Feature selectionFeature selection

Tree modelTree model

Neural NetworkNeural Network

Cluster AnalysisCluster Analysis

Response curvesResponse curves

Machine Learning and statistic tools

We use Statistica Dataminer as a comprehensive statistics work tool.WEKA or YALE are free but (not yet :-) as powerful as the Statistica Dataminer.

9

Connection of a JCHEM molecule DB via JDBC with Statistica

Time for query + copy of 4,000,000 values with 250k molecules 16 fingerprints = 8 seconds.

Test system JChem 3.2 with MS Access with Statistica Dataminer 7.1Dual Opteron 2.8 GHz

10

Statistica with JChem data

11

Eigenva lues o f corre la tion matrixScree p lo t

10 .91%

10 .15%

9 .41%

8 .29%

6 .89%

6 .37% 6 .04%

5 .79% 5 .32% 5 .15%

4 .94% 4 .63% 4 .44% 4 .25%

3 .77% 3 .66%

-2 0 2 4 6 8 10 12 14 16 18 20

Eigenva lue number

0 .4

0 .6

0 .8

1 .0

1 .2

1 .4

1 .6

1 .8

2 .0

Eig

en

valu

e

PCA Scree plot – determine optimal factors to retain

Visible Step

Four factors can be retained. The 16 dimensional space can becompressed into a 4-dimensional space. (Scree plot is not optimal here)

Statistica Dataminer 7.1

12

Pro jection o f the va r iab les on the facto r-p lane ( 1 x 2 )

Active

cd_ fp1

cd_ fp2

cd_ fp3

cd_ fp4

cd_ fp5 cd_ fp6

cd_ fp7

cd_ fp8 cd_ fp9

cd_ fp10

cd_ fp11

cd_ fp12

cd_ fp13

cd_ fp14

cd_ fp15 cd_ fp16

-1 .0 -0 .5 0 .0 0 .5 1 .0

F acto r 1 : 10 .91%

-1.0

-0 .5

0 .0

0 .5

1 .0

Fa

ctor 2

: 10

.15

%

Which of the 16 fingerprints are similar? Those who “cluster” together are similar (fp_11 and fp_14). The variables fp_5 and fp_16 influence factor 1 in the same way. Variables inside or near the center (0,0) have no discrimination power. Remember PCA is no cluster analysis!

PCA Loadings plot – which variables are influential?


13

PCA Scores plot – picture of the reduced dimensionality.

The 16 fingerprints are compressed into 2D. We can use other high dimensionality descriptors forenhanced examples. Cases (molecules) which „cluster“ together may have same properties or functional groups (depending on input). Here we see the KOW molecule set covers the whole NCI dataset based on 16 pfs.


14

The 16 fingerprints are compressed into 3D. The KOWWIN test set does not cover the whole molecules space of important silicon containing molecules. You can also do an Overlap Analysis(compare two databases) within the all-new Instant-JChem.

PCA Scores 3D plot – KOWWIN versus silicon compound test set

3D Sca tte rp lo t (Score sp readshee t (Spreadsheet215) 7v*22435c)

SIL ICO N compounds (red)KO W W IN se t (b lue )


15

Statistica – Random Forest Machine learning

1024-DIM FC descriptor space

Histogram o f logP (O bs.)

-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13

logP (O bs.)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

Nu

mb

er o

f ob

serva

tion

s

Importance p lo tDependen t va riab le :

logP

Var418Var494

Var569Var739

Var188Var856

Var561Var704

Var900Var585

Var468Var314

Var787Var78

Var6780

200

400

600

800

1000

1200

1400

1600

Imp

orta

nce

(F-va

lue

)

Chemical fingerprint descriptors generated with JCHEM GenerateMDGenerateMD performance 1800 molecules/second for 1024 dimensional fpOn Dual Opteron 2,8 GHz (one core used only).

Statistica generates all graphical output + SQL code

16

CART tree method for QSPR and QSAR

Classification trees, boosting trees, random forest, regression trees and honest trees and adaptive trees – lots of wood and forests - did you hear about them?

T ree graph fo r MT PNum. o f non-te rmina l nodes: 33 , Num. o f te rmina l nodes: 34

T ree number: 1I D=1 N=2095

M u=166. 185442

Var =4271. 107556

I D=2 N=1596

M u=157. 640664

Var =4067. 765794

I D=4 N=422

M u=119. 878436

Var =3704. 634302

I D=6 N=356

M u=111. 806180

Var =3150. 895078

I D=8 N=305

M u=104. 120984

Var =2371. 171432

I D=10 N=274

M u=100. 075912

Var =2163. 339861

I D=12 N=213

M u=94. 826761

Var =1889. 309660

I D=14 N=203

M u=92. 885714

Var =1753. 277876

I D=16 N=158

M u=89. 049367

Var =1571. 384272

I D=5 N=1174

M u=171. 214480

Var =3501. 469135

I D=20 N=190

M u=193. 364737

Var =3571. 503041

I D=21 N=984

M u=166. 937500

Var =3374. 917552

I D=24 N=556

M u=153. 394964

Var =3073. 243790

I D=26 N=400

M u=145. 870000

Var =2801. 123252

I D=28 N=197

M u=129. 835533

Var =2308. 531729

I D=31 N=166

M u=137. 564458

Var =2176. 347474

I D=32 N=157

M u=134. 021656

Var =1922. 462462

I D=34 N=135

M u=129. 323704

Var =1802. 469217

I D=29 N=203

M u=161. 430542

Var =2787. 519957

I D=39 N=198

M u=163. 833838

Var =2612. 433045

I D=41 N=144

M u=152. 620833

Var =2273. 971372

I D=27 N=156

M u=172. 689744

Var =3253. 507718

I D=25 N=428

M u=184. 530140

Var =3219. 061548

I D=47 N=411

M u=187. 267397

Var =3068. 444876

I D=49 N=307

M u=181. 211401

Var =2919. 261601

I D=50 N=294

M u=177. 819388

Var =2739. 987556

I D=52 N=289

M u=179. 684775

Var =2562. 540189

I D=3 N=498

M u=193. 646586

Var =3940. 321058

I D=57 N=361

M u=205. 383380

Var =3320. 597122

I D=58 N=198

M u=220. 116162

Var =3241. 355800

I D=60 N=117

M u=204. 389744

Var =3022. 694602

I D=59 N=163

M u=187. 487117

Var =2832. 915478

I D=65 N=134

M u=197. 285075

Var =2416. 466792

I D=18 N=58

M u=74. 408621

Var =1203. 690097

I D=19 N=100

M u=97. 541000

Var =1588. 215217

I D=17 N=45

M u=106. 355556

Var =2158. 814916

I D=15 N=10

M u=134. 230000

Var =3021. 656100

I D=13 N=61

M u=118. 404918

Var =2688. 035237

I D=11 N=31

M u=139. 874194

Var =2785. 218049

I D=9 N=51

M u=157. 766667

Var =5348. 366548

I D=7 N=66

M u=163. 419697

Var =4444. 152491

I D=22 N=75

M u=167. 829333

Var =2890. 560740

I D=23 N=115

M u=210. 018261

Var =3313. 001513

I D=30 N=31

M u=88. 448387

Var =983. 583092

I D=36 N=94

M u=140. 021277

Var =1688. 292528

I D=37 N=41

M u=104. 797561

Var =1200. 338274

I D=35 N=22

M u=162. 850000

Var =1692. 277045

I D=33 N=9

M u=199. 366667

Var =2566. 764444

I D=38 N=5

M u=66. 259999

Var =434. 806387

I D=40 N=54

M u=193. 735185

Var =2285. 623032

I D=42 N=93

M u=163. 956989

Var =2207. 595785

I D=43 N=51

M u=131. 949020

Var =1733. 346415

I D=44 N=88

M u=157. 796591

Var =2934. 696924

I D=45 N=68

M u=191. 963235

Var =3007. 575854

I D=46 N=15

M u=109. 666667

Var =1955. 555556

I D=48 N=104

M u=205. 144231

Var =3080. 979197

I D=54 N=26

M u=219. 038462

Var =3853. 460059

I D=55 N=263

M u=175. 794297

Var =2266. 680241

I D=53 N=5

M u=70. 000000

Var =1170. 300000

I D=51 N=13

M u=257. 923077

Var =828. 724852

I D=56 N=135

M u=162. 941482

Var =4297. 134484

I D=62 N=99

M u=212. 440404

Var =2663. 669079

I D=63 N=18

M u=160. 111111

Var =2680. 265432

I D=61 N=81

M u=242. 832099

Var =2683. 945636

I D=64 N=29

M u=142. 213793

Var =2263. 935669

I D=66 N=65

M u=179. 780000

Var =2056. 397600

I D=67 N=69

M u=213. 775362

Var =2195. 069103

AM 1_dipole

<= 5. 642730 > 5. 642730

PEO E_PC+

<= 1. 203697 > 1. 203697

PEO E_VSA+2

<= 17. 157779 > 17. 157779

ASA-

<= 221. 169665 > 221. 169665

PEO E_VSA- 0

<= 82. 823990 > 82. 823990

PC+

<= 1. 087000 > 1. 087000

E_vdw

<= 53. 550139 > 53. 550139

AM 1_HO M O

<= - 8. 795825 > - 8. 795825

PEO E_VSA- 0

<= 26. 067410 > 26. 067410

logP( o/ w)

<= 1. 758500 > 1. 758500

zagr eb

<= 83. 000000 > 83. 000000

CASA-

<= 430. 315615 > 430. 315615

PM 3_HF

<= 12. 561065 > 12. 561065

a_heavy

<= 21. 500000 > 21. 500000

E

<= 33. 971064 > 33. 971064

ASA_P

<= 126. 019935 > 126. 019935

vsa_don

<= 10. 393096 > 10. 393096

b_r ot R

<= 0. 190983 > 0. 190983

PEO E_VSA_PNEG

<= 3. 755634 > 3. 755634

Kier A3

<= 2. 270166 > 2. 270166

PEO E_PC-

<= - 1. 716735 > - 1. 716735

PEO E_VSA_FPNEG

<= 0. 069790 > 0. 069790

PM 3_E

<= - 321. 575760 > - 321. 575760

Kier Flex

<= 3. 576279 > 3. 576279

ASA-

<= 442. 648105 > 442. 648105

Kier A2

<= 13. 213366 > 13. 213366

st d_dim 3

<= 0. 564750 > 0. 564750

PM 3_HO M O

<= - 9. 631155 > - 9. 631155

b_1r ot N

<= 6. 500000 > 6. 500000

weiner Pat h

<= 1393. 000000 > 1393. 000000

PC-

<= - 1. 366000 > - 1. 366000

SM R_VSA4

<= 1. 912989 > 1. 912989

SM R

<= 10. 963580 > 10. 963580

17

Distance to the mode l YNumber o f components is 2

T he norma lized d istance to the mode l is 0 .742

199

197295

393491

589687

785883

9811079

11771275

13731471

15691667

17651863

19612059

21572255

23532451

Case

0 .0

0 .5

1 .0

1 .5

2 .0

2 .5

3 .0

3 .5

4 .0

Dista

nce

X load ing sca tte r p lo t (p1 vs. p2 )

-0 .15 -0 .10 -0 .05 0 .00 0 .05 0 .10 0 .15

p1

-0 .15

-0 .10

-0 .05

0 .00

0 .05

0 .10

0 .15

p2

Model class specific model #

Generalized Linear Models (GLM)

General Discriminant Analysis 1

Binary logit (logistic) regression 2

Binary probit regression 3

Nonlinear model Multivariate adaptive regression splines (MARS)

4

Tree models Standard Classification Trees (CART) 5

Standard General Chi-square Automatic Interaction Detector (CHAID)

6

Exhaustive CHAID 7

Boosting classification trees 8

Neural Networks Multilayer Perceptron neural network (MLP)

9

Radial Basis Function neural network (RBF) 10

Machine Learning Support Vector Machines (SVM) 11

Naive Bayes classifier 12

k-Nearest Neighbors (KNN) 13

Other machine learning techniques from Statistica Dataminer we useMost of them work for classification and regression

Normal p robab ility p lo t o f residua lsDependen t va riab le : MT P

T est se t samp le ; Number o f trees: 100

-200 -150 -100 -50 0 50 100 150 200

Residua l

-4

-3

-2

-1

0

1

2

3

4

No

rma

l qu

an

tile

0 .01

0 .05

0 .15

0 .50

0 .85

0 .95

0 .99

18

Now with open source datamining tool WEKA

URL

SQL

Data

Easy: enter DB URL, enter SQL statement, import data. Try free AquaStudio for SQL!

Yellow =OK

http://www.aquafold.com/index.html

19

WEKA - Machine learning algorithms in Java

20

WEKA – fingerprint visualization

21

Conclusions regarding statistics:

JChem PCA and PLS output (Eigenvalues, scores, loadings) are provided only as textfile. More univariate and

multivariate tools needed.

1) JChem PCA and PLS results must have graphical output. (They must)

2) JChem PCA must be made faster (factor 600-1000) by using math routines.

3) Integration into Instant-JChem would be good or ChemAxon provides enhanced bundled statistics tools.

4) Currently JDBC query from JChem to other statistical packages like WEKA or Statistica or R or MATLAB or YALE is perfect. Each package works best in the field it was designed for.

1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Statistics evaluation and graphics...

Documents

Transcript of 1 Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Statistics evaluation and graphics...