1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for...

63
1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.

Transcript of 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for...

Page 1: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

1

Can we Predict Anything Useful from 2-D Molecular Structure?

Dr John MitchellUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.

Page 2: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

2

Page 3: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

3

Page 4: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

4

Page 5: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

5

Page 6: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

6

Page 7: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

7

We look at data, analyse data, use data to find correlations ...

... to develop models ...

... and to make (hopefully) useful predictions.

Let’s look at some data ...

Page 8: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

8

New York Times,4th October 2005.

Page 9: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

9

Happiness ≈ (GNP/$5000) -1 Poor fit to linear model

Page 10: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

10

(GNP/$5000) -2

Outliers?

Happiness

Page 11: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

11

Fitting with a curve: reduce RMSE

Page 12: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

12

Outliers?

Different linear models for different regimes

Page 13: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

13

Only one obvious (to me) conclusion

This area is empty: no country isboth rich and unhappy. All other

combinations are observed.

Happiness (GNP/$5000) -2

Page 14: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

14

... but this is nothing to do with 2-D molecular structure

Page 15: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

15

QSPR

• Quantitative Structure Property Relationship

• Physical property related to more than one other variable

• First example from Hansch et al 1960’s

• General form (for non-linear relationships):

y = f (descriptors)

Page 16: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

16

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

Y = f (X1, X2, ... , XN )

• Optimisation of Y = f(X1, X2, ... , XN) is called regression.• Model is optimised upon N “training molecules” and then tested upon M “test” molecules.

Page 17: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

17

QSPR

Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Quality of the model is judged by three parameters:

n

i

predi

obsi yy

nBias

1

)(1

n

i

predi

obsi yy

nRMSE

1

2)(1

2

1

2

1

2 )(/)(1 averagen

i

obsi

predi

n

i

obsi yyyyr

Page 18: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

18

QSPR

Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Different methods for carrying out regression:

• LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc.

• NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

Page 19: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

19

QSPR

Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• However, this does not guarantee a good predictive model….

Page 20: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

20

QSPR

Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Problems with experimental error.• A QSPR equation is only as accurate as the data it is trained upon.• Therefore, we are making experimental measurementsof solubility (Dr Antonio Llinàs).

Page 21: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

21

QSPR

Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Problems with “chemical space”.• “Sample” molecules must be representative of “Population”.• Prediction results will be most accurate for molecules similar to training set.• Global or Local models?

Page 22: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

22

Solubility is an important issue in drug discovery and a major source of attrition

This is expensive for the industry

A good model for predicting the solubility of druglike molecules would be very valuable.

Page 23: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

23

Drug Disc.Today, 10 (4), 289 (2005)

Cohesive interactions in the lattice reduce solubility

Predicting lattice (or almost equivalently sublimation) energy should help predict solubility

Page 24: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

24

Relationship of Chemical Structure

With Lattice Energy

Can we predict lattice energy from molecular structure?

Dr Carole Ouvrard & Dr John MitchellUnilever Centre for Molecular InformaticsUniversity of Cambridge

C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)

Page 25: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

25

Why Do We Need a Predictive Model?

A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials

From 2-D molecular structure only

Without knowing the crystal packing

Without expensive theoretical calculations

Should help predict solubility.

Page 26: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

26

Why Do We Think it Will Work?

Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule.

Many molecules have a plurality of different experimentally observable polymorphs.

We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.

Page 27: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

27

x x

x

x

O

x

x

x

x

Density (g/cc)

Lattice Energy (kJ/mol)

xx

1.40 1.601.50

-92.0

-94.0

-96.0

-98.0

OOO

O�

�O

+

+

+

+ x

x P1-+ P21/c

O P212121 � P21

Calculated Lowest Energy Structure

Experimental Crystal Structure

Page 28: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

28

Expression for the Lattice Energy

U crystal = U molecule + U lattice

Theoretical lattice energy

– Crystal binding = Cohesive energy

Experimental lattice energy is related to -H sublimation

H sublimation = -Ulattice – 2RT(Gavezzotti & Filippini)

Page 29: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

29

Partitioning of the Lattice Energy

U crystal = U molecule + U lattice

H sublimation = -U lattice – 2RT

Partitioning the lattice energy in terms of structural contributions

Choice of the significant parameters

– number of atoms of each type?

– Number of rings, aromatics?

– Number of bonds of each type?

– Symmetry?

– Hydrogen bond donors and acceptors? Intramolecular?

We choose counts of atom type occurrences.

Page 30: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

30

Analysis of the Sublimation Energy Data

Experimental data: Hsublimation Atom Types

– SATIS codes : 10-digit

connectivity code + bond types

– Each 2 digit code = atomic

number

HN 01 07 99 99 99

HO 01 08 99 99 99

O=C 08 06 99 99 99

-O- 08 06 06 99 99

Statistical analysis

Multi-Linear Regression Analysis

Hsub # atoms of each type

Typically, several similar SATIS codes are grouped to define an atom type.

NIST (National Institute of Standards and Technology, USA) Scientific literature

Page 31: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

31

Training Dataset of Model Molecules 226 organic compounds

19 linear alkanes (19)

14 branched alkanes (33)

17 aromatics (50)

106 other non-H-bonders (156)

70 H-bond formers (226)

Non-specific interacting

– Hydrocarbons

– Nitrogen compounds

– Nitro-, CN, halogens,

– S, Se substituents

– Pyridine

Potential hydrogen

bonding interactions

– Amides

– Carboxylic acids

– Amino acids…

0

50

100

150

200

0 5 10 15 20 25

no. C, N, O

Hsu

blim

atio

n(e

xper

imen

tal)

/ kJ

mol

-1

amides

diamides

acids

diacid

aminoacids

alkanesvalineH O

O C H 3

C H 3

N H 2

Page 32: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

32

Study of Non-specific Interactions: Linear

Alkanes

19 compounds : CH4 C20H24 Limit for van der

Waals interactions

Hsub 7.955C-

2.714

r2= 0.977

s = 7.096 kJ/mol0

150

300

450

600

750

0 5 10 15 20

No. of carbon atoms

Bo

ilin

g p

oin

t / °

C

0

30

60

90

120

150

180Hsub / kJ m

ol -1

BPt

Hsub

Note odd-even variation in Hsub for this series.

Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems.

Page 33: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

33

Include Branched Alkanes

Add 14 branched alkanes to dataset. The graph below highlights the

reduction of sublimation enthalpy due to bulky substituents.

0

50

100

150

200

0 5 10 15 20 25

No. carbon atoms

Hsub

/ kJ

mo

l-1

C(CH3)4

(C(CH)3)3CH

33 compounds : CH4 C20H24

Hsub = 7.724Cnonbranched + 3.703

r2= 0.959 s = 8.117 kJ/mol

If we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

Page 34: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

34

All Hydrocarbons: Include Aromatics

Add 17 aromatics to the dataset (note: we have no alkenes or alkynes).

50 compounds

Hsub = 7.680Cnonbranched + 6.185Caromatic + 4.162

r2= 0.958 s = 7.478 kJ/mol

As before, if we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

aliphatic

0

50

100

150

200

0 50 100 150 200

Experimental value /kJ mol-1

Pre

dic

ted

val

ue

/kJ

mo

l-1

Page 35: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

35

All Non-Hydrogen-Bonded Molecules:

Add 106 non-hydrocarbons to the dataset.

Include elements H, C, N, O, F, S, Cl, Br & I.

156 compounds

Hsub predicted by 16 parameter model

r2= 0.896 s = 9.976 kJ/mol

0

50

100

150

200

250

0 50 100 150 200 250

Experimental value / kJ mol-1

Pre

dic

ted

val

ue

/ kJ

mo

l-1

Parameters in model are counts of atom type occurrences.

Page 36: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

36

General Predictive Model

Add 70 hydrogen bond forming molecules to the dataset.

226 compounds

Hsub predicted by 19 parameter model

r2= 0.925 s = 9.579 kJ/mol

Parameters in model are counts of atom type occurrences.

0

50

100

150

200

250

0 50 100 150 200 250

Experimental value /kJ mol-1

Pre

dic

ted

val

ue

/ kJ

mo

l-1

Page 37: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

37

Hsublimation (kJ mol-1) = 6.942 + 20.141 HN + 30.172

HO + 3.127 F + 10.456 Cl + 12.926 Br + 19.763 I +

3.297 C3 – 3.305 C4 + 5.970 Caromatic + 7.631

Cnonbranched + 7.341 CO + 19.676 CS + 11.415 Nnitrile +

8.953 Nnonnitrile + 8.466 NO + 18.249 Oether + 20.585

SO + 12.840 Sthioether

Predictive Model Determined by

MLRA

aliphatic

All these parameters are significantly larger than their standard errors

Page 38: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

38

Distribution of Residuals

The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.

0

20

40

60

-30 -20 -10 0 10 20 30Residuals

No

. of

ob

se

rva

tio

ns

Page 39: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

39

35 diverse compounds

r2 = 0.928

s = 7.420 kJ/mol

Validation on an Independent Test Set

0

50

100

150

200

0 50 100 150 200H sub (experimental) / kJ mol-1

Hsub

(p

red

icte

d)

/ kJ

mo

l-1

NO2

CH3

NO2O2NNitro-compoundsare often outliers

Very encouraging result: accurate prediction possible.

Page 40: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

40

Conclusions

We have determined a general equation allowing us to estimate

the sublimation enthalpy for a large range of organic compounds

with an estimated error of 9 kJ/mol.

A very simple model (counts of atom types) gives a good

prediction of lattice & sublimation energies.

Lattice energy can be predicted from 2D structure, without

knowing the details of the crystal packing.

Avoids need for expensive calculations.

May help predict solubility.

Model gives good chemical insight.

Page 41: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

41

A Chemoinformatics Approach To Predicting the Aqueous Solubility

of Pharmaceutical Molecules

David Palmer & Dr John Mitchell

Unilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.

Page 42: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

42

Pfizer Project: P13Novel Methods for Predicting

Solubility • David Palmer • Dr Antonio Llinàs• Pfizer Institute for Pharmaceutical Materials Science• http://www.msm.cam.ac.uk/pfizer

Page 43: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

43

Datasets

• Compiled from Huuskonen dataset and AquaSol database• All molecules solid at R.T.• n = 1000 molecules

• Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25oC)

Page 44: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

44

Diversity-Conserving Partitioning

• MACCS Structural Key fingerprints

• Tanimoto coefficient

• MaxMin Algorithm

Full dataset n = 1000 molecules

Training n = 670 molecules

Testn = 330 molecules

Page 45: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

45

Structures & Descriptors

3D structures from Concord Minimised with MMFF94 MOE descriptors 2D/ 3D

Separate analysis of 2D and 3D descriptors QuaSAR Contingency Module (MOE) 52 descriptors selected

Page 46: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

46

Multi-Linear Regression

Log.S = 0.07nHDon (+/-0.018) - 0.21TPSA (+/-0.033) + 0.11MAXDP (+/-0.022) - 0.22n.Ct (+/-0.019) - 0.29KierFlex (+/-0.032) - 0.59SLOGP (+/0.036) - 0.26ATS2m (+/-0.026) + 0.25RBN (+/-0.033)

R2 RMSE Bias10-fold CV 0.85 0.79 0.00Train 0.87 0.78 0.00Test 0.85 0.82 -0.01

SLOGP Partition coefficient LipophilicityTPSA Polar Surface Area Molecular ChargeMAXDP Maximal Electrotopological positive variation Molecular Chargen.Ct Number of Tertiary Carbons Molecular SizeATS2m "Broto-Moreau Autocorrelation" Molecular Size/PolarizabilityKierFlex Kier Flexibility Index Molecular FlexibilityRBN Number of Rotatable Bonds Molecular FlexibilitynHDon Number of Hydrogen Bond Donors

We can do better than this with other methods ...

Page 47: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

47

Two More Methods of Prediction

(1) Random Forest handles both selection and regression.

(2a) Ant Colony Optimisation algorithm selection was used for Support Vector Machine regression.

(2b) Support Vector Machine regression was repeated with “Intelligent trial and error” selection.

Page 48: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

48

Random Forest: Introduction

• Introduced by Briemann and Cutler (2001)• Development of Decision Trees (Recursive Partitioning):

• Dataset is partitioned into consecutively smaller subsets (of similar solubility)

• Each partition is based upon the value of one descriptor

• The descriptor used at each split is selected so as to minimise the MSE

Page 49: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

49

Random Forest: Method

• Random Forest is a collection of Decision Trees grown with the CART algorithm.

• Standard Parameters:• 500 decision trees• No pruning back: Minimum node size > 5• “mtry” descriptors tried at each split

Important features:• Incorporates descriptor selection• Incorporates “Out-of-bag” validation

Page 50: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

50

Random Forest: Results

RMSE(te)=0.69r2(te)=0.89Bias(te)=-0.04

RMSE(tr)=0.27r2(tr)=0.98Bias(tr)=0.005

RMSE(oob)=0.68r2(oob)=0.90Bias(oob)=0.01

Page 51: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

51

Support Vector Machines

[1] V.Vapnik, Estimation of Dependences Based on Empirical Data, Nauka, 1979 [in Russian][2] V.Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995.

bgfm

jjj

1

)(),( xx

"In SVM regression, the input is first mapped onto a m-dimensional feature spaceusing a fixed (non-linear) mapping, and then a linear model is constructed in this feature space. The linear model (in the feature space) is given by:

• Kernel Function

ε - "Over-fitting"

"Support Vectors"

• C - cost - "Outliers"• γ - Kernel parameter

Page 52: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

52

SVM: Descriptor Selection

Descriptor RMSE(CV)

SlogPSMR 0.82(KierFlex 0.82)(PEOE_VSA_HYD 0.82)(PEOE_VSA_NEG 0.81)TPSA 0.785(a_don 0.78)a_acc 0.755b_rotN 0.71

• Stepwise selection of descriptors: “intelligent trial & error”

Ant colony descriptor selection algorithm gives 20 descriptors and RMSE (test set) = 0.70

Gives five descriptor model with RMSE (test set) = 0.71

Page 53: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

53

Support Vector Machines: Results

RMSE(CV) = 0.71r2(CV) =0.88Bias(CV) = -0.001

RMSE(test) = 0.71r2(test) = 0.88Bias(test) = 0.02

Page 54: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

54

2D or 3D Molecular Descriptors?

R=0.88R=1.00 (2.d.p.) R=0.95

• No improvement from models containing 3D descriptors

R=0.88

Page 55: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

55

Conclusions

• Two methods so far have produced good models:

a. Random Forest

b. Support Vector Machines

• Accurate experimental data necessary to improve models

• Random Forest valuable for QSPR modelling

Page 56: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

56

Other work

• Linking Enthalpy of Sublimation (Carole) and Solubility (David) studies.

• Prediction of Melting Point.

• Chemoinformatics of prohibited substances in sport.

• Scoring functions for virtual screening.

• Repertoire of enzyme-catalysed reactions

(MACiE).

Page 57: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

57

Page 58: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

58

People

PfizerDr Hua Gao

Dr Tony Auffret

University of CambridgeProf. Robert Glen

Dr Jonathan GoodmanDr Antonio LlinàsDr Noel O’Boyle

AcknowledgementsFunding

Centre: Unilever

David Palmer: Pfizer

Carole Ouvrard: University of Nantes, France.

Page 59: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

59

Ant Colony Optimisation AlgorithmVariable selection based on probability:

01

1

ii

ikip

io Level of Inhibitory Pheromone

1i

Updating rules:

m

k

kiiio oldpnew

100 )()(

m

k

kiii oldpnew

1111 )()(

where

m

ki

11

is the increment of pheromone left on each descriptor in given cycle.

Level of Activator Pheromone

Extra slide 1

Page 60: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

60

Ant Colony Optimisation Algorithm

if kth ant selected variable i both in current iteration and global best solution

if kth ant selected variable i only in current iteration

if variable i was not selected in either current iteration or global best solution

if kth ant did not select variable i in either the current iteration or its global best solution

if kth ant did not select variable i in the current iteration

Hi FF 1

Fi 1

Hi F 1

Hi FF 0

Fi 1

Hi F 1 if kth ant did not select variable i in its global best solution

Extra slide 2

Page 61: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

61

Correlation diagram

SlogP SMR TPSA a_acc b_rotNSlogP 1 0.61 -0.58 -0.27 -0.06SMR 0.61 1 0.04 0.31 0.45TPSA -0.58 0.04 1 0.65 0.5a_acc -0.27 0.31 0.65 1 0.47b_rotN -0.06 0.45 0.5 0.47 1

Extra slide 3

Page 62: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

62

Distributions in datasetExtra slide 4

Page 63: 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.

63

MLR

SLOGP Partition coefficient LipophilicityTPSA Polar Surface Area Molecular ChargeMAXDP Maximal Electrotopological positive variation Molecular Chargen.Ct Number of Tertiary Carbons Molecular SizeATS2m "Broto-Moreau Autocorrelation" Molecular Size/PolarizabilityKierFlex Kier Flexibility Index Molecular FlexibilityRBN Number of Rotatable Bonds Molecular FlexibilitynHDon Number of Hydrogen Bond Donors

nHDon 0.07 0.018 3.7 2.52E-04 261.8 < 2.2e-16TPSA -0.21 0.033 -6.3 5.65E-10 425.2 < 2.2e-16MAXDP 0.11 0.022 5.2 3.12E-07 32.8 1.66E-08n.Ct -0.22 0.019 -11.8 0.00000 890.7 < 2.2e-16KierFlex -0.29 0.032 -9.2 0.00000 847.8 < 2.2e-16SLOGP -0.59 0.036 -16.4 0.00000 1202.7 < 2.2e-16ATS2m -0.26 0.026 -10.1 0.00000 142.0 < 2.2e-16RBN 0.25 0.033 7.4 3.56E-13 55.4 3.56E-13

Extra slide 5