1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for...

Can we Predict Anything Useful from 2-D Molecular Structure?

Dr John MitchellUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.

We look at data, analyse data, use data to find correlations ...

... to develop models ...

... and to make (hopefully) useful predictions.

Let’s look at some data ...

New York Times,4th October 2005.

Happiness ≈ (GNP/$5000) -1 Poor fit to linear model

(GNP/$5000) -2

Outliers?

Happiness

Fitting with a curve: reduce RMSE

Outliers?

Different linear models for different regimes

Only one obvious (to me) conclusion

This area is empty: no country isboth rich and unhappy. All other

combinations are observed.

Happiness (GNP/$5000) -2

... but this is nothing to do with 2-D molecular structure

• Quantitative Structure Property Relationship

• Physical property related to more than one other variable

• First example from Hansch et al 1960’s

• General form (for non-linear relationships):

y = f (descriptors)

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

Y = f (X1, X2, ... , XN )

• Optimisation of Y = f(X1, X2, ... , XN) is called regression.• Model is optimised upon N “training molecules” and then tested upon M “test” molecules.

Y X1 X2 X3 X4 X5 X6Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Quality of the model is judged by three parameters:

obsi yy

2 )(/)(1 averagen

obsi yyyyr

• Different methods for carrying out regression:

• LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc.

• NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

• However, this does not guarantee a good predictive model….

• Problems with experimental error.• A QSPR equation is only as accurate as the data it is trained upon.• Therefore, we are making experimental measurementsof solubility (Dr Antonio Llinàs).

• Problems with “chemical space”.• “Sample” molecules must be representative of “Population”.• Prediction results will be most accurate for molecules similar to training set.• Global or Local models?

Solubility is an important issue in drug discovery and a major source of attrition

This is expensive for the industry

A good model for predicting the solubility of druglike molecules would be very valuable.

Drug Disc.Today, 10 (4), 289 (2005)

Cohesive interactions in the lattice reduce solubility

Predicting lattice (or almost equivalently sublimation) energy should help predict solubility

Relationship of Chemical Structure

With Lattice Energy

Can we predict lattice energy from molecular structure?

Dr Carole Ouvrard & Dr John MitchellUnilever Centre for Molecular InformaticsUniversity of Cambridge

C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)

Why Do We Need a Predictive Model?

A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials

From 2-D molecular structure only

Without knowing the crystal packing

Without expensive theoretical calculations

Should help predict solubility.

Why Do We Think it Will Work?

Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule.

Many molecules have a plurality of different experimentally observable polymorphs.

We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.

Density (g/cc)

Lattice Energy (kJ/mol)

1.40 1.601.50

x P1-+ P21/c

O P212121 � P21

Calculated Lowest Energy Structure

Experimental Crystal Structure

Expression for the Lattice Energy

U crystal = U molecule + U lattice

Theoretical lattice energy

– Crystal binding = Cohesive energy

Experimental lattice energy is related to -H sublimation

H sublimation = -Ulattice – 2RT(Gavezzotti & Filippini)

Partitioning of the Lattice Energy

U crystal = U molecule + U lattice

H sublimation = -U lattice – 2RT

Partitioning the lattice energy in terms of structural contributions

Choice of the significant parameters

– number of atoms of each type?

– Number of rings, aromatics?

– Number of bonds of each type?

– Symmetry?

– Hydrogen bond donors and acceptors? Intramolecular?

We choose counts of atom type occurrences.

Analysis of the Sublimation Energy Data

Experimental data: Hsublimation Atom Types

– SATIS codes : 10-digit

connectivity code + bond types

– Each 2 digit code = atomic

number

HN 01 07 99 99 99

HO 01 08 99 99 99

O=C 08 06 99 99 99

-O- 08 06 06 99 99

Statistical analysis

Multi-Linear Regression Analysis

Hsub # atoms of each type

Typically, several similar SATIS codes are grouped to define an atom type.

NIST (National Institute of Standards and Technology, USA) Scientific literature

Training Dataset of Model Molecules 226 organic compounds

19 linear alkanes (19)

14 branched alkanes (33)

17 aromatics (50)

106 other non-H-bonders (156)

70 H-bond formers (226)

Non-specific interacting

– Hydrocarbons

– Nitrogen compounds

– Nitro-, CN, halogens,

– S, Se substituents

– Pyridine

Potential hydrogen

bonding interactions

– Amides

– Carboxylic acids

– Amino acids…

0 5 10 15 20 25

no. C, N, O

amides

diamides

diacid

aminoacids

alkanesvalineH O

O C H 3

Study of Non-specific Interactions: Linear

Alkanes

19 compounds : CH4 C20H24 Limit for van der

Waals interactions

Hsub 7.955C-

r2= 0.977

s = 7.096 kJ/mol0

0 5 10 15 20

No. of carbon atoms

t / °

180Hsub / kJ m

Note odd-even variation in Hsub for this series.

Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems.

Include Branched Alkanes

Add 14 branched alkanes to dataset. The graph below highlights the

reduction of sublimation enthalpy due to bulky substituents.

0 5 10 15 20 25

No. carbon atoms

C(CH3)4

(C(CH)3)3CH

33 compounds : CH4 C20H24

Hsub = 7.724Cnonbranched + 3.703

r2= 0.959 s = 8.117 kJ/mol

If we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

All Hydrocarbons: Include Aromatics

Add 17 aromatics to the dataset (note: we have no alkenes or alkynes).

50 compounds

Hsub = 7.680Cnonbranched + 6.185Caromatic + 4.162

r2= 0.958 s = 7.478 kJ/mol

As before, if we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

aliphatic

0 50 100 150 200

Experimental value /kJ mol-1

All Non-Hydrogen-Bonded Molecules:

Add 106 non-hydrocarbons to the dataset.

Include elements H, C, N, O, F, S, Cl, Br & I.

156 compounds

Hsub predicted by 16 parameter model

r2= 0.896 s = 9.976 kJ/mol

0 50 100 150 200 250

Experimental value / kJ mol-1

Parameters in model are counts of atom type occurrences.

General Predictive Model

Add 70 hydrogen bond forming molecules to the dataset.

226 compounds

Hsub predicted by 19 parameter model

r2= 0.925 s = 9.579 kJ/mol

Parameters in model are counts of atom type occurrences.

0 50 100 150 200 250

Experimental value /kJ mol-1

Hsublimation (kJ mol-1) = 6.942 + 20.141 HN + 30.172

HO + 3.127 F + 10.456 Cl + 12.926 Br + 19.763 I +

3.297 C3 – 3.305 C4 + 5.970 Caromatic + 7.631

Cnonbranched + 7.341 CO + 19.676 CS + 11.415 Nnitrile +

8.953 Nnonnitrile + 8.466 NO + 18.249 Oether + 20.585

SO + 12.840 Sthioether

Predictive Model Determined by

aliphatic

All these parameters are significantly larger than their standard errors

Distribution of Residuals

The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.

-30 -20 -10 0 10 20 30Residuals

35 diverse compounds

r2 = 0.928

s = 7.420 kJ/mol

Validation on an Independent Test Set

0 50 100 150 200H sub (experimental) / kJ mol-1

NO2O2NNitro-compoundsare often outliers

Very encouraging result: accurate prediction possible.

Conclusions

We have determined a general equation allowing us to estimate

the sublimation enthalpy for a large range of organic compounds

with an estimated error of 9 kJ/mol.

A very simple model (counts of atom types) gives a good

prediction of lattice & sublimation energies.

Lattice energy can be predicted from 2D structure, without

knowing the details of the crystal packing.

Avoids need for expensive calculations.

May help predict solubility.

Model gives good chemical insight.

A Chemoinformatics Approach To Predicting the Aqueous Solubility

of Pharmaceutical Molecules

David Palmer & Dr John Mitchell

Unilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.

Pfizer Project: P13Novel Methods for Predicting

Solubility • David Palmer • Dr Antonio Llinàs• Pfizer Institute for Pharmaceutical Materials Science• http://www.msm.cam.ac.uk/pfizer

Datasets

• Compiled from Huuskonen dataset and AquaSol database• All molecules solid at R.T.• n = 1000 molecules

• Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25oC)

Diversity-Conserving Partitioning

• MACCS Structural Key fingerprints

• Tanimoto coefficient

• MaxMin Algorithm

Full dataset n = 1000 molecules

Training n = 670 molecules

Testn = 330 molecules

Structures & Descriptors

3D structures from Concord Minimised with MMFF94 MOE descriptors 2D/ 3D

Separate analysis of 2D and 3D descriptors QuaSAR Contingency Module (MOE) 52 descriptors selected

Multi-Linear Regression

Log.S = 0.07nHDon (+/-0.018) - 0.21TPSA (+/-0.033) + 0.11MAXDP (+/-0.022) - 0.22n.Ct (+/-0.019) - 0.29KierFlex (+/-0.032) - 0.59SLOGP (+/0.036) - 0.26ATS2m (+/-0.026) + 0.25RBN (+/-0.033)

R2 RMSE Bias10-fold CV 0.85 0.79 0.00Train 0.87 0.78 0.00Test 0.85 0.82 -0.01

SLOGP Partition coefficient LipophilicityTPSA Polar Surface Area Molecular ChargeMAXDP Maximal Electrotopological positive variation Molecular Chargen.Ct Number of Tertiary Carbons Molecular SizeATS2m "Broto-Moreau Autocorrelation" Molecular Size/PolarizabilityKierFlex Kier Flexibility Index Molecular FlexibilityRBN Number of Rotatable Bonds Molecular FlexibilitynHDon Number of Hydrogen Bond Donors

We can do better than this with other methods ...

Two More Methods of Prediction

(1) Random Forest handles both selection and regression.

(2a) Ant Colony Optimisation algorithm selection was used for Support Vector Machine regression.

(2b) Support Vector Machine regression was repeated with “Intelligent trial and error” selection.

Random Forest: Introduction

• Introduced by Briemann and Cutler (2001)• Development of Decision Trees (Recursive Partitioning):

• Dataset is partitioned into consecutively smaller subsets (of similar solubility)

• Each partition is based upon the value of one descriptor

• The descriptor used at each split is selected so as to minimise the MSE

Random Forest: Method

• Random Forest is a collection of Decision Trees grown with the CART algorithm.

• Standard Parameters:• 500 decision trees• No pruning back: Minimum node size > 5• “mtry” descriptors tried at each split

Important features:• Incorporates descriptor selection• Incorporates “Out-of-bag” validation

Random Forest: Results

RMSE(te)=0.69r2(te)=0.89Bias(te)=-0.04

RMSE(tr)=0.27r2(tr)=0.98Bias(tr)=0.005

RMSE(oob)=0.68r2(oob)=0.90Bias(oob)=0.01

Support Vector Machines

[1] V.Vapnik, Estimation of Dependences Based on Empirical Data, Nauka, 1979 [in Russian][2] V.Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995.

)(),( xx

"In SVM regression, the input is first mapped onto a m-dimensional feature spaceusing a fixed (non-linear) mapping, and then a linear model is constructed in this feature space. The linear model (in the feature space) is given by:

• Kernel Function

ε - "Over-fitting"

"Support Vectors"

• C - cost - "Outliers"• γ - Kernel parameter

SVM: Descriptor Selection

Descriptor RMSE(CV)

SlogPSMR 0.82(KierFlex 0.82)(PEOE_VSA_HYD 0.82)(PEOE_VSA_NEG 0.81)TPSA 0.785(a_don 0.78)a_acc 0.755b_rotN 0.71

• Stepwise selection of descriptors: “intelligent trial & error”

Ant colony descriptor selection algorithm gives 20 descriptors and RMSE (test set) = 0.70

Gives five descriptor model with RMSE (test set) = 0.71

Support Vector Machines: Results

RMSE(CV) = 0.71r2(CV) =0.88Bias(CV) = -0.001

RMSE(test) = 0.71r2(test) = 0.88Bias(test) = 0.02

2D or 3D Molecular Descriptors?

R=0.88R=1.00 (2.d.p.) R=0.95

• No improvement from models containing 3D descriptors

R=0.88

Conclusions

• Two methods so far have produced good models:

a. Random Forest

b. Support Vector Machines

• Accurate experimental data necessary to improve models

• Random Forest valuable for QSPR modelling

Other work

• Linking Enthalpy of Sublimation (Carole) and Solubility (David) studies.

• Prediction of Melting Point.

• Chemoinformatics of prohibited substances in sport.

• Scoring functions for virtual screening.

• Repertoire of enzyme-catalysed reactions

(MACiE).

People

PfizerDr Hua Gao

Dr Tony Auffret

University of CambridgeProf. Robert Glen

Dr Jonathan GoodmanDr Antonio LlinàsDr Noel O’Boyle

AcknowledgementsFunding

Centre: Unilever

David Palmer: Pfizer

Carole Ouvrard: University of Nantes, France.

Ant Colony Optimisation AlgorithmVariable selection based on probability:

io Level of Inhibitory Pheromone

Updating rules:

kiiio oldpnew

100 )()(

kiii oldpnew

1111 )()(

is the increment of pheromone left on each descriptor in given cycle.

Level of Activator Pheromone

Extra slide 1

Ant Colony Optimisation Algorithm

if kth ant selected variable i both in current iteration and global best solution

if kth ant selected variable i only in current iteration

if variable i was not selected in either current iteration or global best solution

if kth ant did not select variable i in either the current iteration or its global best solution

if kth ant did not select variable i in the current iteration

Hi FF 1

Hi F 1

Hi FF 0

Hi F 1 if kth ant did not select variable i in its global best solution

Extra slide 2

Correlation diagram

SlogP SMR TPSA a_acc b_rotNSlogP 1 0.61 -0.58 -0.27 -0.06SMR 0.61 1 0.04 0.31 0.45TPSA -0.58 0.04 1 0.65 0.5a_acc -0.27 0.31 0.65 1 0.47b_rotN -0.06 0.45 0.5 0.47 1

Extra slide 3

Distributions in datasetExtra slide 4

SLOGP Partition coefficient LipophilicityTPSA Polar Surface Area Molecular ChargeMAXDP Maximal Electrotopological positive variation Molecular Chargen.Ct Number of Tertiary Carbons Molecular SizeATS2m "Broto-Moreau Autocorrelation" Molecular Size/PolarizabilityKierFlex Kier Flexibility Index Molecular FlexibilityRBN Number of Rotatable Bonds Molecular FlexibilitynHDon Number of Hydrogen Bond Donors

nHDon 0.07 0.018 3.7 2.52E-04 261.8 < 2.2e-16TPSA -0.21 0.033 -6.3 5.65E-10 425.2 < 2.2e-16MAXDP 0.11 0.022 5.2 3.12E-07 32.8 1.66E-08n.Ct -0.22 0.019 -11.8 0.00000 890.7 < 2.2e-16KierFlex -0.29 0.032 -9.2 0.00000 847.8 < 2.2e-16SLOGP -0.59 0.036 -16.4 0.00000 1202.7 < 2.2e-16ATS2m -0.26 0.026 -10.1 0.00000 142.0 < 2.2e-16RBN 0.25 0.033 7.4 3.56E-13 55.4 3.56E-13

Extra slide 5

1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for...

Documents

Transcript of 1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for...

Research Article The Hasford Score May Predict Molecular … · 2019. 7. 30. · Research Article The Hasford Score May Predict Molecular Response in Chronic Myeloid Leukemia Patients:

Introduction to Molecules...predict molecular geometry. • Construct three-dimensional molecular models. • Sketch three-dimensional line structures using dashed lines and wedges.

Study of molecular docking to detect antihypertensive ... · molecular docking to predict natural compounds of O. corniculata for the inhibition of ACE is lacking. Moreover, molecular

Section 12.4 Structure of Molecules 1.To understand molecular structure and bond angles 2.To learn to predict molecular geometry from the number of electron.

How Unilever drives growth by unlocking skills and ... … · Unilever Case Study About Unilever Unilever has been a consumer goods establishment since the 1800s. Today, Unilever

Unilever · 2020-07-17 · Title: Unilever Author: Unilever Subject: Unilever Created Date: 7/21/2015 8:10:46 PM

Discrete Molecular Dynamics Can Predict Helical Prestructured …real.mtak.hu/12705/1/pone.0095795.pdf · 2014. 5. 8. · Discrete Molecular Dynamics Can Predict Helical Prestructured

Graph Neural Network - GitHub Pages · Our Neural Network for the molecular system - Molecules can be represented by graph structures. - We can precisely predict molecular properties

Molecular Genetic Testing to Predict Response to … Genetic Testing to Predict Response to Therapy in MDS Rafael Bejar MD, PhD Bone Marrow Failure Disease Scientific Symposium Rockville,

Chapter 3 Molecules, Ions, & their Compounds. Chapter goals Interpret, predict, and write formulas for ionic and molecular compounds.Interpret, predict,

Chapter 1: Atomic and Molecular Structurechemistry.bd.psu.edu/justik/CHEM 210/CHEM 210 Karty Exam 1.pdfChapter 1: Atomic and Molecular Structure ... Predict the properties of a covalent

Chemistry XXI Unit 3 How do we predict properties? M1. Analyzing Molecular Structure Predicting properties based on molecular structure. M4. Exploring.

A perturbation approach to predict infrared spectra of ...publications.mi.fu-berlin.de/877/1/05.pdf · A perturbation approach to predict infrared spectra of small molecular clusters

UNILEVER CARIBBEAN LIMITED - Home | Unilever … · Our Unilever Sustainable ... the astute leadership of the ... Limited . Unilever Caribbean Limited . Unilever Caribbean Limited

Molecular Polarity & Intermolecular Forces. How to Predict Whether a Molecule is Polar or Nonpolar.

Semantic Chemical Publishing Nick Day*, Peter Corbett, Peter Murray-Rust Unilever Centre for Molecular Informatics, University of Cambridge, UK. March.

Lab 6. Use of VSEPR to Predict Molecular Structure and IR ...

Unilever · 2020-08-02 · Unilever Author: Unilever Subject: unilever Created Date: 11/1/2016 10:58:51 AM ...

Unilever · Title: Unilever Author: Unilever Subject: Unilever Created Date: 7/22/2015 12:40:46 PM

Molecular Imaging to Predict Response to Targeted ... · Molecular Imaging to Predict Response to Targeted Therapies in Renal Cell Carcinoma IngridLeguerney,1,2 LudovicdeRochefort,1