Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.

Molecular Modeling: Statistical Analysis of

Complex DataC372

Dr. Kelsey Forsythe

Terminology• SAR (Structure-Activity Relationships)

– Circa 19th century?

• QSPR (Quantitative Structure Property Relationships)– Relate structure to any physical-chemical property of

molecule

• QSAR (Quantitative Structure Activity Relationships)– Specific to some biological/pharmaceutical function of

molecule (Absorption, Distribution/Digestion, Metabolism, Excretion)

– Brown and Frazer (1868-9)• ‘constitution’ related to biological response

– LogP

Statistical Models

• Simple– Mean, median and variation– Regression

• Advanced– Validation methods– Principal components, co-variance– Multiple Regression QSAR,QSPR

Modern QSAR

– Hansch et. Al. (1963)• Activity ‘travel through body’

partitioning between varied solvents

– C (minimum dosage required)– (hydrophobicity)– (electronic)

– Es (steric)

1/C a b 2 c dE s const.

Choosing Descriptors

• Buffon’s Problem– Needle Length?– Needle Color?– Needle Compostion?– Needle Sheen?– Needle Orientation?

Choosing Descriptors• Constitutional

– MW, Natoms

• Topological– Connectivity,Weiner index

• Electrostatic – Polarity, polarizability, partial charges

• Geometrical Descriptors– Length, width, Molecuar volume

• Quantum Chemical– HOMO and LUMO energies– Vibrational frequencies– Bond orders– Energy total

Choosing Descriptors• Constitutional

– MW, Natoms of element

• Topological– Connectivity,Weiner index (sums of bond

distances)– 2D Fingerprints (bit-strings)– 3D topographical indices, pharmacophore keys

• Electrostatic – Polarity, polarizability, partial charges

• Geometrical Descriptors– Length, width, Molecular volume

Choosing Descriptors• Chemical

– Hydrophobicity (LogP)– HOMO and LUMO energies– Vibrational frequencies– Bond orders– Energy total– GSH

Statistical Methods

• 1-D analysis• Large dimension sets require

decomposition techniques– Multiple Regression– PCA– PLS

• Connecting a descriptor with a structural element so as to interpolate and extrapolate data

Simple Error Analysis(1-D)

• Given N data points– Mean

– Variance

– Regression

ycalc

yobs

xcalc

xobs

)()(

),(

YStdXStd

YXCovR


• Given N data points– Regression

residualy

yyy obsi

calci

obscalc

obscalc

xx

yy


• Given N data points– (Poor 0<R2<1(Good)

2

)()(

),(

)(

N

icalc yySSR

YStdXStd

YXCov

YStd

SSRR

nsfluctuatiobetween n Correlatio

1),(

1

YYXXN

YXCov i

N

ii

2

1

1)(

N

ii YY

NYStd

Correlation vs. Dependence?

• Correlation– Two or more variables/descriptors may

correlate to the same property of a system

• Dependence– When the correlation can be shown due to one

changing due to the change in another

• Ex. Elephants head and legs– Correlation exists between size of head and legs– The size of one does not depend on the size of

the other

Quantitative Structure Activity/Property

Relationships (QSAR,QSPR)• Discern relationships between multiple

variables (descriptors) • Identify connections between structural

traits (type of substituents, bond angles substituent locale) and descriptor values (e.g. activity, LogP, % denaturation)

Pre-Qualifications

• Size– Minimum of FIVE samples per

descriptor

• Verification– Variance– Scaling– Correlations

QSAR/QSPRPre-Qualifications

• Variance– Coefficient of Variation

Standard Deviation

Mean

x

x

"Spread"


• Scaling – Standardizing or normalizing

descriptors to ensure they have equal weight (in terms of magnitude) in subsequent analysis


• Scaling – Unit Variance (Auto Scaling)– Ensures equal statistical weights

(initially)

– Mean Centering

x i' x i

' 1

x i' x i x

x ' 0


• Correlations

– Remove correlated descriptors

– Keep correlated descriptors so as to reduce data set size

– Apply math operation to remove correlation (PCR)

n)correlatio positive (100% 1

n)correlatio negative (100% 1

:

11

ij

ij

r

ENTATIONOVERREPRES

r

2

,

2

,

,,

,

thth,

descriptor j and ibetween n Correlatio)()(

),(

M

kjkj

M

kiki

jkj

M

kiki

ji

ji

jiji

XXXX

XXXXR

YStdXStd

XXCovR


• Correlations

QSAR/QSPR Scheme

• Goal– Predict what happens next

(extrapolate)!– Predict what happens between data

points (interpolate)!

QSAR/QSPR Scheme

• Types of Variable– Continuous

• Concentration, occupied volume, partition coefficient, hydrophobicity

– Discrete• Structural (1-meta substituted, 0-no

meta substitution)

QSAR/QSPR-Principal Components Analysis

• Reduces dimensionality of descriptors

• Principle components are a set of vectors representing the variance in the original data


• Geometric Analogy (3-D to 2-D PCA) y

z

x

K~

s.t. K~

O~

x'1 x '2 ....x 'Ny'1 y '2 ....y 'N0 0 ......0

x1 x2 ....xNy1 y2 ....yNz1 z2 ....zN

O

~


• Formulate matrix• Diagonalize matrix • Eigenvectors are the principal components

– These principal components (new descriptors) are a linear combination of the original descriptors

• Eigenvalues represent variance– Largest accounts for greatest % of data

variance– Next corresponds to second greatest and so on


• Formulate matrix (Several types)– Correlation or covariance (N x P)

• N is number of molecules• P is number of descriptors

– Variance-Covariance matrix (N x N)

• Diagonalize (Rotate) matrix

r11 r12 ....r1pr21 r22 ....r2p rn1 rn2 ....rnp

A~

AA

T Avc


• Eigenvectors (Loadings) – Represents contribution from each original

descriptor to PC (new descriptor)• # columns = # of descriptors• # rows = # of descriptors OR # of molecules

• Eigenvalues– Indicate which PC most important

(representative of original descriptors)• Benzene has 2 non-zero and 1 zero eigenvalue

(planar)


• Scores– Graphing each object/molecule

in space of 2 or more PCs•# rows = # of objects/molecules•# columns = # of descriptors OR #

of moleculesFor benzene corresponds to graph in PC1 (x’) and PC2 (y’) system

QSAR-PCASYBYL (Tripos Inc.)

SYBYL (Tripos Inc.)

SYBYL (Tripos Inc.)

10D3D

SYBYL (Tripos Inc.)• Eigenvalues Explanation of

variance in data

SYBYL (Tripos Inc.)• Each point corresponds to column (#

points = # descriptors) in original data

Proximity correlation

SYBYL (Tripos Inc.)• Each point corresponds to row of

original data (i.e. #points = #molecules) or graph of molecules in PC space

HeNapthalene

H2O

Molecular Size

Small acting Big

Proximitysimilarity

SYBYL (Tripos Inc.)

Outlier

SYBYL (Tripos Inc.)

QSAR/QSPR-Regression Types

• Principal Component Analysis

Non-Linear Mappings

• Calculate “distance” between points in N-d descriptor/parameter space– Euclidean– City-block distances

• Randomly assign compounds in set to points on a 2-D or 3-D space

• Minimize Difference (Optimal N-d 2D plot)

Non-Linear Mappings

• Advantages– Non-linear– No assumptions!– Chance groupings unlikely (2D group

likely an N-D group)

• Disadvantages– Dependence on initial guess (Use

PCA scores to improve)


• Multiple Regression• PCR• PLS


• Linear Regression– Minimize difference between

calculated and observed values (residuals)

Multiple Regression

y mx b

mx i x y i y

i1

N

x i x 2

i1

N

b y m x

y mi * x ii1

N

B


• Principal Component Regression– Regression but with Principal

Components substituted for original descriptors/variables


• Partial Least Squares– Cross-validation determines

number of descriptors/components to use

– Derive equation – Use bootstrapping and t-test to

test coefficients in QSAR regression


• Partial Least Squares (a.k.a. Projection to Latent Structures)– Regression of a Regression

•Provides insight into variation in x’s(bi,j’s as in PCA) AND y’s (ai’s)

– The ti’s are orthogonal – M= (# of variables/descriptors OR

#observations/molecules whichever smaller)

y ai * tii

N

ti bij * x jj

M


• PLS is NOT MR or PCR in practice– PLS is MR w/cross-validation– PLS Faster

•couples the target representation (QSAR generation) and component generation while PCA and PCR are separate

• PLS well applied to multi-variate problems

QSAR/QSPRPost-Qualifications

• Confidence in Regression– TSS-Total Sum of Squares– ESS-Explained Sum of Squares– RSS-Residual Sum of Squares

TSSESS RSS

R2 ESS

TSS

1 (100% explaination of data)

0 (no explaination of data)

y i y 2

i

N

TSS

ycalc,i y 2ESS

i

N

y i ycalc,i 2

i

N

RSS

QSAR/QSPRPost-Qualifications

• Confidence in Prediction (Predictive Error Sum of Squares)

Q2 1PRESS

y i y 2

i1

N

, PRESS y i ycalc,i 2

i1

N

QSAR/QSPRPost-Qualification

• Bias?– Bootstrapping

• Choosing best model?– Cross Validation


• Bootstrapping– ASSUME calculated data is

experimental/observed data– Randomly choose N data (allowing for a

multiple picks of same data)– Regenerate parameters/regression – Repeat M times– Average over M bootstraps– Compare (calculate residual)

• If close to zero then no bias• If large then bias exists

M is typically 50-100


• Cross-Validation (used in PLS)– Remove one or more pieces of input data– Rederive QSAR equation– Calculate omitted data– Compute root-mean-square error to evaluate

efficacy of model • Typically 20% of data is removed for each iteration• The model with the lowest RMS error has the optimal

number of components/descriptors

QSPR Example

• Relation between musk odourant properties and benzenoid structure– Training set of 148 compounds (81 non-musk and 67

musk)– 47 chemical descriptors initially– Pre-qualifications

• Correlations (47-12=35)– Post-qualifications

• Bootstrapping • Test-set

– 6/6 musks, 8/9 non-musks

Narvaez, J. N., Lavine, B. K. and Jurs, P. C. Chemical Senses, 11, 145-156 (1986)

Practical Issues

• 10 times as many compounds as parameters fit

• 3-5 compounds per descriptor• Traditional QSAR

– Good for activity prediction– Not good for whether activity is due

to binding or transport

Advanced Methods

• Neural Networks• Genetic/Evolutionary Algorithms• Monte Carlo• Alternate descriptors

– Reduced graphs– Molecular connectivity indices– Indicator variables (0 or 1)

• Combinatorics (e.g. multiple substituent sites)

Tools Available

• Sybyl (Tripos Inc.)• Insight II (Accelrys Inc.)• Pole Bio-Informatique Lyonnais

– http://pbil.univ-lyon1.fr/

• Molecular Biology– http://www.infobiogen.fr/services/dea

mbulum/english/logiciels.html

Summary

• QSAR/QSPR– Statistics connect structure/behavior w/

observables– Interpolate/Extrapolate

• Multi-Variate Analysis– Pre-Qualification– Regression

• PCA• PLS• MLS

– Post-Qualification

Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.

Documents

Transcript of Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.