Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.
-
Upload
alexandra-heath -
Category
Documents
-
view
214 -
download
0
Transcript of Molecular Modeling: Statistical Analysis of Complex Data C372 Dr. Kelsey Forsythe.
Molecular Modeling: Statistical Analysis of
Complex DataC372
Dr. Kelsey Forsythe
Terminology• SAR (Structure-Activity Relationships)
– Circa 19th century?
• QSPR (Quantitative Structure Property Relationships)– Relate structure to any physical-chemical property of
molecule
• QSAR (Quantitative Structure Activity Relationships)– Specific to some biological/pharmaceutical function of
molecule (Absorption, Distribution/Digestion, Metabolism, Excretion)
– Brown and Frazer (1868-9)• ‘constitution’ related to biological response
– LogP
Statistical Models
• Simple– Mean, median and variation– Regression
• Advanced– Validation methods– Principal components, co-variance– Multiple Regression QSAR,QSPR
Modern QSAR
– Hansch et. Al. (1963)• Activity ‘travel through body’
partitioning between varied solvents
– C (minimum dosage required)– (hydrophobicity)– (electronic)
– Es (steric)
1/C a b 2 c dE s const.
Choosing Descriptors
• Buffon’s Problem– Needle Length?– Needle Color?– Needle Compostion?– Needle Sheen?– Needle Orientation?
Choosing Descriptors• Constitutional
– MW, Natoms
• Topological– Connectivity,Weiner index
• Electrostatic – Polarity, polarizability, partial charges
• Geometrical Descriptors– Length, width, Molecuar volume
• Quantum Chemical– HOMO and LUMO energies– Vibrational frequencies– Bond orders– Energy total
Choosing Descriptors• Constitutional
– MW, Natoms of element
• Topological– Connectivity,Weiner index (sums of bond
distances)– 2D Fingerprints (bit-strings)– 3D topographical indices, pharmacophore keys
• Electrostatic – Polarity, polarizability, partial charges
• Geometrical Descriptors– Length, width, Molecular volume
Choosing Descriptors• Chemical
– Hydrophobicity (LogP)– HOMO and LUMO energies– Vibrational frequencies– Bond orders– Energy total– GSH
Statistical Methods
• 1-D analysis• Large dimension sets require
decomposition techniques– Multiple Regression– PCA– PLS
• Connecting a descriptor with a structural element so as to interpolate and extrapolate data
Simple Error Analysis(1-D)
• Given N data points– Mean
– Variance
– Regression
ycalc
yobs
xcalc
xobs
)()(
),(
YStdXStd
YXCovR
Simple Error Analysis(1-D)
• Given N data points– Regression
residualy
yyy obsi
calci
obscalc
obscalc
xx
yy
Simple Error Analysis(1-D)
• Given N data points– (Poor 0<R2<1(Good)
2
)()(
),(
)(
N
icalc yySSR
YStdXStd
YXCov
YStd
SSRR
nsfluctuatiobetween n Correlatio
1),(
1
YYXXN
YXCov i
N
ii
2
1
1)(
N
ii YY
NYStd
Correlation vs. Dependence?
• Correlation– Two or more variables/descriptors may
correlate to the same property of a system
• Dependence– When the correlation can be shown due to one
changing due to the change in another
• Ex. Elephants head and legs– Correlation exists between size of head and legs– The size of one does not depend on the size of
the other
Quantitative Structure Activity/Property
Relationships (QSAR,QSPR)• Discern relationships between multiple
variables (descriptors) • Identify connections between structural
traits (type of substituents, bond angles substituent locale) and descriptor values (e.g. activity, LogP, % denaturation)
Pre-Qualifications
• Size– Minimum of FIVE samples per
descriptor
• Verification– Variance– Scaling– Correlations
QSAR/QSPRPre-Qualifications
• Variance– Coefficient of Variation
Standard Deviation
Mean
x
x
"Spread"
QSAR/QSPRPre-Qualifications
• Scaling – Standardizing or normalizing
descriptors to ensure they have equal weight (in terms of magnitude) in subsequent analysis
QSAR/QSPRPre-Qualifications
• Scaling – Unit Variance (Auto Scaling)– Ensures equal statistical weights
(initially)
– Mean Centering
x i' x i
' 1
x i' x i x
x ' 0
QSAR/QSPRPre-Qualifications
• Correlations
– Remove correlated descriptors
– Keep correlated descriptors so as to reduce data set size
– Apply math operation to remove correlation (PCR)
n)correlatio positive (100% 1
n)correlatio negative (100% 1
:
11
ij
ij
r
ENTATIONOVERREPRES
r
2
,
2
,
,,
,
thth,
descriptor j and ibetween n Correlatio)()(
),(
M
kjkj
M
kiki
jkj
M
kiki
ji
ji
jiji
XXXX
XXXXR
YStdXStd
XXCovR
QSAR/QSPRPre-Qualifications
• Correlations
QSAR/QSPR Scheme
• Goal– Predict what happens next
(extrapolate)!– Predict what happens between data
points (interpolate)!
QSAR/QSPR Scheme
• Types of Variable– Continuous
• Concentration, occupied volume, partition coefficient, hydrophobicity
– Discrete• Structural (1-meta substituted, 0-no
meta substitution)
QSAR/QSPR-Principal Components Analysis
• Reduces dimensionality of descriptors
• Principle components are a set of vectors representing the variance in the original data
QSAR/QSPR-Principal Components Analysis
• Geometric Analogy (3-D to 2-D PCA) y
z
x
K~
s.t. K~
O~
x'1 x '2 ....x 'Ny'1 y '2 ....y 'N0 0 ......0
x1 x2 ....xNy1 y2 ....yNz1 z2 ....zN
O
~
QSAR/QSPR-Principal Components Analysis
• Formulate matrix• Diagonalize matrix • Eigenvectors are the principal components
– These principal components (new descriptors) are a linear combination of the original descriptors
• Eigenvalues represent variance– Largest accounts for greatest % of data
variance– Next corresponds to second greatest and so on
QSAR/QSPR-Principal Components Analysis
• Formulate matrix (Several types)– Correlation or covariance (N x P)
• N is number of molecules• P is number of descriptors
– Variance-Covariance matrix (N x N)
• Diagonalize (Rotate) matrix
r11 r12 ....r1pr21 r22 ....r2p rn1 rn2 ....rnp
A~
AA
T Avc
QSAR/QSPR-Principal Components Analysis
• Eigenvectors (Loadings) – Represents contribution from each original
descriptor to PC (new descriptor)• # columns = # of descriptors• # rows = # of descriptors OR # of molecules
• Eigenvalues– Indicate which PC most important
(representative of original descriptors)• Benzene has 2 non-zero and 1 zero eigenvalue
(planar)
QSAR/QSPR-Principal Components Analysis
• Scores– Graphing each object/molecule
in space of 2 or more PCs•# rows = # of objects/molecules•# columns = # of descriptors OR #
of moleculesFor benzene corresponds to graph in PC1 (x’) and PC2 (y’) system
QSAR-PCASYBYL (Tripos Inc.)
SYBYL (Tripos Inc.)
SYBYL (Tripos Inc.)
10D3D
SYBYL (Tripos Inc.)• Eigenvalues Explanation of
variance in data
SYBYL (Tripos Inc.)• Each point corresponds to column (#
points = # descriptors) in original data
Proximity correlation
SYBYL (Tripos Inc.)• Each point corresponds to row of
original data (i.e. #points = #molecules) or graph of molecules in PC space
HeNapthalene
H2O
Molecular Size
Small acting Big
Proximitysimilarity
SYBYL (Tripos Inc.)
Outlier
SYBYL (Tripos Inc.)
QSAR/QSPR-Regression Types
• Principal Component Analysis
QSAR/QSPR-Regression Types
• Principal Component Analysis
Non-Linear Mappings
• Calculate “distance” between points in N-d descriptor/parameter space– Euclidean– City-block distances
• Randomly assign compounds in set to points on a 2-D or 3-D space
• Minimize Difference (Optimal N-d 2D plot)
Non-Linear Mappings
• Advantages– Non-linear– No assumptions!– Chance groupings unlikely (2D group
likely an N-D group)
• Disadvantages– Dependence on initial guess (Use
PCA scores to improve)
QSAR/QSPR-Regression Types
• Multiple Regression• PCR• PLS
QSAR/QSPR-Regression Types
• Linear Regression– Minimize difference between
calculated and observed values (residuals)
Multiple Regression
y mx b
mx i x y i y
i1
N
x i x 2
i1
N
b y m x
y mi * x ii1
N
B
QSAR/QSPR-Regression Types
• Principal Component Regression– Regression but with Principal
Components substituted for original descriptors/variables
QSAR/QSPR-Regression Types
• Partial Least Squares– Cross-validation determines
number of descriptors/components to use
– Derive equation – Use bootstrapping and t-test to
test coefficients in QSAR regression
QSAR/QSPR-Regression Types
• Partial Least Squares (a.k.a. Projection to Latent Structures)– Regression of a Regression
•Provides insight into variation in x’s(bi,j’s as in PCA) AND y’s (ai’s)
– The ti’s are orthogonal – M= (# of variables/descriptors OR
#observations/molecules whichever smaller)
y ai * tii
N
ti bij * x jj
M
QSAR/QSPR-Regression Types
• PLS is NOT MR or PCR in practice– PLS is MR w/cross-validation– PLS Faster
•couples the target representation (QSAR generation) and component generation while PCA and PCR are separate
• PLS well applied to multi-variate problems
QSAR/QSPRPost-Qualifications
• Confidence in Regression– TSS-Total Sum of Squares– ESS-Explained Sum of Squares– RSS-Residual Sum of Squares
TSSESS RSS
R2 ESS
TSS
1 (100% explaination of data)
0 (no explaination of data)
y i y 2
i
N
TSS
ycalc,i y 2ESS
i
N
y i ycalc,i 2
i
N
RSS
QSAR/QSPRPost-Qualifications
• Confidence in Prediction (Predictive Error Sum of Squares)
Q2 1PRESS
y i y 2
i1
N
, PRESS y i ycalc,i 2
i1
N
QSAR/QSPRPost-Qualification
• Bias?– Bootstrapping
• Choosing best model?– Cross Validation
QSAR/QSPRPost-Qualification
• Bootstrapping– ASSUME calculated data is
experimental/observed data– Randomly choose N data (allowing for a
multiple picks of same data)– Regenerate parameters/regression – Repeat M times– Average over M bootstraps– Compare (calculate residual)
• If close to zero then no bias• If large then bias exists
M is typically 50-100
QSAR/QSPRPost-Qualification
• Cross-Validation (used in PLS)– Remove one or more pieces of input data– Rederive QSAR equation– Calculate omitted data– Compute root-mean-square error to evaluate
efficacy of model • Typically 20% of data is removed for each iteration• The model with the lowest RMS error has the optimal
number of components/descriptors
QSPR Example
• Relation between musk odourant properties and benzenoid structure– Training set of 148 compounds (81 non-musk and 67
musk)– 47 chemical descriptors initially– Pre-qualifications
• Correlations (47-12=35)– Post-qualifications
• Bootstrapping • Test-set
– 6/6 musks, 8/9 non-musks
Narvaez, J. N., Lavine, B. K. and Jurs, P. C. Chemical Senses, 11, 145-156 (1986)
Practical Issues
• 10 times as many compounds as parameters fit
• 3-5 compounds per descriptor• Traditional QSAR
– Good for activity prediction– Not good for whether activity is due
to binding or transport
Advanced Methods
• Neural Networks• Genetic/Evolutionary Algorithms• Monte Carlo• Alternate descriptors
– Reduced graphs– Molecular connectivity indices– Indicator variables (0 or 1)
• Combinatorics (e.g. multiple substituent sites)
Tools Available
• Sybyl (Tripos Inc.)• Insight II (Accelrys Inc.)• Pole Bio-Informatique Lyonnais
– http://pbil.univ-lyon1.fr/
• Molecular Biology– http://www.infobiogen.fr/services/dea
mbulum/english/logiciels.html
Summary
• QSAR/QSPR– Statistics connect structure/behavior w/
observables– Interpolate/Extrapolate
• Multi-Variate Analysis– Pre-Qualification– Regression
• PCA• PLS• MLS
– Post-Qualification