SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of...
Transcript of SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of...
SAR/QSAR Modeling: State of the Art
Alexander Tropsha
Laboratory for Molecular Modeling
UNC Eshelman School of Pharmacy
UNC-Chapel Hill
The expectations of our field are
high
The newly-appointed President-Elect of the Royal Society of
Chemistry today forecast the impact of advances in modelling and
computational informatics on chemistry
The growth of QSAR modeling is
coincidental with the growth of chemical
data
“Chemical databases” vs “QSAR Modeling
OUTLINE
Conclusions: models are tools for testable hypothesis generation
focus on accurate, experimentally confirmed predictions
Predictive QSAR Modeling Workflow
-Chemical and biological data curation
-Applicability domains
-I’ve got the data: should I model it?
(modelability index)
M
E
T
H
O
D
S
A
P
P
L
I
C
A
T
I
O
N
S
-Cheminformatics approaches to structure-
based virtual screening
-Chemocentric Informatics: integration of
chemical and biological data streams and
diverse data sources
-Novel approaches to model building and
interpretation: Chemical-Biological Read Across
1
2
~106 – 109
molecules
VIRTUAL
SCREENING
CHEMICAL
STRUCTURES
CHEMICAL DESCRIPTORS
PROPERTY/
ACTIVITY
PREDICTIVE
QSAR MODELS
INACTIVES
(confirmed inactives)
QSAR
MAGIC
HITS
(confirmed
actives)
CHEMICAL
DATABASE
The chief utility of QSAR models: identification
of novel hits in external libraries
Quantitative
Structure
Activity
Relationships
D
E
S
C
R
I
P
T
O
R
S
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
0.613
0.380
-0.222
0.708
1.146
0.491
0.301
0.141
0.956
0.256
0.799
1.195
1.005
C
O
M
P
O
U
N
D
S
A
C
T
I
V
I
T
Y
Thousands of molecular descriptors are available for organic compounds constitutional, topological, structural,
quantum mechanics based, fragmental,
steric, pharmacophoric, geometrical,
thermodynamical conformational, etc.
- Building of models using
machine learning methods
(NN, SVM etc.);
- Validation of models according to numerous
statistical procedures, and
their applicability domains.
Credit: Denis Fourches 6
Quantitative Structure Activity Relationships
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
p
n
n molecules x p descriptors
ACTIVITY e.g. toxicity
binding affinity
Pattern matrix
Data dependency and data quality
are critical issues in QSAR modeling
Florian Prinz, Thomas Schlange and Khusru Asadullah. Nature Rev. Drug
Disc. Sep 2011
D
E
S
C
R
I
P
T
O
R
S
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
N
O
0.613
0.380
-0.222
7.08
1.146
0.491
0.301
0.141
0.956
0.256
0.799
1.195
1.005
QSAR modeling with non-curated datasets
C
CH3
CH3
CH3
N
O
H3C CH
2
CH3
O
O–
Na+
Presence of SALTS
Presence of MIXTURES OH
Presence of ERRONEOUS AND/OR
WRONG STRUCTURES
Presence of DUPLICATES
Presence of MISPRINTS
AND WRONG NAMES
Etc.
ERRORS in the calculation
of DESCRIPTORS
INITIAL LIST OF SMILES/STRUCTURES
(2D representation)
difficult cases
Fourches,
Muratov,
Tropsha. Trust
but verify.
JCIM, 2010,
50:1189-204.
QSAR modeling of nitro-aromatic
toxicants
-Case Study 1: 28 compounds tested in rats,
log(LD50), mmol/kg.
-Case Study 2: 95 compounds tested against
Tetrahymena pyriformis, log(IGC50), mmol/ml.
-Case Study 2: after the normalization of nitro groups R2ext~0 increased to R2
ext~0.5
Artemenko, Muratov et al. J. SAR QSAR 2011, 22 (5-6), 1-27.
- Five different representations of nitro groups. -Case Study 1: after the normalization of nitro groups
R2ext~0.45 increased to R2
ext~0.9.
Even small differences in structure representation can
lead to significant errors in prediction accuracy of
models
Data curation affects the accuracy
(up or down!) of QSAR models
Experimental Data quality: Comparison of the ToxCAST (Phase I) in vitro Assay Results for
Duplicates Compounds Total ACEA ATG BSK Cellumen NVS CellzDirect
500 7 81 87 33 239 48
3-Iodo-2-propynylbutylcarbamate
0.71 0.73 0.18 0.53 0.49 0.89 0.15
Bensulide 0.64 0.09 0.71 0.4 0.69 0.95 0.04
Chlorsulfuron 0.24 N/A N/A 0.4 N/A N/A -0.1
Dibutyl phthalate 0.55 N/A 0.62 0.51 0.7 0.81 -0.1
Diclofop-methyl 0.36 1 0.89 0.15 N/A -0 -0.1
EPTC 0.13 N/A N/A -0.1 N/A N/A 0.33
Fenoxaprop-ethyl 0.47 N/A 0.56 0.59 0.31 0.35 0.01
Prosulfuron 0.55 N/A 0.68 0.08 N/A 1 0.4
*
ChEMBL Statistics
• Used ChEMBL 14 – released 18 July 2012 – 1,384,479 compound records – 1,213,242 distinct compounds – 644,734 assays – 10,129,256 bioactivities – 9,003 targets – 46,133 documents
• Primarily covers MedChem Literature • Adds annotations for target data • Successor to SARLite commercial database
Manual Curation of ChEMBL (following several automated steps)
• Input: 190,068 compound-target measures in pairs of papers – Used values as published in ChEMBL – Converted to standardized pKi values – Semi-automated (based on units and type of value
reported)
• 23,956 failed to be automatically converted – Mostly Log Ki or –Log Ki values but others – Manually examined papers representing ~70% and
hand converted affinity value, except when data was being recycled/recited
• Final: 178,317 total replicate pairs of values
Frequency distribution plot for differences in pKi values (>1%) for duplicates
Note the peaks at integral pKi differences
A Recurrent Pattern
Non-standard Units Used
Non-Ki measures given as Ki
These numbers made it into ChEMBL, too.
Ignorance of Biological Complexicty
α2a? α2b? α2c?
Target Doc_ID Src_Key Assay_ID Activity_ID Std_Type Std_Value
α2a 10218 8b 32635 359172 pKi 9.45
α2b 10218 8b 32635 359172 pKi 9.45
α2c 10218 8b 32635 359172 pKi 9.45
No Units at All
No Citation For Data Sources
Summary of published data quality analysis
• A lot of the replicates in the literature aren’t actually independent determinations
• Many errors come from careless specification or interconversion of units
• 91% of the data are single reported measurements • Modeling studies often are not explicitly identified as
such • ChEMBL 15 and going forward have started to address
these issues • This observations suggest new challenges to employ
cheminformatics approaches for biological data curation
Cheminformatics Analysis of qHTS data over 17,000 compounds screened against five major CYP isozymes using
In Vitro bioluminescent qHTS assay
J. Chem. Inf. Model., 2011,
Nature Biotechnology, 2009,
Duplicate analysis
• Carried out by ISIDA/Duplicates program
• 1,280 duplicate couples were found
– 406 had a complete matching profile
– 874 had profile differences
– A total of 1,535 discrepancies were found in the 874
duplicates couples CYP annotation:
CYP2C19 CYP2D6 CYP3A4 CYP1A2 CYP2C9
170 422 426 363 154 # of
discrepancies
PROBLEM: CYP bioprofiles for some duplicates are dramatically different Need biological curation!
2C19 2D6 3A4 1A2 2C9 Supplier SID Tocris-0740
-4.5 -6.2 -4.6 -4.4 -4.6 Tocris 11113673 CID_6603937
-5 -5.6 -8 INA -4.4 Sigma Aldrich 11111504 CID_6603937
2C19 2D6 3A4 1A2 2C9 Supplier SID Tanimoto
Similarity
5 Nearest
neighbors
5.5- INA 4.5- INA INA Tocris 11114071 0.98 6604862
INA INA 5.1- INA INA Sigma Aldrich 11112029 0.98 6604106
INA INA INA INA INA Tocris 11114012 0.98 6604846
INA 5.9- 4.8- INA INA Sigma Aldrich 11112054 0.95 6604136
INA 4.5- 4.7- 4.4- INA Tocris 11113764 0.95 6604137
6604862
6604106
6604846
6604136
6604137
FALSE-POSITIVE
Cytochrome P450
Cytochrome P450
Neighborhood analysis helps to choose correct value Case Study: structural duplicates found in NCGC CYP450 qHTS data
Published guidance on model development : and validation: The OECD Principles
To facilitate the consideration of a QSAR model for regulatory purposes, it should be associated with the
following information:
a defined endpoint
an unambiguous algorithm;
a defined domain of applicability
appropriate measures of goodness- of-fit, robustness and predictivity a mechanistic interpretation, if possible; Should be added: data used for modeling should be carefully curated
-
Dearden JC et al., 2009, SAR and QSAR in Environmental Research, Vol. 20, Nos. 3–4, April–June 2009, 241–266.
Published guidance on model development and validation: J. Dearden’s 21 “how not to do QSAR” principles
DATA AND METHODS: • Genetic Algorithm driven descriptor selection + MLR for the entire dataset • Two methods of dataset splitting:
– SOM to split a dataset into DIVERSE Training (64) and Test (21) sets – Activity-based to split into Training (64) and Test (21) sets
• Modeling for the SOM-training set using MLR with Genetic Algorithm
STATISTICAL CLAIMS: For SOM-generated split: • R2 (train)=0.836, Q2
LOO=0.793 • R2 (predict)= 0.730 For activity-based split. • R2 (train)=0.813, Q2
LOO=0.766 • R2
(predict) = 0.808 Compliance with OECD principles: “In addition, the proposed models took into full account fundamental points required by the OECD principles for QSAR models…”
Beware of poor published models: Bioorganic & Medicinal Chemistry (2007)
True external prediction: Bioorganic & Medicinal Chemistry (2007)
Our attempt to predict activity for new compounds reported in ChEMBL after 2007 using the published model (MLR: all selected descriptors and weights were reported)
Experimental pIC50
Pre
dic
ted
pIC
50
Problems with the model: Bioorganic & Medicinal Chemistry (2007)
Training and test set are selected by SOM: both are diverse and highly similar to one another The same descriptors are used to develop model for the training set using activity-based split
• The model: 1. built only for training set SOM-based split 2. validated using the corresponding prediction set
• Model re-development using activity-based split for training and test sets:
1. the same model and compounds were used 2. the compounds were just rearranged between training
and test sets The external model validation: missing • Conclusion: despite claims in the paper external predictive
power of the model was not established; model is clearly overfitted.
Bioorganic & Medicinal Chemistry (2007)
Why Model Redevelopment and Application Failed?
In Vitro Assays
Chemical Structure
Dataset Modelability: does it make sense to model any SAR data?
Example: Poor structure – in vivo or in vitro-in vivo correlations for Toxcast data*
Toxicol Sci. 2012 Aug;128(2):398-417.
The Concept of Modelability • We often fail to build a predictive QSAR model.
However, it may be possible to evaluate
modelability of the dataset prior to QSAR
study.
• MODI-index: Balanced accuracy (BA) of a kNN
model with K=1 (the activity class of each
compound is predicted to be the same as that of
its nearest chemical neighbor)
CONFUSION MATRIX
PREDICTED OBSERVED
CLASS 0
OBSERVED
CLASS 1
TOTAL
CLASS 0 N00 N10 N.0
CLASS 1 N01 N11 N.1
TOTAL N0.=N0 N1. =N1 N..=N
SE = N00/N0
SP = N11/N1
BA = ½ (SE + SP)
Prediction of Dataset
Modelability
44 diverse datasets
60 ToxCast datasets
M o d e l i n g m e t h o d s
5-fold External Validation
1
4
3
2
5
QSAR Modeling Workflow: the importance of rigorous validation
1 2 3 5 4
courtesy of L. Zhang
Combi-QSAR modeling
Datasets
K-Nearest Neighbors (kNN)
Random Forest (RF)
Support Vector Machines (SVM)
Dragon MOE
Internal validation Model selection
An ensemble of QSAR Models
Modeling set
External set
D e s c r i p t o r s
Evaluation of external performance
34
Tropsha, A. Best Practices for QSAR Model Development, Validation,
and Exploitation Mol. Inf., 2010, 29, 476 – 488
Fully implemented on CHEMBENCH.MML.UNC.EDU
Virtual screening (with AD threshold)
Experimental confirmation
Data curation
Recent examples of theoretically and
experimentally validated QSAR-based
predictions: Experiment-Assisted Computational
Drug Discovery!!
• Anticonvulsants: Shen, M. et al, J. Med. Chem. 2004, 47, 2356-2364.
• HIV-1 reverse transcriptase inhibitors: Medina-Franco, J., et al, J. Comput. Aided. Mol. Des., 2005, 19, 229–242
• D1 receptor antagonists: Oloff et al, J. Med. Chem., 2005, 48, 7322-32
• AmpC inhibitors: Hsieh, J.-H.. et al, J. Comp. Aid. Molec. Des., 2008, 22(9):593-609
• HDAC inhibitors: Wang, S. et al, (JCIM, 2009, 49, 461-76)
• GGT-I inhibitors: Wang, Peterson, et al (JMC, 2009, 52(14):4210-20; provisional patent)
• 5HT6 binders: Hajjo et al, JMC, 2012, 55(12):5704-19
• Antimalarial: Zhang et al. J Chem Inf Model. 2013, 53(2):475-92
• Polypharmacological compounds: Besnard et al, Nature. 2012;492(7428):215-20
OUTLINE
Conclusions: models are tools for testable hypothesis generation
focus on accurate, experimentally confirmed predictions
Predictive QSAR Modeling Workflow
-Chemical and biological data curation
-Applicability domains
-I’ve got the data: should I model it?
(modelability index)
M
E
T
H
O
D
S
A
P
P
L
I
C
A
T
I
O
N
S
-Cheminformatics approaches to structure-
based virtual screening
-Chemocentric Informatics: integration of
chemical and biological data streams and
diverse data sources
-Novel approaches to model building and
interpretation: Chemical-Biological Read Across
1
2
Bridging ligand- and structure-based drug
Design: CSAR 2012 (Carlson et al, JCIM, ASAP)
Any public data and any
computational method can be used.
1.Pose Prediction
2.Ligand Ranking
Urokinase (UK)
Checkpoint kinase
(CHK1)
16 ligands
47 ligands
Pseudomonas aeruginosa
deacetylase (LpxC) 16 ligands
39 ligands
Extracellular signal-regulated
kinase (ERK2)
Reference Structure Target Name Number of ligands
3i5z Ki = 2nM
1owe Ki = 631nM
2e9n Ki = 6.2 nM
3p3e
CONSENSUS
PREDICTIONS
Study Design
STRUCTURE-BASED LIGAND-BASED
Medusa docking
(“traditional” MM/MD
based)
QSAR models built with
available data outside
of CSAR
LIGAND RANKING
CSAR Ligands
TARGET
Ranking Active Ligands
D-1 QSAR no_AD
D-2 QSAR AD
D-3 QSAR AD+SIM
D-4 QSAR no_AD + MEDUSA
D-5 QSAR AD + MEDUSA
D-6 QSAR AD+SIM + MEDUSA
Summary statistics for all groups (Spearman (ρ))
Min
25%tile
Median
75%tile
Max
Our Group UK CHK1
-0.282
0.396
0.522
0.576
0.759
-0.334
-0.141
0.008
0.116
0.356
0.558 15
0.576 15
0.529 15
0.673 15
0.615 15
0.579 15
Ligands
0.319 29
0.320 29
0.330 29
0.182 29
0.171 29
0.194 29
Ligands ERK2 Ligands
0.575 38
0.584 38
0.589 38
0.492 38
0.490 38
0.477 38
-0.077
0.183
0.397
0.491
0.678
QSAR model accuracy in comparison with
docking/scoring based models (all other groups)
QSAR model accuracy in comparison with
docking/scoring based models (all other groups)
Identifying Inactive Ligands*
D-1 QSAR no_AD 0.933 15 4 0.756 30 9
D-2 QSAR AD 0.950 15 4 0.756 30 9
D-3 QSAR AD+SIM 0.950 15 4 0.733 30 9
D-4 QSAR no_AD + MEDUSA 0.967 15 4 0.696 30 9
D-5 QSAR AD + MEDUSA 0.967 15 4 0.696 30 9
D-6 QSAR AD+SIM + MEDUSA 0.967 15 4 0.685 30 9
Summary statistics for all groups (AUC)
Min 0.250 0.378
25%tile 0.714 0.496
Median 0.828 0.563
75%tile 0.950 0.689
Max 1.000 0.756
Our Group UK CHK1
*No inactive ligands were available for ERK2
Actives Inactives Actives Inactives
Integrated workflow incorporating cheminformatics and molecular simulation approaches for pose classification into native- like and non-native
Knowledge based pose scoring filter
Multiple poses created by docking and ranked by a MM scoring function (Medusa)
Native-like poses
Non-native poses
Cp
Op
Ol
Nl
MCT MCT
MCT
MCT
d1
d2
d3
Each tetrahedron is described by a) receptor/ligand atom b) chemical atom type In total, there are 554 theoretical descriptor types (m).
D1: CpOpOlNl
D2: CpOlOlNl
…
Structure Based QSAR: Chemical descriptors of protein-ligand interface (PL/MCT-tess)
Cp
Op
Ol
Nl
Each descriptor’s value is the SUM of protein (p)-ligand (l)pairwise potential for the same tetrahedral type at the interface (n)
kpll
n
1k
3~1
p
3~1
l
pm )MCT(MCTPL/MCT d
*Parr RG, Liu S J. Am. Chem. Soc. 1999, 121, 1922
D1 D2 D3 … D554
2 0 5 0 Count descriptors →
Incorporate the maximal charge transfer (MCT)* value derived from conceptual DFT (Hsieh et al, 2012)
D1 D2 D3 … D554
2.8 0 5.3 0
PL-MCT-tess descriptors
Tessellated protein-ligand
interface
EN
EN
EN
EN
Incorporate Pauling electronegativity (EN) ENTess descriptors (Zhang, 2006)
X-ray
complex
RMSD (Å)
Nu
mb
er o
f p
ose
s D1 D2 D3 … D554 NL/NN
2.8 0 5.3 0 1
3.5 2.1 13.4 0 1
5.2 0 3.7 0 0
…
Pose 1
Pose 2
Pose n
Native pose: RMSD = 0
Pose1: RMSD = 0.5 Å
Pose2: RMSD = 2 Å
Pose3: RMSD = 10 Å
…
NL NN
RMSD classification
threshold = 2.5Å
Building a pose filter using binary QSAR and PL/MCT-tess chemical descriptors
CCR = 0.96 Predicted 0 (non native)
Predicted 1 (native like)
Observed 0 (non native)
2894 127
Observed 1 (native like)
126 2801
Use of knowledge based filter to filter out pose decoys improves the accuracy of virtual screening*
Build filter
Apply to DUD dataset of
true binders and decoys
Native-like?
*Jui-Hua Hsieh; Shuangye Yin; Xiang S. Wang; Shubin Liu; Nikolay V. Dokholyan; Alexander Tropsha; J. Chem. Inf. Model. 2012,
52, 16-28.
Use of target-specific pose filter combined with a physical force-field based
scoring (MedusaScore) leads to significant improvement of hit rates in VS.
N
Knowledge based target specific approach succeeded to filter out “bad” poses generated by MedusaDock
Chemocentric Informatics: Integration of
diverse data streams and data sources
46
Ge
ne
s
Metabolites
Dis
ea
se
s
PubChem
DBpedia
ChEMBL
PDB
DrugBank
PubMed
STITCH
PDSP
STITCH
PDD
CTD
Diseasome
CMAP
KEGG
STITCH
CTD
GeneID
OMIM
Entrez
Gene
Pfam
UniProt
CAS
CTD
KEGG
ERGO
BioCyc
MetaCyc
CMAP
miR2Dis
-ease
miRBase
TarBase
PuTmiR
HMDB
METLIN
METLIN
HMDB
UCbase
miRfunc
PubMed
CTD
HMPDb
Disease gene
signatures
Disease related
genes or proteins
Text/database mining
Network mining
PubMed/ Chemotext
CTD
HMDB
Disease related
proteins
cmap ChemoText
New hypothesis about connectivity between chemicals and diseases
Binding data
Target related ligands
Functional data
QSAR
Predictive models
Database mining
Structural hypothesis “putative drug candidates”
Accept common
hits only
New testable hypothesis with higher confidence
Disease-Target Association
48
Hajjo et al, Chemocentric Informatics Approach
to Drug Discovery:…J Med Chem.
2012, 55(12):5704-19
5-HT6
predictor
300 VS Hits “Actives”
59 K cps.
5-HT6 receptor QSAR models & QSAR-based VS
Model statistics
94 Inactives Ki ≥ 10 µM
196 cps.
102 Actives Ki < 10 µM
Dataset Virtual screening
Source: PDSP Ki-DB
0,0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1,0
Model
CC
Re
vs kNN-Dragon Model
kNN-Dragon Random
CBA-SG Model
CBA-SG Random
49
Step3 : list of correlated compounds
Step2: query the cmap
Database
The connectivity map
Step1: upload signature
Output
High correlation
Low correlation
Null
Biological state 1
Control
Signature
Input
50 Lamb, J. et al. Science, 313, 1929-1935 (2006)
Lamb, J. Nature 7, 54-60 (2007)
Querying the cmap
cmap
1.00
0.00
0.00
-1.00
cmap SCORE
Upload signature Query the cmap List of compounds
(S1) (S2)
S1: Hata, R. et al., Biochem. Biophys. Res. Commun 284, 310 (2001). S2: Ricciarelli, R. et al., IUBMB Life 56, 349 (2004).
Alzheimer’s disease gene signatures
51
97 COMMON HITS with S1
106 COMMON HITS with S2
Chemocentric
Informatics
QSAR
FILTER
Further
selection
34 Higher
Confidence Hits
CONSENSUS
HYPOTHESES 300 5-HT6
Active HITS
WDI
DATABASE
73 COMMON HITS with S1 & S2
cmap
FILTER
cmap
DATABASE
881 instances with S1
861 instances with S2
59 K
compounds
6.1 K
Individual
instances
52
Antipsychotics Antidepressants Calcium Channel Blockers Selective Estrogen Receptor
Modulators (SERMs)
SERMs predicted as 5-HT6 receptor ligands
53
Raloxifene identified as a 5-HT6 receptor ligand and potential preventative for Alzheimer’s disease
Raloxifene binds to 5-HT6
receptor with a Ki= 750 nM.
Raloxifene given at a dose of
120 mg/day led to reduced risk
of cognitive impairment in post-
menopausal women.
Yaffe, K. et al., Am J Psychiatry, 162,
683–690 (2005).
A study recently funded by NIH
is ongoing to evaluate
raloxifen’s effects in AD
patients.
Competition binding at 5-HT6 receptors for
raloxifene (yellow triangle) and chlorpromazine
(square) versus [3H] LSD. Tested by our
collaborators at PDSP.
54
http://www.nia.nih.gov/alzheimers/publications/adprevented/
Raloxifene
Chlorpromazine
Human Effects
Cheminformatics
Exploration and exploitation of diverse
data streams
Bioinformatics Multiple
biological assays
Inherent chemical properties
Integrate cheminformatics and short term assay
data to improve predictive power and interpretability
QSAR Table – qHTS descriptors
ID Name Structure 3T3
9.2mkM
3T3
21mkM …
SHSY
92mkM
1 Acrolein 0 0 … -92
2 2-Amino-4-nitrophenol
0 -22 … 0
... ... … … … … …
369 Tebuco-nazole
-21 -24 … -18
Descriptor #: 1 2 … 182
Modeling Workflow
Hybrid QSAR models have higher
predictive power than those built
with either chemical or biological
descriptors
% TOPKAT
Chemical
descriptors only
Hybrid
descriptors
(Original)
Hybrid descriptors
(THR=15%)
kNN RF kNN RF kNN RF
Sensitivity 0.45 0.73 0.73 0.55 0.82 0.91 0.91
Specificity 0.93 0.78 0.80 0.85 0.78 0.85 0.83
BA 0.69 * 0.75 0.77 0.70 0.80 0.88 0.87
Results are shown for 52 compounds in our external validation sets, which were also absent in the TOPKAT training set. *TOPKAT model was significantly different (p < 0.05) from all other models by the permutation test (10,000 times).
68- 75% BAcc
Hybrid models
Chemical descriptors
QSAR models
1. Integrative chem-bio modeling (existing method)
2. Integrative chem-bio modeling (new method)
3. Integrate cheminformatics prediction and pharmacoepidemiomiology validation 59
Data source:
Hepatotoxicity (28 day)
127 drugs
55-61%
BAcc
Toxicogenomics models
Toxicogenomics expression (24h)
Top 400 genes
Top 100 genes
Top 30 genes
Top 4 genes
2,923 genes
Rank by differential expression
69-78%
BAcc
QSAR < Hybrid < Toxicogenomics models models models
4 classification methods (RF, SVM, kNN, DWD)
30
00
Nu
mb
er
of
gen
es
Number of chemical descriptors
300
Good Toxicogenomics models
Balanced Accuracy
QSAR models
Hybrid models
Bad
60 1. Integrative chem-bio modeling
(existing method) 2. Integrative chem-bio modeling
(new method) 3. Integrate cheminformatics prediction and
pharmacoepidemiomiology validation
Optimal toxicogenomics model
Low et al. (2011) Chem. Res. Toxicol. 24,1251-1262
How predictivity varied with number of genes and number of chemical descriptors
Problem: Conflicting predictions by QSAR and toxicogenomics models
61
-1 0 1 2 3 4
-1.0
-0.5
0.0
0.5
1.0
PC1
PC
2
Biological space
caffeine
carbamazepine
●Toxic drug
●Non-toxic drug
Carbamazepine Distant biological neighbors Close chemical neighbors => QSAR works better
Caffeine Close biological neighbors Distant chemical neighbors => TGx works better
Solution: Learn from both sets of neighbors
-4 -2 0 2 4
-2-1
01
23
PC1
PC
2
carbamazepine
Chemical space ●Toxic drug
●Non-toxic drug
caffeine
1. Integrative chem-bio modeling (existing method)
2. Integrative chem-bio modeling (new method)
3. Integrate cheminformatics prediction and pharmacoepidemiomiology validation
(20%) (70%)
(4%
)
(10
%)
Chemical Read-Across: Learning from similar compounds
62
Traditional read-across predicts toxicity from chemically similar neighbors
ToxMatch, EU
QSAR Toolbox, OECD
Biological neighbors Chemical neighbors
Chemical-biological read-across (CBRA) learns from both sets of neighbors
63
Bendazac
Toxic
0.790
Phenytoin
Non-toxic
0.813
Flutamide
Toxic
0.783
Pemoline
Non-toxic
0.766
Chloramphenicol
Toxic
0.776
Phenylbutazone
Non-toxic
0.737
Disulfiram
Toxic
0.770
Phenobarbital
Non-toxic
0.721
Phenylanthranilic acid
Non-toxic
0.767
CARBAMAZEPINE
0.9
0.8
0.7
similarity = 0.6
Low et al. (2013) Chem. Res. Toxicol. In press. 1. Integrative chem-bio modeling (existing method)
2. Integrative chem-bio modeling (new method)
3. Integrate cheminformatics prediction and pharmacoepidemiomiology validation
CARBAMAZEPINE Non-toxic
Predicted as Non-toxic
Predicted toxicity=-0.099
correctly predicted as nontoxic
incorrectly predicted as toxic
Predicted toxicity =similarity-weighted average of toxicity values = 𝑆𝑖 ⋅ 𝐴𝑖𝑘𝑏𝑖𝑜𝑖=1 + 𝑆𝑗 ⋅ 𝐴𝑗
𝑘𝑐ℎ𝑒𝑚𝑗=1
𝑆𝑖𝑘𝑏𝑖𝑜𝑖=1 + 𝑆𝑗
𝑘𝑐ℎ𝑒𝑚𝑗=1
conflicting predictions
CBRA allows visual comparison of multiple compounds
64
Low et al. (2013) Chem. Res. Toxicol. In press.
1. Integrative chem-bio modeling (existing method)
2. Integrative chem-bio modeling (new method)
3. Integrate cheminformatics prediction and pharmacoepidemiomiology validation
Types of chemical-biological models
65
Hybrid model (from pooling data)
Ensemble model (from pooling models)
Chemical-biological read-across (CBRA)
Pooled data
matrix
Biological model
QSAR model
Chemical structures
Biological assays
Chemical structures
Biological assays
Chemical structures
Biological assays
Hybrid model
Ensemble model
CBRA
Conclusions and Outlook
• Methodology
– data curation is critical (NB: QSAR models could be used to spot and correct erroneous data!)
– Rigorous external model validation is mandatory and should precede any mechanistic interpretation
– Consensus (collaborative!) prediction using all acceptable models affords the highest accuracy and chemical space coverage
– Novel chemical descriptors are needed for (so far) uncommon substances (mixtures, materials, nanomaterials)
– outcome: decision support tools for prioritizing compounds for experimental screening and/or regulatory decision making
Conclusions and Outlook
• Emerging trends in applications of QSAR modeling
– Rapid accumulation of large biomolecular datasets (especially, in public domain)
– Non-traditional sources of datasets (text mining of biomedical literature, patents, EMRs, …)
– Extension of QSAR modeling beyond organic molecules (mixtures, materials, nanomaterials, …)
– Integration of inherent chemical properties with short term biological profiles (biodescriptors ) in the context of structure – in vitro – in vivo extrapolation
– Interpretation of significant chemical and biological descriptors emerging from externally validated models to inform the selection or design of effective and safe chemicals
http://chembench.mml.unc.edu
Principal Investigator Alexander Tropsha
Research Professors Alexander Golbraikh, Denis
Fourches, Eugene Muratov
Graduate Research
Assistants Andrew Fant,
Stephen Bush,
Yen Low
Mary La (Pharmacy
student)
Postdoctoral Fellows Aleck Sedykh,
Regina Politi
Adjunct Members
Weifan Zheng, Shubin Liu
The Laboratory for Molecular Modeling
MAJOR FUNDING
NIH - R01-GM66940
- R01-GM068665
NSF
- ABI 9179-1165
EPA (STAR awards) - RD832720
- RD833825
- RD834999
ONR
Former members:
Jui-Hua Hsieh
Rima Hajjo
Collaborators Bryan Roth,
Ivan Rusyn
Nikolay Dokholyan