Structure-Activity Relationships and Networks: A Generalized Approachto Exploring Structure-Activity...
-
Upload
rguha -
Category
Technology
-
view
1.044 -
download
0
Transcript of Structure-Activity Relationships and Networks: A Generalized Approachto Exploring Structure-Activity...
Structure-‐Ac)vity Rela)onships and Networks: A Generalized Approach
to Exploring Structure-‐Ac)vity Landscapes
Rajarshi Guha NIH Chemical Genomics Center /
NIH Center for Transla9onal Therapeu9cs
March 29, 2011
• Founded 2004 as part of NIH Roadmap Molecular Libraries Ini9a9ve – NCGC staffed with 90+ scien9sts – biologists, chemists, informa9cians, engineers
– Post-‐doc program
• Mission – MLPCN (screening & chemical synthesis; compound repository; PubChem database;
funding for assay, library and technology development ) • Complements individual inves9gator-‐ini9ated research programs • Enables “pharma-‐level” HTS and early chemical op9miza9on
– Develop new chemical probes for basic research and leads for therapeu9c development, par9cularly for rare/neglected diseases
– New paradigms & applica9ons of HTS for chemical biology / chemical genomics
• All NCGC projects are collabora9ons with a target or disease expert; currently >200 collabora9ons with inves9gators worldwide – 75% NIH extramural, 10% NIH intramural, 15% Founda9ons/Research Consor9a/Pharma/
Biotech
NIH Chemical Genomics Center
(C) Detection methods
(B) Target types (A) Disease areas
NCGC Project Diversity
1536-well plates, inter-plate dilution series Assay volumes 2 – 5 μL
Assay concentration ranges over 4 logs (high:~ 100 μM)
Informatics pipeline. Automated curve fitting and classification. 300K samples
Automated concentration-response data collection ~1 CRC/sec
qHTS: High Throughput Dose Response A
B
C
Background
• Cheminforma9cs methods – QSAR, diversity analysis, virtual screening, fragments, polypharmacology, networks
• More recently – RNAi screening, high content imaging
• Extensive use of machine learning • All 9ed together with socware development – User-‐facing GUI tools – Low level programma9c libraries
• Believer & prac99oner of Open Source
Outline
• Structure-‐ac9vity rela9onships • Characterizing ac9vity cliffs • Working with the structure-‐ac9vity landscape
Structure Ac)vity Rela)onships
• Similar molecules will have similar ac9vi9es • Small changes in structure will lead to small changes in ac9vity
• One implica9on is that SAR’s are addi9ve
• This is the basis for QSAR modeling
Mar9n, Y.C. et al., J. Med. Chem., 2002, 45, 4350–4358
Excep)ons Are Easy to Find
N
N
O
F3C
NH2
NH2
Cl Cl
Tran, J.A. et al., Bioorg. Med. Chem. Le2., 2007, 15, 5166–5176
Ki = 39.0 nM
N
N
O
F3C
NH2
NH
Cl Cl
O
Ki = 1.8 nM
N
N
O
NH2
NH
Cl Cl
O
F3C
NH2
Ki = 10.0 nM
N
N
O
F3C
NH2
NH
Cl Cl
O NH2
Ki = 1.0 nM
Structure Ac)vity Landscapes
• Rugged gorges or rolling hills? – Small structural changes associated with large ac9vity changes represent steep slopes in the landscape
– But tradi9onally, QSAR assumes gentle slopes – Machine learning is not very good for special cases
Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535
Structure Ac)vity Landscapes
Characterizing the Landscape
• A cliff can be numerically characterized • Structure Ac9vity Landscape Index (SALI)
• Cliffs are characterized by elements of the matrix with very large values
Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658
€
SALIi, j =Ai − A j
1− sim(i, j)
Visualizing the SALI Matrix
Fingerprints
• Lots of types of fingerprints • Indicates the presence or absence of a structural feature
• Length can vary from 166 to 4096 bits or more • Fingerprints usually compared using the Tanimoto metric
1 0 1 1 0 0 0 1 0
Varying Fingerprint Methods
• Shorter fingerprints will lead to more “similar” pairs
• Requires a higher cutoff to focus on significant cliffs
BCI 1052 bit
Tanimoto Similarity
De
nsity
0.70 0.75 0.80 0.85 0.90 0.95 1.00
02
46
8
MACCS 166 bit
Tanimoto Similarity
De
nsity
0.70 0.75 0.80 0.85 0.90 0.95 1.00
02
46
8
CDK 1024 bit
Tanimoto Similarity
De
nsity
0.6 0.7 0.8 0.9 1.0
02
46
8
Varying the Similarity Metric
Different Ac)vity Representa)ons
• Using the Hill parameters from a dose-‐response curve represents richer data than a single IC50
Concentration
Activity
S0
50%
SInf
AC50
€
S0SinfAC50
H
⎧
⎨ ⎪ ⎪
⎩ ⎪ ⎪
⎫
⎬ ⎪ ⎪
⎭ ⎪ ⎪
€
SALIi, j =d(Pi,Pj )1− sim(i, j)
Visualizing SALI Values
• Alterna9ves? – A heatmap is an easy to understand visualiza9on
– Coupled with brushing, can be a handy tool – A more flexible approach is to consider a network view of the matrix
• The SALI graph – Compounds are nodes – Nodes i,j are connected if SALI(i,j) > X – Only display connected nodes
Visualizing SALI Values
• The SALI graph – Compounds are nodes
– Nodes i,j are connected if SALI(i,j) > X – Only display connected nodes
!1
!2!3
!4
!5
!6
!7
!8
!9
!10!12
!13
!15
!17
!18
!19
!20 !21!22
!23!24
!25
!27
!28
!29
!30
!31
!34!35
!37
!38
!39
!40
!41 !42
!43
!44
!45
!46
!49
!50
!51
!52
!53
!54
!55
!56!57
!58
!59
!60
!62
!63
!64
!65
!66
!68
!69
!71 !72 !73
!75
!76
Varying the Cutoff
• The cutoff controls the complexity of the graph • Higher cut offs will highlight the most significant ac9vity cliffs
Cutoff = 90%
!1
!2
!3
!4
!5
!6
!8
!9 !12!13!15!17
!19
!22 !23
!24 !25!28
!29 !38
!39
!40
!41
!42
!43
!44
!45
!46
!49
!50!52
!54
!55!56!57
!59
!60 !62
!63!64
Cutoff = 50%
!1
!2
!3
!4
!5
!6
!8
!9 !12!13!15!17
!19
!21 !22
!23 !24
!25
!28
!29 !35
!37
!38
!39
!40
!41 !42
!43
!44
!45
!46
!49
!50!52
!54
!55!56
!57
!58
!59
!60 !62
!63!64
!65 !66
Cutoff = 20%
!1
!2!3
!4
!5
!6
!7
!8
!9
!10!12
!13
!15
!17
!18
!19
!20 !21!22
!23!24
!25
!27
!28
!29
!30
!31
!34!35
!37
!38
!39
!40
!41 !42
!43
!44
!45
!46
!49
!50
!51
!52
!53
!54
!55
!56!57
!58
!59
!60
!62
!63
!64
!65
!66
!68
!69
!71 !72 !73
!75
!76
BePer Visualiza)on -‐ SALIViewer
hPp://sali.rguha.net
What Can We Do With SALI’s?
• SALI characterizes cliffs & non-‐cliffs • For a given molecular representa9on, SALI’s gives us an idea of the smoothness of the SAR landscape
• Models try and encode this landscape
• Use the landscape to guide descriptor or model selec9on
Descriptor Space Smoothness
• Edge count of the SALI graph for varying cutoffs • Measures smoothness of the descriptor space
• Can reduce this to a single number (AUC)
SALI Cutoff
Num
ber
of E
dges in S
ALI G
raph
0
100
200
300
400
0.0 0.2 0.4 0.6 0.8 1.0
astemizole
cisaprideE-4031 pimozide
dofetilidesertindole haloperidol droperidolthioridazine
terfenadineverapamildomperidoneloratadinehalofantrine mizolastine
bepridilazimilidechlorpromazinemibefradil
imipraminegranisetrondolasetron perhexiline
amitriptylinediltiazemsparfloxacin
grepafloxacin sildenafilmoxifloxacin
gatifloxacin
astemizole
cisapride E-4031 sertindole pimozide dofetilide haloperidoldroperidol thioridazine domperidone loratadine mizolastine bepridil azimilide mibefradil chlorpromazine imipramine
granisetron dolasetron perhexiline amitriptyline diltiazem sparfloxacin grepafloxacin sildenafil moxifloxacin gatifloxacin
astemizole
grepafloxacin sildenafil moxifloxacin gatifloxacin
Other Examples
• Instead of fingerprints, we use molecular descriptors
• SALI denominator now uses Euclidean distance
• 2D & 3D random descriptor sets – None are really good – Too rough, or – Too flat
SALI Cutoff
Nu
mb
er
of
Ed
ge
s in
SA
LI
Gra
ph
0
100
200
300
400
0.0 0.2 0.4 0.6 0.8 1.0
SALI Cutoff
Nu
mb
er
of
Ed
ge
s in
SA
LI
Gra
ph
0
100
200
300
400
0.0 0.2 0.4 0.6 0.8 1.0
2D
3D
Feature Selec)on Using SALI
• Surprisingly, exhaus9ve search of 66,000 4-‐descriptor combina9ons did not yield semi-‐smoothly decreasing curves
• Not en9rely clear what type of curve is desirable
SALI Graphs & Predic)ve Models
• The graph view allows us to view SAR’s and iden9fy trends easily
• The aim of a QSAR model is to encode SAR’s • Tradi9onally, we consider the quality of a model in terms of RMSE or R2
• But in general, we’re not as interested in RMSE’s as we are in whether the model predicted something as more ac9ve than something else – What we want to have is the correct ordering – We assume the model is sta9s9cally significant
Measuring Model Quality
• A QSAR model should easily encode the “rolling hills”
• A good model captures the most significant cliffs • Can be formalized as
How many of the edge orderings of a SALI graph does the model predict correctly?
• Define S (X ), represen9ng the number of edges correctly predicted for a SALI network at a threshold X
• Repeat for varying X and obtain the SALI curve
SALI Curves
0.0 0.2 0.4 0.6 0.8 1.0
!1.0
!0.5
0.0
0.5
1.0
X
S(X
)
3!descriptor
5!descriptor
Scrambled 3!descriptor
0.0 0.2 0.4 0.6 0.8 1.0
!1
.0!
0.5
0.0
0.5
1.0
X
S(X
)
SCI = 0.12
Model Search Using the SCI
• We’ve used the SALI to retrospec9vely analyze models
• Can we use SALI to develop models? – Iden9fy a model that captures the cliffs
• Tricky – Cliffs are fundamentally outliers
– Op9mizing for good SALI values implies overfivng – Need to trade-‐off between SALI & generalizability
The Objec)ve Func)on
• S0 is a measure of the models ability to summarize the dataset (analogous to RMSE)
• S100 measures the models ability to capture cliffs
• Ideally, the curve starts high and stays high SALI Cutoff
S(X)
0.6
0.7
0.8
0.9
1.0
0.0 0.2 0.4 0.6 0.8 1.0
S100 S0
€
F =1S100
€
F =1S0
+(S100 − S0)
2
€
F =1SCI
SALI Based Model Selec)on
• Considered the BZR dataset from Sutherland et al
• Iden9fied “best” models using a GA to select from a pool of 2D descriptors
• While SALI based op9miza9on can lead to a “bexer” curve, it doesn’t give the best model
SALI Cutoff
S(X)
-0.5
0.0
0.5
0.0 0.2 0.4 0.6 0.8 1.0
RMSE SCI S(100)
SALI Cutoff
S(X)
-0.5
0.0
0.5
0.00 0.02 0.04 0.06 0.08
RMSE SCI S(100)
Sutherland, J et al, J. Chem. Inf. Comput. Sci., 2003, 43, 1906-‐1915
SALI Based Model Selec)on
• 107 aryl azoles as ER-‐β agonists • Used a GA and 2D descriptors to iden9fy models
• In this case, a SALI based objec9ve func9on was able to iden9fy the best model
• Interes9ngly, SCI does not seem to perform very well
Malamas, M.S. et al, J Med Chem, 2004, 47, 5021-‐5040
SALI Cutoff
S(X)
-0.5
0.0
0.5
0.0 0.2 0.4 0.6 0.8 1.0
RMSE SCI S(0) + D/2
SALI Cutoff
S(X)
-0.5
0.0
0.5
0.00 0.02 0.04 0.06 0.08
RMSE SCI S(0) + D/2
SALI Based Model Selec)on
• The size of the solu9on space explored depends on the SALI objec9ve func9on
RMSE S(100) SCI
0.90
0.95
1.00
1.05
1.10
1.15
Objective Function
RMSE
BZR
1/S(0) + D/2 RMSE SCI
0.55
0.60
0.65
Objective Function
RMSE
ER-‐β
Predic)ng the Landscape
• Rather than predic9ng ac9vity directly, we can try to predict the SAR landscape
• Implies that we axempt to directly predict cliffs – Observa9ons are now pairs of molecules
• A more complex problem – Choice of features is trickier – S9ll face the problem of cliffs as outliers – Somewhat similar to predic9ng ac9vity differences
Scheiber et al, StaHsHcal Analysis and Data Mining, 2009, 2, 115-‐122
Predic)ng Cliffs
• Dependent variable are pairwise SALI values, calculated using fingerprints
• Independent variables are molecular descriptors – but considered pairwise – Absolute difference of descriptor pairs, or – Geometric mean of descriptor pairs – …
• Develop a model to correlate pairwise descriptors to pairwise SALI values
A Test Case
• We first consider the Cavalli CoMFA dataset of 30 molecules with pIC50’s
• Evaluate topological and physicochemical descriptors
• Developed random forest models – On the original observed values (30 obs)
– On the SALI values (435 observa9ons)
Cavalli, A. et al, J Med Chem, 2002, 45, 3844-‐3853
Double Coun)ng Structures?
• The dependent and independent variables both encode structure.
• But prexy low correla9ons between individual pairwise descriptors and the SALI values
R2
Perc
ent of Tota
l0
10
20
30
40
50
60
0.00 0.05 0.10 0.15
AbsDiff
0
10
20
30
40
50
60
GeoMean
Model Summaries
• All models explain similar % of variance of their respec9ve datasets
• Using geometric mean as the descriptor aggrega9on func9on seems to perform best
• SALI models are more robust due to larger size of the dataset
Observed pIC50
Pre
dic
ted
pIC
50
4
5
6
7
8
9
4 5 6 7 8 9
!
!
!!!
!
!!
!
!
!!
!
!!
!!
!
!!
!!
!!
!
!
!
!
!
!
Original pIC50 RMSE = 0.97
Observed SALIP
red
icte
d S
AL
I
0
2
4
6
0 2 4 6
!!
!
!
!
!
!! !
!!
!!
!
!!
!!
!!!!
!
!
!
!
!
!
!
!!!!
!!! !!
!
!!
!
!
!!
!
!
! !!
!
!
!!
!
!!
!
!
!!
!!
!!
!
!!
!
!
!
!!
!
!
!
!!
!
!!
!
!
!
!
!!
!! !
!
!
!
!
!
!
!
!!
!
!
!
! !
!
! !
!
!
!
!!!
!!
!
!
!!
!!!!
!
!
!!!
!
!
!
!
!
!!
! !
!
!! !
!
! !
!!
!
!!!!
!!
!
!
!
!
!!
!! !
!!
!
!! !! !!
!
!!
!!!
!
!
!!
!
! !!
!
!
! !!!
!!
!
!!
!
!
!
!
!
!
!
!
!
!!!
!
!!!
!!
!
!!
!!!
!
!
!
!
!
!
!
!!
!!!
!! !
!! !! !
!!!!!
!
!!!
!!
!!!!!
!
!
!
!!
!
!
!
!
!
!!! !!
!!!
!!
!
!
!
!
!!
! !!!
!!
!
!!!
!!
!
!
!
!!
!!
!
!!!!!
!
!
!
!!
!
!!
!! !
! !
!!!
!
!
!!
!
!!
!!!
!!!
!
!
!
!
!
!
!
!
!
!!
!!!
!
!! !
!
!!
!!
!!!
!!
!
!
!
!
!
!!!
!
!
!
!
!
!!!
!!
!
!
!!
!!!!
!
!
!
!!
! !
!!!
!
!!
!
!
!
!
!
!
!!
!
!!!
!
!!!!
!!!
!
!
!!
!
!!!
!!!
!
SALI, AbsDiff RMSE = 1.10
Observed SALI
Pre
dic
ted
SA
LI
0
2
4
6
0 2 4 6
!
!
!
!
!
! !!
!!
!
!!
!
!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!!!
!!
! !
!
!
!
!
!
!
! !
!
! ! !!
!!
!
!
!
!
!
!!!!!
!
!
!!!! !
!
!
!!
!!
!!!
!
!
!
!!!
!!!!
! !
!
!!!
!
!
!
!!
!! !! !
!
!
!
!
!!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!!!
!!
!!
!
!!
!!
!
!
!
!
!!
!
!
!
!
!
!!!
!!
!
! !!
!
!
!!
!!!!
!!
!
!!
!
!!!
!
!
!
!!!
!!
!
!
!
!
!!
!!!
!
!!!!!!
!
!
!
!
!
!
!
!
!!!!
!
!
! !
!!
!
!
!!
!
!
!
!
!
!!!
!!!!
!
!
!
!!!!!
! !!!!
!!!
!
!!
!!!
!!!
!
!! !
!!!
!!
!
!
!
!
! !
!!!
!
!
! !! !
!
! !!!!!
!! !!! !
! !!!!
!!
!!
!
!!!!
!!
! !!!
!
!
!
!
!!
!!
!
!
!
!!
!
!
!
!
!
!!!
!
!!!
!!!!!
!
! !
!!!
!
!
!
!
!
!!
! !
!!!
!
!! !!!
!
!!!
!!
!
!
!
!
!!
!!!
!!! !
!!!!
!!
!! !
!
!!
!
!
!! !
!
!!
!
!
!!
!
!!!! !
!!
!
! !
!
!!!
!! !!
!
!
!
!
!
SALI, GeoMean RMSE = 1.04
Test Case 2
• Considered the Holloway docking dataset, 32 molecules with pIC50’s and Einter
• Similar strategy as before
• Need to transform SALI values
• Descriptors show minimal correla9on
SALI
Perc
ent of T
ota
l
0
10
20
30
40
50
0 20 40 60 80 100 120
log10 (SALI)
Perc
ent of T
ota
l
0
10
20
30
-1 0 1 2
Holloway, M.K. et al, J Med Chem, 1995, 38, 305-‐317
Model Summaries
• The SALI models perform much poorer in terms of % of variance explained
• Descriptor aggrega9on method does not seem to have much effect
• The SALI models appear to perform decently on the cliffs – but misses the most significant
Observed pIC50
Pre
dic
ted
pIC
50
5
6
7
8
9
10
5 6 7 8 9 10
!!!
!
!!
!!!
!
!
!
!!
!!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
Original pIC50 RMSE = 1.05
Observed log10(SALI)P
red
icte
d lo
g1
0(S
AL
I)
!1
0
1
2
!1 0 1 2
!
!!
!
!
!!
!
!
!
!
!
!
!
!
! ! !!
!!!!
!
!
!
!!
!!!
!
!
!
!
!!
!
!
!!
!
!!
!
!!
!
!
!! !
!!
!
! !!!
!!
!
!
!!!
!
!
!
!
!!
!!
!!!
!!!!!
!
!
!
!!!
!!
! !!
!
!
!
!
!
!
!
!!
!
!!
!
!! !
!
!
!
!
!!!
!
!
!!!
!
!
!
!
!
!
!!
!!! !!!
!!
!
!
!
!!
!
!
!
!!!
!
!
!
!
!!
!
!
!! !
!!!
!
!
!
!
!!!
!
!!
!
!
!
!
!
!
!
!
!!!
!
!!!!
!
!
!
!!
!
!!
!
!
!
!
!!
!
!
!
!!
!
!!!!
!
!
!
!!
!
!
!
!
!
!
!
!
!!
!
!! !!! !
!
!
!
!
!!!
!
!
!
!
!!
!
! !
!!
!!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
! !!
!!!
!!
!
!
!
!!
!
!
!
!!!!
!
!!!
!!
!
!
!
!
!
!
!!
!
!
!!
!!
!!
!
!! !!
! !
!
!!
!
!
!!
!
!!
!!
!! !!!
!! !
!! !!
!
!!!
!!!!!
!
!
!
!!
!!!
!
!!!
!!
!!
!
!
!
!!!
!
! !
! !
!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!!!
!
!
!!
!
!
!
! !
!!
!
!
!
!
!
!!
!!
!
!
!!
!
!
!!!
!
!
!!
!
!
!
!
!!
!!
!
!!!
!
!
!
!!!
!
!!
!
!
!
!
!!!
!
!
!
!
!! !
!
!
!
!
!!!
!!
!
!
!!
!
!
!!!
!
! !!
!
!
!
!
!
!
!
!
SALI, AbsDiff RMSE = 0.48
Observed log10(SALI)
Pre
dic
ted
lo
g1
0(S
AL
I)
!1
0
1
2
!1 0 1 2
!
!!
!
!
!
!!
!
!
!
!
!!
!
!
! !
!
!!!!
!
!
!
!
!
!
!! !!!
!
!
!
!
!
!
!
!!!
!
!
!
!!
!!
!! !
!
! !!
!
!
!
!!
!!!!
!
!
!
!
!
!
!
!
!!
!
!! !
!
!
!
!
!!
!
!
!
! !!!!
!
!
!
!!
!!
!
!!
!
!! !
!
!
!
!
!!
!
!
!
!!!
!
!
!
!
!
!!
!!
!!
!
!!
!
!
!
!
!
!!
!
!
!
!!!
!
!
!
!!!
!
!
!! !
!
!
!
!
!
!
!
!!!
!!!
!
!
!
!
!!
!
!
!!!
!
!!!!
!
!
!
!!
!
!!
!
!
!
!
! !
!
!
!
!!
!
!!
!!
!
!
!
!!
!
!!
!
!
!
! !!
!
!
!!
!!!!
!
!
!
!
!!!
!
!
!
!
!!!! !
!
!!
! !
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!
!
!!
!!
!
!
!
!!
!
! !
!!
!
! !
!
!
!
!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!!
!
!! !!
! !
!
!!
!!!!
!
!!
!!!
!!!!!
!!!
! !!
!
!!!
!
!! !
!
!!
!
!
!
!
!!
!
!!!
!!
!
!
!
!
!
!!
!
!
!!!
!
!!!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!! !
!! !
!
!
!!
!!
!
! !
!
!
!
!
!
!
!
!
!
!
! !!
! !
!
!
!
!!
!
!!!
!
!
!
!!!
!
!
! !
!
!
!!!
!
!
!
!
!
!
!!
!
!
!
!!
!
!!
!
!
!
!!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
SALI, GeoMean RMSE = 0.48
Model Summaries
• With untransformed SALI values, models perform similarly in terms of % of variance explained
• The most significant cliffs correspond to stereoisomers
Observed pIC50
Pre
dic
ted
pIC
50
5
6
7
8
9
10
5 6 7 8 9 10
!!!
!
!!
!!!
!
!
!
!!
!!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
Original pIC50 RMSE = 1.05
Observed SALIP
red
icte
d S
AL
I
20
40
60
80
100
20 40 60 80 100
!
!!
!
!
!
!!
!
!
!
!!
!
!
!
!!
!!!
!!!
!
!
!!
!!!
!!
!
!
!
!
!
!
!
!
!!
!
!
!!
!!!!!
! !
!!
!!! !
!!!!!!!
!
!
!
! !!
!! !!
!
!!
!!
!!
!
!!!
!
!!!!!
!
!
!
!
!
!
!!
!
!!
!
!!!!!!
!
!!
!!!!
!
!!
!
!
!
!
!
!!
!
!!
!!!
!!
!
!
!
!!!
!!!!!
!
!
!
!!!
!!!
!!
!
!!!
!!
!
!!!!!
!!
!
!
!
!!
!!
!
!!
!
!!!!
!!
!
!!
!!!
!
!
!
!
!!
!
!
!!!
!
!!
!!
!
!
!
!!
!!!
!
!
!
! !!!! !
!
!
!!!!
!!
!
!!!
!!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!!
!!!
!
!!!!
!
!
!
!!
!!
!
!
!!!
!
!!!!!
!!
!
!
!
!!!
!
!
!
!!
!!
! !
!!!!
!!
!
!!
!!
!!! !
!!!!!!!!!
!!
!
!!
!!!!!
!
!!!!
!!
!
!!
!!!
! !!
!
!!!!
!
!
!
!!
!!!!
!
!
!
!!
!
!!
!
!!
!
!!
!!
!
!
!
! !
!!
!
!
!
!
!
!
!
!!
!!
!! !
!! !
!
!!
!!
!
!
!!
!
!
!
!!!
! !!
!!
!!
!
!
!! !!
!
!!! !!!
!
!
!!!!
!
!
!
!
!
!!!
!
!
!
!
!!!!!
!
!!!
!
!
!!!
!!
!!
!
!!
!
!
!!!
SALI, AbsDiff RMSE = 9.76
Observed SALI
Pre
dic
ted
SA
LI
20
40
60
80
100
20 40 60 80 100
!
!!!
!
!!!
!
!
!
!
!!
!!!!
!
!!!!
!
!
!
!
!
!!! !
!!!
!
!
!
!
!
!
!
!!!
!
!!
!
!!
!! !
!! !!
!
!
!!!!!!!
!
!
!
!!
!!!!!
!!!
!!
!
!
!
!!
!!
!!!!!!
!
!
!
!!
!!
!!!
!
!!!!
!
!
!
!!
!!!
!!!!
!
!
!
!
!
!!! !!
!
!!!!
!!
!
!!
!!!
!!!
!
!
!
! !!!
!!!!!!!!
!
!
!
!!!!!!
!
!
!
!
!!
!!
!!!
!
!!
!!
!!
!
!!
!!!
!
!
!
!
!!
!!
!!!
!
!!
!!
!
!
!
!!
!!!
!
!
!
!!
!!!
!!
!
!! !!
!
!
!
!!
!!!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
! !!
!
!!
!!
!
!
!
!
!
!! !
!!
!
! !
!
!
!
!!!! !
!
!
!
!
!
!
!
!
!!!
!!
!
!!
!!!!
!
!!
!!!
!!
!!
!!!!!!!!
!!!
!!
!!!!!
!
!!!!
!!
!
!!
!!!! !!
!
!!!!
!!
!
!!
!!!!
!
!
!!!!!
!
!
!!
!!
!!
!
!!
!! !!
!
!
!
!
!!
!!!! !!!
! !
!
! !!!!
!!
!
!
!!
!
!
!
!
!!! !
!!
!
!! ! !!! !
!!
!!
!!!
! !!
! !!!
!
!
!
!
!
!!!
!
!
!
!!!
!!!!!
!!
!
!
!!!
!!
!!
!
!!
!
!!
!!
SALI, GeoMean RMSE = 10.01
Model Caveats
• Models based on SALI values are dependent on their being an SAR in the original ac9vity data
• Scrambling results for these models are poorer than the original models but aren’t as random as expected
Observed SALI
Pre
dic
ted S
ALI
0
2
4
6
0 2 4 6
SALI in Bulk
• Much of this material is exploratory • So we’re interested in trends across many assays • ChEMBL is an excellent source for ac9vity cliffs • Assay selec9on
– Human target, binding assay – High confidence (score = 9) – Number of compounds between 75 & 300 – Only consider non-‐NA ac9vity values – Censored data is considered the same as exact data – 31 assays
• We iden9fy datasets with ac9vity cliffs by the skewness of the dependent variable
SALI in Bulk
• Used pIC50’s and CDK hashed fingerprints • Lots of material to explore here
SALI in Bulk
• But fingerprint quality is important • Assay 379744 has 83 cliffs of infinite height since 83 pairs of molecules have Tc = 1.0
• Probably should simply ignore such “iden9cal” molecules
Conclusions
• SALI is the first step in characterizing the SAR landscape
• Allows us to directly analyze the landscape, as opposed to individual molecules
• Being able to predict the landscape could serve as a useful way to extend an SAR landscape
Acknowledgements
• John Van Drie • Gerry Maggiora
• Mic Lajiness
• Jurgen Bajorath • Ajit Jadhav • Trung Ngyuen
ER-‐β Dataset
• 107 molecules, censored data taken as exact
• A few big cliffs • The best linear model performs decently
pIC50
Frequency
-4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5
05
10
15
20
25
Observed pIC50
Pre
dic
ted
pIC
50
-4
-3
-2
-1
-4 -3 -2 -1
SALI Curves from DRCs
• No difference in major cliffs • Some of the minor cliffs are highlighted using the DRC instead of IC50
Clustering in the Holloway Dataset 17
14
23
25
26
18
9
27
16
19
1
10
32
6
29
8
33
12
30
11
4
22
5
28
2
7
13
3
31
24
20
15
21
0.5
0.6
0.7
0.8
0.9
1.0
hclust (*, "complete")
Height