Methods for Virtual Methods for Virtual Screening and Scaffold Screening and Scaffold Hopping for Chemical Hopping for Chemical CompoundsCompoundsNikil Wale1, George Karypis1, Ian A. Watson2
1Department of Computer Science & Engineering, University of Minnesota. [email protected], [email protected]
2Eli Lilly Inc. Lilly Research Labs, Indianapolis. [email protected]
Drug Development PipelineDrug Development Pipeline
Choose a drug target
Drug screening/ lead development/lead optimization
Small scale production
Laboratory andanimal testing
Production forclinical trials
File IND
106 → 1
Experimental High Throughput Screening Target specific assays .....
InSilico clustering ranked-retrieval classification docking .....
Drug ScreeningDrug Screening
Database of Compounds
Bioactivity Ranking
HO
Cl
H
H
H
OH
H
HH
H
HH
H
HH
O
N
H
H
H
H
O
N
H
H
H
H
H
OH
H
HH
H
HH
H
HH1
2
3
query
O
O
NH2
O
O
O
NH2
O
Given a query compound, rank the compounds in the database based on how similar they are to the query in terms of their bioactivity.
Ranked-Retrieval ProblemRanked-Retrieval Problem
Key Principle: Structure Activity RelationshipKey Principle: Structure Activity Relationship
The properties of a chemical compound largely depend on its structure (structure-activity relationship or SAR).
Exploit structural similarity
Capturing structure → structural descriptors (descriptor-space)
Knowledge / Information Retrieval / Data-Knowledge / Information Retrieval / Data-mining Based Approachesmining Based Approaches
Too much emphasis on structural similarity Query and subsequently the hits (top ranked
compounds) may have bad ADME properties. Structures may be toxic or promiscuous.
Retrieve compounds that are structurally diverse and different from the query and yet
bioactive to avoid the above drawback
Drawbacks of Structural Descriptor-Space Drawbacks of Structural Descriptor-Space based Ranked-Retrievalbased Ranked-Retrieval
Database of Compounds with a descriptor-space representation
Diversestructurebut sameBioactivity
HO
Cl
H
H
H
1
2
3
Scaffold-Hopping ProblemScaffold-Hopping Problem
query
O
N
H
H
H
H
H
OH
H
HH
H
HH
H
HH
O
O
NH2
O
OH
H
HH
H
HH
H
HH
O
N
H
H
H
H
O
O
NH2
O
Given a query compound, rank the compounds in the database based on how similar they are to the query in terms of their bioactivity but as dissimilar as possible in terms of their structure to that of the query.
Hard ProblemHard Problem
Runs counter to SAR For a query its hard to distinguish a
genuinely structurally different but active compound from an inactive compound.
Definition of a scaffold-hop for a query is not clear or objective in many cases.
ExamplesExamples
Celebrex Vioxx Bextra
COX2 (cyclooxygenase-2) inhibitors
PDE5 (phosphodiesterase type 5) inhibitors
Viagra Levitra Cialis
Methods for Scaffold-HoppingMethods for Scaffold-Hopping
Based on Indirect measures of similarity.
Includes information beyond structural similarity → allows structural diversity. Automatic Relevance-Feedback based methods Graph analysis based methods.
O
NH2
O
ONH2
HO
CH3
O
High structural(direct)similarity
High structural(direct)similarity
Low structural similarity but similarity by associationCH3
HO
CH3
CH3
Based on Indirect Similarity derived from by automatic relevance feedback mechanism
Query
CH3
HO
CH3
CH 3
Compound with low structuralsimilarity to the query but highsimilarity to set of most similar compounds to the query.O
O
NH 2
O
NH2
HO
CH3
Set of three most similar compounds to the query in terms of structureO
CH 3
HO
NH 2O
HO
O
Automatic Relevance Feedback based Automatic Relevance Feedback based MethodsMethods
Automatic Relevance Feedback based Automatic Relevance Feedback based MethodsMethods
TopKAvgq
c2
cj
c39
c13
{sim(q, c39)}
{sim(q, c2)}
{sim(q, c13)}
top-k
A = {c39, c2, c13}
q + A
c13
cj
c9
c40
{simA(q, c9)}
{simA(q, c13)}
{simA(q, c40)}
Ranked Database Re-ranked Database
simA(q,c) = α sim(q,c) + (1-α) simavg(c,A)
Automatic Relevance Feedback based Automatic Relevance Feedback based MethodsMethods
ClustWtq
c2
cj
c39
c13
{sim(q, c39)}
{sim(q, c2)}
{sim(q, c13)}top-k A = {c8, c10, c13}
q + A
c10
cj
c8
c1
{simA(q, c8)}
{simA(q, c10)}
{simA(q, c1)}
Ranked Database Re-ranked Database
q
Clusters ranked withrespect to their similarity to the query
Cluster into mclusters
1
2
c8
c13
c40
c10
c1
A
simA(q,c) = α sim(q,c) + (1-α) simavg(c,A)
Automatic Relevance Feedback based Automatic Relevance Feedback based MethodsMethods
A = {}
A = A U {cnext }
D = D – {cnext}
BestSumDescSim
BestMaxDescSimc2
ci
c1
c3
Database (D)
q)},sim({maxarg
}{j
qAci
Dcnext ccc
ji
)}},sim({max{maxarg}{
jiqAcDc
next cccji
Methods for Scaffold-HoppingMethods for Scaffold-Hopping
Based on Indirect measure of similarity derived from a nearest-neighbor graph for compounds.
c5
c2
c3
c4
c6c1
c7
c8
c2 and c4 are strongly related by the metric of the number of paths of size 2 that connect them, even though they do not have a direct relation (path of size 1).
Graph formed using information on the neighborhood of each chemical compound
Network Analysis based MethodsNetwork Analysis based Methods
c1
c2
ci
q
D U {q} – {c1}
c18
c2
c1
ck
q
D U {q} – {c2}
cj
ci
c2
ck
D U {q} – {cj}
c14
c9
q
c5
ci
c1
D U {q} – {q}
c44top-k
Rank every compound in the database and the query with respect to every other compound
Nearest Neighbor graph (NG)
There exist an edge between two nodes (compounds) ci and cj if ci occurs in the list of top-k neighbors of cj or vice-versa.
Network Analysis based MethodsNetwork Analysis based Methods
Mutual Nearest Neighbor graph (MG)
There exist an edge between two nodes (compounds) ci and cj if ci occurs in the list of top-k neighbors of cj and vice-versa.
Example – NG and MGExample – NG and MG
c1
c2
q
c18
c2
cj
c1
cj
c2
c14
c9
q
c5
c1
c44
top-3
q
q
c44 c5cj
c2
c1c18
q
c44 c5cj
c2c1
c18NG
adjG(q) = {c44, c5, c1}
adjG(c2) = {cj, q, c1}
adjG(q) = {c1}
adjG(c2) = {cj, c1}
c2
q
cj
c1
MG
The graph based similarity is given by
This similarity can be used in conjunction with Sum or Max search schemes.
)(adj)(adj
)(adj)(adj),(sim
jGiG
jGiGjiG cc
cccc
Four methods derived from graph based similarity – BestSumNG, BestMaxNG, BestSumMG, BestMaxMG.
Network Analysis based MethodsNetwork Analysis based Methods
)},(sim{maxarg}{
G jqAc
iDc
next cccji
)}},(sim{max{maxarg G}{
jiqAcDc
next cccji
Experimental MethodologyExperimental Methodology
A combination of 6 target specific datasets and 3 descriptor-spaces (GF, ECFP, ErG) used resulting in a total of 18 problems.
Tanimoto similarity used to measure all direct similarities.
Standard Retrieval used as baseline for comparison.
Two related schemes, Turbo Max/Sum, are also compared.
q
c2
cj
c39
c13
{sim(q, c39)}
{sim(q, c2)}
{sim(q, c13)}
Ranked Database
Standard Retrieval
Related Schemes
Turbo SumFusion/MaxFusionq
c2
cj
c39
c13
{sim(q, c39)}
{sim(q, c2)}
{sim(q, c13)}
top-k
A = {c39, c2, c13}
q + A
c13
cj
c9
c40
{simA(q, c9)}
{simA(q, c13)}
{simA(q, c40)}
Ranked Database Re-ranked Database
TurboMaxFusion: simA(q,c) = max{sim(c,c39), sim(q,c2), sim(q,c13)}
TurboSumFusion: simA(q,c) = sim(q,c39) + sim(q,c2) + sim(q,c13)
Definition of Scaffold-HopsDefinition of Scaffold-Hops
q = ai
a10
ak
a6
a4
Top 50% of total actives
Bottom 50% of total actives
For every active (ai) in a dataset scaffold-hops are defined.
For every active, all other actives are ranked against it using path-based fingerprints. The lowest 50% actives in this list form scaffold- hops for ai. Scaffold-hops for ai
Performance EvaluationPerformance Evaluation
For each problem, every active in it was used as query exactly once.
The ranked-retrieval and scaffold-hopping performance is measured using uninterpolated precision in the top 50.
Two methods compared using the average of log2 ratios of their uninterpolated precision for the 18 problems.
q
c5
c18
c1
c44
id
ci
1/1
2/2
3/12
Uninterpolated precision =
(1/1 + 2/2+ 3/12) / 50 = 0.045
ck
1
2
3
11
rank
12
50
bioactivity
active
active
inactive
inactive
active
inactive
precision
Results – Scaffold-Hopping PerformanceResults – Scaffold-Hopping Performance
StdRet
TurboSum
TurboMax
TopkAvg
BestMaxDS
BestSumNG
BestMaxNG
BestSumDS
BestSumMG
BestMaxMG
ClustWt
Std
Ret
Tur
boS
um
Tur
boM
ax
Top
kAvg
Bes
tMa
xDS
Bes
tSum
NG
Bes
tMa
xNG
Bes
tSum
DS
Bes
tSum
MG
Bes
tMa
xMG
Clu
stW
t Row method statistically better
Row method statistically worse
Row and column methods statistically same
Results – Ranked-Retrieval PerformanceResults – Ranked-Retrieval Performance
StdRet
TurboSum
TurboMax
TopkAvg
BestMaxDS
BestSumNG
BestMaxNG
BestSumDS
BestSumMG
BestMaxMG
ClustWt
Std
Ret
Tur
boS
um
Tur
boM
ax
Top
kAvg
Bes
tMa
xDS
Bes
tSum
NG
Bes
tMa
xNG
Bes
tSum
DS
Bes
tSum
MG
Bes
tMa
xMG
Clu
stW
t Row method statistically better
Row method statistically worse
Row and column methods statistically same
Results - Problem SpecificResults - Problem Specific
Conclusions & Future WorkConclusions & Future Work
Automatic relevance feedback mechanism inspired and Indirect similarity improves scaffold-hopping performance.
Indirect similarity based methods are more powerful that direct similarity based measures and show significant improvements over the state of the art.
Problem of selecting the right value for the parameter ‘k’.
Selecting the right descriptor space.
Questions?Questions?
THANK YOU!THANK YOU!
Thanks to – • Karypis Research
• Kevin DeRonne• Christopher Kauffman• Huzefa Rangwala• Xia Ning•Yevgeniy Podoylan
www.cs.umn.edu/~karypiswww.cs.umn.edu/~karypis
Top Related