Download - Methods for Virtual Screening and Scaffold Hopping for Chemical Compounds

Methods for Virtual Methods for Virtual Screening and Scaffold Screening and Scaffold Hopping for Chemical Hopping for Chemical CompoundsCompoundsNikil Wale1, George Karypis1, Ian A. Watson2

1Department of Computer Science & Engineering, University of Minnesota. [email protected], [email protected]

2Eli Lilly Inc. Lilly Research Labs, Indianapolis. [email protected]

Drug Development PipelineDrug Development Pipeline

Choose a drug target

Drug screening/ lead development/lead optimization

Small scale production

Laboratory andanimal testing

Production forclinical trials

File IND

106 → 1

Experimental High Throughput Screening Target specific assays .....

InSilico clustering ranked-retrieval classification docking .....

Drug ScreeningDrug Screening

Database of Compounds

Bioactivity Ranking

HO

Cl

H

H

H

OH

H

HH

H

HH

H

HH

O

N

H

H

H

H

O

N

H

H

H

H

H

OH

H

HH

H

HH

H

HH1

2

3

query

O

O

NH2

O

O

O

NH2

O

Given a query compound, rank the compounds in the database based on how similar they are to the query in terms of their bioactivity.

Ranked-Retrieval ProblemRanked-Retrieval Problem

Key Principle: Structure Activity RelationshipKey Principle: Structure Activity Relationship

The properties of a chemical compound largely depend on its structure (structure-activity relationship or SAR).

Exploit structural similarity

Capturing structure → structural descriptors (descriptor-space)

Knowledge / Information Retrieval / Data-Knowledge / Information Retrieval / Data-mining Based Approachesmining Based Approaches

Too much emphasis on structural similarity Query and subsequently the hits (top ranked

compounds) may have bad ADME properties. Structures may be toxic or promiscuous.

Retrieve compounds that are structurally diverse and different from the query and yet

bioactive to avoid the above drawback

Drawbacks of Structural Descriptor-Space Drawbacks of Structural Descriptor-Space based Ranked-Retrievalbased Ranked-Retrieval

Database of Compounds with a descriptor-space representation

Diversestructurebut sameBioactivity

HO

Cl

H

H

H

1

2

3

Scaffold-Hopping ProblemScaffold-Hopping Problem

query

O

N

H

H

H

H

H

OH

H

HH

H

HH

H

HH

O

O

NH2

O

OH

H

HH

H

HH

H

HH

O

N

H

H

H

H

O

O

NH2

O

Given a query compound, rank the compounds in the database based on how similar they are to the query in terms of their bioactivity but as dissimilar as possible in terms of their structure to that of the query.

Hard ProblemHard Problem

Runs counter to SAR For a query its hard to distinguish a

genuinely structurally different but active compound from an inactive compound.

Definition of a scaffold-hop for a query is not clear or objective in many cases.

ExamplesExamples

Celebrex Vioxx Bextra

COX2 (cyclooxygenase-2) inhibitors

PDE5 (phosphodiesterase type 5) inhibitors

Viagra Levitra Cialis

Methods for Scaffold-HoppingMethods for Scaffold-Hopping

Based on Indirect measures of similarity.

Includes information beyond structural similarity → allows structural diversity. Automatic Relevance-Feedback based methods Graph analysis based methods.

O

NH2

O

ONH2

HO

CH3

O

High structural(direct)similarity

High structural(direct)similarity

Low structural similarity but similarity by associationCH3

HO

CH3

CH3

Based on Indirect Similarity derived from by automatic relevance feedback mechanism

Query

CH3

HO

CH3

CH 3

Compound with low structuralsimilarity to the query but highsimilarity to set of most similar compounds to the query.O

O

NH 2

O

NH2

HO

CH3

Set of three most similar compounds to the query in terms of structureO

CH 3

HO

NH 2O

HO

O

Automatic Relevance Feedback based Automatic Relevance Feedback based MethodsMethods


TopKAvgq

c2

cj

c39

c13

{sim(q, c39)}

{sim(q, c2)}

{sim(q, c13)}

top-k

A = {c39, c2, c13}

q + A

c13

cj

c9

c40

{simA(q, c9)}

{simA(q, c13)}

{simA(q, c40)}

Ranked Database Re-ranked Database

simA(q,c) = α sim(q,c) + (1-α) simavg(c,A)


ClustWtq

c2

cj

c39

c13

{sim(q, c39)}

{sim(q, c2)}

{sim(q, c13)}top-k A = {c8, c10, c13}

q + A

c10

cj

c8

c1

{simA(q, c8)}

{simA(q, c10)}

{simA(q, c1)}


q

Clusters ranked withrespect to their similarity to the query

Cluster into mclusters

1

2

c8

c13

c40

c10

c1

A

simA(q,c) = α sim(q,c) + (1-α) simavg(c,A)


A = {}

A = A U {cnext }

D = D – {cnext}

BestSumDescSim

BestMaxDescSimc2

ci

c1

c3

Database (D)

q)},sim({maxarg

}{j

qAci

Dcnext ccc

ji

)}},sim({max{maxarg}{

jiqAcDc

next cccji

Methods for Scaffold-HoppingMethods for Scaffold-Hopping

Based on Indirect measure of similarity derived from a nearest-neighbor graph for compounds.

c5

c2

c3

c4

c6c1

c7

c8

c2 and c4 are strongly related by the metric of the number of paths of size 2 that connect them, even though they do not have a direct relation (path of size 1).

Graph formed using information on the neighborhood of each chemical compound

Network Analysis based MethodsNetwork Analysis based Methods

c1

c2

ci

q

D U {q} – {c1}

c18

c2

c1

ck

q

D U {q} – {c2}

cj

ci

c2

ck

D U {q} – {cj}

c14

c9

q

c5

ci

c1

D U {q} – {q}

c44top-k

Rank every compound in the database and the query with respect to every other compound

Nearest Neighbor graph (NG)

There exist an edge between two nodes (compounds) ci and cj if ci occurs in the list of top-k neighbors of cj or vice-versa.


Mutual Nearest Neighbor graph (MG)

There exist an edge between two nodes (compounds) ci and cj if ci occurs in the list of top-k neighbors of cj and vice-versa.

Example – NG and MGExample – NG and MG

c1

c2

q

c18

c2

cj

c1

cj

c2

c14

c9

q

c5

c1

c44

top-3

q

q

c44 c5cj

c2

c1c18

q

c44 c5cj

c2c1

c18NG

adjG(q) = {c44, c5, c1}

adjG(c2) = {cj, q, c1}

adjG(q) = {c1}

adjG(c2) = {cj, c1}

c2

q

cj

c1

MG

The graph based similarity is given by

This similarity can be used in conjunction with Sum or Max search schemes.

)(adj)(adj

)(adj)(adj),(sim

jGiG

jGiGjiG cc

cccc

Four methods derived from graph based similarity – BestSumNG, BestMaxNG, BestSumMG, BestMaxMG.


)},(sim{maxarg}{

G jqAc

iDc

next cccji

)}},(sim{max{maxarg G}{

jiqAcDc

next cccji

Experimental MethodologyExperimental Methodology

A combination of 6 target specific datasets and 3 descriptor-spaces (GF, ECFP, ErG) used resulting in a total of 18 problems.

Tanimoto similarity used to measure all direct similarities.

Standard Retrieval used as baseline for comparison.

Two related schemes, Turbo Max/Sum, are also compared.

q

c2

cj

c39

c13

{sim(q, c39)}

{sim(q, c2)}

{sim(q, c13)}

Ranked Database

Standard Retrieval

Related Schemes

Turbo SumFusion/MaxFusionq

c2

cj

c39

c13

{sim(q, c39)}

{sim(q, c2)}

{sim(q, c13)}

top-k

A = {c39, c2, c13}

q + A

c13

cj

c9

c40

{simA(q, c9)}

{simA(q, c13)}

{simA(q, c40)}


TurboMaxFusion: simA(q,c) = max{sim(c,c39), sim(q,c2), sim(q,c13)}

TurboSumFusion: simA(q,c) = sim(q,c39) + sim(q,c2) + sim(q,c13)

Definition of Scaffold-HopsDefinition of Scaffold-Hops

q = ai

a10

ak

a6

a4

Top 50% of total actives

Bottom 50% of total actives

For every active (ai) in a dataset scaffold-hops are defined.

For every active, all other actives are ranked against it using path-based fingerprints. The lowest 50% actives in this list form scaffold- hops for ai. Scaffold-hops for ai

Performance EvaluationPerformance Evaluation

For each problem, every active in it was used as query exactly once.

The ranked-retrieval and scaffold-hopping performance is measured using uninterpolated precision in the top 50.

Two methods compared using the average of log2 ratios of their uninterpolated precision for the 18 problems.

q

c5

c18

c1

c44

id

ci

1/1

2/2

3/12

Uninterpolated precision =

(1/1 + 2/2+ 3/12) / 50 = 0.045

ck

1

2

3

11

rank

12

50

bioactivity

active

active

inactive

inactive

active

inactive

precision

Results – Scaffold-Hopping PerformanceResults – Scaffold-Hopping Performance

StdRet

TurboSum

TurboMax

TopkAvg

BestMaxDS

BestSumNG

BestMaxNG

BestSumDS

BestSumMG

BestMaxMG

ClustWt

Std

Ret

Tur

boS

um

Tur

boM

ax

Top

kAvg

Bes

tMa

xDS

Bes

tSum

NG

Bes

tMa

xNG

Bes

tSum

DS

Bes

tSum

MG

Bes

tMa

xMG

Clu

stW

t Row method statistically better

Row method statistically worse

Row and column methods statistically same

Results – Ranked-Retrieval PerformanceResults – Ranked-Retrieval Performance

StdRet

TurboSum

TurboMax

TopkAvg

BestMaxDS

BestSumNG

BestMaxNG

BestSumDS

BestSumMG

BestMaxMG

ClustWt

Std

Ret

Tur

boS

um

Tur

boM

ax

Top

kAvg

Bes

tMa

xDS

Bes

tSum

NG

Bes

tMa

xNG

Bes

tSum

DS

Bes

tSum

MG

Bes

tMa

xMG

Clu

stW

t Row method statistically better

Row method statistically worse

Row and column methods statistically same

Results - Problem SpecificResults - Problem Specific

Conclusions & Future WorkConclusions & Future Work

Automatic relevance feedback mechanism inspired and Indirect similarity improves scaffold-hopping performance.

Indirect similarity based methods are more powerful that direct similarity based measures and show significant improvements over the state of the art.

Problem of selecting the right value for the parameter ‘k’.

Selecting the right descriptor space.

Questions?Questions?

THANK YOU!THANK YOU!

Thanks to – • Karypis Research

• Kevin DeRonne• Christopher Kauffman• Huzefa Rangwala• Xia Ning•Yevgeniy Podoylan

www.cs.umn.edu/~karypiswww.cs.umn.edu/~karypis