X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This...

27
X 0001001011000 0101011000111 R 2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 1 P20 HG003900-01. Information on the Molecular Libraries Roadmap Initiative can be obtained from http://nihroadmap.nih.gov/molecularlibraries/ Jacqueline M. Hughes-Oliver Department of Statistics North Carolina State University [email protected] *joint with Ke Zhang, GSK and Stan Young, NISS Analysis of High-Dimensional Structure-Activity Screening Datasets Using the Optimal Bit String Tree

Transcript of X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This...

Page 1: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

________________________________________________

This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 1 P20 HG003900-01. Information on the Molecular Libraries Roadmap Initiative can be obtained from http://nihroadmap.nih.gov/molecularlibraries/

Jacqueline M. Hughes-Oliver

Department of Statistics

North Carolina State University

[email protected]

*joint with Ke Zhang, GSK and Stan Young, NISS

Analysis of High-Dimensional Structure-Activity Screening

Datasets Using the Optimal Bit String Tree

Page 2: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

2Blackwell-Tapia - November 2008

Outline Background Recursive partitioning OBSTree Simulation study Screening for monoamine oxidase inhibitors Summary

Page 3: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

3Blackwell-Tapia - November 2008

BackgroundEstimate a function such that

based on

where

Preferably,

},...,1,0{}1,0{: Mf p f

niXMYXY piiii ,,1}1,0{},,1,0{),(

psMf

ffff

ls

l

Ll

l

},,1,0{}1,0{:

),,,,( 1

0|0ˆ0|0ˆ YfYf costs more than

Page 4: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

4Blackwell-Tapia - November 2008

http://pubchem.ncbi.nlm.nih.gov/

http://eccr.stat.ncsu.edu/

http://www.niss.org/PowerMV/

Page 5: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

5Blackwell-Tapia - November 2008

AD:Tricyclic:Amitriptyline

N

Background – Structure-Activity Relationship (SAR)

• Willett, Barnard, Downs (1998 JCICS)• Molecular descriptors—Carhart atom pairs

– Atom type—distance—atom type, e.g., C(2,1)-04-C(3,1)– Binary descriptors—few turned on

Page 6: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

6Blackwell-Tapia - November 2008

TrueFalse

X3=1

Splitting variable chosen to optimize “purity measure”

Search space: size p

Need definitions for:search spacepurity measure, splitting criterionstopping criterion

Recursive Partitioning

TrueFalse

X27=1

Page 7: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

7Blackwell-Tapia - November 2008

Recursive Partitioning: Rules are complex

0

12

3

17

18 3

9 19

6 3 15

1 16

9 11

6 2

5

8 4

5

12

1

8

13

3

7

0 0

000

0

0

0 0

0

0

00

0

0

0

00

0

• Are all splits necessary for the activity mechanism?• Does an early split impede identification of other mechanisms?

Page 8: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

8Blackwell-Tapia - November 2008

Need definitions for:• Search space• Purity measure, splitting criterion• Stopping rule

Binary Formal Inference-Based Recursive Modeling (BFIRM)• Cho, Shen, Hermsmeier (2000, JCICS)• Rank predictors according to F-test• Combine important predictors to form splitting variable• Result is better QSAR rules

Recursive Partitioning/Simulated Annealing (RP/SA)• Blower et al. (2002, JCICS)• Best single predictor not necessarily best in combination

Tree Harvesting• Yuan, Chipman, Welch (2006 tech report)• “Trim” bits off each terminal node

Recursive Partitioning: Focus of Study

Page 9: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

9Blackwell-Tapia - November 2008

Recursive Partitioning: RP/SA• Splitting variables are based on a combination of K predictors• Features are always present:

• Search space of size

• Uses simulated annealing – stochastic optimization• K is held fixed for all splits, and is assumed known

20102

10

500

K

p

1,,1,121

Kjjj XXX

Page 10: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

10Blackwell-Tapia - November 2008

OBSTree• Splitting variables are based on a combination of K predictors• Combine approaches of BFIRM and RP/SA• Features can be present or absent: chromosome selection

• Search space of size

• Uses simulated annealing + weighted sampling + trimming• “K” can change for all splits, and is assumed unknown• Uses a penalty entropy splitting criterion• Usual stopping criteria applied, including cross validation

2310 102210

5002

K

K

p

1,,1,0,0,14321

Kjjjjj XXXXX

Page 11: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

11Blackwell-Tapia - November 2008

Pre-OBSTree Setup Remove unary

descriptors Determine Singly

Important group Specify parameters

OBSTree: Flowchart

Descriptor Pool

RP

Singly Important Descriptors

General Descriptors

Page 12: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

12Blackwell-Tapia - November 2008

OBSTree: Flowchart

Pre-OBSTree Setup Remove unary

descriptors Determine Singly

Important group Specify parameters

Initialize split at next depth: depth=depth+1 a set of K descriptor (X0) using WSS Determine best chromosome x0 of initial X0

SA to determine “optimal” (XA, xA) for split using WSS

Form last terminal node. STOP

depth=d or node size<2min or Ymax=0 or Ybar>M-1

Yes

No

Trim Check 2K-1 subsets of current (XA, xA) Report best trimmed version as (X*, x*)

Form terminal node

X*=x*?

Yes

No

Page 13: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

13Blackwell-Tapia - November 2008

• Node has N compounds

• Class i has proportion pi in the node, with a total of ni in the node

• Entropy (node impurity):

• Penalty Entropy (penalize unwanted category)

M

iii pp

0

log

Problem:

Entropy=0 (perfect) when a class of junk compounds is identified

form) (general log1

log

compounds)junk (penalize log)1

log1

(1

0

Wiii

Uj

j

M

iii

ppNN

n

ppNN

n

OBSTree: Splitting Criterion

Page 14: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

14Blackwell-Tapia - November 2008

• Maximum depth d

• The most active compound is junk

• The node size is less than 2j (j is the minimum node size).

• 5-fold cross-validation, e.g., choose depth d if– # correct classifications levels off at depth d

– Accept H0: d+1 = 0 for d+1 = sensitivity between depths d and d+1

OBSTree: Stopping Criteria

Page 15: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

15Blackwell-Tapia - November 2008

• 1000 compounds, 500 binary descriptors• Four active groups (20 compounds per group) – 8% active

Activity Mechanisms Potency Descriptor Sets and Chromosomes

I 3 1 2 3 4 5

1 0 1 0 1

II 3 5 6 7 8 9

0 1 1 1 1

III 2 3 11 12 13 17

1 1 1 1 1

IV 1 15 16 17 18 19

1 1 0 1 1

Simulation Study

Page 16: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

16Blackwell-Tapia - November 2008

Simulation Study: Standard RP Tree

0

12

3

17

18 3

9 19

6 3 15

1 16

9 11

6 2

5

8 4

5

12

1

8

13

3

7

0 0

000

0

0

0 0

0

0

00

0

0

0

00

0

5 compounds of 3 + 5 compounds of 0 7 compounds of 3

Page 17: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

17Blackwell-Tapia - November 2008

Simulation Study: Sample OBSTree

0

1,2,3,4,5/1,0,1,0,1

3

1

3

2

15,16,17,18,19/1,1,0,1,1

5,6,7,8,9/0,1,1,1,1

3,11,12,13,17/1,1,1,1,1

Page 18: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

18Blackwell-Tapia - November 2008

Simulation Study: 5-fold Cross-validation  Actual Accuracy

0 1 2 3

Prediction 0 918 0 0 5 99.5%

1 0 20 0 0 100%

2 0 0 20 0 100%

3 2 0 0 35 94.6%

Hit 99.7% 100% 100% 87.5% Overall Accuracy: 99.3%

OBSTree

RP

  Actual Accuracy

0 1 2 3

Prediction 0 910 1 0 34 96.3%

1 3 19 0 0 86.4%

2 0 0 20 0 100%

3 7 0 0 6 46.2%

Hit 98.9% 95% 100% 15% Overall Accuracy: 93.5%

Page 19: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

19Blackwell-Tapia - November 2008

Simulation Study: Sensitivity Analysis• K, descriptor set size

– K >7 perfectly found all mechanisms– K =7 perfectly found all but one mechanism

• Basic tree parameters– Min node size is 5

• SA parameters– Initial temperature– Minimum temperature– Temperature reduction rate– # transitions at a given temperature– # failures to accept new point before increasing transition counter– Sampling weights in WSS

Page 20: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

20Blackwell-Tapia - November 2008

Screening to Identify MAO Inhibitors• Neuronal MAO deactivates neurotransmitters

• Pargyline, an MAO inhibitor, was used to treat depression• MAO inhibitors no longer used due to toxicity & interactions• Abbott Laboratories dataset of MAO inhibitors

Brown & Martin (1996 JCICS), 1646 chemically diverse compounds 1380 binary 2D atom-pair descriptors Response variable – 0, 1, 2, 3 (ordered data) [1358/114/86/88] Category 3 has 2 well-known mechanisms - Rusinko et al. (1999 JCICS)

Page 21: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

21Blackwell-Tapia - November 2008

0/1/0/6

1/0/1/26

32,572,844

184,721,879

0/0/0/33

0/0/0/15

1, 579,1184,809/1,1,1,0

81,177,579,183/1,1,1,0

2/0/0/32

2/0/5/24

9/2/1/2

704

1184

65

81

OBSTreeOBSTree RP/SA

RP

959/85/55/18 99/1/0/0

183

Page 22: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

22Blackwell-Tapia - November 2008

MAO: Activity Mechanism I • “Irreversible binding to flavin cofactor of MAO”• Pargyline-like compounds• Typical features of pargyline-like compounds

A triple bondA tertiary nitrogenAn aromatic ring

• 1st terminal node of OBSTree

• Highest active terminal node of RP

• 1st terminal node of RP/SA

81 7042/0/0/32

1 1

81 183 177 5790/0/0/33

1 0 1 1

184 721 8790/1/0/6

1 1 1

Page 23: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

23Blackwell-Tapia - November 2008

MAO: Activity Mechanism I

• Compound 1: Pargyline, y=3, has 579 & 81 & 177 but not 183• Compound 2: y=0, has feature 183 so violates OBSTree• Compound 3: y=0, falls in active node from RP• Compound 4: y=0, falls in active node from RP and RP/SA

Page 24: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

24Blackwell-Tapia - November 2008

HO

O

N

N

) C(1,0)-3-C(1,0)579: C(2,1)-3-C(3,1) 1:C(1,0)-3-C(1,0)

1184:N(2,0)-2-N(2,0)

MAO: Activity Mechanism II• “Binding to active site"• –N-N-C(=O)- is a hydrazine feature that can be hydrolyzed to

bind protein (MAO) as a nonselective, irreversible inhibitor

Page 25: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

25Blackwell-Tapia - November 2008

Absent Descriptor (809: C(3,1)-4-Br)

O N

N

Br

O N

N

BrC(3,1)-4-Br

Activity=3 Activity=2

C(3,1)-4-Br

Page 26: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

26Blackwell-Tapia - November 2008

Summary• OBSTree: new RP algorithm for obtaining simplified output

– Model presence and absence of molecular features– Combination size is data-driven, varies over splits– Penalty entropy splitting criterion for one-sided purity– Weighted sampling during optimization allows prior information

• Simpler verification of QSAR• Standard RP and RP/SA are special cases of OBSTree

• Output is not deterministic• As with any RP output, care should be taken when

interpreting the results– Can miss highly correlated but important predictors– Different trees provide similar partitions of the data– Because of hard thresholding, predictions are highly variable

• Computationally intensive!

Page 27: X 0001001011000 0101011000111 R2R2 ECCR @ NCSU ________________________________________________ This work was funded by the National Institutes of Health.

X 00010010110000101011000111

R2ECCR @ NCSU

27Blackwell-Tapia - November 2008

Acknowledgements

• Atina Brooks, North Carolina State University• Jiajun Liu, Merck• Haojun Ouyang, North Carolina State University• Abbott Laboratories• Jack Liu, OmicSoft• Jun Feng, NIH• GoldenHelix