Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell ...

48
Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell www-mitchell.ch.cam.ac.uk/ [email protected]

Transcript of Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell ...

Page 1: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Classifying the WADA 2005 Prohibited List Using the CDK

Fingerprint

Ed Cannon & John Mitchell

www-mitchell.ch.cam.ac.uk/

[email protected]

Page 2: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Classifying the WADA Prohibited List

• Aims & Background.• Methods.• Data.• Results.• Conclusions.

Page 3: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Aims & Background

Page 4: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Aims & Background

• Much drug abuse in sport involves novel compounds such as the “designer steroid” THG.

tetrahydrogestrinone (THG)

Page 5: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Aims & Background

• Hence the World Anti-Doping Agency (WADA) prohibits classes of bioactivity as well as specific molecules.

• Analogues are prohibited using the “similar chemical structure or similar biological effect(s)” criterion.

Page 6: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

WADA Prohibited Classes

• Anabolic Agents (S1)

• Hormones and Related Substances (S2)

• Beta-2-agonists (S3)• Anti-estrogenic

Agents (S4)• Diuretics and

Masking Agents (S5)

• Stimulants (S6)• Narcotics (S7)• Cannabinoids (S8)• Glucocorticoids

(S9)• Alcohol (P1)• Beta Blockers (P2)

Page 7: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Predicting Bioactivities

• We seek to predict whether a molecule exhibits one of these bioactivities.

• Such a classifier would be powerful as an in silico pre-filter for experimental methods such as assays.

Page 8: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Methods

Page 9: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Chemical Space

• Use descriptor-based fingerprints to locate molecules in chemical space.

• Similar Property Principle suggests molecules close together in chemical space often share common bioactivity.

Page 10: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Machine Learning

• Use Machine Learning classification algorithms to predict bioactivity from location of molecules in chemical space.

• Random Forest.

• k-Nearest Neighbours.

Page 11: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Fingerprints

• CDK (Chemistry Development Kit) fingerprint.

• Unity 2D.• MACCS key.• MOE 2D (2004).• Typed Atom Distance.• Typed Graph Distance.

Page 12: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

CDK Fingerprint

• CDK fingerprint resembles Daylight.

• All bond paths up to a length of 6 are generated.

• A hashing function is used to map these paths onto a fingerprint of 1024 bits.

Page 13: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Unity 2D Fingerprint

• Unity is similar to CDK, but based on sub-structures rather than just paths.

• Substructures present in the molecule are enumerated.

• A hashing function is used to map these paths onto a fingerprint of 992 bits.

Page 14: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Classification Algorithms

• Random Forest (RF).

• k-Nearest Neighbours (k-NN).

Page 15: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Random Forest

• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each

node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of

molecule.

Page 16: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Random Forest

Node

A > x1 A < x1

B > x2 B < x2 C > x3 C < x3

Decision: Yes No No Yes

A Random Forest contains many such trees.

Page 17: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Random Forest

• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each

node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of

molecule.

Page 18: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

k-Nearest Neighbours

• Instance based learner.

• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.

• k is a variable describing the number of neighbours to be considered.

• Class of x determined by majority vote of class labels of k neighbours.

• Ties broken randomly (only occurs for even k).

Page 19: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

k-Nearest Neighbours

ActiveInactive?

Page 20: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

k-Nearest Neighbours

• Instance based learner.

• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.

• k is a variable describing the number of neighbours to be considered.

• Class of x determined by majority vote of class labels of k neighbours.

• Ties broken randomly (only occurs for even k).

Page 21: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

k-Nearest Neighbours

• Local method.

• Uses only a very small number of near neighbours to make its prediction.

• Suitable for predicting activity classes with multiple clusters in chemical space.

• Therefore good for WADA classes with multiple receptors.

Page 22: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Performance Measure

• Matthews Correlation Coefficient:

• Range: -1 < MCC < 1;• Balance between predicting

positives & negatives.

]))()()(( nnpnnppp

npnp

ftftftft

ffttMCC

Page 23: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Data

Page 24: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

The Dataset

• 5245 molecules (5235 for CDK).

• Molecules taken from WADA banned list and from corresponding activity classes in MDDR. 367 explicitly allowed substances.

Page 25: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Data by Class

WADA Class Number of Molecules

S1 47

S2 272

S3 367

S4 928

S5 1000

S6 804

S7 195

S8 1000

S9 26

P2 239

Allowed 367

Page 26: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Fivefold Cross-validation

• We test for membership of each prohibited class separately.

• All calculations use 5-fold cv. This uses {80% molecules training set; 20% test set} repeated 5 times so that each molecule is in exactly 1 test set.

Page 27: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

False Positives

• False Positives arise in two ways:

• (1) A molecule predicted positive on an incorrect activity class;

• (2) An explicitly allowed molecule predicted positive.

Page 28: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Results

Page 29: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Results: Random Forest

Aggregated over 10 classes

Page 30: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Unity CDK > MACCS > others.

MCC for RF for Six Fingerprints

0.5000

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6

Rank out of 6 Fingerprints

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Unity 0.8214 CDK 0.8136

MACCS 0.7823

TGD 0.7283 MOE 0.7172

TAD 0.5902

Page 31: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

100 trees sufficient; little improvement with more.

MCC as a Function of ntree in RF models for Unity

0.6000

0.7000

0.8000

0.9000

0 100 200 300 400 500 600 700 800 900 1000

ntree

MC

C

Unity

Page 32: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Results: k-Nearest Neighbours

Aggregated over 10 classes

Page 33: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

MCC as a Function of k in k -NN Models for Six Fingerprints

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Page 34: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Unity CDK > MACCS > others.

MCC for k = 1 for Six Fingerprints

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6

Rank out of 6 Fingerprints

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Unity 0.8363CDK 0.8297

MACCS 0.8045

TGD 0.7404

MOE 0.6814

TAD 0.6152

Page 35: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k.

MCC as a Function of k in k -NN Models for Unity

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C Unity

Page 36: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k. Unity ≈ CDK.

MCC as a Function of k in k- NN Models for CDK

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C CDK

Page 37: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Results: Comparison

Recall v PrecisionAggregated over 10 classes

Recall Precision

Page 38: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Recall v Precision for Positives

60.00

70.00

80.00

90.00

100.00

30.00 40.00 50.00 60.00 70.00 80.00 90.00

Recall

Pre

cisi

on

Unity

MACCS

CDK

RF

k -NN

RF gives higher precision, k-NN higher recall.

Page 39: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Results: Comparison

Analysed by class

Page 40: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Classes vary in difficulty of prediction;

independent of classification algorithm.

MCC by Class for Random Forest Default and k- NN (k = 1) Models

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

1 2 3 4 5 6 7 8 9 10

Class

MC

C RF

k-NN

S1 S2 S3 S4 S5 S6 S7 S8 S9 P2

Page 41: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Conclusions

Page 42: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Conclusions• Can successfully predict active molecules

(MCC ≈ 0.83).

• Unity ≈ CDK > MACCS > others.

• RF & k-NN give similar MCC.

• k-NN higher recall.

• RF higher precision; RF less likely to find false positives.

Page 43: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Conclusions• RF results vary little with ntree.

• k-NN results best for k = 1.

• Performance decreases at higher k.

• Odd k avoids problems with ties (k = 2 is worse than k = 3).

• Activity classes show consistent prediction difficulty pattern.

Page 44: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

Acknowledgements

• Andreas Bender (Novartis Institutes).

• David Palmer (Unilever Centre).

• Unilever.

Page 45: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

www-mitchell.ch.cam.ac.uk/

[email protected]

Page 46: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

MCC as a Function of Class Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000 1200

Class Size

MC

C

UnityMACCSMOE TADTGD

No significant correlation overall; though smallest class S9 is hardest to predict.

Page 47: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

MCC as a Function of Intra Class Mean Tanimoto Score

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.20 0.30 0.40 0.50 0.60

Intra Class Mean Tanimoto Score

MC

C

UnityMACCSMOE TADTGD

Page 48: Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell  jbom1@cam.ac.uk.

tetrahydrogestrinone (THG)

gestrinone

trenbolone