Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell ...

Classifying the WADA 2005 Prohibited List Using the CDK

Fingerprint

Ed Cannon & John Mitchell

www-mitchell.ch.cam.ac.uk/

[email protected]

Classifying the WADA Prohibited List

• Aims & Background.• Methods.• Data.• Results.• Conclusions.

Aims & Background

Aims & Background

• Much drug abuse in sport involves novel compounds such as the “designer steroid” THG.

tetrahydrogestrinone (THG)

Aims & Background

• Hence the World Anti-Doping Agency (WADA) prohibits classes of bioactivity as well as specific molecules.

• Analogues are prohibited using the “similar chemical structure or similar biological effect(s)” criterion.

WADA Prohibited Classes

• Anabolic Agents (S1)

• Hormones and Related Substances (S2)

• Beta-2-agonists (S3)• Anti-estrogenic

Agents (S4)• Diuretics and

Masking Agents (S5)

• Stimulants (S6)• Narcotics (S7)• Cannabinoids (S8)• Glucocorticoids

(S9)• Alcohol (P1)• Beta Blockers (P2)

Predicting Bioactivities

• We seek to predict whether a molecule exhibits one of these bioactivities.

• Such a classifier would be powerful as an in silico pre-filter for experimental methods such as assays.

Methods

Chemical Space

• Use descriptor-based fingerprints to locate molecules in chemical space.

• Similar Property Principle suggests molecules close together in chemical space often share common bioactivity.

Machine Learning

• Use Machine Learning classification algorithms to predict bioactivity from location of molecules in chemical space.

• Random Forest.

• k-Nearest Neighbours.

Fingerprints

• CDK (Chemistry Development Kit) fingerprint.

• Unity 2D.• MACCS key.• MOE 2D (2004).• Typed Atom Distance.• Typed Graph Distance.

CDK Fingerprint

• CDK fingerprint resembles Daylight.

• All bond paths up to a length of 6 are generated.

• A hashing function is used to map these paths onto a fingerprint of 1024 bits.

Unity 2D Fingerprint

• Unity is similar to CDK, but based on sub-structures rather than just paths.

• Substructures present in the molecule are enumerated.

• A hashing function is used to map these paths onto a fingerprint of 992 bits.

Classification Algorithms

• Random Forest (RF).

• k-Nearest Neighbours (k-NN).

Random Forest

• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each

node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of

molecule.

Random Forest

Node

A > x1 A < x1

B > x2 B < x2 C > x3 C < x3

Decision: Yes No No Yes

A Random Forest contains many such trees.

Random Forest

• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each

node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of

molecule.

k-Nearest Neighbours

• Instance based learner.

• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.

• k is a variable describing the number of neighbours to be considered.

• Class of x determined by majority vote of class labels of k neighbours.

• Ties broken randomly (only occurs for even k).


ActiveInactive?


• Instance based learner.

• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.

• k is a variable describing the number of neighbours to be considered.

• Class of x determined by majority vote of class labels of k neighbours.

• Ties broken randomly (only occurs for even k).


• Local method.

• Uses only a very small number of near neighbours to make its prediction.

• Suitable for predicting activity classes with multiple clusters in chemical space.

• Therefore good for WADA classes with multiple receptors.

Performance Measure

• Matthews Correlation Coefficient:

• Range: -1 < MCC < 1;• Balance between predicting

positives & negatives.

]))()()(( nnpnnppp

npnp

ftftftft

ffttMCC

The Dataset

• 5245 molecules (5235 for CDK).

• Molecules taken from WADA banned list and from corresponding activity classes in MDDR. 367 explicitly allowed substances.

Data by Class

WADA Class Number of Molecules

S1 47

S2 272

S3 367

S4 928

S5 1000

S6 804

S7 195

S8 1000

S9 26

P2 239

Allowed 367

Fivefold Cross-validation

• We test for membership of each prohibited class separately.

• All calculations use 5-fold cv. This uses {80% molecules training set; 20% test set} repeated 5 times so that each molecule is in exactly 1 test set.

False Positives

• False Positives arise in two ways:

• (1) A molecule predicted positive on an incorrect activity class;

• (2) An explicitly allowed molecule predicted positive.

Results

Results: Random Forest

Aggregated over 10 classes

Unity CDK > MACCS > others.

MCC for RF for Six Fingerprints

0.5000

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6

Rank out of 6 Fingerprints

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Unity 0.8214 CDK 0.8136

MACCS 0.7823

TGD 0.7283 MOE 0.7172

TAD 0.5902

100 trees sufficient; little improvement with more.

MCC as a Function of ntree in RF models for Unity

0.6000

0.7000

0.8000

0.9000

0 100 200 300 400 500 600 700 800 900 1000

ntree

MC

C

Unity

Results: k-Nearest Neighbours

Aggregated over 10 classes

MCC as a Function of k in k -NN Models for Six Fingerprints

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Unity CDK > MACCS > others.

MCC for k = 1 for Six Fingerprints

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6

Rank out of 6 Fingerprints

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Unity 0.8363CDK 0.8297

MACCS 0.8045

TGD 0.7404

MOE 0.6814

TAD 0.6152

k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k.

MCC as a Function of k in k -NN Models for Unity

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C Unity

k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k. Unity ≈ CDK.

MCC as a Function of k in k- NN Models for CDK

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C CDK

Results: Comparison

Recall v PrecisionAggregated over 10 classes

Recall Precision

Recall v Precision for Positives

60.00

70.00

80.00

90.00

100.00

30.00 40.00 50.00 60.00 70.00 80.00 90.00

Recall

Pre

cisi

on

Unity

MACCS

CDK

RF

k -NN

RF gives higher precision, k-NN higher recall.

Results: Comparison

Analysed by class

Classes vary in difficulty of prediction;

independent of classification algorithm.

MCC by Class for Random Forest Default and k- NN (k = 1) Models

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

1 2 3 4 5 6 7 8 9 10

Class

MC

C RF

k-NN

S1 S2 S3 S4 S5 S6 S7 S8 S9 P2

Conclusions

Conclusions• Can successfully predict active molecules

(MCC ≈ 0.83).

• Unity ≈ CDK > MACCS > others.

• RF & k-NN give similar MCC.

• k-NN higher recall.

• RF higher precision; RF less likely to find false positives.

Conclusions• RF results vary little with ntree.

• k-NN results best for k = 1.

• Performance decreases at higher k.

• Odd k avoids problems with ties (k = 2 is worse than k = 3).

• Activity classes show consistent prediction difficulty pattern.

Acknowledgements

• Andreas Bender (Novartis Institutes).

• David Palmer (Unilever Centre).

• Unilever.

•

www-mitchell.ch.cam.ac.uk/

[email protected]

MCC as a Function of Class Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000 1200

Class Size

MC

C

UnityMACCSMOE TADTGD

No significant correlation overall; though smallest class S9 is hardest to predict.

MCC as a Function of Intra Class Mean Tanimoto Score

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.20 0.30 0.40 0.50 0.60

Intra Class Mean Tanimoto Score

MC

C

UnityMACCSMOE TADTGD

tetrahydrogestrinone (THG)

gestrinone

trenbolone

Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell ...

Documents

Transcript of Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell ...