Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell ...
-
Upload
carolyn-bauer -
Category
Documents
-
view
220 -
download
3
Transcript of Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell ...
Classifying the WADA 2005 Prohibited List Using the CDK
Fingerprint
Ed Cannon & John Mitchell
www-mitchell.ch.cam.ac.uk/
Classifying the WADA Prohibited List
• Aims & Background.• Methods.• Data.• Results.• Conclusions.
Aims & Background
Aims & Background
• Much drug abuse in sport involves novel compounds such as the “designer steroid” THG.
tetrahydrogestrinone (THG)
Aims & Background
• Hence the World Anti-Doping Agency (WADA) prohibits classes of bioactivity as well as specific molecules.
• Analogues are prohibited using the “similar chemical structure or similar biological effect(s)” criterion.
WADA Prohibited Classes
• Anabolic Agents (S1)
• Hormones and Related Substances (S2)
• Beta-2-agonists (S3)• Anti-estrogenic
Agents (S4)• Diuretics and
Masking Agents (S5)
• Stimulants (S6)• Narcotics (S7)• Cannabinoids (S8)• Glucocorticoids
(S9)• Alcohol (P1)• Beta Blockers (P2)
Predicting Bioactivities
• We seek to predict whether a molecule exhibits one of these bioactivities.
• Such a classifier would be powerful as an in silico pre-filter for experimental methods such as assays.
Methods
Chemical Space
• Use descriptor-based fingerprints to locate molecules in chemical space.
• Similar Property Principle suggests molecules close together in chemical space often share common bioactivity.
Machine Learning
• Use Machine Learning classification algorithms to predict bioactivity from location of molecules in chemical space.
• Random Forest.
• k-Nearest Neighbours.
Fingerprints
• CDK (Chemistry Development Kit) fingerprint.
• Unity 2D.• MACCS key.• MOE 2D (2004).• Typed Atom Distance.• Typed Graph Distance.
CDK Fingerprint
• CDK fingerprint resembles Daylight.
• All bond paths up to a length of 6 are generated.
• A hashing function is used to map these paths onto a fingerprint of 1024 bits.
Unity 2D Fingerprint
• Unity is similar to CDK, but based on sub-structures rather than just paths.
• Substructures present in the molecule are enumerated.
• A hashing function is used to map these paths onto a fingerprint of 992 bits.
Classification Algorithms
• Random Forest (RF).
• k-Nearest Neighbours (k-NN).
Random Forest
• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each
node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of
molecule.
Random Forest
Node
A > x1 A < x1
B > x2 B < x2 C > x3 C < x3
Decision: Yes No No Yes
A Random Forest contains many such trees.
Random Forest
• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each
node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of
molecule.
k-Nearest Neighbours
• Instance based learner.
• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.
• k is a variable describing the number of neighbours to be considered.
• Class of x determined by majority vote of class labels of k neighbours.
• Ties broken randomly (only occurs for even k).
k-Nearest Neighbours
ActiveInactive?
k-Nearest Neighbours
• Instance based learner.
• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.
• k is a variable describing the number of neighbours to be considered.
• Class of x determined by majority vote of class labels of k neighbours.
• Ties broken randomly (only occurs for even k).
k-Nearest Neighbours
• Local method.
• Uses only a very small number of near neighbours to make its prediction.
• Suitable for predicting activity classes with multiple clusters in chemical space.
• Therefore good for WADA classes with multiple receptors.
Performance Measure
• Matthews Correlation Coefficient:
• Range: -1 < MCC < 1;• Balance between predicting
positives & negatives.
]))()()(( nnpnnppp
npnp
ftftftft
ffttMCC
Data
The Dataset
• 5245 molecules (5235 for CDK).
• Molecules taken from WADA banned list and from corresponding activity classes in MDDR. 367 explicitly allowed substances.
Data by Class
WADA Class Number of Molecules
S1 47
S2 272
S3 367
S4 928
S5 1000
S6 804
S7 195
S8 1000
S9 26
P2 239
Allowed 367
Fivefold Cross-validation
• We test for membership of each prohibited class separately.
• All calculations use 5-fold cv. This uses {80% molecules training set; 20% test set} repeated 5 times so that each molecule is in exactly 1 test set.
False Positives
• False Positives arise in two ways:
• (1) A molecule predicted positive on an incorrect activity class;
• (2) An explicitly allowed molecule predicted positive.
Results
Results: Random Forest
Aggregated over 10 classes
Unity CDK > MACCS > others.
MCC for RF for Six Fingerprints
0.5000
0.6000
0.7000
0.8000
0.9000
0 1 2 3 4 5 6
Rank out of 6 Fingerprints
MC
C
Unity
MACCS
MOE
TAD
TGD
CDK
Unity 0.8214 CDK 0.8136
MACCS 0.7823
TGD 0.7283 MOE 0.7172
TAD 0.5902
100 trees sufficient; little improvement with more.
MCC as a Function of ntree in RF models for Unity
0.6000
0.7000
0.8000
0.9000
0 100 200 300 400 500 600 700 800 900 1000
ntree
MC
C
Unity
Results: k-Nearest Neighbours
Aggregated over 10 classes
MCC as a Function of k in k -NN Models for Six Fingerprints
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
MC
C
Unity
MACCS
MOE
TAD
TGD
CDK
Unity CDK > MACCS > others.
MCC for k = 1 for Six Fingerprints
0.6000
0.7000
0.8000
0.9000
0 1 2 3 4 5 6
Rank out of 6 Fingerprints
MC
C
Unity
MACCS
MOE
TAD
TGD
CDK
Unity 0.8363CDK 0.8297
MACCS 0.8045
TGD 0.7404
MOE 0.6814
TAD 0.6152
k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k.
MCC as a Function of k in k -NN Models for Unity
0.6000
0.7000
0.8000
0.9000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
MC
C Unity
k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k. Unity ≈ CDK.
MCC as a Function of k in k- NN Models for CDK
0.6000
0.7000
0.8000
0.9000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
MC
C CDK
Results: Comparison
Recall v PrecisionAggregated over 10 classes
Recall Precision
Recall v Precision for Positives
60.00
70.00
80.00
90.00
100.00
30.00 40.00 50.00 60.00 70.00 80.00 90.00
Recall
Pre
cisi
on
Unity
MACCS
CDK
RF
k -NN
RF gives higher precision, k-NN higher recall.
Results: Comparison
Analysed by class
Classes vary in difficulty of prediction;
independent of classification algorithm.
MCC by Class for Random Forest Default and k- NN (k = 1) Models
0.0000
0.1000
0.2000
0.3000
0.4000
0.5000
0.6000
0.7000
0.8000
0.9000
1.0000
1 2 3 4 5 6 7 8 9 10
Class
MC
C RF
k-NN
S1 S2 S3 S4 S5 S6 S7 S8 S9 P2
Conclusions
Conclusions• Can successfully predict active molecules
(MCC ≈ 0.83).
• Unity ≈ CDK > MACCS > others.
• RF & k-NN give similar MCC.
• k-NN higher recall.
• RF higher precision; RF less likely to find false positives.
Conclusions• RF results vary little with ntree.
• k-NN results best for k = 1.
• Performance decreases at higher k.
• Odd k avoids problems with ties (k = 2 is worse than k = 3).
• Activity classes show consistent prediction difficulty pattern.
Acknowledgements
• Andreas Bender (Novartis Institutes).
• David Palmer (Unilever Centre).
• Unilever.
•
www-mitchell.ch.cam.ac.uk/
MCC as a Function of Class Size
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 200 400 600 800 1000 1200
Class Size
MC
C
UnityMACCSMOE TADTGD
No significant correlation overall; though smallest class S9 is hardest to predict.
MCC as a Function of Intra Class Mean Tanimoto Score
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.00 0.10 0.20 0.30 0.40 0.50 0.60
Intra Class Mean Tanimoto Score
MC
C
UnityMACCSMOE TADTGD
tetrahydrogestrinone (THG)
gestrinone
trenbolone