Download - Machine Learning techniques for biomarker discovery in proteomic pattern data

Transcript
Page 1: Machine Learning techniques for biomarker discovery in proteomic pattern data

Machine Learning techniques for biomarker discovery in proteomic pattern data

Elena Marchiori

Department of Computer Science

Vrije Universiteit Amsterdam

Page 2: Machine Learning techniques for biomarker discovery in proteomic pattern data

Overview

• Proteomic pattern data

• How to use the data

• Approaches

• Methodology

• Case study

• Conclusion

Page 3: Machine Learning techniques for biomarker discovery in proteomic pattern data

SELDI-TOF MSSurface-enhanced laser desorption/ionization time-of-flight mass spectronomy

• Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins.

• The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones.

1 Serum on protein binding plate2 Insert plate in vacuum chamber3 Irradiate plate with laser4 This “launches” the proteins /

peptides5 Measure “time of flight” (TOF) of Ions, which corresponds to molecularWeights of proteins

Page 4: Machine Learning techniques for biomarker discovery in proteomic pattern data

Example

Time of flight

Abundance

•Heavier peptides move slower -> •Time of flight corresponds to weight•Weight corresponds to peptides•Measuring relative abundance of detected proteins in serum

Page 5: Machine Learning techniques for biomarker discovery in proteomic pattern data

How to use the data?

• Diagnostic tool:– design a classifier for discriminating healthy from

disease samples

• Biomarkers identification:– Feature selection (FS): select features (peptides /

proteins) that best discriminate the two classes (potential biomarkers)

Page 6: Machine Learning techniques for biomarker discovery in proteomic pattern data

Classification / FS• diagnostic tool => classifier

– train a classifier that separates the two classes of diseased and healthy examples

• biomarkers => feature subset selection– for a given type of classifier (e.g. KNN, SVM) find a

small set of features that optimizes the performance of the classifier when restricted to the selected features

– for a given clustering algorithm find a small set of features that maximizes the coherence of class labels of examples in the clusters (Petricoin et al, The Lancet 2002)

Page 7: Machine Learning techniques for biomarker discovery in proteomic pattern data

Approaches: Commercial

• Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002)

• Propeak (3Z Informatics): separability analysis + bootstrap

• Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ?

Page 8: Machine Learning techniques for biomarker discovery in proteomic pattern data

Approaches: Non-commercial

• Pre-processing + ranking + kNN (Zhu et al., PNAS 2003)

• Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002)

• Filter FS + classifier (Liu et al., Genome Informatics 2002)

• GA + SVM (Jong et al., EvoBIO 2004)

• Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003)

Page 9: Machine Learning techniques for biomarker discovery in proteomic pattern data

SVM-based methods– Linear Support Vector Machine

Page 10: Machine Learning techniques for biomarker discovery in proteomic pattern data

GA_SVM

• Training set T= T_1 T_2.• A genetic algorithm evolves a number of populations.

Each population consists of sets of features of a given size. The fitness of an individual of the population is based on the performance of a SVM. SVM is trained on T_1 using only the features of the individual. The fitness is the SVM error over T_2.

• At each generation new individuals are created and inserted into the population by selecting fit parents which are mutated and recombined.

• Individuals may migrate to neighbor populations.

Page 11: Machine Learning techniques for biomarker discovery in proteomic pattern data

Ensemble SVM-RFESVM-RFE(a cutoff, a training set T=T_1T_2)1. Train a linear soft-SVM(C, class label penalties) on T_1

2. Order features using the weights of the resulting classifier

3. Eliminate features with weight smaller than cutoff

4. Repeat the process with T_1 restricted to the remaining features

This algorithm generates a chain of feature sets

F_1 F_2 … F_k

SVM-RFE selects from {F_1, …,F_k} the set F* that minimizes the error over T_2 of the classifier restricted to the feature set, plus a term for penalizing large feature sets.

We proposed a variant of this FS algorithm that uses ensembles of results of SVM-RFE over different cutoff values.

Page 12: Machine Learning techniques for biomarker discovery in proteomic pattern data

Methodology

• Cross Validation– split data randomly in train and test set

– apply the classification/FS method to the training set

– use the test set only to assess the performance of the method

– repeat the process a number of times to analyze bias induced by the data splitting

Page 13: Machine Learning techniques for biomarker discovery in proteomic pattern data

About Methodology• Examples of recent papers that do NOT use a

correct methodology:– Qu et al. (Clin. Chem. 2002): perform feature pre-

selection before application of CV– Villanueva et al (Anal. Chem. 2004): use the entire

dataset for feature ranking– Petricoin et al (The Lancet 2002): consider one data

split into train/test set

• papers addressing methodology pitfalls:– Simon et al, J Nat. Cancer Inst 2003– Ambroise and Mc Lachlan, PNAS 2002

Page 14: Machine Learning techniques for biomarker discovery in proteomic pattern data

Case Study: Data

• Used in Petricoin et al papers – Commercial analysis software (Proteome Quest): http://www.correlogic.com/– Data sets: http://ncifdaproteomics.com/ppatterns.php

• Ovarian data set:– 162 Positive (Cancer) 92 Negative (Healthy)– 15154 Variables (Peptides / Proteins)

• Prostate data set:– 69 Positive 253 Negative– 15154 Variables

• number variables >> number examples

Page 15: Machine Learning techniques for biomarker discovery in proteomic pattern data

Preliminary analysis

• Few visible differences in means between healthy/cancer groups• But many very low p-values (in particular ovarian -> easy)

Prostate data:

Ovarian data:

Difference in means Histogram p-values

Page 16: Machine Learning techniques for biomarker discovery in proteomic pattern data

The Methods• Diagnostic tool:

– Support Vector Machine with linear and polynomial kernel

• Biomarkers Detection and Diagnostic:– Feature subset selection, using Genetic Algorithms and

Support Vector Machine

Page 17: Machine Learning techniques for biomarker discovery in proteomic pattern data

Diagnostics: Results• Support Vector Machine (SVM) on all features

– Linear and quadratic kernel

• Evaluation measures:– Error: fp + fn / total

– Sensitivity: tp / (tp + fn)

– Specificity: tn / (fp + tn)

– Positive Predictive Value: tp / (tp + fp)

Results seem consistent with preliminary analysis: ovarian easier than prostate

Page 18: Machine Learning techniques for biomarker discovery in proteomic pattern data

Biomarker Detection: Results Linear SVM, Prostate data set

Quadratic SVM, Prostate data set

Bigger error than SVM on all features (+/- 0.06)

Page 19: Machine Learning techniques for biomarker discovery in proteomic pattern data

Results of Experiments

• Results of experiments with GA-SVM indicate that there is variability both due to the data splitting and the algorithm.

• Different sets of features are obtained at each run, however there is a group of about 50 features that occur more often over all the runs.

Page 20: Machine Learning techniques for biomarker discovery in proteomic pattern data

Results of Experiments

• Ensemble-RFE-SVM achieves perfect classification on ovarian dataset while on the prostate dataset achieves sensitivity 0.97(0.04) and specificity of 0.89(0.06).

• Ensemble-RFE-SVM outperforms both GA-SVM and the commercial software of Petricoin et al. However, it finds feature sets of larger sizes.

• Features provided by Petricoin et al URL site yield scarce performance when SVM is used, showing that performance depends on the type of classifier used…

Page 21: Machine Learning techniques for biomarker discovery in proteomic pattern data

Diagnostic tool Design

• Effective FS algorithms, like ensemble SVM-RFE, have to be enhanced with a user-friendly interface and visualization features in order to become operative in research laboratories and hospitals.

• The resulting tools can be used by biologists and pathologists for analyzing their data without need of direct support from CS people.

Page 22: Machine Learning techniques for biomarker discovery in proteomic pattern data

Conclusion• Many machine learning techniques can be used

for the analysis of pattern proteomic data. SVM based approaches are effective.

• Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting.

• Collaboration:– Connie Jimenez– Gus Smit– Kees Jong– Aad van der Vaart