Machine Learning techniques for biomarker discovery in proteomic pattern data

Machine Learning techniques for biomarker discovery in proteomic pattern data

Elena Marchiori

Department of Computer Science

Vrije Universiteit Amsterdam

Overview

• Proteomic pattern data

• How to use the data

• Approaches

• Methodology

• Case study

• Conclusion

SELDI-TOF MSSurface-enhanced laser desorption/ionization time-of-flight mass spectronomy

• Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins.

• The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones.

1 Serum on protein binding plate2 Insert plate in vacuum chamber3 Irradiate plate with laser4 This “launches” the proteins /

peptides5 Measure “time of flight” (TOF) of Ions, which corresponds to molecularWeights of proteins

Example

Time of flight

Abundance

•Heavier peptides move slower -> •Time of flight corresponds to weight•Weight corresponds to peptides•Measuring relative abundance of detected proteins in serum

How to use the data?

• Diagnostic tool:– design a classifier for discriminating healthy from

disease samples

• Biomarkers identification:– Feature selection (FS): select features (peptides /

proteins) that best discriminate the two classes (potential biomarkers)

Classification / FS• diagnostic tool => classifier

– train a classifier that separates the two classes of diseased and healthy examples

• biomarkers => feature subset selection– for a given type of classifier (e.g. KNN, SVM) find a

small set of features that optimizes the performance of the classifier when restricted to the selected features

– for a given clustering algorithm find a small set of features that maximizes the coherence of class labels of examples in the clusters (Petricoin et al, The Lancet 2002)

Approaches: Commercial

• Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002)

• Propeak (3Z Informatics): separability analysis + bootstrap

• Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ?

Approaches: Non-commercial

• Pre-processing + ranking + kNN (Zhu et al., PNAS 2003)

• Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002)

• Filter FS + classifier (Liu et al., Genome Informatics 2002)

• GA + SVM (Jong et al., EvoBIO 2004)

• Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003)

SVM-based methods– Linear Support Vector Machine

GA_SVM

• Training set T= T_1 T_2.• A genetic algorithm evolves a number of populations.

Each population consists of sets of features of a given size. The fitness of an individual of the population is based on the performance of a SVM. SVM is trained on T_1 using only the features of the individual. The fitness is the SVM error over T_2.

• At each generation new individuals are created and inserted into the population by selecting fit parents which are mutated and recombined.

• Individuals may migrate to neighbor populations.

Ensemble SVM-RFESVM-RFE(a cutoff, a training set T=T_1T_2)1. Train a linear soft-SVM(C, class label penalties) on T_1

2. Order features using the weights of the resulting classifier

3. Eliminate features with weight smaller than cutoff

4. Repeat the process with T_1 restricted to the remaining features

This algorithm generates a chain of feature sets

F_1 F_2 … F_k

SVM-RFE selects from {F_1, …,F_k} the set F* that minimizes the error over T_2 of the classifier restricted to the feature set, plus a term for penalizing large feature sets.

We proposed a variant of this FS algorithm that uses ensembles of results of SVM-RFE over different cutoff values.

Methodology

• Cross Validation– split data randomly in train and test set

– apply the classification/FS method to the training set

– use the test set only to assess the performance of the method

– repeat the process a number of times to analyze bias induced by the data splitting

About Methodology• Examples of recent papers that do NOT use a

correct methodology:– Qu et al. (Clin. Chem. 2002): perform feature pre-

selection before application of CV– Villanueva et al (Anal. Chem. 2004): use the entire

dataset for feature ranking– Petricoin et al (The Lancet 2002): consider one data

split into train/test set

• papers addressing methodology pitfalls:– Simon et al, J Nat. Cancer Inst 2003– Ambroise and Mc Lachlan, PNAS 2002

Case Study: Data

• Used in Petricoin et al papers – Commercial analysis software (Proteome Quest): http://www.correlogic.com/– Data sets: http://ncifdaproteomics.com/ppatterns.php

• Ovarian data set:– 162 Positive (Cancer) 92 Negative (Healthy)– 15154 Variables (Peptides / Proteins)

• Prostate data set:– 69 Positive 253 Negative– 15154 Variables

• number variables >> number examples

Preliminary analysis

• Few visible differences in means between healthy/cancer groups• But many very low p-values (in particular ovarian -> easy)

Prostate data:

Ovarian data:

Difference in means Histogram p-values

The Methods• Diagnostic tool:

– Support Vector Machine with linear and polynomial kernel

• Biomarkers Detection and Diagnostic:– Feature subset selection, using Genetic Algorithms and

Support Vector Machine

Diagnostics: Results• Support Vector Machine (SVM) on all features

– Linear and quadratic kernel

• Evaluation measures:– Error: fp + fn / total

– Sensitivity: tp / (tp + fn)

– Specificity: tn / (fp + tn)

– Positive Predictive Value: tp / (tp + fp)

Results seem consistent with preliminary analysis: ovarian easier than prostate

Biomarker Detection: Results Linear SVM, Prostate data set

Quadratic SVM, Prostate data set

Bigger error than SVM on all features (+/- 0.06)

Results of Experiments

• Results of experiments with GA-SVM indicate that there is variability both due to the data splitting and the algorithm.

• Different sets of features are obtained at each run, however there is a group of about 50 features that occur more often over all the runs.

Results of Experiments

• Ensemble-RFE-SVM achieves perfect classification on ovarian dataset while on the prostate dataset achieves sensitivity 0.97(0.04) and specificity of 0.89(0.06).

• Ensemble-RFE-SVM outperforms both GA-SVM and the commercial software of Petricoin et al. However, it finds feature sets of larger sizes.

• Features provided by Petricoin et al URL site yield scarce performance when SVM is used, showing that performance depends on the type of classifier used…

Diagnostic tool Design

• Effective FS algorithms, like ensemble SVM-RFE, have to be enhanced with a user-friendly interface and visualization features in order to become operative in research laboratories and hospitals.

• The resulting tools can be used by biologists and pathologists for analyzing their data without need of direct support from CS people.

Conclusion• Many machine learning techniques can be used

for the analysis of pattern proteomic data. SVM based approaches are effective.

• Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting.

• Collaboration:– Connie Jimenez– Gus Smit– Kees Jong– Aad van der Vaart

Machine Learning techniques for biomarker discovery in proteomic pattern data

Documents

Transcript of Machine Learning techniques for biomarker discovery in proteomic pattern data