Biomarker and Classifier Selection in Diverse Genetic Datasets

20
Biomarker and Classifier Selection in Diverse Genetic Datasets JAMES LINDSAY 1 ED HEMPHILL 2 CHIH LEE 1 ION MANDOIU 1 CRAIG NELSON 2 UNIVERSITY OF CONNECTICUT 1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING 2 DEPARTMENT OF MOLECULAR AND CELL BIOLOGY

description

Biomarker and Classifier Selection in Diverse Genetic Datasets. University Of Connecticut 1 Department of Computer Science and Engineering 2 Department of Molecular and Cell Biology. James Lindsay 1 Ed Hemphill 2 Chih Lee 1 Ion Mandoiu 1 Craig Nelson 2. - PowerPoint PPT Presentation

Transcript of Biomarker and Classifier Selection in Diverse Genetic Datasets

Page 1: Biomarker and Classifier Selection in Diverse Genetic Datasets

Biomarker and Classifier Selection in Diverse

Genetic Datasets

JAMES LINDSAY1

ED HEMPHILL2

CHIH LEE1

ION MANDOIU1

CRAIG NELSON2

UNIVERSITY OF CONNECTICUT1DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING2DEPARTMENT OF MOLECULAR AND CELL BIOLOGY

Page 2: Biomarker and Classifier Selection in Diverse Genetic Datasets

Motivation 1: Cell-type Identification

• The Question: Smallest # of genes to identify each cluster:

• B: Bone• C: Myeloid• D: Endothelial

• Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage.

In collaboration with: Dr. Hector Leonardo Aguila, UCHC

Page 3: Biomarker and Classifier Selection in Diverse Genetic Datasets

Motivation 2: Clinical Diagnostics

• Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoS One 2012

Study # genes Sensitivity (%) Specificity (%)

Lequerre 20 71 61

Stuhlmuller 11 79 56

Stuhlmuller 82 67 56

Lequerre 8 71 28

Sekiguchi 18 71 28

Julia 8 92 17

Stuhlmuller 3 71 17

Tanio 8 67 33

Page 4: Biomarker and Classifier Selection in Diverse Genetic Datasets

Multi-class Classification Problem

Multi-class Classification• There are 2 or more classes• Supervised learning

Key Problems:1. Feature Selection: What are the most predictive

biomarkers?2. Classification: What is the best classification algorithm?

Page 5: Biomarker and Classifier Selection in Diverse Genetic Datasets

Challenges• Different types of data

• Gene expression• Epigenetic data

• Methylation• Histone modification

• Proteomics• Metabolomics• Phenotypes

• Different Platforms• Microarray• Sequencing• In-situ hybridization

• Different Resolutions• Discrete vs Continuous• Sparse vs Complete

Page 6: Biomarker and Classifier Selection in Diverse Genetic Datasets

Minimal Unique Marker Panel Selection (Mumps) Pipeline

Feature Selection

Classification

Parameterize each combination of feature selection and classification algorithms

Inner Cross-validation

Rank Models by AUC

Outer Cross-validation

Output: the best features and classifier

Input: # of biomarkers:

Nes

ted

Cro

ss V

alid

atio

n

Page 7: Biomarker and Classifier Selection in Diverse Genetic Datasets

Feature Selection

• (SVM)-recursive feature elimination (RFE)

• ANOVA F-value• Random Forests• Extra Trees

Algorithms

• Correlation• Cosine• K-Nearest Neighbors

(KNN)• Support Vector Machine

(SVM)• Decision Tree• Random Forests• Extra Trees• Gradient Boosting

Classification

Page 8: Biomarker and Classifier Selection in Diverse Genetic Datasets

Datasets

• From Broad Institute• Affymetrix Gene expression

microarray• 15 hematopoietic cell types• 82 samples • 4-7 samples per cell type.

• Multiple Sources• 70 samples • Approximately 3-7 samples

per cell type.• Affymetrix & Illumina Bead

Array• Different labs

Page 9: Biomarker and Classifier Selection in Diverse Genetic Datasets

Experiments

• Complete • Complete gene expression

profile from microarray datasets.

• Simulated Sparse • 70% and 50% missing data• Coverage of a marker

followed a Beta distribution.

• The fraction of cell types having known expression statuses for a marker.

• Fifteen simulations

• Cross-validation• 3-fold, stratified• # features:

• 2, 8, 16, 32, 64, 96, 128, 256, and 384

• Best set of features and classifier for each # features

• External validation• Use Broad data as

training• Test against external

datasets

Page 10: Biomarker and Classifier Selection in Diverse Genetic Datasets

Performance: Complete Data

2 8 16 32 64 96 128 256 3840.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Continuous CVDiscrete CVContinuous EVDiscrete EV

Number of Markers

AU

C

Page 11: Biomarker and Classifier Selection in Diverse Genetic Datasets

By Algorithm: Complete Data

2 8 16 32 64 96 128 256 3840.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

RFE - Extra Trees RFE - Random Forest RFE - Correlation RFE - CosineRFE - Decision Tree RFE - Gradient Boosting RFE - KNN RFE - SVM

Number of Markers

AU

C

Page 12: Biomarker and Classifier Selection in Diverse Genetic Datasets

Performance: 70% Missing

2 8 16 32 64 96 128 256 3840.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Discrete CVContinuous CVDiscrete EVContinuous EV

Number of Markers

AU

C

Page 13: Biomarker and Classifier Selection in Diverse Genetic Datasets

Summary: Best Algorithms

Complete 70% missing

# of markers FS CL FS CL

2 RFE KNNRFE Extra Trees

8RFE Cosine RFE Cosine

16RFE Cosine RFE Cosine

32RFE Cosine RFE Cosine

64RFE Cosine RFE Cosine

96RFE Cosine RFE Correlation

128RFE Cosine RFE Correlation

256RFE Cosine RFE Correlation

384RFE Correlation RFE Correlation

Page 14: Biomarker and Classifier Selection in Diverse Genetic Datasets

Why the Big Gap?• Cross-platform

normalization

• Similarities in cell-types

• Over-fitting

Correlation: Broad vs External

Page 15: Biomarker and Classifier Selection in Diverse Genetic Datasets

Mesoderm Cell-type Identification

Anti-TNF Responsivness

Motivation Results

# genes AUC

873 %

1674 %

3276 %

6478 %

9687 %

12891 %

25691 %

38492 %

Study # genesSensitivity

(%)Specificity

(%)

Lequerre 20 71 61Stuhlmulle

r 11 79 56Stuhlmulle

r 82 67 56

Lequerre 8 71 28

Sekiguchi 18 71 28

Julia 8 92 17Stuhlmulle

r 3 71 17

Tanio 8 67 33

UCONN 8 83 83

UCONN 2048 94 96

Page 16: Biomarker and Classifier Selection in Diverse Genetic Datasets

Future Work

• Broader Data-types• NCI-60

• microarray mRNA• microarray microRNA• copy number variation• protein array• SNPs• …

• Minimizing over fitting

• Cross-platform• normalization

• Different Data types • Integrate multiple data

types simultaneously

Page 17: Biomarker and Classifier Selection in Diverse Genetic Datasets

Conclusion and Thanks

• Thanks to:• Ed Hemphill• Chih Lee• Ion Mandoiu• Craig Nelson

Smpl BioA commercial service coming in late 2013

Page 18: Biomarker and Classifier Selection in Diverse Genetic Datasets

DON’T GO BEYOND, TIS A SILLY PLACE

Extra Slides

Page 19: Biomarker and Classifier Selection in Diverse Genetic Datasets

Experiment Overview

Parameterize each combination of feature selection and classification algorithms

Output the best features and classifier

Feature Selection

Classification

Inner Cross-validation

Rank Models by AUC

Outer Cross-validation

Input: # of biomarkers:N

este

d C

ross

Val

idat

ion

Test Best Model Output: AUC of best features / classifier

Bro

ad D

ata

Ext

erna

l

Test

ing

Page 20: Biomarker and Classifier Selection in Diverse Genetic Datasets

Performance: 50% Missing

1 2 3 4 5 6 7 8 90.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Continuous CVContinuous EV

Number of Markers

AU

C