Biomarker and Classifier Selection in Diverse Genetic Datasets

Biomarker and Classifier Selection in Diverse

Genetic Datasets

JAMES LINDSAY1

ED HEMPHILL2

CHIH LEE1

ION MANDOIU1

CRAIG NELSON2

UNIVERSITY OF CONNECTICUT1DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING2DEPARTMENT OF MOLECULAR AND CELL BIOLOGY

Motivation 1: Cell-type Identification

• The Question: Smallest # of genes to identify each cluster:

• B: Bone• C: Myeloid• D: Endothelial

• Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage.

In collaboration with: Dr. Hector Leonardo Aguila, UCHC

Motivation 2: Clinical Diagnostics

• Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoS One 2012

Study # genes Sensitivity (%) Specificity (%)

Lequerre 20 71 61

Stuhlmuller 11 79 56

Stuhlmuller 82 67 56

Lequerre 8 71 28

Sekiguchi 18 71 28

Julia 8 92 17

Stuhlmuller 3 71 17

Tanio 8 67 33

Multi-class Classification Problem

Multi-class Classification• There are 2 or more classes• Supervised learning

Key Problems:1. Feature Selection: What are the most predictive

biomarkers?2. Classification: What is the best classification algorithm?

Challenges• Different types of data

• Gene expression• Epigenetic data

• Methylation• Histone modification

• Proteomics• Metabolomics• Phenotypes

• Different Platforms• Microarray• Sequencing• In-situ hybridization

• Different Resolutions• Discrete vs Continuous• Sparse vs Complete

Minimal Unique Marker Panel Selection (Mumps) Pipeline

Feature Selection

Classification

Parameterize each combination of feature selection and classification algorithms

Inner Cross-validation

Rank Models by AUC

Outer Cross-validation

Output: the best features and classifier

Input: # of biomarkers:

Nes

ted

Cro

ss V

alid

atio

n

Feature Selection

• (SVM)-recursive feature elimination (RFE)

• ANOVA F-value• Random Forests• Extra Trees

Algorithms

• Correlation• Cosine• K-Nearest Neighbors

(KNN)• Support Vector Machine

(SVM)• Decision Tree• Random Forests• Extra Trees• Gradient Boosting

Classification

Datasets

• From Broad Institute• Affymetrix Gene expression

microarray• 15 hematopoietic cell types• 82 samples • 4-7 samples per cell type.

• Multiple Sources• 70 samples • Approximately 3-7 samples

per cell type.• Affymetrix & Illumina Bead

Array• Different labs

Experiments

• Complete • Complete gene expression

profile from microarray datasets.

• Simulated Sparse • 70% and 50% missing data• Coverage of a marker

followed a Beta distribution.

• The fraction of cell types having known expression statuses for a marker.

• Fifteen simulations

• Cross-validation• 3-fold, stratified• # features:

• 2, 8, 16, 32, 64, 96, 128, 256, and 384

• Best set of features and classifier for each # features

• External validation• Use Broad data as

training• Test against external

datasets

Performance: Complete Data

2 8 16 32 64 96 128 256 3840.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Continuous CVDiscrete CVContinuous EVDiscrete EV

Number of Markers

AU

C

By Algorithm: Complete Data

2 8 16 32 64 96 128 256 3840.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

RFE - Extra Trees RFE - Random Forest RFE - Correlation RFE - CosineRFE - Decision Tree RFE - Gradient Boosting RFE - KNN RFE - SVM

Number of Markers

AU

C

Performance: 70% Missing

2 8 16 32 64 96 128 256 3840.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Discrete CVContinuous CVDiscrete EVContinuous EV

Number of Markers

AU

C

Summary: Best Algorithms

Complete 70% missing

# of markers FS CL FS CL

2 RFE KNNRFE Extra Trees

8RFE Cosine RFE Cosine




96RFE Cosine RFE Correlation



384RFE Correlation RFE Correlation

Why the Big Gap?• Cross-platform

normalization

• Similarities in cell-types

• Over-fitting

Correlation: Broad vs External

Mesoderm Cell-type Identification

Anti-TNF Responsivness

Motivation Results

# genes AUC

873 %

1674 %

3276 %

6478 %

9687 %

12891 %

25691 %

38492 %

Study # genesSensitivity

(%)Specificity

(%)

Lequerre 20 71 61Stuhlmulle

r 11 79 56Stuhlmulle

r 82 67 56

Lequerre 8 71 28

Sekiguchi 18 71 28

Julia 8 92 17Stuhlmulle

r 3 71 17

Tanio 8 67 33

UCONN 8 83 83

UCONN 2048 94 96

Future Work

• Broader Data-types• NCI-60

• microarray mRNA• microarray microRNA• copy number variation• protein array• SNPs• …

• Minimizing over fitting

• Cross-platform• normalization

• Different Data types • Integrate multiple data

types simultaneously

Conclusion and Thanks

• Thanks to:• Ed Hemphill• Chih Lee• Ion Mandoiu• Craig Nelson

Smpl BioA commercial service coming in late 2013

DON’T GO BEYOND, TIS A SILLY PLACE

Extra Slides

Experiment Overview

Parameterize each combination of feature selection and classification algorithms

Output the best features and classifier

Feature Selection

Classification

Inner Cross-validation

Rank Models by AUC

Outer Cross-validation

Input: # of biomarkers:N

este

d C

ross

Val

idat

ion

Test Best Model Output: AUC of best features / classifier

Bro

ad D

ata

Ext

erna

l

Test

ing

Performance: 50% Missing

1 2 3 4 5 6 7 8 90.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Continuous CVContinuous EV

Number of Markers

AU

C

Biomarker and Classifier Selection in Diverse Genetic Datasets

Documents

Transcript of Biomarker and Classifier Selection in Diverse Genetic Datasets