Biomarker and Classifier Selection in Diverse Genetic Datasets
description
Transcript of Biomarker and Classifier Selection in Diverse Genetic Datasets
Biomarker and Classifier Selection in Diverse
Genetic Datasets
JAMES LINDSAY1
ED HEMPHILL2
CHIH LEE1
ION MANDOIU1
CRAIG NELSON2
UNIVERSITY OF CONNECTICUT1DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING2DEPARTMENT OF MOLECULAR AND CELL BIOLOGY
Motivation 1: Cell-type Identification
• The Question: Smallest # of genes to identify each cluster:
• B: Bone• C: Myeloid• D: Endothelial
• Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage.
In collaboration with: Dr. Hector Leonardo Aguila, UCHC
Motivation 2: Clinical Diagnostics
• Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoS One 2012
Study # genes Sensitivity (%) Specificity (%)
Lequerre 20 71 61
Stuhlmuller 11 79 56
Stuhlmuller 82 67 56
Lequerre 8 71 28
Sekiguchi 18 71 28
Julia 8 92 17
Stuhlmuller 3 71 17
Tanio 8 67 33
Multi-class Classification Problem
Multi-class Classification• There are 2 or more classes• Supervised learning
Key Problems:1. Feature Selection: What are the most predictive
biomarkers?2. Classification: What is the best classification algorithm?
Challenges• Different types of data
• Gene expression• Epigenetic data
• Methylation• Histone modification
• Proteomics• Metabolomics• Phenotypes
• Different Platforms• Microarray• Sequencing• In-situ hybridization
• Different Resolutions• Discrete vs Continuous• Sparse vs Complete
Minimal Unique Marker Panel Selection (Mumps) Pipeline
Feature Selection
Classification
Parameterize each combination of feature selection and classification algorithms
Inner Cross-validation
Rank Models by AUC
Outer Cross-validation
Output: the best features and classifier
Input: # of biomarkers:
Nes
ted
Cro
ss V
alid
atio
n
Feature Selection
• (SVM)-recursive feature elimination (RFE)
• ANOVA F-value• Random Forests• Extra Trees
Algorithms
• Correlation• Cosine• K-Nearest Neighbors
(KNN)• Support Vector Machine
(SVM)• Decision Tree• Random Forests• Extra Trees• Gradient Boosting
Classification
Datasets
• From Broad Institute• Affymetrix Gene expression
microarray• 15 hematopoietic cell types• 82 samples • 4-7 samples per cell type.
• Multiple Sources• 70 samples • Approximately 3-7 samples
per cell type.• Affymetrix & Illumina Bead
Array• Different labs
Experiments
• Complete • Complete gene expression
profile from microarray datasets.
• Simulated Sparse • 70% and 50% missing data• Coverage of a marker
followed a Beta distribution.
• The fraction of cell types having known expression statuses for a marker.
• Fifteen simulations
• Cross-validation• 3-fold, stratified• # features:
• 2, 8, 16, 32, 64, 96, 128, 256, and 384
• Best set of features and classifier for each # features
• External validation• Use Broad data as
training• Test against external
datasets
Performance: Complete Data
2 8 16 32 64 96 128 256 3840.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Continuous CVDiscrete CVContinuous EVDiscrete EV
Number of Markers
AU
C
By Algorithm: Complete Data
2 8 16 32 64 96 128 256 3840.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
RFE - Extra Trees RFE - Random Forest RFE - Correlation RFE - CosineRFE - Decision Tree RFE - Gradient Boosting RFE - KNN RFE - SVM
Number of Markers
AU
C
Performance: 70% Missing
2 8 16 32 64 96 128 256 3840.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Discrete CVContinuous CVDiscrete EVContinuous EV
Number of Markers
AU
C
Summary: Best Algorithms
Complete 70% missing
# of markers FS CL FS CL
2 RFE KNNRFE Extra Trees
8RFE Cosine RFE Cosine
16RFE Cosine RFE Cosine
32RFE Cosine RFE Cosine
64RFE Cosine RFE Cosine
96RFE Cosine RFE Correlation
128RFE Cosine RFE Correlation
256RFE Cosine RFE Correlation
384RFE Correlation RFE Correlation
Why the Big Gap?• Cross-platform
normalization
• Similarities in cell-types
• Over-fitting
Correlation: Broad vs External
Mesoderm Cell-type Identification
Anti-TNF Responsivness
Motivation Results
# genes AUC
873 %
1674 %
3276 %
6478 %
9687 %
12891 %
25691 %
38492 %
Study # genesSensitivity
(%)Specificity
(%)
Lequerre 20 71 61Stuhlmulle
r 11 79 56Stuhlmulle
r 82 67 56
Lequerre 8 71 28
Sekiguchi 18 71 28
Julia 8 92 17Stuhlmulle
r 3 71 17
Tanio 8 67 33
UCONN 8 83 83
UCONN 2048 94 96
Future Work
• Broader Data-types• NCI-60
• microarray mRNA• microarray microRNA• copy number variation• protein array• SNPs• …
• Minimizing over fitting
• Cross-platform• normalization
• Different Data types • Integrate multiple data
types simultaneously
Conclusion and Thanks
• Thanks to:• Ed Hemphill• Chih Lee• Ion Mandoiu• Craig Nelson
Smpl BioA commercial service coming in late 2013
DON’T GO BEYOND, TIS A SILLY PLACE
Extra Slides
Experiment Overview
Parameterize each combination of feature selection and classification algorithms
Output the best features and classifier
Feature Selection
Classification
Inner Cross-validation
Rank Models by AUC
Outer Cross-validation
Input: # of biomarkers:N
este
d C
ross
Val
idat
ion
Test Best Model Output: AUC of best features / classifier
Bro
ad D
ata
Ext
erna
l
Test
ing
Performance: 50% Missing
1 2 3 4 5 6 7 8 90.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Continuous CVContinuous EV
Number of Markers
AU
C