Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.
-
Upload
lorenzo-gallamore -
Category
Documents
-
view
219 -
download
0
Transcript of Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.
Introduction to Machine LearningSacklerYin Aphinyanaphongs MD/ PhD12/11/2014
Who Am I
Yin Aphinyanaphongs (yinformatics.com)
MD, PhD from Vanderbilt University in Nashville, TN.
Assistant Professor in the Center for Health Informatics and Bioinformatics.
Primary Expertise Machine Learning
Predictive Modeling Text Classification
Data Mining Social Media Large Medical Datasets
Secondary Expertise Search Engine Design/ Information Retrieval Natural Language Processing
What I Teach
Introduction to Biomedical Informatics.
Introduction to Medicine for Computer Scientists.
Data Analytics in R for physicians.
Machine Learning Examples
Given an email, classify it as spam or not spam.
Given a handwritten digit, assign it the right number.
Given descriptions of passengers on the titanic, predict who will survive or not survive.
Given a gene expression microarray of a cancer, predict whether the cancer will or will not metastasize.
Email Spam Text Classification
http://blog.cyren.com/uploads/blog/google-docs-spam-sample.jpg
Digit Classification
http://nonbiri-tereka.hatenablog.com/entry/2014/09/18/100439
Predicting Titanic Survival
Passenger class
Name
Sex
Age
Number of siblings/ spouses aboard
Number of parents/ children aboard
Ticket number
Passenger fare
Cabin
Port of Embarkation
https://www.kaggle.com/c/titanic-gettingStarted
Molecular Signatures
Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.
Golub et al.. (1999)) heatmap
+
Machine Learning
Goal
Construct algorithms to learn from data such that a built model from training data will generalize to unseen data.
General Framework
Obtain Seq
Sample Seq (Optional)
Label Seq
Clean Seq
Encode Seq
Build a Model
Performance Evaluation (Internal)
Model Application
and Validation (External)
Basic Framework
LabeledExamples
Unseen Examples
Labeled
Classification Algorithm• Random Forests• Regularized Logistic
Regression• Support Vector
Machines etc.
ALL
AML
ALL AML
+Key Concept – Supervised LearningFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
14Principles and geometric representation for supervised learning (1/7)
• Want to classify objects as boats and houses.
15Principles and geometric
representation for supervised learning (2/7)
• All objects before the coast line are boats and all objects after the coast line are houses.
• Coast line serves as a decision surface that separates two classes.
16Principles and geometric
representation for supervised learning (3/7)
These boats will be misclassified as houses
This house will be misclassified as boat
17Principles and geometric representation for supervised learning (4/7)
Longitude
Latitude
Boat
House
• The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example.
• First all objects are represented geometrically.
18Principles and geometric representation for supervised learning (5/7)
Longitude
Latitude
Boat
House
Then the algorithm seeks to find a decision surface that separates classes of objects
19Principles and geometric representation for supervised learning (6/7)
Longitude
Latitude
? ? ?
? ? ?
These objects are classified as boats
These objects are classified as houses
Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it
20Principles and geometric representation for supervised learning (7/7)
Longitude
Latitude
Object #2
Object #1
Object #3
+Key Concept – Overfitting, UnderfittingFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
22
Over-fitting (a model to your data) = building a model that is good in original data but fails to generalize well to new/unseen data
Under-fitting (a model to your data) = building a model that is poor in both original data and new/unseen data
Two problems: Over-fitting & Under-fitting
23Over/under-fitting are related to complexity of the decision surface and how well the training data is fit
Predictor X
Outcome of Interest Y
24Scenario 1
Predictor X
Outcome of Interest Y
Training Data
Future Data
25Scenario 1
Predictor X
Outcome of Interest Y
Training Data
Future Data
26Scenario 1
Predictor X
Outcome of Interest Y
Training Data
Future Data
27Scenario 1
Predictor X
Outcome of Interest Y
Training Data
Future Data
This line is good!
This line overfits!
28
Predictor X
Outcome of Interest Y
Training Data
Future Data
Scenario 2
29
Predictor X
Outcome of Interest Y
Training Data
Future Data
Scenario 2
30
Predictor X
Outcome of Interest Y
Training Data
Future Data
Over/under-fitting are related to complexity of the decision surface and how well the training data is fit
31
Predictor X
Outcome of Interest Y
Training Data
Future Data
This line is good!
This line underfits!
Over/under-fitting are related to complexity of the decision surface and how well the training data is fit
32
Very important concept… Successful data analysis methods balance
training data fit with complexity. Too complex signature (to fit training data well) overfitting
(i.e., signature does not generalize) Too simplistic signature (to avoid overfitting) underfitting
(will generalize but the fit to both the training and future data will be low and predictive performance small).
+Key Concept – Performance EstimationFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
34
On estimation of classifier accuracy
test
train
data
traintraintrain
testtest
test
train
testtrain
test
train
test
data
Large sample case: use hold-out validation
Small sample case: use N-fold cross-validation
Other versions of this general notion…
Leave one out cross validation
Leave pair out cross validation
Bootstrap
Single Holdout
+Key Concept – The Support Vector MachineFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
37The Support Vector Machine (SVM) approach for building molecular signatures
Support vector machines (SVMs) is a binary classification algorithm.
SVMs are important because of (a) theoretical reasons:
- Robust to very large number of variables and small samples
- Can learn both simple and highly complex classification models
- Employ sophisticated mathematical principles to avoid overfitting
and (b) superior empirical results.
38
Main ideas of SVMs (1/3)
Cancer patientsNormal patientsGene X
Gene Y
• Consider example dataset described by 2 genes, gene X and gene Y
• Represent patients geometrically (by “vectors”)
39
• Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”);
Gap
Cancer patientsNormal patientsGene X
Gene Y
Main ideas of SVMs (2/3)
40
• If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found;
• The feature space is constructed via very clever mathematical projection (“kernel trick”).
Gene Y
Gene X
Cancer
Normal
Cancer
Normal
kernel
Decision surface
Main ideas of SVMs (3/3)
+Key Concept - Curse of DimensionalityThanks to Dr. Gutierrez-Osuna - http://courses.cs.tamu.edu/rgutier/cs790_w02/l5.pdf.
Curse of Dimensionality (1/3)
Curse of dimensionality (2/3)
Curse of Dimensionality (3/3)
45
10,000-50,000 (gene expression microarrays, aCGH, and early SNP arrays)
>500,000 (exon arrays/tiled microarrays/SNP arrays)
10,000-300,000 (MS proteomics)
>10,000,000 (LC-MS proteomics)
>100,000,000 (next-generation sequencing)
The range of features in higher dimensional data include.
46
Some methods do not run at all (classical regression)
Some methods give bad results (KNN, Decision trees)
Very slow analysis
Very expensive/cumbersome clinical application
Tends to “overfit”
High Dimensionality in Small Samples Causes
+Cancer Classification Case StudyFrom Golub et al. (1999)
Case Study
Classify the values of a gene microarray according to leukemia type. AML ALL
Task meta-data 72 samples
47 ALL 25 AML
5,327 genes
Labeled Microarrays
Treatment
AML 25
ALL 47
Encode Microarray
Within each train fold, normalize the values of each column between 0 and 1.
Notice that we don’t normalize the entire dataset and then run our classification algorithms (this would result in overfitting).
Build a Model - Support Vector Machine
* *
**
**
* * *
* *
*
*
*
*
*
***
* ***
**
*
*
This example illustrates a 2 dimensional space. The x and y axis represent one word each. A full text categorization example could contain upwards of 50,000 words and thus 50,000 dimensions.
Build a Model – K nearest neighbors
http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/lguo/knn.html
Build a Model – Neural Network
http://en.wikipedia.org/wiki/Artificial_neural_network
54
Estimate Performance
traintraintrain
testtest
test
train
testtrain
test
train
test
data
Small sample case: use N-fold cross-validation
Results
Proportion of Correct Classifications
Baseline (All in one class) 65.0%
Support Vector Machine 91.7%
K Nearest Neighbors 87.9%
Neural Network 84.7%
http://bib.oxfordjournals.org/content/7/1/86.full.pdf+html
Conclusions
Machine Learning Examples
Key Concepts Supervised Learning Overfitting/ Underfitting Support Vector Machines Cross Validation Curse of Dimensionality
Case Study – Cancer Classification
Thanks.
Dr. Gutierrez-Osuna
Dr. Alexander Statnikov
+
Molecular SignaturesSlides from Dr Alexander Statnikov PhD.
Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.
60Definition of a molecular signature
61Example of a molecular signature
Molecular signature
Patient withlung cancer
Biopsy Gene expression
profile
Primary Lung Cancer
Metastatic Lung Cancer
1. Direct benefits: Models of disease phenotype/clinical outcome• Diagnosis• Prognosis, long-term disease management• Personalized treatment (drug selection, titration)
2. Ancillary benefits 1: Biomarkers for diagnosis, or outcome prediction• Make the above tasks resource efficient, and easy to
use in clinical practice• Helps next-generation molecular imaging• Leads for potential new drug candidates
3. Ancillary benefits 2: Discovery of structure & mechanisms (regulatory/interaction networks, pathways, sub-types)• Leads for potential new drug candidates
62Main uses of molecular signatures
OvaSure
Agendia Clarient Prediction Sciences
Veridex
LabCorp
University Genomics Genomic Health
BioTheranostics Applied Genomics Power3
Correlogic Systems
63Recent molecular signatures available for patient care
64Prostate cancer signatures in the market
• Developed by Agendia (www.agendia.com)
• 70-gene signature to stratify women with breast cancer that hasn’t spread into “low risk” and “high risk” for recurrence of the disease
• Independently validated in >1,000 patients
• So far performed >10,000 tests
• Cost of the test is ~$3,000
• In February, 2007 the FDA cleared the MammaPrint test for marketing in the U.S. for node negative women under 61 years of age with tumors of less than 5 cm.
• TIME Magazine’s 2007 “medical invention of the year”.
65MammaPrint
Oncotype DX Breast Cancer Assay (Launched in 2004)
Developed by Genomic Health (www.genomichealth.com)
21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse
Independently validated in thousands of patients
So far performed >200,000 tests
Price of the test is $4,175
Not FDA approved but covered by most insurances including Medicare
Its sales in 2012 reached $199M.
66
Economic validity
In a 2005 economic analysis of the Recurrence Score result in LN-,ER+ patients receiving tamoxifen, Hornberger et al. performed a cost-utility analysis using a decision analytic model. Using a model, recurrence Score result was predicted on average to increase quality-adjusted survival by 16.3 years and reduce overall costs by $155,128.
Instead of using the model, economic benefits can now be assessed from the published clinical utility of the test and actual health plan costs for adjuvant chemotherapy. For example, in a 2 million member plan, approximately 773 women are eligible for the test. If half receive the test, given the high and increasing cost of adjuvant chemotherapy, supportive care and management of adverse events, the use of the Oncotype DX assay is estimated to save approximately $1,930 per woman tested (given an aggregate 34% reduction in chemotherapy use).
References about health benefits and cost-effectiveness:
“Economic Analysis of Targeting Chemotherapy Using a 21-Gene RT-PCR Assay in Lymph Node-Negative, Estrogen Receptor-Positive, Early-Stage Breast Cancer” Am J Manag Care. 2005; 11(5):313-324.
“Impact of a 21-Gene RT-PCR Assay on Treatment Decisions in Early-Stage Breast Cancer, An Economic Analysis Based on Prognostic and Predictive Validation Studies” Cancer. 2007; 109(6):1011-1018.
67
Oncotype DX Colon Cancer Assay (Launched in 2010)
Developed by Genomic Health (www.genomichealth.com)
Multigene gene signature to predict risk of recurrence in patients with stage II colon cancer
Independently validated in thousands of patients
Price of the test is $3,280
Not FDA approved but covered by most insurances including Medicare
68
Oncotype DX Prostate Cancer Assay (Launched in 2013)
Developed by Genomic Health (www.genomichealth.com)
Multigene gene signature to distinguish aggressive prostate cancer from less threatening one
Independently validated
Price of the test is $3,820
Not FDA approved but covered by most insurances including Medicare
69
Oncotype DX Business Metrics 70
Data from http://investor.genomichealth.com/
Conclusions
Machine Learning Examples
Key Concepts Supervised Learning Overfitting/ Underfitting Support Vector Machines Cross Validation Curse of Dimensionality
Case Study – Cancer Classification
Case Study – Molecular Signatures
Thanks.
Dr. Gutierrez-Osuna
Dr. Alexander Statnikov