Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

72
Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014

Transcript of Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Page 1: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Introduction to Machine LearningSacklerYin Aphinyanaphongs MD/ PhD12/11/2014

Page 2: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Who Am I

Yin Aphinyanaphongs (yinformatics.com)

MD, PhD from Vanderbilt University in Nashville, TN.

Assistant Professor in the Center for Health Informatics and Bioinformatics.

Primary Expertise Machine Learning

Predictive Modeling Text Classification

Data Mining Social Media Large Medical Datasets

Secondary Expertise Search Engine Design/ Information Retrieval Natural Language Processing

Page 3: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

What I Teach

Introduction to Biomedical Informatics.

Introduction to Medicine for Computer Scientists.

Data Analytics in R for physicians.

Page 4: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Machine Learning Examples

Given an email, classify it as spam or not spam.

Given a handwritten digit, assign it the right number.

Given descriptions of passengers on the titanic, predict who will survive or not survive.

Given a gene expression microarray of a cancer, predict whether the cancer will or will not metastasize.

Page 5: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Email Spam Text Classification

http://blog.cyren.com/uploads/blog/google-docs-spam-sample.jpg

Page 6: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Digit Classification

http://nonbiri-tereka.hatenablog.com/entry/2014/09/18/100439

Page 7: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Predicting Titanic Survival

Passenger class

Name

Sex

Age

Number of siblings/ spouses aboard

Number of parents/ children aboard

Ticket number

Passenger fare

Cabin

Port of Embarkation

https://www.kaggle.com/c/titanic-gettingStarted

Page 8: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Molecular Signatures

Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.

Golub et al.. (1999)) heatmap

Page 9: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

+

Machine Learning

Page 10: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Goal

Construct algorithms to learn from data such that a built model from training data will generalize to unseen data.

Page 11: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

General Framework

Obtain Seq

Sample Seq (Optional)

Label Seq

Clean Seq

Encode Seq

Build a Model

Performance Evaluation (Internal)

Model Application

and Validation (External)

Page 12: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Basic Framework

LabeledExamples

Unseen Examples

Labeled

Classification Algorithm• Random Forests• Regularized Logistic

Regression• Support Vector

Machines etc.

ALL

AML

ALL AML

Page 13: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

+Key Concept – Supervised LearningFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon

Page 14: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

14Principles and geometric representation for supervised learning (1/7)

• Want to classify objects as boats and houses.

Page 15: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

15Principles and geometric

representation for supervised learning (2/7)

• All objects before the coast line are boats and all objects after the coast line are houses.

• Coast line serves as a decision surface that separates two classes.

Page 16: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

16Principles and geometric

representation for supervised learning (3/7)

These boats will be misclassified as houses

This house will be misclassified as boat

Page 17: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

17Principles and geometric representation for supervised learning (4/7)

Longitude

Latitude

Boat

House

• The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example.

• First all objects are represented geometrically.

Page 18: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

18Principles and geometric representation for supervised learning (5/7)

Longitude

Latitude

Boat

House

Then the algorithm seeks to find a decision surface that separates classes of objects

Page 19: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

19Principles and geometric representation for supervised learning (6/7)

Longitude

Latitude

? ? ?

? ? ?

These objects are classified as boats

These objects are classified as houses

Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it

Page 20: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

20Principles and geometric representation for supervised learning (7/7)

Longitude

Latitude

Object #2

Object #1

Object #3

Page 21: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

+Key Concept – Overfitting, UnderfittingFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon

Page 22: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

22

Over-fitting (a model to your data) = building a model that is good in original data but fails to generalize well to new/unseen data

Under-fitting (a model to your data) = building a model that is poor in both original data and new/unseen data

Two problems: Over-fitting & Under-fitting

Page 23: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

23Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Predictor X

Outcome of Interest Y

Page 24: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

24Scenario 1

Predictor X

Outcome of Interest Y

Training Data

Future Data

Page 25: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

25Scenario 1

Predictor X

Outcome of Interest Y

Training Data

Future Data

Page 26: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

26Scenario 1

Predictor X

Outcome of Interest Y

Training Data

Future Data

Page 27: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

27Scenario 1

Predictor X

Outcome of Interest Y

Training Data

Future Data

This line is good!

This line overfits!

Page 28: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

28

Predictor X

Outcome of Interest Y

Training Data

Future Data

Scenario 2

Page 29: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

29

Predictor X

Outcome of Interest Y

Training Data

Future Data

Scenario 2

Page 30: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

30

Predictor X

Outcome of Interest Y

Training Data

Future Data

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Page 31: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

31

Predictor X

Outcome of Interest Y

Training Data

Future Data

This line is good!

This line underfits!

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Page 32: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

32

Very important concept… Successful data analysis methods balance

training data fit with complexity. Too complex signature (to fit training data well) overfitting

(i.e., signature does not generalize) Too simplistic signature (to avoid overfitting) underfitting

(will generalize but the fit to both the training and future data will be low and predictive performance small).

Page 33: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

+Key Concept – Performance EstimationFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon

Page 34: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

34

On estimation of classifier accuracy

test

train

data

traintraintrain

testtest

test

train

testtrain

test

train

test

data

Large sample case: use hold-out validation

Small sample case: use N-fold cross-validation

Page 35: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Other versions of this general notion…

Leave one out cross validation

Leave pair out cross validation

Bootstrap

Single Holdout

Page 36: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

+Key Concept – The Support Vector MachineFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon

Page 37: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

37The Support Vector Machine (SVM) approach for building molecular signatures

Support vector machines (SVMs) is a binary classification algorithm.

SVMs are important because of (a) theoretical reasons:

- Robust to very large number of variables and small samples

- Can learn both simple and highly complex classification models

- Employ sophisticated mathematical principles to avoid overfitting

and (b) superior empirical results.

Page 38: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

38

Main ideas of SVMs (1/3)

Cancer patientsNormal patientsGene X

Gene Y

• Consider example dataset described by 2 genes, gene X and gene Y

• Represent patients geometrically (by “vectors”)

Page 39: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

39

• Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”);

Gap

Cancer patientsNormal patientsGene X

Gene Y

Main ideas of SVMs (2/3)

Page 40: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

40

• If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found;

• The feature space is constructed via very clever mathematical projection (“kernel trick”).

Gene Y

Gene X

Cancer

Normal

Cancer

Normal

kernel

Decision surface

Main ideas of SVMs (3/3)

Page 41: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

+Key Concept - Curse of DimensionalityThanks to Dr. Gutierrez-Osuna - http://courses.cs.tamu.edu/rgutier/cs790_w02/l5.pdf.

Page 42: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Curse of Dimensionality (1/3)

Page 43: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Curse of dimensionality (2/3)

Page 44: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Curse of Dimensionality (3/3)

Page 45: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

45

10,000-50,000 (gene expression microarrays, aCGH, and early SNP arrays)

>500,000 (exon arrays/tiled microarrays/SNP arrays)

10,000-300,000 (MS proteomics)

>10,000,000 (LC-MS proteomics)

>100,000,000 (next-generation sequencing)

The range of features in higher dimensional data include.

Page 46: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

46

Some methods do not run at all (classical regression)

Some methods give bad results (KNN, Decision trees)

Very slow analysis

Very expensive/cumbersome clinical application

Tends to “overfit”

High Dimensionality in Small Samples Causes

Page 47: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

+Cancer Classification Case StudyFrom Golub et al. (1999)

Page 48: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Case Study

Classify the values of a gene microarray according to leukemia type. AML ALL

Task meta-data 72 samples

47 ALL 25 AML

5,327 genes

Page 49: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Labeled Microarrays

Treatment

AML 25

ALL 47

Page 50: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Encode Microarray

Within each train fold, normalize the values of each column between 0 and 1.

Notice that we don’t normalize the entire dataset and then run our classification algorithms (this would result in overfitting).

Page 51: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Build a Model - Support Vector Machine

* *

**

**

* * *

* *

*

*

*

*

*

***

* ***

**

*

*

This example illustrates a 2 dimensional space. The x and y axis represent one word each. A full text categorization example could contain upwards of 50,000 words and thus 50,000 dimensions.

Page 52: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Build a Model – K nearest neighbors

http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/lguo/knn.html

Page 53: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Build a Model – Neural Network

http://en.wikipedia.org/wiki/Artificial_neural_network

Page 54: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

54

Estimate Performance

traintraintrain

testtest

test

train

testtrain

test

train

test

data

Small sample case: use N-fold cross-validation

Page 55: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Results

Proportion of Correct Classifications

Baseline (All in one class) 65.0%

Support Vector Machine 91.7%

K Nearest Neighbors 87.9%

Neural Network 84.7%

Page 56: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

http://bib.oxfordjournals.org/content/7/1/86.full.pdf+html

Page 57: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Conclusions

Machine Learning Examples

Key Concepts Supervised Learning Overfitting/ Underfitting Support Vector Machines Cross Validation Curse of Dimensionality

Case Study – Cancer Classification

Page 58: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Thanks.

Dr. Gutierrez-Osuna

Dr. Alexander Statnikov

Page 59: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

+

Molecular SignaturesSlides from Dr Alexander Statnikov PhD.

Page 60: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.

60Definition of a molecular signature

Page 61: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

61Example of a molecular signature

Molecular signature

Patient withlung cancer

Biopsy Gene expression

profile

Primary Lung Cancer

Metastatic Lung Cancer

Page 62: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

1. Direct benefits: Models of disease phenotype/clinical outcome• Diagnosis• Prognosis, long-term disease management• Personalized treatment (drug selection, titration)

2. Ancillary benefits 1: Biomarkers for diagnosis, or outcome prediction• Make the above tasks resource efficient, and easy to

use in clinical practice• Helps next-generation molecular imaging• Leads for potential new drug candidates

3. Ancillary benefits 2: Discovery of structure & mechanisms (regulatory/interaction networks, pathways, sub-types)• Leads for potential new drug candidates

62Main uses of molecular signatures

Page 63: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

OvaSure

Agendia Clarient Prediction Sciences

Veridex

LabCorp

University Genomics Genomic Health

BioTheranostics Applied Genomics Power3

Correlogic Systems

63Recent molecular signatures available for patient care

Page 64: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

64Prostate cancer signatures in the market

Page 65: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

• Developed by Agendia (www.agendia.com)

• 70-gene signature to stratify women with breast cancer that hasn’t spread into “low risk” and “high risk” for recurrence of the disease

• Independently validated in >1,000 patients

• So far performed >10,000 tests

• Cost of the test is ~$3,000

• In February, 2007 the FDA cleared the MammaPrint test for marketing in the U.S. for node negative women under 61 years of age with tumors of less than 5 cm.

• TIME Magazine’s 2007 “medical invention of the year”.

65MammaPrint

Page 66: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Oncotype DX Breast Cancer Assay (Launched in 2004)

Developed by Genomic Health (www.genomichealth.com)

21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse

Independently validated in thousands of patients

So far performed >200,000 tests

Price of the test is $4,175

Not FDA approved but covered by most insurances including Medicare

Its sales in 2012 reached $199M.

66

Page 67: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Economic validity

In a 2005 economic analysis of the Recurrence Score result in LN-,ER+ patients receiving tamoxifen, Hornberger et al. performed a cost-utility analysis using a decision analytic model. Using a model, recurrence Score result was predicted on average to increase quality-adjusted survival by 16.3 years and reduce overall costs by $155,128.

Instead of using the model, economic benefits can now be assessed from the published clinical utility of the test and actual health plan costs for adjuvant chemotherapy. For example, in a 2 million member plan, approximately 773 women are eligible for the test. If half receive the test, given the high and increasing cost of adjuvant chemotherapy, supportive care and management of adverse events, the use of the Oncotype DX assay is estimated to save approximately $1,930 per woman tested (given an aggregate 34% reduction in chemotherapy use).

References about health benefits and cost-effectiveness:

“Economic Analysis of Targeting Chemotherapy Using a 21-Gene RT-PCR Assay in Lymph Node-Negative, Estrogen Receptor-Positive, Early-Stage Breast Cancer” Am J Manag Care. 2005; 11(5):313-324.

“Impact of a 21-Gene RT-PCR Assay on Treatment Decisions in Early-Stage Breast Cancer, An Economic Analysis Based on Prognostic and Predictive Validation Studies” Cancer. 2007; 109(6):1011-1018.

67

Page 68: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Oncotype DX Colon Cancer Assay (Launched in 2010)

Developed by Genomic Health (www.genomichealth.com)

Multigene gene signature to predict risk of recurrence in patients with stage II colon cancer

Independently validated in thousands of patients

Price of the test is $3,280

Not FDA approved but covered by most insurances including Medicare

68

Page 69: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Oncotype DX Prostate Cancer Assay (Launched in 2013)

Developed by Genomic Health (www.genomichealth.com)

Multigene gene signature to distinguish aggressive prostate cancer from less threatening one

Independently validated

Price of the test is $3,820

Not FDA approved but covered by most insurances including Medicare

69

Page 70: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Oncotype DX Business Metrics 70

Data from http://investor.genomichealth.com/

Page 71: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Conclusions

Machine Learning Examples

Key Concepts Supervised Learning Overfitting/ Underfitting Support Vector Machines Cross Validation Curse of Dimensionality

Case Study – Cancer Classification

Case Study – Molecular Signatures

Page 72: Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Thanks.

Dr. Gutierrez-Osuna

Dr. Alexander Statnikov