Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Introduction to Machine LearningSacklerYin Aphinyanaphongs MD/ PhD12/11/2014

Who Am I

Yin Aphinyanaphongs (yinformatics.com)

MD, PhD from Vanderbilt University in Nashville, TN.

Assistant Professor in the Center for Health Informatics and Bioinformatics.

Primary Expertise Machine Learning

Predictive Modeling Text Classification

Data Mining Social Media Large Medical Datasets

Secondary Expertise Search Engine Design/ Information Retrieval Natural Language Processing

What I Teach

Introduction to Biomedical Informatics.

Introduction to Medicine for Computer Scientists.

Data Analytics in R for physicians.

Machine Learning Examples

Given an email, classify it as spam or not spam.

Given a handwritten digit, assign it the right number.

Given descriptions of passengers on the titanic, predict who will survive or not survive.

Given a gene expression microarray of a cancer, predict whether the cancer will or will not metastasize.

Email Spam Text Classification

http://blog.cyren.com/uploads/blog/google-docs-spam-sample.jpg

Digit Classification

http://nonbiri-tereka.hatenablog.com/entry/2014/09/18/100439

Predicting Titanic Survival

Passenger class

Name

Sex

Age

Number of siblings/ spouses aboard

Number of parents/ children aboard

Ticket number

Passenger fare

Cabin

Port of Embarkation

https://www.kaggle.com/c/titanic-gettingStarted

Molecular Signatures

Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.

Golub et al.. (1999)) heatmap

+

Machine Learning

Goal

Construct algorithms to learn from data such that a built model from training data will generalize to unseen data.

General Framework

Obtain Seq

Sample Seq (Optional)

Label Seq

Clean Seq

Encode Seq

Build a Model

Performance Evaluation (Internal)

Model Application

and Validation (External)

Basic Framework

LabeledExamples

Unseen Examples

Labeled

Classification Algorithm• Random Forests• Regularized Logistic

Regression• Support Vector

Machines etc.

ALL

AML

ALL AML

+Key Concept – Supervised LearningFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon

14Principles and geometric representation for supervised learning (1/7)

• Want to classify objects as boats and houses.

15Principles and geometric

representation for supervised learning (2/7)

• All objects before the coast line are boats and all objects after the coast line are houses.

• Coast line serves as a decision surface that separates two classes.

16Principles and geometric

representation for supervised learning (3/7)

These boats will be misclassified as houses

This house will be misclassified as boat


Longitude

Latitude

Boat

House

• The methods that build classification models (i.e., “classification algorithms”) operate very similarly to the previous example.

• First all objects are represented geometrically.


Longitude

Latitude

Boat

House

Then the algorithm seeks to find a decision surface that separates classes of objects


Longitude

Latitude

? ? ?

? ? ?

These objects are classified as boats

These objects are classified as houses

Unseen (new) objects are classified as “boats” if they fall below the decision surface and as “houses” if the fall above it


Longitude

Latitude

Object #2

Object #1

Object #3

+Key Concept – Overfitting, UnderfittingFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon

22

Over-fitting (a model to your data) = building a model that is good in original data but fails to generalize well to new/unseen data

Under-fitting (a model to your data) = building a model that is poor in both original data and new/unseen data

Two problems: Over-fitting & Under-fitting

23Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

Predictor X

Outcome of Interest Y

24Scenario 1

Predictor X


Training Data

Future Data

25Scenario 1

Predictor X


Training Data

Future Data

26Scenario 1

Predictor X


Training Data

Future Data

27Scenario 1

Predictor X


Training Data

Future Data

This line is good!

This line overfits!

28

Predictor X


Training Data

Future Data

Scenario 2

29

Predictor X


Training Data

Future Data

Scenario 2

30

Predictor X


Training Data

Future Data

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

31

Predictor X


Training Data

Future Data

This line is good!

This line underfits!

Over/under-fitting are related to complexity of the decision surface and how well the training data is fit

32

Very important concept… Successful data analysis methods balance

training data fit with complexity. Too complex signature (to fit training data well) overfitting

(i.e., signature does not generalize) Too simplistic signature (to avoid overfitting) underfitting

(will generalize but the fit to both the training and future data will be low and predictive performance small).

+Key Concept – Performance EstimationFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon

34

On estimation of classifier accuracy

test

train

data

traintraintrain

testtest

test

train

testtrain

test

train

test

data

Large sample case: use hold-out validation

Small sample case: use N-fold cross-validation

Other versions of this general notion…

Leave one out cross validation

Leave pair out cross validation

Bootstrap

Single Holdout

+Key Concept – The Support Vector MachineFrom the book “A Gentle Introduction to Support Vector Machines in Biomedicine” Statnikov, Aliferis, Hardin, Guyon

37The Support Vector Machine (SVM) approach for building molecular signatures

Support vector machines (SVMs) is a binary classification algorithm.

SVMs are important because of (a) theoretical reasons:

- Robust to very large number of variables and small samples

- Can learn both simple and highly complex classification models

- Employ sophisticated mathematical principles to avoid overfitting

and (b) superior empirical results.

38

Main ideas of SVMs (1/3)

Cancer patientsNormal patientsGene X

Gene Y

• Consider example dataset described by 2 genes, gene X and gene Y

• Represent patients geometrically (by “vectors”)

39

• Find a linear decision surface (“hyperplane”) that can separate patient classes and has the largest distance (i.e., largest “gap” or “margin”) between border-line patients (i.e., “support vectors”);

Gap

Cancer patientsNormal patientsGene X

Gene Y


40

• If such linear decision surface does not exist, the data is mapped into a much higher dimensional space (“feature space”) where the separating decision surface is found;

• The feature space is constructed via very clever mathematical projection (“kernel trick”).

Gene Y

Gene X

Cancer

Normal

Cancer

Normal

kernel

Decision surface


+Key Concept - Curse of DimensionalityThanks to Dr. Gutierrez-Osuna - http://courses.cs.tamu.edu/rgutier/cs790_w02/l5.pdf.

Curse of Dimensionality (1/3)

Curse of dimensionality (2/3)

Curse of Dimensionality (3/3)

45

10,000-50,000 (gene expression microarrays, aCGH, and early SNP arrays)

>500,000 (exon arrays/tiled microarrays/SNP arrays)

10,000-300,000 (MS proteomics)

>10,000,000 (LC-MS proteomics)

>100,000,000 (next-generation sequencing)

The range of features in higher dimensional data include.

46

Some methods do not run at all (classical regression)

Some methods give bad results (KNN, Decision trees)

Very slow analysis

Very expensive/cumbersome clinical application

Tends to “overfit”

High Dimensionality in Small Samples Causes

+Cancer Classification Case StudyFrom Golub et al. (1999)

Case Study

Classify the values of a gene microarray according to leukemia type. AML ALL

Task meta-data 72 samples

47 ALL 25 AML

5,327 genes

Labeled Microarrays

Treatment

AML 25

ALL 47

Encode Microarray

Within each train fold, normalize the values of each column between 0 and 1.

Notice that we don’t normalize the entire dataset and then run our classification algorithms (this would result in overfitting).

Build a Model - Support Vector Machine

* *

**

**

* * *

* *

*

*

*

*

*

***

* ***

**

*

*

This example illustrates a 2 dimensional space. The x and y axis represent one word each. A full text categorization example could contain upwards of 50,000 words and thus 50,000 dimensions.

Build a Model – K nearest neighbors

http://mines.humanoriented.com/classes/2010/fall/csci568/portfolio_exports/lguo/knn.html

Build a Model – Neural Network

http://en.wikipedia.org/wiki/Artificial_neural_network

54

Estimate Performance

traintraintrain

testtest

test

train

testtrain

test

train

test

data

Small sample case: use N-fold cross-validation

Results

Proportion of Correct Classifications

Baseline (All in one class) 65.0%

Support Vector Machine 91.7%

K Nearest Neighbors 87.9%

Neural Network 84.7%

http://bib.oxfordjournals.org/content/7/1/86.full.pdf+html

Conclusions


Key Concepts Supervised Learning Overfitting/ Underfitting Support Vector Machines Cross Validation Curse of Dimensionality

Case Study – Cancer Classification

Thanks.

Dr. Gutierrez-Osuna

Dr. Alexander Statnikov

+

Molecular SignaturesSlides from Dr Alexander Statnikov PhD.

Molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.

60Definition of a molecular signature

61Example of a molecular signature

Molecular signature

Patient withlung cancer

Biopsy Gene expression

profile

Primary Lung Cancer

Metastatic Lung Cancer

1. Direct benefits: Models of disease phenotype/clinical outcome• Diagnosis• Prognosis, long-term disease management• Personalized treatment (drug selection, titration)

2. Ancillary benefits 1: Biomarkers for diagnosis, or outcome prediction• Make the above tasks resource efficient, and easy to

use in clinical practice• Helps next-generation molecular imaging• Leads for potential new drug candidates

3. Ancillary benefits 2: Discovery of structure & mechanisms (regulatory/interaction networks, pathways, sub-types)• Leads for potential new drug candidates

62Main uses of molecular signatures

OvaSure

Agendia Clarient Prediction Sciences

Veridex

LabCorp

University Genomics Genomic Health

BioTheranostics Applied Genomics Power3

Correlogic Systems

63Recent molecular signatures available for patient care

64Prostate cancer signatures in the market

• Developed by Agendia (www.agendia.com)

• 70-gene signature to stratify women with breast cancer that hasn’t spread into “low risk” and “high risk” for recurrence of the disease

• Independently validated in >1,000 patients

• So far performed >10,000 tests

• Cost of the test is ~$3,000

• In February, 2007 the FDA cleared the MammaPrint test for marketing in the U.S. for node negative women under 61 years of age with tumors of less than 5 cm.

• TIME Magazine’s 2007 “medical invention of the year”.

65MammaPrint

http://www.agendia.com/

Oncotype DX Breast Cancer Assay (Launched in 2004)

Developed by Genomic Health (www.genomichealth.com)

21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse

Independently validated in thousands of patients

So far performed >200,000 tests

Price of the test is $4,175

Not FDA approved but covered by most insurances including Medicare

Its sales in 2012 reached $199M.

66

http://www.genomichealth.com/

Economic validity

In a 2005 economic analysis of the Recurrence Score result in LN-,ER+ patients receiving tamoxifen, Hornberger et al. performed a cost-utility analysis using a decision analytic model. Using a model, recurrence Score result was predicted on average to increase quality-adjusted survival by 16.3 years and reduce overall costs by $155,128.

Instead of using the model, economic benefits can now be assessed from the published clinical utility of the test and actual health plan costs for adjuvant chemotherapy. For example, in a 2 million member plan, approximately 773 women are eligible for the test. If half receive the test, given the high and increasing cost of adjuvant chemotherapy, supportive care and management of adverse events, the use of the Oncotype DX assay is estimated to save approximately $1,930 per woman tested (given an aggregate 34% reduction in chemotherapy use).

References about health benefits and cost-effectiveness:

“Economic Analysis of Targeting Chemotherapy Using a 21-Gene RT-PCR Assay in Lymph Node-Negative, Estrogen Receptor-Positive, Early-Stage Breast Cancer” Am J Manag Care. 2005; 11(5):313-324.

“Impact of a 21-Gene RT-PCR Assay on Treatment Decisions in Early-Stage Breast Cancer, An Economic Analysis Based on Prognostic and Predictive Validation Studies” Cancer. 2007; 109(6):1011-1018.

67

Oncotype DX Colon Cancer Assay (Launched in 2010)


Multigene gene signature to predict risk of recurrence in patients with stage II colon cancer

Independently validated in thousands of patients



68


Oncotype DX Prostate Cancer Assay (Launched in 2013)


Multigene gene signature to distinguish aggressive prostate cancer from less threatening one

Independently validated



69


Oncotype DX Business Metrics 70

Data from http://investor.genomichealth.com/

Conclusions


Key Concepts Supervised Learning Overfitting/ Underfitting Support Vector Machines Cross Validation Curse of Dimensionality

Case Study – Cancer Classification

Case Study – Molecular Signatures

Thanks.

Dr. Gutierrez-Osuna

Dr. Alexander Statnikov

Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.

Documents

Transcript of Introduction to Machine Learning Sackler Yin Aphinyanaphongs MD/ PhD 12/11/2014.