Applied Machine Learning in Biomedicineenrigri/Public/AMLB_2016/Slides/AMLB_Lecture_1… · Course...

Applied Machine Learning in Biomedicine

Enrico Grisan

[email protected]

Course details

Tue-Thu 14.30-16.00 Room 318

May 24th through June 7th

Contact [email protected]

Exam: project assignment

Cancer detection

Face detection

How would you detect a face?

How does album software tag your friends?

What do we do?

Speech recognition

Brain-computer interface

Recommender systems

Amazon, Netflix, Spotify tell you what you might like

The Netflix Prize was an open competition: predict user ratings for films, based on previous ratings without any other information about the users or films,

The grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team which bested Netflix's own algorithm for predicting ratings by 10.06%

Prediction systems

Just ahead of the kickoff for Season 6 of the television series “Games of Thrones,” computer science students at the Technical University of Munich have implemented a project that answers questions preoccupying fans of the series: Has Jon Snow survived Season 5? Who is going to die next?

The students used an array of machine learning algorithms to answer these questions.

The algorithm, which accurately predicted 74 percent of character deaths in the show and books, has many surprises in store, placing a number of characters thought to be relatively safe in grave danger.

The age of big data

“Every day, people create the equivalent of 2.5 quintillionbytes of data from sensors, mobile devices, onlinetransactions, and social networks; so much that 90 percent ofthe world's data has been generated in the past two years..”

The Huffington Post: Arnal Dayaratna: IBM Releases Big Data

CERN Collider320x1012 bytes/s

Personal connectome1018 bytes/person

109 messages/day

30x106 messages/day

Mining brain networks

The role of machine learning

Design and analyze algorithms that

- improve their performance- at some task- with experience

Data(experience)

Learningalgorithm

Knowledge(performance on task)

Imagenet challenge

Kaggle challenge

100 000 $ prize

35000 retinal images4 DR classes

Challenges

• Quantitative modelling

• Noise or information

• Simple or complex models

• Test learning before deploying to real world

• Acquire reliable domain knowledge

Machine learning in biomedicine

Usually extreme conditions:

Very few samples (with respect to the problem)

Very large amount of descriptors per sample

Very large amount of noise/uncertainty

Usually critical consequences:

Results might lead decision making

Results might lead the understanding of phenomena

Categories

– Supervised learning

classification, regression

– Unsupervised learning

Density estimation, clustering, dimensionality reduction

– Semisupervised learning– Active learning– Reinforcement learning– …

Supervised learning

Feature space Target space𝒳 𝒴

Given 𝑿 ∈ 𝓧 and 𝒀 ∈ 𝓨predict 𝒀 = 𝒇 𝐗

Supervised learningFeature space Target space

NormalMetaplastic

Benign neoplasticMalign neoplastic

Gene expression Discrete labelsClassification

CHD risk scoreDemographic and

Clinical dataContinuous labels

Regression

𝒳 𝒴

𝑦 = 𝑓(𝑥)

𝑦 = 𝑓(𝑥)

Typical machine learning workflow

Training data(past)

Domain knowledgeLearned

knowledgeExpert

knowledge

Training

ModelUnknown data

(future)Output

Bayesian framework: MLE vs MAP

• Maximum likelihood:

choose the value that maximizes the probability of observed data

𝜃𝑀𝐿𝐸 = 𝑎𝑟𝑔 max𝜃

𝑃(𝐷|𝜃)

• Maximum a posteriori

choose the value that is most probable given the observeddata and the prior belief

𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔 max𝜃

𝑃 𝜃 𝐷 = 𝑎𝑟𝑔 max𝜃

𝑃 𝐷 𝜃 𝑃)𝑃(𝜃)

Decision theory

You have a chest X-ray of a patient, you must decide if the distribution of intensities iscompatible with having malignant lung nodulesor not.

Suppose that you are able to summarize the intensities distribution with a small (possibly 1) number of measures.

Decision theory

Formalization:

𝑋: intensity measures from the image

𝐶1: normal X-ray class

𝐶2: presence of lung nodules class

𝑝(𝑋, 𝐶)

Decision theory

Is it really necessary to determine 𝑝 𝑋, 𝐶 ?

For a given 𝑥, determine the optimal 𝐶𝑖 (cancer or no cancer):

𝑝 𝐶𝑘 𝑥 =𝑝 𝑥 𝐶𝑘 𝑝(𝐶𝑘)

𝑝(𝑥)

Minimum misclassification rate

Minimum misclassification rate

𝑝 𝑥, 𝐶𝑘 = 𝑝 𝐶𝑘| 𝑥 𝑝(𝑥)

Optimal decision: assign 𝑥 to the class for which 𝒑(𝑪𝒌|𝒙) is largest

Penalizing prediction errors

Expected prediction error

𝑝 𝑚𝑖𝑠𝑡𝑎𝑘𝑒 ~𝐸𝑃𝐸 𝑓 = 𝐸[(𝑌 − 𝑓(𝑋))2] = (𝑦 − 𝑓(𝑥))2𝑝 𝑥, 𝑦 𝑑𝑥𝑑𝑦

= 𝐸𝑋 𝐸𝑌|𝑋 (𝑌 − 𝑓(𝑋))2|𝑋

𝑝 𝑥, 𝑦 = 𝑝 𝑦 𝑥 𝑝(𝑥)

Penalizing prediction errors


𝑝 𝑚𝑖𝑠𝑡𝑎𝑘𝑒 ~𝐸𝑃𝐸 𝑓 = 𝐸[𝐶 ≠ 𝑓(𝑋)] =

𝑘

𝐻(𝒞𝑘 = 𝑓(𝑥))𝑝 𝑥, 𝒞𝑘 𝑑𝑥

= 𝐸𝑋 𝐸𝐶|𝑋 𝐶 ≠ 𝑓(𝑋)|𝑋

𝑝 𝑥, 𝒞𝑘 = 𝑝 𝒞𝑘 𝑥 𝑝(𝑥)

Weighting the loss

Cancer Normal

Cancer 60 40

Normal 5 95

Cancer Normal

Cancer TP FP

Normal FN TN

Esti

mat

edd

iagn

osi

s

True diagnosis

Weighting the loss

Positive Negative

Positive TP FP

Negative FN TN

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇𝑃 + 𝑇𝑁

𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑇𝑁

𝑇𝑁 + 𝐹𝑃

Esti

mat

edd

iagn

osi

s

Weighting the loss

Esti

mat

edd

iagn

osi

s

True diagnosis

Cancer Normal

Cancer 60 40

Normal 5 95

Cancer Normal

Cancer 0 1000

Normal 1 0

Loss Matrix

𝐿11 𝐿21

𝐿12 𝐿22

Expected Loss


𝑝 𝑚𝑖𝑠𝑡𝑎𝑘𝑒 ~𝐸𝑃𝐸 𝐶 = 𝐸[𝐿(𝐶, 𝐶(𝑋)]

= 𝐸𝑋

𝑘=1

𝐾

𝐿 𝒞𝑘 , 𝐶 𝑋 𝑃(𝒞𝑘|𝑋)

𝐶(𝑥) = 𝑎𝑟𝑔 min𝑐∈𝒞

𝑘=1

𝐾

𝐿 𝒞𝑘 , 𝑐 𝑃(𝒞𝑘|𝑋 = 𝑥)

Bayes classifier (0-1 loss)

𝐶(𝑥) = 𝑎𝑟𝑔 min𝑐∈𝒞

𝑘=1

𝐾

𝐿 𝒞𝑘 , 𝐺 𝑋 𝑃(𝒞𝑘|𝑋)

𝐶 𝑥 = 𝒞𝑘 if 𝑃 𝒞𝑘 𝑋 = 𝑥 = max𝑐∈𝒞

𝑃(𝑐|𝑋 = 𝑥)

𝑃 𝒞2 𝑋 = 𝑥𝑃 𝒞1 𝑋 = 𝑥


𝐶 𝑥 = 𝒞2 if 𝑥 > 1𝐸𝑃𝐸 𝐶 = 0.893


𝐶 𝑥 = 𝒞2 if 𝑥 > 3𝐸𝑃𝐸 𝐶 = 0.437


𝐶 𝑥 = 𝒞2 if 𝑥 > 4𝐸𝑃𝐸 𝐶 = 0.892


𝐶 𝑥 = 𝒞2 if 𝑥 > 2.5𝐸𝑃𝐸 𝐶 = 0.363

𝐸𝑃

𝐸(𝐶

)

Bayes classifier (generic loss)

min𝑅

𝐸 𝐿 = min𝑅

𝑘

𝑗

𝑅𝑗

𝐿𝑘𝑗𝑝 𝑥, 𝐶𝑘 𝑑𝑥

⇒ min𝑅

𝑘

𝐿𝑘𝑗𝑝 𝑥, 𝐶𝑘

min𝑅

𝑘

𝐿𝑘𝑗𝑝 𝐶𝑘|𝑥 𝑝(𝑥) ∝ min𝑅

𝑘

𝐿𝑘𝑗𝑝 𝐶𝑘|𝑥 𝑝(𝑥)

Reject option

Classification approaches (1)

Generative models1) Inference step:

use training data to learn a model for 𝑝 𝑋 𝐶𝑘

use data to infer priors 𝑝(𝐶𝑘)use Bayes’ formula to find posteriors 𝑝(𝐶𝑘|𝑋)

1b) Inference step:model 𝑝(𝑋, 𝐶𝑘) directly and obtain posteriors

2) Decision step:use the posterior to make optimal assignments


Discriminative models

1) Inference step:

use training data to learn a model for 𝑝(𝐶𝑘|𝑋)

2) Decision step:use the posterior to make optimal assignments


Discriminative function

Use training data to learn a discriminative function 𝑓(𝑥) directly mapping the input onto a class label

Oranges and Lemons

A two dimensional space

Stars and galaxies

Minor elliptical axis (y) against Major elliptical axis (x) for stars (red) and galaxies (blue)

Coronoray Heart Disease

Patients with (red) and without (blue) coronary heart disease in South Africa (Rousseauw et al, 1983)

Parametric model

A mapping from a set of input 𝑥 to a set of output 𝑦 (labels or values), is parametric if itdepends on a set of (fixed) parameters 𝒘:

1) The number of parameters is finite

2) The number of parameters is independent of the number of data points

Linear classifier

Straight cut (hyperplane) dividing input space 𝑿 into two

A linear classifier assumes

𝑦 = 𝑓 𝑿 = 𝒘𝑇𝑿 + 𝑏

is a linear function of 𝑿

𝑿 =ℎ𝑒𝑖𝑔ℎ𝑡𝑤𝑖𝑑𝑡ℎ

𝒘 =𝑤1

𝑤2

The weight vector

Define the positive class region 𝒘𝑇𝒙𝑖 + 𝑏 > 0

𝑤1𝑥𝑖1 + 𝑤2𝑥𝑖2 + ⋯ + 𝑏 > 0

𝑑=1

𝐷

𝑤𝑑𝑥𝑖𝑑 + 𝑏 > 0

Setting 𝑏 = 0𝒘𝑇𝑿 = 0

Geometric meaning

ℛ1

ℛ2

𝑦 > 0

𝑦 < 0

𝑦 = 0

Geometric meaning

𝐿 = 𝒙 𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏 = 0

1) For any 𝒙1 and 𝒙2 in 𝐿, 𝒘𝑇 𝒙1 − 𝒙2 = 0⇒ 𝒘∗ =

𝒘

𝒘is the normal to 𝐿

2) For any 𝒙0 in 𝐿, 𝒘𝑇𝒙0 = −𝑏

3) The signed distance of a point 𝒙 to 𝐿 is:

𝒘∗𝑇 𝒙 − 𝒙0 =1

𝒘𝒘𝑇𝒙 + 𝑏 =

𝑓(𝒙)

𝑓′(𝒙)

The weight vector

% ww = Dx1 weights

% Xstar = NxD test cases

y_pred = sign(Xstar*ww); % Nx1

Learning the weights

Rosenblatt’s Perceptron Learning

Perceptron criterion:

𝐷 𝒘, 𝑏 = −

𝑖∈ℳ

𝑦𝑖 𝒘𝑇𝒙𝑖 + 𝑏 = −

𝑖∈ℳ

𝐷𝑖 𝒘, 𝑏

𝜕𝐷𝑖 𝒘, 𝑏

𝜕𝑤= −𝑦𝑖𝒙𝑖

𝜕𝐷𝑖 𝒘, 𝑏

𝜕𝑏= −𝑦𝑖

𝒘𝑏

𝜏+1=

𝒘𝑏

𝜏+ 𝜂

𝑦𝑖𝒙𝑖

𝑦𝑖

Stochastic gradientdescent:


% ww = Dx1 weights

% xx = NxD test cases

% yy = Nx1 targets (-1,+1)

old_ww=[];

ww=zeros(D,1);

while (~isequal(ww,old_ww))

old_ww=ww;

for ct=1:N,

pred=sign(xx(ct,:)*ww);

ww=ww+(yy(ct)-pred)*xx(ct,:)’;

end;

end;

𝑦 = 𝑠𝑔𝑛 𝒘𝑇𝒙𝒘𝜏+1 = 𝒘𝜏 + 𝑦 − 𝑦 𝒙

Implementing the bias

What about 𝑏?

Output of the perceptron

Linear classifier revisited

If not linearly separable must- extend model- add features

Nonlinear basis function

From model to no model

Faith in previous knowledgeStrong assumption on- data structure

- separating boundary shape

Faith in the dataNo assumption on theunderlying structureData tell me everything I need

K-nearest neighbours classifier

Fix an Hodges 1951

Decision boundaries

Linear classification1-nearest neighbour

classifier15-nearest neighbour

classifier

Brain MRI application

MICCAI MS lesion challenge 2008http://www.ia.unc.edu/MSseg/index.html

http://www.ia.unc.edu/MSseg/index.html

LANDSAT application

Identification via gait analysis

Nowlan 2009Choi 2014

Characterize each personby the way he moves:

gait signature

Parametric vs non-parametric

• Starting assuming decision boundary is a plane

• Non-parametric KNN has no fixed assumption:boundaries gets more complicated with more data

• Non-parametric methods may need more data and can be computationally intensive

Batch supervised learning

Given: example inputs and targets (training set)Task: predicting target for new inputs (test set)

Examples:- classification (binary or multi-class)- regression- ordinal regression- Poisson regression- ranking…

Batch supervised learning

• Many ways of mapping inputs outputs

• How do we choose what to do?

• How do we know if we are doing well?

Algorithm’s objective cost

Formal objective for algorithms:- minimize a cost function- maximize an objective function

Proving convergence:- does objective monotonically improve?

Considering alternatives:- does another algorithm score better?

Loss function

We want to specify the objective of an algorithm

One idea: consider a loss function 𝐿 𝑦 𝒙∗ ; 𝑦∗

Would like to minimize loss at test time

Minimizing empirical loss might be a reasonableproxy:

𝑖

𝐿 𝑦 𝒙𝑖 ; 𝑦𝑖

Choosing a loss function

• Motivated by the application– 0-1 error, achieving a tolerance, business cost

• Computational convenience:– Differentiability, convexity

• Beware of loss dominated by artifacts:– Outliers

– Unbalanced classes

A step into linear regression

Find a linear function

𝑦 = 𝒘𝑇𝒙

That approximates the mapping:

𝒙 → 𝑦

A step into linear regression

Find a linear function

𝑦 = 𝒘𝑇𝒙

That minimizes the sum of squared residuals from 𝑦:

𝑅𝑆𝑆 𝑤 =

𝑖=1

𝑁

(𝑦𝑖 − 𝑦𝑖)2

=

𝑖=1

𝑁

𝑦𝑖 − 𝑏 −

𝑗=1

𝑝

𝑥𝑖𝑗𝑤𝑗

2

Vector form for RSS

𝒙𝑖 =

1𝑥𝑖1

𝑥𝑖2

⋮𝑥𝑖𝐷

𝑥𝑖 =

1ageBMI

⋮glycemia

𝑿 =

1 𝑥11 ⋯ 𝑥1𝐷

1 𝑥21

1 𝑥31

⋮

⋯⋯⋯

𝑥2𝐷

𝑥3𝐷

⋮1 𝑥𝑁1 ⋯ 𝑥𝑁𝐷

=

𝒙1

𝑇

𝒙2

𝑇

𝒙3

𝑇

⋮

𝒙𝑁

𝑇

𝒘 =

𝑤0𝑤1

𝑤2

⋮𝑤𝐷

Least squares estimation

𝑅𝑆𝑆 𝒘 = 𝑦 − 𝑿𝒘 𝑇 𝑦 − 𝑿𝒘

𝜕𝑅𝑆𝑆

𝜕𝒘= −2𝑿𝑇 𝑦 − 𝑿𝒘

𝜕2𝑅𝑆𝑆

𝜕𝒘𝜕𝒘𝑇= 2𝑿𝑇𝑿

−2𝑿𝑇 𝑦 − 𝑿𝒘 =0 𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝑦

Geometry of least squares

𝑦 = 𝒘𝑇𝒙 = 𝒙𝑻 𝑿𝑇𝑿 −1𝑿𝑇𝑦

The columns of X span a subspaceof ℝ𝐷+1

The closest point to y in thissubspace is its othogonalprojection

The orthogonal projection isgiven by the dot product

𝑦 ≈ 𝑦 = 𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝑦

Least square estimation

% ww = Dx1 weights

% X = NxD test cases

% Y = Nx1

ww = X\Y;

𝑌 = 𝑿𝒘

Least square estimation (2)If we want to minimize the RSS

𝐿 𝒘𝑇𝒙; 𝑦 = 𝑿𝒘 − 𝑌 𝑇 𝑿𝒘 − 𝑌

we can use the iterative scheme with Newton update:

𝒘𝜏+1 = 𝒘𝜏 − 𝜂𝛻𝐿 𝒘𝑇𝒙; 𝑦

𝛻𝐿 𝒘𝑇𝒙; 𝑦 =

𝑖=1

𝑁

(𝒘𝑇𝒙𝑖 − 𝑦𝑖)𝒙𝑖 = 𝑿𝑇𝑿𝒘 − 𝑿𝑇𝒀

Least square estimation (2)

𝒘𝜏+1 = 𝒘𝜏 − 𝜂𝑿𝑇 𝑿𝒘𝜏 − 𝒀 = 𝒘𝜏 − 𝜂𝑿𝑇 𝒀 − 𝒀

1. Initialize 𝒘0

2. Update

2. Check termination conditiona) 𝒘𝜏+1 = 𝒘𝜏

b) |𝒘𝜏+1 − 𝒘𝜏| < 𝜀c) 𝜏 > 𝑇d) 𝑚𝑎𝑥 𝛻𝐿 < 𝜀

The importance of the step

𝑳(𝒘0)

𝐿( 𝒘)

Least squares classifier

Why not using linear leastsquares to fit regressors on binary targets?

% fit yy = ww*xx

% ww = Dx1 weights


% yy = Nx1

ww = xx\yy;

𝑦 < 0

𝑦 > 0


𝑦 > 0𝑦 < 0


Why not using linear leastsquares to fit regressors on binary targets?

% fit yy = ww*xx

% ww = Dx1 weights


% yy = Nx1

ww = xx\yy;

𝑦 > 0

𝑦 < 0

Applied Machine Learning in Biomedicineenrigri/Public/AMLB_2016/Slides/AMLB_Lecture_1… · Course...

Documents

Transcript of Applied Machine Learning in Biomedicineenrigri/Public/AMLB_2016/Slides/AMLB_Lecture_1… · Course...