Applied Machine Learning in Biomedicineenrigri/Public/AMLB_2016/Slides/AMLB_Lecture_1… · Course...
Transcript of Applied Machine Learning in Biomedicineenrigri/Public/AMLB_2016/Slides/AMLB_Lecture_1… · Course...
Course details
Tue-Thu 14.30-16.00 Room 318
May 24th through June 7th
Contact [email protected]
Exam: project assignment
Cancer detection
Face detection
How would you detect a face?
How does album software tag your friends?
What do we do?
What do we do?
Speech recognition
Brain-computer interface
Recommender systems
Amazon, Netflix, Spotify tell you what you might like
The Netflix Prize was an open competition: predict user ratings for films, based on previous ratings without any other information about the users or films,
The grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team which bested Netflix's own algorithm for predicting ratings by 10.06%
Prediction systems
Just ahead of the kickoff for Season 6 of the television series “Games of Thrones,” computer science students at the Technical University of Munich have implemented a project that answers questions preoccupying fans of the series: Has Jon Snow survived Season 5? Who is going to die next?
The students used an array of machine learning algorithms to answer these questions.
The algorithm, which accurately predicted 74 percent of character deaths in the show and books, has many surprises in store, placing a number of characters thought to be relatively safe in grave danger.
The age of big data
“Every day, people create the equivalent of 2.5 quintillionbytes of data from sensors, mobile devices, onlinetransactions, and social networks; so much that 90 percent ofthe world's data has been generated in the past two years..”
The Huffington Post: Arnal Dayaratna: IBM Releases Big Data
CERN Collider320x1012 bytes/s
Personal connectome1018 bytes/person
109 messages/day
30x106 messages/day
Mining brain networks
The role of machine learning
Design and analyze algorithms that
- improve their performance- at some task- with experience
Data(experience)
Learningalgorithm
Knowledge(performance on task)
Imagenet challenge
Kaggle challenge
100 000 $ prize
35000 retinal images4 DR classes
Challenges
• Quantitative modelling
• Noise or information
• Simple or complex models
• Test learning before deploying to real world
• Acquire reliable domain knowledge
Machine learning in biomedicine
Usually extreme conditions:
Very few samples (with respect to the problem)
Very large amount of descriptors per sample
Very large amount of noise/uncertainty
Usually critical consequences:
Results might lead decision making
Results might lead the understanding of phenomena
Categories
– Supervised learning
classification, regression
– Unsupervised learning
Density estimation, clustering, dimensionality reduction
– Semisupervised learning– Active learning– Reinforcement learning– …
Supervised learning
Feature space Target space𝒳 𝒴
Given 𝑿 ∈ 𝓧 and 𝒀 ∈ 𝓨predict 𝒀 = 𝒇 𝐗
Supervised learningFeature space Target space
NormalMetaplastic
Benign neoplasticMalign neoplastic
Gene expression Discrete labelsClassification
CHD risk scoreDemographic and
Clinical dataContinuous labels
Regression
𝒳 𝒴
𝑦 = 𝑓(𝑥)
𝑦 = 𝑓(𝑥)
Typical machine learning workflow
Training data(past)
Domain knowledgeLearned
knowledgeExpert
knowledge
Training
ModelUnknown data
(future)Output
Bayesian framework: MLE vs MAP
• Maximum likelihood:
choose the value that maximizes the probability of observed data
𝜃𝑀𝐿𝐸 = 𝑎𝑟𝑔 max𝜃
𝑃(𝐷|𝜃)
• Maximum a posteriori
choose the value that is most probable given the observeddata and the prior belief
𝜃𝑀𝐴𝑃 = 𝑎𝑟𝑔 max𝜃
𝑃 𝜃 𝐷 = 𝑎𝑟𝑔 max𝜃
𝑃 𝐷 𝜃 𝑃)𝑃(𝜃)
Decision theory
You have a chest X-ray of a patient, you must decide if the distribution of intensities iscompatible with having malignant lung nodulesor not.
Suppose that you are able to summarize the intensities distribution with a small (possibly 1) number of measures.
Decision theory
Formalization:
𝑋: intensity measures from the image
𝐶1: normal X-ray class
𝐶2: presence of lung nodules class
𝑝(𝑋, 𝐶)
Decision theory
Is it really necessary to determine 𝑝 𝑋, 𝐶 ?
For a given 𝑥, determine the optimal 𝐶𝑖 (cancer or no cancer):
𝑝 𝐶𝑘 𝑥 =𝑝 𝑥 𝐶𝑘 𝑝(𝐶𝑘)
𝑝(𝑥)
Minimum misclassification rate
Minimum misclassification rate
𝑝 𝑥, 𝐶𝑘 = 𝑝 𝐶𝑘| 𝑥 𝑝(𝑥)
Optimal decision: assign 𝑥 to the class for which 𝒑(𝑪𝒌|𝒙) is largest
Penalizing prediction errors
Expected prediction error
𝑝 𝑚𝑖𝑠𝑡𝑎𝑘𝑒 ~𝐸𝑃𝐸 𝑓 = 𝐸[(𝑌 − 𝑓(𝑋))2] = (𝑦 − 𝑓(𝑥))2𝑝 𝑥, 𝑦 𝑑𝑥𝑑𝑦
= 𝐸𝑋 𝐸𝑌|𝑋 (𝑌 − 𝑓(𝑋))2|𝑋
𝑝 𝑥, 𝑦 = 𝑝 𝑦 𝑥 𝑝(𝑥)
Penalizing prediction errors
Expected prediction error
𝑝 𝑚𝑖𝑠𝑡𝑎𝑘𝑒 ~𝐸𝑃𝐸 𝑓 = 𝐸[𝐶 ≠ 𝑓(𝑋)] =
𝑘
𝐻(𝒞𝑘 = 𝑓(𝑥))𝑝 𝑥, 𝒞𝑘 𝑑𝑥
= 𝐸𝑋 𝐸𝐶|𝑋 𝐶 ≠ 𝑓(𝑋)|𝑋
𝑝 𝑥, 𝒞𝑘 = 𝑝 𝒞𝑘 𝑥 𝑝(𝑥)
Weighting the loss
Cancer Normal
Cancer 60 40
Normal 5 95
Cancer Normal
Cancer TP FP
Normal FN TN
Esti
mat
edd
iagn
osi
s
True diagnosis
Weighting the loss
Positive Negative
Positive TP FP
Negative FN TN
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑇𝑁
𝑇𝑁 + 𝐹𝑃
Esti
mat
edd
iagn
osi
s
Weighting the loss
Esti
mat
edd
iagn
osi
s
True diagnosis
Cancer Normal
Cancer 60 40
Normal 5 95
Cancer Normal
Cancer 0 1000
Normal 1 0
Loss Matrix
𝐿11 𝐿21
𝐿12 𝐿22
Expected Loss
Expected prediction error
𝑝 𝑚𝑖𝑠𝑡𝑎𝑘𝑒 ~𝐸𝑃𝐸 𝐶 = 𝐸[𝐿(𝐶, 𝐶(𝑋)]
= 𝐸𝑋
𝑘=1
𝐾
𝐿 𝒞𝑘 , 𝐶 𝑋 𝑃(𝒞𝑘|𝑋)
𝐶(𝑥) = 𝑎𝑟𝑔 min𝑐∈𝒞
𝑘=1
𝐾
𝐿 𝒞𝑘 , 𝑐 𝑃(𝒞𝑘|𝑋 = 𝑥)
Bayes classifier (0-1 loss)
𝐶(𝑥) = 𝑎𝑟𝑔 min𝑐∈𝒞
𝑘=1
𝐾
𝐿 𝒞𝑘 , 𝐺 𝑋 𝑃(𝒞𝑘|𝑋)
𝐶 𝑥 = 𝒞𝑘 if 𝑃 𝒞𝑘 𝑋 = 𝑥 = max𝑐∈𝒞
𝑃(𝑐|𝑋 = 𝑥)
𝑃 𝒞2 𝑋 = 𝑥𝑃 𝒞1 𝑋 = 𝑥
Bayes classifier (0-1 loss)
𝐶 𝑥 = 𝒞2 if 𝑥 > 1𝐸𝑃𝐸 𝐶 = 0.893
Bayes classifier (0-1 loss)
𝐶 𝑥 = 𝒞2 if 𝑥 > 3𝐸𝑃𝐸 𝐶 = 0.437
Bayes classifier (0-1 loss)
𝐶 𝑥 = 𝒞2 if 𝑥 > 4𝐸𝑃𝐸 𝐶 = 0.892
Bayes classifier (0-1 loss)
𝐶 𝑥 = 𝒞2 if 𝑥 > 2.5𝐸𝑃𝐸 𝐶 = 0.363
𝐸𝑃
𝐸(𝐶
)
Bayes classifier (generic loss)
min𝑅
𝐸 𝐿 = min𝑅
𝑘
𝑗
𝑅𝑗
𝐿𝑘𝑗𝑝 𝑥, 𝐶𝑘 𝑑𝑥
⇒ min𝑅
𝑘
𝐿𝑘𝑗𝑝 𝑥, 𝐶𝑘
min𝑅
𝑘
𝐿𝑘𝑗𝑝 𝐶𝑘|𝑥 𝑝(𝑥) ∝ min𝑅
𝑘
𝐿𝑘𝑗𝑝 𝐶𝑘|𝑥 𝑝(𝑥)
Reject option
Classification approaches (1)
Generative models1) Inference step:
use training data to learn a model for 𝑝 𝑋 𝐶𝑘
use data to infer priors 𝑝(𝐶𝑘)use Bayes’ formula to find posteriors 𝑝(𝐶𝑘|𝑋)
1b) Inference step:model 𝑝(𝑋, 𝐶𝑘) directly and obtain posteriors
2) Decision step:use the posterior to make optimal assignments
Classification approaches (2)
Discriminative models
1) Inference step:
use training data to learn a model for 𝑝(𝐶𝑘|𝑋)
2) Decision step:use the posterior to make optimal assignments
Classification approaches (3)
Discriminative function
Use training data to learn a discriminative function 𝑓(𝑥) directly mapping the input onto a class label
Oranges and Lemons
A two dimensional space
Stars and galaxies
Minor elliptical axis (y) against Major elliptical axis (x) for stars (red) and galaxies (blue)
Coronoray Heart Disease
Patients with (red) and without (blue) coronary heart disease in South Africa (Rousseauw et al, 1983)
Parametric model
A mapping from a set of input 𝑥 to a set of output 𝑦 (labels or values), is parametric if itdepends on a set of (fixed) parameters 𝒘:
1) The number of parameters is finite
2) The number of parameters is independent of the number of data points
Linear classifier
Straight cut (hyperplane) dividing input space 𝑿 into two
A linear classifier assumes
𝑦 = 𝑓 𝑿 = 𝒘𝑇𝑿 + 𝑏
is a linear function of 𝑿
𝑿 =ℎ𝑒𝑖𝑔ℎ𝑡𝑤𝑖𝑑𝑡ℎ
𝒘 =𝑤1
𝑤2
The weight vector
Define the positive class region 𝒘𝑇𝒙𝑖 + 𝑏 > 0
𝑤1𝑥𝑖1 + 𝑤2𝑥𝑖2 + ⋯ + 𝑏 > 0
𝑑=1
𝐷
𝑤𝑑𝑥𝑖𝑑 + 𝑏 > 0
Setting 𝑏 = 0𝒘𝑇𝑿 = 0
Geometric meaning
ℛ1
ℛ2
𝑦 > 0
𝑦 < 0
𝑦 = 0
Geometric meaning
𝐿 = 𝒙 𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏 = 0
1) For any 𝒙1 and 𝒙2 in 𝐿, 𝒘𝑇 𝒙1 − 𝒙2 = 0⇒ 𝒘∗ =
𝒘
𝒘is the normal to 𝐿
2) For any 𝒙0 in 𝐿, 𝒘𝑇𝒙0 = −𝑏
3) The signed distance of a point 𝒙 to 𝐿 is:
𝒘∗𝑇 𝒙 − 𝒙0 =1
𝒘𝒘𝑇𝒙 + 𝑏 =
𝑓(𝒙)
𝑓′(𝒙)
The weight vector
% ww = Dx1 weights
% Xstar = NxD test cases
y_pred = sign(Xstar*ww); % Nx1
Learning the weights
Rosenblatt’s Perceptron Learning
Perceptron criterion:
𝐷 𝒘, 𝑏 = −
𝑖∈ℳ
𝑦𝑖 𝒘𝑇𝒙𝑖 + 𝑏 = −
𝑖∈ℳ
𝐷𝑖 𝒘, 𝑏
𝜕𝐷𝑖 𝒘, 𝑏
𝜕𝑤= −𝑦𝑖𝒙𝑖
𝜕𝐷𝑖 𝒘, 𝑏
𝜕𝑏= −𝑦𝑖
𝒘𝑏
𝜏+1=
𝒘𝑏
𝜏+ 𝜂
𝑦𝑖𝒙𝑖
𝑦𝑖
Stochastic gradientdescent:
Learning the weights
% ww = Dx1 weights
% xx = NxD test cases
% yy = Nx1 targets (-1,+1)
old_ww=[];
ww=zeros(D,1);
while (~isequal(ww,old_ww))
old_ww=ww;
for ct=1:N,
pred=sign(xx(ct,:)*ww);
ww=ww+(yy(ct)-pred)*xx(ct,:)’;
end;
end;
𝑦 = 𝑠𝑔𝑛 𝒘𝑇𝒙𝒘𝜏+1 = 𝒘𝜏 + 𝑦 − 𝑦 𝒙
Learning the weights
Implementing the bias
What about 𝑏?
Output of the perceptron
Linear classifier revisited
If not linearly separable must- extend model- add features
Nonlinear basis function
From model to no model
Faith in previous knowledgeStrong assumption on- data structure
- separating boundary shape
Faith in the dataNo assumption on theunderlying structureData tell me everything I need
K-nearest neighbours classifier
Fix an Hodges 1951
Decision boundaries
Linear classification1-nearest neighbour
classifier15-nearest neighbour
classifier
Brain MRI application
MICCAI MS lesion challenge 2008http://www.ia.unc.edu/MSseg/index.html
LANDSAT application
Identification via gait analysis
Nowlan 2009Choi 2014
Characterize each personby the way he moves:
gait signature
Parametric vs non-parametric
• Starting assuming decision boundary is a plane
• Non-parametric KNN has no fixed assumption:boundaries gets more complicated with more data
• Non-parametric methods may need more data and can be computationally intensive
Batch supervised learning
Given: example inputs and targets (training set)Task: predicting target for new inputs (test set)
Examples:- classification (binary or multi-class)- regression- ordinal regression- Poisson regression- ranking…
Batch supervised learning
• Many ways of mapping inputs outputs
• How do we choose what to do?
• How do we know if we are doing well?
Algorithm’s objective cost
Formal objective for algorithms:- minimize a cost function- maximize an objective function
Proving convergence:- does objective monotonically improve?
Considering alternatives:- does another algorithm score better?
Loss function
We want to specify the objective of an algorithm
One idea: consider a loss function 𝐿 𝑦 𝒙∗ ; 𝑦∗
Would like to minimize loss at test time
Minimizing empirical loss might be a reasonableproxy:
𝑖
𝐿 𝑦 𝒙𝑖 ; 𝑦𝑖
Choosing a loss function
• Motivated by the application– 0-1 error, achieving a tolerance, business cost
• Computational convenience:– Differentiability, convexity
• Beware of loss dominated by artifacts:– Outliers
– Unbalanced classes
A step into linear regression
Find a linear function
𝑦 = 𝒘𝑇𝒙
That approximates the mapping:
𝒙 → 𝑦
A step into linear regression
Find a linear function
𝑦 = 𝒘𝑇𝒙
That minimizes the sum of squared residuals from 𝑦:
𝑅𝑆𝑆 𝑤 =
𝑖=1
𝑁
(𝑦𝑖 − 𝑦𝑖)2
=
𝑖=1
𝑁
𝑦𝑖 − 𝑏 −
𝑗=1
𝑝
𝑥𝑖𝑗𝑤𝑗
2
Vector form for RSS
𝒙𝑖 =
1𝑥𝑖1
𝑥𝑖2
⋮𝑥𝑖𝐷
𝑥𝑖 =
1ageBMI
⋮glycemia
𝑿 =
1 𝑥11 ⋯ 𝑥1𝐷
1 𝑥21
1 𝑥31
⋮
⋯⋯⋯
𝑥2𝐷
𝑥3𝐷
⋮1 𝑥𝑁1 ⋯ 𝑥𝑁𝐷
=
𝒙1
𝑇
𝒙2
𝑇
𝒙3
𝑇
⋮
𝒙𝑁
𝑇
𝒘 =
𝑤0𝑤1
𝑤2
⋮𝑤𝐷
Least squares estimation
𝑅𝑆𝑆 𝒘 = 𝑦 − 𝑿𝒘 𝑇 𝑦 − 𝑿𝒘
𝜕𝑅𝑆𝑆
𝜕𝒘= −2𝑿𝑇 𝑦 − 𝑿𝒘
𝜕2𝑅𝑆𝑆
𝜕𝒘𝜕𝒘𝑇= 2𝑿𝑇𝑿
−2𝑿𝑇 𝑦 − 𝑿𝒘 =0 𝒘 = 𝑿𝑇𝑿 −1𝑿𝑇𝑦
Geometry of least squares
𝑦 = 𝒘𝑇𝒙 = 𝒙𝑻 𝑿𝑇𝑿 −1𝑿𝑇𝑦
The columns of X span a subspaceof ℝ𝐷+1
The closest point to y in thissubspace is its othogonalprojection
The orthogonal projection isgiven by the dot product
𝑦 ≈ 𝑦 = 𝑿 𝑿𝑇𝑿 −1𝑿𝑇𝑦
Least square estimation
% ww = Dx1 weights
% X = NxD test cases
% Y = Nx1
ww = X\Y;
𝑌 = 𝑿𝒘
Least square estimation (2)If we want to minimize the RSS
𝐿 𝒘𝑇𝒙; 𝑦 = 𝑿𝒘 − 𝑌 𝑇 𝑿𝒘 − 𝑌
we can use the iterative scheme with Newton update:
𝒘𝜏+1 = 𝒘𝜏 − 𝜂𝛻𝐿 𝒘𝑇𝒙; 𝑦
𝛻𝐿 𝒘𝑇𝒙; 𝑦 =
𝑖=1
𝑁
(𝒘𝑇𝒙𝑖 − 𝑦𝑖)𝒙𝑖 = 𝑿𝑇𝑿𝒘 − 𝑿𝑇𝒀
Least square estimation (2)
𝒘𝜏+1 = 𝒘𝜏 − 𝜂𝑿𝑇 𝑿𝒘𝜏 − 𝒀 = 𝒘𝜏 − 𝜂𝑿𝑇 𝒀 − 𝒀
1. Initialize 𝒘0
2. Update
2. Check termination conditiona) 𝒘𝜏+1 = 𝒘𝜏
b) |𝒘𝜏+1 − 𝒘𝜏| < 𝜀c) 𝜏 > 𝑇d) 𝑚𝑎𝑥 𝛻𝐿 < 𝜀
The importance of the step
𝑳(𝒘0)
𝐿( 𝒘)
Least squares classifier
Why not using linear leastsquares to fit regressors on binary targets?
% fit yy = ww*xx
% ww = Dx1 weights
% xx = NxD test cases
% yy = Nx1
ww = xx\yy;
𝑦 < 0
𝑦 > 0
Least squares classifier
𝑦 > 0𝑦 < 0
Least squares classifier
Least squares classifier
Why not using linear leastsquares to fit regressors on binary targets?
% fit yy = ww*xx
% ww = Dx1 weights
% xx = NxD test cases
% yy = Nx1
ww = xx\yy;
𝑦 > 0
𝑦 < 0