Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides ›...

28
Machine Learning for Biomedical Engineering Enrico Grisan [email protected]

Transcript of Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides ›...

Page 1: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Machine Learning for Biomedical Engineering

Enrico Grisan

[email protected]

Page 2: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Curse of dimensionality

• Why are more features bad?

– Redundant features (useless or confounding)

– Hard to interpret and visualize

– Hard to store and process

– Complexity of decision rules (boundaries) increases with the number of features

Page 3: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Occam’s Razor

Williamo of Ockham (1285-1349)

Principle of Parsimony

«One should not increase, beyond what it is necessary, the number of entities required to explain anything»

Seek the simplest solution!

Page 4: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Dimensionality reduction

• Features selection

• Latent features

Page 5: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Best subset selection

• 𝐷 features

• Find the subset of dimension 𝑘 = {1, … , 𝐷} with smallest residual sum

• Exhaustive search:

𝑘=1

𝐷

𝐷𝑘

=

𝑘=1

𝐷𝐷!

𝑘! 𝐷 − 𝑘 !

There exhist mehods efficients only for 𝐷 < 40

Page 6: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

One-step feature selection

Estimate training or cross-validation accuracy for eachsingle-feature classifer:

𝑓𝑘: 𝑿𝑘 → 𝑦𝑒𝑘 = 𝐸[ 𝑓𝑘 𝑿𝑘 − 𝑦]

Choose the set of 𝐾 features with smaller error 𝑒𝑘

Page 7: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Forward stepwise selection

Incrementally add one feature at the time to the currentselected set

Greedy algorithm

1. Given a subset 𝑆 of 𝐾 features

2. Evaluate the RSS of the subset 𝑆 augmented with onefeature (among the remaining 𝐷 − 𝐾)

3. Add the feature providing the larger reduction in RSS

4. Repeat 1-3

Page 8: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Backward stepwise selection

Incrementally remove one feature at the time to the currentselected set

1. Start with the regression/classifier estimate on the full set of D features

2. Evaluate the feature providing the less impact (z-score or RSS with one feature removed)

3. Remove the feature from the set of active features

4. Repeat steps 2-3

Page 9: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Forward stagewise

Incrementally add the feature most correlated with the residuals

1. Center and standardize all features

2. Start with intercept equal to 𝑦

3. Find the feature 𝑿𝑘 most correlated with the residuals 𝑟

4. Compute the weight 𝑤′𝑘 of the linear regression of 𝑿𝑘 to 𝒓

5. Update 𝑤𝑘 = 𝑤𝑘+ 𝑤′𝑘6. Update the least squares fit and the residuals 𝒓 = 𝒚 − 𝑦

7. Repeat step 3-5 until no feature is correlated with the residual

Page 10: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Least angle regression (LAR)

1. Center and standardize all features

2. Start with 𝒓 = 𝒚 − 𝑦 and 𝑤𝑖 = 0, 𝑖 = 1,… , 𝐷

3. Find the feature 𝑿𝑘 most correlated with 𝒓 and add it to the active set 𝒜0

4. At each iteration 𝜏 evaluate the least squares direction:

𝜹𝜏 = (𝑿𝒜𝜏𝑇𝑿𝒜𝜏 )−1𝑿𝒜𝜏

𝑇𝒓𝜏

5. Update the weights of the features in the active set𝒘𝒜𝜏+1 = 𝒘𝒜𝜏 + 𝜂𝜹𝜏

6. Evaluate the least squares fit of 𝑿𝒜𝜏 and update the residuals 𝒓𝜏

7. Repeat 4-6 until some other variable 𝑿𝑗is correlated to 𝒓𝜏 as much as

𝑿𝒜𝜏

8. Add 𝑿𝑗 to 𝑿𝒜𝜏

9. Repeat 4-8

Page 11: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Incremental Forward Stagewise regression

1. Center and standardize all features

2. Start with 𝒓 = 𝒚 and 𝑤𝑖 = 0, 𝑖 = 1,… , 𝐷

3. Find the feature 𝑿𝑘 most correlated with 𝒓

4. Evaluate the change𝛿 = 휀 ∙ 𝑠𝑖𝑔𝑛( 𝑿𝑘 , 𝒓 )𝜏

5. Update the weights of the features in the active set𝑤𝑘 = 𝑤𝑘 + 𝛿

6. Update the residuals𝒓 = 𝒓 − 𝛿𝑿𝑘

7. Repeat 3-6 until the residuals are uncorrelated with the features

Page 12: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Comparison

Page 13: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

fMRI experimentTest whether a classifier could distinguish the activation as a result of seeing words that were either kinds of tool or kinds of building. The subject was shown one word per trial and performed the following task: the subject should think about the item and its properties while the word was displayed (3 s) and try to clear her mind afterwards (8s of blank screen).(Pereira, Mitchell, Botvinick, Neuroimage, 2009)

For each patient and each task (recognizingword) fMRI data provide a signal correlatedto metabolism at each voxel of the acquired3D barin volume

16000 features/ example42 example s for the «building» class42 examples for the «tools» class

Page 14: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

fMRI feature selection

classifier training & feature selection

the goal is:- reduce the ratio of features to examples, - decrease the chance of overfitting, - get rid of uninformative features- let the classifier focus on informative ones.

Page 15: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

fMRI feature selection

16000 features!!!

Page 16: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Regularization

• Do not need validation set to know some fits are silly

• Discourage solutions we don’t like

• Formalizing the cost of solutions we do not like

Page 17: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Shrinkage

Minimizing the objective function

𝒘 = 𝑎𝑟𝑔min𝒘

𝐿 𝑦 𝒙∗ ; 𝑦∗ + 𝜆 𝒘 𝑝

Or constraining the estimates

𝒘 = 𝑎𝑟𝑔min

𝒘𝐿 𝑦 𝒙∗ ; 𝑦∗

𝒘 𝑝 = 𝑡

Page 18: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Ridge regression

p=2 𝒘𝑟𝑖𝑑𝑔𝑒 = 𝑎𝑟𝑔min

𝒘𝐿 𝑦 𝒙∗ ; 𝑦∗ + 𝜆 𝒘 2

𝒘𝑟𝑖𝑑𝑔𝑒= 𝑎𝑟𝑔min𝒘

𝑖=1𝑁 (𝑦𝑖 − 𝑦𝑖)

2 + 𝜆 𝑗=1𝐷 𝑤𝑗

2

𝑅𝑆𝑆 𝜆 = 𝒚 − 𝑿𝒘 𝑇 𝒚 − 𝑿𝒘 + 𝜆𝒘𝑇𝒘

𝒘𝑟𝑖𝑑𝑔𝑒 = (𝑿𝑇𝑿 + 𝜆𝑰)−1𝑿𝑇𝒚

Page 19: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

LASSO

Least Absolute Shrinkage and Selection Operator

p=1 𝒘𝑙𝑎𝑠𝑠𝑜 = 𝑎𝑟𝑔min

𝒘𝐿 𝑦 𝒙∗ ; 𝑦∗ + 𝜆 𝒘 1

𝒘𝑙𝑎𝑠𝑠𝑜= 𝑎𝑟𝑔min𝒘

𝑖=1𝑁 (𝑦𝑖 − 𝑦𝑖)

2 + 𝜆 𝑗=1𝐷 |𝑤𝑗|

Page 20: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Geometry of shrinkage

𝐿 𝑦 𝒙∗ ; 𝑦∗

𝜆 𝒘 1𝜆 𝒘 2

Page 21: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Other shrinkage norms

𝑝 = 4 𝑝 = 2 𝑝 = 1 𝑝 = 0.5 𝑝 = 0.2

𝑝 = 1.2 𝛼 = 0.2

Elastic net:𝛼 𝒘 2 + (1 − 𝛼) 𝒘 1

Page 22: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Regularization constants

How do we pick 𝝀 or 𝒕?

1) Based on validation

2) Based on bounds on generalization error

3) Based on empirical Bayes

4) Reinterpreting 𝜆

5) Going full Bayesian approach

Page 23: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Pre processing

Centering– Might have all features at 500 ± 10

– Hard to predict ball park of bias

– Subtract mean from all input features

Rescaling– Heights can be measured in cm or m

– Rescale inputs to have unit variance… or interquartile ranges

Care at test time: apply same scale to all inputs and reverse scaling to prediction

Page 24: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Some tricks of the trade

• Preprocessing

• Transformations

• Features

Page 25: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Log transform inputs

Positive quantities are often highly skewedLog-domain id often much more natural

Page 26: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Creating extra data

Dirty trick:

Create more training ‘data’ by corruptingexamples in the real training set

Changes could respect invariances that would be difficult or burdensome to measure directly

Page 27: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Encoding attributes

• Categorical variables– A study has three individuals

– Three different colours

– Possible encoding: 100, 010, 001

• Ordinal variables– Movie rating, stars

– Tissue anomaly rating, expert scores 1-3

– Possible encoding: 00, 10, 11

Page 28: Machine Learning for Biomedical Engineering › ~enrigri › Public › AMLB_2016 › Slides › AMLB_Lecture_4.pdfSome tricks of the trade •Preprocessing •Transformations •Features.

Basis function features

In the regression and classification examples we usedpolynomials 𝑥, 𝑥2, 𝑥3, …

Often a bad choice

Polynomials of sparse binary features may make sense: 𝑥1𝑥2, 𝑥1𝑥3,…, 𝑥1𝑥2𝑥2

Other options:

- radial basis function: 𝑒− 𝑥−𝜇 2/ℎ2

- sigmoids: 1/(1 + 𝑒−𝑣𝑇𝑥)

- Fouries, wavelets …