Machine Learning 2 - Universität Hildesheim · 2020. 10. 6. · Machine Learning 2 Machine...

Machine Learning 2

Machine Learning 27. Sparse Linear Models – Further Methods

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)Institute for Computer Science

University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

1 / 14

Machine Learning 2

Outline

1. Automatic Relevance Determination (ARD)

2 / 14

Machine Learning 2

Syllabus

A. Advanced Supervised LearningTue. 9.12. (1) A.1 Generalized Linear Models

Wed. 10.12. (2) A.2 Gaussian ProcessesTue. 16.12. (3) A.3 Advanced Support Vector MachinesWed. 17.12. (4) A.4 Neural Networks

Tue. 6.1. (5) A.5 EnsemblesWed. 7.1. (6) A.5b Ensembles (ctd.)Tue. 13.1. (7) A.6 Sparse Linear Models — L1 regularizationWed. 14.1. (8) A.6b Sparse Linear Models — L1 regularization (ctd.)Tue. 20.1. (9) A.7. Sparse Linear Models — Further MethodsWed. 21.1. (10)Tue. 27.1. (11)Wed. 28.1. (12)Tue. 3.2. (13)Wed. 4.2. (14)

1 / 14

Machine Learning 2 1. Automatic Relevance Determination (ARD)

Outline

1. Automatic Relevance Determination (ARD)

2 / 14

Linear Regression plus ARD RegularizationLinear Regression plus L2 regularization (Ridge Regression):

p(yn | xn, β, σ2y ) := N (yn | βT xn, σ2y )

p(β) := N (β | 0,Σβ := σ2βI )

Linear Regression plus ARD Regularization:

p(β) := N (β | 0,Σβ := diag(σ2β1 , . . . , σ2βM

p(σ2y ) := IG(σ2y | c , d)

p(σ2βm) := IG(σ2βm | a, b), m = 1, . . . ,M

I use a different regularization weight for each predictor xm.

I but M hyperparameters are too many to learn by grid search.I hence put a hyperprior on top.

2 / 14

p(β) := N (β | 0,Σβ := σ2βI )

p(σ2y ) := IG(σ2y | c , d)

p(σ2βm) := IG(σ2βm | a, b), m = 1, . . . ,M

I use a different regularization weight for each predictor xm.

I but M hyperparameters are too many to learn by grid search.I hence put a hyperprior on top.

2 / 14

p(β) := N (β | 0,Σβ := σ2βI )

p(σ2y ) := IG(σ2y | c , d)

p(σ2βm) := IG(σ2βm | a, b), m = 1, . . . ,M

I use a different regularization weight for each predictor xm.I but M hyperparameters are too many to learn by grid search.

I hence put a hyperprior on top.

2 / 14

p(β) := N (β | 0,Σβ := σ2βI )

p(σ2y ) := IG(σ2y | c , d)

p(σ2βm) := IG(σ2βm | a, b), m = 1, . . . ,M

I use a different regularization weight for each predictor xm.I but M hyperparameters are too many to learn by grid search.I hence put a hyperprior on top.

2 / 14

Empirical BayesMaximum Likelihood (ML): θ̂ := arg max

θp(D | θ)

Maximum Aposteriori (MAP): θ̂ := arg maxθ

p(D | θ) p(θ | η)

ML-II (Empirical Bayes): η̂ := arg maxη

p(D | η)

= arg maxη

∫p(D | θ) p(θ | η)dθ

θ̂ ∼p(θ | D, η̂) ∝ p(D | θ) p(θ | η̂)

MAP-II: η̂ := arg maxη

p(D | η) p(η)

= arg maxη

∫p(D | θ) p(θ | η) p(η)dθ

θ̂ ∼p(θ | D, η̂) ∝ p(D | θ) p(θ | η̂)

Full Bayes: (θ̂, η̂) ∼p(θ, η | D) ∝ p(D | θ) p(θ | η) p(η)

3 / 14

θp(D | θ)

p(D | θ) p(θ | η)

p(D | η)

= arg maxη

θ̂ ∼p(θ | D, η̂) ∝ p(D | θ) p(θ | η̂)

p(D | η) p(η)

= arg maxη

θ̂ ∼p(θ | D, η̂) ∝ p(D | θ) p(θ | η̂)

3 / 14

θp(D | θ)

p(D | θ) p(θ | η)

p(D | η)

= arg maxη

θ̂ ∼p(θ | D, η̂) ∝ p(D | θ) p(θ | η̂)

p(D | η) p(η)

= arg maxη

θ̂ ∼p(θ | D, η̂) ∝ p(D | θ) p(θ | η̂)

3 / 14

Marginal LikelihoodWithout hyperpriors:

p(y | X , σ2y ,Σβ) =

∫N (y | Xβ, σ2y I )N (β | 0,Σβ)dβ

= N (y | 0, σ2y I + XΣβXT︸︷︷︸

`(σ2y ,Σβ) := log p(y | X , σ2y ,Σβ) ∝ log |Cy |+ yTC−1y y

With hyperpriors:

`(σ2y ,Σβ) := log p(y | X , σ2y ,Σβ, a, b, c , d)

∝ log |Cy |+ yTC−1y y +M∑

(−a log σ2βm − b/σ2βm)

− c log σ2y − d/σ2y

4 / 14

p(y | X , σ2y ,Σβ) =

∫N (y | Xβ, σ2y I )N (β | 0,Σβ)dβ

= N (y | 0, σ2y I + XΣβXT︸︷︷︸

With hyperpriors:

4 / 14

p(y | X , σ2y ,Σβ) =

∫N (y | Xβ, σ2y I )N (β | 0,Σβ)dβ

= N (y | 0, σ2y I + XΣβXT︸︷︷︸

With hyperpriors:

4 / 14

Inferring Parameters β

p(β | X , y , σ2y ,Σβ) =1

ZN (β | 0,Σβ)N (y | Xβ, σ2y I )

= N (β | µβ :=1

σ2yCβX

T y ,Cβ := (1

σ2yXTX + Σ−1β )−1)

for Σβ =∞I : unregularized estimates

= N (β | (XTX )−1XT y , σ2y (XTX )−1)

for σ2y =∞: overregularized estimates

= N (β | 0,Σβ)

5 / 14

Equivalent MAP Estimation Problem

`(β) =1

σ2y||y − Xβ||22 + min

σ2β1,...,σ2

βM≥0

log |σ2y I + Xdiag(σ2β)XT |+M∑

β2mσ2βm

6 / 14

Learning ARD I: via EM

`(σ2y ,Σβ, µβ,Cβ) := Eβ∼N (µβ ,Cβ)(log p(y | X , β, σ2y ,Σβ))

= Eβ∼N (µβ ,Cβ)(logN (y | Xβ, σ2y ) + logN (β | 0,Σβ))

+ log IG(σ2y | a, b) +M∑

log IG(σ2βm | c , d)

= Eβ∼N (µβ ,Cβ)(−N

2log σ2y −

2σ2y||y − Xβ||2 − 1

log σ2βm −1

2trΣ−1β ββT )

(−a log σ2βm − b/σ2βm)− c log σ2y − d/σ2y

= −N

2log σ2y −

2σ2y(||y − Xµβ||2 + tr(XTXCβ))− 1

log σ2βm

2trΣ−1β (µβµ

Tβ + Cβ) +

M∑m=1

7 / 14

`(. . .)

= −N

2log σ2y −

2σ2y(||y − Xµβ||2 + tr(XTXCβ))− 1

log σ2βm

2trΣ−1β (µβµ

Tβ + Cβ) +

M∑m=1

∝ −(2c + N) log σ2y − (2d + ||y − Xµβ||2 + tr(XTXCβ))1

−∑m

(2a + 1) log σ2βm + 2b/σ2βm − trΣ−1β (µβµTβ + Cβ)

8 / 14

`(. . .) = − (2c + N) log σ2y − (2d + ||y − Xµβ||2 + tr(XTXCβ))1

−∑m

(2a + 1) log σ2βm + 2b/σ2βm − trΣ−1β (µβµTβ + Cβ)

∂σ2βm= − (2a + 1)

σ2βm+ (2b + (µβ)2m + (Cβ)m,m)/(σ2βm)2

σ2βm =2b + (µβ)2m + (Cβ)m,m

2a + 1

∂σ2y

σ2y =2d + ||y − Xµβ||2 + tr(XTXCβ)

2c + Nwhich can be accelerated using

CβXTX = σ2y

old(I − CβΣ−1β ), tr(. . .) = σ2yold

1−(Cβ)m,mσ2βmLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

9 / 14

Learning ARD II: Fixed Point Algorithm

iteratively fit:

σ2βm :=2b + (µβ)2m

2a + γm

σ2y :=2d + ||y − Xµβ||2

2c + N −∑

Cβ := (1

σ2yXTX + Σ−1β )−1

µβ :=1

σ2yCβX

γm := 1−(Cβ)m,mσ2βm

, m := 1, . . . ,M

10 / 14

Learning ARD III: Iteratively Reweighted L1The ARD regularization term

R(σ2β) := log |Cy (σ2β)| = log |σ2y I + XΣβXT |, Σβ := diag(σ2β)

is concave in σ2β and thus can be written as

R(σ2β) = minλλTσ2β − R∗(λ)

R∗(λ) = minσ̃2β

λT σ̃2β − log |Cy (σ̃2β)|

The relaxed function

R(σ2β, λ) := λTσ2β − R∗(λ) = λTσ2β −minσ̃2β

λT σ̃2β − log |Cy (σ̃2β)|

for fixed σ2β is minimized by

λ = ∇σ2β

log |Cy (σ2β)|

11 / 14

Learning ARD III: Iteratively Reweighted L1

Instead of σ2β Wipf/Nagarajan 2008 use

σ2βm??→ λ

12m|βm|

finally yielding the iterative procedure:

β(t+1) := arg minβ

`(β) +M∑

λ(t)m |βm|

and to find λ(t):

λ(0)m := 1

λ(t+1)m := (X.,m(σ2y I + Xdiag(

λ(t)1

, . . . ,1

λ(t)M

) diag(|β(t)1 |, . . . , |β(t)M |))−1X.,m)

12 / 14

ARD for Classification

I so far, everything was developed for linear regression.

I for logistic regression, for EM the E-step cannot be done analytically.I possibly use variational approximationI use Gaussian approximation (Laplace approximation)

I the iteratively reweighted learning algorithm still works.

13 / 14

Remarks

I ARD is a good example for a (arguably simple) hierarchical Bayesianmodel.

I ARD has to be diligently evaluated against simple baselinessuch as normalizing the data with a vanilla L1/L2 regularized model.

14 / 14

Machine Learning 2

Machine Learning 2 - Universität Hildesheim · 2020. 10. 6. · Machine Learning 2 Machine...

Documents

Transcript of Machine Learning 2 - Universität Hildesheim · 2020. 10. 6. · Machine Learning 2 Machine...

Friseurinnung Hildesheim-Alfeld, 7.11.2012© Handwerkskammer Hildesheim-Südniedersachsen · Braunschweiger Str. 53 · 31134 Hildesheim Betriebsberatung im.

Machine Learning FabSpace March 2018 - irit.fr Learning FabSpace March 2018... · Machine Learning 5 Machine Learning • Supervised : Train / test • Predictiveanalysis Machine

Wirtschaftsinformatik 2 - Universität Hildesheim · 2021. 1. 21. · 25.05.12 3/ André Busche, Information Systems and Machine Learning Lab (ISMLL), Institute of Economics and Information

Programming Machine Learning - Universität Hildesheim · 2020-01-21 · Programming Machine Learning 0. Organizational Stu Exercises I There will be a weekly exercise sheet with

An Introduction to Machine Learning - GitHub Pages · Introduction Machine learning The machine learning process What’s machine learning History Supervised learning Non-supervised

Image Analysis 3. Fourier Transform - Universität Hildesheim · Image Analysis Image Analysis 3. Fourier Transform Lars Schmidt-Thieme Information Systems and Machine Learning Lab

machine Learning Learning

Machine Learning Chapter 11. 2 Machine Learning What is learning?

Lars Schmidt-Thieme, Information Systems and Machine ... · Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course on

Machine Learning: Machine Learning: Introduction Introduction

Big Data Analytics - Universität Hildesheim · Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer

CS7267 Machine Learning introduction to machine learning

Advanced Topics in Machine Learning - Universität Hildesheim · Advanced Topics in Machine Learning Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods ... 10g.

Deep learning - Machine learning overviewce.sharif.edu/courses/98-99/1/ce719-1/resources/root/...Deep learning j What is machine learning? Types of machine learning Machine learning

Machine Learning on Spark - UC Berkeley AMP Campampcamp.berkeley.edu/.../Machine-Learning-on-Spark... · Machine Learning on Spark Shivaram Venkataraman ... Machine learning algorithms

Machine Learning for NLP - Ethics and Machine Learning

Learning with large datasets Machine Learning Large scale machine learning.

Machine Learning in Logistics: Machine Learning Algorithmsltu.diva-portal.org/smash/get/diva2:1118764/FULLTEXT02.pdf · Machine Learning in Logistics: Machine Learning ... Machine

¿Qué es Machine Learning? Usos de Machine Learning.

Predicting Housing Prices With Azure Machine Learning · Azure Machine Learning. Expected Learning Outcomes Azure Machine Learning @tetranoodle Build Regression Machine Learning Model.