Explaining the idea behind automatic relevance determination and bayesian interpolation

Dr. Florian WilhelmMarch 13th 2016PyData Amsterdam

What‘s the best model to describe our data?

And what does „best“ actually mean?

Simple model

„Generality“Complex model

„Best Fit“

Occam‘s Razor:

„It is vain to do with more

what can be done with fewer“

Simple Model

ℋ1Complex model

Space of all possible datasets 𝐷

Simple Model

ℋ1Complex model

ℋ1 fits only a small subset of 𝐷 well

Simple Model

ℋ1Complex model

ℋ2 can fit large parts of 𝐷 well

Prefer the model with high evidence for a given dataset

Source: D. J. C. MacKay. Bayesian Interpolation. 1992

1. Model fitting: Assume ℋ𝑖 is the right model and fit its parameters 𝒘 with Bayes:

𝑃 𝒘 𝐷,ℋ𝑖 =𝑃 𝐷 𝒘,ℋ𝑖 𝑃(𝒘|ℋ𝑖)

𝑃(𝐷|ℋ𝑖)

“Business as usual”

2. Model comparison: Compare different models with the help of their evidence 𝑃 𝐷 ℋ𝑖 and model prior 𝑃 ℋ𝑖 :

𝑃 ℋ𝑖 𝐷 ∝ 𝑃 𝐷 ℋ𝑖 𝑃 ℋ𝑖

“Occam‘s razor at work“

Marginalize & approximate:

𝑃 𝐷 ℋ𝑖 = 𝑃 𝐷 𝒘,ℋ𝑖 𝑃 𝒘 ℋ𝑖 𝑑𝒘

𝑃 𝐷 ℋ𝑖 ≅ 𝑃 𝐷 𝒘𝑀𝑃,ℋ𝑖 𝑃 𝒘𝑀𝑃 ℋ𝑖 ∆𝒘𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 ≅ 𝑏𝑒𝑠𝑡 𝑓𝑖𝑡 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 × 𝑂𝑐𝑐𝑎𝑚 𝑓𝑎𝑐𝑡𝑜𝑟

Occam factor:∆𝒘

∆𝟎𝒘

Source: D. J. C. MacKay. Bayesian Interpolation. 1992

Given:

Dataset 𝐷 = 𝑥𝑛, 𝑡𝑛 with 𝑛 = 1…𝑁

Set of (non-linear) functions Φ = {𝜙ℎ: 𝑥 ⟼ 𝜙(𝑥)} with ℎ = 1…𝑀

Assumption:

𝑦 𝒙;𝒘 =

𝑤ℎ𝜙ℎ(𝒙) ,

𝑡𝑛 = 𝑦 𝒙;𝒘 + 𝜐𝑛,

where 𝜐𝑛 is an additive noise with 𝒩 0, 𝛼−1

Task: Find min𝒘‖Φ𝒘 − 𝒕‖2 (Ordinary Least Squares)

Problem:

Having too many features leads to overfitting!

Regularization

Assumption: „Weights are small“

𝑝 𝒘; 𝜆 ~𝒩(0, 𝜆−1𝕀)

Task: Given 𝛼, 𝜆 find

min𝒘𝛼 Φ𝒘− 𝒕 2 + 𝜆 𝒘 2

Consider each 𝛼𝑖 , 𝜆𝑖 defining a model ℋ𝑖 𝛼, 𝜆 .

Yes! That means we can use

our Bayesian Interpolation to

find 𝒘,𝜶, 𝝀 with the highest

evidence!

This is the idea behind BayesianRidge as found in sklearn.linear_model

Consider that each weight has an individual variance, so that𝑝 𝒘 𝝀 ~𝒩 0, Λ−1 ,

where Λ = diag(𝜆1, … , 𝜆𝐻), 𝜆ℎ ∈ ℝ+.

Now, our minimization problem is:min𝒘𝛼 Φ𝒘− 𝒕 2 +𝒘𝑡Λ𝒘

Pruning: If precision 𝜆ℎ of feature ℎ is high, its weight 𝑤ℎ is very likely to

be close to zero and is therefore pruned.

This is called Sparse Bayesian Learning or Automatic Relevance

Determination. Found as ARDRegression under sklearn.linear_model.

Crossvalidation can be used for the estimation of hyperparmeters but suffers from the curse of dimensionality (inappropriate for low-statistics).

17Source: Peter Ellerton, http://pactiss.org/2011/11/02/bayesian-inference-homo-bayesianis/

• Random 100 × 100 design matrix Φ with 100 samples and 100features

• Weights 𝑤𝑖, 𝑖 ∈ 𝐼 = 1,… , 100 , random subset J ⊂ 𝐼 with 𝐽 = 10, and

𝑤𝑖 = 0, 𝑖 ∈ 𝐼\J

𝒩(𝑤𝑖; 0, 14), 𝑖 ∈ 𝐽

• Target 𝒕 = Φ𝒘+ 𝝂 with random noise 𝜈𝑖 ∼ 𝒩(0, 150)

Task: Reconstruct the weights, especially the 10 non-zero weights!

Source: http://scikit-learn.org/stable/auto_examples/linear_model/plot_ard.html#example-linear-model-plot-ard-py

We have to determine the parameters 𝑤, 𝜆, 𝛼 for

𝑃 𝒘, 𝝀, 𝛼 𝒕 = 𝑃 𝒘 𝒕, 𝝀, 𝛼 𝑃 𝝀, 𝛼 𝒕

1) Model fitting:

For the first factor, we have 𝑃 𝒘 𝒕, 𝝀, 𝛼 ~𝒩(𝝁, Σ) with

Σ = Λ + 𝛼Φ𝑇Φ −1,

𝝁 = 𝛼ΣΦT𝐭.

2) Model comparison:

For the second factor, we have

𝑃 𝝀, 𝛼 𝒕 ∝ 𝑃 𝒕 𝝀, 𝛼 𝑃 𝝀 𝑃 𝛼 ,

where 𝑃 𝝀 and 𝑃(𝛼) are hyperpriors which we assume uniform.

Using marginalization, we have

𝑃 𝒕 𝝀, 𝛼 = 𝑃 𝒕 𝒘, 𝛼 𝑃 𝒘 𝝀 𝑑𝒘,

i.e. marginal likelihood or the “evidence for the hyperparameter“.

Differentiation of the log marginal likelihood with respect to 𝜆𝑖 and 𝛼 as well as setting these to zero, we get

𝜆𝑖 =𝛾𝑖

𝜇𝑖2 ,

𝛼 =𝑁 − 𝑖 𝛾𝑖𝒕 − Φ𝝁 2

with 𝛾𝑖 = 1 − 𝜆𝑖Σ𝑖𝑖.

These formulae are used to find the maximum points 𝝀𝑀𝑃 and 𝛼𝑀𝑃.

1. Starting values 𝛼 = 𝜎−2(𝒕), 𝝀 = 𝟏

2. Calculate Σ = Λ + 𝛼Φ𝑇Φ −1 and 𝒘 = 𝝁 = 𝛼ΣΦT𝐭

3. Update 𝜆𝑖 =𝛾𝑖

𝜇𝑖2 and 𝛼 =

𝑁− 𝑖 𝛾𝑖

𝒕−Φ𝝁 2where 𝛾𝑖 = 1 − 𝜆𝑖Σ𝑖𝑖

4. Prune 𝜆𝑖 and 𝜙𝑖 if 𝜆𝑖 > 𝜆𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑5. If not converged go to 2.

Sklearn implementation:

The parameters 𝛼1, 𝛼2 as well as 𝜆1, 𝜆2 are the hyperprior parameters for 𝛼 and 𝝀 with

𝑃 𝛼 ∼ Γ 𝛼1, 𝛼2−1 , 𝑃 𝜆𝑖 ∼ Γ 𝜆1, 𝜆2

−1 .

𝐸 Γ 𝛼, 𝛽 =𝛼

𝛽and 𝑉 Γ 𝛼, 𝛽 =

𝛽2.

Given a some new data 𝑥∗, a prediction for 𝑡∗ is made by

𝑃 𝑡∗ 𝒕, 𝝀𝑀𝑃, 𝛼𝑀𝑃 = 𝑃 𝑡∗ 𝒘,𝛼𝑀𝑃 𝑃 𝒘 𝒕, 𝝀𝑀𝑃, 𝛼𝑀𝑃 𝑑𝒘

= 𝒩 𝝁𝑇𝜙 𝑥∗ , 𝛼𝑀𝑃−1 + 𝜙 𝑥∗

𝑡Σ𝜙 𝑥∗ .

This is a good approximation of the predictive distribution

𝑃 𝑡∗ 𝒕 = 𝑃 𝑡∗ 𝒘, 𝝀, 𝛼 𝑃 𝒘, 𝝀, 𝛼 𝒕 𝑑𝒘 𝑑𝝀 𝑑α .

1. D. J. C. MacKay. Bayesian Interpolation. 1992 (… to understand the overall idea)

2. M. E. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. June, 2001 (… to understand the ARD algorithm)

3. T. Fletcher. Relevance Vector Machines Explained. October, 2010(… to understand the ARD algorithm in detail)

4. D. Wipf. A New View of Automatic Relevance Determination. 2008(… not as good as the ones above)

Graphs from slides 7 and 9 were taken from [1] and the awesome tutorials of Scikit-Learn were consulted many times.

Explaining the idea behind automatic relevance determination and bayesian interpolation

Data & Analytics

Transcript of Explaining the idea behind automatic relevance determination and bayesian interpolation

Interpolation and Masks - FSLfsl.fmrib.ox.ac.uk/fslcourse/lectures/Reg_P1E3.pdf · • Cost Functions • Interpolation Basic Registration Concepts. Interpolation Finds intensity

Inhalt Kapitel IV: Interpolation - WebHome · PDF fileProf. Dr. Barbara Wohlmuth Lehrstuhl fu¨r Numerische Mathematik Inhalt Kapitel IV: Interpolation IV Interpolation IV.1 Polynom-Interpolation

SB4 Interpolation

Interpolation - NUS Computingcs5240/lecture/interpolation.pdf · Weakness: More complex than Lagrange interpolation. Leow Wee Kheng (NUS) Interpolation 19 / 44. Global Interpolation

Explaining and Interpreting Deep Neural Networks · Explaining Neural Network Predictions Layer-wise relevance Propagation (LRP, Bach et al 15) first method to explain nonlinear classifiers

Log-Linear Interpolation of Language Models · Log-Linear Interpolation of Language Models Alexander Gutkin ... I want to thank Dr Stephen Pulman for explaining ... ix. CHAPTER 1

Interpolation 2

Spatial Interpolation Interpolation - VRACpublic.vrac.iastate.edu/~charding/Geol552_2011/Geol552_lecture22... · - Interpolation (Inverse Distance based, IDW) - Zonal Analysis (statistics)

Kapitel 7. Interpolation und Approximation II Inhalt: 7.1 ... · Spline-Interpolation 7.1 Spline-Interpolation Matlab/Octave: Spline-Interpolation der Zensus-Daten durch einen kubischen

Kapitel 7. Interpolation und Approximation II Inhalt: …...Kapitel 7. Interpolation und Approximation II Inhalt: 7.1 Spline-Interpolation 7.2 Trigonometrische Interpolation 7.3 Tschebyscheﬀ-Approximation

mugberiaopac.aadijatechnologies.commugberiaopac.aadijatechnologies.com/opac-admin/images/collegemid... · Stirling's interpolation formula Interpolation by Iteration (Aitken's Interpolation)

07 interpolation

Curve Fitting, InterpolationOct 14, 2004 · Relevance of Curve Fitting Relevance of Interpolation Relevance of Curve Fitting Extracting parameters from experimental data: A standard

CHAPTER 6d. NUMERICAL INTERPOLATION Finite Difference Interpolation

web.spcollege.eduweb.spcollege.edu/.../HUS_1318_Syllabus_FALL_2020-… · Web viewf. Explaining the relevance of factors such as family history of violence, substance abuse, and medical

Interpolation - beam.acclab.helsinki.fibeam.acclab.helsinki.fi/.../lecturenotes/2011/06_interpolation-1x2.pdf · Scientific computing III 2011: 6. Interpolation 3 Interpolation Degree

Curve Fitting, Interpolation · 2012. 4. 23. · Relevance of Curve Fitting Relevance of Interpolation Relevance of Interpolation Interpolation provides a way for us to make close

InterpolationScientific computing III 2013: 6. Interpolation 3 Interpolation Degree of interpolation is - When the original function has sharp corners an interpolation polynomial with

IMAGE INTERPOLATION - Università degli Studi di · PDF fileImage interpolation refers to the “guess” of ... Bilinear Interpolation 33 ... therefore we can use linear interpolation

Hermite Interpolation