Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial...

Post on 29-Sep-2020

9 views 0 download

Transcript of Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artificial...

Linear Regression

Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London

@mpd37m.deisenroth@ucl.ac.uk

https://deisenroth.cc

AIMS Rwanda and AIMS Ghana

March/April 2020

Reference

MATHEMATICS FOR MACHINE LEARNING

Marc Peter DeisenrothA. Aldo Faisal

Cheng Soon Ong

MA

THEM

ATICS FO

R MA

CHIN

E LEARN

ING

DEIS

ENR

OTH

ET AL.

The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to effi ciently learn the mathematics. This self-contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines. For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the fi rst time, the methods help build intuition and practical experience with applying mathematical concepts. Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book’s web site.

MARC PETER DEISENROTH is Senior Lecturer in Statistical Machine Learning at the Department of Computing, Împerial College London.

A. ALDO FAISAL leads the Brain & Behaviour Lab at Imperial College London, where he is also Reader in Neurotechnology at the Department of Bioengineering and the Department of Computing.

CHENG SOON ONG is Principal Research Scientist at the Machine Learning Research Group, Data61, CSIRO. He is also Adjunct Associate Professor at Australian National University.

Cover image courtesy of Daniel Bosma / Moment / Getty Images

Cover design by Holly Johnson

Deisenrith et al. 9781108455145 Cover. C M Y K

https://mml-book.com

Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 2

Regression Problems

Regression (curve fitting)

Given inputs x P RD andcorresponding observations y P Rfind a function f that models therelationship between x and y.

−4 −2 0 2 4

x

−4

−2

0

2

4

y

Typically parametrize the function f with parameters θ

Linear regression: Consider functions f that are linear in theparameters

Marc Deisenroth (UCL) Linear Regression March/April 2020 3

Linear Regression Functions

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials

y “ fpx,θq “Mÿ

m“0

θmxm “

θ0 ¨ ¨ ¨ θM

ı

»

1...xM

fi

ffi

ffi

fl

Radial basis function networks

y “ fpx,θq “Mÿ

m“1

θm exp´

´ 12px´ µmq

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)

Marc Deisenroth (UCL) Linear Regression March/April 2020 4

Linear Regression Functions

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials

y “ fpx,θq “Mÿ

m“0

θmxm “

θ0 ¨ ¨ ¨ θM

ı

»

1...xM

fi

ffi

ffi

fl

Radial basis function networks

y “ fpx,θq “Mÿ

m“1

θm exp´

´ 12px´ µmq

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)

Marc Deisenroth (UCL) Linear Regression March/April 2020 4

Linear Regression Functions

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

ı

«

1

x

ff

Polynomials

y “ fpx,θq “Mÿ

m“0

θmxm “

θ0 ¨ ¨ ¨ θM

ı

»

1...xM

fi

ffi

ffi

fl

Radial basis function networks

y “ fpx,θq “Mÿ

m“1

θm exp´

´ 12px´ µmq

−4 −2 0 2 4x

−2

−1

0

1

2

f(x

)

−4 −2 0 2 4x

−20

−10

0

10

f(x

)

−4 −2 0 2 4x

−3

−2

−1

0

1

f(x

)Marc Deisenroth (UCL) Linear Regression March/April 2020 4

Linear Regression Model and Setting

y “ xJθ ` ε , ε „ N`

0, σ2˘

Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚

Maximum Likelihood EstimationMaximum a Posteriori Estimation

Marc Deisenroth (UCL) Linear Regression March/April 2020 5

Linear Regression Model and Setting

y “ xJθ ` ε , ε „ N`

0, σ2˘

Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚

Maximum Likelihood EstimationMaximum a Posteriori Estimation

Marc Deisenroth (UCL) Linear Regression March/April 2020 5

Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

n“1

N`

yn |xJnθ, σ

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

,

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Marc Deisenroth (UCL) Linear Regression March/April 2020 6

Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

n“1

N`

yn |xJnθ, σ

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

,

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Marc Deisenroth (UCL) Linear Regression March/April 2020 6

Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

n“1

N`

yn |xJnθ, σ

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

,

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Marc Deisenroth (UCL) Linear Regression March/April 2020 6

Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

n“1

N`

yn |xJnθ, σ

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

,

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2 ` const

Marc Deisenroth (UCL) Linear Regression March/April 2020 6

Maximum Likelihood (2)

With

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 7

Maximum Likelihood (2)

With

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 7

Maximum Likelihood (2)

With

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 7

Maximum Likelihood (2)

With

logN`

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

Jnθq

2` const

we get

log ppy|X,θq “Nÿ

n“1

logN`

yn |xJnθ, σ

“ ´1

2σ2

Nÿ

n“1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 7

Making Predictions

y “ xJθ ` ε, ε „ N`

0, σ2˘

Given an arbitrary input x˚, we can predict the correspondingobservation y˚ using the maximum likelihood parameter:

ppy˚|x˚,θMLq “ N

`

y˚ |xJ˚ θ

ML, σ2˘

Measurement noise variance σ2 assumed known

In the absence of noise (σ2 “ 0), the prediction will bedeterministic

Marc Deisenroth (UCL) Linear Regression March/April 2020 8

Example 1: Linear Functions

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

y “ θ0 ` θ1x` ε, ε „ N`

0, σ2˘

At any query point x˚ we obtain the mean prediction as

Ery˚|θML, x˚s “ θML

0 ` θML1 x˚

Marc Deisenroth (UCL) Linear Regression March/April 2020 9

Nonlinear Functions

y “ φpxqJθ ` ε “Mÿ

m“0

θmxm ` ε

Polynomial regression with features

φpxq “ r1, x, x2, . . . , xM sJ

Maximum likelihood estimator:

θML “ pΦJΦq´1ΦJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 10

Nonlinear Functions

y “ φpxqJθ ` ε “Mÿ

m“0

θmxm ` ε

Polynomial regression with features

φpxq “ r1, x, x2, . . . , xM sJ

Maximum likelihood estimator:

θML “ pΦJΦq´1ΦJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 10

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Figure: Training data

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 2nd-order polynomial

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 4th-order polynomial

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 6th-order polynomial

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 8th-order polynomial

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Example 2: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

Figure: 10th-order polynomial

Marc Deisenroth (UCL) Linear Regression March/April 2020 11

Overfitting

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

Training error

Test error

Training error decreases with higher flexibility of the model

We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?

Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data

Marc Deisenroth (UCL) Linear Regression March/April 2020 12

Overfitting

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

Training error

Test error

Training error decreases with higher flexibility of the model

We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?

Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data

Marc Deisenroth (UCL) Linear Regression March/April 2020 12

Overfitting

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

Training error

Test error

Training error decreases with higher flexibility of the model

We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?

Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data

Marc Deisenroth (UCL) Linear Regression March/April 2020 12

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Marc Deisenroth (UCL) Linear Regression March/April 2020 13

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Marc Deisenroth (UCL) Linear Regression March/April 2020 13

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Marc Deisenroth (UCL) Linear Regression March/April 2020 13

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Marc Deisenroth (UCL) Linear Regression March/April 2020 13

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

Marc Deisenroth (UCL) Linear Regression March/April 2020 13

MAP Estimation (2)

Gaussian parameter prior ppθq “ N`

0, α2I˘

Log-posterior distribution:

log ppθ|X,yq “ ´ 12σ2 py ´Xθq

Jpy ´Xθq ´ 12α2θ

Jθ ` const

“ ´ 12σ2 }y ´Xθ}

2 ´ 12α2 }θ}

2 ` const

Compute gradient with respect to θ, set it to 0

Maximum a posteriori estimate:

θMAP “ pXJX ` σ2

α2 Iq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 14

MAP Estimation (2)

Gaussian parameter prior ppθq “ N`

0, α2I˘

Log-posterior distribution:

log ppθ|X,yq “ ´ 12σ2 py ´Xθq

Jpy ´Xθq ´ 12α2θ

Jθ ` const

“ ´ 12σ2 }y ´Xθ}

2 ´ 12α2 }θ}

2 ` const

Compute gradient with respect to θ, set it to 0

Maximum a posteriori estimate:

θMAP “ pXJX ` σ2

α2 Iq´1XJy

Marc Deisenroth (UCL) Linear Regression March/April 2020 14

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Figure: Training data

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 2nd-order polynomial

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 4th-order polynomial

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 6th-order polynomial

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 8th-order polynomial

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Example: Polynomial Regression

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

Figure: 10th-order polynomial

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

MAP

Marc Deisenroth (UCL) Linear Regression March/April 2020 15

Generalization Error

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

(tes

tse

t)

Maximum likelihood

Maximum a posteriori

MAP estimation “delays” the problem of overfitting

It does not provide a general solutionNeed a more principled solution

Marc Deisenroth (UCL) Linear Regression March/April 2020 16

Bayesian Linear Regression

y “ φJpxqθ ` ε , ε „ N`

0, σ2˘

Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them

Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:

ppy˚|x˚q “

ż

ppy˚|x˚,θqppθqdθ

Prediction no longer depends on θ

Predictive distribution reflects the uncertainty about the “correct”parameter setting

Marc Deisenroth (UCL) Linear Regression March/April 2020 17

Bayesian Linear Regression

y “ φJpxqθ ` ε , ε „ N`

0, σ2˘

Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them

Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:

ppy˚|x˚q “

ż

ppy˚|x˚,θqppθqdθ

Prediction no longer depends on θ

Predictive distribution reflects the uncertainty about the “correct”parameter setting

Marc Deisenroth (UCL) Linear Regression March/April 2020 17

Example

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−4 −2 0 2 4x

−4

−2

0

2

4

y

Light-gray: uncertainty due to noise (same as in MLE/MAP)

Dark-gray: uncertainty due to parameter uncertainty

Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)

Marc Deisenroth (UCL) Linear Regression March/April 2020 18

Example

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−4 −2 0 2 4x

−4

−2

0

2

4

y

Light-gray: uncertainty due to noise (same as in MLE/MAP)

Dark-gray: uncertainty due to parameter uncertainty

Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)

Marc Deisenroth (UCL) Linear Regression March/April 2020 18

Model for Bayesian Linear Regression

Prior ppθq “ N`

m0, S0

˘

,

Likelihood ppy|x,θq “ N`

y |φJpxqθ, σ2˘

Parameter θ becomes a latent (random) variable

Prior distribution induces a distribution over plausible functions

Choose a conjugate Gaussian priorClosed-form computationsGaussian posterior

Marc Deisenroth (UCL) Linear Regression March/April 2020 19

Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:Derive this

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 20

Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 20

Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 20

Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

˘

is Gaussian posterior is Gaussian:

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 20

Marginal Likelihood

Marginal likelihood can be computed analytically.

With ppθq “ N`

µ, Σ˘

ppy|Xq “

ż

ppy|X,θqppθqdθ “ N`

y |Φµ, ΦΣΦJ ` σ2I˘

Derivation via completing the squares (see Section 9.3.5 ofMML book)

Marc Deisenroth (UCL) Linear Regression March/April 2020 21

Distribution over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

−4 −2 0 2 4a

−4

−3

−2

−1

0

1

2

3

4b

Marc Deisenroth (UCL) Linear Regression March/April 2020 22

Sampling from the Prior over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

fipxq “ ai ` bix, rai, bis „ ppa, bq

Marc Deisenroth (UCL) Linear Regression March/April 2020 23

Sampling from the Posterior over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

X “ rx1, . . . , xN s, y “ ry1, . . . , yN s Training inputs/targets

−10 −5 0 5 10x

−15

−10

−5

0

5

10

15y

Marc Deisenroth (UCL) Linear Regression March/April 2020 24

Sampling from the Posterior over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

ppa, b|X,yq “ N`

mN , SN˘

Posterior

−4 −2 0 2 4a

−4

−3

−2

−1

0

1

2

3

4b

Marc Deisenroth (UCL) Linear Regression March/April 2020 25

Sampling from the Posterior over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

rai, bis „ ppa, b|X,yq

fi “ ai ` bix

Marc Deisenroth (UCL) Linear Regression March/April 2020 26

Fitting Nonlinear Functions

Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

Example: Radial-basis-function (RBF) network

fpxq “nÿ

i“1

θiφipxq , θi „ N`

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi

Marc Deisenroth (UCL) Linear Regression March/April 2020 27

Fitting Nonlinear Functions

Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

Example: Radial-basis-function (RBF) network

fpxq “nÿ

i“1

θiφipxq , θi „ N`

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi

Marc Deisenroth (UCL) Linear Regression March/April 2020 27

Fitting Nonlinear Functions

Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

Example: Radial-basis-function (RBF) network

fpxq “nÿ

i“1

θiφipxq , θi „ N`

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi

Marc Deisenroth (UCL) Linear Regression March/April 2020 27

Illustration: Fitting a Radial Basis Function Network

φipxq “ exp`

´ 12px´ µiq

Jpx´ µiq˘

-5 0 5x

-2

0

2f(

x)

Place Gaussian-shaped basis functions φi at 25 input locationsµi, linearly spaced in the interval r´5, 3s

Marc Deisenroth (UCL) Linear Regression March/April 2020 28

Samples from the RBF Prior

fpxq “nÿ

i“1

θiφipxq , ppθq “ N`

0, I˘

-5 0 5x

-4

-2

0

2

4f(

x)

Marc Deisenroth (UCL) Linear Regression March/April 2020 29

Samples from the RBF Posterior

fpxq “nÿ

i“1

θiφipxq , ppθ|X,yq “ N`

mN , SN˘

-5 0 5x

-4

-2

0

2

4f(

x)

Marc Deisenroth (UCL) Linear Regression March/April 2020 30

RBF Posterior

-5 0 5x

-2

0

2f(

x)

Marc Deisenroth (UCL) Linear Regression March/April 2020 31

Limitations

-5 0 5x

-2

0

2

f(x)

Feature engineering (what basis functions to use?)Finite number of features:

Above: Without basis functions on the right, we cannot expressany variability of the functionIdeally: Add more (infinitely many) basis functions

Marc Deisenroth (UCL) Linear Regression March/April 2020 32

Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process

Marc Deisenroth (UCL) Linear Regression March/April 2020 33

Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process

Marc Deisenroth (UCL) Linear Regression March/April 2020 33

Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process

Marc Deisenroth (UCL) Linear Regression March/April 2020 33

Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process

Marc Deisenroth (UCL) Linear Regression March/April 2020 33

Summary

0 2 4 6 8 10Degree of polynomial

0

2

4

6

8

10

RM

SE

(tes

tse

t)

Maximum likelihood

Maximum a posteriori

−4 −2 0 2 4x

−4

−2

0

2

4

y

Training data

MLE

MAP

BLR

−2 0 2a

−4

−3

−2

−1

0

1

2

3

4

b

−10 −5 0 5 10x

−15

−10

−5

0

5

10

15

y

Regression = curve fittingLinear regression = linear in the parametersParameter estimation via maximum likelihood and MAPestimation can lead to overfittingBayesian linear regression addresses this issue, but may not beanalytically tractablePredictive uncertainty in Bayesian linear regression explicitlyaccounts for parameter uncertaintyDistribution over parameters Distribution over functions

Marc Deisenroth (UCL) Linear Regression March/April 2020 34

Appendix

Marc Deisenroth (UCL) Linear Regression March/April 2020 35

Joint Gaussian Distribution

Joint Gaussian distribution

ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘

µx|y “ µx ` Σxy Σ´1yy py ´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Marc Deisenroth (UCL) Linear Regression March/April 2020 36

Joint Gaussian Distribution

Joint Gaussian distribution

ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘

µx|y “ µx ` Σxy Σ´1yy py ´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Marc Deisenroth (UCL) Linear Regression March/April 2020 36

Joint Gaussian Distribution

Joint Gaussian distribution

ppx,yq “ N˜«

µx

µy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal:

ppx q “

ż

ppx , y qd y

“ N`

µx , Σxx

˘

Conditional:

ppx|yq “ N`

µx|y, Σx|y

˘

µx|y “ µx ` Σxy Σ´1yy py ´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Marc Deisenroth (UCL) Linear Regression March/April 2020 36

Linear Transformation of Gaussian RandomVariables

If x „ N`

x |µ, Σ˘

and z “ Ax` b then

ppzq “ N`

z |Aµ` b, AΣAJ˘

Marc Deisenroth (UCL) Linear Regression March/April 2020 37

Product of Two Gaussians

x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

c “ CpA´1a`B´1bq

Z “ p2πq´D2 |A`B| exp

`

´12pa´ bq

JpA`Bq´1pa´ bq˘

Product of two Gaussians is an unnormalized Gaussian

The “un-normalizer” Z has a Gaussian functional form:

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Note: This is not a distribution (no random variables)

Marc Deisenroth (UCL) Linear Regression March/April 2020 38

Product of Two Gaussians

x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

c “ CpA´1a`B´1bq

Z “ p2πq´D2 |A`B| exp

`

´12pa´ bq

JpA`Bq´1pa´ bq˘

Product of two Gaussians is an unnormalized Gaussian

The “un-normalizer” Z has a Gaussian functional form:

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Note: This is not a distribution (no random variables)

Marc Deisenroth (UCL) Linear Regression March/April 2020 38

Product of Two Gaussians

x P RD. Then:

N`

x |a, A˘

N`

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

c “ CpA´1a`B´1bq

Z “ p2πq´D2 |A`B| exp

`

´12pa´ bq

JpA`Bq´1pa´ bq˘

Product of two Gaussians is an unnormalized Gaussian

The “un-normalizer” Z has a Gaussian functional form:

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Note: This is not a distribution (no random variables)

Marc Deisenroth (UCL) Linear Regression March/April 2020 38

Example: Marginalization of a Product

p1pxq “ N`

x |a, A˘

p2pxq “ N`

x | b, B˘

Thenż

p1pxqp2pxqdx “

Z “ N`

a | b, A`B˘

P R

Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.

Marc Deisenroth (UCL) Linear Regression March/April 2020 39

Example: Marginalization of a Product

p1pxq “ N`

x |a, A˘

p2pxq “ N`

x | b, B˘

Thenż

p1pxqp2pxqdx “ Z “ N`

a | b, A`B˘

P R

Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.

Marc Deisenroth (UCL) Linear Regression March/April 2020 39