Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial...

Linear Regression

Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London

@mpd37m.deisenroth@ucl.ac.uk

https://deisenroth.cc

AIMS Rwanda and AIMS Ghana

March/April 2020

Reference

MATHEMATICS FOR MACHINE LEARNING

Marc Peter DeisenrothA. Aldo Faisal

Cheng Soon Ong

ATICS FO

E LEARN

ET AL.

The fundamental mathematical tools needed to understand machine learning include linear algebra, analytic geometry, matrix decompositions, vector calculus, optimization, probability and statistics. These topics are traditionally taught in disparate courses, making it hard for data science or computer science students, or professionals, to effi ciently learn the mathematics. This self-contained textbook bridges the gap between mathematical and machine learning texts, introducing the mathematical concepts with a minimum of prerequisites. It uses these concepts to derive four central machine learning methods: linear regression, principal component analysis, Gaussian mixture models and support vector machines. For students and others with a mathematical background, these derivations provide a starting point to machine learning texts. For those learning the mathematics for the fi rst time, the methods help build intuition and practical experience with applying mathematical concepts. Every chapter includes worked examples and exercises to test understanding. Programming tutorials are offered on the book’s web site.

MARC PETER DEISENROTH is Senior Lecturer in Statistical Machine Learning at the Department of Computing, Împerial College London.

A. ALDO FAISAL leads the Brain & Behaviour Lab at Imperial College London, where he is also Reader in Neurotechnology at the Department of Bioengineering and the Department of Computing.

CHENG SOON ONG is Principal Research Scientist at the Machine Learning Research Group, Data61, CSIRO. He is also Adjunct Associate Professor at Australian National University.

Cover image courtesy of Daniel Bosma / Moment / Getty Images

Cover design by Holly Johnson

Deisenrith et al. 9781108455145 Cover. C M Y K

https://mml-book.com

Chapter 9

Marc Deisenroth (UCL) Linear Regression March/April 2020 2

Regression Problems

Regression (curve fitting)

Given inputs x P RD andcorresponding observations y P Rfind a function f that models therelationship between x and y.

−4 −2 0 2 4

Typically parametrize the function f with parameters θ

Linear regression: Consider functions f that are linear in theparameters

Linear Regression Functions

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

Polynomials

y “ fpx,θq “Mÿ

θmxm “

θ0 ¨ ¨ ¨ θM

1...xM

Radial basis function networks

θm exp´

´ 12px´ µmq

−4 −2 0 2 4x

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

Polynomials

θmxm “

θ0 ¨ ¨ ¨ θM

1...xM

θm exp´

´ 12px´ µmq

−4 −2 0 2 4x

Straight lines

y “ fpx,θq “ θ0 ` θ1x “”

θ0 θ1

Polynomials

θmxm “

θ0 ¨ ¨ ¨ θM

1...xM

θm exp´

´ 12px´ µmq

−4 −2 0 2 4x

)Marc Deisenroth (UCL) Linear Regression March/April 2020 4

Linear Regression Model and Setting

y “ xJθ ` ε , ε „ N`

0, σ2˘

Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚

Maximum Likelihood EstimationMaximum a Posteriori Estimation

Linear Regression Model and Setting

y “ xJθ ` ε , ε „ N`

0, σ2˘

Given a training set px1, y1q, . . . , pxN , yN q we seek optimalparameters θ˚

Maximum Likelihood EstimationMaximum a Posteriori Estimation

Maximum Likelihood

Define X “ rx1, . . . ,xN sJ P RNˆD and y “ ry1, . . . , yN sJ P RN

Find parameters θ˚ that maximize the likelihood

ppy1, . . . , yN |x1, . . . ,xN ,θq “ ppy|X,θq “Nź

yn |xJnθ, σ

Log-transformation Maximize the log likelihood

log ppy|X,θq “Nÿ

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

2 ` const

Maximum Likelihood

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

2 ` const

Maximum Likelihood

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

2 ` const

Maximum Likelihood

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

2 ` const

Maximum Likelihood (2)

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

2` const

we get

yn |xJnθ, σ

“ ´1

pyn ´ xJnθq

2` const

“ ´1

2σ2py ´XθqJpy ´Xθq ` const

“ ´1

2σ2}y ´Xθ}2 ` const

Computing the gradient with respect to θ and setting it to 0 givesthe maximum likelihood estimator (least-squares estimator)

θML “ pXJXq´1XJy

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

2` const

we get

yn |xJnθ, σ

“ ´1

pyn ´ xJnθq

2` const

“ ´1

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

2` const

we get

yn |xJnθ, σ

“ ´1

pyn ´ xJnθq

2` const

“ ´1

yn |xJnθ, σ

“ ´1

2σ2pyn ´ x

2` const

we get

yn |xJnθ, σ

“ ´1

pyn ´ xJnθq

2` const

“ ´1

Making Predictions

y “ xJθ ` ε, ε „ N`

0, σ2˘

Given an arbitrary input x˚, we can predict the correspondingobservation y˚ using the maximum likelihood parameter:

ppy˚|x˚,θMLq “ N

y˚ |xJ˚ θ

ML, σ2˘

Measurement noise variance σ2 assumed known

In the absence of noise (σ2 “ 0), the prediction will bedeterministic

Example 1: Linear Functions

−4 −2 0 2 4x

Training data

y “ θ0 ` θ1x` ε, ε „ N`

0, σ2˘

At any query point x˚ we obtain the mean prediction as

Ery˚|θML, x˚s “ θML

0 ` θML1 x˚

Nonlinear Functions

y “ φpxqJθ ` ε “Mÿ

θmxm ` ε

Polynomial regression with features

φpxq “ r1, x, x2, . . . , xM sJ

Maximum likelihood estimator:

θML “ pΦJΦq´1ΦJy

Nonlinear Functions

y “ φpxqJθ ` ε “Mÿ

θmxm ` ε

Polynomial regression with features

φpxq “ r1, x, x2, . . . , xM sJ

Maximum likelihood estimator:

θML “ pΦJΦq´1ΦJy

Example 2: Polynomial Regression

−4 −2 0 2 4x

Figure: Training data

−4 −2 0 2 4x

Training data

Figure: 2nd-order polynomial

−4 −2 0 2 4x

Training data

Figure: 4th-order polynomial

−4 −2 0 2 4x

Training data

−4 −2 0 2 4x

Training data

−4 −2 0 2 4x

Training data

Overfitting

0 2 4 6 8 10Degree of polynomial

Training error

Test error

Training error decreases with higher flexibility of the model

We are not so much interested in the training error, but in thegeneralization error: How well does the model perform whenwe predict at previously unseen input locations?

Maximum likelihood often runs into overfitting problems, i.e.,we exploit the flexibility of the model to fit to the noise in the data

Overfitting

Training error

Test error

Overfitting

Training error

Test error

MAP Estimation

Empirical observation: Parametric models that overfit tend tohave some extreme (large amplitude) parameter values

Mitigate the effect of overfitting by placing a prior distributionppθq on the parameters

Penalize extreme values that are implausible under that prior

Choose θ˚ as the parameter that maximizes the (log) parameterposterior

log ppθ|X,yq “ log ppy|X,θqlooooooomooooooon

log-likelihood

` log ppθqlooomooon

log-prior

` const

Log-prior induces a direct penalty on the parameters

Maximum a posteriori estimate (regularized least squares)

MAP Estimation

log-likelihood

log-prior

` const

MAP Estimation

log-likelihood

log-prior

` const

MAP Estimation

log-likelihood

log-prior

` const

MAP Estimation

log-likelihood

log-prior

` const

MAP Estimation (2)

Gaussian parameter prior ppθq “ N`

0, α2I˘

Log-posterior distribution:

log ppθ|X,yq “ ´ 12σ2 py ´Xθq

Jpy ´Xθq ´ 12α2θ

Jθ ` const

“ ´ 12σ2 }y ´Xθ}

2 ´ 12α2 }θ}

2 ` const

Compute gradient with respect to θ, set it to 0

Maximum a posteriori estimate:

θMAP “ pXJX ` σ2

α2 Iq´1XJy

MAP Estimation (2)

Gaussian parameter prior ppθq “ N`

0, α2I˘

Log-posterior distribution:

log ppθ|X,yq “ ´ 12σ2 py ´Xθq

Jpy ´Xθq ´ 12α2θ

Jθ ` const

“ ´ 12σ2 }y ´Xθ}

2 ´ 12α2 }θ}

2 ` const

Compute gradient with respect to θ, set it to 0

Maximum a posteriori estimate:

θMAP “ pXJX ` σ2

α2 Iq´1XJy

Example: Polynomial Regression

−4 −2 0 2 4x

Figure: Training data

Mean prediction:

Ery˚|x˚,θMAPs “ φJpx˚qθ

−4 −2 0 2 4x

Training data

Figure: 2nd-order polynomial

Mean prediction:

−4 −2 0 2 4x

Training data

Mean prediction:

−4 −2 0 2 4x

Training data

Mean prediction:

−4 −2 0 2 4x

Training data

Mean prediction:

−4 −2 0 2 4x

Training data

Mean prediction:

Generalization Error

Maximum likelihood

Maximum a posteriori

MAP estimation “delays” the problem of overfitting

It does not provide a general solutionNeed a more principled solution

Bayesian Linear Regression

y “ φJpxqθ ` ε , ε „ N`

0, σ2˘

Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them

Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:

ppy˚|x˚q “

ppy˚|x˚,θqppθqdθ

Prediction no longer depends on θ

Predictive distribution reflects the uncertainty about the “correct”parameter setting

Bayesian Linear Regression

y “ φJpxqθ ` ε , ε „ N`

0, σ2˘

Avoid overfitting by not fitting any parameters:Integrate parameters out instead of optimizing them

Use a full parameter distribution ppθq (and not a single pointestimate θ˚) when making predictions:

ppy˚|x˚q “

ppy˚|x˚,θqppθqdθ

Prediction no longer depends on θ

Predictive distribution reflects the uncertainty about the “correct”parameter setting

Example

−4 −2 0 2 4x

Training data

−4 −2 0 2 4x

Light-gray: uncertainty due to noise (same as in MLE/MAP)

Dark-gray: uncertainty due to parameter uncertainty

Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)

Example

−4 −2 0 2 4x

Training data

−4 −2 0 2 4x

Light-gray: uncertainty due to noise (same as in MLE/MAP)

Dark-gray: uncertainty due to parameter uncertainty

Right: Plausible functions under the parameter distribution(every single parameter setting describes one function)

Model for Bayesian Linear Regression

Prior ppθq “ N`

m0, S0

Likelihood ppy|x,θq “ N`

y |φJpxqθ, σ2˘

Parameter θ becomes a latent (random) variable

Prior distribution induces a distribution over plausible functions

Choose a conjugate Gaussian priorClosed-form computationsGaussian posterior

Parameter Posterior and Predictions

Prior ppθq “ N`

m0, S0

is Gaussian posterior is Gaussian:Derive this

ppθ|X,yq “ N`

mN , SN˘

SN “ pS´10 ` σ´2ΦJΦq´1

mN “ SN pS´10 m0 ` σ

´2ΦJyq

Mean mN identical to MAP estimate

Assume a Gaussian distribution ppθq “ N`

mN , SN˘

. Then

ppy˚|x˚q “ N`

y |φJpx˚qmN , φJpx˚qSNφpx˚q ` σ

φJpx˚qSNφpx˚q : Accounts for parameter uncertainty inpredictive varianceMore details https://mml-book.com, Chapter 9

Prior ppθq “ N`

m0, S0

is Gaussian posterior is Gaussian:

ppθ|X,yq “ N`

mN , SN˘

´2ΦJyq

mN , SN˘

. Then

ppy˚|x˚q “ N`

Prior ppθq “ N`

m0, S0

ppθ|X,yq “ N`

mN , SN˘

´2ΦJyq

mN , SN˘

. Then

ppy˚|x˚q “ N`

Prior ppθq “ N`

m0, S0

ppθ|X,yq “ N`

mN , SN˘

´2ΦJyq

mN , SN˘

. Then

ppy˚|x˚q “ N`

Marginal Likelihood

Marginal likelihood can be computed analytically.

With ppθq “ N`

µ, Σ˘

ppy|Xq “

ppy|X,θqppθqdθ “ N`

y |Φµ, ΦΣΦJ ` σ2I˘

Derivation via completing the squares (see Section 9.3.5 ofMML book)

Distribution over Functions

Consider a linear regression setting

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

−4 −2 0 2 4a

Sampling from the Prior over Functions

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

fipxq “ ai ` bix, rai, bis „ ppa, bq

Sampling from the Posterior over Functions

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

X “ rx1, . . . , xN s, y “ ry1, . . . , yN s Training inputs/targets

−10 −5 0 5 10x

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

ppa, b|X,yq “ N`

mN , SN˘

Posterior

−4 −2 0 2 4a

y “ fpxq ` ε “ a` bx` ε , ε „ N`

0, σ2n˘

rai, bis „ ppa, b|X,yq

fi “ ai ` bix

Fitting Nonlinear Functions

Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

Example: Radial-basis-function (RBF) network

fpxq “nÿ

θiφipxq , θi „ N`

0, σ2p˘

whereφipxq “ exp

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi

fpxq “nÿ

0, σ2p˘

whereφipxq “ exp

´ 12px´ µiq

Jpx´ µiq˘

fpxq “nÿ

0, σ2p˘

whereφipxq “ exp

´ 12px´ µiq

Jpx´ µiq˘

Illustration: Fitting a Radial Basis Function Network

φipxq “ exp`

´ 12px´ µiq

Jpx´ µiq˘

-5 0 5x

Place Gaussian-shaped basis functions φi at 25 input locationsµi, linearly spaced in the interval r´5, 3s

Samples from the RBF Prior

fpxq “nÿ

θiφipxq , ppθq “ N`

0, I˘

-5 0 5x

Samples from the RBF Posterior

fpxq “nÿ

θiφipxq , ppθ|X,yq “ N`

mN , SN˘

-5 0 5x

RBF Posterior

-5 0 5x

Limitations

-5 0 5x

Feature engineering (what basis functions to use?)Finite number of features:

Above: Without basis functions on the right, we cannot expressany variability of the functionIdeally: Add more (infinitely many) basis functions

Approach

Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Place a prior on functionsMake assumptions on the distribution of functions

Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian process

Approach

Gaussian process

Approach

Gaussian process

Approach

Gaussian process

Summary

Maximum likelihood

Maximum a posteriori

−4 −2 0 2 4x

Training data

−2 0 2a

−10 −5 0 5 10x

Regression = curve fittingLinear regression = linear in the parametersParameter estimation via maximum likelihood and MAPestimation can lead to overfittingBayesian linear regression addresses this issue, but may not beanalytically tractablePredictive uncertainty in Bayesian linear regression explicitlyaccounts for parameter uncertaintyDistribution over parameters Distribution over functions

Appendix

Joint Gaussian Distribution

Joint Gaussian distribution

ppx,yq “ N˜«

Σxx Σxy

Σyx Σyy

Marginal:

ppx q “

ppx , y qd y

“ N`

µx , Σxx

Conditional:

ppx|yq “ N`

µx|y, Σx|y

µx|y “ µx ` Σxy Σ´1yy py ´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

ppx,yq “ N˜«

Σxx Σxy

Σyx Σyy

Marginal:

ppx q “

ppx , y qd y

“ N`

µx , Σxx

Conditional:

ppx|yq “ N`

µx|y, Σx|y

ppx,yq “ N˜«

Σxx Σxy

Σyx Σyy

Marginal:

ppx q “

ppx , y qd y

“ N`

µx , Σxx

Conditional:

ppx|yq “ N`

µx|y, Σx|y

Linear Transformation of Gaussian RandomVariables

If x „ N`

x |µ, Σ˘

and z “ Ax` b then

ppzq “ N`

z |Aµ` b, AΣAJ˘

Product of Two Gaussians

x P RD. Then:

x |a, A˘

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

c “ CpA´1a`B´1bq

Z “ p2πq´D2 |A`B| exp

´12pa´ bq

JpA`Bq´1pa´ bq˘

Product of two Gaussians is an unnormalized Gaussian

The “un-normalizer” Z has a Gaussian functional form:

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Note: This is not a distribution (no random variables)

x P RD. Then:

x |a, A˘

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

´12pa´ bq

JpA`Bq´1pa´ bq˘

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

x P RD. Then:

x |a, A˘

x | b, B˘

“ ZN`

x | c, C˘

C “ pA´1 `B´1q´1

´12pa´ bq

JpA`Bq´1pa´ bq˘

Z “ N`

a | b, A`B˘

“ N`

b |a, A`B˘

Example: Marginalization of a Product

p1pxq “ N`

x |a, A˘

p2pxq “ N`

x | b, B˘

Thenż

p1pxqp2pxqdx “

Z “ N`

a | b, A`B˘

Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.

Example: Marginalization of a Product

p1pxq “ N`

x |a, A˘

p2pxq “ N`

x | b, B˘

Thenż

p1pxqp2pxqdx “ Z “ N`

a | b, A`B˘

Note: In this context, N is used to describe the functionalrelationship between a, b. Do not treat a or b as randomvariables—they are both deterministic quantities.

Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial...

Documents

Transcript of Linear Regression - Marc Deisenroth · Linear Regression Marc Deisenroth Centre for Artiﬁcial...

Distributed Gaussian Processesmpd37/talks/2015-09-15-gpss.pdf · Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10. Inducing Variables f(x 1) f(x N) f f

DesignofExperimentsforModelDiscrimination using Gaussian ...wp.doc.ic.ac.uk/rmisener/.../sites/66/...poster-1.pdf · Simon Olofsson, Marc Peter Deisenroth and Ruth Misener Dept of

Approximate Inference: Sampling - Marc Deisenroth · Monte Carlo Methods—Motivation Monte Carlo methods are computational techniques that make use ofrandom numbers Two typical problems:

Mathematics for Machine Learning - Deep Learning … · Mathematics for Machine Learning Marc Deisenroth Statistical Machine Learning Group Department of Computing Imperial College

Learning Inverse Dynamics Models with Contacts · Learning Inverse Dynamics Models with Contacts Roberto Calandra 1, Serena Ivaldi;2, Marc Peter Deisenroth 3, Elmar Rueckert 1, Jan

Mathematical Methods - Marc Deisenroth

Modern Regression - Ridge Regression

PILCO: A Model-Based and Data-Efficient Approach to Policy Search · PILCO: A Model-Based and Data-E cient Approach to Policy Search Marc Peter Deisenroth and Carl Edward Rasmussen

Sparse Multilevel Regression (and Poststrati cation (sMRP)) · Sparse Multilevel Regression (and Poststrati cation (sMRP)) Max Goplerudy Shiro Kuriwakiz Marc Ratkovicx Dustin Tingley{

A Survey on Policy Search for Robotics · Robotics Vol. 2, No 1{2 (2013) 1{141 c 2013 Deisenroth, Neumann, Peters DOI: 10.1561/2300000021 A Survey on Policy Search for Robotics Marc

Learning to Control a Low-Cost Manipulator using Data ......Learning to Control a Low-Cost Manipulator using Data-Efﬁcient Reinforcement Learning Marc Peter Deisenroth Dept. of Computer

Lecture Notes: Mathematical Methodsmpd37/teaching/2016/145/notes.pdf · Mathematical Methods Instructors: Marc Deisenroth, Mahdi Cheraghchi Version: November 30, 2016. Contents 1

Chapter 13: SIMPLE LINEAR REGRESSION. 2 Simple Regression Linear Regression.

Regression Discontinuity Design Thanks to Sandi Cleveland and Marc Shure (class of 2011) for some of these slides.

The Persuasive Effects of Direct Mail: A Regression ... · The Persuasive Effects of Direct Mail: A Regression Discontinuity Approach Alan Gerber, Daniel Kessler, and Marc Meredith

Gaussian Processes for Big Data Problemsmpd37/teaching/tutorials/2015-04-14-mlss.pdf · 4/14/2015 · Gaussian Processes Marc Deisenroth @MLSS, 14 April 2015 9. Sampling from a Multivariate

Approximate Inference: Sampling - Marc Deisenroth · Adapted from PRML (Bishop, 2006) 1.Generate z 0 qpzq 2.Generate u 0 Ur0,kqpz 0qs 3.If u 0 ¡p˜pz 0q, reject the sample. Otherwise,

Regression Logistic Regression

1 Curve-Fitting Polynomial Interpolation. 2 Curve Fitting Regression Linear Regression Polynomial Regression Multiple Linear Regression Non-linear Regression.

Efficient Reinforcement Learning using Gaussian …mlg.eng.cam.ac.uk/pub/pdf/Dei10.pdf · using Gaussian Processes Marc Peter Deisenroth Dissertation November 22, 2010 Revised July