Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24:...

37
Teacher: Gianni A. Di Caro Lecture 24: NonLinear and Kernel regression 1 Introduction to Machine Learning 10 - 315 Fall ‘19 Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.

Transcript of Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24:...

Page 1: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

Teacher:Gianni A. Di Caro

Lecture 24:NonLinear and Kernel regression 1

Introduction to Machine Learning10-315 Fall ‘19

Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.

Page 2: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

2

Least Squares Linear regression: summary and terminology

o Linear regression for univariate models 𝑓 𝑥 , 𝑓:ℝ → ℝ. (also known single regression)Hypothesized model: 𝑦 = 𝑓 𝑥 = 𝑤* + 𝑤,𝑥 = 𝒘𝑻𝒙

o Linear regression for multivariate models 𝑓 𝒙 , 𝑓:ℝ0 → ℝ (also know a multiple regression)Hypothesized model: 𝑦 = 𝑓 𝒙 = 𝑤* + 𝑤,𝑥, + 𝑤1𝑥1 +⋯+𝑤0𝑥0 = 𝒘𝑻𝒙

o General case: Linear multivariate regression for multivariate models 𝑓 𝒙 , 𝑓:ℝ0 → ℝ3Hypothesized model: 𝒚 = 𝑓, 𝒙 , 𝑓1 𝒙 ,⋯ , 𝑓3 𝒙 5 = 𝑤* + 𝑤,𝑥, + 𝑤1𝑥1 +⋯+𝑤0𝑥0 = 𝒘𝑻𝒙

Page 3: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

3

Least Squares Linear regression: summary and terminology

OLS Optimization problem and solution:

min𝒘

𝑿𝒘 − 𝒀 5 𝑿𝒘 − 𝒀

<𝒘 = 𝑿5𝑿 =,𝑿5𝒀 >𝑦(𝒙) = <𝒘5𝒙

>𝑦 𝒙 = <𝒘5𝒙 = 𝒘, 𝒙

>𝑦 𝒙 = 𝒙5 𝑿5𝑿 =,𝑿5𝒀 =ABC,

D

𝑏B 𝒙, 𝑿 𝑦 B

𝑿 =𝑥,, ⋯ 𝑥0

,

⋮ ⋱ ⋮𝑥,D ⋯ 𝑥0

D

Number of features, 𝑑

Number of samples, 𝑚 𝒀 =

𝑦 ,

⋮𝑦 D

Design matrixv Prediction as a dot product between learned

feature weights and input feature vector:

v Prediction as a linear blend of the predictions from the training set:

Page 4: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

4

Nonlinear regression

v What about situations where the functional relationships between predictor variables and the outputs is expressed by nonlinear relations?

o In general, if the function cannot be expressed by a linear combination of the coefficients, there’s no closed-form solution as in the case of linear regression and least squares

o Numeric methods need to be applied, and the optimization problems become non-convex →many local minima + non-trivial risk of finding a biased solution

𝑓 𝑥, 𝑤 =𝑤,𝑥𝑤1 + 𝑥

Page 5: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

5

Nonlinear regression → Approximation, Linearization

v General idea (to reuse results and methods from linear LS): Approximate the model function by linearizing it with respect to the estimation parameters 𝒘

𝑓 𝑥 ≈ 𝑓 𝑥* +𝑓K 𝑥*1! 𝑥 − 𝑥* +

𝑓KK 𝑥*2! 𝑥 − 𝑥* 1 + ⋯+

𝑓(O) 𝑥*𝑛! 𝑥 − 𝑥* O

o Taylor series of order 𝑛, for a univariate model:

𝑓 𝑥 ≈ AOC*

Q𝑓(O) 𝑥*

𝑛! 𝑥 − 𝑥* O 𝑓 𝑥 ≈ AOC*

Q

𝑤O 𝑥 − 𝑥* O 𝑓 𝑥 ≈ AOC*

Q

𝑤O𝑥O 𝑥* = 0

o Taylor series of order 𝑛, for a multivariate model:

𝑓 𝒙 ≈ A|T|U*

𝒙 − 𝒙* T

𝑎!𝜕T𝑓 𝒙*

XYC* A|T|U*

𝒙 T

𝑎!𝜕T𝑓 𝒙*

(includes cross-product and partial derivates among variables)

Page 6: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

6

Nonlinear regression → Approximation, Linearization

o Any function can be approximated with the desired precision by using polynomials

o Polynomials are an example of basis functions: form a basis in a function space such that any function can be represented using the basis

o For a given input predictor 𝒙, from the estimation point of view, the above expressions are linear in the parameters to be estimated, 𝒘

𝑓 𝑥 ≈ AOC*

Q

𝑤O𝑥O

ü Approximate the function by linearizing: truncate the Taylor series at the first term, get a linear approximation and work with that to estimate the parameters using OLS

How can we exploit these expressions to find 𝑓 → find 𝑤O using the data?

𝑓 𝒙 ≈ A|T|U*

𝒙 T

𝑎!𝜕T𝑓 𝒙*

Page 7: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

7

Non-linear, additive regression models

§ Main idea to model nonlinearities: Replace inputs to linear units with 𝑏 feature (basis) functions 𝜙[ 𝒙 , 𝑗 = 1,⋯ , 𝑏, where 𝜙[ 𝒙 is an arbitrary function of 𝒙

𝑦 = 𝑓 𝒙;𝒘 = 𝑤* + 𝑤,𝜙, 𝒙 + 𝑤1𝜙1 𝒙 +⋯+𝑤^𝜙^ 𝒙 = 𝒘5 _ 𝝓(𝒙)

𝑓

𝑏𝑏Original

feature input New input

Linear model

o Fitting data to a nonlinear model

o Linear as statistical estimation problem

o The regression function 𝐸 𝑦 𝒙] is linear in the unknown parameters 𝒘that are estimated from the data.

v If we add enough basis function we are guaranteed to approximate the model function very well!

Page 8: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

8

Examples of Feature Functions: Polynomial regression

§ Higher order polynomial with one-dimensional input, 𝒙 = (𝑥)

§ 𝜙, 𝒙 = 𝑥, 𝜙1 𝒙 = 𝑥1, 𝜙c 𝒙 = 𝑥c, ⋯

v The feature space is transformed and enlarged

Ø E.g., if our hypothesis mode is defined by a polynomial of degree 4 defined over a two-feature space (𝑥,, 𝑥1), the hypothesis function has 14 parameters:

𝑓d = 𝑤,𝑥,e + 𝑤1𝑥1e + 𝑤c𝑥,c𝑥1 + 𝑤e𝑥,𝑥1c + 𝑤f𝑥,g1 𝑥11 + 𝑤h 𝑥,c + 𝑤i 𝑥1c +𝑤c𝑥,c𝑥1 + 𝑤j𝑥,𝑥11 +𝑤k𝑥,1 + 𝑤,*𝑥11 + 𝑤,,𝑥,𝑥1 + 𝑤,1𝑥, + 𝑤,c𝑥1 + 𝑤,e

§ Quadratic polynomial with two-dimensional inputs, 𝒙 = (𝑥,, 𝑥1)

§ 𝜙, 𝒙 = 𝑥,, 𝜙1 𝒙 = 𝑥,1, 𝜙c 𝒙 = 𝑥1, 𝜙e 𝒙 = 𝑥11, 𝜙c 𝒙 = 𝑥,𝑥1

Page 9: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

9

Solution using Feature Functions

§ The same techniques (analytical gradient + system of equations, or gradient descent) used for the plain linear case with MSE as loss function

𝝓 𝒙 B = (1, 𝜙, 𝒙 B , 𝜙1 𝒙 B ,⋯ , 𝜙^(𝒙 B ))ℓ =

1𝑚ABC,

D

𝑦 B − 𝑓 𝒙 B1

o To find min𝒘ℓ we have to look where 𝛻𝒘 ℓ = 0

𝛻𝒘 ℓ = −2𝑚ABC,

D

𝑦 B − 𝑓 𝒙 B 𝝓 𝒙 B = 𝟎

𝑓 𝒙 B ;𝒘 = 𝑤* + 𝑤,𝜙, 𝒙 B + 𝑤1𝜙1 𝒙 B + ⋯+𝑤^𝜙^ 𝒙 B = 𝒘5 _ 𝝓(𝒙 B )

Page 10: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

10

Solution using Feature Functions

𝝓 𝒙 B = (1, 𝜙, 𝒙 B , 𝜙1 𝒙 B ,⋯ , 𝜙^ 𝒙 B = (𝜙*, 𝜙, 𝒙 B , 𝜙1 𝒙 B ,⋯ , 𝜙^(𝒙 B )

−2𝑚ABC,

D

𝑦 B − 𝑓 𝒙 B 𝝓 𝒙 B = 𝟎

𝑓 𝒙 B ;𝒘 = 𝑤* + 𝑤,𝜙, 𝒙 B + 𝑤1𝜙1 𝒙 B + ⋯+𝑤^𝜙^ 𝒙 B = 𝒘5 _ 𝝓(𝒙 B )

𝑤*ABC,

D

1𝜙[ 𝒙 B + 𝑤,ABC,

D

𝜙, 𝒙 B 𝜙[ 𝒙 B + ⋯+ 𝑤3ABC,

D

𝜙3 𝒙 B 𝜙[ 𝒙 B ⋯+ 𝑤^ABC,

D

𝜙^ 𝒙 B 𝜙[ 𝒙 B

=ABC,

D

𝑦B𝜙[ 𝒙 B ∀𝑗 = 1,⋯ , 𝑏

Page 11: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

11

Solution using Feature Functions

𝑤*ABC,

D

1𝜙[ 𝒙 B + 𝑤,ABC,

D

𝜙, 𝒙 B 𝜙[ 𝒙 B + ⋯

+𝑤3ABC,

D

𝜙3 𝒙 B 𝜙[ 𝒙 B ⋯+ 𝑤^ABC,

D

𝜙^ 𝒙 B 𝜙[ 𝒙 B =ABC,

D

𝑦B𝜙[ 𝒙 B ∀𝑗 = 1,⋯ , 𝑏

𝑨 =𝜙*(𝒙 , ) ⋯ 𝜙^(𝒙 , )

⋮ ⋱ ⋮𝜙*(𝒙 D ) ⋯ 𝜙^(𝒙 D )

q𝑓 𝒙 = 𝝓 𝒙 ,𝒘Ø Prediction for a new input: 𝝓(𝒙) = 𝜙*(𝒙) 𝜙,(𝒙) ⋯ 𝜙^(𝒙)

q𝑓 𝒙 = 𝒃(𝝓 𝒙 , 𝑨), 𝒀 𝒃 = 𝝓5 𝒙 𝑨5𝑨 =,𝑨5)

𝒘 = 𝑨5𝑨 =,𝑨5𝒀Unregularized LS

𝒘 = 𝑨5𝑨 + 𝜆𝑰D =,𝑨5𝒀Ridge regularized LS

𝑨5𝑨 𝒘 = 𝑨5𝒀

In matrix form:

Page 12: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

12

Example of polynomial regression

Page 13: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

13

Other basis functions? Correlation and locality issues

Page 14: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

14

Electricity example: linear vs. nonlinear data

New data: it doesn’t look linear anymore

Page 15: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

15

New hypothesis: which model complexity?

The complexity of the model grows: one parameter for each feature transformed according to a polynomial of order 2 (at least 3 parameters vs. 2 of original hypothesis)

Page 16: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

16

New hypothesis: which model complexity?

At least 5 parameters (if we had multiple predicting features, all their order dproducts should be considered, resulting into a number of additional parameters)

Page 17: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

17

New hypothesis: which model complexity?

The number of parameters is now larger than the data points, such that the polynomial can almost precisely fit the data → Overfitting

Page 18: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

18

Another example: Polynomial model

Page 19: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

19

Another example: Polynomial model

Regularized solution for different levels of regularization

Least Squares solution for different polynomials

Page 20: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

20

Another example: Gaussian basis functions

Page 21: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

21

Another example: Gaussian basis functions

Page 22: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

22

Right Bias-Variance tradeoff?

Page 23: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

23

Selecting Model Complexity

§ Dataset with 10 points, 1D features: which hypothesis function should we use?

§ Linear regression: 𝑦 = 𝑓 𝑥;𝒘 = 𝑤* + 𝑤,𝑥

§ Polynomial regression, cubic: 𝑦 = 𝑓 𝑥;𝒘 = 𝑤* + 𝑤,𝑥 + 𝑤1𝑥1 + 𝑤c𝑥c

§ MSE for the loss functions

§ Which model would give the smaller error in terms of MSE / least squares fit?

Page 24: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

24

Selecting Model Complexity

§ Cubic regression provides a better fit to the data, and a smaller MSE

§ Should we stick with the hypothesis 𝑓 𝑥;𝒘 = 𝑤* + 𝑤,𝑥 + 𝑤1𝑥1 + 𝑤c𝑥c ?

§ Since a higher order polynomial seems to provide a better fit, why don’t we use a polynomial of order higher than 3?

§ What is the highest order that makes sense for the given problem?

Page 25: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

25

Selecting Model Complexity

§ For 10 data points, a degree 9 polynomial gives a perfect fit (Lagrange interpolation): 0 error

§ Is it always good to minimize (even reduce to zero) the training error?

§ Related (and more important) question: How do we (will) perform on new, unseen data?

Page 26: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

26

Overfitting

§ The 9-polynomial model totally fails the prediction for the new point!

§ Overfitting: Situation when the training error is low and the generalization error is high. Causes of the phenomenon:

Ø Highly complex hypothesis model, with a large number of parameters (degrees of freedom)

Ø Small data size (as compared to the complexity of the model)

§ The learned function has enough degrees of freedom to (over)fit all data perfectly

Page 27: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

27

Training and Validation Loss

Page 28: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

28

Splitting dataset in two

Page 29: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

29

Performance on Validation set

Page 30: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

30

Performance on Validation set

Page 31: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

31

Increasing model complexity

In this case, the small size of the dataset favors an easy overfitting by increasing the degree of the polynomial (i.e., hypothesis complexity). For a large multi-dimensional dataset this effect is less strong / evident

Page 32: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

32

Training vs. Validation Loss

Page 33: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

33

Model Selection and Evaluation Process

1. Break all available data into training and testing sets (e.g., 70% / 30%)

2. Break training set into training and validation sets (e.g., 70% / 30%)

3. Loop:

i. Set a hyperparameter value (e.g., degree of polynomial → model complexity)

ii. Train the model using training sets

iii. Validate the model using validation sets

iv. Exit loop if (validation errors keep growing && training errors go to zero)

4. Choose hyperparameters using validation set results: hyperparameter values corresponding to lowest validation errors

5. (Optional) With the selected hyperparameters, retrain the model using all training data sets

6. Evaluate (generalization) performance on the testing sets

Page 34: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

34

Model Selection and Evaluation Process

Dataset

TestingsetTraining

set

Internal training set

Validationset

Model 1

Learn 1

Learn 2

Learn 𝑛

Validate 1

Validate 2

Validate 𝑛

Select best model

Model ∗

⋮ ⋮

Learn ∗

Model 2

Model 𝑛

Page 35: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

35

Can we use kernels?

Linear in the samples

Page 36: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

36

Can we use kernels?

Prediction costs go as 𝑂(𝑁)

Page 37: Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24: NonLinearand Kernel regression 1 Introduction to Machine Learning 10-315 Fall ‘19

37

Choosing kernels