Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24:...

Teacher:Gianni A. Di Caro

Lecture 24:NonLinear and Kernel regression 1

Introduction to Machine Learning10-315 Fall ‘19

Disclaimer: These slides can include material from different sources. I’ll happy to explicitly acknowledge a source if required. Contact me for requests.

2

Least Squares Linear regression: summary and terminology

o Linear regression for univariate models 𝑓 𝑥 , 𝑓:ℝ → ℝ. (also known single regression)Hypothesized model: 𝑦 = 𝑓 𝑥 = 𝑤* + 𝑤,𝑥 = 𝒘𝑻𝒙

o Linear regression for multivariate models 𝑓 𝒙 , 𝑓:ℝ0 → ℝ (also know a multiple regression)Hypothesized model: 𝑦 = 𝑓 𝒙 = 𝑤* + 𝑤,𝑥, + 𝑤1𝑥1 +⋯+𝑤0𝑥0 = 𝒘𝑻𝒙

o General case: Linear multivariate regression for multivariate models 𝑓 𝒙 , 𝑓:ℝ0 → ℝ3Hypothesized model: 𝒚 = 𝑓, 𝒙 , 𝑓1 𝒙 ,⋯ , 𝑓3 𝒙 5 = 𝑤* + 𝑤,𝑥, + 𝑤1𝑥1 +⋯+𝑤0𝑥0 = 𝒘𝑻𝒙

3

Least Squares Linear regression: summary and terminology

OLS Optimization problem and solution:

min𝒘

𝑿𝒘 − 𝒀 5 𝑿𝒘 − 𝒀

<𝒘 = 𝑿5𝑿 =,𝑿5𝒀 >𝑦(𝒙) = <𝒘5𝒙

>𝑦 𝒙 = <𝒘5𝒙 = 𝒘, 𝒙

>𝑦 𝒙 = 𝒙5 𝑿5𝑿 =,𝑿5𝒀 =ABC,

D

𝑏B 𝒙, 𝑿 𝑦 B

𝑿 =𝑥,, ⋯ 𝑥0

,

⋮ ⋱ ⋮𝑥,D ⋯ 𝑥0

D

Number of features, 𝑑

Number of samples, 𝑚 𝒀 =

𝑦 ,

⋮𝑦 D

Design matrixv Prediction as a dot product between learned

feature weights and input feature vector:

v Prediction as a linear blend of the predictions from the training set:

4

Nonlinear regression

v What about situations where the functional relationships between predictor variables and the outputs is expressed by nonlinear relations?

o In general, if the function cannot be expressed by a linear combination of the coefficients, there’s no closed-form solution as in the case of linear regression and least squares

o Numeric methods need to be applied, and the optimization problems become non-convex →many local minima + non-trivial risk of finding a biased solution

𝑓 𝑥, 𝑤 =𝑤,𝑥𝑤1 + 𝑥

5

Nonlinear regression → Approximation, Linearization

v General idea (to reuse results and methods from linear LS): Approximate the model function by linearizing it with respect to the estimation parameters 𝒘

𝑓 𝑥 ≈ 𝑓 𝑥* +𝑓K 𝑥*1! 𝑥 − 𝑥* +

𝑓KK 𝑥*2! 𝑥 − 𝑥* 1 + ⋯+

𝑓(O) 𝑥*𝑛! 𝑥 − 𝑥* O

o Taylor series of order 𝑛, for a univariate model:

𝑓 𝑥 ≈ AOC*

Q𝑓(O) 𝑥*

𝑛! 𝑥 − 𝑥* O 𝑓 𝑥 ≈ AOC*

Q

𝑤O 𝑥 − 𝑥* O 𝑓 𝑥 ≈ AOC*

Q

𝑤O𝑥O 𝑥* = 0

o Taylor series of order 𝑛, for a multivariate model:

𝑓 𝒙 ≈ A|T|U*

𝒙 − 𝒙* T

𝑎!𝜕T𝑓 𝒙*

XYC* A|T|U*

𝒙 T


(includes cross-product and partial derivates among variables)

6

Nonlinear regression → Approximation, Linearization

o Any function can be approximated with the desired precision by using polynomials

o Polynomials are an example of basis functions: form a basis in a function space such that any function can be represented using the basis

o For a given input predictor 𝒙, from the estimation point of view, the above expressions are linear in the parameters to be estimated, 𝒘

𝑓 𝑥 ≈ AOC*

Q

𝑤O𝑥O

ü Approximate the function by linearizing: truncate the Taylor series at the first term, get a linear approximation and work with that to estimate the parameters using OLS

How can we exploit these expressions to find 𝑓 → find 𝑤O using the data?

𝑓 𝒙 ≈ A|T|U*

𝒙 T


7

Non-linear, additive regression models

§ Main idea to model nonlinearities: Replace inputs to linear units with 𝑏 feature (basis) functions 𝜙[ 𝒙 , 𝑗 = 1,⋯ , 𝑏, where 𝜙[ 𝒙 is an arbitrary function of 𝒙

𝑦 = 𝑓 𝒙;𝒘 = 𝑤* + 𝑤,𝜙, 𝒙 + 𝑤1𝜙1 𝒙 +⋯+𝑤^𝜙^ 𝒙 = 𝒘5 _ 𝝓(𝒙)

𝑓

𝑏𝑏Original

feature input New input

Linear model

o Fitting data to a nonlinear model

o Linear as statistical estimation problem

o The regression function 𝐸 𝑦 𝒙] is linear in the unknown parameters 𝒘that are estimated from the data.

v If we add enough basis function we are guaranteed to approximate the model function very well!

8

Examples of Feature Functions: Polynomial regression

§ Higher order polynomial with one-dimensional input, 𝒙 = (𝑥)

§ 𝜙, 𝒙 = 𝑥, 𝜙1 𝒙 = 𝑥1, 𝜙c 𝒙 = 𝑥c, ⋯

v The feature space is transformed and enlarged

Ø E.g., if our hypothesis mode is defined by a polynomial of degree 4 defined over a two-feature space (𝑥,, 𝑥1), the hypothesis function has 14 parameters:

𝑓d = 𝑤,𝑥,e + 𝑤1𝑥1e + 𝑤c𝑥,c𝑥1 + 𝑤e𝑥,𝑥1c + 𝑤f𝑥,g1 𝑥11 + 𝑤h 𝑥,c + 𝑤i 𝑥1c +𝑤c𝑥,c𝑥1 + 𝑤j𝑥,𝑥11 +𝑤k𝑥,1 + 𝑤,*𝑥11 + 𝑤,,𝑥,𝑥1 + 𝑤,1𝑥, + 𝑤,c𝑥1 + 𝑤,e

§ Quadratic polynomial with two-dimensional inputs, 𝒙 = (𝑥,, 𝑥1)

§ 𝜙, 𝒙 = 𝑥,, 𝜙1 𝒙 = 𝑥,1, 𝜙c 𝒙 = 𝑥1, 𝜙e 𝒙 = 𝑥11, 𝜙c 𝒙 = 𝑥,𝑥1

9

Solution using Feature Functions

§ The same techniques (analytical gradient + system of equations, or gradient descent) used for the plain linear case with MSE as loss function

𝝓 𝒙 B = (1, 𝜙, 𝒙 B , 𝜙1 𝒙 B ,⋯ , 𝜙^(𝒙 B ))ℓ =

1𝑚ABC,

D

𝑦 B − 𝑓 𝒙 B1

o To find min𝒘ℓ we have to look where 𝛻𝒘 ℓ = 0

𝛻𝒘 ℓ = −2𝑚ABC,

D

𝑦 B − 𝑓 𝒙 B 𝝓 𝒙 B = 𝟎

𝑓 𝒙 B ;𝒘 = 𝑤* + 𝑤,𝜙, 𝒙 B + 𝑤1𝜙1 𝒙 B + ⋯+𝑤^𝜙^ 𝒙 B = 𝒘5 _ 𝝓(𝒙 B )

10


𝝓 𝒙 B = (1, 𝜙, 𝒙 B , 𝜙1 𝒙 B ,⋯ , 𝜙^ 𝒙 B = (𝜙*, 𝜙, 𝒙 B , 𝜙1 𝒙 B ,⋯ , 𝜙^(𝒙 B )

−2𝑚ABC,

D

𝑦 B − 𝑓 𝒙 B 𝝓 𝒙 B = 𝟎

𝑓 𝒙 B ;𝒘 = 𝑤* + 𝑤,𝜙, 𝒙 B + 𝑤1𝜙1 𝒙 B + ⋯+𝑤^𝜙^ 𝒙 B = 𝒘5 _ 𝝓(𝒙 B )

𝑤*ABC,

D

1𝜙[ 𝒙 B + 𝑤,ABC,

D

𝜙, 𝒙 B 𝜙[ 𝒙 B + ⋯+ 𝑤3ABC,

D

𝜙3 𝒙 B 𝜙[ 𝒙 B ⋯+ 𝑤^ABC,

D

𝜙^ 𝒙 B 𝜙[ 𝒙 B

=ABC,

D

𝑦B𝜙[ 𝒙 B ∀𝑗 = 1,⋯ , 𝑏

11


𝑤*ABC,

D

1𝜙[ 𝒙 B + 𝑤,ABC,

D

𝜙, 𝒙 B 𝜙[ 𝒙 B + ⋯

+𝑤3ABC,

D

𝜙3 𝒙 B 𝜙[ 𝒙 B ⋯+ 𝑤^ABC,

D

𝜙^ 𝒙 B 𝜙[ 𝒙 B =ABC,

D

𝑦B𝜙[ 𝒙 B ∀𝑗 = 1,⋯ , 𝑏

𝑨 =𝜙*(𝒙 , ) ⋯ 𝜙^(𝒙 , )

⋮ ⋱ ⋮𝜙*(𝒙 D ) ⋯ 𝜙^(𝒙 D )

q𝑓 𝒙 = 𝝓 𝒙 ,𝒘Ø Prediction for a new input: 𝝓(𝒙) = 𝜙*(𝒙) 𝜙,(𝒙) ⋯ 𝜙^(𝒙)

q𝑓 𝒙 = 𝒃(𝝓 𝒙 , 𝑨), 𝒀 𝒃 = 𝝓5 𝒙 𝑨5𝑨 =,𝑨5)

𝒘 = 𝑨5𝑨 =,𝑨5𝒀Unregularized LS

𝒘 = 𝑨5𝑨 + 𝜆𝑰D =,𝑨5𝒀Ridge regularized LS

𝑨5𝑨 𝒘 = 𝑨5𝒀

In matrix form:

12

Example of polynomial regression

13

Other basis functions? Correlation and locality issues

14

Electricity example: linear vs. nonlinear data

New data: it doesn’t look linear anymore

15

New hypothesis: which model complexity?

The complexity of the model grows: one parameter for each feature transformed according to a polynomial of order 2 (at least 3 parameters vs. 2 of original hypothesis)

16


At least 5 parameters (if we had multiple predicting features, all their order dproducts should be considered, resulting into a number of additional parameters)

17


The number of parameters is now larger than the data points, such that the polynomial can almost precisely fit the data → Overfitting

18

Another example: Polynomial model

19

Another example: Polynomial model

Regularized solution for different levels of regularization

Least Squares solution for different polynomials

20

Another example: Gaussian basis functions

21

Another example: Gaussian basis functions

22

Right Bias-Variance tradeoff?

23

Selecting Model Complexity

§ Dataset with 10 points, 1D features: which hypothesis function should we use?

§ Linear regression: 𝑦 = 𝑓 𝑥;𝒘 = 𝑤* + 𝑤,𝑥

§ Polynomial regression, cubic: 𝑦 = 𝑓 𝑥;𝒘 = 𝑤* + 𝑤,𝑥 + 𝑤1𝑥1 + 𝑤c𝑥c

§ MSE for the loss functions

§ Which model would give the smaller error in terms of MSE / least squares fit?

24


§ Cubic regression provides a better fit to the data, and a smaller MSE

§ Should we stick with the hypothesis 𝑓 𝑥;𝒘 = 𝑤* + 𝑤,𝑥 + 𝑤1𝑥1 + 𝑤c𝑥c ?

§ Since a higher order polynomial seems to provide a better fit, why don’t we use a polynomial of order higher than 3?

§ What is the highest order that makes sense for the given problem?

25


§ For 10 data points, a degree 9 polynomial gives a perfect fit (Lagrange interpolation): 0 error

§ Is it always good to minimize (even reduce to zero) the training error?

§ Related (and more important) question: How do we (will) perform on new, unseen data?

26

Overfitting

§ The 9-polynomial model totally fails the prediction for the new point!

§ Overfitting: Situation when the training error is low and the generalization error is high. Causes of the phenomenon:

Ø Highly complex hypothesis model, with a large number of parameters (degrees of freedom)

Ø Small data size (as compared to the complexity of the model)

§ The learned function has enough degrees of freedom to (over)fit all data perfectly

27

Training and Validation Loss

28

Splitting dataset in two

29

Performance on Validation set

30

Performance on Validation set

31

Increasing model complexity

In this case, the small size of the dataset favors an easy overfitting by increasing the degree of the polynomial (i.e., hypothesis complexity). For a large multi-dimensional dataset this effect is less strong / evident

32

Training vs. Validation Loss

33

Model Selection and Evaluation Process

1. Break all available data into training and testing sets (e.g., 70% / 30%)

2. Break training set into training and validation sets (e.g., 70% / 30%)

3. Loop:

i. Set a hyperparameter value (e.g., degree of polynomial → model complexity)

ii. Train the model using training sets

iii. Validate the model using validation sets

iv. Exit loop if (validation errors keep growing && training errors go to zero)

4. Choose hyperparameters using validation set results: hyperparameter values corresponding to lowest validation errors

5. (Optional) With the selected hyperparameters, retrain the model using all training data sets

6. Evaluate (generalization) performance on the testing sets

34

Model Selection and Evaluation Process

Dataset

TestingsetTraining

set

Internal training set

Validationset

Model 1

⋮

Learn 1

Learn 2

Learn 𝑛

Validate 1

Validate 2

Validate 𝑛

Select best model

Model ∗

⋮ ⋮

Learn ∗

Model 2

Model 𝑛

35

Can we use kernels?

Linear in the samples

36

Can we use kernels?

Prediction costs go as 𝑂(𝑁)

37

Choosing kernels

Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24:...

Documents

Transcript of Introduction to Machine Learninggdicaro/10315/lectures/...Teacher: Gianni A. Di Caro Lecture 24:...