Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale...
Transcript of Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale...
![Page 1: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/1.jpg)
Linear Regression (continued)
Professor Ameet Talwalkar
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39
![Page 2: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/2.jpg)
Outline
1 Administration
2 Review of last lecture
3 Linear regression
4 Nonlinear basis functions
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 2 / 39
![Page 3: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/3.jpg)
Announcements
HW2 will be returned in section on Friday
HW3 due in class next Monday
Midterm is next Wednesday (will review in more detail next class)
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 3 / 39
![Page 4: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/4.jpg)
Outline
1 Administration
2 Review of last lecturePerceptronLinear regression introduction
3 Linear regression
4 Nonlinear basis functions
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 4 / 39
![Page 5: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/5.jpg)
Perceptron Main idea
Consider a linear model for binary classification
wTx
We use this model to distinguish between two classes {−1,+1}.
One goal
ε =∑
n
I[yn 6= sign(wTxn)]
i.e., to minimize errors on the training dataset.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 5 / 39
![Page 6: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/6.jpg)
Hard, but easy if we have only one training example
How can we change w such that
yn = sign(wTxn)
Two cases
If yn = sign(wTxn), do nothing.
If yn 6= sign(wTxn), wnew ← wold + ynxn
I Gauranteed to make progress, i.e., to get us closer to y(w>x) > 0
What does update do?
yn[(w + ynxn)Txn] = ynw
Txn + y2nxTnxn
We are adding a positive number, so it’s possible that yn(wnewTxn) > 0
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39
![Page 7: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/7.jpg)
Hard, but easy if we have only one training example
How can we change w such that
yn = sign(wTxn)
Two cases
If yn = sign(wTxn), do nothing.
If yn 6= sign(wTxn), wnew ← wold + ynxn
I Gauranteed to make progress, i.e., to get us closer to y(w>x) > 0
What does update do?
yn[(w + ynxn)Txn] = ynw
Txn + y2nxTnxn
We are adding a positive number, so it’s possible that yn(wnewTxn) > 0
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39
![Page 8: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/8.jpg)
Hard, but easy if we have only one training example
How can we change w such that
yn = sign(wTxn)
Two cases
If yn = sign(wTxn), do nothing.
If yn 6= sign(wTxn), wnew ← wold + ynxn
I Gauranteed to make progress, i.e., to get us closer to y(w>x) > 0
What does update do?
yn[(w + ynxn)Txn] = ynw
Txn + y2nxTnxn
We are adding a positive number, so it’s possible that yn(wnewTxn) > 0
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 6 / 39
![Page 9: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/9.jpg)
Perceptron algorithm
Iteratively solving one case at a time
REPEAT
Pick a data point xn (can be a fixed order of the training instances)
Make a prediction y = sign(wTxn) using the current w
If y = yn, do nothing. Else,
w ← w + ynxn
UNTIL converged.
Properties
This is an online algorithm.
If the training data is linearly separable, the algorithm stops in a finitenumber of steps (we proved this).
The parameter vector is always a linear combination of traininginstances (requires initialization of w0 = 0).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 7 / 39
![Page 10: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/10.jpg)
Perceptron algorithm
Iteratively solving one case at a time
REPEAT
Pick a data point xn (can be a fixed order of the training instances)
Make a prediction y = sign(wTxn) using the current w
If y = yn, do nothing. Else,
w ← w + ynxn
UNTIL converged.
Properties
This is an online algorithm.
If the training data is linearly separable, the algorithm stops in a finitenumber of steps (we proved this).
The parameter vector is always a linear combination of traininginstances (requires initialization of w0 = 0).
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 7 / 39
![Page 11: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/11.jpg)
Regression
Predicting a continuous outcome variable
Predicting shoe size from height, weight and gender
Predicting song year from audio features
Key difference from classification
We can measure ’closeness’ of prediction and labelsI Predicting shoe size: better to be off by one size than by 5 sizesI Predicting song year: better to be off by one year than by 20 years
As opposed to 0-1 classification error, we will focus on squareddifference, i.e., (y − y)2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39
![Page 12: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/12.jpg)
Regression
Predicting a continuous outcome variable
Predicting shoe size from height, weight and gender
Predicting song year from audio features
Key difference from classification
We can measure ’closeness’ of prediction and labelsI Predicting shoe size: better to be off by one size than by 5 sizesI Predicting song year: better to be off by one year than by 20 years
As opposed to 0-1 classification error, we will focus on squareddifference, i.e., (y − y)2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39
![Page 13: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/13.jpg)
Regression
Predicting a continuous outcome variable
Predicting shoe size from height, weight and gender
Predicting song year from audio features
Key difference from classification
We can measure ’closeness’ of prediction and labelsI Predicting shoe size: better to be off by one size than by 5 sizesI Predicting song year: better to be off by one year than by 20 years
As opposed to 0-1 classification error, we will focus on squareddifference, i.e., (y − y)2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 8 / 39
![Page 14: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/14.jpg)
1D example: predicting the sale price of a house
Sale price ≈ price per sqft × square footage + fixed expense
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 9 / 39
![Page 15: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/15.jpg)
Minimize squared errors
Our modelSale price = price per sqft × square footage + fixed expense +unexplainable stuffTraining data
sqft sale price prediction error squared error
2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·
AimAdjust price per sqft and fixed expense such that the sum of the squarederror is minimized — i.e., unexplainable stuff is minimized.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 10 / 39
![Page 16: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/16.jpg)
Minimize squared errors
Our modelSale price = price per sqft × square footage + fixed expense +unexplainable stuffTraining data
sqft sale price prediction error squared error
2000 810K 720K 90K 8100
2100 907K 800K 107K 1072
1100 312K 350K 38K 382
5500 2,600K 2,600K 0 0
· · · · · ·Total 8100 + 1072 + 382 + 0 + · · ·
AimAdjust price per sqft and fixed expense such that the sum of the squarederror is minimized — i.e., unexplainable stuff is minimized.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 10 / 39
![Page 17: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/17.jpg)
Linear regression
Setup
Input: x ∈ RD (covariates, predictors, features, etc)
Output: y ∈ R (responses, targets, outcomes, outputs, etc)
Model: f : x→ y, with f(x) = w0 +∑
dwdxd = w0 +wTx
I w = [w1 w2 · · · wD]T: weights, parameters, or parameter vector
I w0 is called biasI We also sometimes call w = [w0 w1 w2 · · · wD]
T parameters too
Training data: D = {(xn, yn), n = 1, 2, . . . ,N}Least Mean Squares (LMS) Objective: Minimize squared difference ontraining data (or residual sum of squares)
RSS(w) =∑
n
[yn − f(xn)]2 =
∑
n
[yn − (w0 +∑
d
wdxnd)]2
1D Solution: Identify stationary points by taking derivative with respectto parameters and setting to zero, yielding ‘normal equations’
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39
![Page 18: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/18.jpg)
Linear regression
Setup
Input: x ∈ RD (covariates, predictors, features, etc)
Output: y ∈ R (responses, targets, outcomes, outputs, etc)
Model: f : x→ y, with f(x) = w0 +∑
dwdxd = w0 +wTx
I w = [w1 w2 · · · wD]T: weights, parameters, or parameter vector
I w0 is called biasI We also sometimes call w = [w0 w1 w2 · · · wD]
T parameters too
Training data: D = {(xn, yn), n = 1, 2, . . . ,N}Least Mean Squares (LMS) Objective: Minimize squared difference ontraining data (or residual sum of squares)
RSS(w) =∑
n
[yn − f(xn)]2 =
∑
n
[yn − (w0 +∑
d
wdxnd)]2
1D Solution: Identify stationary points by taking derivative with respectto parameters and setting to zero, yielding ‘normal equations’
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39
![Page 19: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/19.jpg)
Linear regression
Setup
Input: x ∈ RD (covariates, predictors, features, etc)
Output: y ∈ R (responses, targets, outcomes, outputs, etc)
Model: f : x→ y, with f(x) = w0 +∑
dwdxd = w0 +wTx
I w = [w1 w2 · · · wD]T: weights, parameters, or parameter vector
I w0 is called biasI We also sometimes call w = [w0 w1 w2 · · · wD]
T parameters too
Training data: D = {(xn, yn), n = 1, 2, . . . ,N}
Least Mean Squares (LMS) Objective: Minimize squared difference ontraining data (or residual sum of squares)
RSS(w) =∑
n
[yn − f(xn)]2 =
∑
n
[yn − (w0 +∑
d
wdxnd)]2
1D Solution: Identify stationary points by taking derivative with respectto parameters and setting to zero, yielding ‘normal equations’
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39
![Page 20: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/20.jpg)
Linear regression
Setup
Input: x ∈ RD (covariates, predictors, features, etc)
Output: y ∈ R (responses, targets, outcomes, outputs, etc)
Model: f : x→ y, with f(x) = w0 +∑
dwdxd = w0 +wTx
I w = [w1 w2 · · · wD]T: weights, parameters, or parameter vector
I w0 is called biasI We also sometimes call w = [w0 w1 w2 · · · wD]
T parameters too
Training data: D = {(xn, yn), n = 1, 2, . . . ,N}Least Mean Squares (LMS) Objective: Minimize squared difference ontraining data (or residual sum of squares)
RSS(w) =∑
n
[yn − f(xn)]2 =
∑
n
[yn − (w0 +∑
d
wdxnd)]2
1D Solution: Identify stationary points by taking derivative with respectto parameters and setting to zero, yielding ‘normal equations’
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39
![Page 21: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/21.jpg)
Linear regression
Setup
Input: x ∈ RD (covariates, predictors, features, etc)
Output: y ∈ R (responses, targets, outcomes, outputs, etc)
Model: f : x→ y, with f(x) = w0 +∑
dwdxd = w0 +wTx
I w = [w1 w2 · · · wD]T: weights, parameters, or parameter vector
I w0 is called biasI We also sometimes call w = [w0 w1 w2 · · · wD]
T parameters too
Training data: D = {(xn, yn), n = 1, 2, . . . ,N}Least Mean Squares (LMS) Objective: Minimize squared difference ontraining data (or residual sum of squares)
RSS(w) =∑
n
[yn − f(xn)]2 =
∑
n
[yn − (w0 +∑
d
wdxnd)]2
1D Solution: Identify stationary points by taking derivative with respectto parameters and setting to zero, yielding ‘normal equations’
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 11 / 39
![Page 22: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/22.jpg)
Probabilistic interpretation
Noisy observation model
Y = w0 + w1X + η
where η ∼ N(0, σ2) is a Gaussian random variable
Likelihood of one training sample (xn, yn)
p(yn|xn) = N(w0 + w1xn, σ2) =
1√2πσ
e−[yn−(w0+w1xn)]2
2σ2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 12 / 39
![Page 23: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/23.jpg)
Probabilistic interpretation
Noisy observation model
Y = w0 + w1X + η
where η ∼ N(0, σ2) is a Gaussian random variable
Likelihood of one training sample (xn, yn)
p(yn|xn) = N(w0 + w1xn, σ2) =
1√2πσ
e−[yn−(w0+w1xn)]2
2σ2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 12 / 39
![Page 24: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/24.jpg)
Maximum likelihood estimation
Maximize over w0 and w1
max logP (D)⇔ min∑
n
[yn − (w0 + w1xn)]2← That is RSS(w)!
Maximize over s = σ2
∂ logP (D)∂s
= −1
2
{− 1
s2
∑
n
[yn − (w0 + w1xn)]2 + N
1
s
}= 0
→ σ∗2 = s∗ =1
N
∑
n
[yn − (w0 + w1xn)]2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 13 / 39
![Page 25: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/25.jpg)
Maximum likelihood estimation
Maximize over w0 and w1
max logP (D)⇔ min∑
n
[yn − (w0 + w1xn)]2← That is RSS(w)!
Maximize over s = σ2
∂ logP (D)∂s
= −1
2
{− 1
s2
∑
n
[yn − (w0 + w1xn)]2 + N
1
s
}= 0
→ σ∗2 = s∗ =1
N
∑
n
[yn − (w0 + w1xn)]2
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 13 / 39
![Page 26: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/26.jpg)
How does this probabilistic interpretation help us?
It gives a solid footing to our intuition: minimizing RSS(w) is asensible thing based on reasonable modeling assumptions
Estimating σ∗ tells us how much noise there could be in ourpredictions. For example, it allows us to place confidence intervalsaround our predictions.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 14 / 39
![Page 27: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/27.jpg)
How does this probabilistic interpretation help us?
It gives a solid footing to our intuition: minimizing RSS(w) is asensible thing based on reasonable modeling assumptions
Estimating σ∗ tells us how much noise there could be in ourpredictions. For example, it allows us to place confidence intervalsaround our predictions.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 14 / 39
![Page 28: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/28.jpg)
Outline
1 Administration
2 Review of last lecture
3 Linear regressionMultivariate solution in matrix formComputational and numerical optimizationRidge regression
4 Nonlinear basis functions
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 15 / 39
![Page 29: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/29.jpg)
LMS when x is D-dimensionalRSS(w) in matrix form
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd)]2
=∑
n
[yn − wTxn]2
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD]T, w ← [w0 w1 w2 . . . wD]
T
which leads to
RSS(w) =∑
n
(yn − wTxn)(yn − xTn w)
=∑
n
wTxnxTn w − 2ynx
Tn w + const.
=
{wT
(∑
n
xnxTn
)w − 2
(∑
n
ynxTn
)w
}+ const.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39
![Page 30: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/30.jpg)
LMS when x is D-dimensionalRSS(w) in matrix form
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd)]2 =
∑
n
[yn − wTxn]2
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD]T, w ← [w0 w1 w2 . . . wD]
T
which leads to
RSS(w) =∑
n
(yn − wTxn)(yn − xTn w)
=∑
n
wTxnxTn w − 2ynx
Tn w + const.
=
{wT
(∑
n
xnxTn
)w − 2
(∑
n
ynxTn
)w
}+ const.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39
![Page 31: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/31.jpg)
LMS when x is D-dimensionalRSS(w) in matrix form
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd)]2 =
∑
n
[yn − wTxn]2
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD]T, w ← [w0 w1 w2 . . . wD]
T
which leads to
RSS(w) =∑
n
(yn − wTxn)(yn − xTn w)
=∑
n
wTxnxTn w − 2ynx
Tn w + const.
=
{wT
(∑
n
xnxTn
)w − 2
(∑
n
ynxTn
)w
}+ const.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39
![Page 32: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/32.jpg)
LMS when x is D-dimensionalRSS(w) in matrix form
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd)]2 =
∑
n
[yn − wTxn]2
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD]T, w ← [w0 w1 w2 . . . wD]
T
which leads to
RSS(w) =∑
n
(yn − wTxn)(yn − xTn w)
=∑
n
wTxnxTn w − 2ynx
Tn w + const.
=
{wT
(∑
n
xnxTn
)w − 2
(∑
n
ynxTn
)w
}+ const.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39
![Page 33: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/33.jpg)
LMS when x is D-dimensionalRSS(w) in matrix form
RSS(w) =∑
n
[yn − (w0 +∑
d
wdxnd)]2 =
∑
n
[yn − wTxn]2
where we have redefined some variables (by augmenting)
x← [1 x1 x2 . . . xD]T, w ← [w0 w1 w2 . . . wD]
T
which leads to
RSS(w) =∑
n
(yn − wTxn)(yn − xTn w)
=∑
n
wTxnxTn w − 2ynx
Tn w + const.
=
{wT
(∑
n
xnxTn
)w − 2
(∑
n
ynxTn
)w
}+ const.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 16 / 39
![Page 34: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/34.jpg)
Matrix Multiplication via Inner Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
Each entry of output matrix is result of inner product of inputs matrices
![Page 35: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/35.jpg)
Matrix Multiplication via Inner Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
Each entry of output matrix is result of inner product of inputs matrices
![Page 36: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/36.jpg)
Matrix Multiplication via Inner Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
9� 1+ 3� 3+ 5� 2 = 28
Each entry of output matrix is result of inner product of inputs matrices
![Page 37: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/37.jpg)
Matrix Multiplication via Inner Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
9� 1+ 3� 3+ 5� 2 = 28
Each entry of output matrix is result of inner product of inputs matrices
![Page 38: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/38.jpg)
Matrix Multiplication via Inner Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
Each entry of output matrix is result of inner product of inputs matrices
![Page 39: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/39.jpg)
Matrix Multiplication via Inner Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
Each entry of output matrix is result of inner product of inputs matrices
![Page 40: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/40.jpg)
Matrix Multiplication via Inner Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
Each entry of output matrix is result of inner product of inputs matrices
![Page 41: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/41.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 42: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/42.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 43: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/43.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 44: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/44.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 45: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/45.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 46: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/46.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 47: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/47.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 48: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/48.jpg)
Matrix Multiplication via Outer Products
�9 3 54 1 2
���1 23 �52 3
�� =
�28 1811 9
�
�9 184 8
�+
�9 �153 �5
�+
�10 154 6
�
Output matrix is sum of outer products between corresponding rows and columns of input matrices
![Page 49: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/49.jpg)
RSS(w) in new notationsFrom previous slide
RSS(w) =
{wT
(∑
n
xnxTn
)w − 2
(∑
n
ynxTn
)w
}+ const.
Design matrix and target vector
X =
xT1
xT2...xTN
∈ RN×(D+1), y =
y1y2...yN
Compact expression
RSS(w) = ||Xw − y||22 ={wTXTXw − 2
(XTy
)Tw
}+ const
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39
![Page 50: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/50.jpg)
RSS(w) in new notationsFrom previous slide
RSS(w) =
{wT
(∑
n
xnxTn
)w − 2
(∑
n
ynxTn
)w
}+ const.
Design matrix and target vector
X =
xT1
xT2...xTN
∈ RN×(D+1)
, y =
y1y2...yN
Compact expression
RSS(w) = ||Xw − y||22 ={wTXTXw − 2
(XTy
)Tw
}+ const
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39
![Page 51: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/51.jpg)
RSS(w) in new notationsFrom previous slide
RSS(w) =
{wT
(∑
n
xnxTn
)w − 2
(∑
n
ynxTn
)w
}+ const.
Design matrix and target vector
X =
xT1
xT2...xTN
∈ RN×(D+1), y =
y1y2...yN
Compact expression
RSS(w) = ||Xw − y||22 ={wTXTXw − 2
(XTy
)Tw
}+ const
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39
![Page 52: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/52.jpg)
RSS(w) in new notationsFrom previous slide
RSS(w) =
{wT
(∑
n
xnxTn
)w − 2
(∑
n
ynxTn
)w
}+ const.
Design matrix and target vector
X =
xT1
xT2...xTN
∈ RN×(D+1), y =
y1y2...yN
Compact expression
RSS(w) = ||Xw − y||22 ={wTXTXw − 2
(XTy
)Tw
}+ const
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 17 / 39
![Page 53: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/53.jpg)
Solution in matrix form
Compact expression
RSS(w) = ||Xw − y||22 ={wTXTXw − 2
(XTy
)Tw
}+ const
Gradients of Linear and Quadratic Functions
∇xb>x = b
∇xx>Ax = 2Ax (symmetric A)
Normal equation
∇wRSS(w) ∝ XTXw − XTy = 0
This leads to the least-mean-square (LMS) solution
wLMS =(XTX
)−1XTy
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39
![Page 54: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/54.jpg)
Solution in matrix form
Compact expression
RSS(w) = ||Xw − y||22 ={wTXTXw − 2
(XTy
)Tw
}+ const
Gradients of Linear and Quadratic Functions
∇xb>x = b
∇xx>Ax = 2Ax (symmetric A)
Normal equation
∇wRSS(w) ∝ XTXw − XTy = 0
This leads to the least-mean-square (LMS) solution
wLMS =(XTX
)−1XTy
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39
![Page 55: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/55.jpg)
Solution in matrix form
Compact expression
RSS(w) = ||Xw − y||22 ={wTXTXw − 2
(XTy
)Tw
}+ const
Gradients of Linear and Quadratic Functions
∇xb>x = b
∇xx>Ax = 2Ax (symmetric A)
Normal equation
∇wRSS(w) ∝ XTXw − XTy = 0
This leads to the least-mean-square (LMS) solution
wLMS =(XTX
)−1XTy
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39
![Page 56: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/56.jpg)
Solution in matrix form
Compact expression
RSS(w) = ||Xw − y||22 ={wTXTXw − 2
(XTy
)Tw
}+ const
Gradients of Linear and Quadratic Functions
∇xb>x = b
∇xx>Ax = 2Ax (symmetric A)
Normal equation
∇wRSS(w) ∝ XTXw − XTy = 0
This leads to the least-mean-square (LMS) solution
wLMS =(XTX
)−1XTy
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 18 / 39
![Page 57: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/57.jpg)
Mini-Summary
Linear regression is the linear combination of featuresf : x→ y, with f(x) = w0 +
∑dwdxd = w0 +w
Tx
If we minimize residual sum of squares as our learning objective, weget a closed-form solution of parameters
Probabilistic interpretation: maximum likelihood if assuming residualis Gaussian distributed
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 19 / 39
![Page 58: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/58.jpg)
Computational complexity
Bottleneck of computing the solution?
w =(XTX
)−1Xy
Matrix multiply of XTX ∈ R(D+1)×(D+1)
Inverting the matrix XTX
How many operations do we need?
O(ND2) for matrix multiplication
O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recenttheoretical advances) for matrix inversion
Impractical for very large D or N
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39
![Page 59: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/59.jpg)
Computational complexity
Bottleneck of computing the solution?
w =(XTX
)−1Xy
Matrix multiply of XTX ∈ R(D+1)×(D+1)
Inverting the matrix XTX
How many operations do we need?
O(ND2) for matrix multiplication
O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recenttheoretical advances) for matrix inversion
Impractical for very large D or N
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39
![Page 60: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/60.jpg)
Computational complexity
Bottleneck of computing the solution?
w =(XTX
)−1Xy
Matrix multiply of XTX ∈ R(D+1)×(D+1)
Inverting the matrix XTX
How many operations do we need?
O(ND2) for matrix multiplication
O(D3) (e.g., using Gauss-Jordan elimination) or O(D2.373) (recenttheoretical advances) for matrix inversion
Impractical for very large D or N
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 20 / 39
![Page 61: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/61.jpg)
Alternative method: an example of using numericaloptimization
(Batch) Gradient descent
Initialize w to w(0) (e.g., randomly); set t = 0; choose η > 0
Loop until convergence1 Compute the gradient∇RSS(w) = XTXw(t) − XTy
2 Update the parametersw(t+1) = w(t) − η∇RSS(w)
3 t← t+ 1
What is the complexity of each iteration?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39
![Page 62: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/62.jpg)
Alternative method: an example of using numericaloptimization
(Batch) Gradient descent
Initialize w to w(0) (e.g., randomly); set t = 0; choose η > 0
Loop until convergence1 Compute the gradient∇RSS(w) = XTXw(t) − XTy
2 Update the parametersw(t+1) = w(t) − η∇RSS(w)
3 t← t+ 1
What is the complexity of each iteration?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39
![Page 63: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/63.jpg)
Alternative method: an example of using numericaloptimization
(Batch) Gradient descent
Initialize w to w(0) (e.g., randomly); set t = 0; choose η > 0
Loop until convergence1 Compute the gradient∇RSS(w) = XTXw(t) − XTy
2 Update the parametersw(t+1) = w(t) − η∇RSS(w)
3 t← t+ 1
What is the complexity of each iteration?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39
![Page 64: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/64.jpg)
Alternative method: an example of using numericaloptimization
(Batch) Gradient descent
Initialize w to w(0) (e.g., randomly); set t = 0; choose η > 0
Loop until convergence1 Compute the gradient∇RSS(w) = XTXw(t) − XTy
2 Update the parametersw(t+1) = w(t) − η∇RSS(w)
3 t← t+ 1
What is the complexity of each iteration?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39
![Page 65: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/65.jpg)
Alternative method: an example of using numericaloptimization
(Batch) Gradient descent
Initialize w to w(0) (e.g., randomly); set t = 0; choose η > 0
Loop until convergence1 Compute the gradient∇RSS(w) = XTXw(t) − XTy
2 Update the parametersw(t+1) = w(t) − η∇RSS(w)
3 t← t+ 1
What is the complexity of each iteration?
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 21 / 39
![Page 66: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/66.jpg)
Why would this work?
If gradient descent converges, it will converge to the same solutionas using matrix inversion.
This is because RSS(w) is a convex function in its parameters w
Hessian of RSS
RSS(w) = wTXTXw − 2(XTy
)Tw + const
⇒ ∂2RSS(w)
∂wwT= 2XTX
XTX is positive semidefinite, because for any v
vTXTXv = ‖XTv‖22 ≥ 0
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39
![Page 67: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/67.jpg)
Why would this work?
If gradient descent converges, it will converge to the same solutionas using matrix inversion.
This is because RSS(w) is a convex function in its parameters w
Hessian of RSS
RSS(w) = wTXTXw − 2(XTy
)Tw + const
⇒ ∂2RSS(w)
∂wwT= 2XTX
XTX is positive semidefinite, because for any v
vTXTXv = ‖XTv‖22 ≥ 0
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39
![Page 68: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/68.jpg)
Why would this work?
If gradient descent converges, it will converge to the same solutionas using matrix inversion.
This is because RSS(w) is a convex function in its parameters w
Hessian of RSS
RSS(w) = wTXTXw − 2(XTy
)Tw + const
⇒ ∂2RSS(w)
∂wwT= 2XTX
XTX is positive semidefinite, because for any v
vTXTXv = ‖XTv‖22 ≥ 0
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39
![Page 69: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/69.jpg)
Why would this work?
If gradient descent converges, it will converge to the same solutionas using matrix inversion.
This is because RSS(w) is a convex function in its parameters w
Hessian of RSS
RSS(w) = wTXTXw − 2(XTy
)Tw + const
⇒ ∂2RSS(w)
∂wwT= 2XTX
XTX is positive semidefinite, because for any v
vTXTXv = ‖XTv‖22 ≥ 0
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 22 / 39
![Page 70: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/70.jpg)
Stochastic gradient descent
Widrow-Hoff rule: update parameters using one example at a time
Initialize w to some w(0); set t = 0; choose η > 0
Loop until convergence1 random choose a training a sample xt
2 Compute its contribution to the gradient
gt = (xTt w
(t) − yt)xt
3 Update the parametersw(t+1) = w(t) − ηgt
4 t← t+ 1
How does the complexity per iteration compare with gradient descent?
O(ND) for gradient descent versus O(D) for SGD
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39
![Page 71: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/71.jpg)
Stochastic gradient descent
Widrow-Hoff rule: update parameters using one example at a time
Initialize w to some w(0); set t = 0; choose η > 0
Loop until convergence1 random choose a training a sample xt
2 Compute its contribution to the gradient
gt = (xTt w
(t) − yt)xt
3 Update the parametersw(t+1) = w(t) − ηgt
4 t← t+ 1
How does the complexity per iteration compare with gradient descent?
O(ND) for gradient descent versus O(D) for SGD
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39
![Page 72: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/72.jpg)
Stochastic gradient descent
Widrow-Hoff rule: update parameters using one example at a time
Initialize w to some w(0); set t = 0; choose η > 0
Loop until convergence1 random choose a training a sample xt
2 Compute its contribution to the gradient
gt = (xTt w
(t) − yt)xt
3 Update the parametersw(t+1) = w(t) − ηgt
4 t← t+ 1
How does the complexity per iteration compare with gradient descent?
O(ND) for gradient descent versus O(D) for SGD
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39
![Page 73: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/73.jpg)
Stochastic gradient descent
Widrow-Hoff rule: update parameters using one example at a time
Initialize w to some w(0); set t = 0; choose η > 0
Loop until convergence1 random choose a training a sample xt
2 Compute its contribution to the gradient
gt = (xTt w
(t) − yt)xt
3 Update the parametersw(t+1) = w(t) − ηgt
4 t← t+ 1
How does the complexity per iteration compare with gradient descent?
O(ND) for gradient descent versus O(D) for SGD
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39
![Page 74: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/74.jpg)
Stochastic gradient descent
Widrow-Hoff rule: update parameters using one example at a time
Initialize w to some w(0); set t = 0; choose η > 0
Loop until convergence1 random choose a training a sample xt
2 Compute its contribution to the gradient
gt = (xTt w
(t) − yt)xt
3 Update the parametersw(t+1) = w(t) − ηgt
4 t← t+ 1
How does the complexity per iteration compare with gradient descent?
O(ND) for gradient descent versus O(D) for SGD
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39
![Page 75: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/75.jpg)
Stochastic gradient descent
Widrow-Hoff rule: update parameters using one example at a time
Initialize w to some w(0); set t = 0; choose η > 0
Loop until convergence1 random choose a training a sample xt
2 Compute its contribution to the gradient
gt = (xTt w
(t) − yt)xt
3 Update the parametersw(t+1) = w(t) − ηgt
4 t← t+ 1
How does the complexity per iteration compare with gradient descent?
O(ND) for gradient descent versus O(D) for SGD
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 23 / 39
![Page 76: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/76.jpg)
Mini-summary
Batch gradient descent computes the exact gradient.
Stochastic gradient descent approximates the gradient with a singledata point; Its expectation equals the true gradient.
Mini-batch variant: trade-off between accuracy of estimating gradientand computational cost
Similar ideas extend to other ML optimization problems.I For large-scale problems, stochastic gradient descent often works well.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 24 / 39
![Page 77: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/77.jpg)
What if XTX is not invertible
Why might this happen?
Answer 1: N < D. Intuitively, not enough data to estimate all parameters.
Answer 2: Columns of X are not linearly independent, e.g., somefeatures are perfectly correlated. In this case, solution is not unique.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39
![Page 78: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/78.jpg)
What if XTX is not invertible
Why might this happen?
Answer 1: N < D. Intuitively, not enough data to estimate all parameters.
Answer 2: Columns of X are not linearly independent, e.g., somefeatures are perfectly correlated. In this case, solution is not unique.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39
![Page 79: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/79.jpg)
What if XTX is not invertible
Why might this happen?
Answer 1: N < D. Intuitively, not enough data to estimate all parameters.
Answer 2: Columns of X are not linearly independent, e.g., somefeatures are perfectly correlated. In this case, solution is not unique.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 25 / 39
![Page 80: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/80.jpg)
Ridge regression
Intuition: what does a non-invertible XTX mean? Consider the SVD ofthis matrix:
XTX = U
λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0
U>
where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D.
Fix the problem by ensuring all singular values are non-zero
XTX + λI = Udiag(λ1 + λ, λ2 + λ, · · · , λ)U>
where λ > 0 and I is the identity matrix
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39
![Page 81: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/81.jpg)
Ridge regression
Intuition: what does a non-invertible XTX mean? Consider the SVD ofthis matrix:
XTX = U
λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0
U>
where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D.
Fix the problem by ensuring all singular values are non-zero
XTX + λI = Udiag(λ1 + λ, λ2 + λ, · · · , λ)U>
where λ > 0 and I is the identity matrix
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39
![Page 82: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/82.jpg)
Ridge regression
Intuition: what does a non-invertible XTX mean? Consider the SVD ofthis matrix:
XTX = U
λ1 0 0 · · · 00 λ2 0 · · · 00 · · · · · · · · · 00 · · · · · · λr 00 · · · · · · 0 0
U>
where λ1 ≥ λ2 ≥ · · ·λr > 0 and r < D.
Fix the problem by ensuring all singular values are non-zero
XTX + λI = Udiag(λ1 + λ, λ2 + λ, · · · , λ)U>
where λ > 0 and I is the identity matrix
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 26 / 39
![Page 83: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/83.jpg)
Regularized least square (ridge regression)
Solution
w =(XTX + λI
)−1XTy
This is equivalent to adding an extra term to RSS(w)
RSS(w)︷ ︸︸ ︷1
2
{wTXTXw − 2
(XTy
)Tw
}+
1
2λ‖w‖22︸ ︷︷ ︸
regularization
Benefits
Numerically more stable, invertible matrix
Prevent overfitting — more on this later
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39
![Page 84: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/84.jpg)
Regularized least square (ridge regression)
Solution
w =(XTX + λI
)−1XTy
This is equivalent to adding an extra term to RSS(w)
RSS(w)︷ ︸︸ ︷1
2
{wTXTXw − 2
(XTy
)Tw
}+
1
2λ‖w‖22︸ ︷︷ ︸
regularization
Benefits
Numerically more stable, invertible matrix
Prevent overfitting — more on this later
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39
![Page 85: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/85.jpg)
Regularized least square (ridge regression)
Solution
w =(XTX + λI
)−1XTy
This is equivalent to adding an extra term to RSS(w)
RSS(w)︷ ︸︸ ︷1
2
{wTXTXw − 2
(XTy
)Tw
}+
1
2λ‖w‖22︸ ︷︷ ︸
regularization
Benefits
Numerically more stable, invertible matrix
Prevent overfitting — more on this later
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 27 / 39
![Page 86: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/86.jpg)
How to choose λ?
λ is referred as hyperparameter
In contrast w is the parameter vector
Use validation set or cross-validation to find good choice of λ
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 28 / 39
![Page 87: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/87.jpg)
Outline
1 Administration
2 Review of last lecture
3 Linear regression
4 Nonlinear basis functions
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 29 / 39
![Page 88: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/88.jpg)
Is a linear modeling assumption always a good idea?Example of nonlinear classification
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Example of nonlinear regression
x
t
0 1
−1
0
1
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 30 / 39
![Page 89: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/89.jpg)
Is a linear modeling assumption always a good idea?Example of nonlinear classification
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
Example of nonlinear regression
x
t
0 1
−1
0
1
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 30 / 39
![Page 90: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/90.jpg)
Nonlinear basis for classification
Transform the input/feature
φ(x) : x ∈ R2 → z = x1 · x2
Transformed training data: linearly separable!
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
−1.5 −1 −0.5 0 0.5 1 1.5 2−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 31 / 39
![Page 91: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/91.jpg)
Nonlinear basis for classification
Transform the input/feature
φ(x) : x ∈ R2 → z = x1 · x2
Transformed training data: linearly separable!
−1.5 −1 −0.5 0 0.5 1 1.5−1.5
−1
−0.5
0
0.5
1
1.5
−1.5 −1 −0.5 0 0.5 1 1.5 2−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 31 / 39
![Page 92: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/92.jpg)
Another example
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
How to transform the input/feature?
φ(x) : x ∈ R2 → z =
x21x1 · x2x22
∈ R3
Transformed training data: linearly separable
0
5
10
15
−10−5
05
10
0
2
4
6
8
10
12
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39
![Page 93: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/93.jpg)
Another example
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
How to transform the input/feature?
φ(x) : x ∈ R2 → z =
x21x1 · x2x22
∈ R3
Transformed training data: linearly separable
0
5
10
15
−10−5
05
10
0
2
4
6
8
10
12
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39
![Page 94: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/94.jpg)
Another example
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
How to transform the input/feature?
φ(x) : x ∈ R2 → z =
x21x1 · x2x22
∈ R3
Transformed training data: linearly separable
0
5
10
15
−10−5
05
10
0
2
4
6
8
10
12
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 32 / 39
![Page 95: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/95.jpg)
General nonlinear basis functions
We can use a nonlinear mapping
φ(x) : x ∈ RD → z ∈ RM
where M is the dimensionality of the new feature/input z (or φ(x)).
M could be greater than, less than, or equal to D
With the new features, we can apply our learning techniques to minimizeour errors on the transformed training data
linear methods: prediction is based on wTφ(x)
other methods: nearest neighbors, decision trees, etc
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 33 / 39
![Page 96: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/96.jpg)
General nonlinear basis functions
We can use a nonlinear mapping
φ(x) : x ∈ RD → z ∈ RM
where M is the dimensionality of the new feature/input z (or φ(x)).
M could be greater than, less than, or equal to D
With the new features, we can apply our learning techniques to minimizeour errors on the transformed training data
linear methods: prediction is based on wTφ(x)
other methods: nearest neighbors, decision trees, etc
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 33 / 39
![Page 97: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/97.jpg)
Regression with nonlinear basis
Residual sum squares
∑
n
[wTφ(xn)− yn]2
where w ∈ RM , the same dimensionality as the transformed features φ(x).
The LMS solution can be formulated with the new design matrix
Φ =
φ(x1)T
φ(x2)T
...φ(xN )T
∈ RN×M , wlms =
(ΦTΦ
)−1ΦTy
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 34 / 39
![Page 98: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/98.jpg)
Regression with nonlinear basis
Residual sum squares
∑
n
[wTφ(xn)− yn]2
where w ∈ RM , the same dimensionality as the transformed features φ(x).
The LMS solution can be formulated with the new design matrix
Φ =
φ(x1)T
φ(x2)T
...φ(xN )T
∈ RN×M , wlms =
(ΦTΦ
)−1ΦTy
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 34 / 39
![Page 99: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/99.jpg)
Example with regressionPolynomial basis functions
φ(x) =
1xx2
...xM
⇒ f(x) = w0 +
M∑
m=1
wmxm
Fitting samples from a sine function: underrfitting as f(x) is too simple
x
t
M = 0
0 1
−1
0
1
x
t
M = 1
0 1
−1
0
1
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 35 / 39
![Page 100: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/100.jpg)
Example with regressionPolynomial basis functions
φ(x) =
1xx2
...xM
⇒ f(x) = w0 +
M∑
m=1
wmxm
Fitting samples from a sine function: underrfitting as f(x) is too simple
x
t
M = 0
0 1
−1
0
1
x
t
M = 1
0 1
−1
0
1
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 35 / 39
![Page 101: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/101.jpg)
Adding high-order terms
M=3
x
t
M = 3
0 1
−1
0
1
M=9: overfitting
x
t
M = 9
0 1
−1
0
1
More complex features lead to better results on the training data, butpotentially worse results on new data, e.g., test data!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 36 / 39
![Page 102: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/102.jpg)
Adding high-order terms
M=3
x
t
M = 3
0 1
−1
0
1
M=9: overfitting
x
t
M = 9
0 1
−1
0
1
More complex features lead to better results on the training data, butpotentially worse results on new data, e.g., test data!
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 36 / 39
![Page 103: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/103.jpg)
Overfiting
Parameters for higher-order polynomials are very large
M = 0 M = 1 M = 3 M = 9
w0 0.19 0.82 0.31 0.35w1 -1.27 7.99 232.37w2 -25.43 -5321.83w3 17.37 48568.31w4 -231639.30w5 640042.26w6 -1061800.52w7 1042400.18w8 -557682.99w9 125201.43
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 37 / 39
![Page 104: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/104.jpg)
Overfitting can be quite disastrous
Fitting the housing price data with large M
Predicted price goes to zero (and is ultimately negative) if you buy a bigenough house!
This is called poor generalization/overfitting.
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 38 / 39
![Page 105: Linear Regression (continued)atalwalk/teaching/winter17/... · 1D example: predicting the sale price of a house Sale price ˇprice per sqft square footage + xed expense Professor](https://reader033.fdocuments.net/reader033/viewer/2022042404/5f192392db4c442f7438b18c/html5/thumbnails/105.jpg)
Detecting overfitting
Plot model complexity versus objective function
X axis: model complexity, e.g., M
Y axis: error, e.g., RSS, RMS(square root of RSS), 0-1 loss
M
ERMS
0 3 6 90
0.5
1TrainingTest
As a model increases in complexity:
Training error keeps improving
Test error may first improve but eventually will deteriorate
Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 39 / 39