lecture 10 l2 regularization STAT598z ... · Ridge regression Creditdataset(averagecreditcarddebt)...

lecture 10: l2 regularizationSTAT 598z: Introduction to computing for statistics

Vinayak RaoDepartment of Statistics, Purdue University

February 14, 2019

Ordinary least squares

Consider linear regression:

y = x⊤w+ ϵ

In vector notation:

y = XTw+ ϵ, y ∈ ℜn,w ∈ ℜp, X ∈ ℜp×n

w = arg minw ∥y− X⊤w∥2 = arg minw∑n

i=1(yi − x⊤i w)2

y = x⊤w+ ϵ

In vector notation:

i=1(yi − x⊤i w)2

y = x⊤w+ ϵ

In vector notation:

i=1(yi − x⊤i w)2

Problem:w = arg minw ∥y− X⊤w∥2 = arg minw

n∑i=1

(yi − x⊤i w)2

Solution:w = (XX⊤)−1Xy (correlation in 1-d)

(XX⊤) Xy

How to do this in R (without using lm)?• Do not invert with solve and multiply!• Directly solve (XX⊤)w = Xy

n∑i=1

(yi − x⊤i w)2

(XX⊤) Xy

n∑i=1

(yi − x⊤i w)2

(XX⊤) Xy

How to do this in R (without using lm)?• Do not invert with solve and multiply!

• Directly solve (XX⊤)w = Xy

n∑i=1

(yi − x⊤i w)2

(XX⊤) Xy

Prediction error

w is an unbiased estimate of the true w

For a test vector xtest we predict w⊤xtest.

(Squared) prediction error: PE2 = 1k∑k

i=1(ytesti − w⊤xtesti )2

Can show:

• PE is has mean 0• variance grows with number of features (p)

What if p > n?

• XX⊤ is singular

Prediction error

Can show:

What if p > n?

Prediction error

Can show:

What if p > n?

Regularization

p > n:

• Cannot invert XX⊤

• We can invert if we add a small λ to the diagonal

wλ = (XX⊤ + λI)−1Xy (I is the identity matrix)

Introducing λ makes problem well-posed, but introduces bias

• λ = 0 recovers OLS• Larger λ causes larger bias• λ = ∞? No variance!

λ trades-off bias and variance

Maybe a nonzero λ is actually good?

Regularization

p > n:

Regularization

p > n:

Regularization

p > n:

• λ = 0 recovers OLS• Larger λ causes larger bias

• λ = ∞? No variance!

Regularization

p > n:

• λ = 0 recovers OLS• Larger λ causes larger bias• λ = ∞?

No variance!

Regularization

p > n:

Regularization

p > n:

Maybe a nonzero λ is actually good?4/10

Ridge regression (a.k.a. Tikhonov regularization)

Recall w = (XX⊤)−1Xy solves w = arg min ∥y− X⊤w∥2

wλ = (XX⊤ + λI)−1Xy solves

wλ = argminLλ(w) := argminn∑i=1

(yi − x⊤i w)2 + λ∥w∥22

∥w∥22 =∑p

i=1 w2i is the squared ℓ2-norm

λ∥w∥2 is the shrinkage penalty.

Favours w’s with smaller components

λ trades of small training error with ‘simple’ solutions

ℓ2/ridge/Tikhonov regularization

(yi − x⊤i w)2 + λ∥w∥22

∥w∥22 =∑p

(yi − x⊤i w)2 + λ∥w∥22

∥w∥22 =∑p

(yi − x⊤i w)2 + λ∥w∥22

∥w∥22 =∑p

(yi − x⊤i w)2 + λ∥w∥22

∥w∥22 =∑p

(yi − x⊤i w)2 + λ∥w∥22

∥w∥22 =∑p

Ridge regression (solution)

Simple modification of the least-squares solution:

wλ = (X⊤X+ λIp)−1X⊤y

In the 1-dimensional case,

wλ = (x⊤x+ λIp)−1x⊤y

=x⊤x

(x⊤x+ λIp)x⊤yx⊤x

= c w (c < 1)

Shrinks least-squares solution.

=x⊤x

= c w (c < 1)

=x⊤x

= c w (c < 1)

=x⊤x

= c w (c < 1)

=x⊤x

= c w (c < 1)

Ridge regression

Credit data set (average credit card debt)

1e−02 1e+00 1e+02 1e+04

−300

−100

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0

−300

−100

λ ‖βR

λ‖2/‖β‖2

James, Witten, Hastic and Tibshirani

How do we choose λ?

Cross-validaton:

• Pick a set of λ’s• For kth fold of cross-validation:• For each λ:• Solve the regularized least squares problem on training data.• Evaluate estimated w on held-out data (call this PEλ,k).

• Pick λ = argmin mean(PEλ)or (argmin (mean(PEλ) + stderr(PEλ)))

• Having chosen λ solve regularized least square on all data

Cross-validaton:

• Pick a set of λ’s• For kth fold of cross-validation:

• For each λ:• Solve the regularized least squares problem on training data.• Evaluate estimated w on held-out data (call this PEλ,k).

Cross-validaton:

Does this work?

1e−01 1e+01 1e+03

Mean S

0.0 0.2 0.4 0.6 0.8 1.0

Mean S

λ ‖βR

λ‖2/‖β‖2

Ridge regression improves performance by reducing variance

• does not perform feature selection• just shrinks components of w towards 0

For the former: Lasso

Does this work?

1e−01 1e+01 1e+03

Mean S

0.0 0.2 0.4 0.6 0.8 1.0

Mean S

λ ‖βR

λ‖2/‖β‖2

Does this work?

1e−01 1e+01 1e+03

Mean S

0.0 0.2 0.4 0.6 0.8 1.0

Mean S

λ ‖βR

λ‖2/‖β‖2

Regularization as constrained optimization

argmin(y− X⊤w)2 + λ∥w∥22 is equivalent to

argmin(y− X⊤w)2 s.t. ∥w∥22 ≤ γ

(Note: γ will depend on data)

First problem: regularized optimizationSecond problem: constrained optimization

lecture 10 l2 regularization STAT598z ... · Ridge regression Creditdataset(averagecreditcarddebt)...

Documents

Transcript of lecture 10 l2 regularization STAT598z ... · Ridge regression Creditdataset(averagecreditcarddebt)...