Ridge regression

A Note on Ridge Regression

Ananda Swarup Das

October 16, 2016

Ananda Swarup Das A Note on Ridge Regression October 16, 2016 1 / 16

Linear Regression

1 Linear Regression is a simple approach for Supervised Learning and isused for quantitative predictions.

2 Assuming X to be a quantitative predictor and y to be a quantitativeresponse and the relationship between the predictor and the responseto be linear, the linear relationship can be written as

y ≈ β0 + β1X (1)

3 The relationship is represented as an approximate one as it is assumedthat y = β0 + β1X + ε where ε is an irreducible error that might havecrept in while recording the data.


Linear Regression Continued

1 In Equation 1, β0, β1 are two unknown constants, also known asparameters.

2 Our objective is to use training data and estimate the values of β̂0, β̂13 So far we have discussed the case of simple linear regression. In case

of multiple linear regression, our linear regression model takes theform

y = β0 + β1x1 + β2x2 + . . .+ βpxp + ε (2)

4 A commonly used technique to find the estimates of theco-efficients(parameters) is least square method [1].


How good is our Estimation of the parameters

1 In the regression setting, a technique to measure the fit ismean-squared error which is given as

MSE =1

n

n∑i=1

(yi − f̂ (xi ))2 (3)

Here, n is the number of observations, yi is the true response andf̂ (xi ) is the response predicted by our model defined by theco-efficients estimated by the training data.


The Bias-Variance Trade Off

As stated in [1], the expected value of the residual error (yi − f̂ (xi )) isgiven by

E (yi − f̂ (xi ))2 = Var(f̂ (xi )) + [Bias(f̂ (xi ))]2 + Var(ε) (4)

1 In the above equation, the first term on the right hand side denotesthe variance of the model that is the amount by which f̂ wouldchange if the parameters β1, . . . , βp are estimated using differenttraining data.

2 The second term denotes the error introduced by approximating amay-be complicated real-life model with a simpler model.


The Bias-Variance Trade Off Continued

Also shown in [1], the expected value of residual error (yi − f̂ (xi )) can alsobe expressed as

E (yi − f̂ (xi ))2 = E (f (xi ) + ε− f̂ (xi ))2 = [f (xi )− f̂ (xi )]2 + Var(ε) (5)

Notice that we have replace yi = f (xi ) + ε. The first part [f (xi )− f̂ (xi )]2

is reducible and we want our estimation of parameters be such that f̂ (xi )is as close as possible to f (xi ). However, the Var(ε) is irreducible.


What do we reduce

1 Reconsider the Equation 4,E (yi − f̂ (xi ))2 = Var(f̂ (xi )) + [Bias(f̂ (xi ))]2 + Var(ε), the expectedvalue of MSE cannot be less than Var(ε).

2 Thus, we have to try to reduce the variance and the bias for themodel f̂ .


Certain Situations

Provided the true relationship between the predictor and the response islinear, the least square method will have less bias.

1 If the size of the training data, n is very very large compared to thenumber of predictors that is n >> p, the least square estimates tendto have less variance.

2 If the size of the training data, n is slightly larger than p, then theleast square estimates may have high variance.

3 If n < p, least square method should not be applied without usingdimension reducing techniques.


Ridge Regression

1 In this presentation, we will deal with the second situation where n isslightly greater than p using Ridge Regression which has been foundto be significantly helpful in dealing with variance.

2 In the least square method, coefficients β1 . . . βp are estimated byminimizing Residual Sum of Squares(RSS)RSS =

∑ni=1(yi − β0 −

∑pj=1 βjxi ,j)

2. Notice that β0 = y ′, the meanof all the responses.

3 In case of Ridge Regression, the minimization function changes to∑ni=1(yi − β0 −

∑pj=1 βjxi ,j)

2 + λ∑p

j=1 β2j . The λ is a tuning

parameter which constraints the choices of the coefficients, butdecreases the variance. To minimize the objective function, both theadditive terms are to be minimized.


The Significance of the choice of λ

1 Stated in [1], for every value of λ there exists a constant s such thatthe problem of ridge regression coefficient estimation boils down to

minimizen∑

i=1

(yi − β0 −p∑

j=1

βjxi ,j)2 (6)

s.t∑p

j=1 β2j ≤ s

2 Notice that if p = 2, under the constaint∑p

j=1 β2j ≤ s, ridge

regression coefficient estimation is equivalent to finding thecoefficients lying within a circle (in general a sphere) centered at theorigin and is of radius

√s, such that the Equation 6 is minimized.


Ridge Regression Coefficient Estimation

B1

B2

Figure: The Residual Square of Sum(RSS)∑n

i=1(yi − β0 −∑p

j=1 βjxi,j)2 is a

convex function and when p = 2, the contours look like a set of concentricellipses. The least square solution is denoted by the innermost marron dot. Theellipses centered at that dot have constant RSS thats is all points on a givenellipses share the common value of RSS which is equal to the Var(ε). As theellipses expand away from the least square estimate, the RSS increases.


Ridge Regression Coefficient Estimation

B1

B2

Figure: In general, the ridge regression coefficient estimates are given by the firstpoint at which the ellipse contacts the constraint circle,the green point in theabove Figure.


A Small Experiment

1 I am using Python scikit-learn for the purpose of the experiment andin this context, it must be mentioned that the book by SebastianRaschka, Python Machine Learning, PACKT Publishing is a goodbook to understand how to use scikit-learn effectively.

2 The data set that is used for the experiment can be found fromhttps://archive.ics.uci.edu/ml/datasets/Housing.

3 The data set comprises of 506 samples and 14 attributes. I have used11 attributes as predictors (column number: 1,2,3,5,6,8,9,10,11,12,13). I have used column number 14 as the responses.

4 Since 506 >> 11, and we are trying Ridge regression for the settingwhere n is slightly larger than p, I have randomly selected 20observations from the data set of which 14 has been used for trainingand 6 has been used for testing.


A Small Experiment

1 0 1 2 3 4 5 6 7 8values of lamda

1

0

1

2

3

4

5

6

7M

SE

Train Mean Squared ErrorTest Mean Squared Error


A Small Experiment

1 Notice that when λ = 0, the minimization function which isminimize(

∑ni=1(yi − β0 −

∑pj=1 βjxi ,j)

2 + λ∑p

j=1 β2j ) is equal to

minimize(∑n

i=1(yi − β0 −∑p

j=1 βjxi ,j)2), the case of least square

estimation. Notice the differences between the MSEs of test data vstraining data. A sharp/large difference denote significance variance ofour model. Notice the difference between the MSE of the test andthe train data at λ = 0. As the value of λ increase, the variancedecreases up-to λ = 4.

2 In general, the choice of λ can be done through grid search using theinbuilt module linear−model .RidgeCV from scikit-learn.


Citations

G. James, D. Witten, T. Hastie, and R. Tibshirani.

An Introduction to Statistical Learning: with Applications in R.

Springer Texts in Statistics. Springer New York, 2014.


Ridge regression

Engineering

Transcript of Ridge regression