Ch. 7 Model Selection 7.1 Introduction: Model...

Ch. 7 Model Selection

• 7.1 Introduction: Model Misspecification

• 7.2 Choosing the Number of Predictor Variables

• 7.3 Variable Selection Methods

1

7.1 Model Misspecification

• Consider linear regression of a response y on k predictor variables.

• Under the full model (all predictors)

β̂˜

= (XTX)−1XTy˜

and

MSE =

y˜T(I −H)y

˜n− k − 1

2

7.1 Model Misspecification (Cont’d)

• Under the reduced model with q parameters (q < k + 1):

β̂˜q = (XT

q Xq)−1XT

q y˜

and

σ̂2q =

y˜T(I −Hq)y

˜n− q

where

Hq = Xq(XTq Xq)

−1Xq.

• Suppose the reduced model is correct, and we use the full model.Then we have over-parametrized the problem.

3

7.1 Model Misspecification (Cont’d): Wrongly using the Full Model

• We have

1. E[β̂˜

] = β˜

(some of the parameters are 0)

⇒ unbiased

2. V (β̂˜

) = σ2(XTX)−1

By including unnecessary variables, we risk introducing unneces-sary multicollinearity

⇒ inflated standard errors⇒ imprecise

Prediction is affected too: unbiased, but less precise than with thecorrect model

4

Model Misspecification (Cont’d): Wrongly using the Reduced Model

• Suppose the full model is correct, and we use the reduced model.Then

E[β̂˜q] = β

˜q + (XT

q Xq)−1XT

q Xrβ˜r 6= β

˜q

⇒ biased

A = (XTq Xq)−1XT

q Xr is the alias matrix.

V (β̂˜q) = σ2(XT

q Xq)−1

5

Model Misspecification (Cont’d): Wrongly using the Reduced Model

• Because of bias, we need to use

MSE = E[(β̂˜q − β

˜q)(β̂

˜q − β

˜q)

T]

= V (β̂˜q) + (Bias(β̂

˜q))(Bias(β̂

˜q))T

= σ2(XTq Xq)

−1 + (Aβ˜r)(Aβ

˜r)

T

E[σ̂2q ] = σ2 +

1

n− qβ˜

Tr X

Tr (I −Hq)Xrβ

˜r

⇒ σ̂2q over-estimates σ2.

6

Summary

• Loss of precision occurs when too many variables are included in amodel.

• Bias results when too few variables are included in a model.

7

7.2 How Many Predictor Variables Should be Used?

1. R2 = SSR(p)SST

= 1− SSE(p)SST

:the proportion of variation in the response explained by the (p variable)regressionproblem: increases with p

2. Adjusted R2 or residual mean square

R2adj = 1−

(n− 1

n− p(1−R2

p)

)This does always increase with p, but maximizing this is equivalent tominimizing

MSE(p) =SSE(p)

n− pThis often decreases as variables are added, but sometimes begins toincrease when too many are added.One might want to choose p where this first starts to level offBetter yet, one might use ...

8

Mallows’ Cp

• Consider

Γ̂p =SSE(p)

σ2− n+ 2p

(SSE(p) is the residual sum of squares for the p parameter model -the reduced model. )

•

E[SSE(p)] = σ2(n− p) + βTr˜

XTr (I −Hp)Xr βr

˜so

Γp = E[Γ̂p] = p+1

σ2βTr˜

XTr (I −Hp)Xr βr

˜

• ⇒ Choose p so that all components of βp˜

are nonzero and βr˜

= 0˜

.

9

Mallows’ Cp (Cont’d)

• Take p to be the smallest value for which Γp = p.

• In practice, take p to be the smallest value for which

Γ̂p.

= p

since we have just shown that Γ̂p is an unbiased estimator for Γp.

10


• We need an estimator for σ2.

• If the full k variable model contains all important regressors (plus, pos-sibly some extras), then

E[MSE(k + 1)] = σ2

so we can use σ̂2 = MSE(k + 1).

• ⇒We estimate Γ̂p by

Cp =SSE(p)

MSE− n+ 2p

11


• Example: An experiment involving 30 measurements on 5 regressorsyields an MSE of 20.

1. The MSE for the best 2 parameter model is 40. Find C2 for thismodel. (30)

2. The MSE for the best 3 parameter model is 30. Find C3 for thismodel. (16.5)



12

Example: cement data

> library(MPV) # this contains the cement data set> library(leaps) # this contains the variable selection routines> data(cement)> x <- cement[,-1] # this removes the y variable from

# the cement data set> y <- cement[,1] # this is the y variable> cement.leaps <- leaps(x,y)> attach(cement.leaps) # this allows us to access

# the variables in cement.leaps# directly# e.g., Mallows Cp is one of the# variables calculated

13

Example (Cont’d)

> plot(size, Cp) # size = no. of parameters

> abline(0,1) # reference line to see where Cp = p

> identify(size, Cp # which models are close to Cp = p?

# Click on the

# plotted points near the

# reference line.

# [1] 6 5 14 15

14

Example (Cont’d)

• CP plot

●●

●

●

●●

●

●

●

●

●●●● ●

2.0 2.5 3.0 3.5 4.0 4.5 5.0

050

100

150

200

250

300

size

Cp

5 6 14 15

15

Example (Cont’d)

• Identifying variables in each model:

which[6,] # which variables are included in model 6?

# 1 2 3 4

# TRUE FALSE FALSE TRUE

# Variables 1 and 4 are in the model.

> cement.6 <- lm(y ˜ as.matrix(x[,which[6,]]))

> summary(cement.6)

16

Example (Cont’d)

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 103.09738 2.12398 48.54 3.32e-13as.matrix(x[, which[6, ]])x1 1.43996 0.13842 10.40 1.11e-06as.matrix(x[, which[6, ]])x4 -0.61395 0.04864 -12.62 1.81e-07---

Residual standard error: 2.734 on 10 degrees of freedomMultiple R-Squared: 0.9725, Adjusted R-squared: 0.967

F-statistic: 176.6 on 2 and 10 DF, p-value: 1.581e-08

> PRESS(cement.6)[1] 121.2244

17

Example (Cont’d)

What about model 14?

cement.14 <- lm(y ˜ as.matrix(x[,which[14,]]))summary(cement.14)Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 203.6420 20.6478 9.863 4.01e-06as.matrix(x[, which[14, ]])x2 -0.9234 0.2619 -3.525 0.006462as.matrix(x[, which[14, ]])x3 -1.4480 0.1471 -9.846 4.07e-06as.matrix(x[, which[14, ]])x4 -1.5570 0.2413 -6.454 0.000118---

Residual standard error: 2.864 on 9 degrees of freedomMultiple R-Squared: 0.9728, Adjusted R-squared: 0.9638F-statistic: 107.4 on 3 and 9 DF, p-value: 2.302e-07

> PRESS(cement.14)[1] 146.8527 # higher than Model 6, so choose Model 6

18

Example (Cont’d)

What about Model 15? (This is the full model.)

> cement.15 <- lm(y ˜ as.matrix(x[,which[15,]]))> summary(cement.15)Coefficients:

Estimate Std. Error t value Pr(>|t|)(Intercept) 62.4054 70.0710 0.891 0.3991as.matrix(x[, which[15, ]])x1 1.5511 0.7448 2.083 0.0708as.matrix(x[, which[15, ]])x2 0.5102 0.7238 0.705 0.5009as.matrix(x[, which[15, ]])x3 0.1019 0.7547 0.135 0.8959as.matrix(x[, which[15, ]])x4 -0.1441 0.7091 -0.203 0.8441---Residual standard error: 2.446 on 8 degrees of freedomMultiple R-Squared: 0.9824, Adjusted R-squared: 0.9736F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07

> PRESS(cement.15)[1] 110.3466 # this is lower than for Model 6.# Is this the best model?plot(cement.15) # check the residual plotsto ensure that everything is OK

19

Example (Cont’d): Checking Model 15

80 90 100−

40

24

Fitted values

Res

idua

ls

●

●

●●

●

●

●

●

●

●

●

●

●

Residuals vs Fitted6

813

●

●

●●

●

●

●

●

●

●

●

●

●

−1.5 0.0 1.0

−1

01

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als Normal Q−Q

6

813

80 90 100

0.0

0.6

1.2

Fitted values

Sta

ndar

dize

d re

sidu

als

●

●

●●

●

●

●

●

●

●

●

●

●

Scale−Location68

13

0.0 0.2 0.4 0.6−

20

12

Leverage

Sta

ndar

dize

d re

sidu

als

●

●

●●

●

●

●

●

●●

●

●

●Cook's distance 1

0.5

0.51

Residuals vs Leverage

8

3

11

20

Are the coefficient standard errors inflated?

• Check variance inflation factors:

vif(cement.15)

as.matrix(x[, which[15, ]])x1 as.matrix(x[, which[15, ]])x238.496 254.420

as.matrix(x[, which[15, ]])x3 as.matrix(x[, which[15, ]])x446.868 282.510

• Fairly severe multicollinearity is inflating the standard errors. Thismight explain why the individual significance tests are giving large p-values.

21

Variable Selection Methods

• Forward Selection* Fit models with each X variable. Choose the variable which gives the

highest |t| statistic (if the p-value < 0.5; otherwise, stop). Suppose thevariable Xi was chosen.

* Fit models which include Xi and each of the other X variables. Addthe X variable with the largest |t| statistic (if the p-value < .5; other-wise, stop).

* Continue adding variables in this manner, until the first time there areno p-values less than .5. Then stop.

22

7.3 Variable Selection Methods (Cont’d)

• Backward Selection

• Begin with all X variables.

• Eliminate the variable with the largest p-value.

• Continue eliminating variables until all remaining variables have p-values < 0.1.

• Stepwise

• Proceed as in Forward selection, but at each step, remove any variablewhose p-value is larger than 0.15.

• All Possible Regressions (Exhaustive Search)

23

Forward Selection Example

library(leaps) # has the regsubsets functionseal.fwd <- regsubsets(age ˜ ., data=cfseal1, method="forward")summary(seal.fwd)$cp

[1] 4.304680 2.614062 3.226201 4.421788 6.000000

# try 2nd model:subset <- summary(seal.fwd)$which[2,]predictors <- cfseal1[,subset]colnames(predictors) <- names(cfseal1)[subset]

seal.lm <- lm(age ˜ ., data=predictors)coef(summary(seal.lm))

Estimate Std. Error t value Pr(>|t|)(Intercept) 1.95540114 5.802631893 0.3369852 0.7387335563liver -0.01829706 0.009455956 -1.9349773 0.0635329824kidney 0.21497898 0.056810674 3.7841301 0.0007812301

24

Forward Selection Example (Cont’d)

library(MPV)PRESS(seal.lm)[1] 8307.55# eliminate intercept:seal.lm0 <- update(seal.lm, ˜.-1)

coef(summary(seal.lm0))

> coef(summary(seal.lm0))Estimate Std. Error t value Pr(>|t|)

liver -0.02040852 0.006969063 -2.928445 6.699975e-03kidney 0.22929965 0.037101251 6.180375 1.127961e-06

PRESS(seal.lm0)[1] 7731.016

25

Backward Selection Example

seal.bwd <- regsubsets(age ˜ ., data = cfseal1, method = "backward")summary(seal.bwd)$cp[1] 4.304680 2.614062 3.226201 4.421788 6.000000

# same as forward sel.

# Is the 3rd model better?subset <- summary(seal.fwd)$which[3,]predictors <- cfseal1[,subset]colnames(predictors) <- names(cfseal1)[subset]

seal.lm3 <- lm(age ˜ ., data=predictors)

coef(summary(seal.lm3))Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.50803199 5.88274533 0.08635968 0.931842141liver -0.02466118 0.01078548 -2.28651698 0.030613633stomach 0.03190020 0.02667225 1.19600683 0.242487709kidney 0.19359125 0.05913202 3.27388193 0.002997971

PRESS(seal.lm3)[1] 9083.876 # 2nd model seems better

26

• Exhaustive search example:

seal.ex <- regsubsets(age ˜ ., data = cfseal1,

method = "exhaustive")

Ch. 7 Model Selection 7.1 Introduction: Model...

Documents

Transcript of Ch. 7 Model Selection 7.1 Introduction: Model...