Ch. 7 Model Selection 7.1 Introduction: Model...
Transcript of Ch. 7 Model Selection 7.1 Introduction: Model...
Ch. 7 Model Selection
• 7.1 Introduction: Model Misspecification
• 7.2 Choosing the Number of Predictor Variables
• 7.3 Variable Selection Methods
1
7.1 Model Misspecification
• Consider linear regression of a response y on k predictor variables.
• Under the full model (all predictors)
β̂˜
= (XTX)−1XTy˜
and
MSE =
y˜T(I −H)y
˜n− k − 1
2
7.1 Model Misspecification (Cont’d)
• Under the reduced model with q parameters (q < k + 1):
β̂˜q = (XT
q Xq)−1XT
q y˜
and
σ̂2q =
y˜T(I −Hq)y
˜n− q
where
Hq = Xq(XTq Xq)
−1Xq.
• Suppose the reduced model is correct, and we use the full model.Then we have over-parametrized the problem.
3
7.1 Model Misspecification (Cont’d): Wrongly using the Full Model
• We have
1. E[β̂˜
] = β˜
(some of the parameters are 0)
⇒ unbiased
2. V (β̂˜
) = σ2(XTX)−1
By including unnecessary variables, we risk introducing unneces-sary multicollinearity
⇒ inflated standard errors⇒ imprecise
Prediction is affected too: unbiased, but less precise than with thecorrect model
4
Model Misspecification (Cont’d): Wrongly using the Reduced Model
• Suppose the full model is correct, and we use the reduced model.Then
E[β̂˜q] = β
˜q + (XT
q Xq)−1XT
q Xrβ˜r 6= β
˜q
⇒ biased
A = (XTq Xq)−1XT
q Xr is the alias matrix.
V (β̂˜q) = σ2(XT
q Xq)−1
5
Model Misspecification (Cont’d): Wrongly using the Reduced Model
• Because of bias, we need to use
MSE = E[(β̂˜q − β
˜q)(β̂
˜q − β
˜q)
T]
= V (β̂˜q) + (Bias(β̂
˜q))(Bias(β̂
˜q))T
= σ2(XTq Xq)
−1 + (Aβ˜r)(Aβ
˜r)
T
E[σ̂2q ] = σ2 +
1
n− qβ˜
Tr X
Tr (I −Hq)Xrβ
˜r
⇒ σ̂2q over-estimates σ2.
6
Summary
• Loss of precision occurs when too many variables are included in amodel.
• Bias results when too few variables are included in a model.
7
7.2 How Many Predictor Variables Should be Used?
1. R2 = SSR(p)SST
= 1− SSE(p)SST
:the proportion of variation in the response explained by the (p variable)regressionproblem: increases with p
2. Adjusted R2 or residual mean square
R2adj = 1−
(n− 1
n− p(1−R2
p)
)This does always increase with p, but maximizing this is equivalent tominimizing
MSE(p) =SSE(p)
n− pThis often decreases as variables are added, but sometimes begins toincrease when too many are added.One might want to choose p where this first starts to level offBetter yet, one might use ...
8
Mallows’ Cp
• Consider
Γ̂p =SSE(p)
σ2− n+ 2p
(SSE(p) is the residual sum of squares for the p parameter model -the reduced model. )
•
E[SSE(p)] = σ2(n− p) + βTr˜
XTr (I −Hp)Xr βr
˜so
Γp = E[Γ̂p] = p+1
σ2βTr˜
XTr (I −Hp)Xr βr
˜
• ⇒ Choose p so that all components of βp˜
are nonzero and βr˜
= 0˜
.
9
Mallows’ Cp (Cont’d)
• Take p to be the smallest value for which Γp = p.
• In practice, take p to be the smallest value for which
Γ̂p.
= p
since we have just shown that Γ̂p is an unbiased estimator for Γp.
10
Mallows’ Cp (Cont’d)
• We need an estimator for σ2.
• If the full k variable model contains all important regressors (plus, pos-sibly some extras), then
E[MSE(k + 1)] = σ2
so we can use σ̂2 = MSE(k + 1).
• ⇒We estimate Γ̂p by
Cp =SSE(p)
MSE− n+ 2p
11
Mallows’ Cp (Cont’d)
• Example: An experiment involving 30 measurements on 5 regressorsyields an MSE of 20.
1. The MSE for the best 2 parameter model is 40. Find C2 for thismodel. (30)
2. The MSE for the best 3 parameter model is 30. Find C3 for thismodel. (16.5)
3. The MSE for the best 4 parameter model is 25. Find C4 for thismodel. (10.5)
4. The MSE for the best 5 parameter model is 19. Find C5 for thismodel. (3.75)
12
Example: cement data
> library(MPV) # this contains the cement data set> library(leaps) # this contains the variable selection routines> data(cement)> x <- cement[,-1] # this removes the y variable from
# the cement data set> y <- cement[,1] # this is the y variable> cement.leaps <- leaps(x,y)> attach(cement.leaps) # this allows us to access
# the variables in cement.leaps# directly# e.g., Mallows Cp is one of the# variables calculated
13
Example (Cont’d)
> plot(size, Cp) # size = no. of parameters
> abline(0,1) # reference line to see where Cp = p
> identify(size, Cp # which models are close to Cp = p?
# Click on the
# plotted points near the
# reference line.
# [1] 6 5 14 15
14
Example (Cont’d)
• CP plot
●●
●
●
●●
●
●
●
●
●●●● ●
2.0 2.5 3.0 3.5 4.0 4.5 5.0
050
100
150
200
250
300
size
Cp
5 6 14 15
15
Example (Cont’d)
• Identifying variables in each model:
which[6,] # which variables are included in model 6?
# 1 2 3 4
# TRUE FALSE FALSE TRUE
# Variables 1 and 4 are in the model.
> cement.6 <- lm(y ˜ as.matrix(x[,which[6,]]))
> summary(cement.6)
16
Example (Cont’d)
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 103.09738 2.12398 48.54 3.32e-13as.matrix(x[, which[6, ]])x1 1.43996 0.13842 10.40 1.11e-06as.matrix(x[, which[6, ]])x4 -0.61395 0.04864 -12.62 1.81e-07---
Residual standard error: 2.734 on 10 degrees of freedomMultiple R-Squared: 0.9725, Adjusted R-squared: 0.967
F-statistic: 176.6 on 2 and 10 DF, p-value: 1.581e-08
> PRESS(cement.6)[1] 121.2244
17
Example (Cont’d)
What about model 14?
cement.14 <- lm(y ˜ as.matrix(x[,which[14,]]))summary(cement.14)Coefficients:
Estimate Std. Error t value Pr(>|t|)(Intercept) 203.6420 20.6478 9.863 4.01e-06as.matrix(x[, which[14, ]])x2 -0.9234 0.2619 -3.525 0.006462as.matrix(x[, which[14, ]])x3 -1.4480 0.1471 -9.846 4.07e-06as.matrix(x[, which[14, ]])x4 -1.5570 0.2413 -6.454 0.000118---
Residual standard error: 2.864 on 9 degrees of freedomMultiple R-Squared: 0.9728, Adjusted R-squared: 0.9638F-statistic: 107.4 on 3 and 9 DF, p-value: 2.302e-07
> PRESS(cement.14)[1] 146.8527 # higher than Model 6, so choose Model 6
18
Example (Cont’d)
What about Model 15? (This is the full model.)
> cement.15 <- lm(y ˜ as.matrix(x[,which[15,]]))> summary(cement.15)Coefficients:
Estimate Std. Error t value Pr(>|t|)(Intercept) 62.4054 70.0710 0.891 0.3991as.matrix(x[, which[15, ]])x1 1.5511 0.7448 2.083 0.0708as.matrix(x[, which[15, ]])x2 0.5102 0.7238 0.705 0.5009as.matrix(x[, which[15, ]])x3 0.1019 0.7547 0.135 0.8959as.matrix(x[, which[15, ]])x4 -0.1441 0.7091 -0.203 0.8441---Residual standard error: 2.446 on 8 degrees of freedomMultiple R-Squared: 0.9824, Adjusted R-squared: 0.9736F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
> PRESS(cement.15)[1] 110.3466 # this is lower than for Model 6.# Is this the best model?plot(cement.15) # check the residual plotsto ensure that everything is OK
19
Example (Cont’d): Checking Model 15
80 90 100−
40
24
Fitted values
Res
idua
ls
●
●
●●
●
●
●
●
●
●
●
●
●
Residuals vs Fitted6
813
●
●
●●
●
●
●
●
●
●
●
●
●
−1.5 0.0 1.0
−1
01
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als Normal Q−Q
6
813
80 90 100
0.0
0.6
1.2
Fitted values
Sta
ndar
dize
d re
sidu
als
●
●
●●
●
●
●
●
●
●
●
●
●
Scale−Location68
13
0.0 0.2 0.4 0.6−
20
12
Leverage
Sta
ndar
dize
d re
sidu
als
●
●
●●
●
●
●
●
●●
●
●
●Cook's distance 1
0.5
0.51
Residuals vs Leverage
8
3
11
20
Are the coefficient standard errors inflated?
• Check variance inflation factors:
vif(cement.15)
as.matrix(x[, which[15, ]])x1 as.matrix(x[, which[15, ]])x238.496 254.420
as.matrix(x[, which[15, ]])x3 as.matrix(x[, which[15, ]])x446.868 282.510
• Fairly severe multicollinearity is inflating the standard errors. Thismight explain why the individual significance tests are giving large p-values.
21
Variable Selection Methods
• Forward Selection* Fit models with each X variable. Choose the variable which gives the
highest |t| statistic (if the p-value < 0.5; otherwise, stop). Suppose thevariable Xi was chosen.
* Fit models which include Xi and each of the other X variables. Addthe X variable with the largest |t| statistic (if the p-value < .5; other-wise, stop).
* Continue adding variables in this manner, until the first time there areno p-values less than .5. Then stop.
22
7.3 Variable Selection Methods (Cont’d)
• Backward Selection
• Begin with all X variables.
• Eliminate the variable with the largest p-value.
• Continue eliminating variables until all remaining variables have p-values < 0.1.
• Stepwise
• Proceed as in Forward selection, but at each step, remove any variablewhose p-value is larger than 0.15.
• All Possible Regressions (Exhaustive Search)
23
Forward Selection Example
library(leaps) # has the regsubsets functionseal.fwd <- regsubsets(age ˜ ., data=cfseal1, method="forward")summary(seal.fwd)$cp
[1] 4.304680 2.614062 3.226201 4.421788 6.000000
# try 2nd model:subset <- summary(seal.fwd)$which[2,]predictors <- cfseal1[,subset]colnames(predictors) <- names(cfseal1)[subset]
seal.lm <- lm(age ˜ ., data=predictors)coef(summary(seal.lm))
Estimate Std. Error t value Pr(>|t|)(Intercept) 1.95540114 5.802631893 0.3369852 0.7387335563liver -0.01829706 0.009455956 -1.9349773 0.0635329824kidney 0.21497898 0.056810674 3.7841301 0.0007812301
24
Forward Selection Example (Cont’d)
library(MPV)PRESS(seal.lm)[1] 8307.55# eliminate intercept:seal.lm0 <- update(seal.lm, ˜.-1)
coef(summary(seal.lm0))
> coef(summary(seal.lm0))Estimate Std. Error t value Pr(>|t|)
liver -0.02040852 0.006969063 -2.928445 6.699975e-03kidney 0.22929965 0.037101251 6.180375 1.127961e-06
PRESS(seal.lm0)[1] 7731.016
25
Backward Selection Example
seal.bwd <- regsubsets(age ˜ ., data = cfseal1, method = "backward")summary(seal.bwd)$cp[1] 4.304680 2.614062 3.226201 4.421788 6.000000
# same as forward sel.
# Is the 3rd model better?subset <- summary(seal.fwd)$which[3,]predictors <- cfseal1[,subset]colnames(predictors) <- names(cfseal1)[subset]
seal.lm3 <- lm(age ˜ ., data=predictors)
coef(summary(seal.lm3))Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.50803199 5.88274533 0.08635968 0.931842141liver -0.02466118 0.01078548 -2.28651698 0.030613633stomach 0.03190020 0.02667225 1.19600683 0.242487709kidney 0.19359125 0.05913202 3.27388193 0.002997971
PRESS(seal.lm3)[1] 9083.876 # 2nd model seems better
26
• Exhaustive search example:
seal.ex <- regsubsets(age ˜ ., data = cfseal1,
method = "exhaustive")