Review of Univariate Linear Regression BMTRY 726 3/4/14.
-
Upload
laurel-murphy -
Category
Documents
-
view
232 -
download
1
Transcript of Review of Univariate Linear Regression BMTRY 726 3/4/14.
Review of Univariate Linear Regression
BMTRY 7263/4/14
Regression AnalysisWe are interested in predicting values of one or more
responses from a set of predictors
Regression analysis is an extension of what we discussed with ANOVA and MANOVA
We now allow for inclusion of continuous predictors in place of (or in addition to) treatment indicators in MANOVA
Univariate Regression AnalysisA univariate regression model states that response Y is
composed of a mean dependent on a set of independent predictors zi and random error ei
Model assumptions:
0 1 1 ... r rY z z
1 11 1 1
2
for
.
.
n nn r r
E
Var
Y Z β ε
1 ε 0
2 ε I
Least Squares EstimationWe use the method of least square to estimate or
model parameters
1' '
1' '
1' '
The estimated coefficients and errors are:
ˆ
ˆ ˆ
where =
The predicted value of the outcome is:
ˆˆ
Z Z Z y
y y I Z Z Z Z y I H y
H Z Z Z Z
y Z
Least Squares EstimationWe estimate the variance using the residuals
'' '
'
'' 2
1' '
2
2
'2
ˆ
ˆ ˆ
ˆ ˆ
Note orthoganl so 0
ˆ ˆ
1
E tr
tr
rank
En r
ε I H y
ε ε y I H I H y
y I H y
ε ε Zβ I H Zβ I H I
I H Z I Z Z Z Z Z
I H
I H
ε ε
Least Squares Estimation
1' '
1 1 1' ' 2 ' 2 '
12 '1
ˆProperties of
ˆ1. It is unbiased:
ˆ2. The estimated variance is:
ˆ3. The distribution is (from 1 and 2): ~ ,
4. It is also the
r
E
Cov
N
BLUE
β
β Z Z Z Zβ β
β Z Z Z IZ Z Z Z Z
β β Z Z
Geometry of Least Squares
Z 1' '
ˆˆ
where
y Zβ Hy
H Z Z Z Z
y
ˆ ˆ
ε y y
I H y
LRT for individual bi’s
First we may test if any predictors effect the response:
The LRT is based on the difference in sums of square between the full and null models…
0 1 2 0 2
1 1 1
1 1 22 1
1
1 2 1 21 22
1 1
: ... 0 or :
under null
q q r
q
n q n r qr q
H H
β 0
βZ Z Z β
β
βY Zβ ε Z Z ε Z β Z β ε
β
Y Z β ε
Z 1 1
ˆ ˆ.vsZ β Zβ
y
ε̂ 1ε̂
1Z
2Z
LRT for individual bi’sDifference in SS between the full and null models…
2 2
2
1
' '
1 11 1
2 21 11 1 1
211
2 2
2 21
Extra
ˆ ˆ ˆ ˆ
We know ~ and ~
~
But... unknown so we can't use
ˆ ˆWe could estimate : res
res res
res full n r res n q
res res full r q
SSfull fulln r
SS SS SS
SS SS
SS SS
s
Z
Z Z
y Z β y Z β y Zβ y Zβ
Z Z
Z Z
1
1
2 2 21
1
, 12
1
ˆ
Or if we consider the following ratio instead...
~
res
res res full
res full
SSreduced reducedn q
SS SSres res fullr q
r q n rSSfulln r
or s
SS SSso F
r q s
Z
Z Z
Z
Z Z
Model BuildingIf we have a large number of predictors, we want to
identify the “best” subset
There are many methods of selecting the “best” -Examine all possible subsets of predictors-Forward stepwise selection-Backwards stepwise selection
Model BuildingThough we can consider predictors that are significant, this may
not yield the “best” subset (some models may yield similar results)
The “best” choice is made by examining some criterion -R2
-adjusted R2 -Mallow’s Cp
-AIC
Since R2 increases as predictors are added, Mallow’s Cp and AIC are better choices for selecting the “best” predictor subset
Model BuildingMallow’s Cp:
-Plot the pairs ( p, Cp).
-Good models have coordinates near the 45o line
Akaike’s Information Criterion
-Smaller is better
for subset with parameters + intercept2
residual variance for full modelresidual
p
SS pC n p
for subset with parameters + interceptln 2
nresidualSS p
AIC n p
Model CheckingAlways good to check if the model is “correct” before using it to
make decisions…Information about fit is contained in the residuals
If the model fits well the estimated error terms should mimic N(0, s 2).
So how can we check?
2
2
Assumptions :
~ 0,
ˆ
ˆ
i NID
Var
ε I H y
ε I H
Model Checking1. Studentized residuals plot
2. Plot residuals versus predicted values-Ideally points should be scattered (i.e. no pattern)-If a pattern exists, can show something about the problem
2
ˆ*
1ˆ where entry in
Plot of theseshould look like independent 0,1
i
ii
thi iis h
h ii
N
H
0 2 4 6 8
-3-2
-10
12
3
predicted y
resi
dual
0 2 4 6 8
-2-1
01
2
predicted y
resi
dual
0 2 4 6 8-2
-10
1predicted y
resi
dual
Model Checking3. Plot residuals versus predictors4. QQ plot of studentized residuals plot
3.5 4.0 4.5 5.0 5.5 6.0
-0.5
0.0
0.5
1.0
1.5
2.0
Predictor Values
Res
idua
ls
-3 -2 -1 0 1 2 3
-2-1
01
2
Quantiles
Stu
dent
ized
Res
idua
ls
Model CheckingWhile residuals analysis is useful, it may miss outliers- i.e.
observations that are very influential on predictionsLeverage:
-how far is the jth observation from the others?
-How much pull does j exert on the fit
Observations that affect inferences are influential
2
2
1
1' 1 j
njj
z z
jj n z zh
H Z Z Z Z
ˆ j jj j jk kk jy h y h y
CollinearityIf Z is not full rank, a linear combination aZ of columns in Z =0
In such a case the columns are colinear in which case the inverse of Z’Z doesn’t exist
It is rare that aZ == 0, but if a combination exists that is nearly zero, (Z’Z)-1 is numerically unstable
Results in very large estimated variance of the model parameters making it difficult to identify significant regression coefficients
CollinearityWe can check for severity of multicollinearity using the variance
inflation factor (VIF)
2
1. Regress on allother 's from the modeland calculate
1
1
2. Examine for all covariates in the model.
5 means high multicollinearity
i
ii
i
X X
VIFR
VIF
VIF
Misspecified ModelIf important predictors are omitted, the vector of regression
coefficients my be biased
Biased unless columns of Z1 and Z2 are independent.
1 2
1
1 2 1 1 2 22
2
1
1' '
1 1 1 1
1 1' ' ' '
1 1 1 1 1 1 1 1 1 2 2
1' '
1 1 1 1 2 2
true model
Fit model with only
ˆ
ˆ
E Var
E E E
Z Z Z
βY Z Z ε Z β Z β ε
β
ε 0 ε I
Z
β Z Z Z Y
β Z Z Z Y Z Z Z Z β Z β ε
β Z Z Z Z β
ExampleDevelop a model to predict percent body fat (PBF) using:
-Age, Weight, Height, Chest, Abdomen, Hip, Arm, WristOur full model is lm(formula = PBF ~ Age + Wt + Ht + Chest + Abd + Hip + Arm + Wrist, data = SSbodyfat)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -16.67443 15.64951 -1.065 0.287711 Age 0.03962 0.02983 1.328 0.185322 Wt -0.07582 0.04704 -1.612 0.108274 Ht -0.10940 0.09398 -1.164 0.245502 Chest -0.06304 0.09758 -0.646 0.518838 Abd 0.94626 0.08556 11.059 < 2e-16 ***Hip -0.07802 0.13584 -0.574 0.566263 Arm 0.50335 0.18791 2.679 0.007896 ** Wrist -1.78671 0.50204 -3.559 0.000448 ***
Residual standard error: 4.347 on 243 degrees of freedomMultiple R-squared: 0.7388, Adjusted R-squared: 0.7303 F-statistic: 85.94 on 8 and 243 DF, p-value: < 2.2e-16
LRTWhat if we want to test whether or not the 4 most non-
significant predictors in the model can be removed
,
,
Given:
4658.34
4590.78
What does our LRT tell us?
res reduced
res full
SS
SS
Model Subset SelectionFirst consider the plot of SSres for all possible subsets of the eight
predictors
Model Subset SelectionWhat about Mallow’s Cp and AIC?
ModelSay we choosing the model with the 4 parameters.
> summary(mod4)
Call:lm(formula = PBF ~ Wt + Abd + Arm + Wrist, data = SSbodyfat, x = T)
coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -34.85407 7.24500 -4.811 2.62e-06 ***Wt -0.13563 0.02475 -5.480 1.05e-07 ***Abd 0.99575 0.05607 17.760 < 2e-16 ***Arm 0.47293 0.18166 2.603 0.009790 ** Wrist -1.50556 0.44267 -3.401 0.000783 ***
Residual standard error: 4.343 on 247 degrees of freedomMultiple R-squared: 0.735, Adjusted R-squared: 0.7307 F-statistic: 171.3 on 4 and 247 DF, p-value: < 2.2e-16
Model CheckSay we choosing the model with 4 parameters. We need to check our regression diagnostics
Model CheckWhat about our parameters versus the residuals?
Model CheckWhat about influential points?
> SSbodyfat[c(39,175,159,206,36),]Obs PBF Wt Abd Arm Wrist39 35.2 363.15 148.1 29.0 21.4175 25.3 226.75 108.8 21.0 20.1159 12.5 136.50 76.6 34.9 16.9206 16.6 208.75 96.3 23.1 19.436 40.1 191.75 113.1 29.8 17.0
Model CheckWhat about collinearity?
> round(cor(SSbodyfat[,c(2,3,6,8,9)]), digits=3) Wt Abd Arm WristWt 1.000 0.888 0.630 0.730Abd 0.888 1.000 0.503 0.620Arm 0.630 0.503 1.000 0.586Wrist 0.730 0.620 0.586 1.000
>library(HH)>vif(mod4)Wt Abd Arm Wrist 7.040774 4.864380 1.793374 2.273047
What do all our model checks tell us about the validity of out model?
What if our investigator really felt all 13 predictors really would give the best model?
> summary(mod13)
Call:lm(formula = PBF ~ Age + Wt + Ht + Neck + Chest + Abd + Hip + Thigh + Knee + Ankle + Bicep + Arm + Wrist, data = bodyfat, x = T)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -18.18849 17.34857 -1.048 0.29551 Age 0.06208 0.03235 1.919 0.05618 . Wt -0.08844 0.05353 -1.652 0.09978 . Ht -0.06959 0.09601 -0.725 0.46925 Neck -0.47060 0.23247 -2.024 0.04405 * Chest -0.02386 0.09915 -0.241 0.81000 Abd 0.95477 0.08645 11.044 < 2e-16 ***Hip -0.20754 0.14591 -1.422 0.15622 Thigh 0.23610 0.14436 1.636 0.10326 Knee 0.01528 0.24198 0.063 0.94970 Ankle 0.17400 0.22147 0.786 0.43285 Bicep 0.18160 0.17113 1.061 0.28966 Arm 0.45202 0.19913 2.270 0.02410 * Wrist -1.62064 0.53495 -3.030 0.00272 **
Residual standard error: 4.305 on 238 degrees of freedom. Multiple R-squared: 0.749, Adjusted R-squared: 0.7353 . F-statistic: 54.65 on 13 and 238 DF, p-value: < 2.2e-16
Is collinrearity problematic?> vif(mod13) Age Wt Ht Neck Chest Abd Hip 2.250450 33.509320 1.674591 4.324463 9.460877 11.767073 14.796520
Thigh Knee Ankle Bicep Arm Wrist 7.777865 4.612147 1.907961 3.619744 2.192492 3.377515