Review of Univariate Linear Regression BMTRY 726 3/4/14.

32
Review of Univariate Linear Regression BMTRY 726 3/4/14

Transcript of Review of Univariate Linear Regression BMTRY 726 3/4/14.

Page 1: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Review of Univariate Linear Regression

BMTRY 7263/4/14

Page 2: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Regression AnalysisWe are interested in predicting values of one or more

responses from a set of predictors

Regression analysis is an extension of what we discussed with ANOVA and MANOVA

We now allow for inclusion of continuous predictors in place of (or in addition to) treatment indicators in MANOVA

Page 3: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Univariate Regression AnalysisA univariate regression model states that response Y is

composed of a mean dependent on a set of independent predictors zi and random error ei

Model assumptions:

0 1 1 ... r rY z z

1 11 1 1

2

for

.

.

n nn r r

E

Var

Y Z β ε

1 ε 0

2 ε I

Page 4: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Least Squares EstimationWe use the method of least square to estimate or

model parameters

1' '

1' '

1' '

The estimated coefficients and errors are:

ˆ

ˆ ˆ

where =

The predicted value of the outcome is:

ˆˆ

Z Z Z y

y y I Z Z Z Z y I H y

H Z Z Z Z

y Z

Page 5: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Least Squares EstimationWe estimate the variance using the residuals

'' '

'

'' 2

1' '

2

2

'2

ˆ

ˆ ˆ

ˆ ˆ

Note orthoganl so 0

ˆ ˆ

1

E tr

tr

rank

En r

ε I H y

ε ε y I H I H y

y I H y

ε ε Zβ I H Zβ I H I

I H Z I Z Z Z Z Z

I H

I H

ε ε

Page 6: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Least Squares Estimation

1' '

1 1 1' ' 2 ' 2 '

12 '1

ˆProperties of

ˆ1. It is unbiased:

ˆ2. The estimated variance is:

ˆ3. The distribution is (from 1 and 2): ~ ,

4. It is also the

r

E

Cov

N

BLUE

β

β Z Z Z Zβ β

β Z Z Z IZ Z Z Z Z

β β Z Z

Page 7: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Geometry of Least Squares

Z 1' '

ˆˆ

where

y Zβ Hy

H Z Z Z Z

y

ˆ ˆ

ε y y

I H y

Page 8: Review of Univariate Linear Regression BMTRY 726 3/4/14.

LRT for individual bi’s

First we may test if any predictors effect the response:

The LRT is based on the difference in sums of square between the full and null models…

0 1 2 0 2

1 1 1

1 1 22 1

1

1 2 1 21 22

1 1

: ... 0 or :

under null

q q r

q

n q n r qr q

H H

β 0

βZ Z Z β

β

βY Zβ ε Z Z ε Z β Z β ε

β

Y Z β ε

Page 9: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Z 1 1

ˆ ˆ.vsZ β Zβ

y

ε̂ 1ε̂

1Z

2Z

Page 10: Review of Univariate Linear Regression BMTRY 726 3/4/14.

LRT for individual bi’sDifference in SS between the full and null models…

2 2

2

1

' '

1 11 1

2 21 11 1 1

211

2 2

2 21

Extra

ˆ ˆ ˆ ˆ

We know ~ and ~

~

But... unknown so we can't use

ˆ ˆWe could estimate : res

res res

res full n r res n q

res res full r q

SSfull fulln r

SS SS SS

SS SS

SS SS

s

Z

Z Z

y Z β y Z β y Zβ y Zβ

Z Z

Z Z

1

1

2 2 21

1

, 12

1

ˆ

Or if we consider the following ratio instead...

~

res

res res full

res full

SSreduced reducedn q

SS SSres res fullr q

r q n rSSfulln r

or s

SS SSso F

r q s

Z

Z Z

Z

Z Z

Page 11: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model BuildingIf we have a large number of predictors, we want to

identify the “best” subset

There are many methods of selecting the “best” -Examine all possible subsets of predictors-Forward stepwise selection-Backwards stepwise selection

Page 12: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model BuildingThough we can consider predictors that are significant, this may

not yield the “best” subset (some models may yield similar results)

The “best” choice is made by examining some criterion -R2

-adjusted R2 -Mallow’s Cp

-AIC

Since R2 increases as predictors are added, Mallow’s Cp and AIC are better choices for selecting the “best” predictor subset

Page 13: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model BuildingMallow’s Cp:

-Plot the pairs ( p, Cp).

-Good models have coordinates near the 45o line

Akaike’s Information Criterion

-Smaller is better

for subset with parameters + intercept2

residual variance for full modelresidual

p

SS pC n p

for subset with parameters + interceptln 2

nresidualSS p

AIC n p

Page 14: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model CheckingAlways good to check if the model is “correct” before using it to

make decisions…Information about fit is contained in the residuals

If the model fits well the estimated error terms should mimic N(0, s 2).

So how can we check?

2

2

Assumptions :

~ 0,

ˆ

ˆ

i NID

Var

ε I H y

ε I H

Page 15: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model Checking1. Studentized residuals plot

2. Plot residuals versus predicted values-Ideally points should be scattered (i.e. no pattern)-If a pattern exists, can show something about the problem

2

ˆ*

1ˆ where entry in

Plot of theseshould look like independent 0,1

i

ii

thi iis h

h ii

N

H

0 2 4 6 8

-3-2

-10

12

3

predicted y

resi

dual

0 2 4 6 8

-2-1

01

2

predicted y

resi

dual

0 2 4 6 8-2

-10

1predicted y

resi

dual

Page 16: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model Checking3. Plot residuals versus predictors4. QQ plot of studentized residuals plot

3.5 4.0 4.5 5.0 5.5 6.0

-0.5

0.0

0.5

1.0

1.5

2.0

Predictor Values

Res

idua

ls

-3 -2 -1 0 1 2 3

-2-1

01

2

Quantiles

Stu

dent

ized

Res

idua

ls

Page 17: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model CheckingWhile residuals analysis is useful, it may miss outliers- i.e.

observations that are very influential on predictionsLeverage:

-how far is the jth observation from the others?

-How much pull does j exert on the fit

Observations that affect inferences are influential

2

2

1

1' 1 j

njj

z z

jj n z zh

H Z Z Z Z

ˆ j jj j jk kk jy h y h y

Page 18: Review of Univariate Linear Regression BMTRY 726 3/4/14.

CollinearityIf Z is not full rank, a linear combination aZ of columns in Z =0

In such a case the columns are colinear in which case the inverse of Z’Z doesn’t exist

It is rare that aZ == 0, but if a combination exists that is nearly zero, (Z’Z)-1 is numerically unstable

Results in very large estimated variance of the model parameters making it difficult to identify significant regression coefficients

Page 19: Review of Univariate Linear Regression BMTRY 726 3/4/14.

CollinearityWe can check for severity of multicollinearity using the variance

inflation factor (VIF)

2

1. Regress on allother 's from the modeland calculate

1

1

2. Examine for all covariates in the model.

5 means high multicollinearity

i

ii

i

X X

VIFR

VIF

VIF

Page 20: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Misspecified ModelIf important predictors are omitted, the vector of regression

coefficients my be biased

Biased unless columns of Z1 and Z2 are independent.

1 2

1

1 2 1 1 2 22

2

1

1' '

1 1 1 1

1 1' ' ' '

1 1 1 1 1 1 1 1 1 2 2

1' '

1 1 1 1 2 2

true model

Fit model with only

ˆ

ˆ

E Var

E E E

Z Z Z

βY Z Z ε Z β Z β ε

β

ε 0 ε I

Z

β Z Z Z Y

β Z Z Z Y Z Z Z Z β Z β ε

β Z Z Z Z β

Page 21: Review of Univariate Linear Regression BMTRY 726 3/4/14.

ExampleDevelop a model to predict percent body fat (PBF) using:

-Age, Weight, Height, Chest, Abdomen, Hip, Arm, WristOur full model is lm(formula = PBF ~ Age + Wt + Ht + Chest + Abd + Hip + Arm + Wrist, data = SSbodyfat)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -16.67443 15.64951 -1.065 0.287711 Age 0.03962 0.02983 1.328 0.185322 Wt -0.07582 0.04704 -1.612 0.108274 Ht -0.10940 0.09398 -1.164 0.245502 Chest -0.06304 0.09758 -0.646 0.518838 Abd 0.94626 0.08556 11.059 < 2e-16 ***Hip -0.07802 0.13584 -0.574 0.566263 Arm 0.50335 0.18791 2.679 0.007896 ** Wrist -1.78671 0.50204 -3.559 0.000448 ***

Residual standard error: 4.347 on 243 degrees of freedomMultiple R-squared: 0.7388, Adjusted R-squared: 0.7303 F-statistic: 85.94 on 8 and 243 DF, p-value: < 2.2e-16

Page 22: Review of Univariate Linear Regression BMTRY 726 3/4/14.

LRTWhat if we want to test whether or not the 4 most non-

significant predictors in the model can be removed

,

,

Given:

4658.34

4590.78

What does our LRT tell us?

res reduced

res full

SS

SS

Page 23: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model Subset SelectionFirst consider the plot of SSres for all possible subsets of the eight

predictors

Page 24: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model Subset SelectionWhat about Mallow’s Cp and AIC?

Page 25: Review of Univariate Linear Regression BMTRY 726 3/4/14.

ModelSay we choosing the model with the 4 parameters.

> summary(mod4)

Call:lm(formula = PBF ~ Wt + Abd + Arm + Wrist, data = SSbodyfat, x = T)

coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -34.85407 7.24500 -4.811 2.62e-06 ***Wt -0.13563 0.02475 -5.480 1.05e-07 ***Abd 0.99575 0.05607 17.760 < 2e-16 ***Arm 0.47293 0.18166 2.603 0.009790 ** Wrist -1.50556 0.44267 -3.401 0.000783 ***

Residual standard error: 4.343 on 247 degrees of freedomMultiple R-squared: 0.735, Adjusted R-squared: 0.7307 F-statistic: 171.3 on 4 and 247 DF, p-value: < 2.2e-16

Page 26: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model CheckSay we choosing the model with 4 parameters. We need to check our regression diagnostics

Page 27: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model CheckWhat about our parameters versus the residuals?

Page 28: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model CheckWhat about influential points?

> SSbodyfat[c(39,175,159,206,36),]Obs PBF Wt Abd Arm Wrist39 35.2 363.15 148.1 29.0 21.4175 25.3 226.75 108.8 21.0 20.1159 12.5 136.50 76.6 34.9 16.9206 16.6 208.75 96.3 23.1 19.436 40.1 191.75 113.1 29.8 17.0

Page 29: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Model CheckWhat about collinearity?

> round(cor(SSbodyfat[,c(2,3,6,8,9)]), digits=3) Wt Abd Arm WristWt 1.000 0.888 0.630 0.730Abd 0.888 1.000 0.503 0.620Arm 0.630 0.503 1.000 0.586Wrist 0.730 0.620 0.586 1.000

>library(HH)>vif(mod4)Wt Abd Arm Wrist 7.040774 4.864380 1.793374 2.273047

Page 30: Review of Univariate Linear Regression BMTRY 726 3/4/14.

What do all our model checks tell us about the validity of out model?

Page 31: Review of Univariate Linear Regression BMTRY 726 3/4/14.

What if our investigator really felt all 13 predictors really would give the best model?

> summary(mod13)

Call:lm(formula = PBF ~ Age + Wt + Ht + Neck + Chest + Abd + Hip + Thigh + Knee + Ankle + Bicep + Arm + Wrist, data = bodyfat, x = T)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -18.18849 17.34857 -1.048 0.29551 Age 0.06208 0.03235 1.919 0.05618 . Wt -0.08844 0.05353 -1.652 0.09978 . Ht -0.06959 0.09601 -0.725 0.46925 Neck -0.47060 0.23247 -2.024 0.04405 * Chest -0.02386 0.09915 -0.241 0.81000 Abd 0.95477 0.08645 11.044 < 2e-16 ***Hip -0.20754 0.14591 -1.422 0.15622 Thigh 0.23610 0.14436 1.636 0.10326 Knee 0.01528 0.24198 0.063 0.94970 Ankle 0.17400 0.22147 0.786 0.43285 Bicep 0.18160 0.17113 1.061 0.28966 Arm 0.45202 0.19913 2.270 0.02410 * Wrist -1.62064 0.53495 -3.030 0.00272 **

Residual standard error: 4.305 on 238 degrees of freedom. Multiple R-squared: 0.749, Adjusted R-squared: 0.7353 . F-statistic: 54.65 on 13 and 238 DF, p-value: < 2.2e-16

Page 32: Review of Univariate Linear Regression BMTRY 726 3/4/14.

Is collinrearity problematic?> vif(mod13) Age Wt Ht Neck Chest Abd Hip 2.250450 33.509320 1.674591 4.324463 9.460877 11.767073 14.796520

Thigh Knee Ankle Bicep Arm Wrist 7.777865 4.612147 1.907961 3.619744 2.192492 3.377515