Multiple Regression HH Chapter 9 Air Pollution Multiple ... Multiple Regression HH Chapter 9 Air...
Transcript of Multiple Regression HH Chapter 9 Air Pollution Multiple ... Multiple Regression HH Chapter 9 Air...
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Multiple Regression
HH Chapter 9
October 31, 2005
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Topics
I Regression with Two or More Predictors
I Matrix Version of Regression
I Hat Matrix & Leverage
I Added Variable Plots
I Interpretation
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Data
EDA original
Correlations
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Air Pollution Data
I hh/datasets/usair.dat
I Response SO2 measurements in 41 metropolitan areas
I PredictorsI tempI mgfirmsI popnI windI precipI raindays
Model?
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Data
EDA original
Correlations
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Scatterplot Matrix
Original Variables
SO2
45 60 75 0 1500 3500 10 30 50
2060
100
4560
75 temp
mgfirms
015
00
015
0035
00
popn
wind
68
10
1030
50 precip
20 60 100 0 1500 6 8 10 40 100 160
4010
016
0
raindays
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Data
EDA original
Correlations
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Correlations between Variables
SO2 temp firms popn wind precip rain
SO2 1.00 -0.43 0.64 0.49 0.09 0.05 0.37
temp -0.43 1.00 -0.19 -0.06 -0.35 0.39 -0.43
firms 0.64 -0.19 1.00 0.96 0.24 -0.03 0.13
popn 0.49 -0.06 0.96 1.00 0.21 -0.03 0.04
wind 0.09 -0.35 0.24 0.21 1.00 -0.01 0.16
precip 0.05 0.39 -0.03 -0.03 -0.01 1.00 0.50
rain 0.37 -0.43 0.13 0.04 0.16 0.50 1.00
Which explanatory variable leads to the “best” simple linearregression?What is its R2?Can we do “better” by including other variables?(transformations?)
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Model
R Code
Diagnostics
Matrix
Notation
Added
Variable Plots
Multiple Regression with p Predictors
Model:
I Observe data {Yi , xi1, . . . , xip} i = 1, . . . n
I E[Yi |xi1, . . . xip] = f (xi1, xi ,p)
I First Approximation (First order Taylor’s series)
E[Yi |xi1, . . . xip] ≡ µi = β0 + xi1β1 + . . . + xi ,pβp
I Normal Model
Yiind∼ N(µi , σ
2) ⇔
Yi = β0 + xi1β1 + . . . + xi ,pβp + εi , εiiid∼ N(0, σ2)
I OLS (MLE) find β0, . . . , βp that minimize
∑
i
(Yi − β0 + xi1β1 + . . . + xi ,pβp)2 ≡
∑(e2
i )
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Model
R Code
Diagnostics
Matrix
Notation
Added
Variable Plots
Fitting Models in R
Choice of transformation of response and predictors?BoxCox procedure can be used to find “best” transformation ofY (for a given set of transformed predictors
poll.lm = lm(SO2 ~ temp + firms +
popn + wind +
precip+ rain,
data=pollution)
# plot diagnostics (R 2.2)
par(mfrow=c(2,2))
plot(poll.lm, ask=F)
library(MASS)
boxcox(poll.lm)
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Model
R Code
Diagnostics
Matrix
Notation
Added
Variable Plots
Scatterplot - log response
log(SO2)
45 60 75 0 1500 3500 10 30 50
2.0
3.0
4.0
4560
75 temp
firms
015
00
015
0035
00
popn
wind
68
10
1030
50 precip
2.0 3.0 4.0 0 1500 6 8 10 40 100 160
4010
016
0
rain
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Model
R Code
Diagnostics
Matrix
Notation
Added
Variable Plots
Residuals
ei = Yi − Yi = Yi − {β0 + xi1β1 + . . . + xi ,pβp}
0 20 40 60 80 100
−20
020
40
Fitted values
Res
idua
ls
Residuals vs Fitted
31
30
26
−2 −1 0 1 2
−1
01
23
4
Theoretical QuantilesS
tand
ardi
zed
resi
dual
s
Normal Q−Q
31
30
26
0 20 40 60 80 100
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location31
30
26
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
−2
−1
01
23
4
Leverage
Sta
ndar
dize
d re
sidu
als
Cook’s distance1
0.5
0.5
1
Residuals vs Leverage
31
1
25
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Matrix Notation
Y1 = 1β0 + x11β1 + . . . + x1pβp + ε1
Y2 = 1β0 + x21β1 + . . . + x2pβp + ε2... =
...
Yn = 1β0 + xn1β1 + . . . + xn,pβp + εn
⇔
Y = 1nβ0 + X1β1 + . . . + Xpβp + ε
Y = Xβ + ε
where X = [1nX1 . . .Xp] is a n × (p + 1)) matrix and Y and Xj
are vectors of length n, β = (β0, . . . βp)T
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
MLE’s in Matrix Notation
The MLE of β maximizes
Q(β) = (Y − Xβ)T (Y − Xβ)
(or equivalently OLS solution minimizes −Q(β))
Solution: β = (XTX)−1XTY
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Hat Matrix
H ≡ X(XTX)−1XT is a n × n projection matrix
I HT = H (Symmetric)
I HH = H2 = H (idempotent)
I HY = X(XTX)−1XTY = Xβ = Y Hat Matrix
I (In − H) is also a projection matrix (In is the identitymatrix)
I (In − H)Y = Y − Y = e
hi is the leverage of case i (the ith diagonal element of H)Measure of how far the ith set of predictors is away from therest of the data(more in Chapter 11)
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Leverage
hatvalues(poll.lm)
0 10 20 30 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Case Index
Leve
rage
high leverage point if hi > 2(p + 1) n
Case 11?
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
New Model
poll.lm3 = lm(log(SO2) ~ temp + log(firms) +
log(popn) + wind + precip +
rain, data=pollution)
2.5 3.0 3.5 4.0
−1.
0−
0.5
0.0
0.5
1.0
Fitted values
Res
idua
ls
Residuals vs Fitted
37
25
31
−2 −1 0 1 2
−2
−1
01
2
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q
25
3731
2.5 3.0 3.5 4.0
0.0
0.5
1.0
1.5
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location25
37 31
0.0 0.1 0.2 0.3 0.4 0.5
−2
−1
01
2
Leverage
Sta
ndar
dize
d re
sidu
als
Cook’s distance 1
0.5
0.5
1Residuals vs Leverage
25
3111
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Interpretation
Added Variable Plots
What is effect of adding Xj to model after all other X′ havebeen included?
I Regress Xj on X1,Xj−1,Xj+1,Xp
I Find the residuals Xj − Xj ≡ Xj |.
I Regress Y on X1,Xj−1,Xj+1,Xp
I Find the residuals Y − Y1,j−1,j+1,p ≡ ej
I Plot ej versus Xj |.
I Slope of line is βj in regression on all X’s (adjusted)
I Look for need to transform, non-constant variance,outliers, etc
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Interpretation
Added Variable Plots in R
# use poll-lm3
library(car)
# library for ‘‘Companion to Applied Regression’’
help(av.plots)
av.plots(poll.lm3)
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Interpretation
av.plots
0 5 10
−1.
0−
0.5
0.0
0.5
1.0
Added−Variable Plot
temp | others
log(
SO
2) |
oth
ers
−0.5 0.0 0.5 1.0
−0.
50.
00.
51.
0
Added−Variable Plot
log(firms) | otherslo
g(S
O2)
| o
ther
s
−0.5 0.0 0.5
−0.
50.
00.
51.
0
Added−Variable Plot
log(popn) | others
log(
SO
2) |
oth
ers
−2 −1 0 1 2 3
−1.
5−
0.5
0.0
0.5
1.0
Added−Variable Plot
wind | others
log(
SO
2) |
oth
ers
−15 −10 −5 0 5 10
−1.
0−
0.5
0.0
0.5
1.0
Added−Variable Plot
precip | others
log(
SO
2) |
oth
ers
−20 −10 0 10 20 30
−0.
50.
00.
51.
0
Added−Variable Plot
rain | others
log(
SO
2) |
oth
ers
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Interpretation
Model Fitting
EDA used throughout:
I scatterplots
I BoxCox or ladder of powers
I leverage plots
I residual plots
I added variable plots
iterate model building until “assumptions” linearity & constantvariance seem plausible
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Interpretation
summary(poll.lm3) (abbreviated)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.7142760 1.6475086 4.075 0.000261 ***
temp -0.0649495 0.0227711 -2.852 0.007333 **
log(firms) 0.3698588 0.1934076 1.912 0.064289 .
log(popn) -0.1771293 0.2335520 -0.758 0.453428
wind -0.1738606 0.0656713 -2.647 0.012204 *
precip 0.0156032 0.0132718 1.176 0.247893
rain 0.0009153 0.0057335 0.160 0.874104
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1
Residual standard error: 0.5108 on 34 degrees of freedom
Multiple R-Squared: 0.5503, Adjusted R-squared: 0.471
F-statistic: 6.936 on 6 and 34 DF, p-value: 7.12e-05
Multiple
Regression
HH Chapter 9
Air Pollution
Example
Regression
with Multiple
Predictors
Matrix
Notation
Added
Variable Plots
Interpretation
Interpretation
I coefficients and their standard errors (in original units)
I t-statistics & p-values
I R2 and adjusted R-squared
I residual standard error
I F statistic and p-value