Multiple Regression HH Chapter 9 Air Pollution Multiple ... Multiple Regression HH Chapter 9 Air...

20
Multiple Regression HH Chapter 9 Air Pollution Example Regression with Multiple Predictors Matrix Notation Added Variable Plots Multiple Regression HH Chapter 9 October 31, 2005

Transcript of Multiple Regression HH Chapter 9 Air Pollution Multiple ... Multiple Regression HH Chapter 9 Air...

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Multiple Regression

HH Chapter 9

October 31, 2005

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Topics

I Regression with Two or More Predictors

I Matrix Version of Regression

I Hat Matrix & Leverage

I Added Variable Plots

I Interpretation

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Data

EDA original

Correlations

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Air Pollution Data

I hh/datasets/usair.dat

I Response SO2 measurements in 41 metropolitan areas

I PredictorsI tempI mgfirmsI popnI windI precipI raindays

Model?

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Data

EDA original

Correlations

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Scatterplot Matrix

Original Variables

SO2

45 60 75 0 1500 3500 10 30 50

2060

100

4560

75 temp

mgfirms

015

00

015

0035

00

popn

wind

68

10

1030

50 precip

20 60 100 0 1500 6 8 10 40 100 160

4010

016

0

raindays

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Data

EDA original

Correlations

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Correlations between Variables

SO2 temp firms popn wind precip rain

SO2 1.00 -0.43 0.64 0.49 0.09 0.05 0.37

temp -0.43 1.00 -0.19 -0.06 -0.35 0.39 -0.43

firms 0.64 -0.19 1.00 0.96 0.24 -0.03 0.13

popn 0.49 -0.06 0.96 1.00 0.21 -0.03 0.04

wind 0.09 -0.35 0.24 0.21 1.00 -0.01 0.16

precip 0.05 0.39 -0.03 -0.03 -0.01 1.00 0.50

rain 0.37 -0.43 0.13 0.04 0.16 0.50 1.00

Which explanatory variable leads to the “best” simple linearregression?What is its R2?Can we do “better” by including other variables?(transformations?)

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Model

R Code

Diagnostics

Matrix

Notation

Added

Variable Plots

Multiple Regression with p Predictors

Model:

I Observe data {Yi , xi1, . . . , xip} i = 1, . . . n

I E[Yi |xi1, . . . xip] = f (xi1, xi ,p)

I First Approximation (First order Taylor’s series)

E[Yi |xi1, . . . xip] ≡ µi = β0 + xi1β1 + . . . + xi ,pβp

I Normal Model

Yiind∼ N(µi , σ

2) ⇔

Yi = β0 + xi1β1 + . . . + xi ,pβp + εi , εiiid∼ N(0, σ2)

I OLS (MLE) find β0, . . . , βp that minimize

i

(Yi − β0 + xi1β1 + . . . + xi ,pβp)2 ≡

∑(e2

i )

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Model

R Code

Diagnostics

Matrix

Notation

Added

Variable Plots

Fitting Models in R

Choice of transformation of response and predictors?BoxCox procedure can be used to find “best” transformation ofY (for a given set of transformed predictors

poll.lm = lm(SO2 ~ temp + firms +

popn + wind +

precip+ rain,

data=pollution)

# plot diagnostics (R 2.2)

par(mfrow=c(2,2))

plot(poll.lm, ask=F)

library(MASS)

boxcox(poll.lm)

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Model

R Code

Diagnostics

Matrix

Notation

Added

Variable Plots

Scatterplot - log response

log(SO2)

45 60 75 0 1500 3500 10 30 50

2.0

3.0

4.0

4560

75 temp

firms

015

00

015

0035

00

popn

wind

68

10

1030

50 precip

2.0 3.0 4.0 0 1500 6 8 10 40 100 160

4010

016

0

rain

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Model

R Code

Diagnostics

Matrix

Notation

Added

Variable Plots

Residuals

ei = Yi − Yi = Yi − {β0 + xi1β1 + . . . + xi ,pβp}

0 20 40 60 80 100

−20

020

40

Fitted values

Res

idua

ls

Residuals vs Fitted

31

30

26

−2 −1 0 1 2

−1

01

23

4

Theoretical QuantilesS

tand

ardi

zed

resi

dual

s

Normal Q−Q

31

30

26

0 20 40 60 80 100

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location31

30

26

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

−2

−1

01

23

4

Leverage

Sta

ndar

dize

d re

sidu

als

Cook’s distance1

0.5

0.5

1

Residuals vs Leverage

31

1

25

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Matrix Notation

Y1 = 1β0 + x11β1 + . . . + x1pβp + ε1

Y2 = 1β0 + x21β1 + . . . + x2pβp + ε2... =

...

Yn = 1β0 + xn1β1 + . . . + xn,pβp + εn

Y = 1nβ0 + X1β1 + . . . + Xpβp + ε

Y = Xβ + ε

where X = [1nX1 . . .Xp] is a n × (p + 1)) matrix and Y and Xj

are vectors of length n, β = (β0, . . . βp)T

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

MLE’s in Matrix Notation

The MLE of β maximizes

Q(β) = (Y − Xβ)T (Y − Xβ)

(or equivalently OLS solution minimizes −Q(β))

Solution: β = (XTX)−1XTY

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Hat Matrix

H ≡ X(XTX)−1XT is a n × n projection matrix

I HT = H (Symmetric)

I HH = H2 = H (idempotent)

I HY = X(XTX)−1XTY = Xβ = Y Hat Matrix

I (In − H) is also a projection matrix (In is the identitymatrix)

I (In − H)Y = Y − Y = e

hi is the leverage of case i (the ith diagonal element of H)Measure of how far the ith set of predictors is away from therest of the data(more in Chapter 11)

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Leverage

hatvalues(poll.lm)

0 10 20 30 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Case Index

Leve

rage

high leverage point if hi > 2(p + 1) n

Case 11?

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

New Model

poll.lm3 = lm(log(SO2) ~ temp + log(firms) +

log(popn) + wind + precip +

rain, data=pollution)

2.5 3.0 3.5 4.0

−1.

0−

0.5

0.0

0.5

1.0

Fitted values

Res

idua

ls

Residuals vs Fitted

37

25

31

−2 −1 0 1 2

−2

−1

01

2

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q−Q

25

3731

2.5 3.0 3.5 4.0

0.0

0.5

1.0

1.5

Fitted values

Sta

ndar

dize

d re

sidu

als

Scale−Location25

37 31

0.0 0.1 0.2 0.3 0.4 0.5

−2

−1

01

2

Leverage

Sta

ndar

dize

d re

sidu

als

Cook’s distance 1

0.5

0.5

1Residuals vs Leverage

25

3111

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Interpretation

Added Variable Plots

What is effect of adding Xj to model after all other X′ havebeen included?

I Regress Xj on X1,Xj−1,Xj+1,Xp

I Find the residuals Xj − Xj ≡ Xj |.

I Regress Y on X1,Xj−1,Xj+1,Xp

I Find the residuals Y − Y1,j−1,j+1,p ≡ ej

I Plot ej versus Xj |.

I Slope of line is βj in regression on all X’s (adjusted)

I Look for need to transform, non-constant variance,outliers, etc

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Interpretation

Added Variable Plots in R

# use poll-lm3

library(car)

# library for ‘‘Companion to Applied Regression’’

help(av.plots)

av.plots(poll.lm3)

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Interpretation

av.plots

0 5 10

−1.

0−

0.5

0.0

0.5

1.0

Added−Variable Plot

temp | others

log(

SO

2) |

oth

ers

−0.5 0.0 0.5 1.0

−0.

50.

00.

51.

0

Added−Variable Plot

log(firms) | otherslo

g(S

O2)

| o

ther

s

−0.5 0.0 0.5

−0.

50.

00.

51.

0

Added−Variable Plot

log(popn) | others

log(

SO

2) |

oth

ers

−2 −1 0 1 2 3

−1.

5−

0.5

0.0

0.5

1.0

Added−Variable Plot

wind | others

log(

SO

2) |

oth

ers

−15 −10 −5 0 5 10

−1.

0−

0.5

0.0

0.5

1.0

Added−Variable Plot

precip | others

log(

SO

2) |

oth

ers

−20 −10 0 10 20 30

−0.

50.

00.

51.

0

Added−Variable Plot

rain | others

log(

SO

2) |

oth

ers

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Interpretation

Model Fitting

EDA used throughout:

I scatterplots

I BoxCox or ladder of powers

I leverage plots

I residual plots

I added variable plots

iterate model building until “assumptions” linearity & constantvariance seem plausible

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Interpretation

summary(poll.lm3) (abbreviated)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.7142760 1.6475086 4.075 0.000261 ***

temp -0.0649495 0.0227711 -2.852 0.007333 **

log(firms) 0.3698588 0.1934076 1.912 0.064289 .

log(popn) -0.1771293 0.2335520 -0.758 0.453428

wind -0.1738606 0.0656713 -2.647 0.012204 *

precip 0.0156032 0.0132718 1.176 0.247893

rain 0.0009153 0.0057335 0.160 0.874104

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1

Residual standard error: 0.5108 on 34 degrees of freedom

Multiple R-Squared: 0.5503, Adjusted R-squared: 0.471

F-statistic: 6.936 on 6 and 34 DF, p-value: 7.12e-05

Multiple

Regression

HH Chapter 9

Air Pollution

Example

Regression

with Multiple

Predictors

Matrix

Notation

Added

Variable Plots

Interpretation

Interpretation

I coefficients and their standard errors (in original units)

I t-statistics & p-values

I R2 and adjusted R-squared

I residual standard error

I F statistic and p-value