Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple...

48
Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B study of the statistical relationship among variables.

Transcript of Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple...

Page 1: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Chapter 12Multiple Linear Regression

An in-depth look at the science and art of multiple

regression!

Chapter 12B

A study of the statistical relationship among variables.

Page 2: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Now, the rest of the story…

Page 3: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Ponder This…

The Gauss-Markov Theorem: Least-square estimates are unbiased and have minimum variance among all unbiased linear estimates.

That is, they are BLUE the best linear unbiased estimator

(BLUE) of any linear combination of the coefficients is its least-squares estimator

Page 4: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Confidence Intervals on Individual Regression Coefficients

2

1

ˆ0,1,...,

ˆ

is the jj element of ( )

j j

jj

jj

T j kC

C X X

T has a t distribution with n-p degrees of freedom.

jjpnjjjjpnj

pnpn

CtCt

tTtP

2,2/

2,2/

,2/,2/

ˆˆˆˆ

thatimplies1)(

In Minitab-speak s.e.

coef., se(j)

Page 5: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Confidence Interval on the Mean Response

The mean response at a point x0 is estimated by

The variance of the estimated mean response is

01

02

,2/||01

02

,2/| )(ˆˆ)(ˆˆ xXXxtxXXxt pnxYxYpnxY

Page 6: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

12-4 Prediction of New Observations

0 0

2 10 /2, 0 0 0

2 10 /2, 0 0

ˆˆ

ˆ ˆ (1 ( ) )

ˆ ˆ (1 ( ) )

n p

n p

y x

y t x X X x Y

y t x X X x

The prediction

itself

The ‘1’ accounts for

the variability of the data

itself

Page 7: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

NFL The Problem still continues (12-47)

(a) Find 95% confidence intervals on the regression coefficients

(b) What is the estimated standard error of when the percentage of completions is 60%, the percentage of TD’s is 4%, and the percentage of interceptions is 3%?

(c) Find a 95% confidence interval on the mean rating when the percentage of completions is 60%, the percentage of TD’s is 4%, and the percentage of interceptions is 3%.

0|ˆY x

Page 8: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

CI on the Coefficients

2 2/2, /2,

ˆ ˆˆ ˆj n p jj j j n p jjt C t C

Page 9: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

CI on the Mean Response

X0 = (1, .6, .04, .03)

Page 10: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Minitab and the NFL

Rating Pts = 3.22 + 1.22 Pct Comp + 4.42 Pct TD - 4.09 Pct Int

Predictor Coef StDev T PConstant 3.220 6.519 0.49 0.626Pct Comp 1.2243 0.1080 11.33 0.000Pct TD 4.4231 0.2799 15.80 0.000Pct Int -4.0914 0.4953 -8.26 0.000S = 1.921 R-Sq = 97.8% R-Sq(adj) = 97.5%

Analysis of VarianceSource DF SS MS F PRegression 3 4196.3 1398.8 379.07 0.000Residual Error 26 95.9 3.7Total 29 4292.2

Predicted Values Fit StDev Fit 95.0% CI 95.0% PI 82.096 0.387 ( 81.300, 82.892) ( 78.068, 86.124)

X0 = (1, .6, .04, .03)

Page 11: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

The Textbook Point Regarding Extrapolation

In the single variable case it was really easy to tell when you were moving beyond the range of the original data.

For the multi-variable case we have a region. It is easy to be within the bounds on every variable in a marginal sense, but be way outside the region of the data.

Example, being 6’2” tall may not be unusual. Being 120 lbs is not an outlier. Being 6’2” and 120 lbs would put you outside the range of many data samples from the population.

Page 12: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Extrapolation in multiple regression

Page 13: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

12-5 Model Adequacy Checking

R2 and Residual Analysis

Figure 12-6 Normal probability plot of residuals

Page 14: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Residual Analysis Two varieties – standardized and studentized

standardized we divide the ‘raw’ residual by square root of MSE. These are pretty simple to understand.

studentized - divide the ‘raw’ residual by the square root of the variance of the predicted (fitted) value at that point.

Don’t be concerned about this one.

Page 15: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

NFL Yes, continuing (12-53)

(a) What proportion of total variability is explained by this model?

(b) Plot the residuals versus y-hat and each regressor

(c) construct a normal probability plot of the residuals

Page 16: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

The Coefficient of Multiple Determination, yes R2

2

2

4196.3.9777

4292.23.7

adj 1 1 .025 .9754292.2 / 29

R

R

alternately, read it off the output report:S = 1.921 R-Sq = 97.8% R-Sq(adj) = 97.5%

Page 17: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Normal Probability Plot

2.01.51.00.50.0-0.5-1.0-1.5-2.0-2.5

2

1

0

-1

-2

Nor

mal

Sco

re

Standardized Residual

Normal Probability Plot of the Residuals(response is Rating P)

Page 18: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Residual Plot 1

432

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

-1.5

-2.0

-2.5

Pct Int

Sta

ndar

dize

d R

esid

ual

Residuals Versus Pct Int(response is Rating P)

Page 19: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Residual Plot 2

1098765432

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

-1.5

-2.0

-2.5

Pct TD

Sta

ndar

dize

d R

esid

ual

Residuals Versus Pct TD(response is Rating P)

Page 20: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Residual Plot 3

70656055

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

-1.5

-2.0

-2.5

Pct Comp

Sta

ndar

dize

d R

esid

ual

Residuals Versus Pct Comp(response is Rating P)

Page 21: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Residual Plot 4

120110100908070

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

-1.5

-2.0

-2.5

Rating P

Sta

ndar

dize

d R

esid

ual

Residuals Versus Rating P(response is Rating P)

Page 22: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Modeling with Multiple Regression

A fascinating look at the variety of mathematical models that are accommodated with multiple

regression

Page 23: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Polynomial Regression Models

Page 24: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Example 12-12 – the data

Page 25: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Example 12-12 – the plot

Page 26: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Example 12-12 – the model

Page 27: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Example 12-12 – the solution

Page 28: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Categorical Regressors and Indicator Variables

Indicator (0,1) variables are normally used to represent the levels of a qualitative (categorical) variable.

It is not really a binary arithmetic type of representation. Rather, r levels are represented by (r-1) indicator variables.

x1 x2 x3

0 0 0 level 1

0 0 1 level 2

0 1 0 level 30 1 1 level 41 0 0 level 5

x1 x2 x3 x4

0 0 0 0 level 1

1 0 0 0 level 2

0 1 0 0 level 30 0 1 0 level 40 0 0 1 level 5

Yes – all estimates are independent

No -- estimate for level 4 is mixed with levels 2 and 3.

Page 29: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Example 12-13

Page 30: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

The Data

Page 31: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

More of the example

Page 32: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

The Data in Matrix-vector form

Page 33: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

The Results

Page 34: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Why not two models?

Common slopes pooling data – better MSE

more degrees of freedom All variables are qualitative

leads to analysis of variance (ANOVA) next chapter

Page 35: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Selection of Variables and Model Building

Not always obvious which combinations of variables to include.

Overfitting a model results in higher variance of predictions, while creating the illusion of better model in the limited context of the current sample

as new columns are added to the X matrix – variance will increase.

When a regression model is primarily an empirical one, selecting variables can be difficult.

If there are theoretical reasons for including particular variables, the task may be simpler –

these variables may be a starting point.

Page 36: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

More selecting Variables Overall F-Test Look for significance of t-stat - j / SE(j) Examine R2 and adjusted R2

not necessarily to maximize but find the point where adding more variables leads to a small increase

Look at the MSE – SSE/(n-p) want a decrease as add variables

Cp criterion – total mean square error All Possible Regressions – Adjusted R-Squared Stepwise Regression – f-in and f-out

forward regression backward regression

Page 37: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

All Possible Regressions Consider a regression problem with k possible

predictor variables. Number of regressions becomes a combinatorial problem

There are 2K-1 distinct subsets of them to be tested (counting the full set but excluding the empty set which corresponds to the mean model).

For example, with 10 potential independent variables, the number of subsets to be tested is 2^10 - 1 = 1023,

with 20 potential independent variables, the number is 2^20, or more than one million.

x1 x2 x3 … xk

2 2 2 … 2 = 2k

Page 38: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Stepwise Regression

1. Begin by picking the single predictor with the largest F-statistic (highest correlation to the response variable Y).

2. Examine the remaining variables to find the one with the highest partial F-statistic.

),(

),|(

1

01

xxMS

SSF

jE

jRj

3. If Fj > Fin include variable j. Otherwise stop.

4. Calculate the F-statistic for the predictors currently in the model.

5. Let Fk be the lowest partial F-statistic. If Fk < Fout, remove variable k from the model. Go to step 2.

Page 39: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Stepwise Regression - comments

Fundamental technique, widely used, and very handy for exploring a set of potential predictor variables.

Minitab implementation allows you to specify a set of predictors to automatically include. All other decisions begin after this set of predictors is included in the model

Forward Selection is a variant of stepwise regression that continues to add variables until the F statistic falls below Fin.

Backward Elimination starts with the full set of predictors and works in the other direction – eliminating them until there is no partial F statistic below Fout.

If possible use all-possible regression method and evaluate by minimum MSE or adjusted R2.

Page 40: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Data, data, we need data Quarterback ratings for the 2004 NFL season

Player Att Comp Pct Yds Yds per Att TD Pct TD Long Int Pct Int Rating D.Culpepper, MIN 548 379 69.2 4717 8.61 39 7.1 82 11 2 110.9D.McNabb, PHI 469 300 64 3875 8.26 31 6.6 80 8 1.7 104.7B.Griese, TAM 336 233 69.3 2632 7.83 20 6 68 12 3.6 97.5M.Bulger, STL 485 321 66.2 3964 8.17 21 4.3 56 14 2.9 93.7B.Favre, GBP 540 346 64.1 4088 7.57 30 5.6 79 17 3.1 92.4J.Delhomme, CAR 533 310 58.2 3886 7.29 29 5.4 63 15 2.8 87.3K.Warner, NYG 277 174 62.8 2054 7.42 6 2.2 62 4 1.4 86.5M.Hasselbeck, SEA 474 279 58.9 3382 7.14 22 4.6 60 15 3.2 83.1A.Brooks, NOS 542 309 57 3810 7.03 21 3.9 57 16 3 79.5T.Rattay, SFX 325 198 60.9 2169 6.67 10 3.1 65 10 3.1 78.1M.Vick, ATL 321 181 56.4 2313 7.21 14 4.4 62 12 3.7 78.1J.Harrington, DET 489 274 56 3047 6.23 19 3.9 62 12 2.5 77.5V.Testaverde, DAL 495 297 60 3532 7.14 17 3.4 53 20 4 76.4P.Ramsey, WAS 272 169 62.1 1665 6.12 10 3.7 51 11 4 74.8J.McCown, ARI 408 233 57.1 2511 6.15 11 2.7 48 10 2.5 74.1P.Manning, IND 497 336 67.6 4557 9.17 49 9.9 80 10 2 121.1D.Brees, SDC 400 262 65.5 3159 7.9 27 6.8 79 7 1.8 104.8B.Roethlisberger, PIT 295 196 66.4 2621 8.88 17 5.8 58 11 3.7 98.1T.Green, KAN 556 369 66.4 4591 8.26 27 4.9 70 17 3.1 95.2T.Brady, NEP 474 288 60.8 3692 7.79 28 5.9 50 14 3 92.6C.Pennington, NYJ 370 242 65.4 2673 7.22 16 4.3 48 9 2.4 91B.Volek, TEN 357 218 61.1 2486 6.96 18 5 48 10 2.8 87.1J.Plummer, DEN 521 303 58.2 4089 7.85 27 5.2 85 20 3.8 84.5D.Carr, HOU 466 285 61.2 3531 7.58 16 3.4 69 14 3 83.5B.Leftwich, JAC 441 267 60.5 2941 6.67 15 3.4 65 10 2.3 82.2C.Palmer, CIN 432 263 60.9 2897 6.71 18 4.2 76 18 4.2 77.3J.Garcia, CLE 252 144 57.1 1731 6.87 10 4 99 9 3.6 76.7D.Bledsoe, BUF 450 256 56.9 2932 6.52 20 4.4 69 16 3.6 76.6K.Collins, OAK 513 289 56.3 3495 6.81 21 4.1 63 20 3.9 74.8K.Boller, BAL 464 258 55.6 2559 5.52 13 2.8 57 11 2.4 70.9

Page 41: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

NFL Yes, Yes (12-81)

Use the football data to build regression models using

(a) all possible regressions(b) stepwise regression(c) forward selection(d) backward elimination

off we go to Minitab

Page 42: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Danger … Danger Multicollinearity

the existence of a high degree of linear correlation among two or more explanatory variables

In the presence of multicollinearity, it is difficult to assess the effect of the independent variables on the dependent variable

x1 and x2 are collinear if x1 = kx2

the X matrix will have rank less than p and the XtX matrix will not have an inverse

rarely face perfect multicollinearity in a data set

Page 43: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Detecting Multicollinearity

Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the hypothesis that those coefficients are insignificant as a group (using a F-test)

Large changes in the estimated regression coefficients when a predictor variable is added or deleted

Large changes in the estimated regression coefficients when an observation is added or deleted

Page 44: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

The Problem of Multicollinearity

The usual interpretation of a regression coefficient is that it estimates the effect of a one unit change in an independent variable

With multicollinearity, the estimate of one variable's impact on y tends to be less precise than if predictors were uncorrelated with one another.

If nominally "different" measures actually quantify the same phenomenon then they are redundant.

One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large.

the hypothesis test that the coefficient is equal to zero against the alternative that it is not equal to zero leads to a failure to reject the null hyp.

if a simple linear regression of the dependent variable on this explanatory variable is estimated, the coefficient will be found to be significant

The best regression models are those in which the predictor variables each correlate highly with the dependent variable but correlate at most only minimally with each other.

Such a model is often called "low noise" and will be statistically robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical population).

Page 45: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Overfitting

A statistical model that has too many parameters An absurd and false model may fit perfectly if the

model has enough complexity by comparison to the amount of data available.

Overfitting is generally recognized to be a violation of Occam's razor.

When the degrees of freedom in parameter selection exceed the information content of the data, this leads to arbitrariness in the fitted model parameters which reduces or destroys the ability of the model to generalize beyond the fitting data.

Page 46: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Occam's Razor A principle attributed to the 14th-century English

logician and Franciscan friar William of Ockham. The principle states that the explanation of any

phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory.

The principle is often expressed as the "law of parsimony" or "law of succinctness“

This is often paraphrased as "All other things being equal, the simplest solution is the best."

When multiple competing theories are equal in other respects, the principle recommends selecting the theory that introduces the fewest assumptions and postulates the fewest entities.

Page 47: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

Other Coefficient Estimates

01 1

01

Min

Min Max

j

j

n k

i j iji j

k

i j iji

j

y x

y x

Page 48: Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple regression! Chapter 12B A study of the statistical relationship.

The End of Regression