Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple...

Chapter 12Multiple Linear Regression

An in-depth look at the science and art of multiple

regression!

Chapter 12B

A study of the statistical relationship among variables.

Now, the rest of the story…

Ponder This…

The Gauss-Markov Theorem: Least-square estimates are unbiased and have minimum variance among all unbiased linear estimates.

That is, they are BLUE the best linear unbiased estimator

(BLUE) of any linear combination of the coefficients is its least-squares estimator

Confidence Intervals on Individual Regression Coefficients

2

1

ˆ0,1,...,

ˆ

is the jj element of ( )

j j

jj

jj

T j kC

C X X

T has a t distribution with n-p degrees of freedom.

jjpnjjjjpnj

pnpn

CtCt

tTtP

2,2/

2,2/

,2/,2/

ˆˆˆˆ

thatimplies1)(

In Minitab-speak s.e.

coef., se(j)

Confidence Interval on the Mean Response

The mean response at a point x0 is estimated by

The variance of the estimated mean response is

01

02

,2/||01

02

,2/| )(ˆˆ)(ˆˆ xXXxtxXXxt pnxYxYpnxY

12-4 Prediction of New Observations

0 0

2 10 /2, 0 0 0

2 10 /2, 0 0

ˆˆ

ˆ ˆ (1 ( ) )

ˆ ˆ (1 ( ) )

n p

n p

y x

y t x X X x Y

y t x X X x

The prediction

itself

The ‘1’ accounts for

the variability of the data

itself

NFL The Problem still continues (12-47)

(a) Find 95% confidence intervals on the regression coefficients

(b) What is the estimated standard error of when the percentage of completions is 60%, the percentage of TD’s is 4%, and the percentage of interceptions is 3%?

(c) Find a 95% confidence interval on the mean rating when the percentage of completions is 60%, the percentage of TD’s is 4%, and the percentage of interceptions is 3%.

0|ˆY x

CI on the Coefficients

2 2/2, /2,

ˆ ˆˆ ˆj n p jj j j n p jjt C t C

CI on the Mean Response

X0 = (1, .6, .04, .03)

Minitab and the NFL

Rating Pts = 3.22 + 1.22 Pct Comp + 4.42 Pct TD - 4.09 Pct Int

Predictor Coef StDev T PConstant 3.220 6.519 0.49 0.626Pct Comp 1.2243 0.1080 11.33 0.000Pct TD 4.4231 0.2799 15.80 0.000Pct Int -4.0914 0.4953 -8.26 0.000S = 1.921 R-Sq = 97.8% R-Sq(adj) = 97.5%

Analysis of VarianceSource DF SS MS F PRegression 3 4196.3 1398.8 379.07 0.000Residual Error 26 95.9 3.7Total 29 4292.2

Predicted Values Fit StDev Fit 95.0% CI 95.0% PI 82.096 0.387 ( 81.300, 82.892) ( 78.068, 86.124)

X0 = (1, .6, .04, .03)

The Textbook Point Regarding Extrapolation

In the single variable case it was really easy to tell when you were moving beyond the range of the original data.

For the multi-variable case we have a region. It is easy to be within the bounds on every variable in a marginal sense, but be way outside the region of the data.

Example, being 6’2” tall may not be unusual. Being 120 lbs is not an outlier. Being 6’2” and 120 lbs would put you outside the range of many data samples from the population.

Extrapolation in multiple regression

12-5 Model Adequacy Checking

R2 and Residual Analysis

Figure 12-6 Normal probability plot of residuals

Residual Analysis Two varieties – standardized and studentized

standardized we divide the ‘raw’ residual by square root of MSE. These are pretty simple to understand.

studentized - divide the ‘raw’ residual by the square root of the variance of the predicted (fitted) value at that point.

Don’t be concerned about this one.

NFL Yes, continuing (12-53)

(a) What proportion of total variability is explained by this model?

(b) Plot the residuals versus y-hat and each regressor

(c) construct a normal probability plot of the residuals

The Coefficient of Multiple Determination, yes R2

2

2

4196.3.9777

4292.23.7

adj 1 1 .025 .9754292.2 / 29

R

R

alternately, read it off the output report:S = 1.921 R-Sq = 97.8% R-Sq(adj) = 97.5%

Normal Probability Plot

2.01.51.00.50.0-0.5-1.0-1.5-2.0-2.5

2

1

0

-1

-2

Nor

mal

Sco

re

Standardized Residual

Normal Probability Plot of the Residuals(response is Rating P)

Residual Plot 1

432

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

-1.5

-2.0

-2.5

Pct Int

Sta

ndar

dize

d R

esid

ual

Residuals Versus Pct Int(response is Rating P)

Residual Plot 2

1098765432

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

-1.5

-2.0

-2.5

Pct TD

Sta

ndar

dize

d R

esid

ual

Residuals Versus Pct TD(response is Rating P)

Residual Plot 3

70656055

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

-1.5

-2.0

-2.5

Pct Comp

Sta

ndar

dize

d R

esid

ual

Residuals Versus Pct Comp(response is Rating P)

Residual Plot 4

120110100908070

2.0

1.5

1.0

0.5

0.0

-0.5

-1.0

-1.5

-2.0

-2.5

Rating P

Sta

ndar

dize

d R

esid

ual

Residuals Versus Rating P(response is Rating P)

Modeling with Multiple Regression

A fascinating look at the variety of mathematical models that are accommodated with multiple

regression

Polynomial Regression Models

Example 12-12 – the data

Example 12-12 – the plot

Example 12-12 – the model

Example 12-12 – the solution

Categorical Regressors and Indicator Variables

Indicator (0,1) variables are normally used to represent the levels of a qualitative (categorical) variable.

It is not really a binary arithmetic type of representation. Rather, r levels are represented by (r-1) indicator variables.

x1 x2 x3

0 0 0 level 1

0 0 1 level 2

0 1 0 level 30 1 1 level 41 0 0 level 5

x1 x2 x3 x4

0 0 0 0 level 1

1 0 0 0 level 2

0 1 0 0 level 30 0 1 0 level 40 0 0 1 level 5

Yes – all estimates are independent

No -- estimate for level 4 is mixed with levels 2 and 3.

Example 12-13

The Data

More of the example

The Data in Matrix-vector form

The Results

Why not two models?

Common slopes pooling data – better MSE

more degrees of freedom All variables are qualitative

leads to analysis of variance (ANOVA) next chapter

Selection of Variables and Model Building

Not always obvious which combinations of variables to include.

Overfitting a model results in higher variance of predictions, while creating the illusion of better model in the limited context of the current sample

as new columns are added to the X matrix – variance will increase.

When a regression model is primarily an empirical one, selecting variables can be difficult.

If there are theoretical reasons for including particular variables, the task may be simpler –

these variables may be a starting point.

More selecting Variables Overall F-Test Look for significance of t-stat - j / SE(j) Examine R2 and adjusted R2

not necessarily to maximize but find the point where adding more variables leads to a small increase

Look at the MSE – SSE/(n-p) want a decrease as add variables

Cp criterion – total mean square error All Possible Regressions – Adjusted R-Squared Stepwise Regression – f-in and f-out

forward regression backward regression

All Possible Regressions Consider a regression problem with k possible

predictor variables. Number of regressions becomes a combinatorial problem

There are 2K-1 distinct subsets of them to be tested (counting the full set but excluding the empty set which corresponds to the mean model).

For example, with 10 potential independent variables, the number of subsets to be tested is 2^10 - 1 = 1023,

with 20 potential independent variables, the number is 2^20, or more than one million.

x1 x2 x3 … xk

2 2 2 … 2 = 2k

Stepwise Regression

1. Begin by picking the single predictor with the largest F-statistic (highest correlation to the response variable Y).

2. Examine the remaining variables to find the one with the highest partial F-statistic.

),(

),|(

1

01

xxMS

SSF

jE

jRj

3. If Fj > Fin include variable j. Otherwise stop.

4. Calculate the F-statistic for the predictors currently in the model.

5. Let Fk be the lowest partial F-statistic. If Fk < Fout, remove variable k from the model. Go to step 2.

Stepwise Regression - comments

Fundamental technique, widely used, and very handy for exploring a set of potential predictor variables.

Minitab implementation allows you to specify a set of predictors to automatically include. All other decisions begin after this set of predictors is included in the model

Forward Selection is a variant of stepwise regression that continues to add variables until the F statistic falls below Fin.

Backward Elimination starts with the full set of predictors and works in the other direction – eliminating them until there is no partial F statistic below Fout.

If possible use all-possible regression method and evaluate by minimum MSE or adjusted R2.

Data, data, we need data Quarterback ratings for the 2004 NFL season

Player Att Comp Pct Yds Yds per Att TD Pct TD Long Int Pct Int Rating D.Culpepper, MIN 548 379 69.2 4717 8.61 39 7.1 82 11 2 110.9D.McNabb, PHI 469 300 64 3875 8.26 31 6.6 80 8 1.7 104.7B.Griese, TAM 336 233 69.3 2632 7.83 20 6 68 12 3.6 97.5M.Bulger, STL 485 321 66.2 3964 8.17 21 4.3 56 14 2.9 93.7B.Favre, GBP 540 346 64.1 4088 7.57 30 5.6 79 17 3.1 92.4J.Delhomme, CAR 533 310 58.2 3886 7.29 29 5.4 63 15 2.8 87.3K.Warner, NYG 277 174 62.8 2054 7.42 6 2.2 62 4 1.4 86.5M.Hasselbeck, SEA 474 279 58.9 3382 7.14 22 4.6 60 15 3.2 83.1A.Brooks, NOS 542 309 57 3810 7.03 21 3.9 57 16 3 79.5T.Rattay, SFX 325 198 60.9 2169 6.67 10 3.1 65 10 3.1 78.1M.Vick, ATL 321 181 56.4 2313 7.21 14 4.4 62 12 3.7 78.1J.Harrington, DET 489 274 56 3047 6.23 19 3.9 62 12 2.5 77.5V.Testaverde, DAL 495 297 60 3532 7.14 17 3.4 53 20 4 76.4P.Ramsey, WAS 272 169 62.1 1665 6.12 10 3.7 51 11 4 74.8J.McCown, ARI 408 233 57.1 2511 6.15 11 2.7 48 10 2.5 74.1P.Manning, IND 497 336 67.6 4557 9.17 49 9.9 80 10 2 121.1D.Brees, SDC 400 262 65.5 3159 7.9 27 6.8 79 7 1.8 104.8B.Roethlisberger, PIT 295 196 66.4 2621 8.88 17 5.8 58 11 3.7 98.1T.Green, KAN 556 369 66.4 4591 8.26 27 4.9 70 17 3.1 95.2T.Brady, NEP 474 288 60.8 3692 7.79 28 5.9 50 14 3 92.6C.Pennington, NYJ 370 242 65.4 2673 7.22 16 4.3 48 9 2.4 91B.Volek, TEN 357 218 61.1 2486 6.96 18 5 48 10 2.8 87.1J.Plummer, DEN 521 303 58.2 4089 7.85 27 5.2 85 20 3.8 84.5D.Carr, HOU 466 285 61.2 3531 7.58 16 3.4 69 14 3 83.5B.Leftwich, JAC 441 267 60.5 2941 6.67 15 3.4 65 10 2.3 82.2C.Palmer, CIN 432 263 60.9 2897 6.71 18 4.2 76 18 4.2 77.3J.Garcia, CLE 252 144 57.1 1731 6.87 10 4 99 9 3.6 76.7D.Bledsoe, BUF 450 256 56.9 2932 6.52 20 4.4 69 16 3.6 76.6K.Collins, OAK 513 289 56.3 3495 6.81 21 4.1 63 20 3.9 74.8K.Boller, BAL 464 258 55.6 2559 5.52 13 2.8 57 11 2.4 70.9

NFL Yes, Yes (12-81)

Use the football data to build regression models using

(a) all possible regressions(b) stepwise regression(c) forward selection(d) backward elimination

off we go to Minitab

Danger … Danger Multicollinearity

the existence of a high degree of linear correlation among two or more explanatory variables

In the presence of multicollinearity, it is difficult to assess the effect of the independent variables on the dependent variable

x1 and x2 are collinear if x1 = kx2

the X matrix will have rank less than p and the XtX matrix will not have an inverse

rarely face perfect multicollinearity in a data set

Detecting Multicollinearity

Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the hypothesis that those coefficients are insignificant as a group (using a F-test)

Large changes in the estimated regression coefficients when a predictor variable is added or deleted

Large changes in the estimated regression coefficients when an observation is added or deleted

The Problem of Multicollinearity

The usual interpretation of a regression coefficient is that it estimates the effect of a one unit change in an independent variable

With multicollinearity, the estimate of one variable's impact on y tends to be less precise than if predictors were uncorrelated with one another.

If nominally "different" measures actually quantify the same phenomenon then they are redundant.

One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large.

the hypothesis test that the coefficient is equal to zero against the alternative that it is not equal to zero leads to a failure to reject the null hyp.

if a simple linear regression of the dependent variable on this explanatory variable is estimated, the coefficient will be found to be significant

The best regression models are those in which the predictor variables each correlate highly with the dependent variable but correlate at most only minimally with each other.

Such a model is often called "low noise" and will be statistically robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical population).

Overfitting

A statistical model that has too many parameters An absurd and false model may fit perfectly if the

model has enough complexity by comparison to the amount of data available.

Overfitting is generally recognized to be a violation of Occam's razor.

When the degrees of freedom in parameter selection exceed the information content of the data, this leads to arbitrariness in the fitted model parameters which reduces or destroys the ability of the model to generalize beyond the fitting data.

Occam's Razor A principle attributed to the 14th-century English

logician and Franciscan friar William of Ockham. The principle states that the explanation of any

phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory.

The principle is often expressed as the "law of parsimony" or "law of succinctness“

This is often paraphrased as "All other things being equal, the simplest solution is the best."

When multiple competing theories are equal in other respects, the principle recommends selecting the theory that introduces the fewest assumptions and postulates the fewest entities.

Other Coefficient Estimates

01 1

01

Min

Min Max

j

j

n k

i j iji j

k

i j iji

j

y x

y x

The End of Regression

Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple...

Documents

Transcript of Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple...