Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple...
-
Upload
gilbert-dennis -
Category
Documents
-
view
225 -
download
5
Transcript of Chapter 12 Multiple Linear Regression An in-depth look at the science and art of multiple...
Chapter 12Multiple Linear Regression
An in-depth look at the science and art of multiple
regression!
Chapter 12B
A study of the statistical relationship among variables.
Now, the rest of the story…
Ponder This…
The Gauss-Markov Theorem: Least-square estimates are unbiased and have minimum variance among all unbiased linear estimates.
That is, they are BLUE the best linear unbiased estimator
(BLUE) of any linear combination of the coefficients is its least-squares estimator
Confidence Intervals on Individual Regression Coefficients
2
1
ˆ0,1,...,
ˆ
is the jj element of ( )
j j
jj
jj
T j kC
C X X
T has a t distribution with n-p degrees of freedom.
jjpnjjjjpnj
pnpn
CtCt
tTtP
2,2/
2,2/
,2/,2/
ˆˆˆˆ
thatimplies1)(
In Minitab-speak s.e.
coef., se(j)
Confidence Interval on the Mean Response
The mean response at a point x0 is estimated by
The variance of the estimated mean response is
01
02
,2/||01
02
,2/| )(ˆˆ)(ˆˆ xXXxtxXXxt pnxYxYpnxY
12-4 Prediction of New Observations
0 0
2 10 /2, 0 0 0
2 10 /2, 0 0
ˆˆ
ˆ ˆ (1 ( ) )
ˆ ˆ (1 ( ) )
n p
n p
y x
y t x X X x Y
y t x X X x
The prediction
itself
The ‘1’ accounts for
the variability of the data
itself
NFL The Problem still continues (12-47)
(a) Find 95% confidence intervals on the regression coefficients
(b) What is the estimated standard error of when the percentage of completions is 60%, the percentage of TD’s is 4%, and the percentage of interceptions is 3%?
(c) Find a 95% confidence interval on the mean rating when the percentage of completions is 60%, the percentage of TD’s is 4%, and the percentage of interceptions is 3%.
0|ˆY x
CI on the Coefficients
2 2/2, /2,
ˆ ˆˆ ˆj n p jj j j n p jjt C t C
CI on the Mean Response
X0 = (1, .6, .04, .03)
Minitab and the NFL
Rating Pts = 3.22 + 1.22 Pct Comp + 4.42 Pct TD - 4.09 Pct Int
Predictor Coef StDev T PConstant 3.220 6.519 0.49 0.626Pct Comp 1.2243 0.1080 11.33 0.000Pct TD 4.4231 0.2799 15.80 0.000Pct Int -4.0914 0.4953 -8.26 0.000S = 1.921 R-Sq = 97.8% R-Sq(adj) = 97.5%
Analysis of VarianceSource DF SS MS F PRegression 3 4196.3 1398.8 379.07 0.000Residual Error 26 95.9 3.7Total 29 4292.2
Predicted Values Fit StDev Fit 95.0% CI 95.0% PI 82.096 0.387 ( 81.300, 82.892) ( 78.068, 86.124)
X0 = (1, .6, .04, .03)
The Textbook Point Regarding Extrapolation
In the single variable case it was really easy to tell when you were moving beyond the range of the original data.
For the multi-variable case we have a region. It is easy to be within the bounds on every variable in a marginal sense, but be way outside the region of the data.
Example, being 6’2” tall may not be unusual. Being 120 lbs is not an outlier. Being 6’2” and 120 lbs would put you outside the range of many data samples from the population.
Extrapolation in multiple regression
12-5 Model Adequacy Checking
R2 and Residual Analysis
Figure 12-6 Normal probability plot of residuals
Residual Analysis Two varieties – standardized and studentized
standardized we divide the ‘raw’ residual by square root of MSE. These are pretty simple to understand.
studentized - divide the ‘raw’ residual by the square root of the variance of the predicted (fitted) value at that point.
Don’t be concerned about this one.
NFL Yes, continuing (12-53)
(a) What proportion of total variability is explained by this model?
(b) Plot the residuals versus y-hat and each regressor
(c) construct a normal probability plot of the residuals
The Coefficient of Multiple Determination, yes R2
2
2
4196.3.9777
4292.23.7
adj 1 1 .025 .9754292.2 / 29
R
R
alternately, read it off the output report:S = 1.921 R-Sq = 97.8% R-Sq(adj) = 97.5%
Normal Probability Plot
2.01.51.00.50.0-0.5-1.0-1.5-2.0-2.5
2
1
0
-1
-2
Nor
mal
Sco
re
Standardized Residual
Normal Probability Plot of the Residuals(response is Rating P)
Residual Plot 1
432
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
Pct Int
Sta
ndar
dize
d R
esid
ual
Residuals Versus Pct Int(response is Rating P)
Residual Plot 2
1098765432
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
Pct TD
Sta
ndar
dize
d R
esid
ual
Residuals Versus Pct TD(response is Rating P)
Residual Plot 3
70656055
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
Pct Comp
Sta
ndar
dize
d R
esid
ual
Residuals Versus Pct Comp(response is Rating P)
Residual Plot 4
120110100908070
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
Rating P
Sta
ndar
dize
d R
esid
ual
Residuals Versus Rating P(response is Rating P)
Modeling with Multiple Regression
A fascinating look at the variety of mathematical models that are accommodated with multiple
regression
Polynomial Regression Models
Example 12-12 – the data
Example 12-12 – the plot
Example 12-12 – the model
Example 12-12 – the solution
Categorical Regressors and Indicator Variables
Indicator (0,1) variables are normally used to represent the levels of a qualitative (categorical) variable.
It is not really a binary arithmetic type of representation. Rather, r levels are represented by (r-1) indicator variables.
x1 x2 x3
0 0 0 level 1
0 0 1 level 2
0 1 0 level 30 1 1 level 41 0 0 level 5
x1 x2 x3 x4
0 0 0 0 level 1
1 0 0 0 level 2
0 1 0 0 level 30 0 1 0 level 40 0 0 1 level 5
Yes – all estimates are independent
No -- estimate for level 4 is mixed with levels 2 and 3.
Example 12-13
The Data
More of the example
The Data in Matrix-vector form
The Results
Why not two models?
Common slopes pooling data – better MSE
more degrees of freedom All variables are qualitative
leads to analysis of variance (ANOVA) next chapter
Selection of Variables and Model Building
Not always obvious which combinations of variables to include.
Overfitting a model results in higher variance of predictions, while creating the illusion of better model in the limited context of the current sample
as new columns are added to the X matrix – variance will increase.
When a regression model is primarily an empirical one, selecting variables can be difficult.
If there are theoretical reasons for including particular variables, the task may be simpler –
these variables may be a starting point.
More selecting Variables Overall F-Test Look for significance of t-stat - j / SE(j) Examine R2 and adjusted R2
not necessarily to maximize but find the point where adding more variables leads to a small increase
Look at the MSE – SSE/(n-p) want a decrease as add variables
Cp criterion – total mean square error All Possible Regressions – Adjusted R-Squared Stepwise Regression – f-in and f-out
forward regression backward regression
All Possible Regressions Consider a regression problem with k possible
predictor variables. Number of regressions becomes a combinatorial problem
There are 2K-1 distinct subsets of them to be tested (counting the full set but excluding the empty set which corresponds to the mean model).
For example, with 10 potential independent variables, the number of subsets to be tested is 2^10 - 1 = 1023,
with 20 potential independent variables, the number is 2^20, or more than one million.
x1 x2 x3 … xk
2 2 2 … 2 = 2k
Stepwise Regression
1. Begin by picking the single predictor with the largest F-statistic (highest correlation to the response variable Y).
2. Examine the remaining variables to find the one with the highest partial F-statistic.
),(
),|(
1
01
xxMS
SSF
jE
jRj
3. If Fj > Fin include variable j. Otherwise stop.
4. Calculate the F-statistic for the predictors currently in the model.
5. Let Fk be the lowest partial F-statistic. If Fk < Fout, remove variable k from the model. Go to step 2.
Stepwise Regression - comments
Fundamental technique, widely used, and very handy for exploring a set of potential predictor variables.
Minitab implementation allows you to specify a set of predictors to automatically include. All other decisions begin after this set of predictors is included in the model
Forward Selection is a variant of stepwise regression that continues to add variables until the F statistic falls below Fin.
Backward Elimination starts with the full set of predictors and works in the other direction – eliminating them until there is no partial F statistic below Fout.
If possible use all-possible regression method and evaluate by minimum MSE or adjusted R2.
Data, data, we need data Quarterback ratings for the 2004 NFL season
Player Att Comp Pct Yds Yds per Att TD Pct TD Long Int Pct Int Rating D.Culpepper, MIN 548 379 69.2 4717 8.61 39 7.1 82 11 2 110.9D.McNabb, PHI 469 300 64 3875 8.26 31 6.6 80 8 1.7 104.7B.Griese, TAM 336 233 69.3 2632 7.83 20 6 68 12 3.6 97.5M.Bulger, STL 485 321 66.2 3964 8.17 21 4.3 56 14 2.9 93.7B.Favre, GBP 540 346 64.1 4088 7.57 30 5.6 79 17 3.1 92.4J.Delhomme, CAR 533 310 58.2 3886 7.29 29 5.4 63 15 2.8 87.3K.Warner, NYG 277 174 62.8 2054 7.42 6 2.2 62 4 1.4 86.5M.Hasselbeck, SEA 474 279 58.9 3382 7.14 22 4.6 60 15 3.2 83.1A.Brooks, NOS 542 309 57 3810 7.03 21 3.9 57 16 3 79.5T.Rattay, SFX 325 198 60.9 2169 6.67 10 3.1 65 10 3.1 78.1M.Vick, ATL 321 181 56.4 2313 7.21 14 4.4 62 12 3.7 78.1J.Harrington, DET 489 274 56 3047 6.23 19 3.9 62 12 2.5 77.5V.Testaverde, DAL 495 297 60 3532 7.14 17 3.4 53 20 4 76.4P.Ramsey, WAS 272 169 62.1 1665 6.12 10 3.7 51 11 4 74.8J.McCown, ARI 408 233 57.1 2511 6.15 11 2.7 48 10 2.5 74.1P.Manning, IND 497 336 67.6 4557 9.17 49 9.9 80 10 2 121.1D.Brees, SDC 400 262 65.5 3159 7.9 27 6.8 79 7 1.8 104.8B.Roethlisberger, PIT 295 196 66.4 2621 8.88 17 5.8 58 11 3.7 98.1T.Green, KAN 556 369 66.4 4591 8.26 27 4.9 70 17 3.1 95.2T.Brady, NEP 474 288 60.8 3692 7.79 28 5.9 50 14 3 92.6C.Pennington, NYJ 370 242 65.4 2673 7.22 16 4.3 48 9 2.4 91B.Volek, TEN 357 218 61.1 2486 6.96 18 5 48 10 2.8 87.1J.Plummer, DEN 521 303 58.2 4089 7.85 27 5.2 85 20 3.8 84.5D.Carr, HOU 466 285 61.2 3531 7.58 16 3.4 69 14 3 83.5B.Leftwich, JAC 441 267 60.5 2941 6.67 15 3.4 65 10 2.3 82.2C.Palmer, CIN 432 263 60.9 2897 6.71 18 4.2 76 18 4.2 77.3J.Garcia, CLE 252 144 57.1 1731 6.87 10 4 99 9 3.6 76.7D.Bledsoe, BUF 450 256 56.9 2932 6.52 20 4.4 69 16 3.6 76.6K.Collins, OAK 513 289 56.3 3495 6.81 21 4.1 63 20 3.9 74.8K.Boller, BAL 464 258 55.6 2559 5.52 13 2.8 57 11 2.4 70.9
NFL Yes, Yes (12-81)
Use the football data to build regression models using
(a) all possible regressions(b) stepwise regression(c) forward selection(d) backward elimination
off we go to Minitab
Danger … Danger Multicollinearity
the existence of a high degree of linear correlation among two or more explanatory variables
In the presence of multicollinearity, it is difficult to assess the effect of the independent variables on the dependent variable
x1 and x2 are collinear if x1 = kx2
the X matrix will have rank less than p and the XtX matrix will not have an inverse
rarely face perfect multicollinearity in a data set
Detecting Multicollinearity
Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the hypothesis that those coefficients are insignificant as a group (using a F-test)
Large changes in the estimated regression coefficients when a predictor variable is added or deleted
Large changes in the estimated regression coefficients when an observation is added or deleted
The Problem of Multicollinearity
The usual interpretation of a regression coefficient is that it estimates the effect of a one unit change in an independent variable
With multicollinearity, the estimate of one variable's impact on y tends to be less precise than if predictors were uncorrelated with one another.
If nominally "different" measures actually quantify the same phenomenon then they are redundant.
One of the features of multicollinearity is that the standard errors of the affected coefficients tend to be large.
the hypothesis test that the coefficient is equal to zero against the alternative that it is not equal to zero leads to a failure to reject the null hyp.
if a simple linear regression of the dependent variable on this explanatory variable is estimated, the coefficient will be found to be significant
The best regression models are those in which the predictor variables each correlate highly with the dependent variable but correlate at most only minimally with each other.
Such a model is often called "low noise" and will be statistically robust (that is, it will predict reliably across numerous samples of variable sets drawn from the same statistical population).
Overfitting
A statistical model that has too many parameters An absurd and false model may fit perfectly if the
model has enough complexity by comparison to the amount of data available.
Overfitting is generally recognized to be a violation of Occam's razor.
When the degrees of freedom in parameter selection exceed the information content of the data, this leads to arbitrariness in the fitted model parameters which reduces or destroys the ability of the model to generalize beyond the fitting data.
Occam's Razor A principle attributed to the 14th-century English
logician and Franciscan friar William of Ockham. The principle states that the explanation of any
phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory.
The principle is often expressed as the "law of parsimony" or "law of succinctness“
This is often paraphrased as "All other things being equal, the simplest solution is the best."
When multiple competing theories are equal in other respects, the principle recommends selecting the theory that introduces the fewest assumptions and postulates the fewest entities.
Other Coefficient Estimates
01 1
01
Min
Min Max
j
j
n k
i j iji j
k
i j iji
j
y x
y x
The End of Regression