Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression...

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Applied Regression ModelingThird Edition

5. Model Building II: Sections 5.1–5.2

Iain Pardoe

Thompson Rivers University

The Pennsylvania State University

March 2020


Iain Pardoe

Outline



Outline

1 5.1 Influential points

2 5.2 Regression pitfalls


Iain Pardoe

Outline


Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier


Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed



Influential points

Beware model conclusions that are overly influencedby a small handful of data points, e.g.:

overall results can be biased if a few unusual pointsdiffer dramatically from general patterns in themajority of the data values;misleading to conclude evidence of a strongassociation between variables if evidence basedmainly on a few dominant points.

Focus on two measures of individual data pointinfluence:

outliers have unusual Y-values relative to theirpredicted Y-values from a model;high leverage points have unusual combinations ofX-values relative to general dataset patterns.

Cook’s distance is a composite measure ofoutlyingness and leverage.


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Outliers

Outliers have unusual Y-values relative to theirpredicted Y-values from a model.In other words, observations with a large magnituderesidual, ei = Yi − Yi .Computer can calculate studentized residuals to putthem on a common scale.When four regression assumptions (zero mean,constant variance, normality, and independence) aresatisfied, studentized residuals ≈ N(0, 12).If we identify an observation with a studentizedresidual outside (−3, 3), we’ve either witnessed avery unusual event (one with prob. less than 0.002)or we’ve found an observation with a Y-value thatdoesn’t fit the pattern in the rest of the dataset.Formally define a potential outlier as an observationwith studentized residual < −3 or > 3.


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance







If we find one or more outliers, investigate why:

data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).

To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance








data input mistake (remedy: identify and correctmistake(s) and reanalyze data);

important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).



Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance








data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);

regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).



Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance








data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);

potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).



Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance








data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).



Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






CARS5 data

Y = city gallons per hundred miles for 50 newU.S. passenger cars in 2011.

X1 = engine size (liters).

X2 = number of cylinders.

X3 = interior volume (hundreds of cubic feet).

Model: E(Y ) = b0 + b1X1 + b2X2 + b3X3.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.542 0.365 6.964 0.000

X1 = Eng 0.235 0.112 2.104 0.041X2 = Cyl 0.294 0.076 3.855 0.000X3 = Vol 0.476 0.279 1.709 0.094

a Response variable: Y = Cgphm.


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance







There is one outlier with studentized residual ≈ −8.

Studentized residual

−8 −6 −4 −2 0

02

46

810

12

Lexus HS 250h


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Remove outlier

Parameters a


X1 = Eng 0.259 0.071 3.645 0.001X2 = Cyl 0.228 0.049 4.634 0.000X3 = Vol 0.492 0.177 2.774 0.008


Regression parameter estimates and p-values changedramatically.

The outlier was a hybrid and did not fit the patternof the standard gasoline-powered cars.


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance







No further outliers.

Studentized residual

−2 −1 0 1 2

02

46

810

12


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Leverage

High leverage points have unusual combinations ofX-values relative to general dataset patterns.

If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.

Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).

Rule of thumb:

if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.

To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Leverage




Rule of thumb:

if leverage > 3(k + 1)/n investigate further;

if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.



Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Leverage




Rule of thumb:

if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;

otherwise, no evidence of undue influence.



Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Leverage




Rule of thumb:

if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.



Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance







Parameters a


X1 = Eng 0.259 0.071 3.645 0.001X2 = Cyl 0.228 0.049 4.634 0.000X3 = Vol 0.492 0.177 2.774 0.008


Threshold: 3(k + 1)/n = 3(3 + 1)/49 = 0.24.

Threshold: 2(k + 1)/n = 2(3 + 1)/49 = 0.16.


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Model 2 leverages

Three high leverage points exceed 3(k + 1)/n threshold.

0 10 20 30 40 50

0.1

0.2

0.3

0.4

0.5

ID number

Leve

rage

Aston Martin DB9

Dodge Challenger SRT8

Rolls−Royce Ghost


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Results with highest leverage point removed

Parameters a


X1 = Eng 0.282 0.080 3.547 0.001X2 = Cyl 0.201 0.064 3.139 0.003X3 = Vol 0.532 0.188 2.824 0.007


Regression parameter estimates and p-values don’tchange dramatically.

The highest leverage point had the potential tostrongly influence results, but in this case did not doso.


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Cook’s distance


Rule of thumb:

observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;

observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;otherwise, unlikely to have undue influence(although not always—see Problem 5.2).

To gauge influence exclude largest Cook’s distance,refit model to remaining observations, and see ifregression parameter estimates change substantially.


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Cook’s distance


Rule of thumb:

observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;

otherwise, unlikely to have undue influence(although not always—see Problem 5.2).



Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Cook’s distance


Rule of thumb:

observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;otherwise, unlikely to have undue influence(although not always—see Problem 5.2).



Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance







Parameters a


X1 = Eng 0.235 0.112 2.104 0.041X2 = Cyl 0.294 0.076 3.855 0.000X3 = Vol 0.476 0.279 1.709 0.094



Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance







Highest Cook’s distance exceeds 0.5 threshold.

0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

ID number

Cook’s

dis

tance

Lexus HS 250h


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance






Results with highest Cook’s distance removed

Parameters a


X1 = Eng 0.259 0.071 3.645 0.001X2 = Cyl 0.228 0.049 4.634 0.000X3 = Vol 0.492 0.177 2.774 0.008


The car with the highest Cook’s distance was theoutlier we found before.


Iain Pardoe

Outline


Influential points

Outliers


CARS5 data


Remove outlier


Leverage


Model 2 leverages


Cook’s distance







Highest Cook’s distance less than 0.5 threshold.

0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

ID number

Cook’s

dis

tance

Aston Martin DB9

Dodge Challenger SRT8

Rolls−Royce Ghost


Iain Pardoe

Outline



Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Regression pitfalls

Some of the pitfalls that can cause problems with aregression analysis:

nonconstant variance—violation of constant varianceassumption;autocorrelation (serial correlation)—failing toaccount for time trends in the model;multicollinearity—highly correlated predictorscausing unstable model results;excluding important predictor variables—leading topossibly incorrect conclusions;overfitting (the sample data)—leading to poorgeneralizability to the population;extrapolation—using model results for predictorvalues very different to those in the sample;missing data—leading to reduced sample sizes atbest, misleading results at worst;power and sample size—small samples can lead toproblems with power.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results


Regression model residuals can violate the constantvariance assumption if there is a fan or funnelpattern in residual scatterplot.

Tests such as Breusch/Pagan (1979) andCook/Weisberg (1983) can complement residual plotanalysis.

Nonconstant variance creates technical difficultieswith model such as inaccurate prediction intervals.

Remedies include response transformations, weightedleast squares, generalized least squares, generalizedlinear models.

Example: FERTILITY data file with Y = Fertility ,X1 = loge(Income), and X2 = Education for 169countries.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Model 1 residuals

Model 1 uses Y as the response variable. Plot indicatesincreasing residual variance as fitted values increase.

1 2 3 4 5

−2

−1

01

2

Fitted values

Stu

dentized r

esid

ual


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Model 2 residuals

Model 1 uses loge(Y ) as the response variable.Constant variance assumption more reasonable now.

0.5 1.0 1.5

−2

−1

01

2

Fitted values

Stu

dentized r

esid

ual


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Autocorrelation

Autocorrelation occurs when regression modelresiduals violate the independence assumptionbecause they are highly dependent across time.

Can occur when regression data have been collectedover time and model fails to account for any strongtime trends.

Dealing with this issue rigorously can requirespecialized time series and forecasting methods.

Sometimes, however, simple ideas can mitigateautocorrelation problems.

Example: UNEMP data file contains annualmacroeconomic data from 1959 to 2010.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Model 1 residuals

Model 1: Multiple linear regression with 4 predictors.Residual plot shows clear evidence of autocorrelation.

0 10 20 30 40 50

−2

−1

01

2

Time period

Resid

ual


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Model 2 residuals

Model 2: Use first-differenced data.Independent errors assumption more reasonable now.

0 10 20 30 40 50

−1.0

−0.5

0.0

0.5

1.0

Time period

Resid

ual


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Multicollinearity

Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.

Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).

Potential remedies include:

collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.

Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Multicollinearity



Potential remedies include:collect more uncorrelated data (if possible);

create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.



Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Multicollinearity



Potential remedies include:collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);

remove one of the highly correlated predictors fromthe model.



Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Multicollinearity



Potential remedies include:collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.



Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Model 1 results

Model 1: E(Y ) = b0 + b1X1 + b2X2.

Model SummaryAdjusted Regression

Model Multiple R R Squared R Squared Std. Error1 0.987 a 0.974 0.968 0.8916a Predictors: (Intercept), X1 = Trad , X2 = Int.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|) VIF1 (Intercept) 1.992 0.902 2.210 0.054

X1 = Trad 0.767 0.868 0.884 0.400 49.541X2 = Int 1.275 0.737 1.730 0.118 49.541

a Response variable: Y = Sales.

R2 is 0.974, but neither X1 nor X2 are significant(given the presence of the other)!

VIF > 10 suggests there is a multicollinearityproblem.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results


Unstable estimates when both X1 and X2 in model.

Sales

1 2 3 4 5 6 7 8

68

10

14

18

12

34

56

78

Trad

6 8 10 14 18 2 3 4 5 6 7 8

23

45

67

8

Int


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Model 2 results

Model 2: E(Y ) = b0 + b1(X1 + X2).Model Summary

Adjusted RegressionModel Multiple R R Squared R Squared Std. Error2 0.987 a 0.974 0.971 0.8505a Predictors: (Intercept), X1 + X2.

Parameters a


X1 + X2 1.042 0.054 19.240 0.000a Response variable: Y = Sales.

R2 unchanged, and the combined predictor variable,X1 + X2, is significant.Note this approach is only possible if it makes senseto create a combined predictor variable.More common to drop one of the correlatedpredictors from model.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Excluding important predictor variables

Excluding important predictors sometimes results inmodels that provide incorrect, biased conclusionsabout included predictors.

Strive to include all potentially important predictors,and remove a predictor only if there are compellingreasons to do so (e.g., if causing multicollinearityproblems and has high individual p-value).

Example: PARADOX data file with n=27high-precision computer components withcomponent quality (Y ) potentially depending on twocontrollable machine factors, speed (X1) and angle(X2).


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Model 1 results

Model 1: E(Y ) = b0 + b1X1.

Parameters a


X1 = Speed 0.430 0.188 2.288 0.031a Response variable: Y = Quality .

Results suggest a positive association betweenquality and speed.

In other words, increase the speed of the machine toimprove quality.

However, this ignores process information relating toangle.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Model 2 results

Model 2: E(Y ) = b0 + b1X1 + b2X2.

Parameters a


X1 = Speed −0.962 0.071 −13.539 0.000X2 = Angle 2.014 0.086 23.473 0.000

a Response variable: Y = Quality .

Results suggest a negative association betweenquality and speed (for a fixed angle),

and a positive association between quality and angle(for a fixed speed).

In other words, increase the angle but decrease thespeed of the machine to improve quality.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Paradox explained

Positive association betweenY and X1 ignoring X2.Negative association betweenY and X1 accounting for X2.

2 4 6 8

24

68

Speed

Qualit

y

1

1 2

3

12

34

23

4 5

34 5

3

4

5 6

4 56

7

5 6

7

7

Points marked by Angle


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Overfitting

Overfitting can occur if overly complicated modeltries to account for every possible pattern in sampledata, but generalizes poorly to underlyingpopulation.Should always apply a “sanity check” to make suremodel makes sense from subject-matter perspectiveand conclusions are supported by data.

−1.0 −0.5 0.0 0.5 1.0

−2

02

46

810

X

Y

−1.0 −0.5 0.0 0.5 1.0

−2

02

46

810

X

Y

Which model seems more reasonable?


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Extrapolation

Extrapolation occurs when regression model resultsare used to estimate or predict a response value foran observation with predictor values that are verydifferent from those in the sample.

This can be dangerous because it means making adecision about a situation where there are no datavalues to support our conclusions.

Example: if we observe an upward trend betweentwo variables, should we assume the trend continuesindefinitely at higher values?


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results


Straight-line model overshoots actual Y at far right.Quadratic model undershoots (i.e., neither model enablesaccurate prediction far from sample data).

2 4 6 8 10 12

51

01

52

02

5

Labor = labor hours in 100s

Pro

du

ctio

n =

pro

du

ctio

n in

10

0s


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Missing data

Missing data occurs when particular values in thedataset have not been recorded for particularvariables and observations.

Dealing with issue rigorously is beyond scope ofbook, but there are some simple ideas that canmitigate some of the major problems.

Example: MISSING data file with n=30, Y, andX1–X4.

No missing values for Y, X1, or X4, but five missingvalues for X2, one of which is also missing X3:

any model including X2 will exclude fiveobservations;including X3 (but excluding X2) will exclude oneobservation;excluding X2 and X3 will exclude no observations.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Missing data





any model including X2 will exclude fiveobservations;

including X3 (but excluding X2) will exclude oneobservation;excluding X2 and X3 will exclude no observations.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Missing data





any model including X2 will exclude fiveobservations;including X3 (but excluding X2) will exclude oneobservation;

excluding X2 and X3 will exclude no observations.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Missing data





any model including X2 will exclude fiveobservations;including X3 (but excluding X2) will exclude oneobservation;excluding X2 and X3 will exclude no observations.


Iain Pardoe

Outline



Regression pitfalls


Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results


Model 2 results


Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation


Missing data

Model results

Model results

Predictors Sample size R2 s

X1, X2, X3, X4 25 0.959 0.865X2, X3, X4 25 0.958 0.849X1, X3, X4 29 0.953 0.852X1, X4 30 0.640 2.300

Ordinarily, we would probably favor the (X2, X3, X4)model.

However, the (X1, X3, X4) model applies to more ofthe sample.

Thus, in this case, we would probably favor the (X1,X3, X4) model, since R2 and s are roughlyequivalent, but the usable sample size is larger.

Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression...

Documents

Transcript of Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression...