Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression...

64
Applied Regression Modeling 5a Iain Pardoe Outline 5.1 Influential points 5.2 Regression pitfalls Applied Regression Modeling Third Edition 5. Model Building II: Sections 5.1–5.2 Iain Pardoe Thompson Rivers University The Pennsylvania State University March 2020

Transcript of Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression...

Page 1: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Applied Regression ModelingThird Edition

5. Model Building II: Sections 5.1–5.2

Iain Pardoe

Thompson Rivers University

The Pennsylvania State University

March 2020

Page 2: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Outline

1 5.1 Influential points

2 5.2 Regression pitfalls

Page 3: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Influential points

Beware model conclusions that are overly influencedby a small handful of data points, e.g.:

overall results can be biased if a few unusual pointsdiffer dramatically from general patterns in themajority of the data values;misleading to conclude evidence of a strongassociation between variables if evidence basedmainly on a few dominant points.

Focus on two measures of individual data pointinfluence:

outliers have unusual Y-values relative to theirpredicted Y-values from a model;high leverage points have unusual combinations ofX-values relative to general dataset patterns.

Cook’s distance is a composite measure ofoutlyingness and leverage.

Page 4: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Influential points

Beware model conclusions that are overly influencedby a small handful of data points, e.g.:

overall results can be biased if a few unusual pointsdiffer dramatically from general patterns in themajority of the data values;misleading to conclude evidence of a strongassociation between variables if evidence basedmainly on a few dominant points.

Focus on two measures of individual data pointinfluence:

outliers have unusual Y-values relative to theirpredicted Y-values from a model;high leverage points have unusual combinations ofX-values relative to general dataset patterns.

Cook’s distance is a composite measure ofoutlyingness and leverage.

Page 5: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Influential points

Beware model conclusions that are overly influencedby a small handful of data points, e.g.:

overall results can be biased if a few unusual pointsdiffer dramatically from general patterns in themajority of the data values;misleading to conclude evidence of a strongassociation between variables if evidence basedmainly on a few dominant points.

Focus on two measures of individual data pointinfluence:

outliers have unusual Y-values relative to theirpredicted Y-values from a model;high leverage points have unusual combinations ofX-values relative to general dataset patterns.

Cook’s distance is a composite measure ofoutlyingness and leverage.

Page 6: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Outliers

Outliers have unusual Y-values relative to theirpredicted Y-values from a model.In other words, observations with a large magnituderesidual, ei = Yi − Yi .Computer can calculate studentized residuals to putthem on a common scale.When four regression assumptions (zero mean,constant variance, normality, and independence) aresatisfied, studentized residuals ≈ N(0, 12).If we identify an observation with a studentizedresidual outside (−3, 3), we’ve either witnessed avery unusual event (one with prob. less than 0.002)or we’ve found an observation with a Y-value thatdoesn’t fit the pattern in the rest of the dataset.Formally define a potential outlier as an observationwith studentized residual < −3 or > 3.

Page 7: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Dealing with outliers

If we find one or more outliers, investigate why:

data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).

To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.

Page 8: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Dealing with outliers

If we find one or more outliers, investigate why:

data input mistake (remedy: identify and correctmistake(s) and reanalyze data);

important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).

To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.

Page 9: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Dealing with outliers

If we find one or more outliers, investigate why:

data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);

regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).

To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.

Page 10: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Dealing with outliers

If we find one or more outliers, investigate why:

data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);

potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).

To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.

Page 11: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Dealing with outliers

If we find one or more outliers, investigate why:

data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).

To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.

Page 12: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Dealing with outliers

If we find one or more outliers, investigate why:

data input mistake (remedy: identify and correctmistake(s) and reanalyze data);important predictor omitted from model (remedy:identify potentially useful predictors not included inthe model and reanalyze data);regression assumptions violated (remedy:reformulate model using transformations orinteractions, say, to correct problem);potential outliers differ substantively from othersample observations (remedy: remove outliers andreanalyze remainder of dataset separately).

To gauge outlier influence exclude largest magnitudestudentized residual, refit model to remainingobservations, and see if regression parameterestimates change substantially.

Page 13: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

CARS5 data

Y = city gallons per hundred miles for 50 newU.S. passenger cars in 2011.

X1 = engine size (liters).

X2 = number of cylinders.

X3 = interior volume (hundreds of cubic feet).

Model: E(Y ) = b0 + b1X1 + b2X2 + b3X3.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.542 0.365 6.964 0.000

X1 = Eng 0.235 0.112 2.104 0.041X2 = Cyl 0.294 0.076 3.855 0.000X3 = Vol 0.476 0.279 1.709 0.094

a Response variable: Y = Cgphm.

Page 14: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Model 1 studentized residuals

There is one outlier with studentized residual ≈ −8.

Studentized residual

−8 −6 −4 −2 0

02

46

810

12

Lexus HS 250h

Page 15: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Remove outlier

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 2.905 0.237 12.281 0.000

X1 = Eng 0.259 0.071 3.645 0.001X2 = Cyl 0.228 0.049 4.634 0.000X3 = Vol 0.492 0.177 2.774 0.008

a Response variable: Y = Cgphm.

Regression parameter estimates and p-values changedramatically.

The outlier was a hybrid and did not fit the patternof the standard gasoline-powered cars.

Page 16: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Model 2 studentized residuals

No further outliers.

Studentized residual

−2 −1 0 1 2

02

46

810

12

Page 17: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Leverage

High leverage points have unusual combinations ofX-values relative to general dataset patterns.

If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.

Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).

Rule of thumb:

if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.

To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.

Page 18: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Leverage

High leverage points have unusual combinations ofX-values relative to general dataset patterns.

If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.

Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).

Rule of thumb:

if leverage > 3(k + 1)/n investigate further;

if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.

To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.

Page 19: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Leverage

High leverage points have unusual combinations ofX-values relative to general dataset patterns.

If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.

Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).

Rule of thumb:

if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;

otherwise, no evidence of undue influence.

To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.

Page 20: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Leverage

High leverage points have unusual combinations ofX-values relative to general dataset patterns.

If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.

Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).

Rule of thumb:

if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.

To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.

Page 21: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Leverage

High leverage points have unusual combinations ofX-values relative to general dataset patterns.

If a point is far from the majority of the sample, itcan pull the fitted model close toward its Y-value,potentially biasing the results.

Leverage measures potential for an observation tohave undue influence on a model (0–1: low–high).

Rule of thumb:

if leverage > 3(k + 1)/n investigate further;if leverage > 2(k + 1)/n and isolated investigatefurther;otherwise, no evidence of undue influence.

To gauge influence exclude largest leverage point,refit model to remaining observations, and see ifregression parameter estimates change substantially.

Page 22: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Results with outlier removed

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 2.905 0.237 12.281 0.000

X1 = Eng 0.259 0.071 3.645 0.001X2 = Cyl 0.228 0.049 4.634 0.000X3 = Vol 0.492 0.177 2.774 0.008

a Response variable: Y = Cgphm.

Threshold: 3(k + 1)/n = 3(3 + 1)/49 = 0.24.

Threshold: 2(k + 1)/n = 2(3 + 1)/49 = 0.16.

Page 23: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Model 2 leverages

Three high leverage points exceed 3(k + 1)/n threshold.

0 10 20 30 40 50

0.1

0.2

0.3

0.4

0.5

ID number

Leve

rage

Aston Martin DB9

Dodge Challenger SRT8

Rolls−Royce Ghost

Page 24: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Results with highest leverage point removed

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)3 (Intercept) 2.949 0.247 11.931 0.000

X1 = Eng 0.282 0.080 3.547 0.001X2 = Cyl 0.201 0.064 3.139 0.003X3 = Vol 0.532 0.188 2.824 0.007

a Response variable: Y = Cgphm.

Regression parameter estimates and p-values don’tchange dramatically.

The highest leverage point had the potential tostrongly influence results, but in this case did not doso.

Page 25: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Cook’s distance

Cook’s distance is a composite measure ofoutlyingness and leverage.

Rule of thumb:

observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;

observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;otherwise, unlikely to have undue influence(although not always—see Problem 5.2).

To gauge influence exclude largest Cook’s distance,refit model to remaining observations, and see ifregression parameter estimates change substantially.

Page 26: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Cook’s distance

Cook’s distance is a composite measure ofoutlyingness and leverage.

Rule of thumb:

observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;

otherwise, unlikely to have undue influence(although not always—see Problem 5.2).

To gauge influence exclude largest Cook’s distance,refit model to remaining observations, and see ifregression parameter estimates change substantially.

Page 27: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Cook’s distance

Cook’s distance is a composite measure ofoutlyingness and leverage.

Rule of thumb:

observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;otherwise, unlikely to have undue influence(although not always—see Problem 5.2).

To gauge influence exclude largest Cook’s distance,refit model to remaining observations, and see ifregression parameter estimates change substantially.

Page 28: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Cook’s distance

Cook’s distance is a composite measure ofoutlyingness and leverage.

Rule of thumb:

observations with a Cook’s distance > 1 are oftensufficiently influential that they should be removedfrom the main analysis—investigate further;observations with a Cook’s distance > 0.5 aresometimes sufficiently influential that they should beremoved from the main analysis—investigate further;otherwise, unlikely to have undue influence(although not always—see Problem 5.2).

To gauge influence exclude largest Cook’s distance,refit model to remaining observations, and see ifregression parameter estimates change substantially.

Page 29: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Results for all observations

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.542 0.365 6.964 0.000

X1 = Eng 0.235 0.112 2.104 0.041X2 = Cyl 0.294 0.076 3.855 0.000X3 = Vol 0.476 0.279 1.709 0.094

a Response variable: Y = Cgphm.

Page 30: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Model 1 Cook’s distances

Highest Cook’s distance exceeds 0.5 threshold.

0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

ID number

Cook’s

dis

tance

Lexus HS 250h

Page 31: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Results with highest Cook’s distance removed

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 2.905 0.237 12.281 0.000

X1 = Eng 0.259 0.071 3.645 0.001X2 = Cyl 0.228 0.049 4.634 0.000X3 = Vol 0.492 0.177 2.774 0.008

a Response variable: Y = Cgphm.

The car with the highest Cook’s distance was theoutlier we found before.

Page 32: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

Influential points

Outliers

Dealing with outliers

CARS5 data

Model 1 studentized residuals

Remove outlier

Model 2 studentized residuals

Leverage

Results with outlier removed

Model 2 leverages

Results with highest leverage pointremoved

Cook’s distance

Results for all observations

Model 1 Cook’s distances

Results with highest Cook’sdistance removed

Model 2 Cook’s distances

5.2 Regression pitfalls

Model 2 Cook’s distances

Highest Cook’s distance less than 0.5 threshold.

0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.6

ID number

Cook’s

dis

tance

Aston Martin DB9

Dodge Challenger SRT8

Rolls−Royce Ghost

Page 33: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Regression pitfalls

Some of the pitfalls that can cause problems with aregression analysis:

nonconstant variance—violation of constant varianceassumption;autocorrelation (serial correlation)—failing toaccount for time trends in the model;multicollinearity—highly correlated predictorscausing unstable model results;excluding important predictor variables—leading topossibly incorrect conclusions;overfitting (the sample data)—leading to poorgeneralizability to the population;extrapolation—using model results for predictorvalues very different to those in the sample;missing data—leading to reduced sample sizes atbest, misleading results at worst;power and sample size—small samples can lead toproblems with power.

Page 34: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Nonconstant variance

Regression model residuals can violate the constantvariance assumption if there is a fan or funnelpattern in residual scatterplot.

Tests such as Breusch/Pagan (1979) andCook/Weisberg (1983) can complement residual plotanalysis.

Nonconstant variance creates technical difficultieswith model such as inaccurate prediction intervals.

Remedies include response transformations, weightedleast squares, generalized least squares, generalizedlinear models.

Example: FERTILITY data file with Y = Fertility ,X1 = loge(Income), and X2 = Education for 169countries.

Page 35: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 1 residuals

Model 1 uses Y as the response variable. Plot indicatesincreasing residual variance as fitted values increase.

1 2 3 4 5

−2

−1

01

2

Fitted values

Stu

dentized r

esid

ual

Page 36: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 2 residuals

Model 1 uses loge(Y ) as the response variable.Constant variance assumption more reasonable now.

0.5 1.0 1.5

−2

−1

01

2

Fitted values

Stu

dentized r

esid

ual

Page 37: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Autocorrelation

Autocorrelation occurs when regression modelresiduals violate the independence assumptionbecause they are highly dependent across time.

Can occur when regression data have been collectedover time and model fails to account for any strongtime trends.

Dealing with this issue rigorously can requirespecialized time series and forecasting methods.

Sometimes, however, simple ideas can mitigateautocorrelation problems.

Example: UNEMP data file contains annualmacroeconomic data from 1959 to 2010.

Page 38: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 1 residuals

Model 1: Multiple linear regression with 4 predictors.Residual plot shows clear evidence of autocorrelation.

0 10 20 30 40 50

−2

−1

01

2

Time period

Resid

ual

Page 39: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 2 residuals

Model 2: Use first-differenced data.Independent errors assumption more reasonable now.

0 10 20 30 40 50

−1.0

−0.5

0.0

0.5

1.0

Time period

Resid

ual

Page 40: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Multicollinearity

Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.

Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).

Potential remedies include:

collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.

Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).

Page 41: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Multicollinearity

Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.

Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).

Potential remedies include:collect more uncorrelated data (if possible);

create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.

Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).

Page 42: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Multicollinearity

Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.

Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).

Potential remedies include:collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);

remove one of the highly correlated predictors fromthe model.

Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).

Page 43: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Multicollinearity

Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.

Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).

Potential remedies include:collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.

Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).

Page 44: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Multicollinearity

Multicollinearity occurs when excessive correlationbetween quantitative predictors leads to unstablemodels and inflated standard errors.

Identify by looking at a scatterplot matrix,calculating bivariate correlations, and calculatingvariance inflation factors(potential problems if > 4, likely problems if > 10).

Potential remedies include:collect more uncorrelated data (if possible);create new combined predictor variables from thehighly correlated predictors (if possible);remove one of the highly correlated predictors fromthe model.

Example: SALES3 data file with sales (Y = Sales),TV/newspaper advertising (X1 = Trad), andinternet advertising (X2 = Int).

Page 45: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 1 results

Model 1: E(Y ) = b0 + b1X1 + b2X2.

Model SummaryAdjusted Regression

Model Multiple R R Squared R Squared Std. Error1 0.987 a 0.974 0.968 0.8916a Predictors: (Intercept), X1 = Trad , X2 = Int.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|) VIF1 (Intercept) 1.992 0.902 2.210 0.054

X1 = Trad 0.767 0.868 0.884 0.400 49.541X2 = Int 1.275 0.737 1.730 0.118 49.541

a Response variable: Y = Sales.

R2 is 0.974, but neither X1 nor X2 are significant(given the presence of the other)!

VIF > 10 suggests there is a multicollinearityproblem.

Page 46: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

X1 and X2 highly correlated

Unstable estimates when both X1 and X2 in model.

Sales

1 2 3 4 5 6 7 8

68

10

14

18

12

34

56

78

Trad

6 8 10 14 18 2 3 4 5 6 7 8

23

45

67

8

Int

Page 47: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 2 results

Model 2: E(Y ) = b0 + b1(X1 + X2).Model Summary

Adjusted RegressionModel Multiple R R Squared R Squared Std. Error2 0.987 a 0.974 0.971 0.8505a Predictors: (Intercept), X1 + X2.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 1.776 0.562 3.160 0.010

X1 + X2 1.042 0.054 19.240 0.000a Response variable: Y = Sales.

R2 unchanged, and the combined predictor variable,X1 + X2, is significant.Note this approach is only possible if it makes senseto create a combined predictor variable.More common to drop one of the correlatedpredictors from model.

Page 48: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Excluding important predictor variables

Excluding important predictors sometimes results inmodels that provide incorrect, biased conclusionsabout included predictors.

Strive to include all potentially important predictors,and remove a predictor only if there are compellingreasons to do so (e.g., if causing multicollinearityproblems and has high individual p-value).

Example: PARADOX data file with n=27high-precision computer components withcomponent quality (Y ) potentially depending on twocontrollable machine factors, speed (X1) and angle(X2).

Page 49: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 1 results

Model 1: E(Y ) = b0 + b1X1.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.847 1.011 2.817 0.009

X1 = Speed 0.430 0.188 2.288 0.031a Response variable: Y = Quality .

Results suggest a positive association betweenquality and speed.

In other words, increase the speed of the machine toimprove quality.

However, this ignores process information relating toangle.

Page 50: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 1 results

Model 1: E(Y ) = b0 + b1X1.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.847 1.011 2.817 0.009

X1 = Speed 0.430 0.188 2.288 0.031a Response variable: Y = Quality .

Results suggest a positive association betweenquality and speed.

In other words, increase the speed of the machine toimprove quality.

However, this ignores process information relating toangle.

Page 51: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 1 results

Model 1: E(Y ) = b0 + b1X1.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)1 (Intercept) 2.847 1.011 2.817 0.009

X1 = Speed 0.430 0.188 2.288 0.031a Response variable: Y = Quality .

Results suggest a positive association betweenquality and speed.

In other words, increase the speed of the machine toimprove quality.

However, this ignores process information relating toangle.

Page 52: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 2 results

Model 2: E(Y ) = b0 + b1X1 + b2X2.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 1.638 0.217 7.551 0.000

X1 = Speed −0.962 0.071 −13.539 0.000X2 = Angle 2.014 0.086 23.473 0.000

a Response variable: Y = Quality .

Results suggest a negative association betweenquality and speed (for a fixed angle),

and a positive association between quality and angle(for a fixed speed).

In other words, increase the angle but decrease thespeed of the machine to improve quality.

Page 53: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 2 results

Model 2: E(Y ) = b0 + b1X1 + b2X2.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 1.638 0.217 7.551 0.000

X1 = Speed −0.962 0.071 −13.539 0.000X2 = Angle 2.014 0.086 23.473 0.000

a Response variable: Y = Quality .

Results suggest a negative association betweenquality and speed (for a fixed angle),

and a positive association between quality and angle(for a fixed speed).

In other words, increase the angle but decrease thespeed of the machine to improve quality.

Page 54: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model 2 results

Model 2: E(Y ) = b0 + b1X1 + b2X2.

Parameters a

Model Estimate Std. Error t-stat Pr(> |t|)2 (Intercept) 1.638 0.217 7.551 0.000

X1 = Speed −0.962 0.071 −13.539 0.000X2 = Angle 2.014 0.086 23.473 0.000

a Response variable: Y = Quality .

Results suggest a negative association betweenquality and speed (for a fixed angle),

and a positive association between quality and angle(for a fixed speed).

In other words, increase the angle but decrease thespeed of the machine to improve quality.

Page 55: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Paradox explained

Positive association betweenY and X1 ignoring X2.Negative association betweenY and X1 accounting for X2.

2 4 6 8

24

68

Speed

Qualit

y

1

1 2

3

12

34

23

4 5

34 5

3

4

5 6

4 56

7

5 6

7

7

Points marked by Angle

Page 56: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Overfitting

Overfitting can occur if overly complicated modeltries to account for every possible pattern in sampledata, but generalizes poorly to underlyingpopulation.Should always apply a “sanity check” to make suremodel makes sense from subject-matter perspectiveand conclusions are supported by data.

−1.0 −0.5 0.0 0.5 1.0

−2

02

46

810

X

Y

−1.0 −0.5 0.0 0.5 1.0

−2

02

46

810

X

Y

Which model seems more reasonable?

Page 57: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Overfitting

Overfitting can occur if overly complicated modeltries to account for every possible pattern in sampledata, but generalizes poorly to underlyingpopulation.Should always apply a “sanity check” to make suremodel makes sense from subject-matter perspectiveand conclusions are supported by data.

−1.0 −0.5 0.0 0.5 1.0

−2

02

46

810

X

Y

−1.0 −0.5 0.0 0.5 1.0

−2

02

46

810

X

Y

Which model seems more reasonable?

Page 58: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Extrapolation

Extrapolation occurs when regression model resultsare used to estimate or predict a response value foran observation with predictor values that are verydifferent from those in the sample.

This can be dangerous because it means making adecision about a situation where there are no datavalues to support our conclusions.

Example: if we observe an upward trend betweentwo variables, should we assume the trend continuesindefinitely at higher values?

Page 59: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Extrapolation example

Straight-line model overshoots actual Y at far right.Quadratic model undershoots (i.e., neither model enablesaccurate prediction far from sample data).

2 4 6 8 10 12

51

01

52

02

5

Labor = labor hours in 100s

Pro

du

ctio

n =

pro

du

ctio

n in

10

0s

Page 60: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Missing data

Missing data occurs when particular values in thedataset have not been recorded for particularvariables and observations.

Dealing with issue rigorously is beyond scope ofbook, but there are some simple ideas that canmitigate some of the major problems.

Example: MISSING data file with n=30, Y, andX1–X4.

No missing values for Y, X1, or X4, but five missingvalues for X2, one of which is also missing X3:

any model including X2 will exclude fiveobservations;including X3 (but excluding X2) will exclude oneobservation;excluding X2 and X3 will exclude no observations.

Page 61: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Missing data

Missing data occurs when particular values in thedataset have not been recorded for particularvariables and observations.

Dealing with issue rigorously is beyond scope ofbook, but there are some simple ideas that canmitigate some of the major problems.

Example: MISSING data file with n=30, Y, andX1–X4.

No missing values for Y, X1, or X4, but five missingvalues for X2, one of which is also missing X3:

any model including X2 will exclude fiveobservations;

including X3 (but excluding X2) will exclude oneobservation;excluding X2 and X3 will exclude no observations.

Page 62: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Missing data

Missing data occurs when particular values in thedataset have not been recorded for particularvariables and observations.

Dealing with issue rigorously is beyond scope ofbook, but there are some simple ideas that canmitigate some of the major problems.

Example: MISSING data file with n=30, Y, andX1–X4.

No missing values for Y, X1, or X4, but five missingvalues for X2, one of which is also missing X3:

any model including X2 will exclude fiveobservations;including X3 (but excluding X2) will exclude oneobservation;

excluding X2 and X3 will exclude no observations.

Page 63: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Missing data

Missing data occurs when particular values in thedataset have not been recorded for particularvariables and observations.

Dealing with issue rigorously is beyond scope ofbook, but there are some simple ideas that canmitigate some of the major problems.

Example: MISSING data file with n=30, Y, andX1–X4.

No missing values for Y, X1, or X4, but five missingvalues for X2, one of which is also missing X3:

any model including X2 will exclude fiveobservations;including X3 (but excluding X2) will exclude oneobservation;excluding X2 and X3 will exclude no observations.

Page 64: Applied Regression Modeling Third Edition - 5. Model Building II: … · Applied Regression Modeling 5a Iain Pardoe Outline 5.1 In uential points In uential points Outliers Dealing

Applied RegressionModeling 5a

Iain Pardoe

Outline

5.1 Influential points

5.2 Regression pitfalls

Regression pitfalls

Nonconstant variance

Model 1 residuals

Model 2 residuals

Autocorrelation

Model 1 residuals

Model 2 residuals

Multicollinearity

Model 1 results

X1 and X2 highly correlated

Model 2 results

Excluding important predictors

Model 1 results

Model 2 results

Paradox explained

Overfitting

Extrapolation

Extrapolation example

Missing data

Model results

Model results

Predictors Sample size R2 s

X1, X2, X3, X4 25 0.959 0.865X2, X3, X4 25 0.958 0.849X1, X3, X4 29 0.953 0.852X1, X4 30 0.640 2.300

Ordinarily, we would probably favor the (X2, X3, X4)model.

However, the (X1, X3, X4) model applies to more ofthe sample.

Thus, in this case, we would probably favor the (X1,X3, X4) model, since R2 and s are roughlyequivalent, but the usable sample size is larger.