Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

27
Stat 112 Notes 15 • Today: – Outliers and influential points. • Homework 4 due on Thursday.

Transcript of Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Page 1: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Stat 112 Notes 15

• Today:– Outliers and influential points.

• Homework 4 due on Thursday.

Page 2: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Outliers and Influential Observations in Simple Regression• Outlier: Any really unusual observation.• Outlier in the X direction (called high leverage point):

Has the potential to influence the regression line.• Outlier in the direction of the scatterplot (outliers in

residuals): An observation that deviates from the overall pattern of relationship between Y and X. Residual is large in absolute value.

• Influential observation: Point that if it is removed would markedly change the statistical analysis. For simple linear regression, points that are outliers in the x direction are often influential.

Page 3: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Housing Prices and Crime Rates• A community in the Philadelphia area is interested in

how crime rates are associated with property values. If low crime rates increase property values, the community might be able to cover the costs of increased police protection by gains in tax revenues from higher property values.

• The town council looked at a recent issue of Philadelphia Magazine (April 1996) and found data for itself and 109 other communities in Pennsylvania near Philadelphia. Data is in philacrimerate.JMP. House price = Average house price for sales during most recent year, Crime Rate=Rate of crimes per 1000 population.

Page 4: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Housing Price-Crime Rate Data

Bivariate Fit of HousePrice By CrimeRate

0

100000

200000

300000

400000

500000

Hou

seP

rice

GladwyneHaverford

Phila, N

Phila,CC

0 50 100 150 200 250 300 350 400

CrimeRate

Gladwyne and Haverford are outliers in the direction of the scatterplot (their house price is considerably higher than one would expect given their crime rate).

Page 5: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Outliers in Direction of Scatterplot• Residual • Standardized Residual:

• Under multiple regression model, about 5% of the points should have standardized residuals greater in absolute value than 2, 1% of the points should have standardized residuals greater in absolute value than 3. Any point with standardized residual greater in absolute value than 3 should be examined.

• To compute standardized residuals in JMP, right click in a new column, click Formula and create a formula with the residual divided by the RMSE.

KiKii xbxbby 110ˆ

iii yye ˆˆ

Page 6: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Outliers in Residuals for Philadelphia Crime Rate Data

Largest Positive Standardized Residuals 1. Gladwyne, 3.74 2. Villanova, 3.23 3. Haverford, 3.19 4. Horsham, 2.64 5. Upper Makefield, 2.23 6. Lower Merion, 1.65 Largest in Magnitude Negative Standardized Residuals 1. North Philadelphia, -1.45 2. Darby Borough, -1.30 The strong outliers in residuals (standardized residual greater than 3 in absolute value) are Gladwyne, Villanova and Haverford.

Page 7: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Influential Points and Leverage Points

• Influential observation: Point that if it is removed would markedly change the statistical analysis. For simple linear regression, points that are outliers in the X direction are often influential.

• Leverage point: Point that is an outlier in the X direction that has the potential to be influential. It will be influential if its residual is of moderately large magnitude.

Page 8: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

B iv a r i a t e F i t o f H o u s e P r ic e B y C r im e R a t e

0

1 0 0 0 0 0

2 0 0 0 0 0

3 0 0 0 0 0

4 0 0 0 0 0

5 0 0 0 0 0

Ho

us

eP

rice

G la d w y n eH a v e rfo rd

P h ila , N

P h ila ,C C

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0

C rime R a te

L in e a r F it

L in e a r F it

L in e a r F it

A l l o b s e r v a t i o n s L i n e a r F i t H o u s e P r ic e = 1 7 6 6 2 9 .4 1 - 5 7 6 . 9 0 8 1 3 C r im e R a te

W i t h o u t C e n t e r C i t y P h i l a d e l p h i a L i n e a r F i t H o u s e P r ic e = 2 2 5 2 3 3 .5 5 - 2 2 8 8 . 6 8 9 4 C r im e R a te

W i t h o u t G l a d w y n e L i n e a r F i t H o u s e P r ic e = 1 7 3 1 1 6 . 4 3 - 5 6 7 . 7 4 5 0 8 C r im e R a te

Center City Philadelphia is influential; Gladwyne is not. In general, points that have high leverage are more likely to be influential.

Which Observations Are Influential?

Page 9: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Excluding Observations from Analysis in JMP

• To exclude an observation from the regression analysis in JMP, go to the row of the observation, click Rows and then click Exclude/Unexclude. A red circle with a diagonal line through it should appear next to the observation.

• To put the observation back into the analysis, go to the row of the observation, click Rows and then click Exclude/Unexclude. The red circle should no longer appear next to the observation.

Page 10: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Formal measures of leverage and influence

• Leverage: “Hat values” (JMP calls them hats)• Influence: Cook’s Distance (JMP calls them Cook’s D

Influence).• To obtain them in JMP, click Analyze, Fit Model, put Y

variable in Y and X variable in Model Effects box. Click Run Model box. After model is fit, click red triangle next to Response. Click Save Columns and then Click Hats for Leverages and Click Cook’s D Influences for Cook’s Distances.

• To sort observations in terms of Cook’s Distance or Leverage, click Tables, Sort and then put variable you want to sort by in By box.

Page 11: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Distributions Cook's D Influence HousePrice

HaverfordGladwyne Phila,CC

0 5 1015202530

h HousePrice

Phila,CC

0 .1.2.3.4.5.6.7.8.9

Center City Philadelphia has both influence (Cook’s Distance much Greater than 1 and high leverage (hat value > 3*2/99=0.06). No otherobservations have high influence or high leverage.

Page 12: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Rules of Thumb for High Leverage and High Influence

• High Leverage Any observation with a leverage (hat value) > (3 * # of coefficients in regression model)/n has high leverage, where

# of coefficients in regression model = 2 for simple linear regression.

n=number of observations. • High Influence: Any observation with a Cook’s

Distance greater than 1 indicates a high influence.

Page 13: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

What to Do About Suspected Influential Observations?

Does removing the observation change the

substantive conclusions?• If not, can say something like “Observation x

has high influence relative to all other observations but we tried refitting the regression without Observation x and our main conclusions didn’t change.”

Page 14: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

• If removing the observation does change substantive conclusions, is there any reason to believe the observation belongs to a population other than the one under investigation?– If yes, omit the observation and proceed.– If no, does the observation have high leverage

(outlier in explanatory variable).• If yes, omit the observation and proceed. Report that

conclusions only apply to a limited range of the explanatory variable.

• If no, not much can be said. More data (or clarification of the influential observation) are needed to resolve the questions.

Page 15: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

General Principles for Dealing with Influential Observations

• General principle: Delete observations from the analysis sparingly – only when there is good cause (observation does not belong to population being investigated or is a point with high leverage). If you do delete observations from the analysis, you should state clearly which observations were deleted and why.

Page 16: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Influential Points, High Leverage Points, Outliers in Multiple Regression

• As in simple linear regression, we identify high leverage and high influence points by checking the leverages and Cook’s distances (Use save columns to save Cook’s D Influence and Hats).

• High influence points: Cook’s distance > 1• High leverage points: Hat greater than (3*(# of

explanatory variables + 1))/n is a point with high leverage. These are points for which the explanatory variables are an outlier in a multidimensional sense.

• Use same guidelines for dealing with influential observations as in simple linear regression.

• Point that has unusual Y given its explanatory variables: point with a residual that is more than 3 RMSEs away from zero (standardized residual greater than 3 in absolute value)

Page 17: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Multiple regression, modeling and outliers, leverage and influential points

Pollution Example

• Data set pollution2.JMP provides information about the relationship between pollution and mortality for 60 cities between 1959-1961.

• The variables are• y (MORT)=total age adjusted mortality in deaths per 100,000

population; • PRECIP=mean annual precipitation (in inches);

EDUC=median number of school years completed for persons 25 and older; NONWHITE=percentage of 1960 population that is nonwhite; NOX=relative pollution potential of Nox (related to amount of tons of Nox emitted per day per square kilometer);

SO2=log of relative pollution potential of SO2

Page 18: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Multiple Regression: Steps in Analysis

1. Preliminaries: Define the question of interest. Review the design of the study. Correct errors in the data.

2. Explore the data. Use graphical tools, e.g., scatterplot matrix; consider transformations of explanatory variables; fit a tentative model; check for outliers and influential points.

3. Formulate an inferential model. Word the questions of interest in terms of model parameters.

Page 19: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Multiple Regression: Steps in Analysis Continued

4. Check the Model. (a) Check the model assumptions of linearity, constant variance, normality. (b) If needed, return to step 2 and make changes to the model (such as transformations or adding terms for interaction and curvature

5. Infer the answers to the questions of interest using appropriate inferential tools (e.g., confidence intervals, hypothesis tests, prediction intervals).

6. Presentation: Communicate the results to the intended audience.

Page 20: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Scatterplot Matrix

• Before fitting a multiple linear regression model, it is good idea to make scatterplots of the response variable versus the explanatory variable. This can suggest transformations of the explanatory variables that need to be done as well as potential outliers and influential points.

• Scatterplot matrix in JMP: Click Analyze, Multivariate Methods and Multivariate, and then put the response variable first in the Y, columns box and then the explanatory variables in the Y, columns box.

Page 21: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Scatterplot Matrix

750

850

950

1050

0

20

40

8.5

9.5

10.5

11.5

0

10

20

30

0

100

200

-1

1

3

MORT

750 950

PRECIP

0 20 40

EDUC

8.5 10.0

NONWHITE

0 10 20 30

NOX

0 100 250

SO2

-1 1 3 5

Page 22: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Crunched Variables

• When an X variable is “crunched – meaning that most of its values are crunched together and a few are far apart – there will be influential points. To reduce the effects of crunching, it is a good idea to transform the variable to log of the variable.

Page 23: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

2. a) From the scatter plot of MORT vs. NOX we see that NOX values are crunched very tight. A Log transformation of NOX is needed.

b) There seems to be approximately a linear relationship between MORT and the other variables

Scatterplot Matrix

750

850

950

1050

0

20

40

8.5

9.5

10.5

11.5

0

10

20

30

0

100

200

-1

1

3

-1

1

3

MORT

750 950

PRECIP

0 20 40

EDUC

8.5 10.5

NONWHITE

0 10 20 30

NOX

0 100 250

log NOX

-1 1 3 5

SO2

-1 1 3 5

Page 24: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Response MORT Summary of Fit RSquare 0.688278 RSquare Adj 0.659415 Root Mean Square Error 36.30065 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 940.6541 94.05424 10.00 <.0001 PRECIP 1.9467286 0.700696 2.78 0.0075 EDUC -14.66406 6.937846 -2.11 0.0392 NONWHITE 3.028953 0.668519 4.53 <.0001 log NOX 6.7159712 7.39895 0.91 0.3681 S02 11.35814 5.295487 2.14 0.0365 Residual by Predicted Plot

-100

-50

0

50

100

MO

RT

Res

idua

l

New Orleans, LA

750 800 850 900 950 100010501100

MORT Predicted

0

0.5

1

1.5

2

New Orleans, LA

Cook’s Distances

NewOrleanshasCook’sDistancegreater than 1 –New Orleans may be influential.

3 RMSEs=108No points are outliersin residuals

Page 25: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Labeling Observations

• To have points identified by a certain column, go the column, click Columns and click Label (click Unlabel to Unlabel).

• To label a row, go to the row, click rows and click label.

Page 26: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Multiple Regression with New Orleans Summary of Fit RSquare 0.688278 RSquare Adj 0.659415 Root Mean Square Error 36.30065 Mean of Response 940.3568 Observations (or Sum Wgts) 60 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 157115.28 31423.1 23.8462 Error 54 71157.80 1317.7 Prob > F C. Total 59 228273.08 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 940.6541 94.05424 10.00 <.0001 PRECIP 1.9467286 0.700696 2.78 0.0075 EDUC -14.66406 6.937846 -2.11 0.0392 NONWHITE 3.028953 0.668519 4.53 <.0001 Log NOX 6.7159712 7.39895 0.91 0.3681 SO2 11.35814 5.295487 2.14 0.0365

Multiple Regression without New Orleans Summary of Fit RSquare 0.724661 RSquare Adj 0.698686 Root Mean Square Error 32.06752 Mean of Response 937.4297 Observations (or Sum Wgts) 59 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 5 143441.28 28688.3 27.8980 Error 53 54501.26 1028.3 Prob > F C. Total 58 197942.54 <.0001 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 852.3761 85.9328 9.92 <.0001 PRECIP 1.3633298 0.635732 2.14 0.0366 EDUC -5.666948 6.52378 -0.87 0.3889 NONWHITE 3.0396794 0.590566 5.15 <.0001 Log NOX -9.898442 7.730645 -1.28 0.2060 SO2 26.032584 5.931083 4.39 <.0001

Removing New Orleans has a large impact on the coefficients of log NOX and log SO2, in particular, it reverses the sign of log S02.

Page 27: Stat 112 Notes 15 Today: –Outliers and influential points. Homework 4 due on Thursday.

Dealing with New Orleans

• New Orleans is influential. • New Orleans also has high leverage,

hat=0.45>(3*6/60)=0.2. • Thus, it is reasonable to exclude New

Orleans from the analysis, report that we excluded New Orleans, and note that our model does not apply to cities with explanatory variables in the range of New Orleans’.