22s:152 Applied Linear Regression Chapter 11...

22s:152 Applied Linear Regression

Chapter 11 Diagnostics:Unusual and Influential Data:Outliers, Leverage, and Influence

————————————————————

We’ve seen that there are certain assumptionsmade when we model the data.

These assumptions allow us to calculate a teststatistic, and know the distribution of the teststatistic under a null hypothesis.

Besides the specified distributional assumptions,there are other things to check for in regression.

In this section, we look for characteristics in thedata that can cause problems with our fittedmodels.

1

In multiple regression, we often investigate witha scatterplot matrix first:

collinearity (highly correlated x’s)outliersmarginal pairwise linearity

These scatterplots can be useful, but it onlyshows pairwise relationships.

After fitting the model, we check for:nonlinearity (with residual plot)

non-constant variancenon-normalitymulticollinearity (VIFs, yet to come in Ch. 13)

*outliers and influential observations

2

Outliers, Leverage, and Influence

• In general, an outlier is any unusual datapoint. It helps to be more specific...

40 60 80 100 120 140 160

4060

80100

120

weight

repwt

• In the above plot there is a strong outlier inthe bottom right.

This is an outlier with respect to the X-distribution(independent variable).

It is not an outlier with respect to the Y-distribution (dependent variable).

3

• A strong outlier with respect to the indepen-dent variable(s), is said to have high lever-age.

• Such a point could have a big impact on thefitted model.

• Here is the fitted line for the full data set:

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●● ●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

40 60 80 100 120 140 160

4060

8010

012

0

weight

repw

t

4

• Here is the fitted line after removing the highleverage point:

40 60 80 100 120 140 160

4060

8010

012

0

weight

repw

t

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●● ●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

• This one point had a a lot of influence on thefitted line. The β0 and β1 changed greatly inthe re-fit.

• The second picture shows a much better fit tothe bulk of the data. But one must be carefulabout removing outliers. In this case it wasa value that was reported incorrectly, and re-moval was justified.

5

• A point with high leverage doesn’t HAVE tohave a large impact on the fitted line.

• Recall the cigarette data:

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 5 10 15 20 25 30

0.5

1.0

1.5

2.0

cig.data$Tar

cig.

data

$Nic

• Above is the fitted line with the inclusionof the point of high leverage (top right datapoint is far outside the distribution of x-values).

6

• Here is the fitted line after removing the out-lier:

0 5 10 15 20 25 30

0.5

1.0

1.5

2.0

cig.data$Tar

cig.

data

$Nic

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

• The before and after in this case are quitesimilar. This leverage point did NOT havea large influence on the fitted model.

• A point with high leverage has the potentialto greatly affect the fitted model.

7

• The blue point below is an outlier, but not with re-spect to the independent variable (not high leverage).

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

−1 0 1 2 3

01

23

45

x

y

●

●

• Here is the fitted line after removal of the outlier, it

did not have a big influence on the fitted line.

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

−1 0 1 2 3

01

23

45

x

y

8

•What about higher dimensions (more predictors)?

• A point with high leverage will be outside ofthe ‘cloud’ of observed x-values.

•With 2 predictors X1 and X2 (no response shown)

-3 -2 -1 0 1 2

-10

12

34

5

x1

x2

The point at the top left has high leveragebecause it is outside the cloud of (X1, X2)independent values.

9

Assessing Leverage: Hat Values

• Beyond graphics, we have a quantity calledthe hat value which is a measure of leverage.

• Leverage measures “potential” influence.

• In regression, the hat value hi (or hii) is acommon measure of leverage.

• A high hi hat value equates to high leverage.

•What is hi? How is it calculated?

– First, why is it called a hat value?

– In matrix notation for OLS, we have:

Y = Xβ = X(X ′X

)−1X ′︸︷︷︸Y

an n× n matrixcalled the Hat matrix

Y = HY

10

– Thus,Yj is the jth row of H times YButH is symmetrical, so the book showsit as...

Yj = h1jY1 + h2jY2 + · · · + hnjYn

=∑ni=1 hijYi

– The value hij captures the contribution ofobservation Yi to the fitted value of thejth observation, or Yj (consider hij a weight)

– If hij is large, Yi CAN have a substantialimpact on the jth fitted value.

– It can be shown that...

hi = hii =∑nj=1 h

2ij

and hi summarizes the potential influenceof Yi on the fit of ALL other observations.

11

• The hat value for obs i is the ith element inthe diagonal of H (thus, hii).

• How do we determine if hi is large?

* 1/n ≤ hi ≤ 1

*∑ni=1 hi = (k + 1)

(where k is the number of predictors)

* h = (k + 1)/n

•We can compare hi to the average. If it ismuch larger, then the point has high lever-age.

• Often use 2(k+1)/n or 3(k+1)/n as a guidefor determining high leverage.

12

• In SLR, hi just measures the distance fromthe mean X .

hi =1

n+

(Xi − X)2∑nj=1(Xj − X)2

• In multiple regression, hi measures distancefrom the centroid (center of the cloud of datapoints in the X-space).

• The dependent-variable values are not involvedin determining leverage. Leverage is a state-ment about the X-space.

13

• Example: Duncan dataReponse: Prestige

Percent of raters in an opinion study rating

occupation as excellent or good in prestige.

Predictors: IncomePercent of males in occupation earn-

ing $3500 or more in 1950.

EducationPercent of males in occupation in 1950

who were high school graduates.

> attach(Duncan)

> head(Duncan)

type income education prestige

accountant prof 62 86 82

pilot prof 72 76 83

architect prof 75 92 90

author prof 55 90 76

chemist prof 64 86 90

minister prof 21 84 87

> nrow(Duncan)

[1] 45

14

Two Predictors:

> plot(education,income,pch=16,cex=2)

## The next line produces an interactive option for the

## user to identify points on a plot with the mouse.

> identify(education,income,row.names(Duncan))

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●

●

20 40 60 80 100

2040

6080

education

inco

me

minister

conductor

RR.engineer

From this bivariate plot of the two predictors,it looks like the ‘RR.engineer’, ‘conductor’, and‘minister’ may have high leverage.

15

Fit the model, get the hat values:

> lm.out=lm(prestige ~ education + income)

> hatvalues(lm.out)

1 2 3 4 ...

0.05092832 0.05732001 0.06963699 0.06489441 ...

Plot the hat values against the indices 1 to 45,and include thresholds for 2 ∗ (k + 1)/n and3 ∗ (k + 1)/n:

> plot(hatvalues(lm.out),pch=16,cex=2)

> abline(h=2*3/45,lty=2)

> abline(h=3*3/45,lty=2)

> identify(1:45,hatvalues(lm.out),row.names(Duncan))

16

●●

●●

●

●

●

●

●●

●

● ● ●●

●

● ●

●

●●

●● ●

●●

●

●

●

●

●

●

●●

●

● ● ●

●

●●

●

●

●

●

0 10 20 30 40

0.05

0.10

0.15

0.20

0.25

Index

hatv

alue

s(lm

.out

)

minister

conductor

RR.engineer

These three points have high leverage (potentialto greatly influence the fitted model) using the2 times and 3 times the average hat value crite-rion.

To measure influence, we’ll look at another statis-tic, but first... consider studentized residuals.

17

Studentized Residuals

• In this section, we’re looking for outliers inthe Y-direction for a given combination of in-dependent variables (x-vector).

• Such an observation would have an unusallylarge residual, and we call it a regression outlier.

• If you have two predictors, the mean struc-ture is a plane, and you’d be looking for ob-servations that fall exceptionally far from theplane compared to the other observations.

• First, we start with standardized residuals.

• Though we assume the errors in our modelhave constant variance [εi ∼ N(0, σ2)], theestimated errors, or the sample residuals ei’sDO NOT have equal variance...

18

• Observations with high leverage tend to havesmaller residuals. This is an artifact of ourmodel fitting. These points can pull the fit-ted model close to them, giving them a ten-dency toward smaller residuals.

• Var(ei) = σ2(1− hi)

•We know... 1/n ≤ hi ≤ 1.

•With this variance, we can form a standardizedresidual e′i which all have equal variance as

e′i =ei

σ√

1− hi=

eiSE√

1− hi

but the distribution of e′i isn’t a t becausethe numerator and denominator are not in-dependent (part of theory of a t statistic).

19

We can get away from the problem by esti-mating σ2 with a sum of squares that doesnot include the ith residual.

We will use subscript (−i) to indicate

quantities calculated without case i...

– Delete the ith observation, and re-fit themodel based on n−1 observations, and get

σ2(−i) = RSS

n−1−k−1

or SE(−i) =√

RSSn−1−k−1

• This gives us the studentized residual

e∗i =ei

SE(−i)√

1− hi

• Now, the σ in the denominator, or SE(−i),is not correlated with the numerator and

e∗i ∼ tn−1−k−1

20

e∗i ∼ tn−1−k−1

•We can now use the t-distribution to judgewhat is a large studentized residual, or howlikely we are to get a studentized residual asfar away from 0 as the ones we get.

• As a note, if the ith observation has a largeresidual and we left it in the computation ofSE, it may greatly inflate SE, deflating thestandardized residual e′i and making it hardto notice that it was large.

• The standardized residuals is also referred toas the internally studentized residual.

• The studentized residual listed here is also re-ferred to as the externally studentized residual.

21

• Test for outliers by comparing e∗i to a t dis-tribution with n − k − 2 df and applying aBonferroni correction (multiply the p-valuesby the number of residuals).

• Really, we only have to be concerned aboutthe largest studentized residuals. Perhapstake a closer look at those with |e∗i | > 2

22

• Example: Returning to the Duncan model:

> nrow(Duncan)

[1] 45

> lm.out=lm(prestige ~ education + income)

> sort(rstudent(lm.out))

-2.3970223990 -1.9309187757 -1.7604905300 ...

Can look at adjusted p-values from t41, butthere is a built-in function in the car libraryto help with this...

Get the adjusted p-value for the largest |e∗i |:> outlierTest(lm.out,row.names(Duncan))

max|rstudent| = 3.134519, degrees of freedom = 41,

unadjusted p = 0.003177202, Bonferroni p = 0.1429741

Observation: minister

Since p= 0.1429741 is larger than 0.05, wewould conclude that this model doesn’t haveany extreme residuals.

23

• An outlier may indicate a sample peculiar-ity or may indicate a data entry error. Itmay suggest an observation belongs to an-other ‘population’.

• Fitting a model with and without an obser-vation gives you a feel for the sensitivity ofthe fitted model to the observation.

• If an observation has a big impact on thefitted model and it’s not justified to removeit, one option is to report both models (withand without) commenting on the differences.

• An analysis called Robust Regression will al-low you to leave the observation in, whilereducing it’s impact on the fitted model.

This method essentially ‘weights’ the obser-vations differently when computing the leastsquares estimates.

24

Measuring influence

•We’ve described a point with high leverageas having the potential to greatly influencethe fitted model.

• To greatly influence the fitted model, a pointhas high leverage and it’s Y-value is not ‘in-line’ with the general trend of the data (hashigh leverage and looks ‘odd’).

• High leverage + large studentized residual =High influence

• To check influence, we can delete an observa-tion, and see how much the fitted regressioncoefficients change. A large change suggestshigh influence.

Difference=βj − βj(−i)25

• Fortunately this can be done analytically (andwe don’t have to re-fit the model n times).

We will use subscript (−i) to indicate quan-tities calculated without case i...

• DFBETAS - effect of Yi on a single estimatedcoefficient

DFBETASj,i =βj − βj(−i)SE(βj(−i))

=

|DFBETAS| > 1 is considered large in asmall or medium sized sample

|DFBETAS| > 2n−1/2 is considered largein a big sample

26

• DFFITS - effect of ith case on fitted valuefor Yi

DFFITS =Yi − Yi(−i)SE(−i)

√hi

= e∗i

√hi

1− hi

|DFFITS| > 1 is considered large in asmall or medium sized sample

|DFFITS| > 2√k+1n is considered large in

a big sample

Look at DFFITS relative to each other, inother words, look for large values.

27

• COOKSD - effect on all fitted values

Based on:how far x-values are from the mean of the x’show far Yi is from the regression line

COOKSD =

∑j(Yj − Yj(−i))2

(k + 1)S2E

=e2ihi

S2E(k + 1)(1− hi)2

=

(e∗2ik + 1

)(hi

1− hi

)

So, COOKSD can be high if hi is very large(close to 1) and e∗2i is moderate, or if e∗2i isvery large and hi is moderate, or if they’reboth extreme.

28

COOKSD > 4n−k−1 Unusually influential

case.

Look at COOKSD relative to each other.

• Example: Returning to the Duncan model:> influence.measures(lm.out)

Influence measures of

lm(formula = prestige ~ education + income) :

dfb.1_ dfb.edct dfb.incm dffit cov.r cook.d hat inf

1 -2.25e-02 0.035944 6.66e-04 0.070398 1.125 1.69e-03 0.0509

2 -2.54e-02 -0.008118 5.09e-02 0.084067 1.131 2.41e-03 0.0573

3 -9.19e-03 0.005619 6.48e-03 0.019768 1.155 1.33e-04 0.0696

4 -4.72e-05 0.000140 -6.02e-05 0.000187 1.150 1.20e-08 0.0649

5 -6.58e-02 0.086777 1.70e-02 0.192261 1.078 1.24e-02 0.0513

.

.

.

You get a DFBETA for each regression coef-ficient and each observation.

We’re not really interested in the interceptDFBETA though. That said, a bivariate plotof the other DFBETAS may be useful...

29

> plot( dfbetas(lm.out)[,c(2,3)],pch=16)

> identify(dfbetas(lm.out)[,2],dfbetas(lm.out)[,3],

row.names(Duncan))

0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

education

income

minister

conductor

RR.engineer

The ‘minister’ observation may have a largeimpact on both regression coefficients (usingthe > 1 threshold).

30

The Cook’s distance is perhaps most com-monly looked at (effect on all fitted values).

> plot(cooks.distance(lm.out),pch=16,cex=1.5)

> abline(h=4/(45-2-1),lty=2)

> identify(1:45,cooks.distance(lm.out),row.names(Duncan))

● ● ● ●●

●

● ●

●

● ● ● ● ● ●

●

●

●

●●

●●

● ●●

●

●

●

● ● ●

● ●

● ● ● ●●

● ●●

● ● ● ●

0 10 20 30 40

0.0

0.1

0.2

0.3

0.4

0.5

Index

cook

d(lm

.out

)

minister

conductor

31

We can bring together leverage, studentizedresiduals and cooks distance in the following“Bubble plot”:

## Plot Residuals vs. Leverage:

> plot(hatvalues(lm.out),rstudent(lm.out),type="n")

> cook=sqrt(cooks.distance(lm.out))

> points(hatvalues(lm.out),rstudent(lm.out),

cex=10*cook/max(cook))

## Include diagnostic thresholds:

> abline(h=c(-2,0,2),lty=2)

> abline(v=c(2,3)*3/45,lty=2)

## Overlay names:

> identify(hatvalues(lm.out),rstudent(lm.out),

row.names(Duncan))

32

0.05 0.10 0.15 0.20 0.25

-2-1

01

23

hatvalues(lm.out)

rstudent(lm.out)

minister

reporter

conductor

RR.engineer

• High Leverage: conductor, minister, RR en-gineer

• Large regression residuals: minister, reporter

• ‘minister’ has the highest influence on thefitted model

33

It turns out, R will do some work for us in ourdiagnostic plotting...

> par(mfrow=c(2,2))

> plot(lm.out)

0 20 40 60 80 100

-20

020

40

Fitted values

Residuals

Residuals vs Fitted6

9

17

-2 -1 0 1 2

-2-1

01

23

Theoretical Quantiles

Sta

ndar

dize

d re

sidu

als

Normal Q-Q

6

9

17

0 20 40 60 80 100

0.0

0.5

1.0

1.5

Fitted values

Standardized residuals

Scale-Location6

917

0.00 0.10 0.20

-2-1

01

23

Leverage

Sta

ndar

dize

d re

sidu

als

Cook's distance1

0.5

0.5

1

Residuals vs Leverage6

169

34

1. Residuals vs. fitted(checking constant variance)

2. Normal QQ plot(checking normality)

3. Scale-Location plot

(similar to 1., but uses√|e∗i | vs. Yi)

4. Residuals vs. Leverage(checking influence, plus it gives you a sweetcontour plot of COOKSD)

35

Jointly influential data points

• The previous section considered the impactthat individual data points can have on thefitted regression coefficients.

• Their impact can be seen with leave-one-outdeletion diagnostics. See...

> influence.measures(lm.out)

• But data points can have a joint influence onthe fitted model.

• Graphical methods can help identify such sub-sets.

• For this purpose, we will use the PartialRegression Plot that we saw earlier (in themultiple regression introduction). Also knownas Added Variable Plot.

36

Recall an Added Variable Plot for Xi

For Xi,

1. get the residuals from the regression of Y onall predictors except Xi, and plot on the Yaxis (what’s left for Xi to explain).

2. get the residuals from the regression of Xi onall other predictors, and plot on the X axis(the non-redundant part of Xi in the model).

3. The plot provides information on adding theXi variable to the model.

The ‘special’ simple linear regression related tothe plot above (i.e. of Residuals(Y ∼ X(−Xi))on Residuals(Xi ∼ X(−Xi))) gives the multipleregression coefficient for Xi.

After we fit the regression to this plot, the resid-uals are the same as the full model multiple re-gression residuals.

37

As this ‘special’ plot is used to fit the multi-ple regression coefficient, the relationship we seeshould be linear.

————————————————————–

• In the Duncan data set, the data point‘minister’ had the most influence whenconsidering leave-one-out diagnostics.

MINISTER DATA POINT:This data point had a large positive studen-tized residual... the prestige for the ‘minis-ter’ was higher than expected for the givenincome and education (seems reasonable forthe occupation though).

38

This data point also had an odd combinationof income/education with a low income fortheir given amount of education (high lever-age, but the combination again seems rea-sonable).

This lead to minister being highly influen-tial.

Let’s fit the model with and without this datapoint...

And let’s see if it is jointly influential by con-sidering added variable plots...

39

• The fitted regression coefficients withthe ‘minister’ data point:

> summary(lm.out)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -6.06466 4.27194 -1.420 0.163

education 0.54583 0.09825 5.555 1.73e-06 ***

income 0.59873 0.11967 5.003 1.05e-05 ***

• The fitted regression coefficients withoutthe ‘minister’ data point:

> summary(lm(prestige[-6]~education[-6]+income[-6]))

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -6.62751 3.88753 -1.705 0.0958 .

education[-6] 0.43303 0.09629 4.497 5.56e-05 ***

income[-6] 0.73155 0.11674 6.266 1.81e-07 ***

The education coefficient is lower without ‘minister’.

The income coefficient is higher without ‘minister’.

(these changes are understandable when we look at

the added variable plots.)

40

• Example: Returning to Duncan model:

> avPlots(lm.out,"education",labels=row.names(Duncan),

id.method=cooks.distance(lm.out),id.n=2)

-60 -40 -20 0 20 40

-40

-20

020

4060

education | others

pres

tige

| ot

hers

minister

conductor

This is the plot used to fit the education mul-tiple regression coefficient.

Here we see how the multiple regression coef-ficient for education is higher when ‘minister’is included (influential point).

41

The ‘conductor’ data point influences the fit-ted line in the same direction (removal ofboth at once may show a large change in theeducation regression coefficient).

> avPlots(lm.out,"income",labels=row.names(Duncan),

id.method=cooks.distance(lm.out),id.n=2)

-40 -20 0 20 40

-30

-20

-10

010

2030

40

income | others

pres

tige

| ot

hers

ministerconductor

Here we see that ‘minister’ and ‘conductor’work together to flatten the income multi-ple regression coefficient. Removal of bothat once may show a large change in the in-come regression coefficient.

42

Removing outliers (careful)

• Removal of an outlier should only be doneafter careful investigation, and perhaps dis-cussion.

• It shouldn’t be removed if it’s not an error, orcan’t be shown that the data point belongsto a different ‘population’.

• Duncan Data set:When reporting the analysis, one could re-port both analyses (with and without ‘min-ister’). Otherwise, if all involved agree that‘minister’ is not representative of the popu-lation at hand, it could be removed.

43

22s:152 Applied Linear Regression Chapter 11...

Documents

Transcript of 22s:152 Applied Linear Regression Chapter 11...