Influential Points and Outliers

14
Influential Points Influential Points and and Outliers Outliers Debbi Amanti Debbi Amanti

description

Influential Points and Outliers. Debbi Amanti. OUTLIERS:. Data points two or three standard deviations from the mean of the data. Observations that differ significantly from the pattern of the REST OF THE DATA - PowerPoint PPT Presentation

Transcript of Influential Points and Outliers

Page 1: Influential Points and  Outliers

Influential PointsInfluential Pointsand and

OutliersOutliers

Debbi AmantiDebbi Amanti

Page 2: Influential Points and  Outliers

OUTLIERS:OUTLIERS:

Data points Data points two or three standard two or three standard deviationsdeviations from the from the meanmean of the data. of the data.

Observations that Observations that differdiffer significantly from significantly from the pattern of the the pattern of the REST OF THE DATAREST OF THE DATA

Observations that lie Observations that lie outsideoutside the overall the overall patternpattern of the other observations. of the other observations.

Page 3: Influential Points and  Outliers

OUTLIERS IN TERMS OF OUTLIERS IN TERMS OF REGRESSION:REGRESSION:

Observations with Observations with largelarge (in (in absolute value) absolute value) residualsresiduals..

Observations falling Observations falling f a rf a r from the from the regression line while not following regression line while not following the the patternpattern of the of the relationshiprelationship apparent in the othersapparent in the others

Residual=actual-fittedResidual=actual-fitted

Page 4: Influential Points and  Outliers

To mathematically compute an outlier given a univariate set of

data:Find the Inter Quartile Range a.k.a. IQR (Q3-Q1) and multiply this value by 1.5. An outlier for a data set is

any point:

Greater than Q3+1.5*(IQR)

Less than Q1-1.5*(IQR)

Page 5: Influential Points and  Outliers

INFLUENTIAL POINTS ARE:INFLUENTIAL POINTS ARE:Points whose removal would greatly Points whose removal would greatly

affect the association of two affect the association of two variablesvariables

Points whose removal would Points whose removal would significantly change the slope of an significantly change the slope of an

LSR lineLSR line

Points with a large moment (i.e they Points with a large moment (i.e they are far away from the rest of the are far away from the rest of the

data.)data.)

Usually outliers in the x direction.Usually outliers in the x direction.

Page 6: Influential Points and  Outliers

The two graphs below show the same data – the one on the The two graphs below show the same data – the one on the right with the removal of the right with the removal of the green data pointgreen data point. As you can . As you can

see, the removal of thissee, the removal of this point point significantly affects the slope of significantly affects the slope of

the the regression lineregression line. This is an . This is an influential point!influential point!

Page 7: Influential Points and  Outliers

Using the same data as shown on Using the same data as shown on the previous slide, let’s compare the previous slide, let’s compare

the the xx and and yy data sets for the data sets for the presence of outliers:presence of outliers:

X DATAX DATA IQR=IQR= 55

QQ11=3=3 QQ33=8=8

MAX=15.5MAX=15.5 MIN=1MIN=1

An outlier is any point:An outlier is any point: > Q> Q33+1.5*IQR=15.5 +1.5*IQR=15.5

oror

< Q< Q11-1.5*IQR=-4.5-1.5*IQR=-4.5

THERE ARE NO OUTLIERS THERE ARE NO OUTLIERS IN THIS DATA SET!!!IN THIS DATA SET!!!

Y DATAY DATAIQR=5IQR=5

QQ11=4=4 QQ33=9=9

MAX=10MAX=10 MIN=2MIN=2

An outlier is any point:An outlier is any point:> Q> Q33+1.5*IQR=16.5 +1.5*IQR=16.5

oror

< Q< Q11-1.5*IQR=-3.5-1.5*IQR=-3.5

THERE ARE NO OUTLIERS IN THERE ARE NO OUTLIERS IN THIS DATA SET!!!THIS DATA SET!!!

Page 8: Influential Points and  Outliers

!!!REMEMBER!!!!!!REMEMBER!!!

An observation does NOT have An observation does NOT have to be an Outlier to be an to be an Outlier to be an

Influential Point!! Influential Point!!

Nor does an observation need Nor does an observation need to be an Influential Point in orderto be an Influential Point in order

to be an Outlier!!to be an Outlier!!

Page 9: Influential Points and  Outliers

Get your calculator handy...

Page 10: Influential Points and  Outliers

Given the five-number summary Given the five-number summary {8 21 35 43 77}, which of the {8 21 35 43 77}, which of the

following is correct?following is correct?

A. There are no outliersA. There are no outliersB. There are at least two outliersB. There are at least two outliersC. There is not enough data to make C. There is not enough data to make

any conclusionany conclusionD. There is exactly one outlierD. There is exactly one outlierE. There is at least one outlierE. There is at least one outlier

Page 11: Influential Points and  Outliers

The correct answer is The correct answer is EEThe five number summary gives youThe five number summary gives you

{Min Q{Min Q11 Median Q Median Q33 Max} Max}The IQR is calculated by QThe IQR is calculated by Q33-Q-Q11

So, the IQR for the given data is 43-21=22So, the IQR for the given data is 43-21=22

An outlier for this data would be: An outlier for this data would be:

>Q>Q33+1.5*IQR or <Q+1.5*IQR or <Q11-1.5*IQR-1.5*IQR

>43+(22*1.5)=76 or <21-(22*1.5)=-12>43+(22*1.5)=76 or <21-(22*1.5)=-12

Since the max is 77, there must be Since the max is 77, there must be at least oneat least one outlieroutlier in this data set, but we cannot conclude in this data set, but we cannot conclude

how many outliers without more data.how many outliers without more data.

Page 12: Influential Points and  Outliers

Given the following scatterplot and residual plot. Which Given the following scatterplot and residual plot. Which of the following is true about the yellow data point?of the following is true about the yellow data point?

0 5 10 15

I. It is an influential pointI. It is an influential pointII. It is an outlier with respect to the regression modelII. It is an outlier with respect to the regression modelII. It appears to be an outlier in the x directionII. It appears to be an outlier in the x direction

A. I onlyA. I onlyB. I and IIB. I and IIC. I and IIIC. I and IIID. None of the aboveD. None of the aboveE. All of the aboveE. All of the above

Page 13: Influential Points and  Outliers

The correct answer is The correct answer is ccI.I. Because this point has a Because this point has a large momentlarge moment and is and is

far from the rest of the data, it is an influential far from the rest of the data, it is an influential point. If this point was removed, the slope of point. If this point was removed, the slope of the line would markedly change.the line would markedly change.

II.II. This point is not an outlier with respect to the This point is not an outlier with respect to the model because as you can see in the residual model because as you can see in the residual plot, it does plot, it does not have a large residualnot have a large residual (It (It follows the regression pattern of the data).follows the regression pattern of the data).

III.III. By looking at both the scatterplot and the By looking at both the scatterplot and the residual plot, you can see that the yellow residual plot, you can see that the yellow point is an point is an outlier in the x directionoutlier in the x direction (far right of (far right of the rest of the data). the rest of the data).

Page 14: Influential Points and  Outliers

Resources used in this Resources used in this presentation include:presentation include:

Workshop StatisticsWorkshop Statistics by Allan Rossman by Allan Rossman

The Basic Practice of StatisticsThe Basic Practice of Statistics by David by David S. MooreS. Moore

AMSCO’s AP StatisticsAMSCO’s AP Statistics by James Bohan by James Bohan

Any further questions, email me at: Any further questions, email me at: [email protected][email protected]