Strategies for Identifying Outliers and Managing Missing Data

Strategies for Identifying Outliersand Managing Missing Data

R. Michael Haynes, PhD [email protected]

Tarleton State University

A PRIORI MARCH 1, 2012Assistant Vice President for Student Life Studies

POST HOC FEBRUARY 29, 2012Executive Director of Institutional Research

Assistant ProfessorEducational Leadership and Policy Studies

mailto:[email protected]

Outlier analysis in multiple regression class

Data inspection (missing data) was a key aspect of dissertation

Try to incorporate at the very least “a nod” to data inspection in any assessment/project completed

A little background...

Identifying input errors

Indentifying spurious data points (an answer of “6” on a 1-5 Likert scale)

Makes your findings more sound

Good practice as recommended by the American Psychological Association (Wilkerson & APA Task Force on Statistical Inference, 1999)

Why is it important to evaluate your data set?Can help in…..

Desired Outcomes

Knowledge of various data inspection methods visual range of data set

Methods for managing missing data list wise deletion

pair wise deletion mean replacement linear trend point

Criteria for identifying outliers/spurious data points standardized residuals/predicted values

standard deviation diagnostics Cook’s D values

Data inspection methods

VisualCan alert you to missing casesMost beneficial with smaller datasets where review of individual cases is possible

Data inspection methodsSPSS minimum/maximum values functionQuick method of inspecting range of larger data sets

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

Learning Community 884 0 1 .14 .343

You are taking this survey: 874 1 3 2.01 .139

Recoded response to high

school graduation year

variable HGRADYR

883 1 4 3.94 .367

Valid N (listwise) 873

What to do about missing values?SPSS options

Exclude cases listwise: Only cases with valid values for all variables are included in the analyses.Exclude cases pairwise: Cases with complete data for the pair of variables being correlated are used to compute the correlation coefficient on which the regression analysis is based. Degrees of freedom are based on the minimum pairwise NReplace with mean: All cases are used for computations, with the mean of the variable substituted for missing observations

(SPSS Inc., 233 S.Wacker Drive, Chicago, IL, 60606)

Problems with these options…

Listwise excludes all values for a case missing even 1 variable value…throws the baby out with the bath water!

Pairwise only utilizes variables for which both values are present

Can lead to distortion of findings through selection bias

(King, Honeker, Joseph, & Scheve, 1998)

More preferred options…Choose “Transform” -> “Missing Values”

Enter variables with missing values into “New Variable” boxUnder “Name and Method”, select one of the following:

Series MeanMean of Nearby PointsMedian of Nearby PointsLinear InterpolationLinear Trend at Point

I prefer the last option, Linear Trend at Point

Linear Trend at Point

Uses the theory of regression to calculate coefficients based upon existing values

Generates a replacement value for each case on each variable

More robust than simply replacing with mean

Identifying outliers… what is an outlier?

An unusual score in a distribution that is considered extreme and may warrant special consideration (Hinkle, Wiersma, & Jurs, 2003)

...a data point distinct or deviant from the rest of the data (Pedhazur, 1997)

Why is it important to identify potential outliers?

Can skew findings which in turn can skew conclusions/decisions/programming

Can help identify case in dire need of additional programming/resources…..finding that lost raft at sea!

As mentioned earlier, can assist in identifying data entry errors

Strategies for identifying outliers in your dataset

Standardized predicted and residual scores


Residuals 3 standard deviations away from meanRule of thumb….”99% of your dataset should fall within + or – 3 standard deviations from the mean”

Casewise Diagnosticsa

Case

Number Std. Residual

Percent Hispanic

Enrollment Predicted Value Residual

75 -4.091 .180 .54883 -.368829

88 -3.195 .020 .30811 -.288109

175 -4.068 .060 .42682 -.366818

a. Dependent Variable: Percent Hispanic Enrollment


Cook’s D valuesConsiders each variables relationship to the other variables in the dataset (Pedhazar, 1997)Cook’s D values greater than 1 could be suspect


Cook’s D valuesConsiders each variables relationship to the other variables in the dataset (Pedhazar, 1997)Cook’s D values greater than 1 could be suspectSaves values to dataset

OK, so what if some of your cases don’t pass this 3 prong approach and it’s not a data entry error?

Discard the case? Rejects the notion that the data “is what it is…”“Tightens-up” the model to be more representative of the norm

Keep it in?Distorts the whole for a special circumstanceDepending upon your research question, could bring attention to a group needing special consideration

Either way, can be addressed in limitations/conclusions/need for further research

References

Hinkle, D.E., Wiersma, W., & Jurs, S.G. (2003). Applied statistics for the behavioral sciences (5th ed.). Boston, MA: Houghton Mifflin Company

King, G., Honaker, J., Joseph, A., & Scheve, K. (1998). Listwise deletion is evil: What to do about missing data in political science [Electronic version]. Society for Political Methodology: American Political Science Association, Washington University in St. Louis, St. Louis, MO. Retrieved February 2, 2009, from http://polmeth.wustl.edu/workingpapers.php?order=dateasc&title=1998&startdate=1998-01-01&enddate=1998-12-31

Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd ed.). South Melbourne, Australia: Wadsworth.

Wilkinson, L. & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanation. American Psychologist, 54, 594-604.

http://polmeth.wustl.edu/workingpapers.php?order=dateasc&title=1998&startdate=1998-01-01&enddate=1998-12-31

http://polmeth.wustl.edu/workingpapers.php?order=dateasc&title=1998&startdate=1998-01-01&enddate=1998-12-31

Strategies for Identifying Outliers and Managing Missing Data

Documents

Transcript of Strategies for Identifying Outliers and Managing Missing Data