How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.

26
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1

Transcript of How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.

How to Handle Missing Values in Multivariate

DataBy Jeff McNeal & Marlen Roberts

1

The Missing Data Problem

•Problems with Statistical Inference

• Sample Size & Power

• Biased Results

Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 1-2). Hoboken, New Jersey: John Wiley & Sons. 2

Real World Examples

• Respondents in a household survey refuse to report income

• Missing results of manufacturing experiment due to equipment failure

• Voters’ inability to express preference for a political candidate in an opinion poll

Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 1-2). Hoboken, New Jersey: John Wiley & Sons. 3

Outline

• Common Assumptions and Missing Data Patterns

• Taxonomy of Methods for Handling Missing Values

• Multiple Imputation

• Maximum Likelihood

• Simulation4

Missing Data Patterns

• All missing data are not created equal

• Missing due to a random process

• Missing due to a non-random process

5

A Simple Example: Income Survey

Westfall, P., & Henning, K. (2013). Understanding Advanced Statistical Methods (1st ed.). Boca Raton, Florida: CRC Press, Taylor & Francis Group. 6

Univariate Missing Data Process: MCAR

P.H. Westfall 7

Multivariate Missing Data Processes:

MCAR and MAR

http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 8

Missing Data Processes: MNAR

http://www.stat.columbia.edu/~gelman/arm/missing.pdf 9

Taxonomy of Missing-Data Methods

• Complete Case Analysis (Listwise Deletion)

• Available Case Analysis (Pairwise Deletion)

• Least Squares on Imputed Data

• Multiple Imputation

• Maximum Likelihood (and Bayes)

Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 19-20). Hoboken, New Jersey: John Wiley & Sons. 10

Complete Case Analysis (Listwise Deletion)

• Easy to implement

• Works well when MCAR assumption is met

• Wastes a lot of information

http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRVQ/Regression%20with%20Missing

%20X's.pdf11

Available Case Analysis (Pairwise Deletion)

• Attempts to minimize the loss of data in listwise deletion

• Increases the power of your test

• Usually is outperformed by Maximum Likelihood

• Caveat: Can result in non-positive definite covariance matrices

http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRVQ/Regression%20with%20Missing

%20X's.pdf12

Least Squares Imputation Methods

• Unconditional Mean Substitution

• Conditional Mean Imputation based on X

• Conditional Mean Imputation based on X and Y

http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRVQ/Regression%20with%20Missing

%20X's.pdf13

Unconditional Mean Substitution

• Just take the sample mean of the observed data and use it for the missing values

• Heavily biases the covariance matrix

• Bias can be corrected but the inferences (confidence intervals, tests, etc.) are distorted and over-precise

http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRVQ/Regression%20with%20Missing

%20X's.pdf14

Conditional Mean Imputation

http://statistics.ucla.edu/system/resources/BAhbBlsHOgZmSSI7MjAxMi8wNS8yOS8xNF80OF8wOV83M19SZWdyZXNzaW9uX3dpdGhfTWlzc2luZ19YX3MucGRmBjoGRVQ/Regression%20with%20Missing

%20X's.pdf15

Multiple Imputation

Little, R., & Rubin, D. (2002). Introduction. In Statistical Analysis with Missing Data (2nd ed., pp. 19-20). Hoboken, New Jersey: John Wiley & Sons. 16

Steps Involved in Multiple Imputation

• Introduce random variation into the process of imputing missing values

• Generate several data sets, each with different imputed values

• Perform an analysis on each data set

• Combine the results into a single set of parameter estimates, standard errors, and test statistics

http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 17

Introducing Randomness into a M.I. Model

http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 18

Adding Variability to the Imputed Values

http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 19

Why Do We Want to Add Variability?

• This is the whole point of multiple imputation

http://www.stat.columbia.edu/~gelman/arm/missing.pdf 20

Combining Inferences from Imputed Data

http://support.sas.com/resources/papers/proceedings12/312-2012.pdf 21

Simplified Form using a Regression Example

http://www.stat.columbia.edu/~gelman/arm/missing.pdf 22

Likelihood-Based Inference

https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf 23

ML with Ignorable Missing Data

https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf 24

ML with Ignorable Missing Data

https://www.amstat.org/sections/srms/webinarfiles/ModernMethodWebinarMay2012.pdf 25

Comparison of Methods

Listwise Pairwise• Easiest to implement• Has minimal effect if data are MCAR, or

MAR for large sample sizes• Has a tendency to bias results

• Uses more information than listwise• Increases statistical power• Also easy to implement

Multiple Imputation Maximum Likelihood• Requires no special software once the

imputed datasets are generated• Requires specification of a model• Requires more assumptions

• Requires specification of a model for each variable

• Most asymptotically efficient• Most complex• You get model comparison statistics (AIC,

BIC, etc.)26