Integrated Data Editing and Imputation
description
Transcript of Integrated Data Editing and Imputation
Integrated Data Editing and Imputation
Ton de WaalDepartment of Methodology Voorburg
Statistics Netherlands
ICES III conference, Montréal June 19, 2007
What is statistical data editing and imputation?
Observed data generally contain errors and missing values
Statistical Data Editing (SDE): process of checking observed data, and, when
necessary, correcting them Imputation:
process of estimating missing data and filling these values in into data set
What is integrated SDE and imputation?
Integration of error localization and imputation
Integration of several edit and imputation techniques to optimize edit and imputation process
Integration of statistical data editing into rest of statistical process
What is integrated SDE and imputation?
Integration of error localization and imputation
Integration of several edit and imputation techniques to optimize edit and imputation process
Integration of statistical data editing into rest of statistical process
SDE and the survey process
We will focus on identifying and correcting errors Other goals of SDE are
identify error sources in order to provide feedback on entire survey process
provide information about the quality of incoming and outgoing data
Role of SDE is slowly shifting towards these goals feedback on other survey phases can be used to improve
those phases and reduce amount of errors arising in these phases
Edits
Edit rules, or edits for short, often used to determine whether record is consistent or not
Inconsistent records are considered to contain errors
Consistent records that are also not suspicious otherwise, e.g. are not outlying with respect to the bulk of the data, are considered error-free
Example of edits (T turnover, P profit, and C costs): T = P + C (balance edit) T ≥ 0
SDE and imputation
Three related problems: Error localization: determine which values are
erroneous Correction: correct missing and erroneous data in
best possible way Consistency: adjust values such that all edits
become satisfied Correction often done by means of imputation
SDE and imputation
Three related problems: Error localization: determine which values are
erroneous Imputation: impute missing and erroneous data
in best possible way Consistency: adjust imputed values such that all
edits become satisfied
SDE and imputation
Three related problems: Error localization: determine which values are
erroneous Imputation: impute missing and erroneous data
in best possible way Consistency: adjust imputed values such that all
edits become satisfied Most SDE techniques focus on error
localization
SDE in the “old” days
Use of computers in SDE started many years ago In early years role of computers restricted to
checking which edits were violated Subject-matter specialists retrieved paper
questionnaires that did not pass all edits and corrected them
After correction, data were again entered into computer, and again checked whether all edits were satisfied
Major problem: during manual correction process records were not checked for consistency
Modern SDE techniques
Interactive editing Selective editing Automatic editing Macro-editing
Interactive editing
During interactive editing a modern survey processing system (e.g. BLAISE) is used
Such a system allows one to check and – if necessary – correct in a single step
Advantages: number of variables, edits and records may be high quality of interactively edited data is generally high
Disadvantage: all records have to be edited: costly in terms of budget and
time not transparent
Selective editing
Umbrella term for several methods to identify the influential errors
Aim is to split data into two streams: critical stream: records that are the most likely ones to
contain influential errors non-critical stream: records that are unlikely to contain
influential errors
Records in critical stream are edited interactively Records in non-critical stream are either not edited
or are edited automatically
Selective editing
Many selective editing methods are based on common sense
Most often applied basic idea is to use a score function
Two important components influence: measures relative influence of record
on publication figure risk: measures deviation of observed values from
“anticipated” values (e.g. medians or values from previous years)
Selective editing Local score for single variable within record
usually defined as distance between observed and anticipated values, taking influence of record into account
Example: W x |Y – Y*| W raising weight, Y observed value, Y* anticipated value influence component: W x Y* risk component: |Y – Y*| / Y*
Local scores combined into global score for entire record by sum of local scores maximum of local scores
Records with global score above certain cut-off value edited interactively
Selective editing: (dis)advantages
Advantage: selective editing improves efficiency in terms of
budget and time Disadvantage:
no good techniques for combining local scores into global score are available if there are many variables
Selective editing has gradually become popular method to edit business data
Automatic editing
Two kinds of errors: systematic ones and random ones
Systematic error: error reported consistently among (some) responding units gross values reported instead of net values values reported in units instead of requested
thousands of units (so-called “thousand-errors”) Random error: error caused but by accident
observed value where respondent by mistake typed in a digit too many
Automatic editing of systematic errors
Can often be detected by comparing respondent’s present values with
those from previous years comparing responses to questionnaire variables
with values of register variables using subject-matter knowledge
Once detected, systematic error is often simple to correct
Automatic editing of random errors
Three classes of methods: methods based on statistical models (e.g. outlier
detection techniques and neural networks) methods based on deterministic checking rules methods based on solving a mathematical
optimization problem
Deterministic checking rules
State which values are considered erroneous when record violates edits Example: if component variables do not sum up to total, total
variable is considered to be erroneous Advantages:
drastically improves efficiency in terms of budget and time transparency and simplicity
Disadvantages: many rules have to be specified, maintained and checked for
validity bias may be introduced as one aims to detect random errors in a
systematic manner
Error localization as mathematical optimization problem
Guiding principle is needed Freund and Hartley (1967): minimize sum of the distance
between observed and “corrected” data and a measure for violation of edits
Casado Valera et al. (90’s): minimize quadratic function measuring distance between observed and “corrected” data such that “corrected” data satisfy all edits
Bankier (90’s): impute missing data and potentially erroneous values by means of donor imputation, and select imputed record that satisfies all edits and that is “closest” to original record
Fellegi-Holt paradigm (1976)
Data should be made to satisfy all edits by changing values of fewest possible number of variables
Generalization: data should be made to satisfy all edits by changing values of variables with smallest possible sum of reliability weights reliability weight expresses how reliable one considers
values of this variable to be high reliability weight corresponds to variable of which
values are considered trustworthy
Fellegi-Holt paradigm: (dis)advantages
Advantages: drastically improves efficiency in terms of budget and time in comparison to deterministic checking rules less, and less
detailed, rules have to be specified Disadvantages:
class of errors that can safely be treated is limited to random errors
class of edits that can be handled is restricted to so-called hard (or logical) edits which hold true for all correctly observed records
risky to treat influential errors by means of automatic editing
Macro-editing
Macro-editing techniques often examine potential impact on survey estimates to identify suspicious data in individual records
Two forms of macro-editing aggregation method distribution method
Macro-editing: aggregation method
Verification whether figures to be published seem plausible
Compare quantities in publication tables with same quantities in previous publications quantities based on register data related quantities from other sources
Macro-editing: distribution method
Available data used to characterize distribution of variables
Individual values compared with this distribution
Records containing values that are considered uncommon given the distribution are candidates for further inspection and possibly for editing
Macro-editing: graphical techniques Exploratory Data Analysis techniques can be applied
box plots scatter plots (outlier robust) fitting
Other often used techniques in software applications anomaly plots: graphical overviews of important
estimates, where unusual estimates are highlighted time series analysis outlier detection methods
Once suspicious data have been detected on a macro-level one can drill-down to sub-populations and individual units
Macro-editing: (dis)advantages
Advantages: directly related to publication figures or distribution efficient in term of budget and time
Disadvantages: records that are considered non-suspicious may still
contain influential errors publication of unexpected (but true) changes in trend may
be prevented for data sets with many important variables graphical
macro-editing is not the most suitable SDE method most persons cannot interpret 10 scatter plots at the same
time
Integrating SDE techniques
We advocate an SDE approach that consists of the following phases:
correction of “evident” systematic errors application of selective editing to split records in
critical stream and non-critical stream editing of data:
records in critical stream edited interactively records in non-critical stream edited automatically
validation of the publication figures by means of (graphical) macro-editing
Imputation
Expert guess Deductive imputation Multivariate regression imputation Nearest neighbor hot-deck imputation Ratio hot-deck imputation
Deductive imputation
Sometimes missing values can be determined unambiguously from edits
Examples: single missing value involved in balance edit for non-negative variables: if a total variable has
zero value all missing subtotal (component) variables are zero
Regression imputation Regression model per variable to be imputed
Y = A + B X + e
Imputations for missing data can be obtained from
Y = Aest + Best X
or from
Y = Aest + Best X + e*
where e* is drawn from appropriate distribution
Regression imputation Imputation can also be based on multivariate regression model that relates
each missing value to all observed values
Ymis = Meanmis + B(Yobs – Meanobs) + e
Estimates of model parameters can be obtained by using EM-algorithm Imputations for missing data can be obtained from
Ymis = Meanest,mis + Best(Yobs – Meanest,obs)
or from
Ymis = Meanest,mis + Best(Yobs – Meanest,obs) + e*
where e* is drawn from appropriate distribution
Nearest neighbor hot deck imputation
For each receptor record with missing values on some (target) variables a donor record is selected that has no missing values on auxiliary and target variables smallest distance to receptor
Replace missing values by values from donor Often used distance measure is minimax distance
Zsi: value of scaled auxiliary variable i in record s distance between records s and t:
D(s,t) = max_i |Zsi – Zti|
Ratio hot deck imputation
Modified version of nearest neighbor hot-deck for variables that are part of balance edit
Calculate difference between total variable and sum of observed components this difference equals the sum of the missing components
Sum of missing components are distributed over missing components using ratios (of missing components to sum of missing components) from donor record level of imputed components is determined by total variable but
their ratios are determined by donor imputed and observed components add up to total
Example of ratio hot deck
P + C = T Record to be imputed given by
T = 400, P = ?, C = ? Donor record
T = 100, P = 25, C = 75 Imputed record
T = 400, P = 100, C = 300
Consistency
If imputed values violate edits, adjust them slightly Observed values not adjusted Minimize Σi wi |Yi,final – Yi,imp| subject to restriction that
Yi,final in combination with observed values satisfy all edits Yi,imp: imputed values (possibly failing edits)
Yi,final: final values
wi: user-specified weights
As numerical edits are generally linear (in)equalities, resulting problem is a linear programming problem
Consistency
Prerequisite: it should be possible to find values Yi,final such that all edits
become satisfied this is the case if Fellegi-Holt paradigm has been applied to
identify errors
Instead of first imputing and then adjusting values, better (but more complicated) approach is to impute under restriction that edits become satisfy see doctorate thesis by Caren Tempelman (Statistics
Netherlands, www.cbs.nl)
Conclusion
All editing and imputation methods have their own (dis)advantages
Integrated use of editing techniques (selective editing, interactive editing, automatic editing, and macro-editing) as well as various imputation techniques can improve efficiency of SDE and imputation process while at same time maintaining or even enhancing statistical quality of produced data