Integrated Data Editing and Imputation

Integrated Data Editing and Imputation

Ton de WaalDepartment of Methodology Voorburg

Statistics Netherlands

ICES III conference, Montréal June 19, 2007

What is statistical data editing and imputation?

Observed data generally contain errors and missing values

Statistical Data Editing (SDE): process of checking observed data, and, when

necessary, correcting them Imputation:

process of estimating missing data and filling these values in into data set

What is integrated SDE and imputation?

Integration of error localization and imputation

Integration of several edit and imputation techniques to optimize edit and imputation process

Integration of statistical data editing into rest of statistical process

SDE and the survey process

We will focus on identifying and correcting errors Other goals of SDE are

identify error sources in order to provide feedback on entire survey process

provide information about the quality of incoming and outgoing data

Role of SDE is slowly shifting towards these goals feedback on other survey phases can be used to improve

those phases and reduce amount of errors arising in these phases

Edits

Edit rules, or edits for short, often used to determine whether record is consistent or not

Inconsistent records are considered to contain errors

Consistent records that are also not suspicious otherwise, e.g. are not outlying with respect to the bulk of the data, are considered error-free

Example of edits (T turnover, P profit, and C costs): T = P + C (balance edit) T ≥ 0

SDE and imputation

Three related problems: Error localization: determine which values are

erroneous Correction: correct missing and erroneous data in

best possible way Consistency: adjust values such that all edits

become satisfied Correction often done by means of imputation

SDE and imputation


erroneous Imputation: impute missing and erroneous data

in best possible way Consistency: adjust imputed values such that all

edits become satisfied

SDE and imputation


erroneous Imputation: impute missing and erroneous data

in best possible way Consistency: adjust imputed values such that all

edits become satisfied Most SDE techniques focus on error

localization

SDE in the “old” days

Use of computers in SDE started many years ago In early years role of computers restricted to

checking which edits were violated Subject-matter specialists retrieved paper

questionnaires that did not pass all edits and corrected them

After correction, data were again entered into computer, and again checked whether all edits were satisfied

Major problem: during manual correction process records were not checked for consistency

Modern SDE techniques

Interactive editing Selective editing Automatic editing Macro-editing

Interactive editing

During interactive editing a modern survey processing system (e.g. BLAISE) is used

Such a system allows one to check and – if necessary – correct in a single step

Advantages: number of variables, edits and records may be high quality of interactively edited data is generally high

Disadvantage: all records have to be edited: costly in terms of budget and

time not transparent

Selective editing

Umbrella term for several methods to identify the influential errors

Aim is to split data into two streams: critical stream: records that are the most likely ones to

contain influential errors non-critical stream: records that are unlikely to contain

influential errors

Records in critical stream are edited interactively Records in non-critical stream are either not edited

or are edited automatically

Selective editing

Many selective editing methods are based on common sense

Most often applied basic idea is to use a score function

Two important components influence: measures relative influence of record

on publication figure risk: measures deviation of observed values from

“anticipated” values (e.g. medians or values from previous years)

Selective editing Local score for single variable within record

usually defined as distance between observed and anticipated values, taking influence of record into account

Example: W x |Y – Y*| W raising weight, Y observed value, Y* anticipated value influence component: W x Y* risk component: |Y – Y*| / Y*

Local scores combined into global score for entire record by sum of local scores maximum of local scores

Records with global score above certain cut-off value edited interactively

Selective editing: (dis)advantages

Advantage: selective editing improves efficiency in terms of

budget and time Disadvantage:

no good techniques for combining local scores into global score are available if there are many variables

Selective editing has gradually become popular method to edit business data

Automatic editing

Two kinds of errors: systematic ones and random ones

Systematic error: error reported consistently among (some) responding units gross values reported instead of net values values reported in units instead of requested

thousands of units (so-called “thousand-errors”) Random error: error caused but by accident

observed value where respondent by mistake typed in a digit too many

Automatic editing of systematic errors

Can often be detected by comparing respondent’s present values with

those from previous years comparing responses to questionnaire variables

with values of register variables using subject-matter knowledge

Once detected, systematic error is often simple to correct

Automatic editing of random errors

Three classes of methods: methods based on statistical models (e.g. outlier

detection techniques and neural networks) methods based on deterministic checking rules methods based on solving a mathematical

optimization problem

Deterministic checking rules

State which values are considered erroneous when record violates edits Example: if component variables do not sum up to total, total

variable is considered to be erroneous Advantages:

drastically improves efficiency in terms of budget and time transparency and simplicity

Disadvantages: many rules have to be specified, maintained and checked for

validity bias may be introduced as one aims to detect random errors in a

systematic manner

Error localization as mathematical optimization problem

Guiding principle is needed Freund and Hartley (1967): minimize sum of the distance

between observed and “corrected” data and a measure for violation of edits

Casado Valera et al. (90’s): minimize quadratic function measuring distance between observed and “corrected” data such that “corrected” data satisfy all edits

Bankier (90’s): impute missing data and potentially erroneous values by means of donor imputation, and select imputed record that satisfies all edits and that is “closest” to original record

Fellegi-Holt paradigm (1976)

Data should be made to satisfy all edits by changing values of fewest possible number of variables

Generalization: data should be made to satisfy all edits by changing values of variables with smallest possible sum of reliability weights reliability weight expresses how reliable one considers

values of this variable to be high reliability weight corresponds to variable of which

values are considered trustworthy

Fellegi-Holt paradigm: (dis)advantages

Advantages: drastically improves efficiency in terms of budget and time in comparison to deterministic checking rules less, and less

detailed, rules have to be specified Disadvantages:

class of errors that can safely be treated is limited to random errors

class of edits that can be handled is restricted to so-called hard (or logical) edits which hold true for all correctly observed records

risky to treat influential errors by means of automatic editing

Macro-editing

Macro-editing techniques often examine potential impact on survey estimates to identify suspicious data in individual records

Two forms of macro-editing aggregation method distribution method

Macro-editing: aggregation method

Verification whether figures to be published seem plausible

Compare quantities in publication tables with same quantities in previous publications quantities based on register data related quantities from other sources

Macro-editing: distribution method

Available data used to characterize distribution of variables

Individual values compared with this distribution

Records containing values that are considered uncommon given the distribution are candidates for further inspection and possibly for editing

Macro-editing: graphical techniques Exploratory Data Analysis techniques can be applied

box plots scatter plots (outlier robust) fitting

Other often used techniques in software applications anomaly plots: graphical overviews of important

estimates, where unusual estimates are highlighted time series analysis outlier detection methods

Once suspicious data have been detected on a macro-level one can drill-down to sub-populations and individual units

Macro-editing: (dis)advantages

Advantages: directly related to publication figures or distribution efficient in term of budget and time

Disadvantages: records that are considered non-suspicious may still

contain influential errors publication of unexpected (but true) changes in trend may

be prevented for data sets with many important variables graphical

macro-editing is not the most suitable SDE method most persons cannot interpret 10 scatter plots at the same

time

Integrating SDE techniques

We advocate an SDE approach that consists of the following phases:

correction of “evident” systematic errors application of selective editing to split records in

critical stream and non-critical stream editing of data:

records in critical stream edited interactively records in non-critical stream edited automatically

validation of the publication figures by means of (graphical) macro-editing

Imputation

Expert guess Deductive imputation Multivariate regression imputation Nearest neighbor hot-deck imputation Ratio hot-deck imputation

Deductive imputation

Sometimes missing values can be determined unambiguously from edits

Examples: single missing value involved in balance edit for non-negative variables: if a total variable has

zero value all missing subtotal (component) variables are zero

Regression imputation Regression model per variable to be imputed

Y = A + B X + e

Imputations for missing data can be obtained from

Y = Aest + Best X

or from

Y = Aest + Best X + e*

where e* is drawn from appropriate distribution

Regression imputation Imputation can also be based on multivariate regression model that relates

each missing value to all observed values

Ymis = Meanmis + B(Yobs – Meanobs) + e

Estimates of model parameters can be obtained by using EM-algorithm Imputations for missing data can be obtained from

Ymis = Meanest,mis + Best(Yobs – Meanest,obs)

or from

Ymis = Meanest,mis + Best(Yobs – Meanest,obs) + e*

where e* is drawn from appropriate distribution

Nearest neighbor hot deck imputation

For each receptor record with missing values on some (target) variables a donor record is selected that has no missing values on auxiliary and target variables smallest distance to receptor

Replace missing values by values from donor Often used distance measure is minimax distance

Zsi: value of scaled auxiliary variable i in record s distance between records s and t:

D(s,t) = max_i |Zsi – Zti|

Ratio hot deck imputation

Modified version of nearest neighbor hot-deck for variables that are part of balance edit

Calculate difference between total variable and sum of observed components this difference equals the sum of the missing components

Sum of missing components are distributed over missing components using ratios (of missing components to sum of missing components) from donor record level of imputed components is determined by total variable but

their ratios are determined by donor imputed and observed components add up to total

Example of ratio hot deck

P + C = T Record to be imputed given by

T = 400, P = ?, C = ? Donor record

T = 100, P = 25, C = 75 Imputed record

T = 400, P = 100, C = 300

Consistency

If imputed values violate edits, adjust them slightly Observed values not adjusted Minimize Σi wi |Yi,final – Yi,imp| subject to restriction that

Yi,final in combination with observed values satisfy all edits Yi,imp: imputed values (possibly failing edits)

Yi,final: final values

wi: user-specified weights

As numerical edits are generally linear (in)equalities, resulting problem is a linear programming problem

Consistency

Prerequisite: it should be possible to find values Yi,final such that all edits

become satisfied this is the case if Fellegi-Holt paradigm has been applied to

identify errors

Instead of first imputing and then adjusting values, better (but more complicated) approach is to impute under restriction that edits become satisfy see doctorate thesis by Caren Tempelman (Statistics

Netherlands, www.cbs.nl)

Conclusion

All editing and imputation methods have their own (dis)advantages

Integrated use of editing techniques (selective editing, interactive editing, automatic editing, and macro-editing) as well as various imputation techniques can improve efficiency of SDE and imputation process while at same time maintaining or even enhancing statistical quality of produced data

Integrated Data Editing and Imputation

Documents

Transcript of Integrated Data Editing and Imputation