THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

21
THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS G. Bianchi, R. M. Lipsi, P. Francescangeli, G. Ruocco, A. M. Salvatore, F. Scalfati ([email protected]) Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 UNECE - CONFERENCE OF EUROPEAN STATISTICIANS

description

UNECE - CONFERENCE OF EUROPEAN STATISTICIANS. Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011. THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS G. Bianchi, R. M. Lipsi, P. Francescangeli, G. Ruocco, - PowerPoint PPT Presentation

Transcript of THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

Page 1: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN

AGRICULTURAL CENSUS

G. Bianchi, R. M. Lipsi, P. Francescangeli, G. Ruocco, A. M. Salvatore, F. Scalfati

([email protected])

Work Session on Statistical Data EditingLjubljana, Slovenia, 9-11 May 2011

UNECE - CONFERENCE OF EUROPEAN STATISTICIANS

Page 2: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

Outline

• Introduction

• E&I strategy guidelines

strategy and census stages

during data collection

stages after data capturing

• E&IS tools and innovations

• Conclusions

• References

Page 3: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

IntroductionFor the 6th Italian Agricultural Census, a new Editing and Imputation System (E&IS) has been implemented in order to reduce the total census error

The main purpose of the E&IS is to identify and treat the non sampling errors, in order to provide a complete and consistent set of data

Page 4: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&I strategy guidelines (1) Quality oriented approach by performing the

E&I process from data collection to the final figures

Data editing and detection of outliers and influential errors (selective editing) during data collection

After data capturing, scheduling of two main

correction stages, centrally managed by Istat

Page 5: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&I strategy guidelines (2)

Use of techniques that minimize the number of changes especially for the treatment of not influential random errors

Quality indicators to monitor the main steps of E&I

Ad-hoc documentation to evaluate the outcome of the procedures, paying particular attention to changes due to the E&I process

Page 6: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&I strategy and census stages (1) According to the E&I strategy, all variables

are separated into different related subsets to identify the most appropriate treatment for each of them

The E&I process will feature three main stages:

1. E&I during data collection

2. Provisional figures dissemination (primary variables)

3. Final results dissemination

Page 7: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&I strategy and census stages (2)

Page 8: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&I during data collection (1)In order to prevent and correct fatal errors and missing values during data capturing

Questionnaire editingHoldings/enumerators

A subset of 220 checking rules (fatal and query) hasbeen implemented in the web based data entry System

Automatic checkData collection staff

Before the final release of data to the census DB, to localize potential errors slipped during data gathering

Census Data Collection System

Page 9: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&I during data collection (2)Before the end of field enumeration operations, and while data collection network is still in force, two distinct procedures have been implemented and launched by Istat to detect influential errors and outlier values

Outliers detection

-Forward Search Technique-manual review of anomalous values by data collection staff

Micro-editing check

Underlines inconsistent data by analyzing at unit level the coherence between the answers referring to related topics

E&I SYSTEM

Page 10: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&I during data collection (3)Forward Search Technique: outliers detection among strata, defined according to the crop type and the farm size

Census

Administrative Register

-Regression lineY=aX+b

-Parameters estimation a and b, with and without outliers

-Statistical significance and goodness of fit of the regression model (R2)

Page 11: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&I stages after data capturing (1) In order to achieve maximum coherence

between provisional and final data at regional level, the strategy adopted is firstly to correct all the primary variables and then the secondary ones

After data collection, two main correction stages are scheduled. In the first stage, all the variables for the dissemination of provisional figures (primary variables) are corrected

In each E&I stage, the following steps are repeated: automatic error detection and treatment of errors

Page 12: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&I stages after data capturing (2)

First step of each E&I stage

Automatic error detection

Macro level editing

- Uses all (or large part) of data to identify errors- Enables to evaluate the accuracy of preliminary estimates such as totals (or subgroups main figures)- Outliers detection

Micro level editing

Erroneous values in individual records are automatically identified by means of edit rules

Page 13: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&I stages after data capturing (3)

Second step of each E&I stage

Treatment of errors

Selective editing

Treatment of the outliers and influential errors, having substantial impact on data dissemination is based on manual review

Random errors

Treatment of not influential random errors is based on minimum change approaches

Imputation

Model based techniques or nearest neighbour donor will be used for the imputation of not influential random errors

Page 14: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&IS tools and innovations (1) Inclusion of a subset of edit rules in the data

capture stage Use of Forward Search methods for the outliers

detection Use of administrative sources for micro and

macro data checks Use of score functions to prioritize records to

be manually reviewed Use of minimum change based model or

nearest neighbour approach for localizing residual random errors

Mix of different imputation methods as nearest neighbour approach or model based imputation

Page 15: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&IS tools and innovations (2) The core of E&IS is the software DIESIS (Data

Imputation Editing System – Italian Software), used for dealing with non influential errors in quantitative variables

DIESIS was developed in 2001 by ISTAT and academic researchers of the University of Rome “ Sapienza”

In DIESIS, optimization techniques were implemented for the simultaneous treatment of qualitative and quantitative variables

Joint use of data driven and minimum change approaches

DIESIS localization performance has been compared with the performance of the Canadian software BANFF

Page 16: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&IS tools and innovations (3) The scheduling and the monitoring of all

procedures and the interactive corrections will be managed by CONCERT, a Java web application

To test the E&IS while implementing the scheduled procedures, an Oracle database was implemented

The whole process of E&I will be documented by a set of quality indicators both, on the data collected and on the results of the different editing steps

Page 17: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&IS tools and innovations (4)

Page 18: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

E&IS tools and innovations (5)Some simulation studies have been carried out for:

identifying for each section of the questionnaire, the most appropriate correction approach

evaluating the imputation of missing non linearly dependent data through conditional Copula functions (developed by ISTAT and the University of Bologna)

assessing the use of Forward Search techniques (robust statistical methods) for outliers detection (developed by ISTAT and the University of Parma)

Page 19: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

ConclusionsThe innovative E&I strategy will reduce the efforts of coping with timeliness constraints and will increase data consistency and accuracy

The results of the procedures implemented in the E&IS are very encouraging and allow to trust in a good improvement of census data quality

Page 20: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

Thank you!!! Thank you!!!

Thank you!!!

Page 21: THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS

References • Bianchi G., Di Lascio F. M. L., Giannerini S., Manzari A., Reale A., Ruocco

G. (2009-a) Exploring copulas for the imputation of missing nonlinearly dependent data, Seventh Scientific Meeting of the CLAssification and Data Analysis Group of the Italian Statistical Society Università di Catania (Italy). September 9-11, 2009.

• Bianchi G., Francescangeli P., Manzari A., Reale A., Ruocco G., Salvi S. (2009-b) An overview of Editing and Imputation System of 2010 Italian Agriculture Census. Round. Roundtable Meeting on Programme for the 2010 Round Census of Agriculture . Budapest 23-27 november 2009.

• Bianchi G., Manzari A., Reale A., Salvi S. (2009-c) Valutazione dell’idoneità del software DIESIS all’individuazione dei valori errati in variabili quantitative. Istat - Collana Contributi Istat – n. 1 – 2009.

• Cotton C. (1991) Functional description of the generalized edit and imputation system. Business Survey Methods Division - July 25 Statistics Canada.

• Kovar J.G., MacMillian J.H., and Whitridge P. (1988) Overview and strategy for the generalized edit and imputation system. Report, Methodology Branch - April 1988 (updated February 1991) Statistics Canada.

• Luzi et al. (2007). EDIMBUS. Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys, August 2007.

• Riani M., Atkinson A. C. (2000). Robust Diagnostic Data Analysis: Trasformations in Regression. TECHNOMETRICS. vol. 42, pp. 384-394 ISSN: 0040-1706. With discussion.