Summary

42
IN S T IT U T O N A C IO N A L D E E S TA D ÍS TIC A Using Selective Editing Combined with an Automatic System in the FSS of Spain Dolores Lorca National Statistical Institute of Spain

description

Using Selective Editing Combined with an Automatic System in the FSS of Spain Dolores Lorca National Statistical Institute of Spain. Summary. An integrated editing process that combines selective editing and the generalized edit and imputation system Banff - PowerPoint PPT Presentation

Transcript of Summary

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Using Selective Editing Combined with an Automatic System in the FSS of Spain

Dolores LorcaNational Statistical Institute of Spain

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Summary • An integrated editing process that combines selective

editing and the generalized edit and imputation system Banff

• We use Banff to detect the suspicious units and a score function for selective editing

• Spanish FSS: the different types of data (crop, livestock, employment) contribute to the complexity of this process

• Some results obtained from the traditional microediting approach and from selective editing are compared

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA Traditional microediting approach

• The subject matter expert specifies the edits

• The processing department makes tailored-made programs for each survey to detect the edit failures

• The edit failures are manually reviewed

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

New integrated edit and imputation process

1) Initial editing prior to selective editing

2) Selective editing procedure

3) Automatic system process (BANFF)

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

1) Initial editing prior to selective editing

Controls of consistence are established in the data collection phase carried out by interviewers

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA 2) Selective editing procedure

Score functions are built to determine and prioritize the survey suspect units to be reviewed manually due to their significant weight on the final estimates

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

3) Automatic system process (BANFF).

The automatic system process is carried out using the generalized system Banff, developed by Statistics Canada

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA Study case:Farm Structure Survey (FSS)

The Spanish FSS collects different types of data such as:

• Utilised agricultural land

• Cultivated Land by kind of crop

• Types of livestock

• The structure and the amount of farm employment

• Machinery and equipment

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA The main characteristics of the FSS

• FSS is carried out every 2 years • It consists of a farm panel drawn from the last

Agrarian Census • The sample design is a single stage design with

stratification of the farms according to geographical area, type of farming (TF) and size

• Data collection is carried out by interviewers

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA FSS Estimators: total estimate of the jth variable in stratum h

Fh is the sample weight for the stratum h

nh is the sample size in stratum h

Xhji denotes the jth variable value for the sampled unit i in stratum h.

1XFX̂hn

1iihjhhj

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA Initial editing prior to selective editing

• Initial editing is carried out by interviewers in the NSI’s provincial offices

• In this phase, all fatal errors are corrected Most of these fatal errors come from balance edits

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Selective editing procedure• The goal: To select the survey units with suspicious

values that may have a significant effect on survey estimates

• Key variable chosen:

Utilized Agricultural Land (UAL), Cultivated Land (CL),

Woody Crops (WCs), Olive Grove (OG),Vineyard (VY),

Animal Units (AU), Annual Labour Units (ALU)

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Selective editing: crop variables

• Relative stability over time

• Anomalous variations, from the previous year to the current one, can be a sign of data errors

• We determine the units with anomalous and significant variations of the selected crop variables

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA Steps of selective editing procedure: crop variables

1) In each stratum, we obtain the units with anomalous variations with respect to the previous period of the analyzed variables, using the Hidiroglou-Berthelot (1986) method of outlier detection (PROC OUTLIER of Banff system)

2)The units for manual editing are selected among

the outliers identified previously having a

significant weight on the population total estimates

using a score function

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA (1) step: Hidiroglou-Berthelot method

PROC OUTLIER of BANFF system

1thji

thji

hji X

Xr

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA (1) step: Hidiroglou-Berthelot method

hjmhji

hjm

hji

hjmhji

hji

hjm

hji

rr1r

r

rr0r

r1

s

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA (1) step: Hidiroglou-Berthelot method

Effect ehji for each unit i:

ehji=shji(max(Fht-1

xhjit , Fh

t-1 xhji

t-1 ))exp

exp=1

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA (1) step: Hidiroglou-Berthelot method

M, Q1,Q3: median, the first quartile and the third quartile of the transformed ehji values of the variable being processed

dQ1=max(M-Q1,|A*M)

dQ3=max(Q3-M,|A*M)

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA (1) step: Hidiroglou-Berthelot method

(M-C dQ1 ,M+CdQ3)

C=5

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

(2) step: scaled local score function (Latouche and Berthelot 1992):

3X̂

XXF1t

hj

1thji

thji

1th

hji

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Setting threshold value (Lawrence and Mckenzie 2000):

4)X̂(SEn

k3a hj

hhj

ahj is the threshold value of the jth variable in stratum h, SE(Xhj) is the standard error of the jth variable in stratum h, nh is the sample size in stratum h and k is a value such as :

)X̂(kVar))X̂(bias(E hjhj2

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Using the Lawrence and Mckenzie formula ensures that the bias due to not editing some of the survey units is less than k% of the variance of the estimate. The value of k is set to 10%

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

• Within eachs stratum,the values Δhji are sorted in descending order

• Then, the outliers with score Δhji > ahj are selected for manual editing

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA Selective editing: employment variable•ALU variable One ALU is equivalent to the work carried out by one person on a full-time basis over one year

• Using auxiliary information to estimate the expected amended value: the ratio between the employment number in agriculture obtained in t and t-2 through the Force labour Survey (FLS)

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Selective editing: score function

5X̂R

RXXF1t

hj

1t

hji

t

hji

1t

h

hji

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Selective Editing: livestock variables

– The FSS collects the existing livestock in

the farm on the day of the interview – A farm can have a strong livestock variation depending

on the interview date

The selective editing procedure for livestock

is different to the rest of variables

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Animal Units (AU)

Livestock data are expressed in AUs which

are obtained by applying a coefficient to

each species and type in order to group

different species in one common unit

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Steps of Selective Editing procedure: livestock

1) Units that fail some of the edits, which are specified in the traditional microediting approach, are selected as suspicious units

2) For each suspicious unit or edit failure, an estimate of the expected amended response of AU variable is calculated

3) We determine, among the suspicious units detected at the previous step, those units with a significant weight on the total estimate of the AU variable

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Edits specified in the traditional microediting approach

yhji < chj

yhji is the jth variable (types of livestock) for the unit i in stratum h chj is a constant determined by the historical empirical distributions

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA •Estimate of the expected amended response of AU variable:

chj expressed in AU, i.e. x’hji

•Magnitude of failure for the suspicious unit i

ehji=xhji-x’hji

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA Selective editing: score function

6X̂

eF1t

hj

hji1t

hhji

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA • The threshold is calculated using the Lawrence and Mckenzie formula as in previous cases

• Within each stratum,the values Δhji are sorted in descending order

• Then, the edit failures with score Δhji > ahj are selected for manual editing

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA Global score function

Ghi=max j(Δhji )

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Macroediting and selective editing approach

• In first place, a selection of the strata with the largest variation with respect to the previous period of the analysed variables is carried out

• After, the steps of selective editing procedure are applied only to the farms of the selected strata

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA Macro-editing approach:

p

1:h

1t

hj

t

hj

1t

hj

t

hj

hj

X̂X̂

X̂X̂

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

threshold value for the strata

• In each region, the hj values are sorted in descending order.

• We determine a threshold value, j* and strata with hj >j* are selected

This threshold value is set to 3%.

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CAResults

– Farm number: 3690

– We compare the results obtained for the following editing procedures:• (A) Traditional microediting approach• (B) Selective editing procedure• (C) Macroediting and selective editing

approach

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Table 1

Procedure A B C

Rate of editedunits(%)

21.5 9.0 4.8

Rate of correctedunits(%)

3.9 7.2 9.1

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Table 2:Change rates of total estimate

for the CL variable (%)

Change rate of (B)over (A)

Change rate of (C)over (A)

0.8 1.1

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CATable 3

95% Confident interval CL(B) CL (C)(72657.2; 86031.5) 78770.72 78471.56

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA Further research

– Banff will be applied to the rest of units that have not been edited in the selective editing procedure

– Different methods of imputation will be tested

INST

ITU

TO N

ACIO

NAL

DE E

STAD

ÍSTI

CA

Final remarks

• Integrating the PROC OUTLIER of Banff to detect suspicious units and a score function to select units for manually editing has been useful in the Spanish FSS

• Reduction in cost and processing time would be attained using this approach

• Response burden is reduced from carrying out less number of recontacts