Introduction to data integration

ENP 2018

Georgia

Introduction to dataintegration

ENP 20181

Combining multiple data sources (Various statistical tasks and methods

Two main stages of data integration:

Steady/major states of data, Input and output quality

Statistical validity and equivalence of proxy data

Basic data configurations

Outline

2 ENP 2018

Statistical tasks as building blocks

ENP 20183

Statistical tasks as ‘building blocks’

Micro dataset I. Data editing and Imputation

II. Creation of joint micro dataset

II.a. Data linkage

II.b. Micro-level statistical matching

III. Alignment of Statistical Micro data

III.a. Units

III.b. Measurements

Aggregated Level IV. Multisource Estimation

IV.a. Population size estimation

IV.b. Univalent estimation

IV.c. Coherent estimation

(Di Zio et al. 2017)

Example: Norwegian integrated employment

statistics

4 ENP 2018

Production of integrated statistical micro data

• Two main stages:

1. data linkage across sources

• negligible error if unique identifiers available

• error-prone if probabilistic/deterministic record linkage

2.micro integration

• in concept and value

• deals with units and variables

• can involve micro- and macro-level constraints

5 ESTP 2017

Generic Statistical Business Process Model:

6 ENP 2018

Exercise1. Identify and discuss where in the GSBPM you would do tasks that

are connected to combining and integrating data.

2. When in the current production process do you work on harmonisation of variables and classification today?

3. When in the productions process do you think would be optimal for greatest harmonisation?

https://statswiki.unece.org/display/GSBPM/Clickable+GSBPM

GSBPM 5.0

ENP 20187

https://statswiki.unece.org/display/GSBPM/Clickable+GSBPM

Data integration: An overview

Two main production stages within process 5. Process

- up to 5.1 Integrate data include linkage across sources

- up to 5.8 Finalise data files

Major states of data

input data → micro data → statistical data → disseminated data

Input and output quality

input quality at end of process 4. Collect

output accuracy at end of process 5. Process

8 ENP 2018

Major states of data

9 ENP 2018

[Different numbers (≥ 4) and names; e.g. “steady states” at

CBS]

Data GSBPM process Defining characteristic

Input up to proces

s

4.4 acceptable

Micro up to proces

s

5.1 linkable across

datasets

Statistical up to proces

s

5.8 statistical accuracy

Disseminate

d

up to proces

s

7.2 releasable

Illustration: Labour Market Account related

statistics

10ENP 2018

...

SurveyA

RegisterI WageStat

SurveyB SickLeaveStat Labour marketAccount

RegisterI I EmployStat

SurveyC

...

NB. stovepipes without defined major states that allow integration


statistics

11 ENP 2018

...

SurveyA ...

RegisterI WageStat

SurveyB ActivityRegister SickLeaveStat LM A

RegisterI I EmployStat

SurveyC ...

...

NB. Activity Register as major-state micro or statistical data?


statistics

12ENP 2018

Activity Register (AR): Unit = Person × Activity × Duration

NB. Envisaged e.g. in Wallgren & Wallgren (2006). However,

unlikely for AR to be a base register, as CPR, BR or IR.

Person Activity Duration Outcome ...

Busines

s

Type Start Finish Amount

. UoS PhD . - 100% Upgrade .

A 09/2009 · · ·

A UoS Teach. Assist. 11/2010 11/2010 15 hrs x · · ·

. . .

Input quality

Some relevant EU projects

BLUE-ETS, WP4,

ESSnet Admin data for Business Statistics, WP6,

ESSnet Quality of Multisource Statistics, WP1

13

Input data quality dimensions/indicators(BLUE-ETS)

14 ENP 2018

1.Technical:

• accessibility, file declaration, convertability, etc.

2.Accuracy:

• authentic, inconsistent, dubious objects

• measurement error, inconsistent, dubious values

3.Completeness

• under-/over-coverage, selectivity of objects

• missing or imputed values

Input data quality dimensions/indicators(BLUE-ETS)

15 ENP 2018

4.Time-related

• timeliness, punctuality, overall time lag,

delay

• dynamics of objects

• stability of variables/measurements

5.Integrability

• comparability, alignment of objects

• linking (key) variables

• comparability/proximity of variables

Output accuracy: Validity and

equivalence (Zhang, 2012)

16 ENP 2018

Illustration of statistical equivalence

17 ENP 2018

Binary data

True Proxy

Anderse

n

0 1 1 0 0 0 1 1

Johnson 1 0 1 0 0 1 0 1

Petersen 1 1 0 0 1 0 0 1

Measurement error 2 2 2 1 1 3 1

Statistical

equivalence

Yes Yes No No No No No

Relative equivalence ∼ ∼

Relevant aspects to breakdown of data configurations:

• Aggregation level: micro, macro, or mixed

• Unit: with or without overlap between datasets

• Variables: with or without overlap (or proxy)

• Coverage:presence of over-/under-coverage

• Time: cross-sectional vs. longitudinal

• Population: known from available frame or not

• Design:census, probability sampling or observational

18ENP 2018

Some basic data configurations (ESSnet, 2016)

Relevant types of statistical output:

• population registers

• statistics (macro-level)

• micro datasets

• metadata

Relevant dist inct ion for statistics:

• descriptive: such as totals or means, etc.

• analytic: price indices, regression coefficients, etc.

ENP 201819

Some basic data configurations (ESSnet, 2016)

Configuration 1: Ideal complementary datasets

20 ENP 2018

Example: BR, survey of largest units, register data of the rest

Configuration 2: Overlapping instead of

complementary

21 ENP 2018

Example: short register history and overlapping sample survey data

Configuration 2S (“traditional” survey sampling?)

22ENP 2018

Example: complete register and complementary sample survey data

Configuration 3 (“traditional” multi-frame

sampling?)

23 ENP 2018

Example: population census and post-enumeration survey

Configuration 4 (e.g repeating weighting)

ENP 201824

Configuration 5 (Statistical matching)

ENP 201825

Configuration 6 (Time series)

ENP 201826

Example: register and survey time series of different frequencies

Exercise

Study the different configurations suggested

from the Essnet project «Quality in

Multisource Statistics.

Try to intepret the described situations and

assess them from your professional

experience: Do you find the configurations

exhaustive?

Group discussion

27

REFERENCES

References

[1] ESSnet (2016). Measuring the Quality of the Output of Multisource Statistics: A Breakdown of Basic

Data Configurations. WP3 working document.

[2] BLUE-ETS (2011). Deliverable 4.1. List of quality groups and indicators identified for administrative data

sources.

[3] De Waal, T. (2016), Obtaining numerically consistent estimates from a mix of administrative data and

surveys. Statistical Journal of the IAOS, 32, 231-243.

[4] Di Zio, M., Zhang, L.-C. and De Waal, T. (2017). Statistical methods for combining multiple sources of

administrative and survey data. The Survey Statistician, 76, 17-26.

[5] Fienberg, S. E., Makov., U.E. and Steele, R.J. (1998). Disclosure limitation control usingperturbation and

related methods for categorical data. Journal of Official Statistics, vol. 14, pp. 485-502.

[6] Groves, R.M., Fowler Jr., F.J., Couper, M., Lepkowski, J.M., Singer, E. and Tourrangeau, R. (2004).

Survey Methodology. New York: Wiley.

[7] Rubin, D. (1993). Discussion, statistical disclosure limitation. Journal of Official Statistics vol. 9, pp.

461-468.

[8] Zhang, L.-C. (2012). Topics of statistical theory for register-based statistics and data integration. Statistica

Neerlandica, vol. 66, pp. 41-63.

[9] Wallgren, A. and Wallgren, B. (2014). Register-based Statistics - Administrative Data for Statistical Pur-

poses. 2 ed. John Wiley & Sons, Ltd.

ENP 201828

Introduction to data integration

Documents

Transcript of Introduction to data integration