Introduction to data integration

28
ENP 2018 Georgia Introduction to data integration ENP 2018 1

Transcript of Introduction to data integration

Page 1: Introduction to data integration

ENP 2018

Georgia

Introduction to dataintegration

ENP 20181

Page 2: Introduction to data integration

Combining multiple data sources (Various statistical tasks and methods

Two main stages of data integration:

Steady/major states of data, Input and output quality

Statistical validity and equivalence of proxy data

Basic data configurations

Outline

2 ENP 2018

Page 3: Introduction to data integration

Statistical tasks as building blocks

ENP 20183

Statistical tasks as ‘building blocks’

Micro dataset I. Data editing and Imputation

II. Creation of joint micro dataset

II.a. Data linkage

II.b. Micro-level statistical matching

III. Alignment of Statistical Micro data

III.a. Units

III.b. Measurements

Aggregated Level IV. Multisource Estimation

IV.a. Population size estimation

IV.b. Univalent estimation

IV.c. Coherent estimation

(Di Zio et al. 2017)

Page 4: Introduction to data integration

Example: Norwegian integrated employment

statistics

4 ENP 2018

Page 5: Introduction to data integration

Production of integrated statistical micro data

• Two main stages:

1. data linkage across sources

• negligible error if unique identifiers available

• error-prone if probabilistic/deterministic record linkage

2.micro integration

• in concept and value

• deals with units and variables

• can involve micro- and macro-level constraints

5 ESTP 2017

Page 6: Introduction to data integration

Generic Statistical Business Process Model:

6 ENP 2018

Page 7: Introduction to data integration

Exercise1. Identify and discuss where in the GSBPM you would do tasks that

are connected to combining and integrating data.

2. When in the current production process do you work on harmonisation of variables and classification today?

3. When in the productions process do you think would be optimal for greatest harmonisation?

https://statswiki.unece.org/display/GSBPM/Clickable+GSBPM

GSBPM 5.0

ENP 20187

Page 8: Introduction to data integration

Data integration: An overview

Two main production stages within process 5. Process

- up to 5.1 Integrate data include linkage across sources

- up to 5.8 Finalise data files

Major states of data

input data → micro data → statistical data → disseminated data

Input and output quality

input quality at end of process 4. Collect

output accuracy at end of process 5. Process

8 ENP 2018

Page 9: Introduction to data integration

Major states of data

9 ENP 2018

[Different numbers (≥ 4) and names; e.g. “steady states” at

CBS]

Data GSBPM process Defining characteristic

Input up to proces

s

4.4 acceptable

Micro up to proces

s

5.1 linkable across

datasets

Statistical up to proces

s

5.8 statistical accuracy

Disseminate

d

up to proces

s

7.2 releasable

Page 10: Introduction to data integration

Illustration: Labour Market Account related

statistics

10ENP 2018

...

SurveyA

RegisterI WageStat

SurveyB SickLeaveStat Labour marketAccount

RegisterI I EmployStat

SurveyC

...

NB. stovepipes without defined major states that allow integration

Page 11: Introduction to data integration

Illustration: Labour Market Account related

statistics

11 ENP 2018

...

SurveyA ...

RegisterI WageStat

SurveyB ActivityRegister SickLeaveStat LM A

RegisterI I EmployStat

SurveyC ...

...

NB. Activity Register as major-state micro or statistical data?

Page 12: Introduction to data integration

Illustration: Labour Market Account related

statistics

12ENP 2018

Activity Register (AR): Unit = Person × Activity × Duration

NB. Envisaged e.g. in Wallgren & Wallgren (2006). However,

unlikely for AR to be a base register, as CPR, BR or IR.

Person Activity Duration Outcome ...

Busines

s

Type Start Finish Amount

. UoS PhD . - 100% Upgrade .

A 09/2009 · · ·

A UoS Teach. Assist. 11/2010 11/2010 15 hrs x · · ·

. . .

Page 13: Introduction to data integration

Input quality

Some relevant EU projects

BLUE-ETS, WP4,

ESSnet Admin data for Business Statistics, WP6,

ESSnet Quality of Multisource Statistics, WP1

13

Page 14: Introduction to data integration

Input data quality dimensions/indicators(BLUE-ETS)

14 ENP 2018

1.Technical:

• accessibility, file declaration, convertability, etc.

2.Accuracy:

• authentic, inconsistent, dubious objects

• measurement error, inconsistent, dubious values

3.Completeness

• under-/over-coverage, selectivity of objects

• missing or imputed values

Page 15: Introduction to data integration

Input data quality dimensions/indicators(BLUE-ETS)

15 ENP 2018

4.Time-related

• timeliness, punctuality, overall time lag,

delay

• dynamics of objects

• stability of variables/measurements

5.Integrability

• comparability, alignment of objects

• linking (key) variables

• comparability/proximity of variables

Page 16: Introduction to data integration

Output accuracy: Validity and

equivalence (Zhang, 2012)

16 ENP 2018

Page 17: Introduction to data integration

Illustration of statistical equivalence

17 ENP 2018

Binary data

True Proxy

Anderse

n

0 1 1 0 0 0 1 1

Johnson 1 0 1 0 0 1 0 1

Petersen 1 1 0 0 1 0 0 1

Measurement error 2 2 2 1 1 3 1

Statistical

equivalence

Yes Yes No No No No No

Relative equivalence ∼ ∼

Page 18: Introduction to data integration

Relevant aspects to breakdown of data configurations:

• Aggregation level: micro, macro, or mixed

• Unit: with or without overlap between datasets

• Variables: with or without overlap (or proxy)

• Coverage:presence of over-/under-coverage

• Time: cross-sectional vs. longitudinal

• Population: known from available frame or not

• Design:census, probability sampling or observational

18ENP 2018

Some basic data configurations (ESSnet, 2016)

Page 19: Introduction to data integration

Relevant types of statistical output:

• population registers

• statistics (macro-level)

• micro datasets

• metadata

Relevant dist inct ion for statistics:

• descriptive: such as totals or means, etc.

• analytic: price indices, regression coefficients, etc.

ENP 201819

Some basic data configurations (ESSnet, 2016)

Page 20: Introduction to data integration

Configuration 1: Ideal complementary datasets

20 ENP 2018

Example: BR, survey of largest units, register data of the rest

Page 21: Introduction to data integration

Configuration 2: Overlapping instead of

complementary

21 ENP 2018

Example: short register history and overlapping sample survey data

Page 22: Introduction to data integration

Configuration 2S (“traditional” survey sampling?)

22ENP 2018

Example: complete register and complementary sample survey data

Page 23: Introduction to data integration

Configuration 3 (“traditional” multi-frame

sampling?)

23 ENP 2018

Example: population census and post-enumeration survey

Page 24: Introduction to data integration

Configuration 4 (e.g repeating weighting)

ENP 201824

Page 25: Introduction to data integration

Configuration 5 (Statistical matching)

ENP 201825

Page 26: Introduction to data integration

Configuration 6 (Time series)

ENP 201826

Example: register and survey time series of different frequencies

Page 27: Introduction to data integration

Exercise

Study the different configurations suggested

from the Essnet project «Quality in

Multisource Statistics.

Try to intepret the described situations and

assess them from your professional

experience: Do you find the configurations

exhaustive?

Group discussion

27

Page 28: Introduction to data integration

REFERENCES

References

[1] ESSnet (2016). Measuring the Quality of the Output of Multisource Statistics: A Breakdown of Basic

Data Configurations. WP3 working document.

[2] BLUE-ETS (2011). Deliverable 4.1. List of quality groups and indicators identified for administrative data

sources.

[3] De Waal, T. (2016), Obtaining numerically consistent estimates from a mix of administrative data and

surveys. Statistical Journal of the IAOS, 32, 231-243.

[4] Di Zio, M., Zhang, L.-C. and De Waal, T. (2017). Statistical methods for combining multiple sources of

administrative and survey data. The Survey Statistician, 76, 17-26.

[5] Fienberg, S. E., Makov., U.E. and Steele, R.J. (1998). Disclosure limitation control usingperturbation and

related methods for categorical data. Journal of Official Statistics, vol. 14, pp. 485-502.

[6] Groves, R.M., Fowler Jr., F.J., Couper, M., Lepkowski, J.M., Singer, E. and Tourrangeau, R. (2004).

Survey Methodology. New York: Wiley.

[7] Rubin, D. (1993). Discussion, statistical disclosure limitation. Journal of Official Statistics vol. 9, pp.

461-468.

[8] Zhang, L.-C. (2012). Topics of statistical theory for register-based statistics and data integration. Statistica

Neerlandica, vol. 66, pp. 41-63.

[9] Wallgren, A. and Wallgren, B. (2014). Register-based Statistics - Administrative Data for Statistical Pur-

poses. 2 ed. John Wiley & Sons, Ltd.

ENP 201828