Introduction to data integration
Transcript of Introduction to data integration
ENP 2018
Georgia
Introduction to dataintegration
ENP 20181
Combining multiple data sources (Various statistical tasks and methods
Two main stages of data integration:
Steady/major states of data, Input and output quality
Statistical validity and equivalence of proxy data
Basic data configurations
Outline
2 ENP 2018
Statistical tasks as building blocks
ENP 20183
Statistical tasks as ‘building blocks’
Micro dataset I. Data editing and Imputation
II. Creation of joint micro dataset
II.a. Data linkage
II.b. Micro-level statistical matching
III. Alignment of Statistical Micro data
III.a. Units
III.b. Measurements
Aggregated Level IV. Multisource Estimation
IV.a. Population size estimation
IV.b. Univalent estimation
IV.c. Coherent estimation
(Di Zio et al. 2017)
Example: Norwegian integrated employment
statistics
4 ENP 2018
Production of integrated statistical micro data
• Two main stages:
1. data linkage across sources
• negligible error if unique identifiers available
• error-prone if probabilistic/deterministic record linkage
2.micro integration
• in concept and value
• deals with units and variables
• can involve micro- and macro-level constraints
5 ESTP 2017
Generic Statistical Business Process Model:
6 ENP 2018
Exercise1. Identify and discuss where in the GSBPM you would do tasks that
are connected to combining and integrating data.
2. When in the current production process do you work on harmonisation of variables and classification today?
3. When in the productions process do you think would be optimal for greatest harmonisation?
https://statswiki.unece.org/display/GSBPM/Clickable+GSBPM
GSBPM 5.0
ENP 20187
Data integration: An overview
Two main production stages within process 5. Process
- up to 5.1 Integrate data include linkage across sources
- up to 5.8 Finalise data files
Major states of data
input data → micro data → statistical data → disseminated data
Input and output quality
input quality at end of process 4. Collect
output accuracy at end of process 5. Process
8 ENP 2018
Major states of data
9 ENP 2018
[Different numbers (≥ 4) and names; e.g. “steady states” at
CBS]
Data GSBPM process Defining characteristic
Input up to proces
s
4.4 acceptable
Micro up to proces
s
5.1 linkable across
datasets
Statistical up to proces
s
5.8 statistical accuracy
Disseminate
d
up to proces
s
7.2 releasable
Illustration: Labour Market Account related
statistics
10ENP 2018
...
SurveyA
RegisterI WageStat
SurveyB SickLeaveStat Labour marketAccount
RegisterI I EmployStat
SurveyC
...
NB. stovepipes without defined major states that allow integration
Illustration: Labour Market Account related
statistics
11 ENP 2018
...
SurveyA ...
RegisterI WageStat
SurveyB ActivityRegister SickLeaveStat LM A
RegisterI I EmployStat
SurveyC ...
...
NB. Activity Register as major-state micro or statistical data?
Illustration: Labour Market Account related
statistics
12ENP 2018
Activity Register (AR): Unit = Person × Activity × Duration
NB. Envisaged e.g. in Wallgren & Wallgren (2006). However,
unlikely for AR to be a base register, as CPR, BR or IR.
Person Activity Duration Outcome ...
Busines
s
Type Start Finish Amount
. UoS PhD . - 100% Upgrade .
A 09/2009 · · ·
A UoS Teach. Assist. 11/2010 11/2010 15 hrs x · · ·
. . .
Input quality
Some relevant EU projects
BLUE-ETS, WP4,
ESSnet Admin data for Business Statistics, WP6,
ESSnet Quality of Multisource Statistics, WP1
13
Input data quality dimensions/indicators(BLUE-ETS)
14 ENP 2018
1.Technical:
• accessibility, file declaration, convertability, etc.
2.Accuracy:
• authentic, inconsistent, dubious objects
• measurement error, inconsistent, dubious values
3.Completeness
• under-/over-coverage, selectivity of objects
• missing or imputed values
Input data quality dimensions/indicators(BLUE-ETS)
15 ENP 2018
4.Time-related
• timeliness, punctuality, overall time lag,
delay
• dynamics of objects
• stability of variables/measurements
5.Integrability
• comparability, alignment of objects
• linking (key) variables
• comparability/proximity of variables
Output accuracy: Validity and
equivalence (Zhang, 2012)
16 ENP 2018
Illustration of statistical equivalence
17 ENP 2018
Binary data
True Proxy
Anderse
n
0 1 1 0 0 0 1 1
Johnson 1 0 1 0 0 1 0 1
Petersen 1 1 0 0 1 0 0 1
Measurement error 2 2 2 1 1 3 1
Statistical
equivalence
Yes Yes No No No No No
Relative equivalence ∼ ∼
Relevant aspects to breakdown of data configurations:
• Aggregation level: micro, macro, or mixed
• Unit: with or without overlap between datasets
• Variables: with or without overlap (or proxy)
• Coverage:presence of over-/under-coverage
• Time: cross-sectional vs. longitudinal
• Population: known from available frame or not
• Design:census, probability sampling or observational
18ENP 2018
Some basic data configurations (ESSnet, 2016)
Relevant types of statistical output:
• population registers
• statistics (macro-level)
• micro datasets
• metadata
Relevant dist inct ion for statistics:
• descriptive: such as totals or means, etc.
• analytic: price indices, regression coefficients, etc.
ENP 201819
Some basic data configurations (ESSnet, 2016)
Configuration 1: Ideal complementary datasets
20 ENP 2018
Example: BR, survey of largest units, register data of the rest
Configuration 2: Overlapping instead of
complementary
21 ENP 2018
Example: short register history and overlapping sample survey data
Configuration 2S (“traditional” survey sampling?)
22ENP 2018
Example: complete register and complementary sample survey data
Configuration 3 (“traditional” multi-frame
sampling?)
23 ENP 2018
Example: population census and post-enumeration survey
Configuration 4 (e.g repeating weighting)
ENP 201824
Configuration 5 (Statistical matching)
ENP 201825
Configuration 6 (Time series)
ENP 201826
Example: register and survey time series of different frequencies
Exercise
Study the different configurations suggested
from the Essnet project «Quality in
Multisource Statistics.
Try to intepret the described situations and
assess them from your professional
experience: Do you find the configurations
exhaustive?
Group discussion
27
REFERENCES
References
[1] ESSnet (2016). Measuring the Quality of the Output of Multisource Statistics: A Breakdown of Basic
Data Configurations. WP3 working document.
[2] BLUE-ETS (2011). Deliverable 4.1. List of quality groups and indicators identified for administrative data
sources.
[3] De Waal, T. (2016), Obtaining numerically consistent estimates from a mix of administrative data and
surveys. Statistical Journal of the IAOS, 32, 231-243.
[4] Di Zio, M., Zhang, L.-C. and De Waal, T. (2017). Statistical methods for combining multiple sources of
administrative and survey data. The Survey Statistician, 76, 17-26.
[5] Fienberg, S. E., Makov., U.E. and Steele, R.J. (1998). Disclosure limitation control usingperturbation and
related methods for categorical data. Journal of Official Statistics, vol. 14, pp. 485-502.
[6] Groves, R.M., Fowler Jr., F.J., Couper, M., Lepkowski, J.M., Singer, E. and Tourrangeau, R. (2004).
Survey Methodology. New York: Wiley.
[7] Rubin, D. (1993). Discussion, statistical disclosure limitation. Journal of Official Statistics vol. 9, pp.
461-468.
[8] Zhang, L.-C. (2012). Topics of statistical theory for register-based statistics and data integration. Statistica
Neerlandica, vol. 66, pp. 41-63.
[9] Wallgren, A. and Wallgren, B. (2014). Register-based Statistics - Administrative Data for Statistical Pur-
poses. 2 ed. John Wiley & Sons, Ltd.
ENP 201828