Model of transformation administrative data to statistical data Data used in Population and Housing...
-
Upload
frederick-kettering -
Category
Documents
-
view
215 -
download
0
Transcript of Model of transformation administrative data to statistical data Data used in Population and Housing...
Model of transformation administrative data to statistical data
Data used in Population and Housing Census 2011
– examples
Janusz Dygaszewicz and Paweł MurawskiCentral Statistical Office
POLAND
Outline
1. Purpose of the work on administrtive sources
2. Data quality3. Extract data4. Transform data5. Summary
Data Owners:• Ministry of Finance,• Ministry of Interior and Administration,• Ministry of Justice,• Agricultural Social Insurance Fund,• National Health Fund,• Agency for Restructuring and Modernisation of Agriculture,• Agricultural and Food Quality Inspection,• Agency for Geodesy and Cartography,• State Fund for Rehabilitation of Disabled Persons,• County Offices,• Commune Offices,• Regional Offices,• Telcoms,• Energy Suppliers,• Office For Foreigners,• Social Insurance Institution,• Housing Managers,
Registers - data acquisition
3
Purpose of the work on administrative data
Obtaining a sufficiently complete data set –subjective and objective completeness corresponding to classification standards, definitions and basic categories, and thus the effective use of administrative data
Data quality-measures-
1. Measuring the quality of administrative registers– timeliness of data– methodological compatibility– completeness– identification standards used in the registry– usefulness– compatibility of data in administrative sources to data obtained in the
study/survey
2. Measuring the quality in processing of data registers– excessive coverage error rate– incomplete coverage error rate– subjective indicator of completeness– objective indicator of completeness– imputation rate– data correction index– integration data from various sources index
Extract data
consolidation data from various source systems; different data format,
extract data into the production environment based on the SAS software,
converting data into one format that is suitable for processing – SAS tables,
validate of imported data structure is an integral part of this process .
Extract data-examples-
Register/System Name CentralY/N
Relational Y/N Data format
PESEL General electronic system of Population Register Y N TXTKEP Register of the National Taxpayers Y N TXTGZM Community Registers of Residence N Y SQL ServerPIT Personal income tax register Y N TXTSI MS Ministry of Justice system Y N XLSPOBYT Foreigners evidence system Y Y SQL ServerZUS CRPS Central Register of Contribution Payers Y N TXTZUS CRU Central Register od Insured Persons Y N TXTZUS SER Pension insurance system Y N TXTKRUS Agricultural Social Insurance Fund System Y N XMLCWU NFZ Central Register of Insured Y N TXTPFRON State Fund for Rehabilitation of Disabled Persons Y N TXTARiMR Agency for Restructuring and Modernisation of Agriculture
systemY Y XLS
EPN Property Tax Records N Y SQL Server
Transform data
Data processing in the production environment consisting of:• profiling – create a raport on the data quality,• unification/standardization of data,• parsing (separation) or combining variables,• standardization with schemes,• conversion,• validation,• deduplication,• data integration.
Transform data- standardization and parsing examples-
Incorrect data format Format after standardization
1985-02-21 19850221
1985.02.21 19850221
1985 02 21 19850221
Voivodeship City Street Place of birth
MAZOWQIECKIE WARZSWA ul. DŁUGA LONDYN - ANGLIA
MAZPWOECKIE WARS-AWA Ulica DŁUUGA LONDYN – WLK BRYTANIA
ZAZOWIEVCKIE AWRSZAWA DLUGAA LONDYN/CHELSEA
MZAOWIECIE WARSZAAAWA DŁUGA (ul.) LONDYN BRIDGE
Voivodeship City Prefix Street Place of birth
MAZOWIECKIE WARSZAWA UL DŁUGA LONDYN
Transform data- exemples: report data cleaning -
Description Before cleaning After cleaning
Group of variables Variable TotalInorrect
TotalInorrect
total incorrect In %
total incorect In %
Address of permanent residence
COMMUNITY 4320724 428469 9,92% 4316061 72797 1,69%
CITY 4353209 207399 4,77% 4352983 43086 0,99%
STREET 3514154 573899 16,34% 3440932 125392 3,65%
PREFIX 0 0 - 108551 0 0,00%
Address of residence
COMMUNITY 739088 100282 13,57% 738717 11666 1,58%
CITY 742388 30644 4,13% 742336 6344 0,86%
STREET 607939 102725 16,90% 593370 21012 3,55%
PREFIX 0 0 - 18416 0 0,00%
Corresponding address
COMMUNITY 2005 132 6,59% 2005 30 1,50%
CITY 448791 21678 4,84% 448704 4796 1,07%
STREET 377849 64871 17,17% 374220 20575 5,50%
PREFIX 0 0 - 11192 0 0,00%
Personal Data NAME 4355764 7208 0,17% 4355757 5534 0,13%
Transform data- conversion: marital status variable-
Family benefits system
501 – unmarried woman
504 – married (F)
505 – divorced (M)
506 – divorced (F)
507 – widow
508 – widower
509 – non-formalized (M)
510 – non-formalized (F)
511 – separated
Foreigners evidence system
$BD – no data
MTA – married (F)PNA – unmarried womenRWA – divorced (F)RWY – divorced (M)WDA – widowWDC – widower
WNA – single (F)
WNY – single (F)
Statistical standard
2 unmarried woman
4 married (F)
5 divorced (M)
6 divorced (F)
7 widower
8 widow
9 unidentified
10 separated (M)
11 separated (F)
12non-formalized relationship (M)
13non-formalized relationship (F)
14 single (M)
15 single (F)
Community Registers of Residence
2 – unmarried woman
4 – married (F)5 – divorced (M)6 – divorced (F)7 – widower8 – widow
no data – empty field
12345678
9
914265873
1514
2134
5
6
87
12
13
10/11
3 married (M)
503 – married (M)
ZNY - married (M)
3 – married (M)
1 – bachelor
KWR – bachelor
502 – bachelor
1 bachelor
Transform data-validation-
checking the data, correcting abnormal values, according
to the algorithms prepared by methodologists,
eventual exclusion from further processing records which improvement is impossible.
Transform data- deduplication -
removal of repeated units,requires detailed analisys, including
alalysis of legal actsindividual for each register, result of deduplication – one record with
all the possible and unique information.
Transform data-expamle of deduplication process-
imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM
JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303
ANNA MALINOWSKA K 00000000002 ANDERSA 7 20010205
ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101
JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712
imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM
JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712
ANNA MALINOWSKA K 00000000002 ANDERSA 7 20010205
ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101
JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303
imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM
JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303
ANNA MALINOWSKA K 00000000002 ANDERSA 7 20010205
ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101
JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712
imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM
JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712
ANNA MALINOWSKA K 00000000002 ANDERSA 7 20010205
ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101
JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303
ANNA MALINOWSKA K 00000000002 ANDERSA 7 20010205
ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101
imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM
ANNA MALINOWSKA K 00000000002 ANDERSA 7 20010205
ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101
JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303
Transform data-data integration-
process of selection of the best, most current and correct value of several or a dozen of registers
Used to create a statistical register, which will be available for use by analysts.
Transform data-intergation process – scheme-
A Register
B Register
C Register
ONE ID
MULTIPLE IDENTIFIRES
ALTERNATIVE LINKING KEYS
DATA INTEGRATION
LINKING
SELECTING
ALGORYTHMS
SELECTING THE BEST VALUES
DATA COMPLETENESS
STATISTICAL REGISTER
REGISTER OF REFERENCE
kraj_ur_kod_KEP # not null
msce_ur_kod_POBYT # not null
kraj_ur_kod_GZM # not null
Transform data-data integration: example of algorythm
FALSE
FALSE
TRUE
TRUE
TRUE
Kraj_ur_kodselectkraj_ur_kod_GZM
selectkaj_ur_kod_POBYT
selectkraj_ur_kod_KEP
Summary Common difficulties: - poor quality data, missing values, duplicates, - conflicting data,- technical: size of the registers, time-consuming
process.Benefits: - obtain relevent, useful, accurate data- improve the quality of the output data. - selection of the best variables from multiple registers,