Model of transformation administrative data to statistical data Data used in Population and Housing...

23
Model of transformation administrative data to statistical data Data used in Population and Housing Census 2011 – examples Janusz Dygaszewicz and Paweł Murawski Central Statistical Office POLAND

Transcript of Model of transformation administrative data to statistical data Data used in Population and Housing...

Model of transformation administrative data to statistical data

Data used in Population and Housing Census 2011

– examples

Janusz Dygaszewicz and Paweł MurawskiCentral Statistical Office

POLAND

Outline

1. Purpose of the work on administrtive sources

2. Data quality3. Extract data4. Transform data5. Summary

Data Owners:• Ministry of Finance,• Ministry of Interior and Administration,• Ministry of Justice,• Agricultural Social Insurance Fund,• National Health Fund,• Agency for Restructuring and Modernisation of Agriculture,• Agricultural and Food Quality Inspection,• Agency for Geodesy and Cartography,• State Fund for Rehabilitation of Disabled Persons,• County Offices,• Commune Offices,• Regional Offices,• Telcoms,• Energy Suppliers,• Office For Foreigners,• Social Insurance Institution,• Housing Managers,

Registers - data acquisition

3

Purpose of the work on administrative data

Obtaining a sufficiently complete data set –subjective and objective completeness corresponding to classification standards, definitions and basic categories, and thus the effective use of administrative data

Data quality-measures-

1. Measuring the quality of administrative registers– timeliness of data– methodological compatibility– completeness– identification standards used in the registry– usefulness– compatibility of data in administrative sources to data obtained in the

study/survey

2. Measuring the quality in processing of data registers– excessive coverage error rate– incomplete coverage error rate– subjective indicator of completeness– objective indicator of completeness– imputation rate– data correction index– integration data from various sources index

Extract data

consolidation data from various source systems; different data format,

extract data into the production environment based on the SAS software,

converting data into one format that is suitable for processing – SAS tables,

validate of imported data structure is an integral part of this process .

Extract data-examples-

Register/System Name CentralY/N

Relational Y/N Data format

PESEL General electronic system of Population Register Y N TXTKEP Register of the National Taxpayers Y N TXTGZM Community Registers of Residence N Y SQL ServerPIT Personal income tax register Y N TXTSI MS Ministry of Justice system Y N XLSPOBYT Foreigners evidence system Y Y SQL ServerZUS CRPS Central Register of Contribution Payers Y N TXTZUS CRU Central Register od Insured Persons Y N TXTZUS SER Pension insurance system Y N TXTKRUS Agricultural Social Insurance Fund System Y N XMLCWU NFZ Central Register of Insured Y N TXTPFRON State Fund for Rehabilitation of Disabled Persons Y N TXTARiMR Agency for Restructuring and Modernisation of Agriculture

systemY Y XLS

EPN Property Tax Records N Y SQL Server

Transform data

Data processing in the production environment consisting of:• profiling – create a raport on the data quality,• unification/standardization of data,• parsing (separation) or combining variables,• standardization with schemes,• conversion,• validation,• deduplication,• data integration.

Transform data- profiling-

Transform data- standardization and parsing examples-

Incorrect data format Format after standardization

1985-02-21 19850221

1985.02.21 19850221

1985 02 21 19850221

Voivodeship City Street Place of birth

MAZOWQIECKIE WARZSWA ul. DŁUGA LONDYN - ANGLIA

MAZPWOECKIE WARS-AWA Ulica DŁUUGA LONDYN – WLK BRYTANIA

ZAZOWIEVCKIE AWRSZAWA DLUGAA LONDYN/CHELSEA

MZAOWIECIE WARSZAAAWA DŁUGA (ul.) LONDYN BRIDGE

Voivodeship City Prefix Street Place of birth

MAZOWIECKIE WARSZAWA UL DŁUGA LONDYN

Transform data- schemes examples-

Transform data- exemples: report data cleaning -

Description Before cleaning After cleaning

Group of variables Variable TotalInorrect

TotalInorrect

total incorrect In %

total incorect In %

Address of permanent residence

COMMUNITY 4320724 428469 9,92% 4316061 72797 1,69%

CITY 4353209 207399 4,77% 4352983 43086 0,99%

STREET 3514154 573899 16,34% 3440932 125392 3,65%

PREFIX 0 0 - 108551 0 0,00%

Address of residence

COMMUNITY 739088 100282 13,57% 738717 11666 1,58%

CITY 742388 30644 4,13% 742336 6344 0,86%

STREET 607939 102725 16,90% 593370 21012 3,55%

PREFIX 0 0 - 18416 0 0,00%

Corresponding address

COMMUNITY 2005 132 6,59% 2005 30 1,50%

CITY 448791 21678 4,84% 448704 4796 1,07%

STREET 377849 64871 17,17% 374220 20575 5,50%

PREFIX 0 0 - 11192 0 0,00%

Personal Data NAME 4355764 7208 0,17% 4355757 5534 0,13%

Transform data- conversion: gender variables

famale

male

F

M

1

2

Transform data- conversion: marital status variable-

Family benefits system

501 – unmarried woman

504 – married (F)

505 – divorced (M)

506 – divorced (F)

507 – widow

508 – widower

509 – non-formalized (M)

510 – non-formalized (F)

511 – separated

Foreigners evidence system

$BD – no data

MTA – married (F)PNA – unmarried womenRWA – divorced (F)RWY – divorced (M)WDA – widowWDC – widower

WNA – single (F)

WNY – single (F)

Statistical standard

2 unmarried woman

4 married (F)

5 divorced (M)

6 divorced (F)

7 widower

8 widow

9 unidentified

10 separated (M)

11 separated (F)

12non-formalized relationship (M)

13non-formalized relationship (F)

14 single (M)

15 single (F)

Community Registers of Residence

2 – unmarried woman

4 – married (F)5 – divorced (M)6 – divorced (F)7 – widower8 – widow

no data – empty field

12345678

9

914265873

1514

2134

5

6

87

12

13

10/11

3 married (M)

503 – married (M)

ZNY - married (M)

3 – married (M)

1 – bachelor

KWR – bachelor

502 – bachelor

1 bachelor

Transform data-validation-

checking the data, correcting abnormal values, according

to the algorithms prepared by methodologists,

eventual exclusion from further processing records which improvement is impossible.

Transform data- deduplication -

removal of repeated units,requires detailed analisys, including

alalysis of legal actsindividual for each register, result of deduplication – one record with

all the possible and unique information.

Transform data-expamle of deduplication process-

imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM

JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303

ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205

ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101

JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712

imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM

JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712

ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205

ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101

JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303

imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM

JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303

ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205

ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101

JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712

imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM

JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 19840712

ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205

ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101

JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303

ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205

ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101

imie1_GZM nazwisko1_GZM plec_GZM pesel_GZM adr_ulica_GZM adr_nr_dom_GZM adr_nr_lok_GZM data_zam_od_GZM

ANNA MALINOWSKA K 00000000002 ANDERSA 7   20010205

ADAM PIOTROWSKI M 00000000003 FILTROWA 2 45 20090101

JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303JAN KOWALSKI M 00000000001 PUŁAWSKA 4 37 20070303

Transform data-data integration-

process of selection of the best, most current and correct value of several or a dozen of registers

Used to create a statistical register, which will be available for use by analysts.

Transform data-intergation process – scheme-

A Register

B Register

C Register

ONE ID

MULTIPLE IDENTIFIRES

ALTERNATIVE LINKING KEYS

DATA INTEGRATION

LINKING

SELECTING

ALGORYTHMS

SELECTING THE BEST VALUES

DATA COMPLETENESS

STATISTICAL REGISTER

REGISTER OF REFERENCE

kraj_ur_kod_KEP # not null

msce_ur_kod_POBYT # not null

kraj_ur_kod_GZM # not null

Transform data-data integration: example of algorythm

FALSE

FALSE

TRUE

TRUE

TRUE

Kraj_ur_kodselectkraj_ur_kod_GZM

selectkaj_ur_kod_POBYT

selectkraj_ur_kod_KEP

Data integration-example of process-

Summary Common difficulties: - poor quality data, missing values, duplicates, - conflicting data,- technical: size of the registers, time-consuming

process.Benefits: - obtain relevent, useful, accurate data- improve the quality of the output data. - selection of the best variables from multiple registers,

Thank you for your attention

www.stat.gov.pl