Selective Editing with Categorical Variables€¦ · Selective Editing with Categorical Variables...

31
Selective Editing with Categorical Variables Use of R in Official Statistics 2019 M.R. González-García, J. Poch, D. Salgado, T. Vázquez-Gutiérrez Statistics Spain (INE) INS, Bucharest, 20-21 May, 2019

Transcript of Selective Editing with Categorical Variables€¦ · Selective Editing with Categorical Variables...

  • Selective Editing with CategoricalVariables

    Use of R in Official Statistics 2019

    M.R. González-García, J. Poch, D. Salgado , T. Vázquez-Gutiérrez

    Statistics Spain (INE)

    INS, Bucharest, 20-21 May, 2019

  • Overview

    1. Statement of the problem

    2. The methodological approach

    3. Application of random forests

    4. Results

    5. Implementation

    6. Preliminary conclusions

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 2 / 13

  • Statement of the problem. I

    National/European Health Survey :

    • 3/2 Types of Questionnaires (Adult, Household,Underage).

    • Around 450 + 50 + 250 questionnaire items.

    • Estimators:

    Ŷ targetd =∑

    k∈s

    ωks(x) · δDomainkUd · δTarget

    k

    • Focus on domain variable δDomainkUd expressing OCCUPATION.

    • Traditional editing work:

    systematic interactive editing on each questionnaire taking intoaccount other nuclear sociodemographic variables such asage, gender, professional situation, economical activity, incomeinterval. . .

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 3 / 13

  • Statement of the problem. II

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 4 / 13

  • The methodological approach

    sk = ωk ·∣

    ∣yk − ŷk∣

    x

    |yk−ŷk |σk

    → 0

    sk =√

    2π· ωk ·

    M

    − 12 ,12 ,−

    (

    σ2k

    σ2k+ν2

    k

    )2(yl−ŷk )

    2

    2ν2k

    1+ 1−pp

    ν2k +σ

    2k

    ν2k

    sk = ωk · P(

    yk 6= y(0)k

    )

    ←−−−−−−−−−

    yk=ŷk+νkyk=y

    (0)k +σk

    −−−−−−−−−→

    P

    (

    yk 6=y(0)k

    )

    =∑M

    m=1 wm·IRm (xk ;Zcross

    k )

    sk = E[

    ωk

    ∣yk − y(0)k

    ∣Zcrossk]

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 5 / 13

  • Application of Random Forests

    • Questionnaire for adults .

    • Data from year 2011: raw and (manually) edited .

    • Sample size: 18486 units.

    • Train data set (80%) – test data set(20%) 50 times

    • Target variable: occupation code measurement error (1-,2-, 3-digit codes)

    • Predictors (raw values):Age, Sex, Stratum, Proxy, adultNACE, adultOccup,pensionNACE, currBusNACE, precBusNACE, pensionProfSit,lastProfSit, incomeInterval, studyDegree,samplingWeight

    • model1

  • Results. I

    1 train + 1 test datasets

    AUC for Classification with Random Forests Spanish National Occupation Classification List −− 1 Digit

    Specificity

    Sen

    sitiv

    ity

    1.0 0.5 0.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Without Sampling Weight

    With Sampling Weight

    AUC without:0.8

    AUC with:0.77

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 7 / 13

  • Results. I

    1 train + 1 test datasets

    AUC for Classification with Random Forests Spanish National Occupation Classification List −− 2 Digits

    Specificity

    Sen

    sitiv

    ity

    1.0 0.5 0.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Without Sampling Weight

    With Sampling Weight

    AUC without:0.82

    AUC with:0.77

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 7 / 13

  • Results. I

    1 train + 1 test datasets

    AUC for Classification with Random Forests Spanish National Occupation Classification List −− 3 Digits

    Specificity

    Sen

    sitiv

    ity

    1.0 0.5 0.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Without Sampling Weight

    With Sampling Weight

    AUC without:0.82

    AUC with:0.76

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 7 / 13

  • Results. I

    50 train + 50 test datasets

    0.775

    0.800

    0.825

    0.850

    1 2 3

    Number of Occupation Code Digits

    AU

    C

    Without Sampling Weights

    With Sampling Weights

    AUC for Classification with Random Forests Spanish National Occupation Classification List

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 7 / 13

  • Results. II

    1 train + 1 test datasets

    PROXY_0

    F7_2_3

    F7_2_1

    F7_2_2

    F9

    SEXOa

    F16m_2_3

    F16m_2_2

    F16a_2_1

    F16m_2_1

    F16a_2_3

    F16a_2_2

    D28

    CNAE_AS_3

    F18

    CNAE_AS_1

    CNAE_AS_2

    CNO_AS_3

    CNO_AS_2

    A10_i

    ESTRATO

    CNO_AS_1

    EDADa

    FACTORADULTO

    0 50 100 150 200

    Mean Decrease in Gini Index

    Var

    iabl

    es

    Mean Decrease in Node ImpuritySpanish National Occupation Classification List −− 1 Digit

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13

  • Results. II

    1 train + 1 test datasets

    PROXY_0

    F7_2_3

    F7_2_1

    F7_2_2

    F9

    SEXOa

    F16m_2_3

    F16m_2_1

    F16a_2_1

    F16m_2_2

    F16a_2_3

    F16a_2_2

    F18

    D28

    CNAE_AS_3

    CNAE_AS_1

    CNAE_AS_2

    CNO_AS_3

    CNO_AS_2

    A10_i

    ESTRATO

    CNO_AS_1

    EDADa

    FACTORADULTO

    0 100 200

    Mean Decrease in Gini Index

    Var

    iabl

    es

    Mean Decrease in Node ImpuritySpanish National Occupation Classification List −− 2 Digits

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13

  • Results. II

    1 train + 1 test datasets

    PROXY_0

    F7_2_3

    F7_2_1

    F7_2_2

    F9

    SEXOa

    F16m_2_3

    F16a_2_1

    F16m_2_1

    F16a_2_3

    F16m_2_2

    F16a_2_2

    F18

    D28

    CNAE_AS_3

    CNAE_AS_1

    CNAE_AS_2

    CNO_AS_2

    CNO_AS_3

    A10_i

    ESTRATO

    CNO_AS_1

    EDADa

    FACTORADULTO

    0 100 200 300

    Mean Decrease in Gini Index

    Var

    iabl

    es

    Mean Decrease in Node ImpuritySpanish National Occupation Classification List −− 3 Digits

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13

  • Results. II

    50 train + 50 test datasets

    PROXY_0

    F7_2_3

    F7_2_1

    F7_2_2

    F9

    SEXOa

    F16m_2_3

    F16a_2_1

    F16m_2_1

    F16m_2_2

    F16a_2_3

    F16a_2_2

    D28

    CNAE_AS_3

    CNAE_AS_1

    F18

    CNAE_AS_2

    CNO_AS_3

    CNO_AS_2

    A10_i

    ESTRATO

    CNO_AS_1

    EDADa

    FACTORADULTO

    0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200Variable Importance (Mean Decrease in Gini Index)

    Var

    iabl

    e

    Variable Importance in the Classification of Influential Errors in the Spanish National Occupation Classification List −− 1−Digit Code

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13

  • Results. II

    50 train + 50 test datasets

    PROXY_0

    F7_2_1

    F7_2_3

    F7_2_2

    F9

    SEXOa

    F16m_2_3

    F16a_2_1

    F16a_2_3

    F16m_2_1

    F16m_2_2

    F16a_2_2

    D28

    F18

    CNAE_AS_3

    CNAE_AS_1

    CNAE_AS_2

    CNO_AS_3

    CNO_AS_2

    A10_i

    ESTRATO

    CNO_AS_1

    EDADa

    FACTORADULTO

    0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260Variable Importance (Mean Decrease in Gini Index)

    Var

    iabl

    e

    Variable Importance in the Classification of Influential Errors in the Spanish National Occupation Classification List −− 2−Digit Code

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13

  • Results. II

    50 train + 50 test datasets

    PROXY_0

    F7_2_3

    F7_2_1

    F7_2_2

    F9

    SEXOa

    F16m_2_3

    F16a_2_1

    F16a_2_3

    F16m_2_1

    F16m_2_2

    F16a_2_2

    D28

    F18

    CNAE_AS_3

    CNAE_AS_1

    CNAE_AS_2

    CNO_AS_3

    CNO_AS_2

    A10_i

    ESTRATO

    CNO_AS_1

    EDADa

    FACTORADULTO

    0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300Variable Importance (Mean Decrease in Gini Index)

    Var

    iabl

    e

    Variable Importance in the Classification of Influential Errors in the Spanish National Occupation Classification List −− 3−Digit Code

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 8 / 13

  • Results. III

    1 train + 1 test datasets

    0.00

    0.25

    0.50

    0.75

    0

    1000

    2000

    3000

    0.00 0.25 0.50 0.75 1.00

    Fraction of Edited Questionnaires

    Fra

    ctio

    n of

    Tot

    al C

    ases

    Total Num

    ber of Cases

    Variable

    Error

    NonError

    True Positive

    False Negative

    True Negative

    False Positive

    Weights

    No

    Yes

    Classification of Cases in Error Detection (Classifier + Cutoff) Spanish National Occupation Classification List −− 1 Digit

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 9 / 13

  • Results. III

    1 train + 1 test datasets

    0.00

    0.25

    0.50

    0.75

    0

    1000

    2000

    3000

    0.00 0.25 0.50 0.75 1.00

    Fraction of Edited Questionnaires

    Fra

    ctio

    n of

    Tot

    al C

    ases

    Total Num

    ber of Cases

    Variable

    Error

    NonError

    True Positive

    False Negative

    True Negative

    False Positive

    Weights

    No

    Yes

    Classification of Cases in Error Detection (Classifier + Cutoff) Spanish National Occupation Classification List −− 2 Digits

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 9 / 13

  • Results. III

    1 train + 1 test datasets

    0.00

    0.25

    0.50

    0.75

    0

    1000

    2000

    3000

    0.00 0.25 0.50 0.75 1.00

    Fraction of Edited Questionnaires

    Fra

    ctio

    n of

    Tot

    al C

    ases

    Total Num

    ber of Cases

    Variable

    Error

    NonError

    True Positive

    False Negative

    True Negative

    False Positive

    Weights

    No

    Yes

    Classification of Cases in Error Detection (Classifier + Cutoff) Spanish National Occupation Classification List −− 3 Digits

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 9 / 13

  • Results. III

    50 train + 50 test datasets

    With Sampling Weights Without Sampling Weights

    1 2 3 1 2 3

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    0.35

    0.40

    0.45

    0.50

    0.55

    0.60

    0.65

    0.70

    0.75

    0.80

    0.85

    0.90

    0.95

    1.00

    Number of Classification Code Digits

    Sam

    ple

    Per

    cent

    age

    Variable

    True Positive

    False Negative

    True Negative

    False Positive

    Classification of Influential Errors in the Spanish National Occupation Classification List

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 9 / 13

  • Results. IV

    1 train + 1 test datasets

    0.00

    0.25

    0.50

    0.75

    1.00

    0.00 0.25 0.50 0.75 1.00Fraction of Sample of Edited Questionnaires

    variable

    Error

    NonError

    Accuracy

    Precision

    Recall

    Specificity

    F−measure

    Weights

    No

    Yes

    Error Detection Quality Indicators (Classifier + Cutoff) Spanish National Occupation Classification List −− 1 Digit

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13

  • Results. IV

    1 train + 1 test datasets

    0.00

    0.25

    0.50

    0.75

    1.00

    0.00 0.25 0.50 0.75 1.00Fraction of Sample of Edited Questionnaires

    variable

    Error

    NonError

    Accuracy

    Precision

    Recall

    Specificity

    F−measure

    Weights

    No

    Yes

    Error Detection Quality Indicators (Classifier + Cutoff) Spanish National Occupation Classification List −− 2 Digits

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13

  • Results. IV

    1 train + 1 test datasets

    0.00

    0.25

    0.50

    0.75

    1.00

    0.00 0.25 0.50 0.75 1.00Fraction of Sample of Edited Questionnaires

    variable

    Error

    NonError

    Accuracy

    Precision

    Recall

    Specificity

    F−measure

    Weights

    No

    Yes

    Error Detection Quality Indicators (Classifier + Cutoff) Spanish National Occupation Classification List −− 3 Digits

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13

  • Results. IV

    50 train + 50 test datasets

    Accuracy Precision Recall Specificity Fmeasure

    With S

    ampling W

    eightsW

    ithout Sam

    pling Weights

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Sample Percentage

    quantile

    0%

    25%

    50%

    75%

    100%

    Classification of Influential Errors in the Spanish National Occupation Classification List −− 1−Digit Codes

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13

  • Results. IV

    50 train + 50 test datasets

    Accuracy Precision Recall Specificity Fmeasure

    With S

    ampling W

    eightsW

    ithout Sam

    pling Weights

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Sample Percentage

    quantile

    0%

    25%

    50%

    75%

    100%

    Classification of Influential Errors in the Spanish National Occupation Classification List −− 2−Digit Codes

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13

  • Results. IV

    50 train + 50 test datasets

    Accuracy Precision Recall Specificity FmeasureW

    ith Sam

    pling Weights

    Without S

    ampling W

    eights

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    Sample Percentage

    quantile

    0%

    25%

    50%

    75%

    100%

    Classification of Influential Errors in the Spanish National Occupation Classification List −− 3−Digit Codes

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 10 / 13

  • Results. V

    1 train + 1 test datasets

    M N O P

    H I J K L

    C D E F G

    * 00 000 A B

    0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

    0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

    0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

    0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.000.00

    0.02

    0.04

    0.06

    0.000

    0.005

    0.010

    0.015

    0.020

    0.025

    0.000

    0.005

    0.010

    0.015

    0.020

    0.025

    0.00

    0.05

    0.10

    0.15

    0.20

    0.00

    0.01

    0.02

    0.03

    0.04

    0.00

    0.01

    0.02

    0.03

    0.00

    0.02

    0.04

    0.06

    0.00

    0.05

    0.10

    0.15

    0.000

    0.005

    0.010

    0.015

    0.020

    0.000

    0.025

    0.050

    0.075

    0.00

    0.01

    0.02

    0.03

    0.04

    0.00

    0.01

    0.02

    0.03

    0.04

    0.000

    0.025

    0.050

    0.075

    0.100

    0.00

    0.01

    0.02

    0.03

    0.0000

    0.0025

    0.0050

    0.0075

    0.0100

    0.0125

    0.00

    0.25

    0.50

    0.75

    1.00

    0.00

    0.02

    0.04

    0.06

    0.00

    0.02

    0.04

    0.06

    0.00

    0.03

    0.06

    0.09

    Fraction of Sample of Edited Questionnaires

    Abs

    olut

    e R

    elat

    ive

    Pse

    udoB

    ias

    Weights

    No

    Yes

    Editing EfficiencySpanish National Occupation Classification List −− 1 Digit

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 11 / 13

  • Results. V

    50 train + 50 test datasets

    With Sampling Weights Without Sampling Weights*

    00000

    AB

    CD

    EF

    GH

    IJ

    KL

    MN

    OP

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    0.000.250.500.751.00

    −0.050−0.025

    0.0000.0250.050

    0.000.050.10

    0.000.050.100.150.20

    0.000.010.020.030.04

    0.000.020.040.06

    0.000.030.060.090.12

    0.000.010.02

    0.000.020.040.06

    0.000.010.02

    0.000.020.040.06

    0.00000.00250.00500.0075

    0.000.030.060.09

    0.000.010.020.030.04

    0.000.010.020.03

    0.000.020.040.06

    0.000.010.020.03

    0.000.010.020.030.04

    0.0000.0250.0500.0750.100

    Sample Percentage

    Abs

    olut

    e R

    elat

    ive

    Pse

    udoB

    ias

    quantile

    0%

    25%

    50%

    75%

    100%

    Classification of Influential Errors in the Spanish National Occupation Classification List −− 1−Digit Codes

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 11 / 13

  • Implementation

    Continuous variables

    Sk = S

    (M

    (1), . . . ,M(Q))

    UnitPrioritization

    UnitPrioritizationParam

    ErrorMoments Observation-Prediction Model Parameters (ωk, p̂k, σ̂k, ν̂k, ŷk)

    ErrorMomentParam

    contObsPredModelParam

    ObsErrSTDParam

    ErrProbParam

    PredParam StQList

    takes

    takes

    takes to update

    takes to update

    takes to update

    takes

    takes

    takes

    Mkk =

    √2

    π· ωk · ν̂k · 1F1

    1

    2;1

    2;−

    (yk − ŷk

    )2

    2 · ν̂2k

    ·

    1

    1 +1− p̂k

    p̂k

    σ̂2kσ̂2k+ν̂2

    k

    −1/2

    exp

    (−

    (yk−ŷk)2

    2·ν̂2k

    )

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 12 / 13

  • Implementation

    Categorical variables

    Sk = S

    (M

    (1), . . . ,M(Q))

    UnitPrioritization

    UnitPrioritizationParam

    ErrorMoments Observation-Prediction Model Parameters (ωk, p̂k)

    ErrorMomentParam

    categObsPredModelParam

    ErrProbParam–RF

    takes

    takes

    takes to update

    Mkk = ωk · p̂k

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 12 / 13

  • Preliminary Conclusions

    • Same underlying methodological approach as withcontinuous variables.

    • Effective prioritization of units with influential errors.

    • Details to be worked out for a multivariate editing in a fullyfledged E&I strategy .

    • Unbalanced learning sub/over-sampling , cost-sensitivelearning, ensemble techniques. . .

    • More general threshold selection schemes.

    • Sampling weights to be introduced in the generation of therandom forest?

    • If random forests, why not SVMs, boosting , neuralnetworks . . . ?

    • Also for continuous variables?

    Salgado et al (StatSpain (INE)) Selective Editing with Categorical Variables 20-21 May, 2019 13 / 13