Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals
-
Upload
connor-blake -
Category
Documents
-
view
23 -
download
0
description
Transcript of Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Sampling as a way to reduce risk and create a Public Use
File maintaining weighted totals
Maria Cristina Casciano, Laura Corallo, Daniela Ichim
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Outline• Multiple releases: MFR and PUF
• Subsampling– allocation: reduce the risk of disclosure– selection: pre-defined quality standards
• Results– Career of Doctorate Holders Survey
• Further work
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Multiple …Multiple countries
Multiple countries
Multiple countries
MS1
MS2
SURVEY1 TABLES1 PUF1 MFR1 OTHER1
SURVEY2 TABLES2 PUF2 MFR2 OTHER2
SURVEYX TABLESX PUFX MFRX OTHERX
Multiple releases
Multiple releases
Multiple releases
SURVEY1 TABLES1 PUF1 MFR1 OTHER1
SURVEY2 TABLES2 PUF2 MFR2 OTHER2
SURVEYX TABLESX PUFX MFRX OTHERX
Multiple releases
Multiple releases
Multiple releases
SURVEY1 TABLES1 PUF1 MFR1 OTHER1
SURVEY2 TABLES2 PUF2 MFR2 OTHER2
SURVEYX TABLESX PUFX MFRX OTHERX
Multiple releases
Multiple releases
Multiple releases
MS27
Multiple countriesM
ultip
le
surv
eys
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Comparability
• ESSnet on SDC harmonisation and common tools– WP1: test the comparability concept– Istat, Destatis, Statistics Austria– multiple countries
• 1 Assessment of effects of different practices on predefined statistics• 2 Definition of a threshold to define when action is needed• 3 setting a process for choosing acceptable practices
HOW
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Multiple releases
SURVEY1 TABLES1 PUF1 MFR1 OTHER1
• A particular harmonisation dimension
• Hierarchical structure– Utility– Risk of disclosure
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Multiple releaseshierarchical structure
MFR
+
-
More restrictive license
PUF
+
-Less aggregated information
Less restrictive license More aggregated information
UNIQUE PRODUCTION PROCESS!
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
PUF-MFR• MFR
– definition of a disclosure scenario– risk assessment R1
– risk limitation w.r.t.• adopted disclosure scenario• some data utility requirements
• PUF– harmonized with the MFR (e.g. weighted totals)– reduced the risk of disclosure– random sample– internal consistency of records– some (other) data utility requirements (CV and weighted totals – precision and accuracy)
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Data description
Year t-5 Year t-3 Year t
Doctorate Holders CDH 2009 Survey
Estimates by PhD scientific area, by gender and by region
labour market entry
usefulness of the
PhD
for obtaining a job
type of contract
type of work
earnings job
satisfaction
Focus on the characterisation of the occupational status of the PhD holders:
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
72%resp
28%No resp
12964 respondents
18500PhD Holders
(Census)
Citizenship(2 categories)
PhD Scientific Area(14 categories)
Gender Region
weights obtained by
constraining on known marginal distributions:
Adjustment for non-responses via calibration
Data description
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
PUF-subsamplingSimple random samplingUtility: Weighted totals may always be preserved by calibrationRisk: how many units at risk are sampled?Example (MFR-CDH): 12964 units, 24.7% of units at risk
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Subsampling
allocation
domains
utility
disclosure
sample size
stratification
dissemination
totals
scenario
calibration
key variables
quality
users
auxiliary
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
PUF-subsampling: proposal
1. Optimal allocation of units to be sampled in each domain according to Bethel’s approach
(Risk minimization)
2. Selection of a fixed size balanced sample (CUBE method)
(Data utility maximization)
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
*djpdjp CVCV
●Cost function to minimize:
● Expected Coefficient of Variation (CV) of the estimates of the total of variable P in domain jd equal or lower than prefixed thresholds:
djH
1hhh0
' nCCC
1. Bethel’s approach (1989)
nh and Ch related to the risk to be reduced
Optimal allocation: nh*
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
2. Balanced sampling
A sampling design s is said to be balanced on
the auxiliary variables if and
only if the balancing equations given by:
are satisfied, where X is the vector of known
population totals, is the H.-T. estimator
XXπ ˆ
'.....1 pj xxxx
πX̂
exact estimates for pre-defined variables
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Balanced sampling: the CUBE methodGeometrically each vertex of the hypercube is a sample:
The balancing equations define a sub-space of RN named K.The problem is to choose a vertex (sample) of the N-cube that remains in the sub-space of constraints K
(111)
(000) (100)
(101)
(010)
(011)
(110)
N,s 10
Cube method (Deville & Tillé,2004):1. Flight phase: it’s a random walk starting from the
vector and moving in the intersection of the cube C and K. It stops at the vertex of intersection of C and K, if this vertex exists.
2. Landing phase: At the end of the flight phase, if a sample is not exactly determined in C∩K, a sample is selected as close as possible to the constraints space K.
K
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Implementation
1. determination of the optimal strata sizes in terms of reduction of the overall risk (cost function), keeping the CV level of the estimates below a 5% threshold for three combinations of the allocation and domain variables
Allocation variables: Occup, JobS, Contract, Work, IncomeDomain variables: Gender, Region, Scientific Area, Year
of Completion
2. six possible settings, corresponding to different choices of the parameters:a. Risk R1 used as the minimization cost of the algorithmb. Risk R1 used as a stratification variable c. include all units of the strata containing no units at risk
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
C.S
Ris
k.c
ost
Ris
k.s
trat
Cen
s.n
o.ris
k
# S
trata
#C
en
s.s
trata
#C
en
s.u
nits
Size
Beth
el
Size
Pro
p.
Size
Eq
ual
Max.B
eth
el-
Pro
p
Max.B
eth
el-
Eq
ual
1 N Y N 925 153 252 4933 5391 5550 459 618
2 N Y Y 925 214 704 5105 5547 5550 443 446
3 Y Y N 925 204 558 5239 5719 5550 480 311
4 Y Y Y 925 235 814 5330 5781 5550 451 220
5 Y N N 925 240 687 5555 5953 6475 399 921
6 Y N Y 925 269 983 5649 6094 6475 446 827
7 N Y N 925 306 1614 8725 9256 9250 530 524
8 N Y Y 925 352 1919 8827 9324 9250 498 424
9 Y Y N 925 416 3229 8955 9424 9250 468 294
10 Y Y Y 925 451 3398 9045 9511 9250 466 205
11 Y N N 925 426 3243 9151 9601 9250 451 100
12 Y N Y 925 457 3399 9222 9669 9250 446 84
13 N Y N 56 0 0 4745 4773 4760 138 132
14 N Y Y 56 28 9761 10320 10346 10360 166 630
15 Y Y N 56 21 5844 8812 8841 8848 189 389
16 Y Y Y 56 28 9761 10323 10349 10360 166 630
17 Y N N 28 0 0 4760 4774 4788 176 88
18 Y N Y 28 0 0 4759 4774 4788 176 88
Allocations (CV* = 5%)
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
0
2000
4000
6000
8000
10000
12000
0 0.05 0.1 0.15 0.2 0.25
CV
Be
the
l sa
mp
le s
ize
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Allocations
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Balanced sample
Selection of samples of fixed size from the CDH survey:
Utility constraints on:• the population size N • the optimal sample size n • the marginal frequency distributions by
Gender, Year of Doctorate Completion and Scientific Area
18 equations
CUBE algorithm:
I. Input Vector is the optimal one determined by Bethel
II. Flight phase ends with no exact solutionIII. Landing phase starts: selection of a sample which
ensures a low difference to the balance, according to the distance between to
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Median of absolute relative errors
Results
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Results
C.S
Ris
k.c
ost
Ris
k.s
trat
Cen
s.n
o.ris
k
Ris
k
Occu
p
Job
S
Con
tract
Work
Incom
e
1 N Y N 1366 0.88 0.97 0.97 0.99 0.99
2 N Y Y 1333 0.92 0.99 0.94 0.97 0.99
3 Y Y N 1335 0.92 0.98 0.95 0.99 0.99
4 Y Y Y 1354 0.87 0.99 0.95 0.97 0.99
5 Y N N 1490 0.86 0.98 0.97 0.98 0.98
6 Y N Y 1525 0.91 0.98 0.95 0.97 0.99
7 N Y N 2194 0.83 0.91 0.99 0.97 1.00
8 N Y Y 2177 0.56 0.81 0.99 0.94 0.99
9 Y Y N 2149 0.78 0.91 0.99 0.91 1.00
10 Y Y Y 2163 0.64 0.88 0.97 0.95 0.99
11 Y N N 2232 0.63 0.87 0.99 0.86 1.00
12 Y N Y 2233 0.55 0.78 0.96 0.94 0.99
13 N Y N 1272 0.96 0.99 0.92 0.96 0.98
14 N Y Y 559 0.52 0.79 0.41 0.83 0.98
15 Y Y N 564 0.77 0.94 0.93 0.97 0.99
16 Y Y Y 562 0.56* 0.84 0.59 0.88 0.99
17 Y N N 1270 0.95 0.99 0.98 0.99 0.99
18 Y N Y 1247 0.91 0.99 0.98 0.99 0.98
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona
Further work
1. the relationship between coefficients of variation and disclosure risk, together with different options of including the risk of disclosure in the sampling design;
2. the introduction of an utility-priority
approach into the way to deal with the balancing equations;
3. the usage of other data utility constraints to be investigated.