INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big...

89
INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson, and Kelly Trageser April 29, 2013

Transcript of INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big...

Page 1: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

INFO 7470/ECON 7400 Synthetic Data Creation and Use

John M. Abowd and Lars Vilhuberwith a big assist from Abigail Cooke, Javier

Miranda, Martha Stinson, and Kelly TrageserApril 29, 2013

Page 2: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

2

Outline

• SIPP Synthetic Data• LBD Synthetic Data

4/29/2013

Page 3: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

3

SURVEY OF INCOME AND PROGRAM PARTICIPATION (SIPP) SYNTHETIC DATA

4/29/2013

Page 4: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

4

Survey of Income and Program Participation (SIPP)

• Goal of SIPP: accurate info about income and program participation of individuals and households and its principal determinants

• Information:– Cash and noncash income on a sub-annual basis. – Taxes, assets, liabilities– Participation in government transfer programshttp://www.census.gov/sipp/intro.html

4/29/2013

Page 5: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

5

Background

• In 2001, a new regulation authorized the Census Bureau and SSA to link SIPP and CPS data to SSA and IRS administrative data for research purposes

• Idea for a public use file was motivated by a desire to allow outside access to long administrative record histories of earnings and benefits linked to household demographic data

• These data allow detailed statistical and simulation study of retirement and disability programs

• Census Bureau, Social Security Administration, Internal Revenue Service, and Congressional Budget Office all participated in development

4/29/2013

Page 6: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

6

Genesis of the SSB

• A portion of the SIPP user community was primarily interested in national retirement and disability programs

• SIPP augmented with – earnings histories from the IRS data maintained at SSA (W-2)– benefit data from SSA’s master beneficiary records.

• Feasibility assessment (confidentiality!) of adding SIPP variables to earnings/benefit data in a public-use file (PUF)– set of variables that could be added without compromising the

confidentiality protection of the existing SIPP public use files was VERY limited

• Alternative methods explored

4/29/2013

Page 7: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

7

SSB Basic Methodology

• Experiment using “synthetic data”• In fact: partially synthetic data with multiple

imputation of missing items• Partially synthetic data:

– Some (at least one) variables are actual responses– Other variables are replaced by values sampled

from the posterior predictive distribution for that record, conditional on all of the confidential data

4/29/2013

Page 8: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

8

History of the SSB• 2003-2005: Creation, but not release, of three versions of the “SIPP/SSA/IRS-

PUF” (SSB)• 2006: Release to limited public access of

SSB V4.2– Access to general public only at Cornell-hosted Virtual RDC (SSB server: restricted-

access setup)• With promise of evaluation of Virtual RDC-run programs on internal Gold Standard

– Ongoing SSA evaluation– Ongoing evaluation at Census (in RDC)

• 2010: Release of SSB V5 at Census and on the Virtual RDC (codebook: http://www.census.gov/sipp/SSB_Codebook.pdf )– Restructured to vastly improve analytical validity of SIPP variables

• 2013: Release of SSB V5.1 at Census and on the VirtualRDC (documentation in preparation)– First user-initiated variables

4/29/2013

Page 9: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

9

Basic Structure of the SSB V4

• SIPP– Core set of 125 SIPP variables in a standardized

extract of SIPP panels 1990-1993 and 1996– All missing data items (except for structurally

missing) are marked for imputation• IRS

– Maintained at SSA, but derived from IRS records– Master summary earnings records (SER)– Master detailed earnings records (DER)

4/29/2013

Page 10: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

10

Basic Structure of the SSB V4 (II)

• SSA– Master Beneficiary Record (MBR)

• Census– Numident: administrative birth and death dates

• All files combined using verified SSNs=> “Gold Standard”

4/29/2013

Page 11: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

11

Basic Structure of SSB V5

• Panels: 1990, 1991, 1992, 1993, 1996, 2001, and 2004 (this variable is now in the SSB)

• Couple-level linkage: the first person to whom the SIPP respondent was married during the time period covered by the SIPP panel

• SIPP variables only appear in years appropriate for the panel indicated by the PANEL variable (biggest change from V4.2)

• Version 5.1: user-requested variables4/29/2013

Page 12: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

12

Missing Values in the Gold Standard

• Values may be missing due to– [survey] Non-response– [survey] Question not being asked in a particular

panel– [admin] Failure to link to administrative record (non-

validated SSN)– [both] Structural missing (e.g., income of spouse if

not married)• All missing values except structural are part of

the missing data imputation phase of SSB4/29/2013

Page 13: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

13

Scope of the Synthesis

• Never missing and not synthesized– gender– marital status– spouse’s gender– initial type of Social Security benefits– type of Social Security benefits in 2000– spouse’s benefits type variables

• All other variables in the public use file were synthesized

4/29/2013

Page 14: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

14

Common Structure to Multiple Imputation and Synthesis

• Hierarchical tree of variable relationships (parent-child relationship, accounting for structure)

• At each node, independent SRMI is used– Statistical model is estimated for each of the variables at the same

level (one of):• Bayesian bootstrap • Logistic regression (with automatic Bayesian variable selection)• Linear regression (with automatic Bayesian variable selection)

– Statistical models are estimated separately for groups of individuals– Then, a proper posterior predictive distribution is estimated– Given a PPD, each variable is imputed /synthesized, conditional on all

values of all other variables for that record• The next node is processed

4/29/2013

Page 15: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

15

MI and Synthesis

• Initial iterations for missing data imputation, keeping all observed values where available

• Final iteration is for data synthesis (replacing all observed values, see exceptions)

4/29/2013

Page 16: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

16

Latest Release of SSB

• 2010: Release of limited public access of SSB V5.0

• 2013: Release of limited public access SSB V5.1

• Both versions accessed via the VirtualRDC Synthetic Data Server

4/29/2013

Page 17: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

17

SIPP Variables

• Codebook

4/29/2013

Page 18: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

18

Synthetic Data Creation• Purpose of synthetic data is to create micro-data

that can be used by researchers in the same manner as the original data while preserving the confidentiality of respondents’ identities

• Fundamental trade-off: usefulness and analytical validity of data versus protection from disclosure

• Goal: not be able to re-identify anyone in the already released SIPP public use files while still preserving regression results

4/29/2013

Page 19: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

19

Multiple Imputation forConfidentiality Protection

4/29/2013

• Denote confidential data by and non-confidential data by

• and has no missing data• PPD: • Complete data: from

• Synthetic data: from

• Major emphasis is to find a good estimate of the PPD

Page 20: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

20

Testing Analytical Validity• Run regressions on each synthetic implicate

– Average coefficients– Combine standard errors using formulae that take

account of average variance of estimates (within implicate variance) and differences in variance across estimates (between implicate variance)

• Run regressions on gold standard data• Compare average synthetic coefficient and standard

error to gold standard coefficient and standard error• Data are analytically valid if coefficient is unbiased

and the same inferences are drawn

4/29/2013

Page 21: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

21

Formulae: Completed Data Only

• Notation– Script is index for missing data implicate– is total number of missing data implicates

• Estimate from one completed implicate

• Average of statistic across implicates

4/29/2013

Page 22: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

22

Formulae: Total Variance andBetween Variance

• Total variance of average statistic

• Variance of the statistic across implicates: between variance

4/29/2013

Page 23: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

23

Formula: Within Variance

• Variance of the statistic from each completed implicate

• Average variance of statistic: within variance

4/29/2013

Page 24: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

24

Formulae: Synthetic and Completed

4/29/2013

• Notation– script is index for missing data implicate– script is index for synthetic data implicate – is total number of missing data implicates– is total number of synthetic implicates per missing

data implicate• Estimate from one synthetic implicate

• Average of statistic across synthetic implicates

Page 25: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

25

Formulae: Grand Mean and Overall Variance

• Average of statistic across all implicates

• Total variance of average statistic

4/29/2013

Page 26: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

26

Formulae: Between Variances

• Variance of the statistic across missing data implicates: between implicate variance

• Variance of the statistic across synthetic data implicates: between r implicate variance

4/29/2013

Page 27: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

27

Formulae: Within Variances

• Variance of the statistic on each implicate

• Average variance of statistic: within variance

• Source: Reiter, Survey Methodology (2004): 235-42.

4/29/2013

Page 28: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

28

Example: Average AIME/AMW

• Estimate average on each of synthetic implicates– AvgAIME(1,1) , AvgAIME(1,2) , AvgAIME(1,3) , AvgAIME(1,4) , – AvgAIME(2,1) , AvgAIME(2,2) , AvgAIME(2,3) , AvgAIME(2,4) , – AvgAIME(3,1) , AvgAIME(3,2) , AvgAIME(3,3) , AvgAIME(3,4) , – AvgAIME(4,1) , AvgAIME(4,2) , AvgAIME(4,3) , AvgAIME(4,4)

• Estimate mean for each set of synthetic implicates that correspond to one completed implicate– AvgAIMEAVG(1) , AvgAIMEAVG(2) , AvgAIMEAVG(3) ,

AvgAIMEAVG(4)• Estimate grand mean of all implicates

– AvgAIMEGRANDAVG

4/29/2013

Page 29: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

29

Example (cont.)

• Between m implicate variance

• Between r implicate variance

4/29/2013

Page 30: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

30

Example (cont.)• Variance of mean from each

implicate– VAR[AvgAIME(1,1)] , VAR[AvgAIME(1,2)] , VAR[AvgAIME(1,3)] , VAR[AvgAIME(1,4)] – VAR[AvgAIME(2,1)] , VAR[AvgAIME(2,2)] , VAR[AvgAIME(2,3)] , VAR[AvgAIME(2,4)] – VAR[AvgAIME(3,1)] , VAR[AvgAIME(3,2)] , VAR[AvgAIME(3,3)] , VAR[AvgAIME(3,4)] – VAR[AvgAIME(4,1)] , VAR[AvgAIME(4,2)] , VAR[AvgAIME(4,3)] , VAR[AvgAIME(4,4)]

• Within variance

4/29/2013

Page 31: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

31

Example (cont.)

• Total Variance

• Use AvgAIMEGRANDAVG and Total Variance to calculate confidence intervals and compare to estimate from completed data

4/29/2013

Page 32: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

32

SAS Programs

• Sample programs to calculate total variance and confidence intervals

4/29/2013

Page 33: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

33

Results: Average AIME

4/29/2013

AVG STAT

Total VAR

Betw. M Var

Betw. R Var

Betw. Var

Within Var

synthetic 1094.2 91.8 59.3 13.3 21.1 1074.5 1113.9completed 1142.5 52.8 23.4 23.7 1129.3 1155.7*All individuals with TOB_2000=1

confidence interval

Average of AIME (Average Indexed Monthly Earnings)/AMW(Average Monthly Wage)

Page 34: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

34

Public Use of the SIPP Synthetic Beta

• Full version (16 implicates) released to the Cornell VirtualRDC Synthetic Data Server (SDS)

• Any researcher may use these data• During the testing phase, all analyses must be performed

on the Virtual RDC• Census Bureau research team will run the same analysis

on the completed confidential data• Results of the comparison will be released to the

researcher, Census Bureau, SSA, and IRS (after traditional disclosure avoidance analysis of the runs on the confidential data)

4/29/2013

Page 35: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

35

Methods for Estimating the PPD• Sequential Regression Multivariate Imputation (SRMI) is a

parametric method where PPD is defined as

• The BB is a non-parametric method of taking draws from the posterior predictive distribution of a group of variables that allows for uncertainty in the sample CDF

• We use BB for a few groups of variables with particularly complex relationships and use SRMI for all other variables

dXYpXYYpXYYp obsobsobsobsobsobs ,|,,|~

,|~

Page 36: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

36

SRMI Method Details• Assume a joint density that defines parametric relationships

between all observed variables.• Approximate the joint density by a sequence of conditional

densities defined by generalized linear models.• Same process for completing and synthesizing data• Synthetic values of some are draws from:

where Ym, Xm are completed data, and densities pk are defined by an appropriate generalized linear model and prior

dXYpXYypXYyp mm

k

mm

kkk

mm

kk ,|,,|~,|~~

Yyk

Page 37: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

37

SRMI Details: KDE Transforms

• The SRMI models for continuous variables assume that they are conditionally normal

• This assumption is relaxed by performing a KDE-based transform of groups of related variables

• All variables in the group are transformed to normality, then the PPD is estimated

• The sampled values from PPD are inverse transformed back to the original distribution using the inverse cumulative distribution

4/29/2013

Page 38: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

38

SRMI Example: Synthesizing Date of Birth

• Divide individuals into homogeneous groups using stratification variables– example: male, black, age categories, education

categories, marital status– example: decile of lifetime earnings distribution,

decile of lifetime years worked distribution, worked previous year, worked current year

• For each group, estimate an independent linear regression of date of birth on other variables (not used for stratification) that are strongly related

4/29/2013

Page 39: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

39

SRMI Example: Synthesizing Date of Birth

• Synthetic date of birth is a random variable• Before analysis, it is transformed to normal using the KDE-

based procedure• Distribution has two sources of variation:

– variation in error term in regression model– variation in estimated parameters: ’s and 2

• Synthetic values are draws from this distribution• Synthetic values are inverse transformed back to the original

distribution using the inverse cumulative distribution

4/29/2013

Page 40: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

40

Bayesian Bootstrap Method Details

• Divide data into homogeneous groups using similar stratification variables as in SRMI

• Within groups do a Bayesian bootstrap of all variables to be synthesized at the same time.– n observations in a group, draw 1-n random variables from

uniform (0,1) distribution– let uo … ui … un define the ordering of the observations in the

group– ui – ui-1 is the probability of sampling observation i from the group

to replace missing data or synthesize data in observation j– conventional bootstrap, probability of sampling is 1/n

4/29/2013

Page 41: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

41

Creating Synthetic Data

• Begin with base data set that contains only non-missing values

• Use BB to complete missing administrative data – i.e. find donor SSN based on non-missing SIPP variables

• Use SRMI to complete missing SIPP data • Iterate multiple times – input for iteration 2 is

completed data set from iteration 1 • On last iteration, run 4 separate processes to create

4 separate data sets or implicates4/29/2013

Page 42: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

42

Creating Synthetic Data, Cont.

• Synthesis is like one more iteration of data completion, except all observations are treated as missing

• Each completed implicate serves as a separate input file• Run 16 separate processes to create 16 different

synthetic data sets or implicates• The separate processes to create implicates have

different stratification variables• Need enough implicates to produce enough variation to

ensure that averages across the implicates will be close to truth

4/29/2013

Page 43: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

43

Features of Synthesizing Routines

• Parent-child relationships– foreign-born and decade arrive in US– welfare participation and welfare amount– presence of earnings, amount of earnings

• Restrictions on draws from PPD– Some draws must be within a pre-specified range from

the original value: example MBA is +/- $50 of original value.

– impose maximum and minimum values on some variables

4/29/2013

Page 44: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

44

External Researcher Validation

• Version 4.0 – 12 projects– 1 was submitted for validation

• Version 5.0– 31 projects– 6 were submitted for validation

4/29/2013

Page 45: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

45

Validation Details

• Henriques, Alice (2102) “How does Social Security claiming respond to incentives? Considering husbands’ and wives’ benefits separately”

• Armour, Philip (2012) “The role of information in disability insurance take-up: An analysis of the Social Security statement phase-in”

• Bertrand, Marianne, Emir Kamenica and Jessica Pan, “Gender identity and relative income within households”

4/29/2013

Page 46: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

46

From Bertrand et al.

4/29/2013

Timeline: SDS application November 2012, gold standard results January 2013

Page 47: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

47

SYNTHETIC LONGITUDINAL BUSINESS DATABASE

4/29/2013

Page 48: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

48

The Synthetic Longitudinal Business Database

Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek2/Abowd

on July 31, 2009 at the

Census-NSF-IRS Synthetic Data Workshop

[link] [link]

Kinney/Reiter/Jarmin/Miranda/Reznek/Abowd (2011) “Towards Unrestricted Public Use Microdata

: The Synthetic Longitudinal Business Database.”, CES-WP-11-04

Work on the Synthetic LBD was supported by NSF Grant ITR-0427889, and ongoing work is supported by the Census Bureau. A portion of this work was conducted by Special Sworn Status researchers of the U.S. Census Bureau at the Triangle Census Research Data Center. Research results and conclusions expressed are those of the authors and do not necessarily reflect the views of the Census Bureau. Results have been screened to ensure that no confidential data are revealed.

4/29/2013

Page 49: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

49

Overview

• LBD background

• Synthetic data generation

• Analytic validity

• Confidentiality protection

• Future plans

4/29/2013

Page 50: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

50

Elements

4/29/2013

(Economic Surveys and Censuses)

Issue: (item) non-response

Solution: LBD

(Business Register)Issue: inexact link

recordsSolution: LBD

Match-merged and completed

complex integrated dataIssue: too much detail

leads to disclosure issueSolution: Synthetic LBD

Public-use dataWith novel detail

Novel analysis using Public-use data with novel detailIssue: are the results rightSolution: Early release/SDS

Page 51: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

51

The Real LBD

• Economic census covering nearly all private non-farm business establishments with paid employees– Contains: Annual payroll and Mar 12 employment

(1976-2005), SIC/NAICS, Geography (down to county), Entry year, Exit year, Firm structure

• Used for looking at business dynamics, job flows, market volatility, international comparisons…

4/29/2013

Page 52: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

52

Longitudinal Business Database (LBD)

• Detailed description in Jarmin and Miranda • Developed as a research dataset by the U.S.

Census Bureau Center for Economic Studies• Constructed by linking annual snapshot of the

Census Bureau’s Business Register (see Lecture 4)

4/29/2013

Page 53: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

53

Longitudinal Business Database II

• CES constructed • Longitudinal linkages (using probabilistic

record linking, see Lecture 10)• Re-timed multi-unit births and • Edits and imputations for missing data

4/29/2013

Page 54: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

54

Access to the LBD

• Different levels of access• Public use tabulations – Business Dynamics

Statistics http://www.ces.census.gov/index.php/bds

• “Gold Standard” confidential micro-data available through the Census Research Data Center (RDC) Network– Most used dataset in the RDCs

4/29/2013

Page 55: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

55

Bridge between the Two

• Synthetic data set– Available outside the Census RDC– Providing as much analytical validity as possible– Reduce the number of requests for special

tabulations– Aid users requiring RDC access

• Experiment in public use business micro-data

4/29/2013

Page 56: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

56

Why Synthetic Data?

• Concerns about confidentiality protection for census of establishments– LBD is a test case for business data

• Criteria given for public release:– No actual values of confidential values could be

released– Should provide valid inferences while protecting

confidentiality

4/29/2013

Page 57: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

57

Generic Structure

• Gold standard: given by internal LBD (already completed)

• Partially synthetic:– Unsynthesized:

• County (but not released!) [x1]• SIC [x2]

– Synthesized• Birth [y1] and death [y2] year:• Multi-unit status [y3]• Employment (March 12) [y4]• Payroll [y5]

4/29/2013

Page 58: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

58

Synthesis: General Approach

• Y=[y1|y2|y3|y4|y5]• X=[x1|x2]• Generate joint distribution of Y|X by sampling

from conditionals– f(y1,y2,y3|X) = f(y1|X)·f(y2|y1,X)·f(y3|y1,y2,X)

• Use SIC as “by group”

4/29/2013

Page 59: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

59

General Approach to Synthesis

• Drawing from f(yk|X,y1,...,yk-1)– Fit model using observed data– Draw new values of parameters from posterior

distributions– Use new parameters to predict yk from X and

synthetic values of y1,...,yk-1

4/29/2013

Page 60: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

60

The Sequential Regression Multivariate Imputation (SRMI) Approach

• Calendar:– Step1: Impute y1 | X– Step 2: Impute y2 | [y1| f(X)]

• Where f(X) uses state [x1’] instead of county [x1]

• Type of firm– Step 3: Impute y3 | [y1|y2|X]

• Characteristics– Step 4: Impute y4(t)|[y1|y2|y3|y4(t-1)|x2]– Step 5: Impute y5(t)|[y1|y2|y3|y4(t)|y5(t-1)|x2]

4/29/2013

Page 61: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

61

First Year

• Impute y1 (Firstyear) | SIC, County using variant of Dirichlet-Multinomial– Prior information is obtained by collapsing

categories– Synthetic values obtained from sampling from

multinomial distribution

4/29/2013

Page 62: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

62

Last Year

• Impute y2 (Last Year)| First Year, State, SIC• Simple multinomial approach

– Dirichlet-multinomial with flat prior– Sample from multinomial probabilities obtained

from matching categories in observed data

4/29/2013

Page 63: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

63

Multi-unit Status

• Impute in two stages:– Categorical response: Always MU, sometimes MU,

never MU– Imputed using simple multinomial approach

• Given change in status occurs, impute when change occurred (future)

4/29/2013

Page 64: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

64

Employment and Payroll

• Highly skewed longitudinal continuous variables• Imputed using a set of normal linear models with kde

transformation of response (Abowd and Woodcock, 2004)

• Impute year by year, employment and then payroll, based on groups– (3-digit SIC) – by (multiunit status) – by (continuer status)– by (top 5% status)

• If model too sparse, use 2-digit SIC as prior4/29/2013

Page 65: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

65

Analytical Validity Tests

• Compare observed data and synthetic data for whole LBD

• Job creation and destruction• Employment volatility• Gross employment levels

4/29/2013

Page 66: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

66

Job Destruction Rates: LBD and Implicates by Year

05

101520253035404550

19

77

19

78

19

79

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

Year

LBD Implicate 1 Implicate 2 Implicate (Mean)

Job Creation Rates: LBD and Implicates by Year

05

101520253035404550

19

77

19

78

19

79

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

Year

LBD Implicate 1 Implicate 2 Implicate (Mean)

4/29/2013

Page 67: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

67

Job Creation from Births: LBD and Implicates by Year

01,0002,0003,0004,0005,0006,0007,0008,0009,000

10,000

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Tho

usan

ds

Year

LBD Implicate 1 Implicate 2 Implicate (Mean)

4/29/2013

Page 68: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

68

Job Creation from Births and Expansions: LBD and Implicates by Year

05,000

10,00015,00020,00025,00030,00035,00040,000

19

77

19

78

19

79

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

Th

ou

san

ds

Year

LBD Implicate 1 Implicate 2 Implicate (Mean)

4/29/2013

Page 69: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

69

Net Job Creation Rates: LBD v Implicates

-10

-5

0

5

10

15

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Net Job Creation LBD Net Job Creation Implicate 1 Net Job Creation Implicate 2

4/29/2013

Page 70: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

70

Employment Volatility: Establishment by Year, weighted

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

YearVolatility (LBD, Weighted) Volatility (Imp 1, Weighted)

Volatility (Imp 2, Weighted) Volatility (Imp-Mean, Weighted)

4/29/2013

Page 71: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

71

Employment: LBD and Implicates by Year

0

100000

200000

300000

400000

500000

1977 1979 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999

Year

Co

un

t

LBD Synthetic

4/29/2013

Page 72: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,
Page 73: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,
Page 74: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,
Page 75: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,
Page 76: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

76

Confidentiality Protection

• Unavailable in SynLBD V2 (current on SDS)– Firm structure– Firm linkages (across time, across implicates)– Geography

• Basic protection– Replacing sensitive values of with draws from

probability distributions

4/29/2013

Page 77: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

77

Disclosure Avoidance Review

• High probability that an individual establishment’s synthetic birth/death year is different from its actual birth/death year

• Synthetic maxima not necessarily near actual• High between-imputation variability at

establishment level

4/29/2013

Page 78: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

78

Synthesizing Firstyear (Birth) and Lastyear (Death)

• Positive probability exists of producing any feasible birth year, and substantial probability exists that synthesized firstyear is not the actual firstyear

• Table on next slide shows this: prob(actual birth year=synthetic birth year l synthetic birth year) is low

• Similar results hold for deaths• Conclusions: establishment lifetimes are random,

so users can’t accurately attach establishment identifications to them

4/29/2013

Page 79: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

794/29/2013

Page 80: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

Example: Year of birth

Page 81: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

81

Confidentiality Protection: Breaking Firm Links

• Firm characteristics not synthesized• Firm characteristics more skewed than

establishment characteristics• Cannot link multi-unit establishments to their

firms

4/29/2013

Page 82: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

82

Confidentiality Protection: Breaking Links Across Implicates

• Synthetic observations with the same LBDnum across implicates are not generated from the same LBD establishment

• Can’t group (across implicates within year) observations generated from same establishment

4/29/2013

Page 83: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

83

Confidentiality Protection: Synthesizing Employment and Payroll

• Synthesis models are essentially regressions with transformed variables

• Synthesis captures low-dimensional relationships and sacrifices higher-dimensional ones

• Synthesized employment and payroll vary substantially around regression lines

• Synthesized employment and payroll vary significantly from observed values

4/29/2013

Page 84: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

84

Example: Correlations Among Actual and Synthetic Data

• SIC 573 - year 2000

4/29/2013

Pearson Correlation CoefficientsSIC 573Year: 2000

EmploymentSynthetic Employment Payroll

Synthetic Payroll

Employment 141000

Synthetic 0.003 1Employment 21100 41000Payroll 0.712 -0.012 1

41000 21100 41000Synthetic 0.007 0.444 0.004 1Payroll 21100 41000 21100 41000

Slide 84

Page 85: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

854/29/2013

Page 86: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

864/29/2013

Page 87: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

87

Conclusions

• Analytical validity supported for broad analyses– Issues with some details– Obtain user feedback to inform future refinements

• Sufficient confidentiality protection– Basic metrics show strong protection– Differential privacy protection not yet verified

4/29/2013

Page 88: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

88

– Include NAICS, geography, changes in multiunit status, firm age and size

– Multiple Imputations for release– Address bias in job creation/destruction– Extend time series

Ongoing Work at Census

4/29/2013

Page 89: INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

© John M. Abowd and Lars Vilhuber 2013, all rights reserved

89

External Validation Exercises

• 41 approved projects (includes provisional approvals)

• 3 have submitted results for validation (one of these did two rounds of validation)

• Moscarini timeline: application approved March 2011, validation results released September 2011

4/29/2013