Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data...

21
Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott 1 1 University of Michigan School of Public Health April 2011 Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Transcript of Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data...

Page 1: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Statistical Analysis Using Combined Data Sources:Discussion

2011 JPSM Distinguished LectureUniversity of Maryland

Michael Elliott1

1University of Michigan School of Public Health

April 2011

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 2: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

“Complete” (Ideal) vs. Observed (Messy) Data

“Complete” data dU vs.observed data ds :

Data with measurementerror (X vs. X ∗): lastmonth vs. this monthemployee count.Partial Z : “large” areaIDs vs. small area IDs.Non-overlappingcovariates: healthoutcomes X vs.behavioral risk factors W .Multiple surveys.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 3: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Missing Information Principle

Would like to make inference aboutQ(Y1, ...,YN ,X1, ...,XN) = Q(X,Y) using dU

Stuck with making inference about Q(X,Y) using ds

Chambers: Use “missing information principle” to obtainlikelihood inference about θ ≡ θN ≡ Q(X,Y) using ds byreplacing sufficient statistics S(dU) for θN by E (S(dU) | ds).

EM algorithmReplace likelihood inference with pseudo-likelihood inferencesto accommodate unequal probability sample designs.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 4: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Missing Information Principle

Alternative to calibration: calibration data becomes part ofds , and replace data in MLE/PMLE score equations withexpected value conditional on calibration totals.

Chambers shows that this can improve efficiency overuncalibrated estimators, and avoid bias due to failure of theimplied calibration model.

Probabilistic linkage: allows for bias correction in estimatingequations using probabilistic linked datasets.

Combining data from multiple surveys using MIP extension ofmissing data algorithms.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 5: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Bayesian Survey Inference

Focus on inference about Q(X,Y) using posterior predictivedistribution based on p(Ynobs ,Xnobs | y, x):

p(Ynobs ,Xnobs | y, x) =p(Y,X)

p(y, x)=

∫p(Y,X | θ)p(θ)dθ

p(y, x)=∫

p(Ynobs ,Xnobs | y, x,θ)p(y, x | θ)p(θ)dθ

p(y, x)=∫

p(Ynobs ,Xnobs | y, x,θ)p(θ | y, x)dθ

(Ericson 1969; Scott 1977; Rubin 1987).

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 6: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Bayesian Survey Inference vs. MIP

Both Bayesian survey inference and the missing informationprinciple focus on prediction of missing elements in thepopulation conditional on observed data.

Bayesian approach obtains full posterior predictive distributionof Q(X,Y), rather than point estimate and asymptoticnormality assumptions

EM vs. MCMC (“stochastic EM”)

Bayesian survey inference should yield similar inference toMIP, at least in large samples.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 7: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Applications of Bayesian Survey Inference

(Item level) missing data (Rubin 1987, Little and Rubin 2002).

Weight trimming (Elliott and Little 2000, Elliott 2007, 2008,2009).

Disclosure risk (Raghunathan et al. 2003, Reither 2005).

Combining data from multiple surveys (Raghunathan et al.2007, Davis et al. 2010).

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 8: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Combining Data from Multiple Complex Sample DesignSurveys (Dong 2011)

Motivating example: obtaininference about Q(Y ,X ,W )when only two of the threevariables are contained inany one survey.

Surveys use different designsand data collection methods→ different sampling andnonsampling error properties

Cannot simply pool datafor analysis.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 9: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Combining Data from Multiple Complex Sample DesignSurveys (Dong 2011)

Borrow from disclosure riskliterature to generatesynthetic populations usingdata from each survey.

Each generated population“uncomplexes” sampledesign to create what iseffectively a simple randomsample.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 10: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Combining Data from Multiple Complex Sample DesignSurveys (Dong 2011)

Pool data and use standardimputation approaches to fillin missing variables for thedata from each survey.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 11: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Combining Estimates from Synthetic PopulationsGenerated by A Single Survey (Raghunathan et al. 2003)

Suppose we have synthetic populations Pl , l = 1, ..., Lgenerated from P(Ynobs ,Xnobs | y, x).

Estimate Q = Q(X ,Y ) with QL = L−1∑

l Ql , where Ql is an(asymptotically) unbiased estimator of Q generated from Pl .

var(QL) = TL = (1 + L−1)BL − UL, whereBL = (L− 1)−1

∑l(Ql − QL)(Ql − QL)

T is

between-imputation variance and UL = L−1∑

l Ul is theaverage of the within-imputation variances Ul of Ql .

TL ≈ BL if generated sample large enough that UL can beignored and number of synthetics L large enough that 1/L canbe ignored

Need a bit more care if synthetic populations are generatedfrom different surveys, since posterior predictive distributionmodels from different survey may not be the same.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 12: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Combining Estimates from Synthetic PopulationsGenerated by Multiple Surveys (Dong et al. 2011)

Generate synthetic populations Psl , l = 1, ..., L for surveys

s = 1, ...,S from P(Ysnobs ,X

snobs | ys , xs).

Account for complex sample design features; regard syntheticdata as SRS from population.No need to actually generate entire population, just samplelarge enough that between-imputation variance swampswithin-imputation variance.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 13: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Combining Estimates from Synthetic PopulationsGenerated by Multiple Surveys (Dong et al. 2011)

For each survey, obtain QsL as the synthetic population point

estimator of Q for survey s and BsL as its variance. If

Q̂s ∼ N(Q,Us)

Qsl | ys , xs ∼ N(Q̂s ,Bs), then

As L →∞, posterior predictive distribution of Q approximately

N(Q∞,B∞), Q∞ =P

s Qs∞/Bs

∞Ps 1/Bs

∞, B∞ = 1P

s 1/Bs∞

.

t approximation available for small L.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 14: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Combining Estimates from Synthetic PopulationsGenerated by Multiple Surveys: Extension to Missing Data(Dong et al. 2011)

Synthesize, then impute.

Imputation for missing components in each survey obtained bystacking Ps

l , s = 1, ...,S and treating as SRS from population.

Multiply impute m = 1, ...,M complete datasets for each ofthe L synthetic populations.

Reseparate into the surveys and obtain Qsml for each of the

multiply imputed synthetic datasets.

Qs

L = L−1∑

l Qs

M,l for Qs

M,l = M−1∑

m Qsml .

BsL = 1+L−1

L−1

∑l(Q

s

M,l−Qs

L)2+ 1+M−1

L

∑l(M−1)−1

∑m(Qs

ml−Qs

M,l)2

Combine survey-level predictive distributions as in thecomplete data case on previous slide.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 15: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Generating Synthetic Populations from Posterior PredictiveDistribution

Derivation of predictive distribution ignores sampling indicator I ;this requires (Rubin 1987)

Unconfounded sampling

P(I | Y,X) = P(I | y, x)

Independence of I and (Ynobs ,Xnobs) given y,x, and θ

P(Ynobs ,Xnobs | y, x, I,θ) = P(Ynobs ,Xnobs | y, x,θ)

Maintaining these assumption requires:

Probability sample

Model p(Y,X) attentive to design features and robust enoughto sufficiently capture all aspects of the distribution of Y,Xrelevant to Q(Y,X).

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 16: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Non-parametric Posterior Predictive Distribution

Generate the lth synthetic population as follows:

Account for stratification and clustering by drawing a Bayesianbootstrap sample of the clusters within each stratum.

For stratum h with Ch clusters, draw Ch − 1 random variablesfrom U(0, 1) and order a1, .., aCh−1; sample Ch clusters withreplacement with probability ac − ac−1, where a0 = 0 andaCh

= 1.

Use finite population Bayesian bootstrap Polya urn scheme(Lo 1988) extended to account for selection weights (Cohen1997) to generate unobserved elements of the populationwithin each cluster c in stratum h:

Draw a sample of size Nch − nch by drawing (yk , xk) from theith unit among the nch sampled elements with probability

wi−1+li,k−1∗(Nch−nch)Nch−nch+(k−1)∗(Nch−nch)

where li,k−1 is the number of bootstrap

draws of the ith unit among the previous k − 1 bootstrapselections.Repeat F times for each boostrapped cluster.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 17: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Application: Distribution of Insurance Coverage

Estimating 2006 insurance coverage using Behavior RiskFactor Surveillance Survey (BRFSS), National HealthInterview Survey (NHIS), Medical Expenditure Panel Survey(MEPS).

NHIS and MEPS split insured into private and public: imputeBRFSS using gender, race, region, education, age, andincome.

Generate synthetic populations using non-parametric Bayesianbootstrap method.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 18: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Application: Distribution of Insurance Coverage

Type of coverage NHIS BRFSS MEPS Combined(n=76K) (n=356K) (n=34K)

Point Est. Private 74.6 73.4 75.9(%) Public 7.5 13.3 8.6

None 17.8 15.4 13.2 15.3SE Private .50 .53 .23

Public .25 .38 .16None .43 .18 .38 .16

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 19: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

The Role of the Model

Bayesian bootstrap is very robust: in most settings it doesn’tprovide efficiency gains over design-based methods, but“combining data” situations are likely exceptions.

In application, some issues arise:

BRFSS estimator may be biased (low response rate,telephone-only sampling frame).MEPS is a subsample of previous year’s NHIS.More traditional modeling approaches needed?

MIP uses E {∂θ log f (du) | ds}Target quantity of interest if model is misspecified? Solutionto population score equation? Can we think ofθ ≡ θN ≡ Q(X,Y)?Use estimating equation methodology to obtain consistentestimators of θ.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 20: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

Thanks

Qi Dong

Trivellore Raghunathan

NCI grant R01-CA129101

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion

Page 21: Statistical Analysis Using Combined Data Sources ......Statistical Analysis Using Combined Data Sources: Discussion 2011 JPSM Distinguished Lecture University of Maryland Michael Elliott1

References

Cohen, M.P. (1997). The Bayesian bootstrap and multiple imputation for unequal probability sample designs,Proceedings of the Survey Research Methods Section, American Statistical Association, p. 635-638.

Davis, W.W., Parsons, V.L., Xie, D., Schenker, N., Town, M., Raghunathan, T.E., and Feuer, E.J. (2010).State-based estimates of mammography screening rates based on information from two health surveys. PublicHealth Reports, 125, 567-578.

Dong, Q., (2011). A principled method to combine information from multiple complex surveys. Unpublished Ph.D.thesis, University of Michigan.

Elliott, M.R., and Little, R.J.A. (2000). Model-based alternatives to trimming survey weights. Journal of OfficialStatistics, 16, 191-209.

Elliott, M.R. (2007). Bayesian weight trimming for generalized linear regression models. Survey Methodology, 33,23-34.

Elliott, M.R. (2008). Model averaging methods for weight trimming, Journal of Official Statistics, 24, 517-540.

Elliott, M.R. (2009). Model averaging methods for weight trimming in generalized linear regression models, Journalof Official Statistics, 25, 1-20.

Ericson, W.A. (1969). Subjective Bayesian models in sampling finite populations. Journal of the Royal StatisticalSociety, B31, 195-234.

Little, R.J.A., Rubin, D.B. (2002). Statistical Analysis with Missing Data, 2nd Ed., New York: Wiley.

Lo, A.Y. (1987). A large sample study of the Bayesian bootstrap, Annals of Statistics, 15, 360-375.

Raghunathan, T. E., Reiter, J. P. and Rubin, D. B. (2003) . Multiple imputation for statistical disclosurelimitation. Journal of Official Statistics, 19, 1-16.

Raghunathan, T.E., Xie, D., Schenker, N., Parsons, V.L., Davis, W.W., Dodd, K.W., and Feuer, E.J. (2007).Combining information from two surveys to estimate county-level prevalence rates of cancer risk Factors andScreening, Journal of the American Statistical Association, 102, 474-486.

Reiter, J.P. (2005). Releasing multiply imputed, synthetic public use microdata: An illustration and empiricalstudy. Journal of the Royal Statistical Society, Series A: Statistics in Society, 168, 185-205.

Rubin, D.B. (1987). Multiple Imputation for Non-response in Surveys, New York: Wiley.

Scott, A.J. (1977). Large sample posterior distributions in finite populations. The Annals of MathematicalStatistics, 42, 1113-1117.

Michael Elliott Statistical Analysis Using Combined Data Sources: Discussion