Nonparametric Bayesian Data Analysis for Causal Inference · conditions/devices) lead inexorably to...
Transcript of Nonparametric Bayesian Data Analysis for Causal Inference · conditions/devices) lead inexorably to...
Nonparametric Bayesian Data Analysis for CausalInference
Part I: Nonparametric Models
Steve MacEachern
The Ohio State University
with thanks to many
Bayesian Causal Inference Workshop
Mathematical Biosciences Institute
The Ohio State University
June 2019
1
An overview of the talk
• Motivation for np Bayesian methods – Full support
• Density estimation
– birthweights: kde and Bayes
• The Dirichlet process
– a distribution on countably discrete distributions
• Dependent Dirichlet processes
– from one distribution to a collection of distributions
• Modelling with nonparametric processes
– the structure of the problem
2
Bayesian methods
• Bayesian methods are supported by theoretical development
– Subjective probability de Finetti, Savage, Lindley, . . . (SB)
∗ the axioms of rational behavior (along with some regularity
conditions/devices) lead inexorably to the conclusion that (i)
each of us has a prior distribution, and (ii) optimal actions
(choices, decisions) must be driven by Bayes’ Theorem.
– Decision theory (DTB)
∗ (nearly) all agree on the concept of admissibility
∗ inadmissible decision rules should not be used, unless ac-
counting for the gap between data and formal decision prob-
lem, for computational implementation, or because we can’t
find a dominating rule
∗ the entire set of admissible decision rules consists of Bayes
rules and near-Bayes rules
• Both perspectives tell us that we should be Bayesian
• Neither perspective tells us how to be Bayesian
3
Motivation for nonparametric Bayesian methods
• The problem:
– estimate the probability of obtaining Heads when a coin is
flipped once. Data on n independent flips of the coin
• The model:
p ∼ F
X|p ∼ Binom(n, p)
• The prior:
– F = δ0.5 or F ∼ Unif(0.4, 0.6)
• If the closure of the support of F is a proper subset of [0, 1], we
may have trouble
• At best, we get the “closest” point in the support to the true p
4
Priors to posteriors
0.0 0.2 0.4 0.6 0.8 1.0
02
46
8
ppo
ster
ior d
ensi
ty
0.0 0.2 0.4 0.6 0.8 1.0
010
2030
40
p
post
erio
r den
sity
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
p
dens
ity
0.0 0.2 0.4 0.6 0.8 1.0
02
46
810
p
dens
ity
• Sample size 10 in top plots; 100 in bottom plots. All have p̂ = 0.7
• Left plots. uniform(0, 1);uniform(0.4, 0.6)
• Right plots. Uniform, arcsin, beta(10,1). Priors in black; poste-
riors in blue
5
Motivation for nonparametric Bayesian methods
• The rule resulting from a prior with small support seems silly (SB)
and does not dominate sensible rules such as the MLE (DTB)
• This leads to the following principle for an adequate Bayesian
model
– We must use a prior distribution with full support
• Asymptotics are a key step in justifying the principle
– viewed by some as anathematic to Bayes
– a lack of consistency suggests the inconsistent estimator will
be bettered by a consistent one at some point
– rates of convergence parallel consistency. A bad rate implies a
dispersed posterior leading to poor large-sample behavior
• Jeffreys favored a prior of beta form with point masses added at
0 and 1. After Heads and Tails are both observed, only the beta
prior is relevant
6
Further motivation for np Bayes
• False consistency
– suppose that X1, X2, . . . ∼ F
– 4ather than observing X directly, we observe a categorized
version of X
– Yi = [Xi]
– if our model takes F to be a normal distribution, our posterior
eventually concentrates on the best fitting normal for the Y s
– the distribution of Xi|Yi degenerates
– although we have never seen an X, we claim to have perfect
knowledge of its distribution
• A prior with full support will lead to degeneracy where we observe
data and will not lead to degeneracy where we don’t
• Particularly relevant for causal inference. The posterior based on
a massive amount of observational data need not (perhaps should
not) concentrate in aspects.
7
Density estimation
• Birthweight of single, live births (kernel density estimates)
• Boys slightly larger than girls
1000 2000 3000 4000 5000 6000
0e+0
02e
−04
4e−0
46e
−04
8e−0
4
Birthweight
Den
sity
• Distribution at birth is skewed left
• Note bumps in left tail
8
A major use for np Bayes
• Kernel density estimate
– smoothed version of histogram
∗ Scott’s development
∗ from histogram to average shifted histogram
(uniform kernel) to kde
– average of kernels – here normal densities
– one term in average per data point
• Bayesian analog?
– hierarchical model with discrete latent structure
– mixture of densities – here, focus on normal densities
– unbounded number of components
• Latent structure allows us to make (weak) use of models
9
Results of the fit
• The birthweight data, fitting location-scale models for boys / girls
1000 3000 5000
0 e
+00
4 e
−04
8 e
−04
predictive density of girls
Den
sity
0 1000 3000 5000
0 e
+00
4 e
−04
predictive density of boys
dens
ity.b
oyn
1000 3000 5000
0 e
+00
4 e
−04
8 e
−04
predictive density of girls
Den
sity
0 1000 3000 5000
0 e
+00
4 e
−04
predictive density of boys
Den
sity
• We get a great fit, but what have we gained by using Bayes?
• Gain is a formal inferential framework, ability to examine spe-
cific assumptions, ability to inject some structure into estimate,
“optimality” of Bayes procedures, etc.
10
A prior on distribution functions
• Ferguson laid out two basic properties that a NP Bayesian model
should satisfy
– full support (in some relevant space, say the real line)
– ability to perform the updates (at the time, conjugacy)
• We all know how this works for our starter problem of the coin
• The likelihood is a product of iid Bernoullis (alternatively, a bi-
nomial)
f (Y1, . . . , Yn|p) =n∏i=1pYi(1− p)(1−Yi)
• The conjugate prior has a form which matches the likelihood
– we recognize the matching form as that of a beta random vari-
able
• For a more refined notion of conjugacy, see Diaconis & Ylvisaker
11
Prior predictive under the Dirichlet process
• The Dirichlet process model for data (here θi)
F ∼ Dir(α)
θi|F ∼ F
• For any (measurable) set A, partition R into A and AC
• A Dirichlet-multinomial model (Beta-binomial) on A and AC
• Predictive probability of θi ∈ A is (with sloppy notation)
P (θi ∈ A) = E[F (A)] = α(A)/α(R)• Thus, predictive distribution for an observation from this model
is driven by rescaled base measure
– typical to replace α with M · F0, writing F ∼ Dir(MF0)
– base measure is α, base distribution is F0
– prior predictive distribution is F0
12
Posterior predictive under the Dirichlet process
• The Dirichlet process model for data (here θi)
F ∼ Dir(α)
θi|F ∼ F
• The posterior distribution is a Dirichlet process with an updated
base measure
F |θ1, . . . , θn ∼ Dir(α +∑iδθi)
θj|F ∼ F, j > n
• The same problem with a new base measure. Answer is the same
– new α is MF0 +∑i δθi =MF0 + nF̂n, where F̂n is empirical cdf
– posterior predictive distribution is
M
M + nF0 +
n
M + nF̂n
13
Sethuraman’s stick-breaking
• The Dirichlet process has many representations
– limit of Dirichlet-multinomial distributions
– rescaled gamma process
– Polya urn scheme (aka, Chinese restaurant process)
• Blackwell and MacQueen’s predictive view leads to the stick-
breaking version of the process
θ∗iiid∼ F0
Viiid∼ Beta(1,M) pi = Vi
∏j<i
(1− Vj)
F =∑piδθ∗i
• See Sethuraman & Tiwari’s and Sethuraman
• Formally, a discrete distribution with a countable number of atoms
– base cdf, F0, gives distribution for locations
– mass parameter, M , leads to weights
14
The Mixture of Dirichlet Processes (MDP) model
• Discrete distributions are wonderful, but they are of limited use
– technical challenges – here, no sigma finite dominating measure
– usefulness challenges – far from traditional modelling
• Antoniak describes the mixture of Dirichlet processes model. E.g.
F ∼ Dir(MF0)
G = N(0, σ2)
X ∼ H = F ∗G
• This is the most popular version of a smoothed DP
– a countable mixture of normals
– DP provides latent structure
– fits seamlessly in a hierarchical model
• MDP or DPM? Antoniak named it first – hence MDP
15
Computation with the MDP model
• The MDP model relies on a latent structure involving the θi
• Rewrite the MDP model as
F ∼ Dir(MF0)
θi|F ∼ F
εi ∼ N(0, σ2)
X = θi + εi
• The identical model, but now we have θi to work with
– allows us to avoid infinite dimensional F
• Escobar developed MCMC computation for the MDP model in
his dissertation
– separate development of Gibbs sampler
• Driven by Polya urn scheme
16
Additional computational issues
• Advances in computation followed rapidly
– drove innovations in MCMC that are used much more widely
– ideas flow from the many representations of the DP
• MCMC for the MDP model are early examples of (i) trans-
dimensional MCMC, as the number of clusters varies while the
algorithm runs, (ii) marginalization and the benefits that follow
from it, (iii) overparametrization followed by marginalization, (iv)
reparameterization to enhance mixing, (v) use of nonidentifiable
models to enhance mixing, and (vi) the role of indentifiability in
MCMC diagnostics
• Additional refinements include split-merge samplers, parallel meth-
ods, variational approximations, . . .
• Methods extend to complex models in which DP is small part
17
From one distribution to a collection of distributions
• The birthweights again
1000 2000 3000 4000 5000 6000
0e+0
02e
−04
4e−0
46e
−04
8e−0
4
Birthweight
Den
sity
• A nonparametric model is needed
• Does the distribution’s shape remain the same through time (age)?
18
Growth curves
• In addition to birthweight, weight is tracked through time (months)
Age LSD Mean USD RSD
0 0.5 3.2 0.4 0.80
1 0.6 4.0 0.5 0.83
20 1.2 11.2 1.2 1.00
40 1.6 14.8 2.1 1.31
60 1.9 17.7 2.8 1.47
100 3.8 26.0 5.8 1.53
• LSD is Lower Standard deviation; USD Upper Standard Devia-
tion, RSD = USD/LSD
• Skewness changes through time, from left-skewed to right-skewed
• Bump in left tail disappears
• Changes in center, spread, skewness, shape
• Oddity with mean, lower sd, upper sd
19
Weighted least squares (Wang)
• Two motivations for weighted least squares
– scale family
– convolution family
• Both motivations imply the same model under normal theory
• Models differ under non-normal error distribution
• Getting the model right matters
– impact on posterior inference for regression coefficients
– impact on assessment of outlyingness
– impact on predictions via predictive distribution
• Nonparametric Bayes models can separate two motivations and
supply a bridge of models between them
20
Weisberg’s apple shoots data – a convolution family?
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−1 0 1
−2
−1
01
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−1.
5−
1.0
−0.
50.
00.
51.
0
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
−1
01
2
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
• A set of normal probability plots (qq plots), broken down by the
level of aggregation
• The “data” are the standardized residuals from a decent regres-
sion model
• These plots show movement toward normality–or do they?
21
Smoothed probability plots
• Avoid jumpiness from points entering/exiting plot
• Allow comparison across different sample sizes
• Focus on pattern, not individual cases
• Get a kde of the density; extract the plot
●
●
●
●
●
●
●
●
●●
●●●●●●●●●●●●●●●●●
●●●●
●●●●
●●●●
●●●●
●●●●●●
●●●●●●
●●●●●●
●●●●●●
●●●●●●
●●●●●
●●●●●●●●●●●●●●●
●●
●●
●
●
−2 −1 0 1 2
−3
−2
−1
01
Theoretical quantiles
Ker
nel d
ensi
ty s
moo
thed
qua
ntile
s
●
●
●●
●●
●●
●●
●●●●●●●●●●●●●●●●●●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●●●●●●●●●●
●●
●●
●●
●
−2 −1 0 1 2
−2
−1
01
Theoretical quantiles
Ker
nel d
ensi
ty s
moo
thed
qua
ntile
s
●
●
●●
●●
●●
●●
●●●●●●●●●●●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●
●●●●●●●●●●●●
●●
●●
●●
●●
●●
●
−2 −1 0 1 2
−2
−1
01
2
Theoretical quantiles
Ker
nel d
ensi
ty s
moo
thed
qua
ntile
s
• We can write models which capture the convolution
• Control amount of averaging; from none to much
22
Dependent nonparametric processes
• Early attempts to induce dependence are either very restrictive
or very jumpy
• Static forms of dependence have limited support:
– regression with an arbitrary residual distribution
Yi = βTxi + εi, εi ∼ F ;F ∼ smoothedDP (MF0)
∗ the residual distribution is the same for all βTxi and all xi
– Mixed model; match notion of random effect to focus on dist’n
whereas fixed effect is matched to notion of number
Yij = βTxi + γj + εij, γj ∼ F ; F ∼ DP (MF0)
∗ the γj are independent with common distribution
∗ the distribution cannot vary with the covariate
23
Early attempts
• Wild forms of dependence
– base measure of DP changes with x
– (conditionally) independent realizations of Fx
– this form of dependence breaks any tie beyond parameters of
the base measure
• Suppose base measure moves continuously (both base cdf and
mass)
– for x1 ≈ x2 realized Fx1 is independent of realized Fx2– lose ability to pool information from nearby x
– lack of consistency for regression problems (one obs’n per value
of x)
• Takeaway – continuity of distributions in x is important
• Motivating case, growth of individuals through time – we have a
strong belief in continuity for individuals, hence for pop’n dist’n
24
A comprehensive picture in regression
• A nonparametric model is needed at each level of x
• The residual distribution must evolve smoothly as x changes
• The changes go beyond an additive effect
• So, why the success of traditional regression models?
– fitting procedure (least squares or robust version) excels at
picking out central trend of response, given covariates
– many traditional questions focus on central tendency
• There is growing interest in prediction
– focus is on distribution for individual case – given x
– no central limit theorem applies to the individual case
– case diagnostics are also about the individual case
– (for classical folks, growth in quantile regression is notable)
• Counterfactuals are all about prediction
25
An easy extension (MacEachern)
• Replace each component of the model with a grander object, in-
dexed by covariate, x ∈ X
– F → Fx
– θ∗i → θ∗i,x
– Vi → Vi,x
• Easy case is finite, discrete X
– locations are iid from multivariate F0
– breaks are iid from multivariate with Beta(1,Mx) marginals
– weights constructed from breaks
• For each x, a stickbreaking construction
θ∗i,xiid∼ F0,x
Vi,xiid∼ Beta(1,Mx) pi,x = Vi,x
∏j<i
(1− Vj,x)
26
Toward the canonical construction
• Need a standard way to build the needed multivariate components
– copula models, especially the Gaussian copula
• F0 a multivariate distribution with specified marginals
– begin with Z a multivariate normal with N(0, 1) marginals
– probability integral transform maps to U(0, 1) marginals
– inverse cdf transform maps to desired marginals
• Beta(1,M)? Same argument
– begin with Z a multivariate normal with N(0, 1) marginals
– probability integral transform maps to U(0, 1) marginals
– inverse beta cdf transform maps to desired marginals
• Computation is messier and slower than for simple MDP model,
but MCMC proceeds along well established lines
27
Canonical construction
• The typical regression problem has continuous covariates,
hence X is neither finite nor discrete
– use identical strategy as in finite case
– Gaussian process (GP) replaces multivariate normal
– PIT / inverse cdf transformations use GP to construct needed
processes
• Formally, θ∗i and Vi become stochastic processes with index set X
• Maintain independence across i
• Create distribution at an x by plucking off countable collection of
θ∗i and Vi at that x
– pi,x = Vi,x∏j<i(1− Vj,x)
– Fx =∑pi,xδθ∗i,x Fx(A) =
∑i pi,xI(θ
∗i,x ∈ A)
• Result is a distrbution-valued stochastic process, the
dependent Dirichlet process (DDP)
28
Paths of the θ∗i
0 2 4 6 8 10
−3−2
−10
12
3
x
thet
a
29
Paths of the θ∗i
0 2 4 6 8 10
−3−2
−10
12
3
x
thet
a
30
Paths of the θ∗i
0 2 4 6 8 10
−3−2
−10
12
3
x
thet
a
31
Paths of the θ∗i
0 2 4 6 8 10
−3−2
−10
12
3
x
thet
a
32
Paths of the θ∗i
0 2 4 6 8 10
−3−2
−10
12
3
x
thet
a
• With smoothing, distribution at x = 0 is a mixture of normals,
two components centered near 2, three centered near −1
• Distribution at x = 5 has all five components centered between −1and 0
33
Properties
• Following Ferguson, we lay out desireable properties for these
dependent nonparametric processes
– the prior should have full/large support
(no consistency without full support)
– the prior should lead to a model we can fit
(just about any model we can devise, nowadays)
– the marginal distribution of Fx should be of known form
(contributes to understanding of model)
– the distributions Fx should be continuous in x
(as a default, may opt for discontinuity)
– by varying parameters in the process, we should be able to
produce a variety of behaviors
• Dependent Dirichlet processes satisfy all of these properties
34
A special case – the single p DDP
• The single-p DDP is a special case of the DDP
• Take processes Vi,x that do not vary with x
– resulting pi,x do not vary with x
– random measure assigns mass to set A as
Fx(A) =∑ipiI(θ
∗i,x ∈ A)
• If cardinality of X is finite, the DDP model is a DP
– if not finite, is arguably a DP
• Modelling use differs from tradition of DP
– observation connected to DP only through θ∗i,x at some single x
– different cases use different parts of θ∗i
• Computations simpler than with general DDP model
35
A special case – the single θ DDP (Walker, Karabatsos)
• The single-θ DDP is a special case of the DDP
• Take processes θi,x that do not vary with x
– the Vi,x, hence pi,x do vary with x
– random measure assigns mass to set A as
Fx(A) =∑ipi,xI(θ
∗i ∈ A)
• Results are akin to “stair-step” model, with the θi playing the
role of the steps
• Evidence that it fits some types of data quite well
36
Dependent nonparametric processes (DNP)
• The Dirichlet process is easily generalized in a number of ways.
Similarly, dependent Dirichlet processes can be easily generalized
• The simplest generalizations are discrete distributions (which be-
come mixture models when smoothed)
• Require replacement of random variable with stochastic process
• Easy to build the required processes with target marginal distri-
butions
– from a normal, to a Gaussian process
– from a Beta, to a process with Beta marginals
– several other strategies for building these stochastic processes.
Most come from elementary distribution theory
• Goal is to create prior for collection of distributions indexed by x
• From here on out, everything is modelling!
37
Structure of X
• Rp
– the old standby – regression, spatial models, etc.
• Graph
– a natural for ANOVA, often reduces to placing points in Rp
– flexible construction of graph via Voronoi tesselation, tilings
• Directed graph
– discrete time, time series models
• Surface
– spatial models on the sphere
• Tree
– species trees, coalescent models
38
Collection of distributions
• The set of distributions inherits many properties from the stochas-
tic processes used to construct them
• Need to capture driving ideas behind particular modelling effort,
choose processes that match those ideas
• Marginal distribution for Fx:
– fixing x, we have Fx ∼ Dir(Mx, F0,x)
– proof: Fix x, rule for constructing F is that for the DP
• Continuity of Fx in x
• Stationarity (isotropy) of Fx in x
• Stochastic ordering (monotonicity) of Fx in x
• Many producing results on construction / properties of processes
Pruenster, Lijoi, Muliere, Petrone, James, Jara, Ghosal, Ghosh,
Tokdar, Ramamoorthi, Hjort, Shui, Griffin, . . .
39
Connection to data - independence
• The distributions Fx, x ∈ X must be connected to data
• Many ways to do so
• One extreme is independence
– each case has a covariate value, say x
– x may or may not have any repeats in the data set
– the associated Y values are independent draws from the Fx
– in the context of regression
Yi = βTxi + εi, xi becomes x
εi ∼ Fx, εi independent for i = 1, . . . , n
– model used with an extra level of smoothing, so that Fx is not
discrete
– appropriate as a replacement for ordinary least squares style
regression
40
Connection to data - extreme dependence
• A second extreme takes entire groups of observations from the
same surface
– the group of observations plays the role of a case on the pre-
vious slide
– each case is an independent draw from Fx
Yij = µ(xij) + εij
∗ xij becomes x
∗ εij from same mixture component for j = 1, . . . , ni
– requires single-p model, where pi,x = pi for all x
• Appropriate when we conceive of our model as a mixture of sur-
faces (Gelfand et al.)
41
Connection to data - selector functions
• Between the extremes lies the notion of a selector function
– process defines collection of distributions at all points, x, where
data are observed
– positive-integer valued stochastic process on X determines which
component of mixture case is associated with
Yi = µx + εi
– xi becomes x
– selector process Zx maps εi into association with θ∗Zx (MacE.)
• Properties of data-distribution connection determined by proper-
ties of selector
– facilitates edge detection or changepoint modelling
– natural hierarchical modelling strategy
– extremes of dependence/independence also captured through
selector functions
42
Connection to data - copulae and margins
• Collection of distributions from dependent nonparametric process
• Autoregressive process gives sequence of values in R
– values in R transformed into values in [0, 1]
– distributions Fx yield data:
Yi = F−1x (ux)
• With continuous Fx, yields continuous path
• Dynamics of dependence governed by autoregressive structure
– can replace autoregressive structure with other structures –
time series/spatial models/spatio-temporal models provide wide
array of choices
• Applies very generally, functional data, time series, spatial data,
etc. (Xu et al.)
43
Recap
• Elements of the problem
– an index set, X– a model for the collection of distributions, Fx
– a model for the connection between distributions and data
– combined with other, traditional pieces of models
• The properties of the models are determined by these elements
• Great freedom in playing with the elements: spatial DP, linear
DP, functional DP, hierarchical DP, nested DP, . . ., dependent
species sampling models, kernel stick-breaking processes, probit
stick-breaking processes, . . .
• Focus on identifying key modelling issues and using a process that
captures those issues
• Analogy is the generalized linear model. Framework encompasses
many specific forms. Strong commonalities across forms
44