Nonparametric Bayesian Data Analysis for Causal Inference · conditions/devices) lead inexorably to...

Nonparametric Bayesian Data Analysis for CausalInference

Part I: Nonparametric Models

Steve MacEachern

The Ohio State University

with thanks to many

Bayesian Causal Inference Workshop

Mathematical Biosciences Institute

The Ohio State University

June 2019

1

An overview of the talk

• Motivation for np Bayesian methods – Full support

• Density estimation

– birthweights: kde and Bayes

• The Dirichlet process

– a distribution on countably discrete distributions

• Dependent Dirichlet processes

– from one distribution to a collection of distributions

• Modelling with nonparametric processes

– the structure of the problem

2

Bayesian methods

• Bayesian methods are supported by theoretical development

– Subjective probability de Finetti, Savage, Lindley, . . . (SB)

∗ the axioms of rational behavior (along with some regularity

conditions/devices) lead inexorably to the conclusion that (i)

each of us has a prior distribution, and (ii) optimal actions

(choices, decisions) must be driven by Bayes’ Theorem.

– Decision theory (DTB)

∗ (nearly) all agree on the concept of admissibility

∗ inadmissible decision rules should not be used, unless ac-

counting for the gap between data and formal decision prob-

lem, for computational implementation, or because we can’t

find a dominating rule

∗ the entire set of admissible decision rules consists of Bayes

rules and near-Bayes rules

• Both perspectives tell us that we should be Bayesian

• Neither perspective tells us how to be Bayesian

3

Motivation for nonparametric Bayesian methods

• The problem:

– estimate the probability of obtaining Heads when a coin is

flipped once. Data on n independent flips of the coin

• The model:

p ∼ F

X|p ∼ Binom(n, p)

• The prior:

– F = δ0.5 or F ∼ Unif(0.4, 0.6)

• If the closure of the support of F is a proper subset of [0, 1], we

may have trouble

• At best, we get the “closest” point in the support to the true p

4

Priors to posteriors

0.0 0.2 0.4 0.6 0.8 1.0

02

46

8

ppo

ster

ior d

ensi

ty

0.0 0.2 0.4 0.6 0.8 1.0

010

2030

40

p

post

erio

r den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

p

dens

ity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

p

dens

ity

• Sample size 10 in top plots; 100 in bottom plots. All have p̂ = 0.7

• Left plots. uniform(0, 1);uniform(0.4, 0.6)

• Right plots. Uniform, arcsin, beta(10,1). Priors in black; poste-

riors in blue

5

Motivation for nonparametric Bayesian methods

• The rule resulting from a prior with small support seems silly (SB)

and does not dominate sensible rules such as the MLE (DTB)

• This leads to the following principle for an adequate Bayesian

model

– We must use a prior distribution with full support

• Asymptotics are a key step in justifying the principle

– viewed by some as anathematic to Bayes

– a lack of consistency suggests the inconsistent estimator will

be bettered by a consistent one at some point

– rates of convergence parallel consistency. A bad rate implies a

dispersed posterior leading to poor large-sample behavior

• Jeffreys favored a prior of beta form with point masses added at

0 and 1. After Heads and Tails are both observed, only the beta

prior is relevant

6

Further motivation for np Bayes

• False consistency

– suppose that X1, X2, . . . ∼ F

– 4ather than observing X directly, we observe a categorized

version of X

– Yi = [Xi]

– if our model takes F to be a normal distribution, our posterior

eventually concentrates on the best fitting normal for the Y s

– the distribution of Xi|Yi degenerates

– although we have never seen an X, we claim to have perfect

knowledge of its distribution

• A prior with full support will lead to degeneracy where we observe

data and will not lead to degeneracy where we don’t

• Particularly relevant for causal inference. The posterior based on

a massive amount of observational data need not (perhaps should

not) concentrate in aspects.

7

Density estimation

• Birthweight of single, live births (kernel density estimates)

• Boys slightly larger than girls

1000 2000 3000 4000 5000 6000

0e+0

02e

−04

4e−0

46e

−04

8e−0

4

Birthweight

Den

sity

• Distribution at birth is skewed left

• Note bumps in left tail

8

A major use for np Bayes

• Kernel density estimate

– smoothed version of histogram

∗ Scott’s development

∗ from histogram to average shifted histogram

(uniform kernel) to kde

– average of kernels – here normal densities

– one term in average per data point

• Bayesian analog?

– hierarchical model with discrete latent structure

– mixture of densities – here, focus on normal densities

– unbounded number of components

• Latent structure allows us to make (weak) use of models

9

Results of the fit

• The birthweight data, fitting location-scale models for boys / girls

1000 3000 5000

0 e

+00

4 e

−04

8 e

−04

predictive density of girls

Den

sity

0 1000 3000 5000

0 e

+00

4 e

−04

predictive density of boys

dens

ity.b

oyn

1000 3000 5000

0 e

+00

4 e

−04

8 e

−04

predictive density of girls

Den

sity

0 1000 3000 5000

0 e

+00

4 e

−04

predictive density of boys

Den

sity

• We get a great fit, but what have we gained by using Bayes?

• Gain is a formal inferential framework, ability to examine spe-

cific assumptions, ability to inject some structure into estimate,

“optimality” of Bayes procedures, etc.

10

A prior on distribution functions

• Ferguson laid out two basic properties that a NP Bayesian model

should satisfy

– full support (in some relevant space, say the real line)

– ability to perform the updates (at the time, conjugacy)

• We all know how this works for our starter problem of the coin

• The likelihood is a product of iid Bernoullis (alternatively, a bi-

nomial)

f (Y1, . . . , Yn|p) =n∏i=1pYi(1− p)(1−Yi)

• The conjugate prior has a form which matches the likelihood

– we recognize the matching form as that of a beta random vari-

able

• For a more refined notion of conjugacy, see Diaconis & Ylvisaker

11

Prior predictive under the Dirichlet process

• The Dirichlet process model for data (here θi)

F ∼ Dir(α)

θi|F ∼ F

• For any (measurable) set A, partition R into A and AC

• A Dirichlet-multinomial model (Beta-binomial) on A and AC

• Predictive probability of θi ∈ A is (with sloppy notation)

P (θi ∈ A) = E[F (A)] = α(A)/α(R)• Thus, predictive distribution for an observation from this model

is driven by rescaled base measure

– typical to replace α with M · F0, writing F ∼ Dir(MF0)

– base measure is α, base distribution is F0

– prior predictive distribution is F0

12

Posterior predictive under the Dirichlet process

• The Dirichlet process model for data (here θi)

F ∼ Dir(α)

θi|F ∼ F

• The posterior distribution is a Dirichlet process with an updated

base measure

F |θ1, . . . , θn ∼ Dir(α +∑iδθi)

θj|F ∼ F, j > n

• The same problem with a new base measure. Answer is the same

– new α is MF0 +∑i δθi =MF0 + nF̂n, where F̂n is empirical cdf

– posterior predictive distribution is

M

M + nF0 +

n

M + nF̂n

13

Sethuraman’s stick-breaking

• The Dirichlet process has many representations

– limit of Dirichlet-multinomial distributions

– rescaled gamma process

– Polya urn scheme (aka, Chinese restaurant process)

• Blackwell and MacQueen’s predictive view leads to the stick-

breaking version of the process

θ∗iiid∼ F0

Viiid∼ Beta(1,M) pi = Vi

∏j<i

(1− Vj)

F =∑piδθ∗i

• See Sethuraman & Tiwari’s and Sethuraman

• Formally, a discrete distribution with a countable number of atoms

– base cdf, F0, gives distribution for locations

– mass parameter, M , leads to weights

14

The Mixture of Dirichlet Processes (MDP) model

• Discrete distributions are wonderful, but they are of limited use

– technical challenges – here, no sigma finite dominating measure

– usefulness challenges – far from traditional modelling

• Antoniak describes the mixture of Dirichlet processes model. E.g.

F ∼ Dir(MF0)

G = N(0, σ2)

X ∼ H = F ∗G

• This is the most popular version of a smoothed DP

– a countable mixture of normals

– DP provides latent structure

– fits seamlessly in a hierarchical model

• MDP or DPM? Antoniak named it first – hence MDP

15

Computation with the MDP model

• The MDP model relies on a latent structure involving the θi

• Rewrite the MDP model as

F ∼ Dir(MF0)

θi|F ∼ F

εi ∼ N(0, σ2)

X = θi + εi

• The identical model, but now we have θi to work with

– allows us to avoid infinite dimensional F

• Escobar developed MCMC computation for the MDP model in

his dissertation

– separate development of Gibbs sampler

• Driven by Polya urn scheme

16

Additional computational issues

• Advances in computation followed rapidly

– drove innovations in MCMC that are used much more widely

– ideas flow from the many representations of the DP

• MCMC for the MDP model are early examples of (i) trans-

dimensional MCMC, as the number of clusters varies while the

algorithm runs, (ii) marginalization and the benefits that follow

from it, (iii) overparametrization followed by marginalization, (iv)

reparameterization to enhance mixing, (v) use of nonidentifiable

models to enhance mixing, and (vi) the role of indentifiability in

MCMC diagnostics

• Additional refinements include split-merge samplers, parallel meth-

ods, variational approximations, . . .

• Methods extend to complex models in which DP is small part

17

From one distribution to a collection of distributions

• The birthweights again

1000 2000 3000 4000 5000 6000

0e+0

02e

−04

4e−0

46e

−04

8e−0

4

Birthweight

Den

sity

• A nonparametric model is needed

• Does the distribution’s shape remain the same through time (age)?

18

Growth curves

• In addition to birthweight, weight is tracked through time (months)

Age LSD Mean USD RSD

0 0.5 3.2 0.4 0.80

1 0.6 4.0 0.5 0.83

20 1.2 11.2 1.2 1.00

40 1.6 14.8 2.1 1.31

60 1.9 17.7 2.8 1.47

100 3.8 26.0 5.8 1.53

• LSD is Lower Standard deviation; USD Upper Standard Devia-

tion, RSD = USD/LSD

• Skewness changes through time, from left-skewed to right-skewed

• Bump in left tail disappears

• Changes in center, spread, skewness, shape

• Oddity with mean, lower sd, upper sd

19

Weighted least squares (Wang)

• Two motivations for weighted least squares

– scale family

– convolution family

• Both motivations imply the same model under normal theory

• Models differ under non-normal error distribution

• Getting the model right matters

– impact on posterior inference for regression coefficients

– impact on assessment of outlyingness

– impact on predictions via predictive distribution

• Nonparametric Bayes models can separate two motivations and

supply a bridge of models between them

20

Weisberg’s apple shoots data – a convolution family?

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−1 0 1

−2

−1

01

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

−1.

5−

1.0

−0.

50.

00.

51.

0

Normal Q−Q Plot


Sam

ple

Qua

ntile

s ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

−1

01

2

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

• A set of normal probability plots (qq plots), broken down by the

level of aggregation

• The “data” are the standardized residuals from a decent regres-

sion model

• These plots show movement toward normality–or do they?

21

Smoothed probability plots

• Avoid jumpiness from points entering/exiting plot

• Allow comparison across different sample sizes

• Focus on pattern, not individual cases

• Get a kde of the density; extract the plot

●

●

●

●

●

●

●

●

●●

●●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●●

●●

●●

●

●

−2 −1 0 1 2

−3

−2

−1

01

Theoretical quantiles

Ker

nel d

ensi

ty s

moo

thed

qua

ntile

s

●

●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●●

●●

●●

●●

●

−2 −1 0 1 2

−2

−1

01


Ker

nel d

ensi

ty s

moo

thed

qua

ntile

s

●

●

●●

●●

●●

●●

●●●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●

●●

●●

●●

●●

●●

●

−2 −1 0 1 2

−2

−1

01

2


Ker

nel d

ensi

ty s

moo

thed

qua

ntile

s

• We can write models which capture the convolution

• Control amount of averaging; from none to much

22

Dependent nonparametric processes

• Early attempts to induce dependence are either very restrictive

or very jumpy

• Static forms of dependence have limited support:

– regression with an arbitrary residual distribution

Yi = βTxi + εi, εi ∼ F ;F ∼ smoothedDP (MF0)

∗ the residual distribution is the same for all βTxi and all xi

– Mixed model; match notion of random effect to focus on dist’n

whereas fixed effect is matched to notion of number

Yij = βTxi + γj + εij, γj ∼ F ; F ∼ DP (MF0)

∗ the γj are independent with common distribution

∗ the distribution cannot vary with the covariate

23

Early attempts

• Wild forms of dependence

– base measure of DP changes with x

– (conditionally) independent realizations of Fx

– this form of dependence breaks any tie beyond parameters of

the base measure

• Suppose base measure moves continuously (both base cdf and

mass)

– for x1 ≈ x2 realized Fx1 is independent of realized Fx2– lose ability to pool information from nearby x

– lack of consistency for regression problems (one obs’n per value

of x)

• Takeaway – continuity of distributions in x is important

• Motivating case, growth of individuals through time – we have a

strong belief in continuity for individuals, hence for pop’n dist’n

24

A comprehensive picture in regression

• A nonparametric model is needed at each level of x

• The residual distribution must evolve smoothly as x changes

• The changes go beyond an additive effect

• So, why the success of traditional regression models?

– fitting procedure (least squares or robust version) excels at

picking out central trend of response, given covariates

– many traditional questions focus on central tendency

• There is growing interest in prediction

– focus is on distribution for individual case – given x

– no central limit theorem applies to the individual case

– case diagnostics are also about the individual case

– (for classical folks, growth in quantile regression is notable)

• Counterfactuals are all about prediction

25

An easy extension (MacEachern)

• Replace each component of the model with a grander object, in-

dexed by covariate, x ∈ X

– F → Fx

– θ∗i → θ∗i,x

– Vi → Vi,x

• Easy case is finite, discrete X

– locations are iid from multivariate F0

– breaks are iid from multivariate with Beta(1,Mx) marginals

– weights constructed from breaks

• For each x, a stickbreaking construction

θ∗i,xiid∼ F0,x

Vi,xiid∼ Beta(1,Mx) pi,x = Vi,x

∏j<i

(1− Vj,x)

26

Toward the canonical construction

• Need a standard way to build the needed multivariate components

– copula models, especially the Gaussian copula

• F0 a multivariate distribution with specified marginals

– begin with Z a multivariate normal with N(0, 1) marginals

– probability integral transform maps to U(0, 1) marginals

– inverse cdf transform maps to desired marginals

• Beta(1,M)? Same argument

– begin with Z a multivariate normal with N(0, 1) marginals

– probability integral transform maps to U(0, 1) marginals

– inverse beta cdf transform maps to desired marginals

• Computation is messier and slower than for simple MDP model,

but MCMC proceeds along well established lines

27

Canonical construction

• The typical regression problem has continuous covariates,

hence X is neither finite nor discrete

– use identical strategy as in finite case

– Gaussian process (GP) replaces multivariate normal

– PIT / inverse cdf transformations use GP to construct needed

processes

• Formally, θ∗i and Vi become stochastic processes with index set X

• Maintain independence across i

• Create distribution at an x by plucking off countable collection of

θ∗i and Vi at that x

– pi,x = Vi,x∏j<i(1− Vj,x)

– Fx =∑pi,xδθ∗i,x Fx(A) =

∑i pi,xI(θ

∗i,x ∈ A)

• Result is a distrbution-valued stochastic process, the

dependent Dirichlet process (DDP)

28

Paths of the θ∗i

0 2 4 6 8 10

−3−2

−10

12

3

x

thet

a

29

Paths of the θ∗i

0 2 4 6 8 10

−3−2

−10

12

3

x

thet

a

30

Paths of the θ∗i

0 2 4 6 8 10

−3−2

−10

12

3

x

thet

a

31

Paths of the θ∗i

0 2 4 6 8 10

−3−2

−10

12

3

x

thet

a

32

Paths of the θ∗i

0 2 4 6 8 10

−3−2

−10

12

3

x

thet

a

• With smoothing, distribution at x = 0 is a mixture of normals,

two components centered near 2, three centered near −1

• Distribution at x = 5 has all five components centered between −1and 0

33

Properties

• Following Ferguson, we lay out desireable properties for these

dependent nonparametric processes

– the prior should have full/large support

(no consistency without full support)

– the prior should lead to a model we can fit

(just about any model we can devise, nowadays)

– the marginal distribution of Fx should be of known form

(contributes to understanding of model)

– the distributions Fx should be continuous in x

(as a default, may opt for discontinuity)

– by varying parameters in the process, we should be able to

produce a variety of behaviors

• Dependent Dirichlet processes satisfy all of these properties

34

A special case – the single p DDP

• The single-p DDP is a special case of the DDP

• Take processes Vi,x that do not vary with x

– resulting pi,x do not vary with x

– random measure assigns mass to set A as

Fx(A) =∑ipiI(θ

∗i,x ∈ A)

• If cardinality of X is finite, the DDP model is a DP

– if not finite, is arguably a DP

• Modelling use differs from tradition of DP

– observation connected to DP only through θ∗i,x at some single x

– different cases use different parts of θ∗i

• Computations simpler than with general DDP model

35

A special case – the single θ DDP (Walker, Karabatsos)

• The single-θ DDP is a special case of the DDP

• Take processes θi,x that do not vary with x

– the Vi,x, hence pi,x do vary with x

– random measure assigns mass to set A as

Fx(A) =∑ipi,xI(θ

∗i ∈ A)

• Results are akin to “stair-step” model, with the θi playing the

role of the steps

• Evidence that it fits some types of data quite well

36

Dependent nonparametric processes (DNP)

• The Dirichlet process is easily generalized in a number of ways.

Similarly, dependent Dirichlet processes can be easily generalized

• The simplest generalizations are discrete distributions (which be-

come mixture models when smoothed)

• Require replacement of random variable with stochastic process

• Easy to build the required processes with target marginal distri-

butions

– from a normal, to a Gaussian process

– from a Beta, to a process with Beta marginals

– several other strategies for building these stochastic processes.

Most come from elementary distribution theory

• Goal is to create prior for collection of distributions indexed by x

• From here on out, everything is modelling!

37

Structure of X

• Rp

– the old standby – regression, spatial models, etc.

• Graph

– a natural for ANOVA, often reduces to placing points in Rp

– flexible construction of graph via Voronoi tesselation, tilings

• Directed graph

– discrete time, time series models

• Surface

– spatial models on the sphere

• Tree

– species trees, coalescent models

38

Collection of distributions

• The set of distributions inherits many properties from the stochas-

tic processes used to construct them

• Need to capture driving ideas behind particular modelling effort,

choose processes that match those ideas

• Marginal distribution for Fx:

– fixing x, we have Fx ∼ Dir(Mx, F0,x)

– proof: Fix x, rule for constructing F is that for the DP

• Continuity of Fx in x

• Stationarity (isotropy) of Fx in x

• Stochastic ordering (monotonicity) of Fx in x

• Many producing results on construction / properties of processes

Pruenster, Lijoi, Muliere, Petrone, James, Jara, Ghosal, Ghosh,

Tokdar, Ramamoorthi, Hjort, Shui, Griffin, . . .

39

Connection to data - independence

• The distributions Fx, x ∈ X must be connected to data

• Many ways to do so

• One extreme is independence

– each case has a covariate value, say x

– x may or may not have any repeats in the data set

– the associated Y values are independent draws from the Fx

– in the context of regression

Yi = βTxi + εi, xi becomes x

εi ∼ Fx, εi independent for i = 1, . . . , n

– model used with an extra level of smoothing, so that Fx is not

discrete

– appropriate as a replacement for ordinary least squares style

regression

40

Connection to data - extreme dependence

• A second extreme takes entire groups of observations from the

same surface

– the group of observations plays the role of a case on the pre-

vious slide

– each case is an independent draw from Fx

Yij = µ(xij) + εij

∗ xij becomes x

∗ εij from same mixture component for j = 1, . . . , ni

– requires single-p model, where pi,x = pi for all x

• Appropriate when we conceive of our model as a mixture of sur-

faces (Gelfand et al.)

41

Connection to data - selector functions

• Between the extremes lies the notion of a selector function

– process defines collection of distributions at all points, x, where

data are observed

– positive-integer valued stochastic process on X determines which

component of mixture case is associated with

Yi = µx + εi

– xi becomes x

– selector process Zx maps εi into association with θ∗Zx (MacE.)

• Properties of data-distribution connection determined by proper-

ties of selector

– facilitates edge detection or changepoint modelling

– natural hierarchical modelling strategy

– extremes of dependence/independence also captured through

selector functions

42

Connection to data - copulae and margins

• Collection of distributions from dependent nonparametric process

• Autoregressive process gives sequence of values in R

– values in R transformed into values in [0, 1]

– distributions Fx yield data:

Yi = F−1x (ux)

• With continuous Fx, yields continuous path

• Dynamics of dependence governed by autoregressive structure

– can replace autoregressive structure with other structures –

time series/spatial models/spatio-temporal models provide wide

array of choices

• Applies very generally, functional data, time series, spatial data,

etc. (Xu et al.)

43

Recap

• Elements of the problem

– an index set, X– a model for the collection of distributions, Fx

– a model for the connection between distributions and data

– combined with other, traditional pieces of models

• The properties of the models are determined by these elements

• Great freedom in playing with the elements: spatial DP, linear

DP, functional DP, hierarchical DP, nested DP, . . ., dependent

species sampling models, kernel stick-breaking processes, probit

stick-breaking processes, . . .

• Focus on identifying key modelling issues and using a process that

captures those issues

• Analogy is the generalized linear model. Framework encompasses

many specific forms. Strong commonalities across forms

44

Nonparametric Bayesian Data Analysis for Causal Inference · conditions/devices) lead inexorably to...

Documents

Transcript of Nonparametric Bayesian Data Analysis for Causal Inference · conditions/devices) lead inexorably to...