Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine...

42
On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang

Transcript of Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine...

Page 1: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

On Bayesian Computation

Michael I. Jordan

with Elaine Angelino, Maxim Rabinovich, Martin Wainwrightand Yun Yang

Page 2: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Previous Work: Information Constraints on Inference

I Minimize the minimax risk under constraintsI privacy constraintI communication constraintI memory constraint

I Yields tradeoffs linking statistical quantities (amount of data,complexity of model, dimension of parameter space, etc) to“externalities” (privacy level, bandwidth, storage, etc)

Page 3: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Ongoing Work: Computational Constraints on Inference

I Tradeoffs via convex relaxations

I Tradeoffs via concurrency control

I Bounds via optimization oracles

I Bounds via communication complexity

I Tradeoffs via subsampling

I All of this work has been frequentist; how about Bayesianinference?

Page 4: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Today’s talk: Bayesian Inference and Computation

I Integration rather than optimization

I Part I: Mixing times for MCMC

I Part II: Distributed MCMC

Page 5: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Part I: Mixing Times for MCMC

Page 6: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

MCMC and Sparse Regression

I Statistical problem: predict/explain a response variable Ybased on a very large number of features X1, . . . ,Xd .

I Formally:

Y = X1 β1 + X2 β2 + · · ·+ Xd βd + noise, or

Yn×1 = Xn×d βd×1 + noise, (matrix form)

number of features d � sample size n.

I Sparsity condition: number of non-zero βj ’s � n.

I Goal: variable selection—identify influential features

I Can this be done efficiently (in a computational sense)?

Page 7: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Statistical Methodology

I Frequentist approach: penalized optimization

Objective function f (β)

=1

2n‖Y − Xβ‖22︸ ︷︷ ︸

Goodness of fit

+ Pλ(β)︸ ︷︷ ︸Penalty term

I Bayesian approach: Monte Carlo sampling

Posterior probability P(β |Y )

∝ P(Y |β)︸ ︷︷ ︸Likelihood

× P(β)︸ ︷︷ ︸Prior probability

Page 8: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Computation and Bayesian inference

Posterior probability P(β |Y )

= C (Y )︸ ︷︷ ︸Normalizing

constant

× P(Y |β)︸ ︷︷ ︸Likelihood

× P(β)︸ ︷︷ ︸Prior probability

I The most widely used tool for fitting Bayesian models is thesampling technique based on Markov chain Monte Carlo(MCMC)

I The theoretical analysis of the computational efficiency ofMCMC algorithms lags that of optimization-based methods

Page 9: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Computational Complexity

I Central object of interest: the mixing time tmix of the Markovchain

I Bound tmix as a function of problem parameters (n, d)

I rapidly mixing—tmix grows at most polynomially

I slowly mixing—tmix grows exponentially

I It has long been believed by many statisticians that MCMCfor Bayesian variable selection is necessarily slowly mixing;efforts have been underway to find lower bounds but suchbounds have not yet been found

Page 10: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Model

Mγ : Linear model: Y = Xγβγ + w , w ∼ N (0, φ−1In)

Precision prior: π(φ) ∝ 1

φ

Regression prior:(βγ | γ

)∼ N (0, g φ−1(XT

γ Xγ)−1)

Sparsity prior: π(γ) ∝(1

p

)κ|γ|I[|γ| ≤ s0].

Page 11: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Assumption A (Conditions on β∗)

The true regression vector has components β∗ = (β∗S , β∗Sc ) that

satisfy the bounds

Full β∗ condition:∥∥ 1√

nXβ∗

∥∥22≤ g σ20

log p

n

Off-support Sc condition:∥∥ 1√

nXScβ∗Sc

∥∥22≤ Lσ20

log p

n,

for some L ≥ 0.

Page 12: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Assumption B (Conditions on the design matrix)

The design matrix has been normalized so that ‖Xj‖22 = n for allj = 1, . . . , p; moreover, letting Z ∼ N(0, In), there exist constantsν > 0 and L <∞ such that Lν ≥ 4 and

Lower restricted eigenvalue (RE(s)):

min|γ|≤s

λmin

(1

nXTγ Xγ

)≥ ν,

Sparse projection condition (SI(s)):

EZ

[max|γ|≤s

maxk∈[p]\γ

1√n

∣∣〈(I − Φγ

)Xk , Z 〉

∣∣] ≤ 1

2

√Lν log p,

where Φγ denotes projection onto the span of {Xj , j ∈ γ}.

This is a mild assumption, needed in the information-theoreticanalysis of variable selection.

Page 13: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Assumption C (Choices of prior hyperparameters)

The noise hyperparameter g and sparsity penalty hyperparameterκ > 0 are chosen such that

g � p2α for some α > 0, and

κ+ α ≥ C1(L + L) + 2 for some universal constant C1 > 0.

Page 14: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Assumption D (Sparsity control)

For a constant C0 > 8, one of the two following conditions holds:

Version D(s∗): We set s0 : = p in the sparsity prior and the truesparsity s∗ is bounded as

max{1, s∗} ≤ 1

8C0K

{ n

log p− 16Lσ20

}for some constant K ≥ 4 +α+ cL, where c is a universal constant.

Version D(s0): The sparsity parameter s0 satisfies the sandwichrelation

max{

1,(2ν−2 ω(X ) + 1

)s∗}≤ s0 ≤

1

8C0 K

{ n

log p− 16Lσ20

},

where ω(X ) : = maxγ|||(XT

γ Xγ)−1XTγ Xγ∗\γ |||2op.

Page 15: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Our Results

Theorem (Posterior concentration)

Given Assumption A, Assumption B with s = Ks∗, Assumption C,and Assumption D(s∗), if Cβ satisfies

C 2β ≥ c0 ν

−2 (L + L + α + κ)σ20log p

n,

we have πn(γ∗ | Y ) ≥ 1− c1 p−1 with high probability.

Compare to `1-based approaches,which require an irrepresentablecondition:

maxk∈[d ],

S: |S |=s∗

‖XTk XS(XT

S XS)−1‖1 < 1

Page 16: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Our Results

Theorem (Rapid mixing)

Suppose that Assumption A, Assumption B with s = s0,Assumption C, and Assumption D(s0) all hold. Then there areuniversal constants c1, c2 such that, for any ε ∈ (0, 1), the ε-mixingtime of the Metropolis-Hastings chain is upper bounded as

τε ≤ c1 ps20

(c2α (n + s0) log p + log(1/ε) + 2

)with probability at least 1− c3p

−c4 .

Page 17: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

High Level Proof Idea

I A Metropolis-Hasting random walk on the d-dim hypercube{0, 1}d

I Canonical path ensemble argument: for any model γ 6= γ∗,find a path from γ to γ∗, along which acceptance ratios arehigh, where γ∗ is the true model

Page 18: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Part II: Distributed MCMC

Page 19: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Traditional MCMC

I Serial, iterative algorithm for generating samples

I Slow for two reasons:

(1) Large number of iterations required to converge

(2) Each iteration depends on the entire dataset

I Most research on MCMC has targeted (1)

I Recent threads of work target (2)

Page 20: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Serial MCMC

Data

Single core

Samples

Page 21: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Data-parallel MCMC

Data

Parallel cores

“Samples”

Page 22: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Aggregate samples from across partitions — but how?

Aggregate

Data

Parallel cores

“Samples”

Page 23: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Factorization motivates a data-parallel approach

π(θ | x)︸ ︷︷ ︸posterior

∝ π(θ)︸︷︷︸prior

π(x | θ)︸ ︷︷ ︸likelihood

=J∏

j=1

π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior

Page 24: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Factorization motivates a data-parallel approach

π(θ | x)︸ ︷︷ ︸posterior

∝ π(θ)︸︷︷︸prior

π(x | θ)︸ ︷︷ ︸likelihood

=J∏

j=1

π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior

I Partition the data as x(1), . . . , x(J) across J cores

I The jth core samples from a distribution proportional to thejth sub-posterior (a ‘piece’ of the full posterior)

I Aggregate the sub-posterior samples to form approximate fullposterior samples

Page 25: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Aggregation strategies for sub-posterior samples

π(θ | x)︸ ︷︷ ︸posterior

∝ π(θ)︸︷︷︸prior

π(x | θ)︸ ︷︷ ︸likelihood

=J∏

j=1

π(θ)1/Jπ(x(j) | θ)︸ ︷︷ ︸sub-posterior

Sub-posterior density estimation (Neiswanger et al, UAI 2014)

Weierstrass samplers (Wang & Dunson, 2013)

Weighted averaging of sub-posterior samples

I Consensus Monte Carlo (Scott et al, Bayes 250, 2013)

I Variational Consensus Monte Carlo (Rabinovich et al, NIPS 2015)

Page 26: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Aggregate ‘horizontally’ across partions

Aggregate

Data

Parallel cores

“Samples”

Page 27: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Naıve aggregation = Average

Aggregate

= + x 0.5 x 0.5

( , )

= + x x

Aggregate( , )

Page 28: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Less naıve aggregation = Weighted average

= + x 0.58 x 0.42

Aggregate( , )

= + x x

Aggregate( , )

Page 29: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Consensus Monte Carlo (Scott et al, 2013)

= + x x

Aggregate( , ) = + x x

Aggregate( , )

I Weights are inverse covariance matrices

I Motivated by Gaussian assumptions

I Designed at Google for the MapReduce framework

Page 30: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate thetarget distribution

Method: Convex optimization via variational Bayes

Page 31: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate thetarget distribution

Method: Convex optimization via variational Bayes

F = aggregation function

qF = approximate distribution

L (F )︸ ︷︷ ︸objective

= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood

+ H [qF ]︸ ︷︷ ︸entropy

Page 32: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate thetarget distribution

Method: Convex optimization via variational Bayes

F = aggregation function

qF = approximate distribution

L (F )︸ ︷︷ ︸objective

= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood

+ H [qF ]︸ ︷︷ ︸relaxed entropy

Page 33: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate thetarget distribution

Method: Convex optimization via variational Bayes

F = aggregation function

qF = approximate distribution

L (F )︸ ︷︷ ︸objective

= EqF [log π (X, θ)]︸ ︷︷ ︸likelihood

+ H [qF ]︸ ︷︷ ︸relaxed entropy

No mean field assumption

Page 34: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

= + x x

Aggregate( , ) = + x x

Aggregate( , )

I Optimize over weight matrices

I Restrict to valid solutions when parameter vectors constrained

Page 35: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Theorem (Entropy relaxation)

Under mild structural assumptions, we can choose

H [qF ] = c0 +1

K

K∑k=1

hk (F ) ,

with each hk a concave function of F such that

H [qF ] ≥ H [qF ] .

We therefore haveL (F ) ≥ L (F ) .

Page 36: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Variational Consensus Monte Carlo

Theorem (Concavity of the variational Bayes objective)

Under mild structural assumptions, the relaxed variational Bayesobjective

L (F ) = EqF [log π (X, θ)] + H [qF ]

is concave in F .

Page 37: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Empirical evaluation

I Compare three aggregation strategies:

I Uniform average

I Gaussian-motivated weighted average (CMC)

I Optimized weighted average (VCMC)

I For each algorithm A, report approximation error of someexpectation Eπ[f ], relative to serial MCMC

εA (f ) =|EA [f ]− EMCMC [f ]||EMCMC [f ]|

I Preliminary speedup results

Page 38: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Example 1: High-dimensional Bayesian probit regression

#data = 100, 000, d = 300

First moment estimation error, relative to serial MCMC

(Error truncated at 2.0)

Page 39: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Example 2: High-dimensional covariance estimation

Normal-inverse Wishart model

#data = 100, 000, #dim = 100 =⇒ 5, 050 parameters

(L) First moment estimation error (R) Eigenvalue estimation error

Page 40: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Example 3: Mixture of 8, 8-dim Gaussians

Error relative to serial MCMC, for cluster comembershipprobabilities of pairs of test data points

Page 41: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

VCMC reduces CMC error at the cost of speedup (∼2x)

VCMC speedup is approximately linear

CMC

VCMC

Page 42: Michael I. Jordan with Elaine Angelino, Maxim Rabinovich ... · Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang. Previous Work: Information

Discussion

Contributions

I Convex optimization framework for Consensus Monte Carlo

I Structured aggregation accounting for constrained parameters

I Entropy relaxation

I Empirical evaluation