Firefly exact MCMC for Big Data

EXACT MCMC ON BIGDATA: THE TIP OF AN ICEBERG

University of Helsinki

Gianvito Siciliano

(2014 - Probabilistic Models for Big Data Seminar)

AGENDA

1. MCMC intro:

• Bayesian Inference

• Sampling methods (Gibbs, MH)

2. MCMC and Big Data

• Issues

• Approximate solutions (SGLD, SGFS, MH Test)

3. Firefly Monte Carlo

4. Conclusions

BAYESIAN MODELING• To obtain quantities of interest from the posterior we usually need to engage with an

integral in this form:

• The problem is that these integrals are usually impossible to evaluate analytically • Bayes rule allows us to express the posterior over parameters in terms of the prior

and likelihood terms:

P (✓|X) /NY

i=1

P (xi|✓)P (✓)

MCMC

• Monte Carlo: simulation to draw quantities of interest from the distribution

• Markov Chain: stochastic process in which future states are independent of past states given the present state.

• Hence, MCMC is a class of method in which we can simulate draws that are slightly dependent and are approximately from posterior distribution.

HOW TO SAMPLE? In Bayesian statistics, there are generally two algorithms that you can use (to allow

pseudo-random sampling from a distribution)

Gibbs Sampler Metropolis-Hastings algorithm. Used to sample from a joint distribution, if we knew the full conditional distributions for each parameter

JD = p(θ1, . . . , θk )

The full conditional distribution is the distribution of the parameter conditional on the known information and all the other parameters:

FCD = p(θj|θ−j, X)

Used when…

• the posterior doesn’t look like any distribution we know (no conjugacy)

• the posterior consists of more than 2 parameters (grid approximations intractable)

• some (or all) of the full conditionals do not look like any distributions we know (no Gibbs sampling for those whose full conditionals we don’t know)

1. Pick a vector of starting values θ(0).

2. Start with any θ (order does not matter).

Draw a value θ1(1) from the full conditional p(θ1 | θ2(0), θ3(0), y).

3. Draw a value θ2(1) (again order does not matter) from the full

conditional p(θ2 | θ1(1), θ3(0), y). Note that we must use theupdated value of θ1(1).

4. Repeat (for all parameters) until we get M draws, with each draw

being a vector θ(t).

5. Optional burn-in and/or thinning.

Gibbs Sampler

1. Choose a starting value θ(0).

2. At iteration t, draw a candidate θ(∗) from a jumping distribution

Jt(θ∗ | θ(t−1)).

3. Compute an acceptance ratio conditioned:

r = p(θ∗|y)/Jt(θ∗|θ(t−1)) / p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)

4. Accept θ∗ as θ(t) with probability min(r,1).

If θ∗ is not accepted, then θ(t) = θ(t−1).

5. Repeat steps 2-4 M times to get M draws from p(θ | y), with optional burn-in and/or thinning.

MH Algorithm

1. Choose a starting value θ(0).

2. At iteration t, draw a candidate θ(∗) from a jumping distribution

Jt(θ∗ | θ(t−1)).

3. Compute an acceptance ratio conditioned:

r = p(θ∗|y) / p(θ(t−1)|y)

4. Accept θ∗ as θ(t) with probability min(r,1).

If θ∗ is not accepted, then θ(t) = θ(t−1).

5. Repeat steps 2-4 M times to get M draws from p(θ | y), with optional burn-in and/or thinning.

MH Algorithm

• Canonical MCMC algorithm proposed samples from a distribution Q and accept/reject the proposals with a rule that need to examine the likelihood of all data-items

• All the data are processed at each iteration,

run-time may be excessive!

MCMC and BIG DATA

Propose: ✓

0 ⇠ Q(✓

0|✓)

Accept with Prob. ↵ = min

"1,

Q(✓|✓0)P (✓

0)

QNi=1 P (xi|✓0

)

Q(✓

0|✓)P (✓)

QNi=1 P (xi|✓)

#

If accept=True: ✓ ✓

0

IDEA • Assume that you have T units of computation to achieve the lowest

possible error. • Your MCMC procedure has a knob to control the bias/variance

tradeoff So, during the sampling phase… Turn left => SLOW: small bias, high variance Turn right => FAST: strong bias, low variance

MCMC APPROXIMATE SOLUTIONS FOR BIG DATA

SGLD & SGFS: knob = stepsize

Stochastic Gradient Langevin Dynamics Langevin dynamics based on stochastic gradients [W. & Teh, ICML 2011] • The idea is to expand Stochastic Gradient descend optimization algorithm to include gaussian noise with Langevin Dynamics. • One of the advantages of SGLD is that the entire data sets should never be saved into memory • Disadvantages:

• it has to read from external data each iteration • gradients are computationally expensive • it uses a proper pre-conditions matrix to decide the size step of the transaction operator.

Stochastic Gradient Fisher Scoring [Ahn, et al, ICML 2012] Built on SGLD and it tries to beat its predecessor by offering a three phase procedure: 1. Burn-in: large stepsize. 2. Reached distribution: still large stepsize and samples from the asymptotic gaussian approximation of the posterior. 3. Further annealing: smaller stepsize to generate increasingly accurate samples from the true posterior. • With this approach the algorithm tries to reduce the bias in burn-in phase and then starts sampling to reduce variance.

MH TEST: knob = confidence

CUTTING THE MH ALGORITHM BUDGET [Korattikara et al, ICML 1023] …by conducing sequential hypothesis tests to decide whether accept/reject a given sample and find the majority of these decision based on a small fraction of the data • Works directly on the rule-step of MH algorithm • Accept a proposal with a given confidence • Applicable to problem where is impossible to compute gradient

FIREFLY EXACT SOLUTION

ISSUE 1: prohibitive cost of evaluating every likelihood terms at every iteration (for a big data-sets) ISSUE 2: latter procedures construct an approximated transition operator (using subsets of data) GOAL: obtain an exact procedures, that leaves the true full-data posterior distribution invariant! HOW: by querying only the likelihood of a potentially small subset of the data at each iteration yet simulates from the exact posterior IDEA: introduce a collection of Bernoulli variables that turn on (and off) the data for which calculate the likelihoods

Assuming we have:

1. Target Distribution 2. Likelihood function

Compute all N likelihoods at every iteration is a bottleneck! 3. Assume that each product term in Ln can be bounded by a cheaper lower bound:

5. Each zn has the following Bernoulli Distribution (conditioned)

6. And augment the posterior with these N vars

FLYMC: HOW IT WORKS

Assuming we have:





FLYMC: HOW IT WORKS

} the marginal distrib. is still the correct posterior given in equation 1

Why Exact?

Assuming we have:





FLYMC: HOW IT WORKS

} from this joint distrib. evaluate only those likelihood terms for which zn = 1 (light terms)

Why Firefly?

FLYMC: THE REDUCED SPACE

• We simulate the Markov chain on the zn space:

zn = 0 => Dark point (no likelihoods computed) zn = 1 => Light point (likelihoods computed)

{

• If the Markov chain will tend to occupy zn = 0

ALGORITHM IMPL.

FLYMC: LOWER BOUNDThe lower bound Bn(θ) of each data point’s likelihood Ln(θ), should satisfy 2 properties:

• Tightness, to determine the number of bright data points (M is the average):

• It must be easy to compute the product (using scale exponential-family lower bounds)

With this setting, we achieve speedup of N/M, from O(ND) ev. time of regular MCMC

MAP-OPTIMISATION…in order to find an Approximate Maximum a Posteriori value of θ and to construct Bn to be tight there.

The proposed algorithm versions (used in the experiments) are:

• Untuned FlyMC, with the choice of ε = 1.5 for all data points.

• MAP-tuned FlyMC that performs a gradient descent optimization to find an ε value for each data points. (This last way allows to obtain a nearer bounds to the posteriori value of θ).

• Regular full-posterior MCMC (for comparison)

EXPERIMENTS

Expectation:

• slower in mixing • faster in iterating

Results:

• FlyMC offers a speedup of at least one order of magnitude compared with reg. MCMC

CONCLUSIONS

FlyMC is an exact procedures that has the true full-posterior as its target

The introduction of the binary latent variables is a simple and efficient idea

The lower bound is a requirement, and it can be difficult to obtain for many problems

Dr. Antti Honkela Dr. Arto Klami

Reviewers

Acknoledgements

([email protected])

Thank you!

mailto:[email protected]

Firefly exact MCMC for Big Data

Data & Analytics

Transcript of Firefly exact MCMC for Big Data