Firefly exact MCMC for Big Data
-
Upload
gianvito-siciliano -
Category
Data & Analytics
-
view
224 -
download
0
description
Transcript of Firefly exact MCMC for Big Data
EXACT MCMC ON BIGDATA: THE TIP OF AN ICEBERG
University of Helsinki
Gianvito Siciliano
(2014 - Probabilistic Models for Big Data Seminar)
AGENDA
1. MCMC intro:
• Bayesian Inference
• Sampling methods (Gibbs, MH)
2. MCMC and Big Data
• Issues
• Approximate solutions (SGLD, SGFS, MH Test)
3. Firefly Monte Carlo
4. Conclusions
BAYESIAN MODELING• To obtain quantities of interest from the posterior we usually need to engage with an
integral in this form:
• The problem is that these integrals are usually impossible to evaluate analytically • Bayes rule allows us to express the posterior over parameters in terms of the prior
and likelihood terms:
P (✓|X) /NY
i=1
P (xi|✓)P (✓)
MCMC
• Monte Carlo: simulation to draw quantities of interest from the distribution
• Markov Chain: stochastic process in which future states are independent of past states given the present state.
• Hence, MCMC is a class of method in which we can simulate draws that are slightly dependent and are approximately from posterior distribution.
HOW TO SAMPLE? In Bayesian statistics, there are generally two algorithms that you can use (to allow
pseudo-random sampling from a distribution)
Gibbs Sampler Metropolis-Hastings algorithm. Used to sample from a joint distribution, if we knew the full conditional distributions for each parameter
JD = p(θ1, . . . , θk )
The full conditional distribution is the distribution of the parameter conditional on the known information and all the other parameters:
FCD = p(θj|θ−j, X)
Used when…
• the posterior doesn’t look like any distribution we know (no conjugacy)
• the posterior consists of more than 2 parameters (grid approximations intractable)
• some (or all) of the full conditionals do not look like any distributions we know (no Gibbs sampling for those whose full conditionals we don’t know)
1. Pick a vector of starting values θ(0).
2. Start with any θ (order does not matter).
Draw a value θ1(1) from the full conditional p(θ1 | θ2(0), θ3(0), y).
3. Draw a value θ2(1) (again order does not matter) from the full
conditional p(θ2 | θ1(1), θ3(0), y). Note that we must use theupdated value of θ1(1).
4. Repeat (for all parameters) until we get M draws, with each draw
being a vector θ(t).
5. Optional burn-in and/or thinning.
Gibbs Sampler
1. Choose a starting value θ(0).
2. At iteration t, draw a candidate θ(∗) from a jumping distribution
Jt(θ∗ | θ(t−1)).
3. Compute an acceptance ratio conditioned:
r = p(θ∗|y)/Jt(θ∗|θ(t−1)) / p(θ(t−1)|y)/Jt(θ(t−1)|θ∗)
4. Accept θ∗ as θ(t) with probability min(r,1).
If θ∗ is not accepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ | y), with optional burn-in and/or thinning.
MH Algorithm
1. Choose a starting value θ(0).
2. At iteration t, draw a candidate θ(∗) from a jumping distribution
Jt(θ∗ | θ(t−1)).
3. Compute an acceptance ratio conditioned:
r = p(θ∗|y) / p(θ(t−1)|y)
4. Accept θ∗ as θ(t) with probability min(r,1).
If θ∗ is not accepted, then θ(t) = θ(t−1).
5. Repeat steps 2-4 M times to get M draws from p(θ | y), with optional burn-in and/or thinning.
MH Algorithm
• Canonical MCMC algorithm proposed samples from a distribution Q and accept/reject the proposals with a rule that need to examine the likelihood of all data-items
• All the data are processed at each iteration,
run-time may be excessive!
MCMC and BIG DATA
Propose: ✓
0 ⇠ Q(✓
0|✓)
Accept with Prob. ↵ = min
"1,
Q(✓|✓0)P (✓
0)
QNi=1 P (xi|✓0
)
Q(✓
0|✓)P (✓)
QNi=1 P (xi|✓)
#
If accept=True: ✓ ✓
0
IDEA • Assume that you have T units of computation to achieve the lowest
possible error. • Your MCMC procedure has a knob to control the bias/variance
tradeoff So, during the sampling phase… Turn left => SLOW: small bias, high variance Turn right => FAST: strong bias, low variance
MCMC APPROXIMATE SOLUTIONS FOR BIG DATA
SGLD & SGFS: knob = stepsize
Stochastic Gradient Langevin Dynamics Langevin dynamics based on stochastic gradients [W. & Teh, ICML 2011] • The idea is to expand Stochastic Gradient descend optimization algorithm to include gaussian noise with Langevin Dynamics. • One of the advantages of SGLD is that the entire data sets should never be saved into memory • Disadvantages:
• it has to read from external data each iteration • gradients are computationally expensive • it uses a proper pre-conditions matrix to decide the size step of the transaction operator.
Stochastic Gradient Fisher Scoring [Ahn, et al, ICML 2012] Built on SGLD and it tries to beat its predecessor by offering a three phase procedure: 1. Burn-in: large stepsize. 2. Reached distribution: still large stepsize and samples from the asymptotic gaussian approximation of the posterior. 3. Further annealing: smaller stepsize to generate increasingly accurate samples from the true posterior. • With this approach the algorithm tries to reduce the bias in burn-in phase and then starts sampling to reduce variance.
MH TEST: knob = confidence
CUTTING THE MH ALGORITHM BUDGET [Korattikara et al, ICML 1023] …by conducing sequential hypothesis tests to decide whether accept/reject a given sample and find the majority of these decision based on a small fraction of the data • Works directly on the rule-step of MH algorithm • Accept a proposal with a given confidence • Applicable to problem where is impossible to compute gradient
FIREFLY EXACT SOLUTION
ISSUE 1: prohibitive cost of evaluating every likelihood terms at every iteration (for a big data-sets) ISSUE 2: latter procedures construct an approximated transition operator (using subsets of data) GOAL: obtain an exact procedures, that leaves the true full-data posterior distribution invariant! HOW: by querying only the likelihood of a potentially small subset of the data at each iteration yet simulates from the exact posterior IDEA: introduce a collection of Bernoulli variables that turn on (and off) the data for which calculate the likelihoods
Assuming we have:
1. Target Distribution 2. Likelihood function
Compute all N likelihoods at every iteration is a bottleneck! 3. Assume that each product term in Ln can be bounded by a cheaper lower bound:
5. Each zn has the following Bernoulli Distribution (conditioned)
6. And augment the posterior with these N vars
FLYMC: HOW IT WORKS
Assuming we have:
1. Target Distribution 2. Likelihood function
Compute all N likelihoods at every iteration is a bottleneck! 3. Assume that each product term in Ln can be bounded by a cheaper lower bound:
5. Each zn has the following Bernoulli Distribution (conditioned)
6. And augment the posterior with these N vars
FLYMC: HOW IT WORKS
} the marginal distrib. is still the correct posterior given in equation 1
Why Exact?
Assuming we have:
1. Target Distribution 2. Likelihood function
Compute all N likelihoods at every iteration is a bottleneck! 3. Assume that each product term in Ln can be bounded by a cheaper lower bound:
5. Each zn has the following Bernoulli Distribution (conditioned)
6. And augment the posterior with these N vars
FLYMC: HOW IT WORKS
} from this joint distrib. evaluate only those likelihood terms for which zn = 1 (light terms)
Why Firefly?
FLYMC: THE REDUCED SPACE
• We simulate the Markov chain on the zn space:
zn = 0 => Dark point (no likelihoods computed) zn = 1 => Light point (likelihoods computed)
{
• If the Markov chain will tend to occupy zn = 0
ALGORITHM IMPL.
FLYMC: LOWER BOUNDThe lower bound Bn(θ) of each data point’s likelihood Ln(θ), should satisfy 2 properties:
• Tightness, to determine the number of bright data points (M is the average):
• It must be easy to compute the product (using scale exponential-family lower bounds)
With this setting, we achieve speedup of N/M, from O(ND) ev. time of regular MCMC
MAP-OPTIMISATION…in order to find an Approximate Maximum a Posteriori value of θ and to construct Bn to be tight there.
The proposed algorithm versions (used in the experiments) are:
• Untuned FlyMC, with the choice of ε = 1.5 for all data points.
• MAP-tuned FlyMC that performs a gradient descent optimization to find an ε value for each data points. (This last way allows to obtain a nearer bounds to the posteriori value of θ).
• Regular full-posterior MCMC (for comparison)
EXPERIMENTS
Expectation:
• slower in mixing • faster in iterating
Results:
• FlyMC offers a speedup of at least one order of magnitude compared with reg. MCMC
CONCLUSIONS
FlyMC is an exact procedures that has the true full-posterior as its target
The introduction of the binary latent variables is a simple and efficient idea
The lower bound is a requirement, and it can be difficult to obtain for many problems
Dr. Antti Honkela Dr. Arto Klami
Reviewers
Acknoledgements