Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6...

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling

Christopher De Sa Kunle Olukotun Christopher Ré {cdesa,kunle,chrismre}@stanford.edu

Stanford

1

Overview

Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.

Zhang et al, PVLDB 2014 Smola et al, PVLDB 2010

…etc.


Question: when and why does it work?



“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does

…but there’s no theoretical guarantee.



“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does

…but there’s no theoretical guarantee.

Our contributions

1.  The “folklore” is not necesarily true. 2.  ...but it works under reasonable conditions.

10

Problem: given a probability distribution, produce samples from it.

•  e.g. to do inference in a graphical model

11

Problem: given a probability distribution, produce samples from it.

•  e.g. to do inference in a graphical model

Algorithm: Gibbs sampling

•  de facto Markov chain Monte Carlo (MCMC) method for inference

•  produces a series of approximate samples that approach the target distribution

What is Gibbs Sampling?

12

Algorithm 1 Gibbs sampling

Require: Variables xi for 1 i n, and distribution ⇡.

loop

Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x

end loop


13

x1

x2

x3

x4

x5 x7

x6



loop


end loop


14

x1

x2

x3

x4

x5 x7

x6



loop


end loop


15

Choose a variable to update at random.

x5

x1

x2

x3

x4

x5 x7

x6



loop


end loop


16

Compute its conditional distribution given the

other variables.

x4

x6

x7

P( ) = 0.7

P( ) = 0.3

x5

x5

x5

x1

x2

x3

x4

x5 x7

x6



loop


end loop


17

Update the variable by sampling from its

conditional distribution.

Compute its conditional distribution given the

other variables.

x4

x6

x7

P( ) = 0.7

P( ) = 0.3

x5

x5

x5 x5

x1

x2

x3

x4

x5 x7

x6



loop


end loop


18

Output the current state as a sample.

x5 x5

Gibbs Sampling: A Practical Perspective

19


•  Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs

•  Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize

20


•  Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs

•  Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize

21 64 core

No parallelism Leave up to 98% of performance

on the table! e.g.

Asynchronous Gibbs Sampling

22


•  Run multiple threads in parallel without locks – also known as HOGWILD! – adapted from a popular technique for stochastic

gradient descent (SGD)

•  When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables

can be overwritten by other threads – semantics not equivalent to standard (sequential)

Gibbs sampling

23


•  Run multiple threads in parallel without locks – also known as HOGWILD! – adapted from a popular technique for stochastic

gradient descent (SGD)

•  When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables

can be overwritten by other threads – semantics not equivalent to standard (sequential)

Gibbs sampling

24

26

Question

Does asynchronous Gibbs sampling work? …and what does it mean for it to work?

27

Question


Two desiderata

28

Question


want to get

accurate estimates ê

bound the bias

Two desiderata

29

Question


want to get


bound the bias

Two desiderata

want to be independent of initial conditions

quickly ê

bound the mixing time

Previous Work

30

Previous Work

•  “Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015

•  “Analyzing Hogwild Parallel Gaussian Gibbs Sampling” — Johnson et al, NIPS 2013.

31

Previous Work

•  “Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015

•  “Analyzing Hogwild Parallel Gaussian Gibbs Sampling” — Johnson et al, NIPS 2013.

32

33

Question


Two desiderata


quickly ê


want to get


bound the bias

Bias

34

Bias

•  How close are samples to target distribution? – standard measurement: total variation distance

•  For sequential Gibbs, no asymptotic bias:

35

kµ� ⌫kTV = max

A⇢⌦|µ(A)� ⌫(A)|

Bias



36

kµ� ⌫kTV = max

A⇢⌦|µ(A)� ⌫(A)|

8µ0, limt!1

kP (t)µ0 � ⇡kTV = 0

Bias



37

kµ� ⌫kTV = max

A⇢⌦|µ(A)� ⌫(A)|

“Folklore”: asynchronous Gibbs is also unbiased. …but this is not necessarily true!

8µ0, limt!1

kP (t)µ0 � ⇡kTV = 0

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

38

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

39

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

40

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

41

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Two threads update starting here.

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

42

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)1/4

1/4

1/4

1/4 Two threads update starting here.

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

Simple Bias Example

43

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)

1/4

1/41/41/4

3/4

3/4

1/2

1/2

1/2

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)1/4

1/4

1/4

1/4

p(0, 1) = p(1, 0) = p(1, 1) =1

3p(0, 0) = 0.

(0, 0)

(0, 1)

(1, 0)

(1, 1)1/4

1/4

1/4

1/4Should have zero probability!

Two threads update starting here.

Nonzero Asymptotic Bias Asynchronous Bias: Example Distribution

�

�.��

�.�

�.��

�.�

�.��

�.�

�.��

�.�

(�,�) (�,�) (�,�) (�,�)

prob

abili

ty

state

Distribution of Sequential vs. Hogwild! Gibbs

sequentialHogwild!

Bias introduced by Hogwild!-Gibbs (�� samples).

�� / �

44

Measured

Bias (total variation

distance)

sequential < 0.1%

unbiased

asynchronous 9.8% biased

Nonzero Asymptotic Bias Asynchronous Bias: Example Distribution

�

�.��

�.�

�.��

�.�

�.��

�.�

�.��

�.�

(�,�) (�,�) (�,�) (�,�)

prob

abili

ty

state

Distribution of Sequential vs. Hogwild! Gibbs

sequentialHogwild!


�� / �

45

Asynchronous Bias: Example Distribution

�

�.�

�.�

�.�

�.�

�.�

�.�

�.�

�.�

(�,X�) (�,X�) (X�,�) (X�,�)

prob

abili

ty

state

Marginal distribution of Sequential vs. Hogwild! Gibbs

sequentialHogwild!


�� / ��

Measured

Bias (total variation

distance)

sequential < 0.1%

unbiased

asynchronous 9.8% biased

Are we using the right metric?

46


•  No, total variation distance is too conservative – depends on events that don’t matter for inference – usually only care about small number of variables

•  New metric: sparse variation distance

where |A| is the number of variables on which event A depends

47





48

kµ� ⌫kSV(!) = max

|A|!|µ(A)� ⌫(A)|





49

kµ� ⌫kSV(!) = max

|A|!|µ(A)� ⌫(A)|

Simple Example: Bias of Asynchronous Gibbs

Total variation: 9.8% Sparse Variation ( ): 0.4% ! = 1

Total Influence Parameter

50


•  Old condition that was used to study mixing times of spin statistics systems

–  means X and Y equal except variable j.

–  is conditional distribution of variable i given the values of all the other variables in state X.

– Dobrushin’s condition holds when

51

↵ = max

i2I

X

j2I

max

(X,Y )2Bj

��⇡i(·|XI\{i})� ⇡i(·|YI\{i})��TV

(X,Y ) 2 Bj

⇡i(·|XI\{i})


•  Old condition that was used to study mixing times of spin statistics systems

–  means X and Y equal except variable j.

–  is conditional distribution of variable i given the values of all the other variables in state X.

– Dobrushin’s condition holds when

52

↵ = max

i2I

X

j2I

max

(X,Y )2Bj

��⇡i(·|XI\{i})� ⇡i(·|YI\{i})��TV

(X,Y ) 2 Bj

⇡i(·|XI\{i})

↵ < 1.

Asymptotic Result

•  For any class of distributions with bounded total influence – big-O notation is over number of variables

•  If timesteps of sequential Gibbs suffice to achieve arbitrarily small bias – measured by sparse variation distance, for fixed

•  …then asynchronous Gibbs requires only additional timesteps to achieve the same bias!

53

↵ = O(1).n.

O(n)

!-

O(1)

!-

Asymptotic Result

•  For any class of distributions with bounded total influence – big-O notation is over number of variables

•  If timesteps of sequential Gibbs suffice to achieve arbitrarily small bias – measured by sparse variation distance, for fixed

•  …then asynchronous Gibbs requires only additional timesteps to achieve the same bias!

54

↵ = O(1).n.

O(n)

!-

O(1)

more details, explicit bounds, et cetera in the paper

!-

want to get


bound the bias

55

Question


Two desiderata


quickly ê


Mixing Time

56

Mixing Time

•  How long do we need to run until the samples are independent of initial conditions?

•  Mixing time of a Markov chain is the first time at which the distribution of the sample is close to the stationary distribution. –  in terms of total variation distance –  feasible to run MCMC if mixing time is small

57

Mixing Time

•  How long do we need to run until the samples are independent of initial conditions?

•  Mixing time of a Markov chain is the first time at which the distribution of the sample is close to the stationary distribution. –  in terms of total variation distance –  feasible to run MCMC if mixing time is small

58

“Folklore”: asynchronous Gibbs has the same mixing time as sequential Gibbs…also not necessarily true!

Mixing Time Example

Mixing Time: Example (cont’d)

Here is the empirical estimate for P

�

�

TY > ��

for di�ferentmaximum delays ⌧—should be exactly �.�.

�

�.�

�.�

�.�

�.�

�

� ��

estim

atio

nof

P(�

T Y>

�)

sample number (thousands)

Mixing of Sequential vs Hogwild! Gibbs

⌧ = �.�⌧ = �.�⌧ = �.�

sequentialtrue distribution

�� / �

59

is hardware-dependent read

staleness parameter

⌧

HOGWILD!

Mixing Time Example

Mixing Time: Example (cont’d)

Here is the empirical estimate for P

�

�

TY > ��

for di�ferentmaximum delays ⌧—should be exactly �.�.

�

�.�

�.�

�.�

�.�

�

� ��

estim

atio

nof

P(�

T Y>

�)

sample number (thousands)

Mixing of Sequential vs Hogwild! Gibbs

⌧ = �.�⌧ = �.�⌧ = �.�

sequentialtrue distribution

�� / �

60

Sequential Gibbs achieves correct marginal quickly.

tmix

= O(n log n)

Asynchronous Gibbs takes much longer.



tmix

= exp(⌦(n))


staleness parameter

⌧

HOGWILD!

Bounding the Mixing Time

61

↵ < 1.


Suppose that our target distribution satisfies Dobrushin’s condition (total influence ). •  Mixing time of sequential Gibbs (known result)

•  Mixing time of asynchronous Gibbs is

62

↵ < 1.

tmix�seq

(✏) n

1� ↵log

⇣n✏

⌘.

tmix�hog

(✏) n+ ↵⌧

1� ↵log

⇣n✏

⌘.


staleness parameter

⌧


Suppose that our target distribution satisfies Dobrushin’s condition (total influence ). •  Mixing time of sequential Gibbs (known result)

•  Mixing time of asynchronous Gibbs is

63

↵ < 1.

tmix�seq

(✏) n

1� ↵log

⇣n✏

⌘.

tmix�hog

(✏) n+ ↵⌧

1� ↵log

⇣n✏

⌘.

Takeaway message: can compare the two mixing time bounds with

…they differ by a negligible factor!

tmix�hog

(✏) ⇡�1 + ↵⌧n�1

�tmix�seq

(✏)


staleness parameter

⌧

Theory Matches Experiment Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling

16500

17000

17500

18000

18500

19000

0 50 100 150 200

mix

ing

time

expected delay parameter (⌧ ⇤)

Estimated tmix

of HOGWILD! Gibbs on Large Ising Model

estimatedtheory

Figure 4. Comparison of estimated mixing time and theory-predicted (by Equation 2) mixing time as ⌧ increases for a syn-thetic Ising model graph (n = 1000, � = 3).

the (relatively small) dependence of the mixing time on ⌧proved to be computationally intractable.

Instead, we use a technique called coupling to the future.We initialize two chains, X and Y , by setting all the vari-ables in X

0

to 1 and all the variables in Y0

to �1. Weproceed by simulating a coupling between the two chains,and return the coupling time T

c

. Our estimate of the mixingtime will then be ˆt(✏), where P(T

c

� ˆt(✏)) = ✏.

Statement 2. This experimental estimate is an upperbound for the mixing time. That is, ˆt(✏) � t

mix

(✏).

To estimate ˆt(✏), we ran 10000 instances of the cou-pling experiment, and returned the sample estimate ofˆt(1/4). To compare across a range of ⌧⇤, we selectedthe ⌧̃

i,t

to be independent and identically distributed ac-cording to the maximum-entropy distribution supported on{0, 1, . . . , 200} consistent with a particular assignment of⌧⇤. The resulting estimates are plotted as the blue seriesin Figure 4. The red line represents the mixing time thatwould be predicted by naively applying Equation 2 usingthe estimate of the sequential mixing time as a startingpoint — we can see that it is a very good match for the ex-perimental results. This experiment shows that, at least forone archetypal model, our theory accurately characterizesthe behavior of HOGWILD! Gibbs sampling as the delayparameter ⌧⇤ is changed, and that using HOGWILD!-Gibbsdoesn’t cause the model to catastrophically fail to mix.

Of course, in order for HOGWILD!-Gibbs to be useful, itmust also speed up the execution of Gibbs sampling onsome practical models. It is already known that this is thecase, as these types of algorithms been widely implementedin practice (Smola & Narayanamurthy, 2010; Smyth et al.,2009). To further test this, we ran HOGWILD!-Gibbs sam-pling on a real-world 11 GB Knowledge Base Populationdataset (derived from the TAC-KBP challenge) using a ma-

0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8 12 18 36

spee

dup

over

sing

le-th

read

ed

threads

Performance of HOGWILD! Gibbs on KBP Dataset

HOGWILD!multi-model

Figure 5. Speedup of HOGWILD! and multi-model Gibbs sam-pling on large KBP dataset (11 GB).

chine with a single-socket, 18-core Xeon E7-8890 CPUand 1 TB RAM. As a comparison, we also ran a “multi-model” Gibbs sampler: this consists of multiple threadswith a single execution of Gibbs sampling running inde-pendently in each thread. This sampler will produce thesame number of samples as HOGWILD!-Gibbs, but will re-quire more memory to store multiple copies of the model.

Figure 5 reports the speedup, in terms of wall-clock time,achieved by HOGWILD!-Gibbs on this dataset. On this ma-chine, we get speedups of up to 2.8⇥, although the programbecomes memory-bandwidth bound at around 8 threads,and we see no significant speedup beyond this. With anynumber of workers, the run time of HOGWILD!-Gibbs isclose to that of multi-model Gibbs, which illustrates thatthe additional cache contention caused by the HOGWILD!updates has little effect on the algorithm’s performance.

7. ConclusionWe analyzed HOGWILD!-Gibbs sampling, a heuristic forparallelized MCMC sampling, on discrete-valued graphi-cal models. First, we constructed a statistical model forHOGWILD!-Gibbs by adapting a model already used forthe analysis of asynchronous SGD. Next, we illustrated amajor issue with HOGWILD!-Gibbs sampling: that it pro-duces biased samples. To address this, we proved that if forsome class of models with bounded total influence, onlyO(n) sequential Gibbs samples are necessary to producegood marginal estimates, then HOGWILD!-Gibbs samplingproduces equally good estimates after only O(1) additionalsteps. Additionally, for models that satisfy Dobrushin’scondition (↵ < 1), we proved mixing time bounds for se-quential and asynchronous Gibbs sampling that differ byonly a factor of 1 + O(n�1

). Finally, we showed that ourtheory matches experimental results, and that HOGWILD!-Gibbs produces speedups up to 2.8⇥ on a real dataset.

64

expected staleness parameter ( ) ⌧

Conclusion

•  Analyzed and modeled asynchronous Gibbs sampling, and identified two success metrics –  sample bias à how close to target distribution? – mixing time à how long do we need to run?

•  Showed that asynchronicity can cause problems

•  Proved bounds on the effect of asynchronicity –  using the new sparse variation distance, together with –  the classical condition of total influence

65

Conclusion

•  Analyzed and modeled asynchronous Gibbs sampling, and identified two success metrics –  sample bias à how close to target distribution? – mixing time à how long do we need to run?

•  Showed that asynchronicity can cause problems

•  Proved bounds on the effect of asynchronicity –  using the new sparse variation distance, together with –  the classical condition of total influence

66

Thank you!

[email protected] stanford.edu/~cdesa

Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6...

Documents

Transcript of Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6...