Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6...
Transcript of Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6...
![Page 1: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/1.jpg)
Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling
Christopher De Sa Kunle Olukotun Christopher Ré {cdesa,kunle,chrismre}@stanford.edu
Stanford
1
![Page 2: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/2.jpg)
Overview
![Page 3: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/3.jpg)
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Zhang et al, PVLDB 2014 Smola et al, PVLDB 2010
…etc.
![Page 4: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/4.jpg)
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
![Page 5: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/5.jpg)
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does
…but there’s no theoretical guarantee.
![Page 6: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/6.jpg)
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does
…but there’s no theoretical guarantee.
![Page 7: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/7.jpg)
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does
…but there’s no theoretical guarantee.
Our contributions
1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions.
![Page 8: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/8.jpg)
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does
…but there’s no theoretical guarantee.
Our contributions
1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions.
![Page 9: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/9.jpg)
Asynchronous Gibbs sampling is a popular algorithm that’s used in practical ML systems.
Question: when and why does it work?
“Folklore” says that asynchronous Gibbs sampling basically works whenever standard (sequential) Gibbs sampling does
…but there’s no theoretical guarantee.
Our contributions
1. The “folklore” is not necesarily true. 2. ...but it works under reasonable conditions.
![Page 10: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/10.jpg)
10
Problem: given a probability distribution, produce samples from it.
• e.g. to do inference in a graphical model
![Page 11: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/11.jpg)
11
Problem: given a probability distribution, produce samples from it.
• e.g. to do inference in a graphical model
Algorithm: Gibbs sampling
• de facto Markov chain Monte Carlo (MCMC) method for inference
• produces a series of approximate samples that approach the target distribution
![Page 12: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/12.jpg)
What is Gibbs Sampling?
12
![Page 13: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/13.jpg)
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
13
![Page 14: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/14.jpg)
x1
x2
x3
x4
x5 x7
x6
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
14
![Page 15: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/15.jpg)
x1
x2
x3
x4
x5 x7
x6
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
15
Choose a variable to update at random.
x5
![Page 16: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/16.jpg)
x1
x2
x3
x4
x5 x7
x6
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
16
Compute its conditional distribution given the
other variables.
x4
x6
x7
P( ) = 0.7
P( ) = 0.3
x5
x5
x5
![Page 17: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/17.jpg)
x1
x2
x3
x4
x5 x7
x6
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
17
Update the variable by sampling from its
conditional distribution.
Compute its conditional distribution given the
other variables.
x4
x6
x7
P( ) = 0.7
P( ) = 0.3
x5
x5
x5 x5
![Page 18: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/18.jpg)
x1
x2
x3
x4
x5 x7
x6
Algorithm 1 Gibbs sampling
Require: Variables xi for 1 i n, and distribution ⇡.
loop
Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x
end loop
What is Gibbs Sampling?
18
Output the current state as a sample.
x5 x5
![Page 19: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/19.jpg)
Gibbs Sampling: A Practical Perspective
19
![Page 20: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/20.jpg)
Gibbs Sampling: A Practical Perspective
• Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs
• Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize
20
![Page 21: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/21.jpg)
Gibbs Sampling: A Practical Perspective
• Pros of Gibbs sampling – Easy to implement – Updates are sparse à fast on modern CPUs
• Cons of Gibbs sampling – sequential algorithm à can’t naively parallelize
21 64 core
No parallelism Leave up to 98% of performance
on the table! e.g.
![Page 22: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/22.jpg)
Asynchronous Gibbs Sampling
22
![Page 23: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/23.jpg)
Asynchronous Gibbs Sampling
• Run multiple threads in parallel without locks – also known as HOGWILD! – adapted from a popular technique for stochastic
gradient descent (SGD)
• When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables
can be overwritten by other threads – semantics not equivalent to standard (sequential)
Gibbs sampling
23
![Page 24: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/24.jpg)
Asynchronous Gibbs Sampling
• Run multiple threads in parallel without locks – also known as HOGWILD! – adapted from a popular technique for stochastic
gradient descent (SGD)
• When we read a variable, it could be stale – while we re-sample a variable, its adjacent variables
can be overwritten by other threads – semantics not equivalent to standard (sequential)
Gibbs sampling
24
![Page 25: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/25.jpg)
25
![Page 26: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/26.jpg)
26
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
![Page 27: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/27.jpg)
27
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
Two desiderata
![Page 28: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/28.jpg)
28
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
want to get
accurate estimates ê
bound the bias
Two desiderata
![Page 29: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/29.jpg)
29
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
want to get
accurate estimates ê
bound the bias
Two desiderata
want to be independent of initial conditions
quickly ê
bound the mixing time
![Page 30: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/30.jpg)
Previous Work
30
![Page 31: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/31.jpg)
Previous Work
• “Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015
• “Analyzing Hogwild Parallel Gaussian Gibbs Sampling” — Johnson et al, NIPS 2013.
31
![Page 32: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/32.jpg)
Previous Work
• “Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent” — Niu et al, NIPS 2011. follow-up work: Liu and Wright SCIOPS 2015, Liu et al JMLR 2015, De Sa et al NIPS 2015, Mania et al arxiv 2015
• “Analyzing Hogwild Parallel Gaussian Gibbs Sampling” — Johnson et al, NIPS 2013.
32
![Page 33: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/33.jpg)
33
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
Two desiderata
want to be independent of initial conditions
quickly ê
bound the mixing time
want to get
accurate estimates ê
bound the bias
![Page 34: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/34.jpg)
Bias
34
![Page 35: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/35.jpg)
Bias
• How close are samples to target distribution? – standard measurement: total variation distance
• For sequential Gibbs, no asymptotic bias:
35
kµ� ⌫kTV = max
A⇢⌦|µ(A)� ⌫(A)|
![Page 36: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/36.jpg)
Bias
• How close are samples to target distribution? – standard measurement: total variation distance
• For sequential Gibbs, no asymptotic bias:
36
kµ� ⌫kTV = max
A⇢⌦|µ(A)� ⌫(A)|
8µ0, limt!1
kP (t)µ0 � ⇡kTV = 0
![Page 37: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/37.jpg)
Bias
• How close are samples to target distribution? – standard measurement: total variation distance
• For sequential Gibbs, no asymptotic bias:
37
kµ� ⌫kTV = max
A⇢⌦|µ(A)� ⌫(A)|
“Folklore”: asynchronous Gibbs is also unbiased. …but this is not necessarily true!
8µ0, limt!1
kP (t)µ0 � ⇡kTV = 0
![Page 38: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/38.jpg)
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
38
![Page 39: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/39.jpg)
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
39
![Page 40: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/40.jpg)
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
40
![Page 41: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/41.jpg)
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
41
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Two threads update starting here.
![Page 42: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/42.jpg)
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
42
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)1/4
1/4
1/4
1/4 Two threads update starting here.
![Page 43: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/43.jpg)
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
Simple Bias Example
43
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)
1/4
1/41/41/4
3/4
3/4
1/2
1/2
1/2
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)1/4
1/4
1/4
1/4
p(0, 1) = p(1, 0) = p(1, 1) =1
3p(0, 0) = 0.
(0, 0)
(0, 1)
(1, 0)
(1, 1)1/4
1/4
1/4
1/4Should have zero probability!
Two threads update starting here.
![Page 44: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/44.jpg)
Nonzero Asymptotic Bias Asynchronous Bias: Example Distribution
�
�.��
�.�
�.��
�.�
�.��
�.�
�.��
�.�
(�,�) (�,�) (�,�) (�,�)
prob
abili
ty
state
Distribution of Sequential vs. Hogwild! Gibbs
sequentialHogwild!
Bias introduced by Hogwild!-Gibbs (��� samples).
�� / �
44
Measured
Bias (total variation
distance)
sequential < 0.1%
unbiased
asynchronous 9.8% biased
![Page 45: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/45.jpg)
Nonzero Asymptotic Bias Asynchronous Bias: Example Distribution
�
�.��
�.�
�.��
�.�
�.��
�.�
�.��
�.�
(�,�) (�,�) (�,�) (�,�)
prob
abili
ty
state
Distribution of Sequential vs. Hogwild! Gibbs
sequentialHogwild!
Bias introduced by Hogwild!-Gibbs (��� samples).
�� / �
45
Asynchronous Bias: Example Distribution
�
�.�
�.�
�.�
�.�
�.�
�.�
�.�
�.�
(�,X�) (�,X�) (X�,�) (X�,�)
prob
abili
ty
state
Marginal distribution of Sequential vs. Hogwild! Gibbs
sequentialHogwild!
Bias introduced by Hogwild!-Gibbs (��� samples).
�� / ��
Measured
Bias (total variation
distance)
sequential < 0.1%
unbiased
asynchronous 9.8% biased
![Page 46: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/46.jpg)
Are we using the right metric?
46
![Page 47: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/47.jpg)
Are we using the right metric?
• No, total variation distance is too conservative – depends on events that don’t matter for inference – usually only care about small number of variables
• New metric: sparse variation distance
where |A| is the number of variables on which event A depends
47
![Page 48: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/48.jpg)
Are we using the right metric?
• No, total variation distance is too conservative – depends on events that don’t matter for inference – usually only care about small number of variables
• New metric: sparse variation distance
where |A| is the number of variables on which event A depends
48
kµ� ⌫kSV(!) = max
|A|!|µ(A)� ⌫(A)|
![Page 49: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/49.jpg)
Are we using the right metric?
• No, total variation distance is too conservative – depends on events that don’t matter for inference – usually only care about small number of variables
• New metric: sparse variation distance
where |A| is the number of variables on which event A depends
49
kµ� ⌫kSV(!) = max
|A|!|µ(A)� ⌫(A)|
Simple Example: Bias of Asynchronous Gibbs
Total variation: 9.8% Sparse Variation ( ): 0.4% ! = 1
![Page 50: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/50.jpg)
Total Influence Parameter
50
![Page 51: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/51.jpg)
Total Influence Parameter
• Old condition that was used to study mixing times of spin statistics systems
– means X and Y equal except variable j.
– is conditional distribution of variable i given the values of all the other variables in state X.
– Dobrushin’s condition holds when
51
↵ = max
i2I
X
j2I
max
(X,Y )2Bj
��⇡i(·|XI\{i})� ⇡i(·|YI\{i})��TV
(X,Y ) 2 Bj
⇡i(·|XI\{i})
![Page 52: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/52.jpg)
Total Influence Parameter
• Old condition that was used to study mixing times of spin statistics systems
– means X and Y equal except variable j.
– is conditional distribution of variable i given the values of all the other variables in state X.
– Dobrushin’s condition holds when
52
↵ = max
i2I
X
j2I
max
(X,Y )2Bj
��⇡i(·|XI\{i})� ⇡i(·|YI\{i})��TV
(X,Y ) 2 Bj
⇡i(·|XI\{i})
↵ < 1.
![Page 53: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/53.jpg)
Asymptotic Result
• For any class of distributions with bounded total influence – big-O notation is over number of variables
• If timesteps of sequential Gibbs suffice to achieve arbitrarily small bias – measured by sparse variation distance, for fixed
• …then asynchronous Gibbs requires only additional timesteps to achieve the same bias!
53
↵ = O(1).n.
O(n)
!-
O(1)
!-
![Page 54: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/54.jpg)
Asymptotic Result
• For any class of distributions with bounded total influence – big-O notation is over number of variables
• If timesteps of sequential Gibbs suffice to achieve arbitrarily small bias – measured by sparse variation distance, for fixed
• …then asynchronous Gibbs requires only additional timesteps to achieve the same bias!
54
↵ = O(1).n.
O(n)
!-
O(1)
more details, explicit bounds, et cetera in the paper
!-
![Page 55: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/55.jpg)
want to get
accurate estimates ê
bound the bias
55
Question
Does asynchronous Gibbs sampling work? …and what does it mean for it to work?
Two desiderata
want to be independent of initial conditions
quickly ê
bound the mixing time
![Page 56: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/56.jpg)
Mixing Time
56
![Page 57: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/57.jpg)
Mixing Time
• How long do we need to run until the samples are independent of initial conditions?
• Mixing time of a Markov chain is the first time at which the distribution of the sample is close to the stationary distribution. – in terms of total variation distance – feasible to run MCMC if mixing time is small
57
![Page 58: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/58.jpg)
Mixing Time
• How long do we need to run until the samples are independent of initial conditions?
• Mixing time of a Markov chain is the first time at which the distribution of the sample is close to the stationary distribution. – in terms of total variation distance – feasible to run MCMC if mixing time is small
58
“Folklore”: asynchronous Gibbs has the same mixing time as sequential Gibbs…also not necessarily true!
![Page 59: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/59.jpg)
Mixing Time Example
Mixing Time: Example (cont’d)
Here is the empirical estimate for P
�
�
TY > ��
for di�ferentmaximum delays ⌧—should be exactly �.�.
�
�.�
�.�
�.�
�.�
�
� �� ��� ��� ��� ��� ���
estim
atio
nof
P(�
T Y>
�)
sample number (thousands)
Mixing of Sequential vs Hogwild! Gibbs
⌧ = �.�⌧ = �.�⌧ = �.�
sequentialtrue distribution
�� / �
59
is hardware-dependent read
staleness parameter
⌧
HOGWILD!
![Page 60: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/60.jpg)
Mixing Time Example
Mixing Time: Example (cont’d)
Here is the empirical estimate for P
�
�
TY > ��
for di�ferentmaximum delays ⌧—should be exactly �.�.
�
�.�
�.�
�.�
�.�
�
� �� ��� ��� ��� ��� ���
estim
atio
nof
P(�
T Y>
�)
sample number (thousands)
Mixing of Sequential vs Hogwild! Gibbs
⌧ = �.�⌧ = �.�⌧ = �.�
sequentialtrue distribution
�� / �
60
Sequential Gibbs achieves correct marginal quickly.
tmix
= O(n log n)
Asynchronous Gibbs takes much longer.
Asynchronous Gibbs takes much longer.
Asynchronous Gibbs takes much longer.
tmix
= exp(⌦(n))
is hardware-dependent read
staleness parameter
⌧
HOGWILD!
![Page 61: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/61.jpg)
Bounding the Mixing Time
61
↵ < 1.
![Page 62: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/62.jpg)
Bounding the Mixing Time
Suppose that our target distribution satisfies Dobrushin’s condition (total influence ). • Mixing time of sequential Gibbs (known result)
• Mixing time of asynchronous Gibbs is
62
↵ < 1.
tmix�seq
(✏) n
1� ↵log
⇣n✏
⌘.
tmix�hog
(✏) n+ ↵⌧
1� ↵log
⇣n✏
⌘.
is hardware-dependent read
staleness parameter
⌧
![Page 63: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/63.jpg)
Bounding the Mixing Time
Suppose that our target distribution satisfies Dobrushin’s condition (total influence ). • Mixing time of sequential Gibbs (known result)
• Mixing time of asynchronous Gibbs is
63
↵ < 1.
tmix�seq
(✏) n
1� ↵log
⇣n✏
⌘.
tmix�hog
(✏) n+ ↵⌧
1� ↵log
⇣n✏
⌘.
Takeaway message: can compare the two mixing time bounds with
…they differ by a negligible factor!
tmix�hog
(✏) ⇡�1 + ↵⌧n�1
�tmix�seq
(✏)
is hardware-dependent read
staleness parameter
⌧
![Page 64: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/64.jpg)
Theory Matches Experiment Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling
16500
17000
17500
18000
18500
19000
0 50 100 150 200
mix
ing
time
expected delay parameter (⌧ ⇤)
Estimated tmix
of HOGWILD! Gibbs on Large Ising Model
estimatedtheory
Figure 4. Comparison of estimated mixing time and theory-predicted (by Equation 2) mixing time as ⌧ increases for a syn-thetic Ising model graph (n = 1000, � = 3).
the (relatively small) dependence of the mixing time on ⌧proved to be computationally intractable.
Instead, we use a technique called coupling to the future.We initialize two chains, X and Y , by setting all the vari-ables in X
0
to 1 and all the variables in Y0
to �1. Weproceed by simulating a coupling between the two chains,and return the coupling time T
c
. Our estimate of the mixingtime will then be ˆt(✏), where P(T
c
� ˆt(✏)) = ✏.
Statement 2. This experimental estimate is an upperbound for the mixing time. That is, ˆt(✏) � t
mix
(✏).
To estimate ˆt(✏), we ran 10000 instances of the cou-pling experiment, and returned the sample estimate ofˆt(1/4). To compare across a range of ⌧⇤, we selectedthe ⌧̃
i,t
to be independent and identically distributed ac-cording to the maximum-entropy distribution supported on{0, 1, . . . , 200} consistent with a particular assignment of⌧⇤. The resulting estimates are plotted as the blue seriesin Figure 4. The red line represents the mixing time thatwould be predicted by naively applying Equation 2 usingthe estimate of the sequential mixing time as a startingpoint — we can see that it is a very good match for the ex-perimental results. This experiment shows that, at least forone archetypal model, our theory accurately characterizesthe behavior of HOGWILD! Gibbs sampling as the delayparameter ⌧⇤ is changed, and that using HOGWILD!-Gibbsdoesn’t cause the model to catastrophically fail to mix.
Of course, in order for HOGWILD!-Gibbs to be useful, itmust also speed up the execution of Gibbs sampling onsome practical models. It is already known that this is thecase, as these types of algorithms been widely implementedin practice (Smola & Narayanamurthy, 2010; Smyth et al.,2009). To further test this, we ran HOGWILD!-Gibbs sam-pling on a real-world 11 GB Knowledge Base Populationdataset (derived from the TAC-KBP challenge) using a ma-
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8 12 18 36
spee
dup
over
sing
le-th
read
ed
threads
Performance of HOGWILD! Gibbs on KBP Dataset
HOGWILD!multi-model
Figure 5. Speedup of HOGWILD! and multi-model Gibbs sam-pling on large KBP dataset (11 GB).
chine with a single-socket, 18-core Xeon E7-8890 CPUand 1 TB RAM. As a comparison, we also ran a “multi-model” Gibbs sampler: this consists of multiple threadswith a single execution of Gibbs sampling running inde-pendently in each thread. This sampler will produce thesame number of samples as HOGWILD!-Gibbs, but will re-quire more memory to store multiple copies of the model.
Figure 5 reports the speedup, in terms of wall-clock time,achieved by HOGWILD!-Gibbs on this dataset. On this ma-chine, we get speedups of up to 2.8⇥, although the programbecomes memory-bandwidth bound at around 8 threads,and we see no significant speedup beyond this. With anynumber of workers, the run time of HOGWILD!-Gibbs isclose to that of multi-model Gibbs, which illustrates thatthe additional cache contention caused by the HOGWILD!updates has little effect on the algorithm’s performance.
7. ConclusionWe analyzed HOGWILD!-Gibbs sampling, a heuristic forparallelized MCMC sampling, on discrete-valued graphi-cal models. First, we constructed a statistical model forHOGWILD!-Gibbs by adapting a model already used forthe analysis of asynchronous SGD. Next, we illustrated amajor issue with HOGWILD!-Gibbs sampling: that it pro-duces biased samples. To address this, we proved that if forsome class of models with bounded total influence, onlyO(n) sequential Gibbs samples are necessary to producegood marginal estimates, then HOGWILD!-Gibbs samplingproduces equally good estimates after only O(1) additionalsteps. Additionally, for models that satisfy Dobrushin’scondition (↵ < 1), we proved mixing time bounds for se-quential and asynchronous Gibbs sampling that differ byonly a factor of 1 + O(n�1
). Finally, we showed that ourtheory matches experimental results, and that HOGWILD!-Gibbs produces speedups up to 2.8⇥ on a real dataset.
64
expected staleness parameter ( ) ⌧
![Page 65: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/65.jpg)
Conclusion
• Analyzed and modeled asynchronous Gibbs sampling, and identified two success metrics – sample bias à how close to target distribution? – mixing time à how long do we need to run?
• Showed that asynchronicity can cause problems
• Proved bounds on the effect of asynchronicity – using the new sparse variation distance, together with – the classical condition of total influence
65
![Page 66: Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs ... · 14 . x 1 x 2 x 3 x 4 x 5 x 7 x 6 Algorithm 1 Gibbs sampling Require: Variables x i for 1 i n, and distribution ⇡.](https://reader035.fdocuments.net/reader035/viewer/2022071504/6123d8bd2ea3c34a517f325e/html5/thumbnails/66.jpg)
Conclusion
• Analyzed and modeled asynchronous Gibbs sampling, and identified two success metrics – sample bias à how close to target distribution? – mixing time à how long do we need to run?
• Showed that asynchronicity can cause problems
• Proved bounds on the effect of asynchronicity – using the new sparse variation distance, together with – the classical condition of total influence
66
Thank you!
[email protected] stanford.edu/~cdesa