Bayesian Learning via Stochastic Gradient Langevin Dynamicsteh/research/compstats/sgld.pdf ·...
Transcript of Bayesian Learning via Stochastic Gradient Langevin Dynamicsteh/research/compstats/sgld.pdf ·...
Bayesian Learning viaStochastic Gradient Langevin Dynamics
Max Welling and Yee Whye TehPresented by:
Andriy Mnih and Levi Boyles
University of California IrvineUniversity College London
June 2011 / ICML
1 / 21
Outline
Motivation
Method
Demonstrations
Discussion
2 / 21
Outline
Motivation
Method
Demonstrations
Discussion
3 / 21
Motivation
I Large scale datasets are becoming more commonly available acrossmany fields.
I Learning complex models from these datasets will be the future.
I Current successes in scalable learning methods are optimization-basedand non-Bayesian.
I Bayesian methods are currently not scalable, e.g. each iteration ofMCMC sampling requires computations over the whole datasets.
I Aim: develop Bayesian methodologies applicable to large scaledatasets.
I Best of both worlds: scalability, and Bayesian protection againstoverfitting.
4 / 21
Contribution
I A very simple twist to standard stochastic gradient ascent.
I Turns it into a Bayesian algorithm which samples from the full posteriordistribution rather than converges to a MAP mode.
I Resulting algorithm is related to Langevin dynamics—a classicalphysics method for sampling from a distribution.
I Applied to Bayesian mixture models, logistic regression, andindependent components analysis.
5 / 21
Outline
Motivation
Method
Demonstrations
Discussion
6 / 21
Setup
I Parameter vector θ.
I Large numbers of data items x1, x2, . . . , xN where N � 1.
I Model distribution is:
p(θ,X ) = p(θ)N∏
i=1
p(xi |θ)
I Aim: obtain samples from posterior distribution p(θ|X ).
7 / 21
Stochastic Gradient Ascent
I Also known as: stochastic approximation, Robbins-Munro.
I At iteration t = 1,2, . . .:
I Get a subset (minibatch) xt1, . . . , xtn of data items where n� N.I Approximate gradient of log posterior using the subset:
∇ log p(θt |X ) ≈ ∇ log p(θt) +Nn
n∑i=1
∇ log p(xti |θt)
I Take a gradient step:
θt+1 = θt +εt2
(∇ log p(θt) +
Nn
n∑i=1
∇ log p(xti |θt)
)
8 / 21
Stochastic Gradient Ascent
I Major requirement for convergence on step-sizes1:
∞∑t=1
εt =∞∞∑
t=1
ε2t <∞
I Intuition:
I Step sizes cannot decrease too fast, otherwise will not be able totraverse parameter space.
I Step sizes must decrease to zero, otherwise parameter trajectorywill not converge to a local MAP mode.
1In addition to other technical assumptions.9 / 21
Langevin Dynamics
I Stochastic differential equation describing dynamics which converge toposterior p(θ|X ):
dθ(t) =12∇ log p(θ(t)|X ) + db(t)
where b(t) is Brownian motion.
I Intuition:
I Gradient term encourages dynamics to spend more time in highprobability areas.
I Brownian motion provides noise so that dynamics will explore thewhole parameter space.
10 / 21
Langevin Dynamics
I First order Euler discretization:
θt+1 = θt +ε
2∇ log p(θt |X ) + ηt ηt = N(0, ε)
I Amount of noise is balanced to gradient step size.
I With finite step size there will be discretization errors.
I Discretization can be fixed by Metropolis-Hastings accept/reject step.
I As ε→ 0 acceptance rate goes to 1.
11 / 21
Stochastic Gradient Langevin Dynamics
I Idea: Langevin dynamics with stochastic gradients.
θt+1 = θt +εt2
(∇ log p(θt) +
Nn
n∑i=1
∇ log p(xti |θt)
)+ ηt
ηt = N(0, εt)
I Update is just stochastic gradient ascent plus Gaussian noise.
I Noise variance is balanced with gradient step sizes.
I εt decreases to 0 slowly (step-size requirement).
12 / 21
Stochastic Gradient Langevin Dynamics—Intuition
θt+1 = θt +εt2
(∇ log p(θt) +
Nn
n∑i=1
∇ log p(xti |θt)
)+ ηt
ηt = N(0, εt)
I Only computationally expensive part of Langevin dynamics is thegradient computation. If gradient can be well-approximated on smallminibatches the algorithm will work well.
I As εt → 0:
I Variance of gradient noise is O(ε2t ), while variance of ηt is εt � ε2t .Gradient noise dominated by ηt so can be ignored.Result: Langevin dynamic updates with decreasing step sizes.
I MH acceptance probability approaches 1, so we can ignore theexpensive MH accept/reject step.
I εt approaches 0 slowly enough, so discretized Langevin dynamicsstill able to explore whole parameter space.
13 / 21
Outline
Motivation
Method
Demonstrations
Discussion
14 / 21
Mixture of Gaussians
−1 0 1 2−3
−2
−1
0
1
2
3
−1 0 1 2−3
−2
−1
0
1
2
3
True Posterior Estimated from samples
15 / 21
Mixture of Gaussians
100
102
104
106
10−6
10−4
10−2
100
iteration
no
ise
va
ria
nce
∇θ1 noise
∇θ2 noise
injected noise
10−8
10−6
10−4
10−2
10−3
10−2
10−1
100
step size
ave
rag
e r
eje
ctio
n r
ate
Noise amounts vs iterations Rejection probabilities vs step sizes
16 / 21
Logistic Regression
0 2 4 6 8 10!7
!6
!5
!4
!3
!2
!1
0
Number of iterations through whole dataset
Log jo
int pro
babili
ty p
er
datu
m
0 2 6 84
-6
-4-5
-3
10
-2-10
-70 0.5 1 1.5 2
0.65
0.7
0.75
0.8
0.85
Number of iterations through whole dataset
Acc
ura
cy o
n test
data
Accuracy after 10 iterationsAccuracy
0 0.5 1.5 21
0.7
0.8
0.75
0.85
0.65
Average log joint probability vs Accuracies vsiterations through dataset iterations through dataset
17 / 21
Independent Components Analysis
0 0.5 1 1.5 2
x 104
2
4
6
8
10
iteration
Am
ari d
ista
nce
Amari Distance Corr. Lan.
2 4 6 8
x 104
2
4
6
8
10
iteration
Am
ari d
ista
nce
Amari Distance Stoc. Lan.
1 2 3 4 5 60
20
40
60
80
100
Sorted Component ID
Inst
abili
ty M
etric
Instability Metric Corr. Lan.
1 2 3 4 5 60
50
100
150
200
Sorted Component ID
Inst
abili
ty M
etric
Instability Metric Stoc. Lan.
Corrected Langevin Stochastic Langevin
18 / 21
Independent Components Analysis
W(1,1)
W(1
,2)
PDF W(1,1) vs W(1,2) Corr. Lan.
−4 −2 0 2 4
−6
−4
−2
0
2
4
W(1,1)
W(1
,2)
PDF W(1,1) vs W(1,2) Stoc. Lan.
−5 0 5−6
−4
−2
0
2
4
6
W(1,1)
W(2
,1)
PDF W(1,1) vs W(2,1) Corr. Lan.
−4 −2 0 2 4
0
1
2
3
4
5
W(1,1)
W(2
,1)
PDF W(1,1) vs W(2,1) Stoc. Lan.
−5 0 5−5
0
5
Corrected Langevin Stochastic Langevin
19 / 21
Outline
Motivation
Method
Demonstrations
Discussion
20 / 21
Discussion
I This is the first baby step towards Bayesian learning for large scaledatasets.
I Future work:
I Theoretical convergence proof.I Better scalable MCMC techniques.I Methods that do not require decreasing step sizes.
21 / 21