Bayesian Learning via Stochastic Gradient Langevin Dynamicsteh/research/compstats/sgld.pdf ·...

Bayesian Learning viaStochastic Gradient Langevin Dynamics

Max Welling and Yee Whye TehPresented by:

Andriy Mnih and Levi Boyles

University of California IrvineUniversity College London

June 2011 / ICML

1 / 21

Outline

Motivation

Method

Demonstrations

Discussion

2 / 21

Outline

Motivation

Method

Demonstrations

Discussion

3 / 21

Motivation

I Large scale datasets are becoming more commonly available acrossmany fields.

I Learning complex models from these datasets will be the future.

I Current successes in scalable learning methods are optimization-basedand non-Bayesian.

I Bayesian methods are currently not scalable, e.g. each iteration ofMCMC sampling requires computations over the whole datasets.

I Aim: develop Bayesian methodologies applicable to large scaledatasets.

I Best of both worlds: scalability, and Bayesian protection againstoverfitting.

4 / 21

Contribution

I A very simple twist to standard stochastic gradient ascent.

I Turns it into a Bayesian algorithm which samples from the full posteriordistribution rather than converges to a MAP mode.

I Resulting algorithm is related to Langevin dynamics—a classicalphysics method for sampling from a distribution.

I Applied to Bayesian mixture models, logistic regression, andindependent components analysis.

5 / 21

Outline

Motivation

Method

Demonstrations

Discussion

6 / 21

Setup

I Parameter vector θ.

I Large numbers of data items x1, x2, . . . , xN where N � 1.

I Model distribution is:

p(θ,X ) = p(θ)N∏

i=1

p(xi |θ)

I Aim: obtain samples from posterior distribution p(θ|X ).

7 / 21

Stochastic Gradient Ascent

I Also known as: stochastic approximation, Robbins-Munro.

I At iteration t = 1,2, . . .:

I Get a subset (minibatch) xt1, . . . , xtn of data items where n� N.I Approximate gradient of log posterior using the subset:

∇ log p(θt |X ) ≈ ∇ log p(θt) +Nn

n∑i=1

∇ log p(xti |θt)

I Take a gradient step:

θt+1 = θt +εt2

(∇ log p(θt) +

Nn

n∑i=1

∇ log p(xti |θt)

)

8 / 21

Stochastic Gradient Ascent

I Major requirement for convergence on step-sizes1:

∞∑t=1

εt =∞∞∑

t=1

ε2t <∞

I Intuition:

I Step sizes cannot decrease too fast, otherwise will not be able totraverse parameter space.

I Step sizes must decrease to zero, otherwise parameter trajectorywill not converge to a local MAP mode.

1In addition to other technical assumptions.9 / 21

Langevin Dynamics

I Stochastic differential equation describing dynamics which converge toposterior p(θ|X ):

dθ(t) =12∇ log p(θ(t)|X ) + db(t)

where b(t) is Brownian motion.

I Intuition:

I Gradient term encourages dynamics to spend more time in highprobability areas.

I Brownian motion provides noise so that dynamics will explore thewhole parameter space.

10 / 21

Langevin Dynamics

I First order Euler discretization:

θt+1 = θt +ε

2∇ log p(θt |X ) + ηt ηt = N(0, ε)

I Amount of noise is balanced to gradient step size.

I With finite step size there will be discretization errors.

I Discretization can be fixed by Metropolis-Hastings accept/reject step.

I As ε→ 0 acceptance rate goes to 1.

11 / 21

Stochastic Gradient Langevin Dynamics

I Idea: Langevin dynamics with stochastic gradients.

θt+1 = θt +εt2

(∇ log p(θt) +

Nn

n∑i=1

∇ log p(xti |θt)

)+ ηt

ηt = N(0, εt)

I Update is just stochastic gradient ascent plus Gaussian noise.

I Noise variance is balanced with gradient step sizes.

I εt decreases to 0 slowly (step-size requirement).

12 / 21

Stochastic Gradient Langevin Dynamics—Intuition

θt+1 = θt +εt2

(∇ log p(θt) +

Nn

n∑i=1

∇ log p(xti |θt)

)+ ηt

ηt = N(0, εt)

I Only computationally expensive part of Langevin dynamics is thegradient computation. If gradient can be well-approximated on smallminibatches the algorithm will work well.

I As εt → 0:

I Variance of gradient noise is O(ε2t ), while variance of ηt is εt � ε2t .Gradient noise dominated by ηt so can be ignored.Result: Langevin dynamic updates with decreasing step sizes.

I MH acceptance probability approaches 1, so we can ignore theexpensive MH accept/reject step.

I εt approaches 0 slowly enough, so discretized Langevin dynamicsstill able to explore whole parameter space.

13 / 21

Outline

Motivation

Method

Demonstrations

Discussion

14 / 21

Mixture of Gaussians

−1 0 1 2−3

−2

−1

0

1

2

3

−1 0 1 2−3

−2

−1

0

1

2

3

True Posterior Estimated from samples

15 / 21

Mixture of Gaussians

100

102

104

106

10−6

10−4

10−2

100

iteration

no

ise

va

ria

nce

∇θ1 noise

∇θ2 noise

injected noise

10−8

10−6

10−4

10−2

10−3

10−2

10−1

100

step size

ave

rag

e r

eje

ctio

n r

ate

Noise amounts vs iterations Rejection probabilities vs step sizes

16 / 21

Logistic Regression

0 2 4 6 8 10!7

!6

!5

!4

!3

!2

!1

0

Number of iterations through whole dataset

Log jo

int pro

babili

ty p

er

datu

m

0 2 6 84

-6

-4-5

-3

10

-2-10

-70 0.5 1 1.5 2

0.65

0.7

0.75

0.8

0.85

Number of iterations through whole dataset

Acc

ura

cy o

n test

data

Accuracy after 10 iterationsAccuracy

0 0.5 1.5 21

0.7

0.8

0.75

0.85

0.65

Average log joint probability vs Accuracies vsiterations through dataset iterations through dataset

17 / 21

Independent Components Analysis

0 0.5 1 1.5 2

x 104

2

4

6

8

10

iteration

Am

ari d

ista

nce

Amari Distance Corr. Lan.

2 4 6 8

x 104

2

4

6

8

10

iteration

Am

ari d

ista

nce

Amari Distance Stoc. Lan.

1 2 3 4 5 60

20

40

60

80

100

Sorted Component ID

Inst

abili

ty M

etric

Instability Metric Corr. Lan.

1 2 3 4 5 60

50

100

150

200

Sorted Component ID

Inst

abili

ty M

etric

Instability Metric Stoc. Lan.

Corrected Langevin Stochastic Langevin

18 / 21

Independent Components Analysis

W(1,1)

W(1

,2)

PDF W(1,1) vs W(1,2) Corr. Lan.

−4 −2 0 2 4

−6

−4

−2

0

2

4

W(1,1)

W(1

,2)

PDF W(1,1) vs W(1,2) Stoc. Lan.

−5 0 5−6

−4

−2

0

2

4

6

W(1,1)

W(2

,1)

PDF W(1,1) vs W(2,1) Corr. Lan.

−4 −2 0 2 4

0

1

2

3

4

5

W(1,1)

W(2

,1)

PDF W(1,1) vs W(2,1) Stoc. Lan.

−5 0 5−5

0

5

Corrected Langevin Stochastic Langevin

19 / 21

Outline

Motivation

Method

Demonstrations

Discussion

20 / 21

Discussion

I This is the first baby step towards Bayesian learning for large scaledatasets.

I Future work:

I Theoretical convergence proof.I Better scalable MCMC techniques.I Methods that do not require decreasing step sizes.

21 / 21

Bayesian Learning via Stochastic Gradient Langevin Dynamicsteh/research/compstats/sgld.pdf ·...

Documents

Transcript of Bayesian Learning via Stochastic Gradient Langevin Dynamicsteh/research/compstats/sgld.pdf ·...