Towards Maximum Likelihood: Learning Undirected Graphical ...
Transcript of Towards Maximum Likelihood: Learning Undirected Graphical ...
Towards Maximum Likelihood: Learning UndirectedGraphical Models using Persistent Sequential Monte
Carlo
Hanchen Xiong
Institute of Computer ScienceUniversity of Innsbruck, Austria
November 26, 2014
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 1 / 22
Background
Undirected Graphical Models:
Markov Random Fields (or Markov Network)
Condtional Random Fields
Modeling:
p(x;θ) =exp (−E (x;θ))
Z(θ)(1)
Energy: E (x;θ) = −θ>φ(x) (2)
with random variables x = [x1, x2, . . . , xD ] ∈ XD where xd can take Nd
discrete values, φ(x) is a K -dimensional vector of sufficient statistics, andparameter θ ∈ RK .
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 2 / 22
Maximum Log-Likelihood Training
UGMs’ likelihood functions is concave w.r.t. θ[Koller and Friedman, 2009];
Given training data D = {x(m)}Mm=1, the derivative of average
log-likelihood L(θ|D) = 1M
∑Mm=1 log p(x(m);θ) as
∂L(θ|D)
∂θ= ED(φ(x))︸ ︷︷ ︸
ψ+
−Eθ(φ(x))︸ ︷︷ ︸ψ−
=1
M
M∑m=1
φ(x(m))−∑x′∈D
p(x′;θ)φ(x)′
(3)
interpretation: iteratively pulls down the energy of the data spaceoccupied by D (positive phase), but raises the energy over all dataspace XD (negative phase), until it reaches a balance (ψ+ = ψ−).
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 3 / 22
Maximum Log-Likelihood Training
UGMs’ likelihood functions is concave w.r.t. θ[Koller and Friedman, 2009];
Given training data D = {x(m)}Mm=1, the derivative of average
log-likelihood L(θ|D) = 1M
∑Mm=1 log p(x(m);θ) as
∂L(θ|D)
∂θ= ED(φ(x))︸ ︷︷ ︸
ψ+
−Eθ(φ(x))︸ ︷︷ ︸ψ−
=1
M
M∑m=1
φ(x(m))−∑x′∈D
p(x′;θ)φ(x)′
(3)
interpretation: iteratively pulls down the energy of the data spaceoccupied by D (positive phase), but raises the energy over all dataspace XD (negative phase), until it reaches a balance (ψ+ = ψ−).
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 3 / 22
Maximum Log-Likelihood Training
UGMs’ likelihood functions is concave w.r.t. θ[Koller and Friedman, 2009];
Given training data D = {x(m)}Mm=1, the derivative of average
log-likelihood L(θ|D) = 1M
∑Mm=1 log p(x(m);θ) as
∂L(θ|D)
∂θ= ED(φ(x))︸ ︷︷ ︸
ψ+
−Eθ(φ(x))︸ ︷︷ ︸ψ−
=1
M
M∑m=1
φ(x(m))−∑x′∈D
p(x′;θ)φ(x)′
(3)
interpretation: iteratively pulls down the energy of the data spaceoccupied by D (positive phase), but raises the energy over all dataspace XD (negative phase), until it reaches a balance (ψ+ = ψ−).
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 3 / 22
Existing Learning Methods
Approximate the second term of the gradient L:
∂L(θ|D)
∂θ= ED(φ(x))︸ ︷︷ ︸
ψ+
−Eθ(φ(x))︸ ︷︷ ︸ψ−
Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]
Contrastive Divergence (CD) [Hinton, 2002]
Persistent Contrastive Divergence (PCD) [Tieleman, 2008]
Tempered Transition (TT) [Salakhutdinov, 2010]
Parallel Tempering (PT) [Desjardins et al., 2010]
CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22
Existing Learning Methods
Approximate the second term of the gradient L:
∂L(θ|D)
∂θ= ED(φ(x))︸ ︷︷ ︸
ψ+
−Eθ(φ(x))︸ ︷︷ ︸ψ−
Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]
Contrastive Divergence (CD) [Hinton, 2002]
Persistent Contrastive Divergence (PCD) [Tieleman, 2008]
Tempered Transition (TT) [Salakhutdinov, 2010]
Parallel Tempering (PT) [Desjardins et al., 2010]
CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22
Existing Learning Methods
Approximate the second term of the gradient L:
∂L(θ|D)
∂θ= ED(φ(x))︸ ︷︷ ︸
ψ+
−Eθ(φ(x))︸ ︷︷ ︸ψ−
Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]
Contrastive Divergence (CD) [Hinton, 2002]
Persistent Contrastive Divergence (PCD) [Tieleman, 2008]
Tempered Transition (TT) [Salakhutdinov, 2010]
Parallel Tempering (PT) [Desjardins et al., 2010]
CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22
Existing Learning Methods
Approximate the second term of the gradient L:
∂L(θ|D)
∂θ= ED(φ(x))︸ ︷︷ ︸
ψ+
−Eθ(φ(x))︸ ︷︷ ︸ψ−
Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]
Contrastive Divergence (CD) [Hinton, 2002]
Persistent Contrastive Divergence (PCD) [Tieleman, 2008]
Tempered Transition (TT) [Salakhutdinov, 2010]
Parallel Tempering (PT) [Desjardins et al., 2010]
CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22
Existing Learning Methods
Approximate the second term of the gradient L:
∂L(θ|D)
∂θ= ED(φ(x))︸ ︷︷ ︸
ψ+
−Eθ(φ(x))︸ ︷︷ ︸ψ−
Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]
Contrastive Divergence (CD) [Hinton, 2002]
Persistent Contrastive Divergence (PCD) [Tieleman, 2008]
Tempered Transition (TT) [Salakhutdinov, 2010]
Parallel Tempering (PT) [Desjardins et al., 2010]
CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22
Existing Learning Methods
Approximate the second term of the gradient L:
∂L(θ|D)
∂θ= ED(φ(x))︸ ︷︷ ︸
ψ+
−Eθ(φ(x))︸ ︷︷ ︸ψ−
Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]
Contrastive Divergence (CD) [Hinton, 2002]
Persistent Contrastive Divergence (PCD) [Tieleman, 2008]
Tempered Transition (TT) [Salakhutdinov, 2010]
Parallel Tempering (PT) [Desjardins et al., 2010]
CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22
Existing Learning Methods
Approximate the second term of the gradient L:
∂L(θ|D)
∂θ= ED(φ(x))︸ ︷︷ ︸
ψ+
−Eθ(φ(x))︸ ︷︷ ︸ψ−
Markov Chain Monte Carlo Maximum Likelihood(MCMCML)[Geyer, 1991]
Contrastive Divergence (CD) [Hinton, 2002]
Persistent Contrastive Divergence (PCD) [Tieleman, 2008]
Tempered Transition (TT) [Salakhutdinov, 2010]
Parallel Tempering (PT) [Desjardins et al., 2010]
CD,PCD,TT and PT can be summarized as a Robbins-Monro’sstochastic approximation procedure (SAP).[Robbins and Monro, 1951]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 4 / 22
Robbins-Monro’s SAP
1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.
2 for t = 0 : T do // T iterations
3 for n = 1 : N do // go through all N particles
4 Sample st+1,n from st,n using transition operator Hθt ;
5 end for
6 Update: θt+1 = θt + η[ 1
M
M∑m=1
φ(x(m))− 1
N
N∑n=1
φ(st+1,n)]
7 Decrease η.
8 end for
9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22
Robbins-Monro’s SAP
1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.
2 for t = 0 : T do // T iterations
3 for n = 1 : N do // go through all N particles
4 Sample st+1,n from st,n using transition operator Hθt ;
5 end for
6 Update: θt+1 = θt + η[ 1
M
M∑m=1
φ(x(m))− 1
N
N∑n=1
φ(st+1,n)]
7 Decrease η.
8 end for
9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22
Robbins-Monro’s SAP
1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.
2 for t = 0 : T do // T iterations
3 for n = 1 : N do // go through all N particles
4 Sample st+1,n from st,n using transition operator Hθt ;
5 end for
6 Update: θt+1 = θt + η[ 1
M
M∑m=1
φ(x(m))− 1
N
N∑n=1
φ(st+1,n)]
7 Decrease η.
8 end for
9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22
Robbins-Monro’s SAP
1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.
2 for t = 0 : T do // T iterations
3 for n = 1 : N do // go through all N particles
4 Sample st+1,n from st,n using transition operator Hθt ;
5 end for
6 Update: θt+1 = θt + η[ 1
M
M∑m=1
φ(x(m))− 1
N
N∑n=1
φ(st+1,n)]
7 Decrease η.
8 end for
9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22
Robbins-Monro’s SAP
1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.
2 for t = 0 : T do // T iterations
3 for n = 1 : N do // go through all N particles
4 Sample st+1,n from st,n using transition operator Hθt ;
5 end for
6 Update: θt+1 = θt + η[ 1
M
M∑m=1
φ(x(m))− 1
N
N∑n=1
φ(st+1,n)]
7 Decrease η.
8 end for
9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22
Robbins-Monro’s SAP
1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.
2 for t = 0 : T do // T iterations
3 for n = 1 : N do // go through all N particles
4 Sample st+1,n from st,n using transition operator Hθt ;
5 end for
6 Update: θt+1 = θt + η[ 1
M
M∑m=1
φ(x(m))− 1
N
N∑n=1
φ(st+1,n)]
7 Decrease η.
8 end for
9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22
Robbins-Monro’s SAP
1 Training data set D = {x(1), · · · , x(M)}. Randomly initialize modelparameters θ0 and N particles {s0,1, · · · , s0,N}.
2 for t = 0 : T do // T iterations
3 for n = 1 : N do // go through all N particles
4 Sample st+1,n from st,n using transition operator Hθt ;
5 end for
6 Update: θt+1 = θt + η[ 1
M
M∑m=1
φ(x(m))− 1
N
N∑n=1
φ(st+1,n)]
7 Decrease η.
8 end for
9 When using Gibbs sampler as Hθt , SAP becomes PCD, and similarly,TT and PT can be substituted as well.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 5 / 22
MCMCML
In MCMCML, a proposal distribution p(x;θ0) is set up in the sameform as a UGM, and we have
Z(θ)Z(θ0) =
∑x exp(θ>φ(x))∑x exp(θ>0 φ(x))
=∑
x exp(θ>φ(x))
exp(θ>0 φ(x))× exp(θ>0 φ(x))∑
x exp(θ>0 φ(x))
=∑
xexp(θ>φ(x))
exp(θ>0 φ(x))× exp(θ>0 φ(x))∑
x exp(θ>0 φ(x))
=∑
x exp((θ − θ0)>φ(x)
)p(x;θ0)
≈ 1S
∑Ss=1 w
(s)
(4)
where w (s) isw (s) = exp
((θ − θ0)>φ(x̄(s))
), (5)
MCMCML is an importance sampling approximation of the gradient.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 6 / 22
MCMCML
In MCMCML, a proposal distribution p(x;θ0) is set up in the sameform as a UGM, and we have
Z(θ)Z(θ0) =
∑x exp(θ>φ(x))∑x exp(θ>0 φ(x))
=∑
x exp(θ>φ(x))
exp(θ>0 φ(x))× exp(θ>0 φ(x))∑
x exp(θ>0 φ(x))
=∑
xexp(θ>φ(x))
exp(θ>0 φ(x))× exp(θ>0 φ(x))∑
x exp(θ>0 φ(x))
=∑
x exp((θ − θ0)>φ(x)
)p(x;θ0)
≈ 1S
∑Ss=1 w
(s)
(4)
where w (s) isw (s) = exp
((θ − θ0)>φ(x̄(s))
), (5)
MCMCML is an importance sampling approximation of the gradient.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 6 / 22
MCMCML
1: t ← 0, initialize the proposal distribution p(x;θ0)2: Sample {x̄(s)} from p(x;θ0)3: while ! stop criterion do4: Calculate w (s) using (5)
5: Calculate gradient ∂L̃(θt |D)∂θt
using importance samplingapproximation.
6: update θt+1 = θt + η ∂L̃(θt |D)∂θt
7: t ← t + 18: end while
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 7 / 22
MCMCML
MCMCML’s performance highly depends on initial proposaldistribution;
at time t, it is helpful to update the proposal distribution as thep(x;θt−1);
this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;
this also looks like SAP learning schemes.
a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22
MCMCML
MCMCML’s performance highly depends on initial proposaldistribution;
at time t, it is helpful to update the proposal distribution as thep(x;θt−1);
this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;
this also looks like SAP learning schemes.
a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22
MCMCML
MCMCML’s performance highly depends on initial proposaldistribution;
at time t, it is helpful to update the proposal distribution as thep(x;θt−1);
this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;
this also looks like SAP learning schemes.
a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22
MCMCML
MCMCML’s performance highly depends on initial proposaldistribution;
at time t, it is helpful to update the proposal distribution as thep(x;θt−1);
this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;
this also looks like SAP learning schemes.
a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22
MCMCML
MCMCML’s performance highly depends on initial proposaldistribution;
at time t, it is helpful to update the proposal distribution as thep(x;θt−1);
this is analogous to sequential importance sampling with resamplingat every iteration, however, the construction of sequentialdistributions is by learning;
this also looks like SAP learning schemes.
a similar connection between PCD and Sequential Monte Carlo wasfound in [Asuncion et al., 2010]
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 8 / 22
SAP Learning as Sequential Monte Carlo
1: Initialize p(x;θ0), t ← 0
2: Sample particles {x̄(s)0 }Ss=1 ∼ p(x;θ0)
3: while ! stop criterion do4: Assign w (s) ← 1
S ,∀s ∈ S // importance reweighting
5: // resampling is ignored because it has no effect6: switch (algorithmic choice) // MCMC transition7: case CD:8: generate a brand new particle set {x̄(s)
t+1}Ss=1 with Gibbs sampling from D
9: case PCD:10: evolve particle set {x̄(s)
t }Ss=1 to {x̄(s)t+1}
Ss=1 with one step Gibbs sampling
11: case Tempered Transition:12: evolve particle set {x̄(s)
t }Ss=1 to {x̄(s)t+1}
Ss=1 with tempered transition
13: case Parallel Tempering:14: evolve particle set {x̄(s)
t }Ss=1 to {x̄(s)t+1}
Ss=1 with parallel tempering
15: end switch16: Compute the gradient ∆θt according to (4);17: θt+1 = θt + η∆θt , t ← t + 1;18: reduce η;
19: end whileHanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 9 / 22
SAP Learning as Sequential Monte Carlo
A sequential Monte Carlo (SMC) algorithms can work on thecondition that sequential, intermediate distributions are wellconstructed: two successive ones should be close.
the gap between successive distributions in SAP: η × D1 learning rate η;
2 the dimensionality of x: D.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 10 / 22
SAP Learning as Sequential Monte Carlo
A sequential Monte Carlo (SMC) algorithms can work on thecondition that sequential, intermediate distributions are wellconstructed: two successive ones should be close.
the gap between successive distributions in SAP: η × D1 learning rate η;2 the dimensionality of x: D.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 10 / 22
Persistent Sequential Monte Carlo
Every sequential, intermediate distributions is constructed by learning,so learning and sampling are interestingly entangled.
applying SMC philosophy future in sampling: Persistent SMC (SPMC)
{x̄(s)}S/2s=1 ∼ UD {x̄t−1,(s)}S/2s=1 ∼ pH(x;θt−1)
p0(x;θt) p1(x;θt) · · · pH(x;θt)
{x̄(s)}S/2s=1 ∼ UD {x̄t,(s)}S/2s=1 ∼ pH(x;θt)
p0(x;θt+1) p1(x;θt+1) · · · pH(x;θt+1)
{x̄t+1,(s)}S/2s=1 ∼ pH(x;θt+1)
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 11 / 22
Persistent Sequential Monte Carlo
Every sequential, intermediate distributions is constructed by learning,so learning and sampling are interestingly entangled.
applying SMC philosophy future in sampling: Persistent SMC (SPMC)
{x̄(s)}S/2s=1 ∼ UD {x̄t−1,(s)}S/2s=1 ∼ pH(x;θt−1)
p0(x;θt) p1(x;θt) · · · pH(x;θt)
{x̄(s)}S/2s=1 ∼ UD {x̄t,(s)}S/2s=1 ∼ pH(x;θt)
p0(x;θt+1) p1(x;θt+1) · · · pH(x;θt+1)
{x̄t+1,(s)}S/2s=1 ∼ pH(x;θt+1)
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 11 / 22
Persistent Sequential Monte Carlo
{x̄(s)}S/2s=1 ∼ UD {x̄t−1,(s)}S/2s=1 ∼ pH(x;θt−1)
p0(x;θt) p1(x;θt) · · · pH(x;θt)
{x̄(s)}S/2s=1 ∼ UD {x̄t,(s)}S/2s=1 ∼ pH(x;θt)
p0(x;θt+1) p1(x;θt+1) · · · pH(x;θt+1)
{x̄t+1,(s)}S/2s=1 ∼ pH(x;θt+1)
UD is a uniform distribu-tion on x, and interme-diate sequential distribu-tions are: ph(x;θt+1) ∝p(x;θt)
1−βhp(x;θt+1)βh
where 0 ≤ βH ≤ βH−1 ≤· · ·β0 = 1.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 12 / 22
Number of Sub-Sequential Distributions
One issue arising in PSMC is the number of βh, i.e. H
By exploiting degeneration of particle set: the importance weightingfor each particle is
w (s) =ph(x̄(s);θt+1)
ph−1(x̄(s);θt+1)
= exp(E (x̄(s);θt)
)∆βhexp
(E (x̄(s);θt+1)
)−∆βh(6)
where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]
σ =(∑S
s=1 w(s))2
S∑S
s=1 w(s)2∈[
1
S, 1
](7)
ESS σ is actually a function of ∆βh.
Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22
Number of Sub-Sequential Distributions
One issue arising in PSMC is the number of βh, i.e. H
By exploiting degeneration of particle set: the importance weightingfor each particle is
w (s) =ph(x̄(s);θt+1)
ph−1(x̄(s);θt+1)
= exp(E (x̄(s);θt)
)∆βhexp
(E (x̄(s);θt+1)
)−∆βh(6)
where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]
σ =(∑S
s=1 w(s))2
S∑S
s=1 w(s)2∈[
1
S, 1
](7)
ESS σ is actually a function of ∆βh.
Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22
Number of Sub-Sequential Distributions
One issue arising in PSMC is the number of βh, i.e. H
By exploiting degeneration of particle set: the importance weightingfor each particle is
w (s) =ph(x̄(s);θt+1)
ph−1(x̄(s);θt+1)
= exp(E (x̄(s);θt)
)∆βhexp
(E (x̄(s);θt+1)
)−∆βh(6)
where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]
σ =(∑S
s=1 w(s))2
S∑S
s=1 w(s)2∈[
1
S, 1
](7)
ESS σ is actually a function of ∆βh.
Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22
Number of Sub-Sequential Distributions
One issue arising in PSMC is the number of βh, i.e. H
By exploiting degeneration of particle set: the importance weightingfor each particle is
w (s) =ph(x̄(s);θt+1)
ph−1(x̄(s);θt+1)
= exp(E (x̄(s);θt)
)∆βhexp
(E (x̄(s);θt+1)
)−∆βh(6)
where ∆βh is the step length from βh−1 to βh, i.e. ∆βh = βh − βh−1.the ESS of a particle set as [Kong et al., 1994]
σ =(∑S
s=1 w(s))2
S∑S
s=1 w(s)2∈[
1
S, 1
](7)
ESS σ is actually a function of ∆βh.
Set a threshold on σ, at every h, and find the biggest gap by usingbidirectional search.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 13 / 22
PSMC
Input: a training dataset D = {x(m)}Mm=1, learning rate η1: Initialize p(x;θ0), t ← 0
2: Sample particles {x̄(s)0 }Ss=1 ∼ p(x;θ0)
3: while ! stop criterion // root-SMC do4: h← 0, β0 ← 15: while βh < 1 // sub-SMC do6: assign importance weights {w (s)}Ss=1 to particles according to (6)7: resample particles based on {w (s)}Ss=1
8: find the step length ∆βh9: βh+1 = βh + δβ
10: h← h + 111: end while12: Compute the gradient ∆θt according to (4)13: θt+1 = θt + η∆θt
14: t ← t + 115: end while
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 14 / 22
Experiments
two experiments on two challenges:
big learning rates
high dimensional distributions
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 15 / 22
First Experiment: Small Learning Rates
a small-size Boltzmann Machine with only 10 variables is used to avoid theeffect of model complexitylearning rate ηt = 1
100+t
0 100 200 300 400 500−5
−4.5
−4
−3.5
−3
−2.5
−2
−1.5
number of epochs
log−
likelih
ood
ML
PCD−10
PCD−1
PT
TT
PSMC
SMC
(a)
0 100 200 300 400 5007
8
9
10
11
12
number of epochs
nu
mb
er
of
tem
pe
ratu
res
PSMC
(b)
0 100 200 300 400 5007
8
9
10
11
12
number of epochs
nu
mb
er
of
tem
pe
ratu
res
Standard SMC
(c)
Figure: The performance of algorithms with the first learning rate scheme. (a):log-likelihood vs. number of epochs; (b) and (c): the number of βs in PSMC andSMC at each iteration (blue) and their mean values (red).
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 16 / 22
First Experiment: Bigger Learning Rates
learning rate ηt = 120+0.5×t
0 20 40 60 80 100−5
−4.5
−4
−3.5
−3
−2.5
−2
−1.5
number of epochs
log−
likelih
ood
ML
PCD−10
PCD−1
PT
TT
PSMC
SMC
(a)
0 20 40 60 80 1007
8
9
10
11
12
number of epochs
nu
mb
er
of
tem
pe
ratu
res
PSMC
(b)
0 20 40 60 80 1007
8
9
10
11
12
number of epochs
nu
mb
er
of
tem
pe
ratu
res
Standard SMC
(c)
Figure: The performance of algorithms with the second learning rate scheme. (a):log-likelihood vs. number of epochs; (b) and (c): the number of βs in PSMC andSMC at each iteration (blue) and their mean values (red).
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 17 / 22
First Experiment: Large Learning Rates
learning rate ηt = 110+0.1×t
0 10 20 30 40−5
−4.5
−4
−3.5
−3
−2.5
−2
−1.5
number of epochs
log−
likelih
ood
ML
PCD−10
PT
TT
PSMC
SMC
(a)
0 10 20 30 407
8
9
10
11
12
number of epochs
nu
mb
er
of
tem
pe
ratu
res
PSMC
(b)
0 10 20 30 407
8
9
10
11
12
number of epochs
nu
mb
er
of
tem
pe
ratu
res
Standard SMC
(c)
Figure: The performance of algorithms with the third learning rate scheme. (a):log-likelihood vs. number of epochs; (b) and (c): the number of βs in PSMC andSMC at each iteration (blue) and their mean values (red).
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 18 / 22
Second Experiment: Small Dimensionality/Scale
we used the popular restricted Boltzmann machine to model handwrittendigit images (the MNIST database).10 hidden unites
200 400 600 800 1000−230
−225
−220
−215
−210
−205
−200
−195
number of epochs
testin
g−
da
ta lo
g−
like
liho
od
200 400 600 800 1000−230
−225
−220
−215
−210
−205
−200
−195
number of epochs
tra
inin
g−
da
ta lo
g−
like
liho
od
PCD−100 PCD−1 PT TT PSMC SMC
(a)
0 200 400 600 800 100060
80
100
120
140
160
180
200
220
number of epochs
num
ber
of te
mpera
ture
s
PSMC
(b)
0 200 400 600 800 100050
100
150
200
250
300
number of epochs
num
ber
of te
mpera
ture
s
Standard SMC
(c)Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 19 / 22
Second Experiment: Large Dimensionality/Scale
500 hidden unites
200 400 600 800 1000−200
−195
−190
−185
−180
−175
−170
−165
−160
number of epochs
testin
g−
da
ta lo
g−
like
liho
od
200 400 600 800 1000−200
−195
−190
−185
−180
−175
−170
−165
−160
number of epochs
tra
inin
g−
da
ta lo
g−
like
liho
od
PCD−200 PCD−1 PT TT PSMC
(d)
0 200 400 600 800 10000
50
100
150
200
250
number of epochs
num
ber
of te
mpera
ture
s
PSMC
(e)
0 200 400 600 800 10006
7
8
9
10
11
number of epochs
num
ber
of te
mpera
ture
sStandard SMC
(f)
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 20 / 22
Conclusion
a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)
reveal two challenges in learning: large learning rate, highdimensionality;
deeper application of SMC in learning −→ Persistent SMC;
yield higher likelihood than state-of-the-art algorithms in challengingcases.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22
Conclusion
a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)
reveal two challenges in learning: large learning rate, highdimensionality;
deeper application of SMC in learning −→ Persistent SMC;
yield higher likelihood than state-of-the-art algorithms in challengingcases.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22
Conclusion
a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)
reveal two challenges in learning: large learning rate, highdimensionality;
deeper application of SMC in learning −→ Persistent SMC;
yield higher likelihood than state-of-the-art algorithms in challengingcases.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22
Conclusion
a new interpretation of learning undirected graphical models:sequential Monte Carlo (SMC)
reveal two challenges in learning: large learning rate, highdimensionality;
deeper application of SMC in learning −→ Persistent SMC;
yield higher likelihood than state-of-the-art algorithms in challengingcases.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 21 / 22
END
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 22 / 22
Asuncion, A. U., Liu, Q., Ihler, A. T., and Smyth, P. (2010).Particle filtered MCMC-MLE with connections to contrastivedivergence.In ICML.
Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau,O. (2010).Tempered Markov Chain Monte Carlo for training of restrictedBoltzmann machines.In AISTATS.
Geyer, C. J. (1991).Markov Chain Monte Carlo Maximum Likelihood.Compuuting Science and Statistics:Proceedings of the 23rdSymposium on the Interface.
Hinton, G. E. (2002).Training products of experts by minimizing contrastive divergence.Neural Computation, 14(8):1771–1800.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 22 / 22
Koller, D. and Friedman, N. (2009).Probabilistic Graphical Models: Principles and Techniques.MIT Press.
Kong, A., Liu, J. S., and Wong, W. H. (1994).Sequential Imputations and Bayesian Missing Data Problems.Journal of the American Statistical Association, 89(425):278–288.
Robbins, H. and Monro, S. (1951).A Stochastic Approximation Method.Ann.Math.Stat., 22:400–407.
Salakhutdinov, R. (2010).Learning in markov random fields using tempered transitions.In NIPS.
Tieleman, T. (2008).Training Restricted Boltzmann Machines using Approximations to theLikelihood Gradient.In ICML, pages 1064–1071.
Hanchen Xiong (UIBK) Learning UGMs with Persistent SMC November 26, 2014 22 / 22