CSE 5526: Boltzmann Machines - mr-pc.org
Transcript of CSE 5526: Boltzmann Machines - mr-pc.org
2 CSE 5526: Boltzmann Machines
Boltzmann machines are a stochastic extension of Hopfield networks
β’ A Boltzmann machine is a stochastic learning machine that consists of visible and hidden units and symmetric connections
β’ The network can be layered and visible units can be either βinputβ or βoutputβ (connections are not directed)
input
hidden
visible
4 CSE 5526: Boltzmann Machines
Boltzmann machines have the same energy function as Hopfield networks
β’ Because of symmetric connections, the energy function of neuron configuration π±π± is:
πΈπΈ π±π± = β12οΏ½οΏ½π€π€πππππ₯π₯πππ₯π₯ππ
ππππ
= π±π±πππππ±π±
β’ Can we start there and derive a probability
distribution over configurations?
5 CSE 5526: Boltzmann Machines
The Boltzmann-Gibbs distribution defines probabilities from energies
β’ Consider a physical system with many states. β’ Let ππππ denote the probability of occurrence of state ππ β’ Let πΈπΈππ denote the energy of state ππ
β’ From statistical mechanics, when the system is in thermal equilibrium, it satisfies
ππππ =1ππ
exp βπΈπΈππππ
where ππ = οΏ½ exp βπΈπΈππππ
ππ
β’ Z is called the partition function, and ππ is called the temperature
β’ The Boltzmann-Gibbs distribution
6 CSE 5526: Boltzmann Machines
Remarks
β’ Lower energy states have higher probability of occurrences
β’ As ππ decreases, the probability is concentrated on a small subset of low energy states
7 CSE 5526: Boltzmann Machines
Boltzmann-Gibbs distribution example
β’ Consider π¬π¬ = [1, 2, 3] β’ At ππ = 10,
β’ πποΏ½ = exp βπ¬π¬ππ
= 0.905, 0.819, 0.741
β’ ππ = β πποΏ½ππππ = 0.905 + 0.819 + 0.741 = 2.464
β’ ππ = 1πππποΏ½ = [0.367, 0.332, 0.301]
8 CSE 5526: Boltzmann Machines
Boltzmann-Gibbs distribution example
β’ Consider π¬π¬ = [1, 2, 3] β’ At ππ = 1,
β’ πποΏ½ = exp βπ¬π¬ππ
= 0.368, 0.135, 0.050
β’ ππ = β πποΏ½ππππ = 0.368 + 0.135 + 0.050 = 0.553
β’ ππ = 1πππποΏ½ = [0.665, 0.245, 0.090]
9 CSE 5526: Boltzmann Machines
Boltzmann-Gibbs distribution example
β’ Consider π¬π¬ = [1, 2, 3] β’ At ππ = 0.1,
β’ πποΏ½ = exp βπ¬π¬ππ
= 4.54 β 10β5, 2.06 β 10β9, 9.36 β 10β14
β’ ππ = β πποΏ½ππππ = 4.54 β 10β5 + 2.06 β 10β9 + 9.36 β 10β14 = 4.54 β 10β5
β’ ππ = 1πππποΏ½ = [0.99995, 4.54 β 10β5, 2.06 β 10β9]
10 CSE 5526: Boltzmann Machines
Boltzmann-Gibbs distribution applied to Hopfield network energy function
ππ π±π± =1ππ
exp β1πππΈπΈ π±π± =
1ππ
exp1
2πππ±π±πππππ±π±
β’ Partition function For ππ neurons, involves 2ππ terms
ππ = οΏ½ππ π±π±π±π±
β’ Marginal over π»π» of the neurons Involves 2π»π» terms
ππ π±π±πΌπΌ = οΏ½ππ π±π±πΌπΌ , π±π±π½π½π±π±π½π½
11 CSE 5526: Boltzmann Machines
Boltzmann machines are unsupervised probability models
β’ The primary goal of Boltzmann machine learning is to produce a network that models the probability distribution of observed data (at visible neurons) β’ Such a net can be used for pattern completion, as a part of
an associative memory, etc. β’ What do we want to do with it?
β’ Compute the probability of a new observation β’ Learn parameters of the model from data β’ Estimate likely values completing partial observations
12 CSE 5526: Boltzmann Machines
Compute the probability of a new observation
β’ Divide the entire net into the subset π±π±πΌπΌ of visible units and π±π±π½π½ of hidden units
β’ Given parameters ππ, compute likelihood of observation π±π±πΌπΌ
ππ π±π±πΌπΌ = οΏ½ππ(π±π±)π±π±π½π½
= οΏ½1ππ
exp β1πππΈπΈ π±π±
π±π±π½π½
= οΏ½1ππ
exp1
2πππ±π±πππππ±π±
π±π±π½π½
13 CSE 5526: Boltzmann Machines
Learning can be performed by gradient descent
β’ The objective of Boltzmann machine learning is to maximize the likelihood of the visible units taking on training patterns by adjusting W
β’ Assuming that each pattern of the training sample is statistically independent, the log probability of the training sample is:
πΏπΏ π°π° = logοΏ½ππ(π±π±πΌπΌ)π±π±πΌπΌ
= οΏ½ log ππ(π±π±πΌπΌ)π±π±πΌπΌ
14 CSE 5526: Boltzmann Machines
Log likelihood of data has two terms
πΏπΏ π°π° = logοΏ½ππ(π±π±πΌπΌ)π±π±πΌπΌ
= οΏ½ log ππ(π±π±πΌπΌ)π±π±πΌπΌ
= οΏ½ logοΏ½1ππ
exp βπΈπΈ π±π±ππ
π±π±π½π½π±π±πΌπΌ
= οΏ½ logοΏ½ exp βπΈπΈ π±π±ππ
π±π±π½π½
β log πππ±π±πΌπΌ
= οΏ½ logοΏ½ exp βπΈπΈ π±π±ππ
π±π±π½π½
β logοΏ½ exp βπΈπΈ π±π±ππ
π±π±π±π±πΌπΌ
15 CSE 5526: Boltzmann Machines
Gradient of log likelihood of data
ππππ(π°π°)πππ€π€ππππ
= οΏ½ππ
πππ€π€ππππlogοΏ½ exp β
πΈπΈ π±π±ππ
π±π±π½π½
βππ
πππ€π€ππππlogοΏ½ exp β
πΈπΈ π±π±ππ
π±π±π±π±πΌπΌ
= οΏ½β 1ππβ exp βπΈπΈ π±π±
πππππΈπΈ π±π±πππ€π€πππππ±π±π½π½
β exp βπΈπΈ π±π±πππ±π±π½π½
+
1ππβ exp βπΈπΈ π±π±
πππππΈπΈ π±π±πππ€π€πππππ±π±
β exp βπΈπΈ π±π±πππ±π±π±π±πΌπΌ
16 CSE 5526: Boltzmann Machines
Gradient of log likelihood of data
β’ Then, since ππππ(π±π±)πππ€π€ππππ
= βπ₯π₯πππ₯π₯ππ
πππΏπΏ(π°π°)πππ€π€ππππ
=1πποΏ½ οΏ½
exp βπΈπΈ π±π±ππ π₯π₯πππ₯π₯ππ
β exp βπΈπΈ π±π±πππ±π±π½π½π±π±π½π½
ββ exp βπΈπΈ π±π±
ππ π₯π₯πππ₯π₯πππ±π±
πππ±π±πΌπΌ
=1ππ
οΏ½οΏ½ππ π±π±π½π½ π±π±πΌπΌ π₯π₯πππ₯π₯πππ±π±π½π½π±π±πΌπΌ
βοΏ½οΏ½ππ(π±π±)π₯π₯πππ₯π₯πππ±π±π±π±πΌπΌ
training patterns constant w.r.t. β βπ±π±πΌπΌ
17 CSE 5526: Boltzmann Machines
Gradient of log likelihood of data
πππΏπΏ(π°π°)πππ€π€ππππ
=1ππππππππ+ β ππππππβ
β’ Where ππππππ+ = β β ππ π±π±π½π½ π±π±πΌπΌ π₯π₯πππ₯π₯πππ±π±π½π½π±π±πΌπΌ = β πΈπΈπ±π±π½π½|π±π±πΌπΌ{π₯π₯πππ₯π₯ππ}π±π±πΌπΌ β’ is the mean correlation between neurons i and j when the
visible units are βclampedβ to π±π±πΌπΌ
β’ And ππππππβ = πΎπΎβ ππ π±π± π₯π₯πππ₯π₯πππ±π± = πΈπΈπ±π±{π₯π₯πππ₯π₯ππ} β’ is the mean correlation between i and j when the machine
operates without βclampingβ
18 CSE 5526: Boltzmann Machines
Maximization of πΏπΏ π°π°
β’ To maximize πΏπΏ π°π° , we use gradient ascent:
β³ π€π€ππππ = πππππΏπΏ π°π°πππ€π€ππππ
=ππππππππππ+ β ππππππβ
= ππ ππππππ+ β ππππππβ where the learning rate ππ incorporates the temperature ππ
19 CSE 5526: Boltzmann Machines
Positive and negative phases
β’ Thus there are two phases to the learning process: 1. Positive phase: the net operates in the βclampedβ
condition, where visible units take on training patterns with the desired probability distribution
2. Negative phase: the net operates freely without the influence of external input
20 CSE 5526: Boltzmann Machines
Remarks on Boltzmann machine learning
β’ The Boltzmann learning rule is a remarkably simple local rule, concerning only βpresynapticβ and βpostsynapticβ neurons β’ Such local rules are biologically plausible
β’ But, computing ππππππ+ directly requires summing over 2π»π» terms for each observed data point
β’ Computing ππππππβ requires summing 2ππ terms once β’ Can we approximate those sums?
21 CSE 5526: Boltzmann Machines
Sampling methods can approximate expectations (difficult integrals/sums)
β’ Want to compute ππππππβ = β ππ π±π± π₯π₯πππ₯π₯πππ±π± = πΈπΈπ±π± π₯π₯πππ₯π₯ππ β’ We can approximate that expectation as
ππππππβ = πΈπΈπ±π± π₯π₯πππ₯π₯ππ β1πποΏ½π₯π₯πππ₯π₯πππ±π±ππ
β’ Where π±π±ππ are samples drawn from ππ π±π±ππ
β’ In general πΈπΈπ±π± ππ(π±π±) β 1ππβ ππ π±π±πππ±π±ππ
β’ The sum converges to the true expectation as the number of samples goes to infinity
β’ This is called Monte Carlo integration / summation
22 CSE 5526: Boltzmann Machines
Gibbs sampling can draw samples from intractable distributions
β’ Many high-dimensional probability distributions are difficult to compute or sample from
β’ But Gibbs sampling can draw samples from them β’ If you can compute the probability of some variables
given others β’ This is typically the case in graphical models, including
Boltzmann machines β’ This kind of approach is called Markov chain
Monte Carlo (MCMC)
23 CSE 5526: Boltzmann Machines
Gibbs sampling can draw samples from intractable distributions
β’ Consider a K-dimensional random vector π±π± = (π₯π₯1, β¦ , π₯π₯πΎπΎ)ππ
β’ Suppose we know the conditional distributions ππ π₯π₯ππ π₯π₯1, β¦ , π₯π₯ππβ1, π₯π₯ππ+1, β¦ , π₯π₯πΎπΎ = ππ π₯π₯ππ π±π±βΌππ
β’ Then by sampling each π₯π₯ππ in turn (or randomly), the distribution of samples will eventually converge to
ππ π±π± = ππ(π₯π₯1, β¦ , π₯π₯πΎπΎ) β’ Note: it can be difficult to know how long to sample
24 CSE 5526: Boltzmann Machines
Gibbs sampling algorithm
β’ For iteration n: π₯π₯1(ππ) is drawn from ππ π₯π₯1 π₯π₯2, β¦ , π₯π₯πΎπΎ β¦ π₯π₯ππ(ππ) is drawn from ππ π₯π₯ππ π₯π₯1, β¦ , π₯π₯ππβ1, π₯π₯ππ+1, β¦ , π₯π₯πΎπΎ β¦ π₯π₯πΎπΎ(ππ) is drawn from ππ π₯π₯πΎπΎ π₯π₯1, β¦ , π₯π₯πΎπΎβ1
β’ Each iteration samples each random variable once in the natural order
β’ Newly sampled values are used immediately (i.e., asynchronous sampling)
25 CSE 5526: Boltzmann Machines
Gibbs sampling in Boltzmann machines
β’ Need ππ π₯π₯ππ π₯π₯1, β¦ , π₯π₯ππβ1, π₯π₯ππ+1, β¦ , π₯π₯πΎπΎ = ππ π₯π₯ππ π±π±βΌππ β’ Can compute from Boltzmann-Gibbs distribution
ππ π₯π₯ππ π±π±βΌππ =ππ π₯π₯ππ = 1, π±π±βΌππ
ππ π₯π₯ππ = 1, π±π±βΌππ + ππ π₯π₯ππ = β1, π±π±βΌππ
=
1ππ exp βπΈπΈ
+
ππ1ππ exp βπΈπΈ
+
ππ + 1ππ exp βπΈπΈ
β
ππ
=1
1 + exp 1ππ πΈπΈ+ β πΈπΈβ
= ππΞπΈπΈππ
26 CSE 5526: Boltzmann Machines
So Boltzmann machines use stochastic neurons with sigmoid activation
β’ Neuron ππ is connected to all other neurons:
π£π£ππ = οΏ½π€π€πππππ₯π₯ππππ
β’ It is then updated stochastically so that
π₯π₯ππ = οΏ½ 1 with prob. ππ π£π£ππ β1 with prob. 1 β ππ(π£π£ππ)
β’ Where
ππ π£π£ =1
1 + exp βπ£π£ππ
27 CSE 5526: Boltzmann Machines
Prob. of flipping a single neuron
β’ Consider the prob. of flipping a single neuron k:
ππ π₯π₯ππ β βπ₯π₯ππ =1
1 + exp βπΈπΈππ2ππ
where βπΈπΈππ is the energy change due to the flip (proof is a homework problem)
β’ So a change that decreases the energy is more likely than that increasing the energy
28 CSE 5526: Boltzmann Machines
Simulated annealing
β’ As the temperature T decreases β’ The average energy of a stochastic system decreases β’ It reaches the global minimum as ππ β 0 β’ So for optimization problems, should favor low temps
β’ But, convergence is slow at low temps β’ due to trapping at local minima
β’ Simulated annealing is a stochastic optimization technique that gradually decreases ππ β’ To get the best of both worlds
29 CSE 5526: Boltzmann Machines
Simulated annealing (cont.)
β’ No guarantee for the global minimum, but higher chances for lower local minima
β’ Boltzmann machines use simulated annealing to gradually lower T
30 CSE 5526: Boltzmann Machines
Full Boltzmann machine training algorithm
β’ The entire algorithm consists of the following nested loops: 1. Loop over all training data points, accumulating
gradient of each weight 2. For each data point, compute expectation π₯π₯πππ₯π₯ππ with π±π±πΌπΌ
clamped and free 3. Compute expectations using simulated annealing,
gradually decreasing T 4. For each T, sample the state of the entire net a number
of times using Gibbs sampling
31 CSE 5526: Boltzmann Machines
Remarks on Boltzmann machine training
β’ Boltzmann machines are extremely slow to train β’ but work well once they are trained
β’ Because of its computational complexity, the algorithm has only been applied to toy problems
β’ But: the restricted Boltzmann machine is much easier to train, stay tunedβ¦