CSE 5526: Boltzmann Machines - mr-pc.org

32
1 CSE 5526: Boltzmann Machines CSE 5526: Introduction to Neural Networks Boltzmann Machines

Transcript of CSE 5526: Boltzmann Machines - mr-pc.org

1 CSE 5526: Boltzmann Machines

CSE 5526: Introduction to Neural Networks

Boltzmann Machines

2 CSE 5526: Boltzmann Machines

Boltzmann machines are a stochastic extension of Hopfield networks

β€’ A Boltzmann machine is a stochastic learning machine that consists of visible and hidden units and symmetric connections

β€’ The network can be layered and visible units can be either β€œinput” or β€œoutput” (connections are not directed)

input

hidden

visible

3 CSE 5526: Boltzmann Machines

In general, the network is fully connected

4 CSE 5526: Boltzmann Machines

Boltzmann machines have the same energy function as Hopfield networks

β€’ Because of symmetric connections, the energy function of neuron configuration 𝐱𝐱 is:

𝐸𝐸 𝐱𝐱 = βˆ’12��𝑀𝑀𝑗𝑗𝑗𝑗π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗

𝑗𝑗𝑗𝑗

= π±π±π‘‡π‘‡π‘Šπ‘Šπ±π±

β€’ Can we start there and derive a probability

distribution over configurations?

5 CSE 5526: Boltzmann Machines

The Boltzmann-Gibbs distribution defines probabilities from energies

β€’ Consider a physical system with many states. β€’ Let 𝑝𝑝𝑗𝑗 denote the probability of occurrence of state 𝑖𝑖 β€’ Let 𝐸𝐸𝑗𝑗 denote the energy of state 𝑖𝑖

β€’ From statistical mechanics, when the system is in thermal equilibrium, it satisfies

𝑝𝑝𝑗𝑗 =1𝑍𝑍

exp βˆ’πΈπΈπ‘—π‘—π‘‡π‘‡

where 𝑍𝑍 = οΏ½ exp βˆ’πΈπΈπ‘—π‘—π‘‡π‘‡

𝑗𝑗

β€’ Z is called the partition function, and 𝑇𝑇 is called the temperature

β€’ The Boltzmann-Gibbs distribution

6 CSE 5526: Boltzmann Machines

Remarks

β€’ Lower energy states have higher probability of occurrences

β€’ As 𝑇𝑇 decreases, the probability is concentrated on a small subset of low energy states

7 CSE 5526: Boltzmann Machines

Boltzmann-Gibbs distribution example

β€’ Consider 𝑬𝑬 = [1, 2, 3] β€’ At 𝑇𝑇 = 10,

β€’ 𝒑𝒑� = exp βˆ’π‘¬π‘¬π‘‡π‘‡

= 0.905, 0.819, 0.741

β€’ 𝑍𝑍 = βˆ‘ 𝑝𝑝�𝑗𝑗𝑗𝑗 = 0.905 + 0.819 + 0.741 = 2.464

β€’ 𝒑𝒑 = 1𝑍𝑍𝒑𝒑� = [0.367, 0.332, 0.301]

8 CSE 5526: Boltzmann Machines

Boltzmann-Gibbs distribution example

β€’ Consider 𝑬𝑬 = [1, 2, 3] β€’ At 𝑇𝑇 = 1,

β€’ 𝒑𝒑� = exp βˆ’π‘¬π‘¬π‘‡π‘‡

= 0.368, 0.135, 0.050

β€’ 𝑍𝑍 = βˆ‘ 𝑝𝑝�𝑗𝑗𝑗𝑗 = 0.368 + 0.135 + 0.050 = 0.553

β€’ 𝒑𝒑 = 1𝑍𝑍𝒑𝒑� = [0.665, 0.245, 0.090]

9 CSE 5526: Boltzmann Machines

Boltzmann-Gibbs distribution example

β€’ Consider 𝑬𝑬 = [1, 2, 3] β€’ At 𝑇𝑇 = 0.1,

β€’ 𝒑𝒑� = exp βˆ’π‘¬π‘¬π‘‡π‘‡

= 4.54 β‹… 10βˆ’5, 2.06 β‹… 10βˆ’9, 9.36 β‹… 10βˆ’14

β€’ 𝑍𝑍 = βˆ‘ 𝑝𝑝�𝑗𝑗𝑗𝑗 = 4.54 β‹… 10βˆ’5 + 2.06 β‹… 10βˆ’9 + 9.36 β‹… 10βˆ’14 = 4.54 β‹… 10βˆ’5

β€’ 𝒑𝒑 = 1𝑍𝑍𝒑𝒑� = [0.99995, 4.54 β‹… 10βˆ’5, 2.06 β‹… 10βˆ’9]

10 CSE 5526: Boltzmann Machines

Boltzmann-Gibbs distribution applied to Hopfield network energy function

𝑝𝑝 𝐱𝐱 =1𝑍𝑍

exp βˆ’1𝑇𝑇𝐸𝐸 𝐱𝐱 =

1𝑍𝑍

exp1

2π‘‡π‘‡π±π±π‘‡π‘‡π‘Šπ‘Šπ±π±

β€’ Partition function For 𝑁𝑁 neurons, involves 2𝑁𝑁 terms

𝑍𝑍 = �𝑝𝑝 𝐱𝐱𝐱𝐱

β€’ Marginal over 𝐻𝐻 of the neurons Involves 2𝐻𝐻 terms

𝑝𝑝 𝐱𝐱𝛼𝛼 = �𝑝𝑝 𝐱𝐱𝛼𝛼 , 𝐱𝐱𝛽𝛽𝐱𝐱𝛽𝛽

11 CSE 5526: Boltzmann Machines

Boltzmann machines are unsupervised probability models

β€’ The primary goal of Boltzmann machine learning is to produce a network that models the probability distribution of observed data (at visible neurons) β€’ Such a net can be used for pattern completion, as a part of

an associative memory, etc. β€’ What do we want to do with it?

β€’ Compute the probability of a new observation β€’ Learn parameters of the model from data β€’ Estimate likely values completing partial observations

12 CSE 5526: Boltzmann Machines

Compute the probability of a new observation

β€’ Divide the entire net into the subset 𝐱𝐱𝛼𝛼 of visible units and 𝐱𝐱𝛽𝛽 of hidden units

β€’ Given parameters π‘Šπ‘Š, compute likelihood of observation 𝐱𝐱𝛼𝛼

𝑝𝑝 𝐱𝐱𝛼𝛼 = �𝑝𝑝(𝐱𝐱)𝐱𝐱𝛽𝛽

= οΏ½1𝑍𝑍

exp βˆ’1𝑇𝑇𝐸𝐸 𝐱𝐱

𝐱𝐱𝛽𝛽

= οΏ½1𝑍𝑍

exp1

2π‘‡π‘‡π±π±π‘‡π‘‡π‘Šπ‘Šπ±π±

𝐱𝐱𝛽𝛽

13 CSE 5526: Boltzmann Machines

Learning can be performed by gradient descent

β€’ The objective of Boltzmann machine learning is to maximize the likelihood of the visible units taking on training patterns by adjusting W

β€’ Assuming that each pattern of the training sample is statistically independent, the log probability of the training sample is:

𝐿𝐿 𝐰𝐰 = log�𝑃𝑃(𝐱𝐱𝛼𝛼)𝐱𝐱𝛼𝛼

= οΏ½ log 𝑃𝑃(𝐱𝐱𝛼𝛼)𝐱𝐱𝛼𝛼

14 CSE 5526: Boltzmann Machines

Log likelihood of data has two terms

𝐿𝐿 𝐰𝐰 = log�𝑃𝑃(𝐱𝐱𝛼𝛼)𝐱𝐱𝛼𝛼

= οΏ½ log 𝑃𝑃(𝐱𝐱𝛼𝛼)𝐱𝐱𝛼𝛼

= οΏ½ logοΏ½1𝑍𝑍

exp βˆ’πΈπΈ 𝐱𝐱𝑇𝑇

𝐱𝐱𝛽𝛽𝐱𝐱𝛼𝛼

= οΏ½ logοΏ½ exp βˆ’πΈπΈ 𝐱𝐱𝑇𝑇

𝐱𝐱𝛽𝛽

βˆ’ log 𝑍𝑍𝐱𝐱𝛼𝛼

= οΏ½ logοΏ½ exp βˆ’πΈπΈ 𝐱𝐱𝑇𝑇

𝐱𝐱𝛽𝛽

βˆ’ logοΏ½ exp βˆ’πΈπΈ 𝐱𝐱𝑇𝑇

𝐱𝐱𝐱𝐱𝛼𝛼

15 CSE 5526: Boltzmann Machines

Gradient of log likelihood of data

πœ•πœ•πœ•πœ•(𝐰𝐰)πœ•πœ•π‘€π‘€π‘—π‘—π‘—π‘—

= οΏ½πœ•πœ•

πœ•πœ•π‘€π‘€π‘—π‘—π‘—π‘—logοΏ½ exp βˆ’

𝐸𝐸 𝐱𝐱𝑇𝑇

𝐱𝐱𝛽𝛽

βˆ’πœ•πœ•

πœ•πœ•π‘€π‘€π‘—π‘—π‘—π‘—logοΏ½ exp βˆ’

𝐸𝐸 𝐱𝐱𝑇𝑇

𝐱𝐱𝐱𝐱𝛼𝛼

= οΏ½βˆ’ 1π‘‡π‘‡βˆ‘ exp βˆ’πΈπΈ 𝐱𝐱

π‘‡π‘‡πœ•πœ•πΈπΈ π±π±πœ•πœ•π‘€π‘€π‘—π‘—π‘—π‘—π±π±π›½π›½

βˆ‘ exp βˆ’πΈπΈ 𝐱𝐱𝑇𝑇𝐱𝐱𝛽𝛽

+

1π‘‡π‘‡βˆ‘ exp βˆ’πΈπΈ 𝐱𝐱

π‘‡π‘‡πœ•πœ•πΈπΈ π±π±πœ•πœ•π‘€π‘€π‘—π‘—π‘—π‘—π±π±

βˆ‘ exp βˆ’πΈπΈ 𝐱𝐱𝑇𝑇𝐱𝐱𝐱𝐱𝛼𝛼

16 CSE 5526: Boltzmann Machines

Gradient of log likelihood of data

β€’ Then, since πœ•πœ•πœ•πœ•(𝐱𝐱)πœ•πœ•π‘€π‘€π‘—π‘—π‘—π‘—

= βˆ’π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗

πœ•πœ•πΏπΏ(𝐰𝐰)πœ•πœ•π‘€π‘€π‘—π‘—π‘—π‘—

=1𝑇𝑇� οΏ½

exp βˆ’πΈπΈ 𝐱𝐱𝑇𝑇 π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗

βˆ‘ exp βˆ’πΈπΈ 𝐱𝐱𝑇𝑇𝐱𝐱𝛽𝛽𝐱𝐱𝛽𝛽

βˆ’βˆ‘ exp βˆ’πΈπΈ 𝐱𝐱

𝑇𝑇 π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗𝐱𝐱

𝑍𝑍𝐱𝐱𝛼𝛼

=1𝑇𝑇

��𝑃𝑃 𝐱𝐱𝛽𝛽 𝐱𝐱𝛼𝛼 π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗𝐱𝐱𝛽𝛽𝐱𝐱𝛼𝛼

βˆ’οΏ½οΏ½π‘ƒπ‘ƒ(𝐱𝐱)π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗𝐱𝐱𝐱𝐱𝛼𝛼

training patterns constant w.r.t. βˆ‘ βˆ™π±π±π›Όπ›Ό

17 CSE 5526: Boltzmann Machines

Gradient of log likelihood of data

πœ•πœ•πΏπΏ(𝐰𝐰)πœ•πœ•π‘€π‘€π‘—π‘—π‘—π‘—

=1π‘‡π‘‡πœŒπœŒπ‘—π‘—π‘—π‘—+ βˆ’ πœŒπœŒπ‘—π‘—π‘—π‘—βˆ’

β€’ Where πœŒπœŒπ‘—π‘—π‘—π‘—+ = βˆ‘ βˆ‘ 𝑃𝑃 𝐱𝐱𝛽𝛽 𝐱𝐱𝛼𝛼 π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗𝐱𝐱𝛽𝛽𝐱𝐱𝛼𝛼 = βˆ‘ 𝐸𝐸𝐱𝐱𝛽𝛽|𝐱𝐱𝛼𝛼{π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗}𝐱𝐱𝛼𝛼 β€’ is the mean correlation between neurons i and j when the

visible units are β€œclamped” to 𝐱𝐱𝛼𝛼

β€’ And πœŒπœŒπ‘—π‘—π‘—π‘—βˆ’ = πΎπΎβˆ‘ 𝑃𝑃 𝐱𝐱 π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗𝐱𝐱 = 𝐸𝐸𝐱𝐱{π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗} β€’ is the mean correlation between i and j when the machine

operates without β€œclamping”

18 CSE 5526: Boltzmann Machines

Maximization of 𝐿𝐿 𝐰𝐰

β€’ To maximize 𝐿𝐿 𝐰𝐰 , we use gradient ascent:

β–³ 𝑀𝑀𝑗𝑗𝑗𝑗 = πœ€πœ€πœ•πœ•πΏπΏ π°π°πœ•πœ•π‘€π‘€π‘—π‘—π‘—π‘—

=πœ€πœ€π‘‡π‘‡πœŒπœŒπ‘—π‘—π‘—π‘—+ βˆ’ πœŒπœŒπ‘—π‘—π‘—π‘—βˆ’

= πœ‚πœ‚ πœŒπœŒπ‘—π‘—π‘—π‘—+ βˆ’ πœŒπœŒπ‘—π‘—π‘—π‘—βˆ’ where the learning rate πœ‚πœ‚ incorporates the temperature 𝑇𝑇

19 CSE 5526: Boltzmann Machines

Positive and negative phases

β€’ Thus there are two phases to the learning process: 1. Positive phase: the net operates in the β€œclamped”

condition, where visible units take on training patterns with the desired probability distribution

2. Negative phase: the net operates freely without the influence of external input

20 CSE 5526: Boltzmann Machines

Remarks on Boltzmann machine learning

β€’ The Boltzmann learning rule is a remarkably simple local rule, concerning only β€œpresynaptic” and β€œpostsynaptic” neurons β€’ Such local rules are biologically plausible

β€’ But, computing πœŒπœŒπ‘—π‘—π‘—π‘—+ directly requires summing over 2𝐻𝐻 terms for each observed data point

β€’ Computing πœŒπœŒπ‘—π‘—π‘—π‘—βˆ’ requires summing 2𝑁𝑁 terms once β€’ Can we approximate those sums?

21 CSE 5526: Boltzmann Machines

Sampling methods can approximate expectations (difficult integrals/sums)

β€’ Want to compute πœŒπœŒπ‘—π‘—π‘—π‘—βˆ’ = βˆ‘ 𝑃𝑃 𝐱𝐱 π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗𝐱𝐱 = 𝐸𝐸𝐱𝐱 π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗 β€’ We can approximate that expectation as

πœŒπœŒπ‘—π‘—π‘—π‘—βˆ’ = 𝐸𝐸𝐱𝐱 π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗 β‰ˆ1𝑁𝑁�π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗𝐱𝐱𝑛𝑛

β€’ Where 𝐱𝐱𝑛𝑛 are samples drawn from 𝑃𝑃 𝐱𝐱𝑛𝑛

β€’ In general 𝐸𝐸𝐱𝐱 𝑓𝑓(𝐱𝐱) β‰ˆ 1π‘π‘βˆ‘ 𝑓𝑓 𝐱𝐱𝑛𝑛𝐱𝐱𝑛𝑛

β€’ The sum converges to the true expectation as the number of samples goes to infinity

β€’ This is called Monte Carlo integration / summation

22 CSE 5526: Boltzmann Machines

Gibbs sampling can draw samples from intractable distributions

β€’ Many high-dimensional probability distributions are difficult to compute or sample from

β€’ But Gibbs sampling can draw samples from them β€’ If you can compute the probability of some variables

given others β€’ This is typically the case in graphical models, including

Boltzmann machines β€’ This kind of approach is called Markov chain

Monte Carlo (MCMC)

23 CSE 5526: Boltzmann Machines

Gibbs sampling can draw samples from intractable distributions

β€’ Consider a K-dimensional random vector 𝐱𝐱 = (π‘₯π‘₯1, … , π‘₯π‘₯𝐾𝐾)𝑇𝑇

β€’ Suppose we know the conditional distributions 𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ π‘₯π‘₯1, … , π‘₯π‘₯π‘˜π‘˜βˆ’1, π‘₯π‘₯π‘˜π‘˜+1, … , π‘₯π‘₯𝐾𝐾 = 𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ π±π±βˆΌπ‘˜π‘˜

β€’ Then by sampling each π‘₯π‘₯π‘˜π‘˜ in turn (or randomly), the distribution of samples will eventually converge to

𝑝𝑝 𝐱𝐱 = 𝑝𝑝(π‘₯π‘₯1, … , π‘₯π‘₯𝐾𝐾) β€’ Note: it can be difficult to know how long to sample

24 CSE 5526: Boltzmann Machines

Gibbs sampling algorithm

β€’ For iteration n: π‘₯π‘₯1(𝑛𝑛) is drawn from 𝑝𝑝 π‘₯π‘₯1 π‘₯π‘₯2, … , π‘₯π‘₯𝐾𝐾 … π‘₯π‘₯π‘˜π‘˜(𝑛𝑛) is drawn from 𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ π‘₯π‘₯1, … , π‘₯π‘₯π‘˜π‘˜βˆ’1, π‘₯π‘₯π‘˜π‘˜+1, … , π‘₯π‘₯𝐾𝐾 … π‘₯π‘₯𝐾𝐾(𝑛𝑛) is drawn from 𝑝𝑝 π‘₯π‘₯𝐾𝐾 π‘₯π‘₯1, … , π‘₯π‘₯πΎπΎβˆ’1

β€’ Each iteration samples each random variable once in the natural order

β€’ Newly sampled values are used immediately (i.e., asynchronous sampling)

25 CSE 5526: Boltzmann Machines

Gibbs sampling in Boltzmann machines

β€’ Need 𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ π‘₯π‘₯1, … , π‘₯π‘₯π‘˜π‘˜βˆ’1, π‘₯π‘₯π‘˜π‘˜+1, … , π‘₯π‘₯𝐾𝐾 = 𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ π±π±βˆΌπ‘˜π‘˜ β€’ Can compute from Boltzmann-Gibbs distribution

𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ π±π±βˆΌπ‘˜π‘˜ =𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ = 1, π±π±βˆΌπ‘˜π‘˜

𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ = 1, π±π±βˆΌπ‘˜π‘˜ + 𝑝𝑝 π‘₯π‘₯π‘˜π‘˜ = βˆ’1, π±π±βˆΌπ‘˜π‘˜

=

1𝑍𝑍 exp βˆ’πΈπΈ

+

𝑇𝑇1𝑍𝑍 exp βˆ’πΈπΈ

+

𝑇𝑇 + 1𝑍𝑍 exp βˆ’πΈπΈ

βˆ’

𝑇𝑇

=1

1 + exp 1𝑇𝑇 𝐸𝐸+ βˆ’ πΈπΈβˆ’

= πœŽπœŽΞ”πΈπΈπ‘‡π‘‡

26 CSE 5526: Boltzmann Machines

So Boltzmann machines use stochastic neurons with sigmoid activation

β€’ Neuron π‘˜π‘˜ is connected to all other neurons:

π‘£π‘£π‘˜π‘˜ = οΏ½π‘€π‘€π‘˜π‘˜π‘—π‘—π‘₯π‘₯𝑗𝑗𝑗𝑗

β€’ It is then updated stochastically so that

π‘₯π‘₯π‘˜π‘˜ = οΏ½ 1 with prob. πœ‘πœ‘ π‘£π‘£π‘˜π‘˜ βˆ’1 with prob. 1 βˆ’ πœ‘πœ‘(π‘£π‘£π‘˜π‘˜)

β€’ Where

πœ‘πœ‘ 𝑣𝑣 =1

1 + exp βˆ’π‘£π‘£π‘‡π‘‡

27 CSE 5526: Boltzmann Machines

Prob. of flipping a single neuron

β€’ Consider the prob. of flipping a single neuron k:

𝑃𝑃 π‘₯π‘₯π‘˜π‘˜ β†’ βˆ’π‘₯π‘₯π‘˜π‘˜ =1

1 + exp βˆ†πΈπΈπ‘˜π‘˜2𝑇𝑇

where βˆ†πΈπΈπ‘—π‘— is the energy change due to the flip (proof is a homework problem)

β€’ So a change that decreases the energy is more likely than that increasing the energy

28 CSE 5526: Boltzmann Machines

Simulated annealing

β€’ As the temperature T decreases β€’ The average energy of a stochastic system decreases β€’ It reaches the global minimum as 𝑇𝑇 β†’ 0 β€’ So for optimization problems, should favor low temps

β€’ But, convergence is slow at low temps β€’ due to trapping at local minima

β€’ Simulated annealing is a stochastic optimization technique that gradually decreases 𝑇𝑇 β€’ To get the best of both worlds

29 CSE 5526: Boltzmann Machines

Simulated annealing (cont.)

β€’ No guarantee for the global minimum, but higher chances for lower local minima

β€’ Boltzmann machines use simulated annealing to gradually lower T

30 CSE 5526: Boltzmann Machines

Full Boltzmann machine training algorithm

β€’ The entire algorithm consists of the following nested loops: 1. Loop over all training data points, accumulating

gradient of each weight 2. For each data point, compute expectation π‘₯π‘₯𝑗𝑗π‘₯π‘₯𝑗𝑗 with 𝐱𝐱𝛼𝛼

clamped and free 3. Compute expectations using simulated annealing,

gradually decreasing T 4. For each T, sample the state of the entire net a number

of times using Gibbs sampling

31 CSE 5526: Boltzmann Machines

Remarks on Boltzmann machine training

β€’ Boltzmann machines are extremely slow to train β€’ but work well once they are trained

β€’ Because of its computational complexity, the algorithm has only been applied to toy problems

β€’ But: the restricted Boltzmann machine is much easier to train, stay tuned…

32 CSE 5526: Boltzmann Machines

An example

β€’ The encoder problem (see blackboard)

β€’ Ackley, Hinton, Sejnowski (1985)