Associative Memories (II) Stochastic Networks and...

33
Stochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II) Stochastic Networks and Boltzmann Machines Davide Bacciu Dipartimento di Informatica Università di Pisa [email protected] Applied Brain Science - Computational Neuroscience (CNS)

Transcript of Associative Memories (II) Stochastic Networks and...

Page 1: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Associative Memories (II)Stochastic Networks and Boltzmann Machines

Davide Bacciu

Dipartimento di InformaticaUniversità di [email protected]

Applied Brain Science - Computational Neuroscience (CNS)

Page 2: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

IntroductionStochastic Neurons

Deterministic Networks

Consider the Hopfield network model introduced so farNeuron output is a deterministic function of its inputsRecurrent network of all visible unitsLearns to encode a set of fixed training patternsV = [v1 . . . vP]

What models do we obtain if we relax the deterministicinput-output mapping?

Page 3: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

IntroductionStochastic Neurons

Stochastic Networks

A network of units whose activation is determined by astochastic function

The state of a unit at a given timestep is sampled from agiven probability distributionThe network learns a probability distribution P(V) from thetraining patterns

Network includes bothvisible v and hidden h unitsNetwork activity is asample from posteriorprobability given inputs(visible data)

Page 4: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

IntroductionStochastic Neurons

Neural Sampling Hypothesis

The activity of each neuron can be sampled from the networkdistribution

No distinction between input and outputNatural way to deal with incomplete stimuliStochastic nature of activation is coherent with neuralresponse variabilitySpontaneous neural activity can be explained in terms ofprior/marginal distributions

Page 5: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

IntroductionStochastic Neurons

Probability and Statistics Refresher

On the blackboard

Page 6: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

IntroductionStochastic Neurons

Stochastic Binary Neurons

Spiking point neuron with binary output sjTypically discrete time model with time into small ∆tintervalsAt each time interval(t + 1 ≡ t + ∆t), the neuron can emit aspike with probability p(t)

j

s(t)j =

{1, with probability p(t)

j

0, with probability 1− p(t)j

The key is in the definition of the spiking probability (need to bea function of local potential)

p(t)j ≈ σ(x (t)

j )

Page 7: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

IntroductionStochastic Neurons

General Sigmoidal Stochastic Binary Network

Network of N neurons with binary activation sj

Weight matrix M = [Mi j]i,j∈{1,...,N}Bias vector b = [bj ]j∈{1,...,N}

Local neuron potential xj defined as usual

x (t+1)j =

N∑i=1

Mijs(t)i + bj

A chosen neuron fires with spiking probability

p(t+1)j = P(s(t+1)

j = 1|st ) = σ(x (t+1)j ) =

1

1 + e−x (t+1)j

Formulation highlights Markovian dynamics

Page 8: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

IntroductionStochastic Neurons

Neurobiological Foundations

Variability in transmitter release in synaptic vescicles ≈Gaussian Distribution (Katz 1954)Assuming independent distributions and large number ofsynaptic connections

⇒ Central limit theorem⇒ Local (membrane) potential ≈ Gaussian

DistributionConditional probability of neuron spiking is a GaussianCDF ≈ scaled sigmoid

Page 9: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

IntroductionStochastic Neurons

Network Dynamics (I)

How does the network state (activation of all neurons) evolve intime?

Assume neurons to be updated in parallel every ∆t (Paralleldynamics)

P(s(t+1)|s(t)) =N∏

j=1

P(s(t+1)j |st ) = T (s(t+1)|s(t))

Yielding a Markov process for state update

P(s(t+1) = s′) =∑

s

T (s′|s)P(s(t) = s)

Page 10: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

IntroductionStochastic Neurons

Network Dynamics (II)

Parallel dynamics assumes a synchronization clock exist(biologically non plausible)Alternatively, one neuron at random can be chosen forupdate at each step (Glauber Dynamics)No fixed-point guarantees for s but it has a stationarydistribution for the network at equilibrium state when itsconnectivity is symmetric

Given Fj as state flip operator for j-th neuron s(t+1) = Fjs(t)

T (s(t+1)|s(t)) =1N

P(s(t+1)j |st )

While if s(t+1) = s(t)

T (s(t+1)|s(t)) = 1− 1N

∑j

P(s(t+1)j |st )

Page 11: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

The Boltzmann-Gibbs Distribution

Symmetric connectivity enforces detailed balance condition

P(s)T (s′|s) = P(s′)T (s|s′)

Ensures reversible transitions guaranteeing existence ofequilibrium (Boltzmann-Gibbs) distribution

P∞(s) =e−E(s)

Z

whereE(s) is the energy function

Z =∑

s

e−E(s) is the partition function

Page 12: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Boltzmann Machines

A stochastic recurrent networkwhere binary unit states are alsorandom variables

visible v ∈ {0,1}latent h ∈ {0,1}s = [vh]

Boltzmann-Gibbs distribution having linear energy function

E(s) = −12

∑ij

Mijsisj −∑

j

bjsj = −12

sT Ms− bT s

with symmetric and no self-recurrent connectivity

Page 13: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Learning

Ackley, Hinton and Sejnowski (1985)Boltzmann machines can be trained so that the equilibriumdistribution tends towards any arbitrary distribution acrossbinary vectors given samples from that distribution

A couple of simplifications to start withBias b absorbed into weight matrix MConsider only visible units s = v

Use probabilistic learning techniques to fit the parameters, i.e.maximizing the log-likelihood

L(M) =1L

L∑l=1

log P(vl |M)

given the P visible training patterns vl

Page 14: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Gradient Approach

First, the gradient for a single pattern

∂P(v|M)

∂Mij= −〈vivj〉+ vivj

with free expectations 〈vivj〉 =∑

v

P(v)vivj

Then, the log-likelihood gradient

∂L∂Mij

= −〈vivj〉+ 〈vivj〉c

with clamped expectations 〈vivj〉c =1L

p∑l=1

v li v

lj

Page 15: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Something we have already seen..

It is Hebbian learning again!

〈vivj〉c︸ ︷︷ ︸wake

−〈vivj〉︸ ︷︷ ︸dream

wake part is the usual Hebb rule applied to the empiricaldistribution of data that the machine sees coming in fromthe outside worlddream part is an anti-hebbian term concerning correlationbetween units when generated by the internal dynamics ofthe machine

Can only capture quadratic correlation!

Page 16: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Learning with Hidden Units

To efficiently capture higher-order correlations we need tointroduce hidden units hAgain log-likelihood gradient ascent (s = [vh])

∂P(v|M)

∂Mij=

∑h

sisjP(h|v)−∑

s

sisjP(s)

= 〈sisj〉c − 〈sisj〉

Expectations generally become intractable due to thepartition function Z (exponential complexity)

Page 17: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Boltzmann Mean-field Approximation

Approximate the expectations using a simpler distributionthan P(s)

Q(s) =N∏

i=1

msii (1−mi)

si

New parameter mi describing the probability of unit i beingon, that are chosen to minimize

m∗ = arg minm

KL[Q(s)||P(s)]

The solution is obtained by the N dimensional system

mi = σ(∑

j

Mijmj)

The mi ’s allow to approximate the intractable expectations

〈sisj〉 ≈ mimj

Page 18: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Restricted Boltzmann Machines (RBM)

A special Boltzmann machineBipartite graphConnections only betweenhidden and visible units

Energy function, highlighting bipartition in hidden (h) andvisible (v) units

E(v,h) = −vT Mh− bT v− cT h

Learning (and inference) becomes tractable due to graphbipartition which factorizes distribution

Page 19: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

The RBM Catch

Hidden units are conditionally independent given visible units,and viceversa

P(hj |v) = σ(∑

i

Mijvi + cj)

P(vi |h) = σ(∑

j

Mijhj + bi)

They can be updated in batch!

Page 20: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Training Restricted Boltzmann Machines

Again by likelihood maximization, yields

∂L∂Mij

= 〈vihj〉c︸ ︷︷ ︸data

−〈vihj〉︸ ︷︷ ︸model

A Gibbs sampling approach

WakeClamp data on vSample vihj for all pairs ofconnected unitsRepeat for all elements ofdataset

DreamDon’t clamp unitsLet network reachequilibriumSample vihj for all pairs ofconnected unitsRepeat many times to geta good estimate

Page 21: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Gibbs-Sampling RBM

∂L∂Mij

= 〈vihj〉c︸ ︷︷ ︸data

−〈vihj〉︸ ︷︷ ︸model

It is difficult to obtain an unbiased sample of the second term

Page 22: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Gibbs-Sampling RBMPlugging-in Data

1 Start with a training vector on the visible units2 Alternate between updating all the hidden units in parallel

and updating all the visible units in parallel (iterate)

∂L∂Mij

= 〈vihj〉0︸ ︷︷ ︸data

−〈vihj〉∞︸ ︷︷ ︸model

Page 23: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

Contrastive-Divergence Learning

Gibbs sampling can be painfully slow to converge

1 Clamp a training vector vl

on visible units2 Update all hidden units in

parallel3 Update the all visible units

in parallel to get areconstruction

4 Update the hidden unitsagain

〈vihj〉0︸ ︷︷ ︸data

− 〈vihj〉1︸ ︷︷ ︸reconstruction

Page 24: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

What does Contrastive Divergence Learn?

A very crude approximation of the gradient of thelog-likelihood

It does not even follow the gradient closelyMore closely approximating the gradient of a objectivefunction called the Contrastive Divergence

It ignores one tricky term in this objective function so it isnot even following that gradient

Sutskever and Tieleman (2010) have shown that it is notfollowing the gradient of any function

Page 25: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

Boltzmann MachineRestricted Boltzmann Machine

So Why Using it?

Because He says so!

It works well enough in many significant applications

Page 26: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

ApplicationsSummary

Character Recognition

Learning good features for reconstructing images of number 2handwriting

Slide credit goes to G. Hinton

Page 27: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

ApplicationsSummary

Weight Learning

Page 28: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

ApplicationsSummary

Final Weights

Page 29: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

ApplicationsSummary

Digit Reconstruction

Page 30: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

ApplicationsSummary

Digit Reconstruction (II)

What would happen if we supply the RBM with a test digit that itis not a 2?

It will try anyway to see a 2 in whatever we supply!

Page 31: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

ApplicationsSummary

One Last Final Reason for Introducing RBM

Deep Belief Network

The fundamental buildingblock for one of the mostpopular deep learningarchitectures

A network of stackedRBM trained layer-wiseby ContrastiveDivergence plus asupervised read-out layer

Page 32: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

ApplicationsSummary

Take Home Messages

Stochastic networks as a paradigm that can explain neuralvariability and spontaneous activity in terms of distributionsBoltzmann Machines

Hidden neurons required to explain high-order correlationsTraining is a mix of Hebbian and anti-HebbianMultifaceted nature (recurrent network, undirectedgraphical model and energy-based network)

Restricted Boltzmann MachinesTractable model thanks to bipartite connectivityTrained by a very short Gibbs sampling (contrastivedivergence)Can be very powerful if stacked (deep learning)

Page 33: Associative Memories (II) Stochastic Networks and ...didawiki.di.unipi.it/.../3-boltz-hand.pdfStochastic Networks and Neurons Boltzmann Machine Conclusions Associative Memories (II)

Stochastic Networks and NeuronsBoltzmann Machine

Conclusions

ApplicationsSummary

Next Lecture

Part I - Lecture (1h)Incremental learning in competitive networksStability-plasticity dilemmaAdaptive Resonance Theory (ART)

Part II - Lab (2h)Restricted Boltzmann MachinesI suggest you have a good look at reference [1] on thecourse wiki (pages 3− 6)