annealing (2)
-
Author
chairul-rizky -
Category
Documents
-
view
221 -
download
0
Embed Size (px)
Transcript of annealing (2)
-
8/18/2019 annealing (2)
1/32
2806 Neural Computation
Stochastic MachinesLecture 10
2005 Ari Visa
-
8/18/2019 annealing (2)
2/32
Agenda
Some historical notes
Some theory
Metropolis Algorithm Simulated Annealing
i!!s Sampling
"olt#mann Machine
Conclusions
-
8/18/2019 annealing (2)
3/32
Some $istorical Notes
Statistical mechanics encompasses the %ormal study
o% macroscopic e&uili!rium prorerties o% large
systems o% elements'
(he )canonical distri!ution) *+ i!!s distri!ution +
"olt#mann distri!ution, *-illard i!!s. 1/02,
(he use o% statistical mechanics as !asis %or the study
o% neural netors *Cragg and (emperley. 1/5,*Coan. 1/68,
-
8/18/2019 annealing (2)
4/32
Some $istorical Notes
(he idea o% introducing temperature and simulatedannealing into com!inatorial optimi#ation
pro!lems *3irpatric. elatt. Vacchi. 1/84,'
(he "olt#mann machine is among the %irstmultilayer learning machine inspired !y statisticalmechanics *$inton Senosi. 1/84.1/86,.*Acley et al 1/85,'
-
8/18/2019 annealing (2)
5/32
Some (heory
Consider a physical system ith many degrees o% %reedom. that can reside in amyone o% a large num!er o% possi!le states'
Let pi denote the pro!a!ility o% occurrence o% state i. p i 70 %or all i and Σi pi +1'
Let i denote the energy o% the system hen it is in state i'
-hen the system is in thermal e&uili!rium ith its surrounding en9ironment. state
i occurs ith a pro!a!ility: pi + 1;< e=p*> i ; "(, here ( is the a!solutetemperature *3, and "is "olt#mann?s constant *+ canonial or i!!sdistri!ution,'
(he normali#ing &uantity < is called the sum o9er states or the partitioning%unction < + Σi e=p*> i ; "(,'
Note to points:
1, States o% lo energy ha9e a higher pro!a!ility o% occurrence than states o% highenergy'
2, As the temperature ( reduced. the pro!a!ility is concentrated on a smallersu!set o% lo>energy states'
-
8/18/2019 annealing (2)
6/32
Some (heory
(he $elmholt# %ree energy o% a physical system @ isde%ined in terms o% the partition %unction < : @ + >( log <
(he a9erage energy: B + Σi pii B + >@ + >(Σi pilog pi *+ entropy $, B + >@ + >($ $ D $? 7 0 The principle of minimal free energy: the principle o% minimal %ree energy o% a stochastic system
ith respect to 9aria!les o% the system is achie9ed atthermal e&uili!rium. at hich point the system is go9erned
!y the i!!s distri!ution'
-
8/18/2019 annealing (2)
7/32
Some (heory
Consider a system hose e9olution is descri!ed !y a
stochastic process EFn. n+1.2.'''G. consisting o% a
%amily o% random 9aria!les'
(he 9alue =n assumed !y the random 9aria!le Fn at
discrete time n is called the state o% the system'
(he space o% all possi!le 9alues that the random
9aria!les can assume is called the state space o%the system'
-
8/18/2019 annealing (2)
8/32
Some (heory
H% the structure o% the stochastic process EFn.n+1.2.'''G. is such that the conditional pro!a!ilitydistri!ution o% X n+1 depends only on the value of X n
and is independent o% all pre9ious 9alues. is a Markov chain' Markov property I*FnD1+ =nJ Fn+ =n .'''.F1+ =1G A se&uence o% random 9aria!les F1.F2 .'''.Fn. FnD1
%orms a Maro9 chain i% the pro!a!ility that thesystem is in state =nD1 at time nD1 dependse=clusi9ely on the pro!a!ility that the system is instate =n at time n'
-
8/18/2019 annealing (2)
9/32
Some (heory
Hn a maro9 chain. the transition %rom one state to another is pro!a!ilistic : pi + I*FnD1+ JFn + i, denote the transition pro!a!ility%rom state i at time n to state at time nD1 *p i 70 %or all *i., and Σ pi + 1%or all i,'
Maro9 chain is homogeneous in time i% the transition pro!a!ilities are
%i=ed and not change ith time' Let pi*m, denote the m-step transition probability %rom state i to state :
pi*m, + I*FnDm+ = JFn + =i, m+1.2.''' -e may 9ie as the sum o9er all intermediate states through hich the
system passes in its transition %rom state i to state ' p i*m, + Σ pi *m, p m+1.2. ''' ith pi *1, + pi '
pi*mDn, + Σ pi *m, p *n, m.n +1.2. ''' *Chapman>3olmogoro9 identity,' -hen a state o% the chain can only reoccur at time inter9als that are
multiples o% d *d is the largest such num!er,. e say the state has period d'
-
8/18/2019 annealing (2)
10/32
Some (heory
(he state i is said to !e a recurrent state i% the maro9 chain returns tostate i ith pro!a!ility 1 % i + I*e9er returning to state i, + 1'
H% the pro!a!ility % i is less than 1. the state i is said to !e a transient state'
(he state o% a Maro9 chain is said to !e accessible %rom state i i%there is a %inite se&uence o% transitions %rom i to ith positi9e pro!a!ility'
H% the states i and are accessi!le to each other. the states i and o% theMaro9 chain are said to communicate ith each other'
H% to states o% a Maro9 chain communicate ith each other. they
!elong to the same class' H% all the states consist o% a single class. Maro9 chain is said to !eindecomposible or irreducible'
-
8/18/2019 annealing (2)
11/32
Some (heory
(he mean recurrence time o% state i is de%ined as thee=pectations o% (i*, o9er the returns ' (i*, denotesthe time that elapses !eteen the (k-1)th and k threturns to state i '
(he steady-state probability o% state i. denoted !y πi .is e&ual to the reciprocal o% the mean recurrence timeπi + 1 ;K(i*,
H% K(i*, . that is πi B0. the state i is asid to !e a
positive recurrent state' H% K(i*, +. that is πi +0. the state i is asid to !e a
null recurrent state'
-
8/18/2019 annealing (2)
12/32
Some (heory
Ergodicity + e may su!stitute time a9erages %or ensem!le a9erages' Hn the conte=t o% Maro9 chain + the long>term proportion o% time
spent !y the chain in state i corresponds to the steady>state pro!a!ilityπi '
(he proportion o% time spent in state i a%ter returns 9i*,. 9
i*, + ;
Σ l+1(i*l,' Consider an ergodic Maro9 chain characteri#ed !y a stochastic matri=
P' Let the ro 9ector *n>1, denote the state distri!ution 9ector o% thechain at time n>1 the th element o% *n>1, is the pro!a!ility that thechain is in state = at time n>1'
*n,
+*n>1,
P
(he state distri!ution 9ector o% Maro9 chain at time n is the producto% the initial state distri!ution 9ector *0, and the nth poer o% thestochastic matri= P' *n, + *0, Pn.
-
8/18/2019 annealing (2)
13/32
Some (heory
(he ergodicity
theorem: 11'2>11'2O
-
8/18/2019 annealing (2)
14/32
Some (heory
(he principle of detailedbalance states that atthermal e&uili!rium.
the rate o% occurrenceo% any transitione&uals thecorresponding rate o%
occurrence o% thein9erse transition'
πi pi6 + π 6 p 6i
-
8/18/2019 annealing (2)
15/32
Metropolis Algorithm
Metropolis algorithm *Metropolis et al 1/54, is a modi%ied MonteCarlo method %or stochastic simulation o% a collection o% atoms ine&uili!rium at a gi9en temperature'
(he random 9aria!le Fn representing an ar!itary Maro9 chain is instate =i at time n' -e randomly generate a ne state = representing a
reali#ation o% another random 9aria!le Pn' (he generation o% this nestatesatis%ies the symmetry condition:I*Pn+ = JFn + =i, + I*Pn+ =i JFn + = ,
Let + > i denote the energy di%%erence resulting %rom thetransition o% the system %rom state Fn+ =i to state Pn + = '
1, 0: -e %ind that π ; πi + e=p*> ;(, B 1→ π p i + πi τ p i 1, B0: -e %ind that π ; πi + e=p*> ;(, 1→ π p i + πi τ i (he a prior transition pro!a!ilities τi are in %act the pro!a!ilistic
model o% the random step in the Metropolis algorithm'
-
8/18/2019 annealing (2)
16/32
Simulated Annealing
1, A schedule that determines the rate at hich the temperature isloered'
2, An algorithm that iterati9ely %inds the e&uili!rium distri!ution ateach ne temperature in the schedule !y using the %inal state o% thesystem at the pre9ious temperature as the starting point %or the ne
temperature *3irpatric et al' 1/84,'
(he Metropolis algorithm is the !asis %or the simulated annealing process' (he temperature ( plays the role o% a control parameter' thesimulated annealing process ill con9erge to a con%iguration o%minimal energy pro9ided that the temperature is decreased no %aster
than logarithmically Q too slo to !e o% practical use → finite-timeapproimation *no longer guaranteed to %ind a glo!al minimum ith
pro!a!ility one,
-
8/18/2019 annealing (2)
17/32
Simulated Annealing
(o implement a %inite>time appro=imation o% the
simulated annealing algorithm. e must speci%y a
set o% parameters go9erning the con9ergence o%
the algorithm' these parameters are com!ined in aso>called annealing schedule or cooking schedule'
(he annealing schedule speci%ies a %inite se&uence
o% 9alues o% the temperature and and a %inite
num!er o% transitions attempted at each 9alue o%
the temperature'
-
8/18/2019 annealing (2)
18/32
-
8/18/2019 annealing (2)
19/32
i!!s Sampling
i!!s sampler generate a Maro9 chain ith the i!!sdistri!ution as e&uili!rium distri!ution'
(he transition pro!a!ilities associated ith i!!s samplerare nonstationary'
1, ach component o% the random 9ector X is 9isited in thenatural order. ith the result that a total o% 3 ne 9ariatesare generated on each iteration'
2, (he ne 9alue o% component F>1 is used immediately
hen a ne 9alue o% F is dran %or +2.4.'''.3' (he i!!s sampler is an iterati9e adapti9e scheme'
-
8/18/2019 annealing (2)
20/32
i!!s Sampler
*11'45. 11'46. 11'4O,
-
8/18/2019 annealing (2)
21/32
"olt#mann Machine
(he primary goal o% "olt#mannlearning is to produce a neuralnetor that correctly modelsinput patterns according to a"olt#mann distri!ution'
(he "olt#mann machineconsists o% stochastic neurons'A stochastic neuron resides inone o% to possi!le states *T 1,in a pro!a!ilistic manner'
(he use o% symmetric synaptic
connections !eteen neurons' (he stochastic neurons
partition into to %unctionalgroups: 9isi!le and hidden'
-
8/18/2019 annealing (2)
22/32
"olt#mann Machine
Uuring the training phase o% the netor . the9isi!le neurons are all clamped onto speci%ic statesdetermined !y the en9ironment'
(he hidden neurons alays operate %reely they areused to e=plain underlying constraints contained inthe en9ironmental input 9ectors'
(his is accomplished !y capturing higher>orderstatistical correlations in the clamping 9ectors'
(he netor can per%orm pattern completition pro9ided that it has learned the training distri!ution properly'
-
8/18/2019 annealing (2)
23/32
"olt#mann Machine
Let x denote the state o% the"olt#mann machine. ith itscomponent =i denoting the state o%neuron i' (he state x represents areali#ation o% the random 9ector X'(he synaptic connection %rom
neuron i to neuron is denoted !y 6i. ith 6i + i6 %or all *i., and ii + 0 %or all i'
*x, + > W iW 6 6i =i= 6 . iX I*X +x, + 1;< e=p*>*x,;(, + ϕ*=;( Wi 6i =i, here ϕ*', is a
sigmoid %unction o% its arguments' i!!s sampling and simulated
annealing are used'
-
8/18/2019 annealing (2)
24/32
"olt#mann Machine
(he goal o% "olt#mann learning is to ma=imi#e the lielihood or log>lielihood %unction inaccordance ith the ma=imum>lielihood principle'
&ositve phase' Hn this phase the netor operates in its clamped condition' 'egative phase% Hn this second phase. the netor is alloed to run %reely. and there%ore
ith no en9ironmental input' (he log>lielihood %unction L*w, + logY =S ∈(I*XS+ xS,
L*w, + WxS ∈( *log Wxβ e=p*>*x,;(, > log Wx e=p*>*x,;(, ,
Ui%%erentiating L*w, ith respect to 6i and introducing ρD 6i and ρ> 6i ' 6i + ε∂L*, ; ∂ 6i +η *ρD 6i > ρ> 6i , here η is a learning>rate parameter η + ε;(' @rom a learning point o% 9ie. the to terms that constitute the "olt#mann learning rule
ha9e opposite meaning: ρD 6i corresponding to the clamped condition o% the netor is a
$e!!ian learning rule ρ>
6i corresponding to the %ree>running condition o% the netor isunlearning *%orgetting, term' -e ha9e also a primiti9e %orm o% an attention mechanism' (he to phase approach and .speci%ically. the negati9e phase means also increased
computational time and sensiti9ity to statistical errors'
-
8/18/2019 annealing (2)
25/32
Sigmoid "elie% Netors
igmoid belief netorks or logistic belief nets *Neal 1//2,ere de9eloped to %ind a stochastic machine that ouldshare ith the "olt#mann machine the capacity to learnar!itarily pro!a!ility distri!utions o9er !inary 9ectors. !ut
ould not need the negati9e phase o% the "olt#mannmachine learning procedure' (his o!ecti9e as achie9ed !y replacing the symmetric
connections o% the "olt#mann machine ith directedconnections that form an acyclic graph'
A sigmoid !elie% netor consists o% a multilayerarchitecture ith !inary stochastic neurons' (he acyclicnature o% the machine maes it easy to per%orm pro!a!ilistic calculations'
-
8/18/2019 annealing (2)
26/32
Sigmoid "elie% Netors
Let the 9ector X. consisting o% to>9alued random 9aria!lesF1.F2.'''.F N. de%ine a sigmoid
!elie% netor composed o% Nstochastic neurons'
(he parents o% element F in X aredenoted !y pa*F , ⊆ EF1.F2.'''.F >1G
pa*F , is the smallest su!set o% random9ector X %or hich e ha9e I*F += JF1 + =1.'''.F >1+ = >1 , + I*F + = J
pa* F , + ϕ*= ;( Wi i =i,
Note that 1' i +0 %or all Fi not !elonging to pa*Fi, and
2' i +0 %or all i 7 '
-
8/18/2019 annealing (2)
27/32
Sigmoid "elie% Netors
Learning:
Ht is assumed that each sample is to>9alued. representing certain attri!utes' Zepetition o%training e=amples is permitted. in proportion to ho commonly a particularcom!ination o% attri!utes is non to occur'
1' Some si#e %or a state 9ector. x. is decided %or the netor'
2' A su!set o% the state 9ector. say xS. is selected to represent the attri!utes in the
training cases that is xS represent the state 9ector o% the 9isi!le neurons'4' (he remaining part o% the state 9ector x. denoted !y xβ.de%ines the state 9ector o% the
hidden neurons'
Ui%%erent arrangements o% 9isi!le and hidden neurons may result in di%%erent con%iguration[ (he log>lielihood %unction L*w, + Y =S ∈( log I*XS+ xS, Ui%%erentiating L*w, ith respect to 6i 6i + ε∂L*, ; ∂ 6i +η ρ 6i here η is a learning>rate parameter η + ε;( and ρ 6i is Wxβ Wx I*X+ x JXS+ xS,
ϕ*>= 6;( WiA6 6i =i,= 6=i hich is an a9erage correlation !eteen the states o% neurons iand . eighted !y the %actorϕ*>= 6;( WiA6 6i =i, '
-
8/18/2019 annealing (2)
28/32
Sigmoid "elie% Netors
ta!le 11'2
-
8/18/2019 annealing (2)
29/32
$elmholt# Machine
(he $elmholt# machine *Uayan et al' 1//5.$inton et al' 1//5, uses to entirely di%%erent setso% synaptic connections'
(he %orard connections constitute tehrecognition model' (he purpose o% this model isto in%er a pro!a!ility distri!ution o9er the
underlying causes o% the input 9ector' (he !acard connections constitute thegenerati9e model' (he purpose o% this secondmodel is to reconstruct an appro=imation to theoriginal input 9ector %rom theunderlyingrepresentations captured !y the hiddenlayers o% the netor. there!y ena!ling it tooperate in a self-supervised manner '
"oth the recognition and generati9e modelsoperate in a strictly %eed%orard %ashion. ith no%eed!ac they interact ith each other only 9iathe learning procedure'
-
8/18/2019 annealing (2)
30/32
-
8/18/2019 annealing (2)
31/32
Mean>@ield (heory
(he use o% mean>%ield theory asthe mathematical !asis %orderi9ing deterministic
appro=imation to the stochasticmachines to speed up learning'
1' Correlations are replaced !ytheir mean>%ield apro=imations'
2' An intracta!le model isreplaced !y a tracta!le model9ia a 9ariational principle'
-
8/18/2019 annealing (2)
32/32
Summary
Some ideas rooted in statistical mechanics ha9e represented'
(he "olt#mann machine uses hidden and 9isi!le neurons that are in the%orm o% stochastic. !inary>state units' Ht e=ploits the properties o% thei!!s distri!ution. there!y o%%ering some appealing %eatures:
(hrough training. the pro!a!ility distri!ution e=hi!ited !y the neurons ismatched to that o% the en9ironment'
(he netor o%%ers a generali#ed approach that is applica!le to the !asicissues o% search. representation. and learning'
(he netor is &uaranteed to %ind the glo!al minimum o% the energysur%ace ith respect to the states. pro9ided that the annealing schedulein the learning process is per%ormed sloly enough'