Restricted Boltzman Machine (RBM) presentation of fundamental theory
-
Upload
seongwon-hwang -
Category
Data & Analytics
-
view
241 -
download
2
Transcript of Restricted Boltzman Machine (RBM) presentation of fundamental theory
M&S
Restricted Boltzman Machine - Theory -
Seongwon Hwang
M&S
Energy Based Model
M&S
1. Scalar Function
θV cos0
θgtθV sin0
jgtθViθVV )sin(cos 002
2mvmghE
*Total Energy = Potential + Kinetic Energy
M&S
2. Principle of Minimum Energy
E
Principle of Maximum Entropy
Principle of Minimum Energy
Equilibrium at fixed internal energy
Equilibrium at fixed entropy
EquilibriumUnstable
S
M&S
In Neural Network
Supervised Model
),,( jiij yxWEix
jy
Input variables
Output variables Energy = - Correlation
ijW
Unsupervised Model
ix Input variables
),( iij xWEEnergy with input variables =
- Correlation ijW
M&S
In Neural Network
Unsupervised Model with Hidden units
),,( jiij hvWEiv
jh
Visible variables
Hidden variables Energy = - Correlation
ijW
Energy Correlation
M&S
In Neural Network
Learning in unsupervised model
x
),( xWE
dataxminx x
),( xWE
datax
'WW
M&S
How we make energy in neural network?
Hopfield Neural Network
M&S
Two constraints
1. Symmetric weight between neurons
2. Asynchronously learning required for stable state
jiij WW 3x
1x 2x
3x
1x 2x
M&S
Two constraints
1. Symmetric weight between neurons
2. Asynchronously learning required for stable state
jiij WW 3x
1x 2x
3x
1x 2x
M&S
Two constraints
1. Symmetric weight between neurons
2. Asynchronously learning required for stable state
jiij WW 3x
1x 2x
3x
1x 2x
M&S
Two constraints
1. Symmetric weight between neurons
2. Asynchronously learning required for stable state
jiij WW 3x
1x 2x
3x
1x 2x
3x
1x 2x
Randomly activate node
M&S
Define energy by Hopfield
1x
2x
3x
5x
4x
ji
ijji wxxE
2
3
11
2
4 3
} 1 ,0 {ix
M&S
Example for intuition
1x
2x
3x
5x
4x
2
3
11
2
4 3
11 x
} 1 ,0 {ix
12 x 13 x 04 x 05 x
01 x 12 x 13 x 04 x 15 x
7E6E
... ...
M&S
Application - Data store
1
2x
3x
5x
4x
2
3
11
2
4 3
} 1 ,0 {ix
M&S
Application - Data store
1
1
1
0
0
2
3
11
2
4 3
} 1 ,0 {ix
M&S
Learning in Hopfield Network
1x
2x
3x
5x
4x
12w
15w
13w45w
35w
23w34w
} 1 ,0 {ix
ji
ijji wxxE
Several dataset
ijij wwWeight uptdate
M&S
Boltzman Machine
M&S
Overview
Energy Correlation
Probability Correlation
Hopfield Neural Network
Boltzman Machine
M&S
Overview
Probablity Correlation
Boltzman Machine
j
vE
vE
i j
i
eevvP )(
)(
)(
Energy
Boltzman Distribution
M&S
Thermalphysics for boltzman distribution
M&S
Macrostate Vs. Microstate
TH H
10 100 500
HH THT HTH TTT HHT THH HTT T
Total number of microstate: 8
Microstate 1
Microstate 2
Microstate 3
…
Position, Velocity…
M&S
Macrostate Vs. Microstate
TH H
10 100 500
HH THT HTH TTT HHT THH HTT T
Total number of macrostate : 4
1
T
2
03
M&S
Macrostate Vs. Microstate
TH H
10 100 500
HH THT HTH TTT HHT THH HTT T
Total number of macrostate : 4
2
H
1
30
Temperature, Pressure…
M&S
Canonical Ensemble (NVT Ensemble)
N, V, T, Fixed ensemble of microstates
0 , , , ETVN
M&S
Canonical Ensemble (NVT Ensemble)
N, V, T, Fixed ensemble of microstates
0 , , , ETVN1 , , , ETVN
M&S
Boltzman Distribution
j
TkE
TkE
i Bj
Bi
eeEP /
/
)(
!...!!!
210 NNNNW
SW ln
≈
Maximum Entropy!
Number of cases Number of particles of total system
ith microstate’s number of particle
...) ,0 ,0 ,0 ,(N
...) ,0 ,0 ,2 ,2( N
...) ,0 ,1 ,2 ,3( N...
Number of cases
M&S
Boltzman Distribution
j
TkE
TkE
i Bj
Bi
eeEP /
/
)()( iEP
iE
M&S
1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1
0 1 0 12 0 0 0 1 0 0 0 0 30 0 1 0 0 0 0 1 0 00 0 0 0 4 0 0 6 0 00 0 2 0 0 0 3 2 0 10 3 0 0 7 1 0 0 0 00 0 1 0 4 5 3 0 1 00 0 0 0 2 0 0 0 0 00 0 0 0 0 1 0 0 0 1
0 1 2 3 4 5 6 7
2
j
TkE
TkE
i Bj
Bi
eeEP /
/
)(
1 1 1 1 1 1 1 1 1 11 1 2 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 0 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1
M&S
H
2 0 0 0 1 0 0 0 0 30 0 1 0 0 0 0 1 0 00 0 0 0 4 0 0 6 0 00 0 2 0 0 0 3 2 0 10 3 0 0 7 1 0 0 0 00 0 1 0 4 5 3 0 1 00 0 0 0 2 0 0 0 0 00 0 0 0 0 1 0 0 0 1
0 1 2 3 4 5 6 7
Intuition for connection between Physics and Network
M&S
H
2 0 0 0 1 0 0 0 0 30 0 1 0 0 0 0 1 0 00 0 0 0 4 0 0 6 0 00 0 2 0 0 0 3 2 0 10 3 0 0 7 1 0 0 0 00 0 1 0 4 5 3 0 1 00 0 0 0 2 0 0 0 0 00 0 0 0 0 1 0 0 0 1
0 1 2 3 4 5 6 7
Intuition for connection between Physics and Network
As energy changesChanges of molecular struture
Changes of configuration of network
Physics
Network
M&S
Helmholtz free energy
j
Eβbb
jeTkZTkF )ln(ln
= Free energy associated with Canonical Ensemble
M&S
Overview
Probability
j
vE
vE
i j
i
eevP )(
)(
)(
Energy
Configurations...) ,1 ,0 ,1 ,0(1 v ...) ,1 ,0 ,1 ,1(2 v
N2
N-dimensional binary data
...
1v 2v 3v 4v
5v 6v 7v
M&S
Overview
Probability
lk
hvE
hvE
ii lk
ji
eehvP
.
) ,(
) ,(
) ,(
Energy
1v 2v 3v 4v
1h 2h 3h
M&S
Restricted Boltzman Machine
M&S
Restriction – NO connections between H and V respectvely
Boltzman Machine
Restricted Boltzman Machine
1v 2v 3v 4v
1h 2h 3h
1v 2v 3v 4v
1h 2h 3h
M&S
Restriction – NO connections between H and V respectvely
Restricted Boltzman Machine
1v 2v 3v 4v
1h 2h 3h
Conditional Independent!
)|()|()|,( CBPCAPCBAP
)|()|()|,( 1111 vhPvhPvhhP
j
j vhPvhP )|()|(
i
i hvPhvP )|()|(
General Form
M&S
Energy from Hopfield Network
i j j
jji
iijiij hcvbhvwhvE ),(
1v 2v 3v 4v
1h 2h 3h
hv
hvE
hvE
j
vE
vE
i eehvP
eevvP
j
i
,
),(
),(
)(
)(
),()(
v‘ biash‘ bias
M&S
Two Important Conditional Probabilites! – First
i
jiijj cvwσvhP )()|1(
1v 2v 3v 4v
1h 2h 3h
x
x
eexσ
1
)(
M&S
Two Important Conditional Probabilites! – Second
1v 2v 3v 4v
1h 2h 3h
i
ijiji bhwσhvP )()|1(
M&S
Generative Vs. Discriminative Model
),( ),|( yxPyxP
Y
X
Y
X
)|( xyPEX) Gaussians, Sigmoid Belief Networks,
Bayesian NetworksEX) Neural Network, Logistic Regression,
Support Vector Machine
<Generative Model> <Discriminative Model>
RBM
M&S
Maximum Likelihood Estimator
Population
Sample
Maximizing the possibility based on observed samples to estimate unobserved parameters of population
M&S
Maximum Likelihood Estimator
EX) What is the probability of coin in the case of head?
H H T
322 )1()|()( ppppθxPθL
032)( 2
ppppL
32
p
M&S
Learning in RBM
Cost = Negative Log-Likelihood (NLL)
)|(ln)|( θvPvθNLL hv
hvE
h
hvE ee,
),(),( lnln
model),(),( )|(
hvE
θhvE
θθvθNLL
data
Gradient Discent for NLL<...> Expectation
Free Energy 1 Free Energy ∞
Positive Phase Negative Phase
M&S
Learning in RBM
model),(),( )|(
hvE
θhvE
θθvθNLL
data
Gradient Discent for NLL
1 0 1 1
1h 2h 3h} 1 ,0 {jh
Easy to compute!
M&S
Learning in RBM
model),(),( )|(
hvE
θhvE
θθvθNLL
data
Gradient Discent for NLL
1v 2v 3v 4v
1h 2h 3h} 1 ,0 {jh
Hard to compute!
} 1 ,0 {iv
mn2
Total number of possible configurations:
M&S
Markov Chain Monte Carlo (MCMC)
1. Markov Chain
First-order Markov chain is that next state depends only on immediately preceding one, Second or higher order’s next state depends on two or more preceding ones.
M&S
Markov Chain Monte Carlo (MCMC)
2. Monte Carlo – Compute the value statistically by using random numbers
samples of number Totalcircle the in sample of Number
4π
22 yx 1<Evalutation>
VS.
Sampling
EX) Compute circular constant
M&S
Gibbs Sampling
1. Set up initial values randomly
Multi-Dimensional Variants
... , , 321 xxx ...) , , ,( 321 xxxp
Joint Probability or Conditional Probability
or
2. Sampling with conditional distribution
3. Perform this to reach stationary value
- Algorithm -
...) ,0 ,1 ,1 ,0 ,0 ,0 ,1( ...) ,0 ,1 ,1 ,1 ,0 ,1 ,0( ...) ,1 ,1 ,1 ,0 ,1 ,0 ,1(0r 1r 2r
)|( 01 rrp )|( 12 rrp ...
)|( ii xxp
M&S
k-th Contrastive Divergence
1. Usage of real data as initial values
2. kth sample is equal to expectation of desirable distribution
- Characteristics -
...) ,0 ,1 ,1 ,0 ,0 ,0 ,1( ...) ,0 ,1 ,1 ,1 ,0 ,1 ,0( ...) ,1 ,1 ,1 ,0 ,1 ,0 ,1(
data
1r 2r)|( 01 rrp )|( 12 rrp
2k
M&S
k-th Contrastive Divergence
1. Usage of real data as initial values
2. kth sample is equal to expectation of desirable distribution
3. That k is 1 is enough to be converged since the real data is used as initial valuesData
- Characteristics -
...) ,0 ,1 ,1 ,0 ,0 ,0 ,1( ...) ,0 ,1 ,1 ,1 ,0 ,1 ,0(
data
1r)|( 01 rrp
M&S
Learning in RBM
model),(),( )|(
hvE
θhvE
θθvθNLL
data
Gradient Discent for NLL
1v 2v 3v 4v
1h 2h 3h} 1 ,0 {jh
Hard to compute!
} 1 ,0 {iv
mn2
Total number of possible configurations:
M&S
Approximation in RBM
model),(
hvEθ
m
mxfm
xf )(1)( model
)()()(1 1 kk
m
m xfxfxfm
MCMC_Gibbs sampling
CD_k=1
M&S
Sampling Algorithm in RBM
1st Step
1 0 1 1
1h 2h 3h
- Usage of real data as an initial value
dataInput
} 1 ,0 {jh
} 1 ,0 {iv
M&S
Sampling Algorithm in RBM
2nd Step
1 0 1 1
1 2h 3h
- Sampling each hidden unit with conditional probability starting from initial values
dataInput
} 1 ,0 {jh
} 1 ,0 {iv
M&S
Two Important Conditional Probabilites! – First
i
jiijj cvwσvhP )()|1(
1v 2v 3v 4v
1h 2h 3h
x
x
eexσ
1
)(
M&S
Sampling Algorithm in RBM
2nd Step
1 0 1 1
1 2h 3h
- Sampling each hidden unit with conditional probability starting from initial values
dataInput
} 1 ,0 {jh
} 1 ,0 {iv
i
jiijj cvwσvhP )()|1(
M&S
Sampling Algorithm in RBM
2nd Step
1 0 1 1
1 0 3h
- Sampling each hidden unit with conditional probability starting from initial values
dataInput
} 1 ,0 {jh
} 1 ,0 {iv
i
jiijj cvwσvhP )()|1(
M&S
Sampling Algorithm in RBM
2nd Step
1 0 1 1
1 0 1
- Sampling each hidden unit with conditional probability starting from initial values
dataInput
} 1 ,0 {jh
} 1 ,0 {iv
i
jiijj cvwσvhP )()|1(
M&S
Sampling Algorithm in RBM
3rd Step
0 0 1 1
1 0 1
- Sampling each input unit with conditional probability starting from sampled hidden units
} 1 ,0 {jh
} 1 ,0 {iv
Reconstruction! Generative Model!
M&S
Two Important Conditional Probabilites! – Second
1v 2v 3v 4v
1h 2h 3h
i
ijiji bhwσhvP )()|1(
M&S
Sampling Algorithm in RBM
3rd Step
0 0 1 1
1 0 1
- Sampling each input unit with conditional probability starting from sampled hidden units
} 1 ,0 {jh
} 1 ,0 {iv
Reconstruction! Generative Model!
i
ijiji bhwσhvP )()|1(
M&S
Sampling Algorithm in RBM
3rd Step
0 0 1 1
1 0 1
- Sampling each input unit with conditional probability starting from sampled hidden units
} 1 ,0 {jh
} 1 ,0 {iv
Reconstruction! Generative Model!
i
ijiji bhwσhvP )()|1(
M&S
Sampling Algorithm in RBM
3rd Step
0 0 1 1
1 0 1
- Sampling each input unit with conditional probability starting from sampled hidden units
} 1 ,0 {jh
} 1 ,0 {iv
Reconstruction! Generative Model!
i
ijiji bhwσhvP )()|1(
M&S
Sampling Algorithm in RBM
3rd Step
0 0 1 0
1 0 1
- Sampling each input unit with conditional probability starting from sampled hidden units
} 1 ,0 {jh
} 1 ,0 {iv
Reconstruction! Generative Model!
i
ijiji bhwσhvP )()|1(CD_k=1
M&S
Sampling Algorithm in RBM
4th Step - k times performing CD_k
1v 2v ...v
1h 2h ...h
1v 2v ...v
1h 2h ...h
…1h 2h ...h
1v 2v ...v
t = 0 t = 1 t = ∞ ≈ k
M&S
Sampling Algorithm in RBM
4th Step - k times performing CD_k=1
1v 2v ...v
1h 2h ...h
1v 2v ...v
1h 2h ...h
…1h 2h ...h
1v 2v ...v
t = 0 t = 1 t = ∞ ≈ k
Data Model
M&S
Learning in RBM
model),(),( )|(
hvE
θhvE
θθvθNLL
data
Gradient Discent for NLL<...> Expectation
cbwθ ,,
jiij
hvw
hvE
),(
ii
vb
hvE
),(
jj
hc
hvE
),(
i j j
jji
iijiij hcvbhvwhvE ),(
M&S
Learning in RBM
model),(),( )|(
hvE
θhvE
θθvθNLL
data
Gradient Discent for NLL<...> Expectation
cbwθ ,,
jiij
hvw
hvE
),(
ii
vb
hvE
),(
jj
hc
hvE
),(
)( model jidatajiwij hvhvηwΔ
)( model idataibi vvηbΔ
)( model jdatajcj hhηcΔ
M&S
Learning in RBM
model),(),( )|(
hvE
θhvE
θθvθNLL
data
Gradient Discent for NLL<...> Expectation
cbwθ ,,
)( model jidatajiwij hvhvηwΔ
)( model idataibi vvηbΔ
)( model jdatajcj hhηcΔ
i
ijiijji vcvwσhv )(
i
jiijj cvwσh )(
ii vv
M&S
Learning in RBM
model),(),( )|(
hvE
θhvE
θθvθNLL
data
Gradient Discent for NLL<...> Expectation
cbwθ ,,
ijtij
tij wΔww 1
iti
ti bΔbb 1
jtj
tj cΔcc 1
M&S
Learning in RBM
model),(),( )|(
hvE
θhvE
θθvθNLL
data
Gradient Discent for NLL<...> Expectation
cbwθ ,,
)( )(1 kiib
ti
ti vvηbb
))()(( )()(1
i
kij
kiij
iijiijw
tij
tij vcvwσvcvwσηww
))()(( )(1
ij
kiij
ijiijc
tj
tj cvwσcvwσηcc
ModelData
M&S
Intuition for RBM
Cost = Negative Log-Likelihood (NLL)
)|( vθNLL hv
hvE
h
hvE ee,
),(),( lnln
Model
Data ),( hvE
),( hvEEnergy Surface in global configurations
Datapoint + Hidden(datapoint)
Reconstruction + Hidden(reconstruction)
Sampling
M&S
Intuition for RBM
Cost = Negative Log-Likelihood (NLL)
)|( vθNLL hv
hvE
h
hvE ee,
),(),( lnln
Model
Data ),( hvE
),( hvEEnergy Surface in global configurations
M&S
Intuition for RBM
Cost = Negative Log-Likelihood (NLL)
)|( vθNLL hv
hvE
h
hvE ee,
),(),( lnln
Sampling Direction
Energy Surface in global configurations
Sampling
Global Minimum
Global MinimumDatapoint
M&S
Intuition for RBM
Cost = Negative Log-Likelihood (NLL)
)|( vθNLL hv
hvE
h
hvE ee,
),(),( lnln
Sampling Direction
Datapoint
j
vE
vE
i j
i
eevvP )(
)(
)(
ith configuration
Overall configuration
M&S
Intuition for RBM
Cost = Negative Log-Likelihood (NLL)
)|( vθNLL hv
hvE
h
hvE ee,
),(),( lnln
Sampling Direction
Datapoint
j
vE
vE
i j
i
eevvP )(
)(
)(
ith configuration
Overall configuration
M&S
Intuition for RBM
Sampling Direction
Global Minimum
1v 2v ...v
1h 2h ...h
1v 2v ...v
1h 2h ...h
t = 0 t = 1
i
jiijj cvwσvhP )()|1(
i
ijiji bhwσhvP )()|1(
Boltzman Distribution
lk
hvE
hvE
ii lk
ji
eehvP
.
) ,(
) ,(
) ,(
)( iEP
iE
Energy
M&S
Intuition for RBM
Sampling Direction
Global Minimum
1v 2v ...v
1h 2h ...h
1v 2v ...v
1h 2h ...h
t = 0 t = 1
i
jiijj cvwσvhP )()|1(
i
ijiji bhwσhvP )()|1(
Boltzman Distribution
lk
hvE
hvE
ii lk
ji
eehvP
.
) ,(
) ,(
) ,(
)( iEP
iE
Energy
M&S
Intuition for RBM
Sampling Direction
Global Minimum
1v 2v ...v
1h 2h ...h
1v 2v ...v
1h 2h ...h
t = 0 t = 1
i
jiijj cvwσvhP )()|1(
i
ijiji bhwσhvP )()|1(
Boltzman Distribution
lk
hvE
hvE
ii lk
ji
eehvP
.
) ,(
) ,(
) ,(
)( iEP
iE
Energy
M&S
Intuition for RBM
Sampling Direction
Global Minimum
1v 2v ...v
1h 2h ...h
1v 2v ...v
1h 2h ...h
t = 0 t = 1
i
jiijj cvwσvhP )()|1(
i
ijiji bhwσhvP )()|1(
Boltzman Distribution
lk
hvE
hvE
ii lk
ji
eehvP
.
) ,(
) ,(
) ,(
)( iEP
iE
Energy
M&S
Intuition for RBM
Sampling Direction
Global Minimum
1v 2v ...v
1h 2h ...h
1v 2v ...v
1h 2h ...h
t = 0 t = 1
…1v 2v ...v
1h 2h ...h
t = ∞
Energy
Energy Surface in global configurations
Sampling
Global Minimum
M&S
Intuition for RBM
Contrastive Divergence (CD)
PCD Vs. CD
Global Minimum
M&S
Intuition for RBM
Contrastive Divergence (CD)
PCD Vs. CD
Global Minimum
M&S
Intuition for RBM
Contrastive Divergence (CD)
PCD Vs. CD
Global Minimum
M&S
Intuition for RBM
Persistent Contrastive Divergence (PCD)
PCD Vs. CD
Global Minimum
M&S
Intuition for RBM
Persistent Contrastive Divergence (PCD)
PCD Vs. CD
Global Minimum
Previous sample point
M&S
Intuition for RBM
Persistent Contrastive Divergence (PCD)
PCD Vs. CD
Global Minimum
Previous sample point
M&S
Intuition for RBM
Persistent Contrastive Divergence (PCD)
PCD Vs. CD
Global Minimum
Previous sample point
Winner is PCD!
M&S
Practice
Input Data 1th epoch Reconstruction
M&S
Practice
11th epoch Reconstruction 61th epoch Reconstruction
M&S
In Reality – Unsupervised Pretraining
1v 2v 3v ...v
1h 2h ...h
1h 2h 3h ...h
1y 2y ...y
Pretraining!
M&S
Thank you!