Chapter 2 Fundamental Neurocomputing Concepts
description
Transcript of Chapter 2 Fundamental Neurocomputing Concepts
1
Chapter 2 Fundamental Neurocomputing Concepts國立雲林科技大學 資訊工程研究所
張傳育 (Chuan-Yu Chang ) 博士Office: EB212TEL: 05-5342601 ext. 4337E-mail: [email protected]://MIPL.yuntech.edu.tw
2
Basic Models of Artificial neurons An artificial neuron can be referred to as a
processing element, node, or a threshold logic unit. There are four basic components of a neuron
A set of synapses with associated synaptic weights A summing device, each input is multiplied by the
associated synaptic weight and then summed. A activation function, serves to limit the amplitude of the
neuron output. A threshold function, externally applied and lowers the
cumulative input to the activation function.
3
Basic Models of Artificial neurons
4
Basic Models of Artificial neurons
q
n
jjqjq
qqqq
Tqnqq
n
jq
Tjqjq
xwfy
ufvfy
www
xwu
1
1n21
1
bygiven isneuron theofoutput the
)(
isfunction activation theofoutput the
R,...,, where
iscombiner linear theofoutput the
q
Tq
w
wxxw (2.2)
(2.3)
(2.4)
5
Basic Models of Artificial neurons The threshold (or bias) is incorporated into the synaptic
weight vector wq for neuron q.
6
Basic Models of Artificial neurons
n
jjqjq
vfy
q
wv
as written is neuron ofoutput The
as written is potential activation internal effective The
0
x
7
Basic Activation Functions
The activation function, transfer function, Linear or nonlinear
Linear (identity) activation function
qqlinq vvfy
8
Basic Activation Functions
Hard limiter Binary function, threshold function
(0,1) The output of the binary hard
limiter can be written as
Hard limiter activation function
0 if1
0 if0
q
qqhlq v
vvfy
9
Basic Activation Functions
Bipolar, symmetric hard limiter (-1, 1) The output of the symmetric
hard limiter can be written as
Sometimes referred to as the signum (or sign) function.
0 if1
0 if0
0 if1
q
q
q
qshlq
v
v
v
vfySymmetric limiter activation function
10
Basic Activation Functions
Saturation linear function, piecewise linear function The output of the saturation
linear function is given by
2
1 if1
2
1
2
1- if
2
12
1 if0
q
q
qslq
v
vv
v
vfySaturation linear activation function
11
Basic Activation Functions
Saturation linear function The output of the symmetric
saturation linear function is given by
Saturation linear activation function
1 if1
11- if
1 if1
q
q
qsslq
v
vv
v
vfy
12
Basic Activation Functions
Sigmoid function (S-shaped function) Binary sigmoid function The output of the binary
sigmoid function is given by
qvqbsq
evfy
1
1
Where is the slope parameter of the binary sigmoid function
Binary sigmoid function
Hard limiter has no derivative at the origin, the binary sigmoid is a continuousand differentiable function.
13
Basic Activation Functions The derivation of the binary sigmoid function for two
different values of the slope parameter.
qbsqbsv
v
q
qbsqbs vfvf
e
e
dv
vdfvg
q
q
11
2
14
Basic Activation Functions
Sigmoid function (S-shaped function) Bipolar sigmoid function,
hyperbolic tangent sigmoid The output of the Binary sigmoid
function is given by
q
q
v
v
vv
vv
qqhtsqe
e
ee
eevvfy
2
2
1
1tanh
15
Basic Activation Functions The effect of the threshold
qand bias q can be illustrated by observing the binary sigmoid activation function Three plots of the binary
sigmoid function (=1) Threshold=2 (q=2) Bias=2(q=2) Nominal case(q=q=0)
Applying a threshold is analogous to delaying a time-domain signal
Adding a bias is analogous to an advance of a signal.
16
The Hopfield Model of the Artificial Neuron Hopfield 清楚的描述類神經網路理論與實務間的關係。 The Hopfield neural network is an asynchronous parallel
processing, fully interconnected. Content-addressable memory (associative memory)
Retrieving a stored pattern in response to the presentation of a noisy or incomplete version of that pattern.
Discrete-time model of the Hopfield neuron
17
The Hopfield Model of the Artificial Neuron
The unit delay z-1 delays the output of the activation function by one sample period to give yq(k).
(2.18) )1(
as )1(neuron ofoutput the(2.17), and (2.16) Using
(2.17) 1
where
(2.16) 1)1(
as zdelay unit thebeforeneuron theofoutput The
1
1
-1
n
jqjqjshlq
q
n
jqjqjq
qshlq
kxwfky
ky
kxwkv
kvfky
18
The Hopfield Model of the Artificial Neuron Continuous-time model of the Hopfield artificial
neuron Tcq=RqCq is the integration time constant of the qth
neuron, q is an externally applied threshold. The integrator can be realized with an OPA,
capacitor Cq, and resistor Rq.
q>0 is called the leakage factor of the integrator.
qhtsq
n
jqjqjqq
qcq
vfy
xwvdt
dvT
1
19
Adaline and Madaline
Least-Mean-Square (LMS) Algorithm Widrow-Hoff learning rule Delta rule The LMS is an adaptive algorithm that computes
adjustments of the neuron synaptic weights. The algorithm is based on the method of steepest decent. It adjusts the neuron weights to minimize the mean square
error between the inner product of the weight vector with the input vector and the desired output of the neuron.
Adaline (adaptive linear element) A single neuron whose synaptic weights are updated
according to the LMS algorithm. Madaline (Multiple Adaline)
20
Simple adaptive linear combiner
kxkwkwkxkv TT
inputs
x0=1, wo= (bias)
21
Simple adaptive linear combiner The difference between the desired response and
the network response is
The MSE criterion can be written as
Expanding Eq(2.23)
22 )()(2
1)(
2
1)( kkkdEkeEJ T xww
kkkdkvkdke T xw
)()(2
1)()(
2
1
)()()()(2
1)()()()(
2
1)(
2
2
kkkkdE
kkkEkkkkdEkdEwJ
xTT
TTT
wCwwp
wxxwwx
(2.22)
(2.23)
(2.24)
(2.25)
22
Simple adaptive linear combiner Cross correlation vector between the desired
response and the input patterns
Covariance matrix for the input pattern
J(w) 的 MSE 表面有一個最小值 (minimum) ,因此計算梯度等於零的權重值
因此,最佳的權重值為
)()( kkdE xp
)()( kkE Tx xxC
0)()(
)(
k
JJ xwCp
w
www
pCw 1* x
(2.26)
(2.27)
23
The LMS Algorithm
上式的兩個限制 求解 covariance matrix 的反矩陣很費時 不適合即時的修正權重,因為在大部分情況, covariance
matrix 和 cross correlation vector 無法事先知道。 為避開這些問題, Widow and Hoff 提出了 LMS
algorithm To obtain the optimal values of the synaptic weights when
J(w) is minimum. Search the error surface using a gradient descent method
to find the minimum value. We can reach the bottom of the error surface by changing the
weights in the direction of the negative gradient of the surface.
24
The LMS Algorithm
Typical MSE surface of an adaptive linear combiner
25
The LMS Algorithm Because the gradient on the surface cannot be computed
without knowledge of the input covariance matrix and the cross-correlation vector, these must be estimated during an iterative procedure.
Estimation of the MSE gradient surface can be obtained by taking the gradient of the instantaneous error surface.
The gradient of J(w) approximated as
The learning rule for updating the weights using the steepest descent gradients method as
)()(
)(
2
1)( )(
2
kke
keJ kwww
xw
w
)()()()()()1( kkekJkk w xwwww
(2.28)
(2.29)
Learning rate specifies the magnitude of the update step for the weights in the negative gradient direction.
26
The LMS Algorithm
If the value of is chosen to be too small, the learning algorithm will modify the weights slowly and a relatively large number of iterations will be required.
If the value of is set too large, the learning rule can become numerically unstable leading to the weights not converging.
27
The LMS Algorithm
The scalar form of the LMS algorithm can be written from (2.22) and (2.29)
從 (2.29) 及 (2.31) 式,我們必須給 learning rate 設立一個上限,以維持網路的穩定性。 (Haykin,1996)
n
hhh kxkwkdke
1
)()()()(
)()()()1( kxkekwkw iii
(2.30)
(2.31)
max
20
The largest eigenvalue of the
input covariance matrix Cx
28
The LMS Algorithm
為使 LMS 收斂的最小容忍的穩定性,可接受的learning rate 可限定在
(2.33) 式是一個近似的合理解法,因為
xtrace C
20
n
h
n
hxhhhx ctrace
1 1maxC
(2.33)
(2.34)
29
The LMS Algorithm
從 (2.32) 、 (2.33) 式知道, learning rate 的決定,至少得計算輸入樣本的 covariance matrix ,在實際的應用上是很難達到的。
即使可以得到,這種固定 learning rate 在結果的精確度上是有問題的。 因此, Robbin’s and Monro’s root-finding algorithm 提出了,隨時間
變動 learning rate 的方法。 (Stochastic approximation )
where is a very small constant. 缺點: learning rate 減低的速度太快。
kk
)((2.35)
30
The LMS Algorithm
理想的做法應該是在學習的過程中, learning rate 應該在訓練的開始時有較大的值,然後逐漸降低。 (Schedule-type adjustment)
Darken and Moody Search-then converge algorithm
Search phase: is relatively large and almost constant. Converge phase: is decrease exponentially to zero.
0 >0 and >>1, typically 100<=<=500 These methods of adjusting the learning rate are commonly
called learning rate schedules.
/1)( 0
kk
(2.36)
31
The LMS Algorithm
Adaptive normalization approach (non-schedule-type) is adjusted according to the input data every time step
where 0 is a fixed constant.
Stability is guaranteed if 0< 0 <2; the practical range is 0.1<= 0 <=1
2
2
0
)()(
kxk
(2.37)
32
The LMS Algorithm Comparison of two learning rate schedules: stochastic approximation
schedule and the search-then-converge schedule.
Eq.(2.35)
Eq.(2.36)
is a constant
33
Summary of the LMS algorithm Step 1: set k=1, initialize the synaptic weight vector w(k=1), and
select values for 0 and . Step 2: Compute the learning rate parameter
Step 3: Computer the error
Step 4: Update the synaptic weights
Step 5: If convergence is achieved, stop; else set k=k+1, then go to step 2.
/10
kk
n
hhh kxkwkdke
1
)()()()(
)()()()()1( kxkekkwkw iii
34
Example 2.1: Parametric system identification Input data consist of 1000 zero-mean Gaussian random vectors
with three components. The bias is set to zero. The variance of the components of x are 5, 1, and 0.5. The assumed linear model is given by b=[1, 0.8, -1]T.
To generate the target values, the 1000 input vectors are used to form a matrix X=[x1x2…x1000], the desired outputs are computed according to d=bTX
The progress of the learning rate parameter as it is adjusted according to the search-then converge schedule.
bx d
200
1936.09.0
10001000
1
max0
1000
1
h
TT
xXX
xxC
The learning process was terminated when 82 102/1 keJ
35
Example 2.1 (cont.) Parametric system identification: estimating a parameter vector
associated with a dynamic model of a system given only input/output data from the system.
The root mean square (RMS) value of the performance measure.
36
Adaline and Madaline
Adaline It is an adaptive pattern classification network trained by
the LMS algorithm.
x0(k)=1
可調整的 bias或 weight
產生 bipolar (+1, -1)的輸出,可因activation function 的不同,而有 (0,1) 的輸出
)()()( kvkdke )()()(~ kykdke
37
Adaline
Linear error The difference between the desired output and the
output of the linear combiner.
Quantizer error The difference between the desired output and the
output of the symmetric hard limiter.
)()()( kvkdke
)()()(~ kykdke
38
Adaline
Adaline 的訓練過程 輸入向量 x必須和對應的 desired 輸出 d ,同時餵給 Adaline。 神經鍵的權重值 w,會根據 linear LMS algorithm 動態的調
整。 Adaline 在訓練的過程,並沒有使用到 activation function ,
(activation function 只有在測試階段才會使用 ) 一旦網路的權重經過適當的調整後,可用未經訓練的 pattern
來測試 Adaline 的反應。 如果 Adaline 的輸出和測試的輸入有很高的正確性時,可稱
網路已經 generalization 。
39
Adaline One common application of the Adaline is for the
realization of a small class of logic functions: Only those logic functions that are linearly separable can be
realized by the Adaline. AND
otherwise
xxxxANDnxy j
n
n
jj 1
1 all if1,...,,1sgn 21
1
w0(k)
40
Adaline OR
otherwise
xxxxORnxy j
n
n
jj 1
1 some if1,...,,1sgn 21
1
w0(k)
41
Adaline Majority
otherwise
xxxxMAJxy j
n
n
jj 1
1 ofmajority theif1,...,,sgn 21
1
42
Adaline Linear separability
The Adaline acts as a classifier which separates all possible input patterns into two categories.
The output of the linear combiner is given as
)(
)()(
)(
)()(
0)()()()()(
0)(Let
)()()()()()(
2
01
2
12
02211
02211
kw
kwkx
kw
kwkx
or
kwkxkwkxkw
kv
kwkxkwkxkwkv
43
AdalineLinear separability of the Adaline
Adaline 只能分割線性可分割的 patten
44
AdalineNonlinear separation problem
若 separating boundary非 straight line ,Adaline 無法分割
Since the boundary is not a straight line, the Adaline cannot be used to accomplish this task.
45
Adaline Adaline with nonlinearly transformed inputs (polynomial
discriminant function) To solve the classification problem for patterns that are not
linearly separable, the inputs to the Adaline can be preprocessed with fixed nonlinearities. (polynomial discriminant function)
22524213
122110
)()()(
)()()()(
xkwxkwxxkw
xkwxkwkwkv
(2.45)
46
Adaline The critical thresholding condition for this Adaline with nonlinearly
transformed inputs occurs when v(k) in (2.45) is set to zero. Realizing a nonlinearly separable function (XNOR)
If the appropriate nonlinearities are chosen, the network can be trained to separate the input space into two subspaces which are not linearly separable.
47
Adaline (cont.) Linear error correction rules
有兩種基本的線性修正規則,可用來動態調整網路的權重值。( 網路權重的改變與網路實際輸出和 desire 輸出的差異有關 ) -LMS: same as (2.22) and (2.29) -LMS: a self-normalizing version of the -LMS learning rule
-LMS 演算法是根據最小擾動原則 (minimal-disturbance principle) ,當調整權重以適應新的 pattern 的同時,對於先前的 pattern 的反應,應該有最小的影響。
-LMS 是基於最小化 MSE 表面。 -LMS 則是更新權重使得目前的誤差降低。
2
2)(
)()()()1(
kx
kxkekwkw (2.46)
)()()()()()1( kxkekwwJkwkw w
48
Adaline (cont.) Consider the change in the error for -LMS
From (2.47) The choice of a controls stability and speed of convergence,
is typically set in the range
-LMS 之所以稱為 self-normalizing 是因為的選擇和網路的輸入大小無關,
kekekx
kxkxkeke
kekxkx
kxkekwkd
kekxkwkdkekeke
T
TT
T
2
2
2
2
11
ke
ke
11.0
(2.47)
(2.48)
49
Adaline (cont.) Detail comparison of the -LMS and -LMS
From (2.46)
Define normalized desired response and normalized training vector
Eq(2.49) can be rewrote as
222
2
2
2
2
)()()(
)(
)(
)()()()1(
kx
kx
kx
kxkw
kx
kdkw
kx
kxkxkwkdkw
kx
kxkekwkw
T
T
22
,kx
kxkx
kx
kdkd
)()1( kxkxkwkdkwkw T
和 -LMS 具有相同的型式,所以 -LMS 表示正規化輸入樣本後的 -LMS 。
(2.49)
(2.50-51)
(2.52)
)()()()()()1( kxkekwwJkwkw w
50
Multiple Adaline (Madaline)
單一個 Adaline 無法解決非線性分割區域的問題。
可使用多個 adaline Multiple adaline Madaline
Madaline I : single-layer network with single output. Madaline II : multi-layer network with multiple output.
51
Example of Madaline I network consisting of three Adalines
May be OR, AND, and MAJ
52
Two-layer Madaline II architecture
53
Madaline I realization of an XNOR logic function
54
Madaline I realization of an XNOR logic function (cont.)
)(
)()(
)(
)()(
移項整理後可得),(同除以0)()()()()()(
0)( ,1)(
)()()()()()()(
12
101
12
112
12
100122111
10
1001221111
kw
kwkx
kw
kwkx
kw
kwkxkwkxkwkx
kvkx
kwkxkwkxkwkxkv
55
同理
)(
)()(
)(
)()(
移項整理後可得),(同除以0)()()()()()(
0)( ,1)(
)()()()()()()(
22
201
22
212
22
200211222
20
2002112222
kw
kwkx
kw
kwkx
kw
kwkxkwkxkwkx
kvkx
kwkxkwkxkwkxkv
Madaline I realization of an XNOR logic function (cont.)
56
Madaline I separation properties for the XNOR problem
57
Madaline Learning Strategies Madline 的學習策略有兩種
Madaline rule I, MRI, for Madaline I The basic idea is to adjust those weights for the neuron
whose linear output vj(k) is closest to the desired response.
MRI follows the minimal-disturbance principle. Madaline rule II, MRII, for Madaline II
The weights are initially set to small random values. The training patterns are presented in a random fashion
with the objective of minimizing the average Hamming error over the training set. (backpropagation)
58
Simple Perceptron
Simple perceptron (single-layer perceptron) Very similar to the Adaline, 由 Frank Rosenblatt (1950) 提
出。 Minsky and Paper 發現一個嚴重的限制: perceptron 無法
解決 XOR 的問題。 藉由正確的 processing layer ,可解決 XOR 問題,或
是, parity function 的問題。 Simple perceptron 和典型的 pattern classifier 的 maximum-
likelihood Gaussian classifier 有關,均可視為線性分類器。 大部分的 perceptron 的訓練是 supervised ,也有部分是
self-organizing 。
59
Simple Perceptron (cont.)
In Rosenblatt’s early work The perceptron had three layers
Sensory surface (retina): Association area (A unit) Response unit (R unit)
It was not allowed to have more than one R unit on at a time. (Winner-take-all)
60
Simple Perceptron (cont.)
Original Rosenblatt’s perceptron Binary input, no bias.
Modified perceptron Bipolar inputs and a bias term Output y{-1,1}
61
Simple Perceptron (cont.) The quantizer error is used to adjust the synaptic
weights of the neuron. The adaptive algorithm for adjusting the neuron
weights (the perceptron learning rule) is given as
Rosenblatt normally set to unity. The choice of the learning rate does not affect the
numerical stability of the perceptron learning rule. can affect the speed of convergence.
kykdkxkwkdke
kxke
kwkw
T
sgn~where
2
~1 (2.55)
(2.56)
比較 (2.46)
2
2)(
)()()()1(
kx
kxkekwkw
62
Simple Perceptron (cont.)
The perceptron learning rule is considered a nonlinear algorithm.
The perceptron learning rule performs the update of the weights until all the input patterns are classified correctly. The quantizer error will be zero for all training pattern inputs, and
no weight adjustments will occur.
The weights are not guaranteed to be optimal.
63
Simple Perceptron
Mays’s perceptron learning rules Mays 對標準的 perceptron learning rules 提出兩個修改: Increment adaptation algorithm
Modified relaxation algorithm
kvifkx
kxkdkw
kvifkx
kxkekw
kw )(
2
)(~
1
2
2
2
2
otherwise
kx
kxkekw
kekvifkw
kw2
2
)(
0)(~ and
1
(2.58)
(2.57)
64
Simple Perceptron with a Sigmoid Activation Function
The learning rule is based on the method of steepest descent and attempts to minimize an instantaneous performance function.
65
Simple Perceptron with a Sigmoid Activation Function (cont.)
學習演算法可由 MSE 推導獲得
The instantaneous performance function to be minimized is given as
kykdke
keEJ
qqq
~ where
~2
1 2w
Tqq
qqqq
qqqq
kwkxfkvfky
kykykdkd
kykdkeJ
where
22
12
1~2
1
22
22w
(2.61)
(2.60)
(2.59)
66
Simple Perceptron with a Sigmoid Activation Function (cont.)
假設 activation function 為 hyperbolic tangent sigmoid ,因此,神經元的輸出可表示成
根據 (2.15) 式對 hyperbolic tangent sigmoid函數的微分
採用 steepest descent 的 discrete-time learning rule
kvkvfky qqhtsq tanh
kvfkvfkvg qqq21'
qwqq Jkk www 1(參考 2.29 式 )
(2.64)
(2.63)
(2.62)
)()()()()()1( kxkekJkk w wwww
qbsqbsv
v
q
qbsqbs vfvf
e
e
dv
vdfvg
q
q
11
2
67
Simple Perceptron with a Sigmoid Activation Function (cont.)
計算 (2.64) 式的梯度 (gradient)
以 (2.63) 式代入 (2.65) 式
採用 (2.66) 式的 gradient ,則 discrete-time learning rule for simple perceptron 可寫成
kxkvfke
kxkvfkvfkd
kxkvfkvfkxkvfkdJ
qqq
qqqqqw
~'
''
w
kxkyke
kxkvfkeJ
qqqw
2
2
1~1~
w
kxkykekk qqqq21~1 ww
(2.65)
(2.66)
(2.67)
68
Simple Perceptron with a Sigmoid Activation Function (cont.)
(2.67) 式可改寫成 scalar form
其中
(2.68) 、 (2.69) 和 (2.70) 為 backpropagation training algorithm 的標準形式。
kxkykekwkw jqqqjqj21~1
n
jqqjjqq
qqq
kwkxfkvfky
kykdke
1
~
(2.70)
(2.69)
(2.68)
69
Example 2.2 Applied the architecture of Figure 2.30 to learn character “E” The character image consists of 5x5 array, 25 pixel (column major) The learning rule is Eq.(2.67), with =1, =0.25 The desired neuron response d=0.5, error goal 10-8. The initial weights of the neuron were randomized. After 39 training pattern, the actual neuron output y=0.50009 (see Fig.
2.32)
70
Example 2.2 (cont.)
The single neuron cannot correct for a noisy input.
For Fig. 2.31 (b), y=0.5204 For Fig. 2.31 (c), y=0.6805 To compensate for noisy
Multi-layer perceptron Hopfield associative
memory.
71
Feedforward Multilayer Perceptron Multilayer perceptron (MLP)
The branches can only broadcast information in one direction. Synaptic weight can be adjusted according to a defined
learning rule. h-p-m feedforward MLP neural network. In general there can be any number of hidden layers in the
architecture; however, from a practical perspective, only one or two hidden layer are used.
72
Feedforward Multilayer Perceptron (cont.)
73
Feedforward Multilayer Perceptron (cont.) The first layer has the weight matrix
The second layer has the weight matrix
The third layer has the weight matrix
Define a diagonal nonlinear operator matrix
nhjiwW )1()1(
hprjwW )2()2(
pmsrwW )3()3(
)()()( ,...,, fffdiagf (2.71)
74
Feedforward Multilayer Perceptron (cont.) The output of the first layer can be written as
The output of the second layer can be written as
The output of the third layer can be written as
將 (2.72)代入 (2.73) ,再代入 (2.74) 可得最後的輸出為
xWvx )1()1()1()1(1 ffout
1)2()2()2()2(
2 outout ff xWvx
2)3()3()3()3(
3 outout ff xWvx
(2.72)
(2.73)
(2.74)
xWWW )1()1()2()2()3()3( fffy (2.75)
The synaptic weights are fixed, a training process must be carried out a prioriTo properly adjust the weights.
75
Overview of Basic Learning Rules for a Single Neuron
Generalized LMS Learning Rule 定義一個需最小化的 performance function (energy function)
其中 , ||w||2為向量 w 的 Euclidean norm(.) 為任何可微分的函數, e is the linear error。
2
22)( ww
e
xwTde
Desired output
Weight vector Input vector
(2.76)
(2.77)
76
Generalized LMS Learning Rule (cont.) 採用最陡坡降法 (steepest descent approach) ,可獲得
general LMS algorithm。 Continuous-time learning rule( 可視為向量的微分 )
Discrete-time learning rule
If (t)=1/2t2, and ’(t)=g(t)=t, then (2.81) is written as
wx
ww
)(
)(
egdt
dw
)()()()(
)()()1(
kkegk
kk w
wxw
www
Learning rate Leakage factor
(2.78)
(2.79)
(2.82)
(2. 81)
wxwxwxw eee
dt
d(2.83)
77
Generalized LMS Learning Rule (cont.)
Leaky LMS algorithm (0<=<1)
Standard LMS algorithm (=0), (the same as Eq.2.29)
The scalar form of standard LMS algorithm
)()()()1(
)()()()()1(
kkek
kkkekk
xw
wxww
)()()()1( kkekk xww
(2.84)
(2.85)
n
jjj
jjj
kxkwkdke
nj
kxkekwkw
1
)()()()(
,...,2,1,0for
)()()()1( (2.86)
78
Generalized LMS Learning Rule (cont.) Standard LMS 可有三種重要的變化:
慣性 (momentum)被設計來在平均下坡力量的方向上,提供一特殊動量來改變權重向量。可定義成目前權重 w(k)和前一權重 w(k-1)間的差異。
因此 (2.85) 式可改寫成
其中 0<<1 為 momentum parameter
1)( kkk www
11 kkkkekk wwxww
(2.87)
(2.88)
Standard LMS algorithm with momentum
79
Generalized LMS Learning Rule (cont.)
2. Recursive weighted least-squares 是一種基於autoregressive moving average (ARMA) 的 parametric system identification
Recursive weighted least-squares包含有 參數向量 (parameter vector) 的更新表示 增益向量 (gain vector) 的更新表示 權重矩陣 (weighting matrix) 的更新表示 The update expression for the parameter vector is
The update expression for the gain vector is
The update expression for the weighting matrix is
)()1()()1( kekkk Lww
1)()()()()()1(
kkkkkk T xPxxPL
)()()1()(1
)1( kkkkk T PxLPP
11 nLPPP Tnn ,11
11 nw
(2.89)
(2.90)
(2.91)
80
Generalized LMS Learning Rule (cont.)
其中 error term e(k) is given by
將 (2.90)代入 (2.89) 得
因此修改後的 synaptic weight vector 可表示成
其中
)()()()( kkkdke T xw
)()()(
)()()()()1(
kkk
kkkekk
T xPx
wPww
)()()()()()1( kkkekkk xPww
)()()(
1)(
kkkk
T xPx
(2.92)
(2.93)
(2.94)
(2.95)
81
Generalized LMS Learning Rule (cont.)
將 (2.90)代入 (2.91) 可得 weighting matrix
The update expression for the weighting matrix is written as
)()()()()(1
)()()()()()(1
)()()(
)()()()()(
1)1(
kkkkk
kkkkkk
kkk
kkkkkk
T
T
T
T
PxxPI
PxxPP
xPx
PxxPPP
)()()()()(1
)1( kkkkkk T PxxPIP
(2.96)
(2.97)
82
Generalized LMS Learning Rule (cont.)
3. 最小擾動原則 (minimal disturbance principle) Modified normalized LMS 在 (2.46) 式中,在分母的地方加入正的常數,確保權重的更新
不會變成無限大。
Where
typically
2
2
1k
kkekk
x
xww
(2.98)
20
0
11.0
83
Example 2.3
The same as Example 2.1 ,但使用不同的 LMS algorithm 。
Use the same Initial weight vector Initial learning rate Termination criterion
84
Overview of basic learning rules for a single neuron Hebbian Learning
[Donald Hebb] The strength of a synapse between cells A and B increased slightly for the situation when the firing in A was followed by firing in B with a very small time delay. For two neurons on either side of a synapse that are synchronously
activated, then the strength of the synapse is increased. [Stent] expanded Hebb’s original statement to include the case
when two neurons on either side of a synapse are asynchronously activated, leading to a weakened synapse.
[Rumelhart] Adjust the strength of the connection between units A and B in proportion to the product of their simultaneous activation. If the product of the activations is positive, the modification to the
synaptic connection is more excitatory. If the product of the activations is negative, the modification to the
synaptic connection is more inhibitory.
85
Overview of basic learning rules for a single neuron (cont.)
Hebbian synapse Uses a highly local, time-dependent, and strongly
interactive mechanism to increase synaptic efficiency as a function of the correlation between the presynaptic and postsynaptic activity levels.
86
Overview of basic learning rules for a single neuron (cont.)
Four key properties of a Hebbian synapse Time-dependent mechanism
To change in a Hebbian synapse that depend on the precise time of occurrence of the the presynaptic and postsynaptic activity levels.
Local mechanism Within a synapse, ongoing activity levels in the presynaptic and
postsynaptic units are used by a hebbian synapse to produce an input-dependent, local synaptic modification.
Interactive mechanism Any form of hebbian learning depends on the interaction between
presynaptic and postsynaptic activities.
Conjunctional (correlational) mechanism The “co-occurrence” of presynaptic and postsynaptic activities within a
relatively short time interval is sufficient to produce a synaptic modification.
87
Overview of basic learning rules for a single neuron (cont.)
Synaptic activities can be categorized as Hebbian
A Hebbian synapse increases its strength, and its strength is decreased when the activities are either uncorrelated or negatively correlated.
Anti-Hebbian An anti-Hebbian synapse enhance negatively correlated
presynaptic or postsynaptic activities and weakens positively correlated activities.
Non-Hebbian A non-Hebbian synapse does not involve the strongly
interactive, highly local, time-dependent mechanism.
88
Overview of basic learning rules for a single neuron (cont.) Standard Hebbian learning for a single
neuron
89
Overview of basic learning rules for a single neuron (cont.)
The standard Hebbian learning rule for a single neurons from an energy function defined as
The output of the neuron
Taking the steepest descent approach to derive the continuous-time learning rule
2
22)( wxww T
vfvd
vdy
wdt
dww
xwv T
(2.99)
(2.100)
(2.101)
90
Overview of basic learning rules for a single neuron (cont.)
The gradient of (2.99) is given as
Using (2.102) and (2.101), the continuous-time standard Hebbian learning rule for a single neuron is given as
The discrete-time standard Hebbian learning rule is
The scalar discrete-time form
wyxww
vvfww
)()(
wyxdt
dw
)()()()()1( kwkxkykwkw
)()()()()1( kwkxkykwkw jjjj
(2.102)
(2.103)
(2.104)
(2.105)
91
Overview of basic learning rules for a single neuron (cont.) Generalized Hebbian learning rule
Can be considered as a gradient optimization process when an appropriate energy or Lyapunov function is selected
The resulting generalized Hebbian learning rule is given as
The discrete-time form as
w
w
dt
dw
)(
wxdt
dw
)()()()1(
)()()()(
)()()()()1(
kxkkw
kxkkwkw
kwkxkkwkw
(2.106)
(2.107)
(2.108)
dyvxw ,,,, is a learning signal dyvxw ,,,, is a learning signal dyvxw ,,,,
92
Overview of basic learning rules for a single neuron (cont.)
Assume that the learning signal is the output of the neuron
Therefore, (2.107) becomes
)()(
vfdv
vdy
wyxdt
dw
(2.109)
(2.110)
93
Overview of basic learning rules for a single neuron (cont.) Oja’s Learning Rule
Oja’s learning rule 可由最小化下列能量函數推導:
兩個基本假設: 神經元的權重向量是正規化的 (normalized) , ||w||2=1 神經元採用線性的激發函數 (linear activation
function) , y=wTx 因此能量函數可改寫成
xxewhere
ew
ˆ
)(2
221
2
2
2
22
12
12
1
wywxywxx
wyxywx
wyxw
TTT
TT
(2.111)
(2.112)
(2.114)
94
Overview of basic learning rules for a single neuron (cont.)
採用 steepest descent 法, Oja’s 的 continuous-time learning rule 可改寫成:
將 (2.116)帶入 (2.115) 可得 Oja’s 的 continuous-time learning rule
rate learning 是 0 where,
2wyxyw
wdt
dw
w
w
xwvywhere
wyxydt
dw
T
2
(2.115)
(2.116)
(2.117)
(2.118)
Hebbian cooccurance term Active decay term
95
Overview of basic learning rules for a single neuron (cont.)
(2.117) 式可改寫成 discrete-time 格式:
Typical (simple) form of Hebbian learning
kykwkxkykwkw
kykwkxkykwkw
jjjj
)(1
as writtern becan (2.119) of formscalar The
)(1
kxkykwkw jjj )(1
(2.119)
(2.120)
(2.121)
96
Example 2.4 Want to analyze zero-mean
random data. 5000 stochastic vectors 1st component has a variance of
10 2nd and 3rd component have
variances of 0.002. Using Oja’s learning
=0,001 Convergent condition
||w(k)-w(k-1)||2<=10-6
Initial weight vectorw(0)=[0.5949, -0.5585, 0.4811]T
After 1182 iterationsw=[1.000, -0.0007, 0.0002]T
Very close to the 1st principal eigenvector w=[1.000, -0.0001, 0.0001]T
97
Overview of basic learning rules for a single neuron (cont.) Potential Learning Rule
不需要 desired signal ,是一種 unsupervised learning 。 The learning is performed on the activity level of the
neuron. Potential learning rule 可由最小化下列能量函數推導:
function loss theis
0
where
2
22
xwv
wxww
T
T(2.122)
98
Overview of basic learning rules for a single neuron (cont.)
The gradient of (2.122) is given as
The learning signal is
Using Amari’s result in (2.106) and the gradient in (2,123), the continuous-time potential learning rule
The discrete-time form
The discrete-time scalar form
wxvww )()(
dv
vdv
)()(
wxvdt
dw )(
)()()()()1( kwkxvkwkw
)()()()()1( kwkxvkwkw jjjj
(2.123)
(2.124)
(2.125)
(2.126)
99
Overview of basic learning rules for a single neuron (cont.) Correlation Learning Rule
Minimizing the energy function
The gradient with respect to the synaptic weight matrix is
利用 Amari’s (2.106) 及 (2.128) 可得 continuous-time correlation learning rule
The discrete-time form can be written as
The discrete-time scalar form is
2
22)( wxdww T
wdxww )(
wdxdt
dw
kwkxkdkwkw 1
kwkxkdkwkw jjjj 1
Correlation learning rule通常應用在記憶網路中紀錄資料,如果將 (2.129) 式中的 d 改成 y ,則成為 hebbian leaning rule 。
(2.127)
(2.128)
(2.129)
(2.130)
(2.131)
100
Overview of basic learning rules for a single neuron (cont.) Standard Perceptron Learning Rule
可由 minimizing the MSE criterion 來推導獲得
其中
神經元的輸出
採用 steepest descent approach, the continuous-time learning rule is given by
2
2
1)( ew
yde
)()( vfxwfy T
)(wdt
dww
(2.132)
(2.133)
(2.134)
101
Overview of basic learning rules for a single neuron (cont.)
則 (2.132) 式的 gradient 可得
x
xdv
vdfe
xdv
vdfvd
xdv
vdfvx
dv
vdfdww
)(
)()(
)()(
)()(
)()(')(
vegvefdv
vdfe
其中
(2.135)
(2.136)
102
Overview of basic learning rules for a single neuron (cont.)
使用 (2.134), (2.135) 和 (2.136) 式, the continuous-time standard perceptron learning rule for a single neuron as
(2.137) 式可改寫成 discrete-time形式
The scalar form of (2.138) can be written as
xdt
dw
)()()()1( kxkkwkw
)()()()1( kxkkwkw jjj
(2.137)
(2.138)
(2.139)
103
Overview of basic learning rules for a single neuron (cont.) Generalized Perceptron Learning Rule
104
Overview of basic learning rules for a single neuron (cont.) Generalized Perceptron Learning Rule
When the energy function is not defined to be the MSE criterion, we can define a general energy function as
其中為可微函數。如果 e2 則變成 standard perceptron learning rule.
其中
)()()( ydew
w
v
v
y
y
e
eww
)(
)()(')(
eee
e
)(vfxwfy T
(2.141)
(2.142)
(2.143)
(2.140)
105
Overview of basic learning rules for a single neuron (cont.)
f(.) is a differentiable function, and
(2.141) can be written as
The continuous-time general perceptron learning rule is given as
If we define the learning signal as
(2.146) can be written as
)()(')()(
vgvfdv
vdf
dv
vdy
xvgeww )()()(
xvgedt
dw)()(
)()( vge
xdt
dw
(2.144)
(2.145)
(2.146)
(2.147)
(2.148)
106
Overview of basic learning rules for a single neuron (cont.) Discrete-time form
Discrete scalar form
)()()()1( kxkkwkw
)()()()1( kxkkwkw jjj
(2.149)
(2.150)
107
Data Preprocessing
The performance of a neural network is strongly dependent on the preprocessing that is performed on the training data.
Scaling The training data can be amplitude-scaled in two
ways The value of the pattern lie between -1 and 1. The value of the pattern lie between 0 and 1.
Referred to as min/max scaling. MATLAB: premnmx
108
Data Preprocessing (cont.)
Another scaling process Mean centering
如果用來 training 的 data包含有 biases 時。 Variance scaling
如果用來 training 的 data 具有不同的單位時。 假設輸入向量以行方向排列成矩陣 目標向量也以行方向排列成矩陣 Mean centering
計算矩陣 A 、 C 中每一列的 mean value 將矩陣 A 、 C 中的每個元素,減去該對應的 mean value 。
Variance scaling 計算矩陣 A 、 C 中每一列的 standard deviation. 將矩陣 A 、 C 中的每個元素,除以該對應的 standard deviation 。
mnA mpC
109
Data Preprocessing (cont.)
Transformations The feature of certain “raw” signals are used fro training
inputs provide better results than the raw signals. A front-end feature extractor can be used to discern salient
or distinguishing characteristics of the data. Four transform methods:
Fourier Transform Principal-Component Analysis Partial Least-Squares Regression Wavelets and Wavelet Transforms
110
Data Preprocessing (cont.)
Fourier Transform The FFT can be used to extract the import
features of the data, and then these dominant characteristic features can be used to train the neural network.
111
Data Preprocessing (cont.)三個具有相同波形,不同相位的信號,每個信號具有 1024 個取樣點。
112
Data Preprocessing (cont.)
在 FFT magnitude response 上,具有相同 magnitude response ,而且只需 16 個 magnitude取樣即可。
三個具有相同波形,不同相位的信號,在 FFT 的相位上則有所不同。
113
Data Preprocessing (cont.)
Principal-Component Analysis PCA can be used to “compress” the input training
data set, reduce the dimension of the inputs. By determining the important features of the data
according to an assessment of the variance of the data.
In MATLAB, prepca is provided to perform PCA on the training data
114
Data Preprocessing (cont.) Given a set of training data
where assumed that m>>n,n denote the dimension of the input training patternsm denote the number of training pattern.
Using PCA, an “optimal” orthogonal transformation matrix can be determined
where h<<n (the degree of dimension reduction) The dimension of the input vectors can be reduced according to
the transformation
where Ar is the reduced-dimension set of training patterns.The columns of Ar are the principal components for each of the inputs from A
mnA
nhpcaW
AWA pcar
mhrA
(2.151)
115
Data Preprocessing (cont.) Partial Least-Squares Regression
PLSR can be used to compress the input training data set. Restricted for use with supervised trained neural networks. Only scalar target values are allowed. The factor analysis in PLSR can determine the degree of
compression of the input data. After the optimal number of PLSR factor h has been
determined, the weight loading vectors can be used to transform the data similar to the PCA approach.
The optimal number weight loading vectors can form an orthogonal transformation matrix as the columns of the matrix
The dimension of the input vectors can be reduced according to the transformation
hnplsrW
AWA Tplsrr (2.152)
116
Data Preprocessing (cont.)
PCA and PLSR orthogonal transformation vectors used for data compression
PLSR 使用輸入資料與目標資料來產生orthogonal transformation Wplsr 的weight loading vector
117
Data Preprocessing (cont.)
Wavelets and Wavelet Transforms A wave is an oscillating function of time. Fourier analysis is used for analyzing waves
Certain function can be expanded in terms of sinusoidal waves. How much of each frequency component is required to synthesize the signal.
Very useful for periodic, time-invariant, stationary signal analysis. A wavelet can be considered as a small wave, whose energy is
concentrated. Useful for analyzing signals that are time-varying, transient,
nonstationary. To allow for simultaneous time and frequency analysis. Wavelets are local waves. The wavelet transform can provide a time-frequency description of
signals and can be used to compress data for training neural network.