Chapter 2 Fundamental Neurocomputing Concepts

1

Chapter 2 Fundamental Neurocomputing Concepts國立雲林科技大學資訊工程研究所

張傳育 (Chuan-Yu Chang ) 博士Office: EB212TEL: 05-5342601 ext. 4337E-mail: [email protected]://MIPL.yuntech.edu.tw

2

Basic Models of Artificial neurons An artificial neuron can be referred to as a

processing element, node, or a threshold logic unit. There are four basic components of a neuron

A set of synapses with associated synaptic weights A summing device, each input is multiplied by the

associated synaptic weight and then summed. A activation function, serves to limit the amplitude of the

neuron output. A threshold function, externally applied and lowers the

cumulative input to the activation function.

3

Basic Models of Artificial neurons

4


q

n

jjqjq

qqqq

Tqnqq

n

jq

Tjqjq

xwfy

ufvfy

www

xwu

1

1n21

1

bygiven isneuron theofoutput the

)(

isfunction activation theofoutput the

R,...,, where

iscombiner linear theofoutput the

q

Tq

w

wxxw (2.2)

(2.3)

(2.4)

5

Basic Models of Artificial neurons The threshold (or bias) is incorporated into the synaptic

weight vector wq for neuron q.

6


qq

n

jjqjq

vfy

q

wv

as written is neuron ofoutput The

as written is potential activation internal effective The

0

x

7

Basic Activation Functions

The activation function, transfer function, Linear or nonlinear

Linear (identity) activation function

qqlinq vvfy

8


Hard limiter Binary function, threshold function

(0,1) The output of the binary hard

limiter can be written as

Hard limiter activation function

0 if1

0 if0

q

qqhlq v

vvfy

9


Bipolar, symmetric hard limiter (-1, 1) The output of the symmetric

hard limiter can be written as

Sometimes referred to as the signum (or sign) function.

0 if1

0 if0

0 if1

q

q

q

qshlq

v

v

v

vfySymmetric limiter activation function

10


Saturation linear function, piecewise linear function The output of the saturation

linear function is given by

2

1 if1

2

1

2

1- if

2

12

1 if0

q

qq

q

qslq

v

vv

v

vfySaturation linear activation function

11


Saturation linear function The output of the symmetric

saturation linear function is given by

Saturation linear activation function

1 if1

11- if

1 if1

q

qq

q

qsslq

v

vv

v

vfy

12


Sigmoid function (S-shaped function) Binary sigmoid function The output of the binary

sigmoid function is given by

qvqbsq

evfy

1

1

Where is the slope parameter of the binary sigmoid function

Binary sigmoid function

Hard limiter has no derivative at the origin, the binary sigmoid is a continuousand differentiable function.

13

Basic Activation Functions The derivation of the binary sigmoid function for two

different values of the slope parameter.

qbsqbsv

v

q

qbsqbs vfvf

e

e

dv

vdfvg

q

q

11

2

14


Sigmoid function (S-shaped function) Bipolar sigmoid function,

hyperbolic tangent sigmoid The output of the Binary sigmoid

function is given by

q

q

qq

qq

v

v

vv

vv

qqhtsqe

e

ee

eevvfy

2

2

1

1tanh

15

Basic Activation Functions The effect of the threshold

qand bias q can be illustrated by observing the binary sigmoid activation function Three plots of the binary

sigmoid function (=1) Threshold=2 (q=2) Bias=2(q=2) Nominal case(q=q=0)

Applying a threshold is analogous to delaying a time-domain signal

Adding a bias is analogous to an advance of a signal.

16

The Hopfield Model of the Artificial Neuron Hopfield 清楚的描述類神經網路理論與實務間的關係。 The Hopfield neural network is an asynchronous parallel

processing, fully interconnected. Content-addressable memory (associative memory)

Retrieving a stored pattern in response to the presentation of a noisy or incomplete version of that pattern.

Discrete-time model of the Hopfield neuron

17

The Hopfield Model of the Artificial Neuron

The unit delay z-1 delays the output of the activation function by one sample period to give yq(k).

(2.18) )1(

as )1(neuron ofoutput the(2.17), and (2.16) Using

(2.17) 1

where

(2.16) 1)1(

as zdelay unit thebeforeneuron theofoutput The

1

1

-1

n

jqjqjshlq

q

n

jqjqjq

qshlq

kxwfky

ky

kxwkv

kvfky

18

The Hopfield Model of the Artificial Neuron Continuous-time model of the Hopfield artificial

neuron Tcq=RqCq is the integration time constant of the qth

neuron, q is an externally applied threshold. The integrator can be realized with an OPA,

capacitor Cq, and resistor Rq.

q>0 is called the leakage factor of the integrator.

qhtsq

n

jqjqjqq

qcq

vfy

xwvdt

dvT

1

19

Adaline and Madaline

Least-Mean-Square (LMS) Algorithm Widrow-Hoff learning rule Delta rule The LMS is an adaptive algorithm that computes

adjustments of the neuron synaptic weights. The algorithm is based on the method of steepest decent. It adjusts the neuron weights to minimize the mean square

error between the inner product of the weight vector with the input vector and the desired output of the neuron.

Adaline (adaptive linear element) A single neuron whose synaptic weights are updated

according to the LMS algorithm. Madaline (Multiple Adaline)

20

Simple adaptive linear combiner

kxkwkwkxkv TT

inputs

x0=1, wo= (bias)

21

Simple adaptive linear combiner The difference between the desired response and

the network response is

The MSE criterion can be written as

Expanding Eq(2.23)

22 )()(2

1)(

2

1)( kkkdEkeEJ T xww

kkkdkvkdke T xw

)()(2

1)()(

2

1

)()()()(2

1)()()()(

2

1)(

2

2

kkkkdE

kkkEkkkkdEkdEwJ

xTT

TTT

wCwwp

wxxwwx

(2.22)

(2.23)

(2.24)

(2.25)

22

Simple adaptive linear combiner Cross correlation vector between the desired

response and the input patterns

Covariance matrix for the input pattern

J(w) 的 MSE 表面有一個最小值 (minimum) ，因此計算梯度等於零的權重值

因此，最佳的權重值為

)()( kkdE xp

)()( kkE Tx xxC

0)()(

)(

k

JJ xwCp

w

www

pCw 1* x

(2.26)

(2.27)

23

The LMS Algorithm

上式的兩個限制求解 covariance matrix 的反矩陣很費時不適合即時的修正權重，因為在大部分情況， covariance

matrix 和 cross correlation vector 無法事先知道。為避開這些問題， Widow and Hoff 提出了 LMS

algorithm To obtain the optimal values of the synaptic weights when

J(w) is minimum. Search the error surface using a gradient descent method

to find the minimum value. We can reach the bottom of the error surface by changing the

weights in the direction of the negative gradient of the surface.

24

The LMS Algorithm

Typical MSE surface of an adaptive linear combiner

25

The LMS Algorithm Because the gradient on the surface cannot be computed

without knowledge of the input covariance matrix and the cross-correlation vector, these must be estimated during an iterative procedure.

Estimation of the MSE gradient surface can be obtained by taking the gradient of the instantaneous error surface.

The gradient of J(w) approximated as

The learning rule for updating the weights using the steepest descent gradients method as

)()(

)(

2

1)( )(

2

kke

keJ kwww

xw

w

)()()()()()1( kkekJkk w xwwww

(2.28)

(2.29)

Learning rate specifies the magnitude of the update step for the weights in the negative gradient direction.

26

The LMS Algorithm

If the value of is chosen to be too small, the learning algorithm will modify the weights slowly and a relatively large number of iterations will be required.

If the value of is set too large, the learning rule can become numerically unstable leading to the weights not converging.

27

The LMS Algorithm

The scalar form of the LMS algorithm can be written from (2.22) and (2.29)

從 (2.29) 及 (2.31) 式，我們必須給 learning rate 設立一個上限，以維持網路的穩定性。 (Haykin,1996)

n

hhh kxkwkdke

1

)()()()(

)()()()1( kxkekwkw iii

(2.30)

(2.31)

max

20

The largest eigenvalue of the

input covariance matrix Cx

28

The LMS Algorithm

為使 LMS 收斂的最小容忍的穩定性，可接受的learning rate 可限定在

(2.33) 式是一個近似的合理解法，因為

xtrace C

20

n

h

n

hxhhhx ctrace

1 1maxC

(2.33)

(2.34)

29

The LMS Algorithm

從 (2.32) 、 (2.33) 式知道， learning rate 的決定，至少得計算輸入樣本的 covariance matrix ，在實際的應用上是很難達到的。

即使可以得到，這種固定 learning rate 在結果的精確度上是有問題的。因此， Robbin’s and Monro’s root-finding algorithm 提出了，隨時間

變動 learning rate 的方法。 (Stochastic approximation )

where is a very small constant. 缺點： learning rate 減低的速度太快。

kk

)((2.35)

30

The LMS Algorithm

理想的做法應該是在學習的過程中， learning rate 應該在訓練的開始時有較大的值，然後逐漸降低。 (Schedule-type adjustment)

Darken and Moody Search-then converge algorithm

Search phase: is relatively large and almost constant. Converge phase: is decrease exponentially to zero.

0 >0 and >>1, typically 100<=<=500 These methods of adjusting the learning rate are commonly

called learning rate schedules.

/1)( 0

kk

(2.36)

31

The LMS Algorithm

Adaptive normalization approach (non-schedule-type) is adjusted according to the input data every time step

where 0 is a fixed constant.

Stability is guaranteed if 0< 0 <2; the practical range is 0.1<= 0 <=1

2

2

0

)()(

kxk

(2.37)

32

The LMS Algorithm Comparison of two learning rate schedules: stochastic approximation

schedule and the search-then-converge schedule.

Eq.(2.35)

Eq.(2.36)

is a constant

33

Summary of the LMS algorithm Step 1: set k=1, initialize the synaptic weight vector w(k=1), and

select values for 0 and . Step 2: Compute the learning rate parameter

Step 3: Computer the error

Step 4: Update the synaptic weights

Step 5: If convergence is achieved, stop; else set k=k+1, then go to step 2.

/10

kk

n

hhh kxkwkdke

1

)()()()(

)()()()()1( kxkekkwkw iii

34

Example 2.1: Parametric system identification Input data consist of 1000 zero-mean Gaussian random vectors

with three components. The bias is set to zero. The variance of the components of x are 5, 1, and 0.5. The assumed linear model is given by b=[1, 0.8, -1]T.

To generate the target values, the 1000 input vectors are used to form a matrix X=[x1x2…x1000], the desired outputs are computed according to d=bTX

The progress of the learning rate parameter as it is adjusted according to the search-then converge schedule.

bx d

200

1936.09.0

10001000

1

max0

1000

1

h

TT

xXX

xxC

The learning process was terminated when 82 102/1 keJ

35

Example 2.1 (cont.) Parametric system identification: estimating a parameter vector

associated with a dynamic model of a system given only input/output data from the system.

The root mean square (RMS) value of the performance measure.

36

Adaline and Madaline

Adaline It is an adaptive pattern classification network trained by

the LMS algorithm.

x0(k)=1

可調整的 bias或 weight

產生 bipolar (+1, -1)的輸出，可因activation function 的不同，而有 (0,1) 的輸出

)()()( kvkdke )()()(~ kykdke

37

Adaline

Linear error The difference between the desired output and the

output of the linear combiner.

Quantizer error The difference between the desired output and the

output of the symmetric hard limiter.

)()()( kvkdke

)()()(~ kykdke

38

Adaline

Adaline 的訓練過程輸入向量 x必須和對應的 desired 輸出 d ，同時餵給 Adaline。神經鍵的權重值 w，會根據 linear LMS algorithm 動態的調

整。 Adaline 在訓練的過程，並沒有使用到 activation function ，

(activation function 只有在測試階段才會使用 ) 一旦網路的權重經過適當的調整後，可用未經訓練的 pattern

來測試 Adaline 的反應。如果 Adaline 的輸出和測試的輸入有很高的正確性時，可稱

網路已經 generalization 。

39

Adaline One common application of the Adaline is for the

realization of a small class of logic functions: Only those logic functions that are linearly separable can be

realized by the Adaline. AND

otherwise

xxxxANDnxy j

n

n

jj 1

1 all if1,...,,1sgn 21

1

w0(k)

40

Adaline OR

otherwise

xxxxORnxy j

n

n

jj 1

1 some if1,...,,1sgn 21

1

w0(k)

41

Adaline Majority

otherwise

xxxxMAJxy j

n

n

jj 1

1 ofmajority theif1,...,,sgn 21

1

42

Adaline Linear separability

The Adaline acts as a classifier which separates all possible input patterns into two categories.

The output of the linear combiner is given as

)(

)()(

)(

)()(

0)()()()()(

0)(Let

)()()()()()(

2

01

2

12

02211

02211

kw

kwkx

kw

kwkx

or

kwkxkwkxkw

kv

kwkxkwkxkwkv

43

AdalineLinear separability of the Adaline

Adaline 只能分割線性可分割的 patten

44

AdalineNonlinear separation problem

若 separating boundary非 straight line ，Adaline 無法分割

Since the boundary is not a straight line, the Adaline cannot be used to accomplish this task.

45

Adaline Adaline with nonlinearly transformed inputs (polynomial

discriminant function) To solve the classification problem for patterns that are not

linearly separable, the inputs to the Adaline can be preprocessed with fixed nonlinearities. (polynomial discriminant function)

22524213

122110

)()()(

)()()()(

xkwxkwxxkw

xkwxkwkwkv

(2.45)

46

Adaline The critical thresholding condition for this Adaline with nonlinearly

transformed inputs occurs when v(k) in (2.45) is set to zero. Realizing a nonlinearly separable function (XNOR)

If the appropriate nonlinearities are chosen, the network can be trained to separate the input space into two subspaces which are not linearly separable.

47

Adaline (cont.) Linear error correction rules

有兩種基本的線性修正規則，可用來動態調整網路的權重值。( 網路權重的改變與網路實際輸出和 desire 輸出的差異有關 ) -LMS: same as (2.22) and (2.29) -LMS: a self-normalizing version of the -LMS learning rule

-LMS 演算法是根據最小擾動原則 (minimal-disturbance principle) ，當調整權重以適應新的 pattern 的同時，對於先前的 pattern 的反應，應該有最小的影響。

-LMS 是基於最小化 MSE 表面。 -LMS 則是更新權重使得目前的誤差降低。

2

2)(

)()()()1(

kx

kxkekwkw (2.46)

)()()()()()1( kxkekwwJkwkw w

48

Adaline (cont.) Consider the change in the error for -LMS

From (2.47) The choice of a controls stability and speed of convergence,

is typically set in the range

-LMS 之所以稱為 self-normalizing 是因為的選擇和網路的輸入大小無關，

kekekx

kxkxkeke

kekxkx

kxkekwkd

kekxkwkdkekeke

T

TT

T

2

2

2

2

11

ke

ke

11.0

(2.47)

(2.48)

49

Adaline (cont.) Detail comparison of the -LMS and -LMS

From (2.46)

Define normalized desired response and normalized training vector

Eq(2.49) can be rewrote as

222

2

2

2

2

)()()(

)(

)(

)()()()1(

kx

kx

kx

kxkw

kx

kdkw

kx

kxkxkwkdkw

kx

kxkekwkw

T

T

22

,kx

kxkx

kx

kdkd

)()1( kxkxkwkdkwkw T

和 -LMS 具有相同的型式，所以 -LMS 表示正規化輸入樣本後的 -LMS 。

(2.49)

(2.50-51)

(2.52)

)()()()()()1( kxkekwwJkwkw w

50

Multiple Adaline (Madaline)

單一個 Adaline 無法解決非線性分割區域的問題。

可使用多個 adaline Multiple adaline Madaline

Madaline I ： single-layer network with single output. Madaline II ： multi-layer network with multiple output.

51

Example of Madaline I network consisting of three Adalines

May be OR, AND, and MAJ

52

Two-layer Madaline II architecture

53

Madaline I realization of an XNOR logic function

54

Madaline I realization of an XNOR logic function (cont.)

)(

)()(

)(

)()(

移項整理後可得),(同除以0)()()()()()(

0)( ,1)(

)()()()()()()(

12

101

12

112

12

100122111

10

1001221111

kw

kwkx

kw

kwkx

kw

kwkxkwkxkwkx

kvkx

kwkxkwkxkwkxkv

55

同理

)(

)()(

)(

)()(

移項整理後可得),(同除以0)()()()()()(

0)( ,1)(

)()()()()()()(

22

201

22

212

22

200211222

20

2002112222

kw

kwkx

kw

kwkx

kw

kwkxkwkxkwkx

kvkx

kwkxkwkxkwkxkv

Madaline I realization of an XNOR logic function (cont.)

56

Madaline I separation properties for the XNOR problem

57

Madaline Learning Strategies Madline 的學習策略有兩種

Madaline rule I, MRI, for Madaline I The basic idea is to adjust those weights for the neuron

whose linear output vj(k) is closest to the desired response.

MRI follows the minimal-disturbance principle. Madaline rule II, MRII, for Madaline II

The weights are initially set to small random values. The training patterns are presented in a random fashion

with the objective of minimizing the average Hamming error over the training set. (backpropagation)

58

Simple Perceptron

Simple perceptron (single-layer perceptron) Very similar to the Adaline, 由 Frank Rosenblatt (1950) 提

出。 Minsky and Paper 發現一個嚴重的限制： perceptron 無法

解決 XOR 的問題。藉由正確的 processing layer ，可解決 XOR 問題，或

是， parity function 的問題。 Simple perceptron 和典型的 pattern classifier 的 maximum-

likelihood Gaussian classifier 有關，均可視為線性分類器。大部分的 perceptron 的訓練是 supervised ，也有部分是

self-organizing 。

59

Simple Perceptron (cont.)

In Rosenblatt’s early work The perceptron had three layers

Sensory surface (retina): Association area (A unit) Response unit (R unit)

It was not allowed to have more than one R unit on at a time. (Winner-take-all)

60


Original Rosenblatt’s perceptron Binary input, no bias.

Modified perceptron Bipolar inputs and a bias term Output y{-1,1}

61

Simple Perceptron (cont.) The quantizer error is used to adjust the synaptic

weights of the neuron. The adaptive algorithm for adjusting the neuron

weights (the perceptron learning rule) is given as

Rosenblatt normally set to unity. The choice of the learning rate does not affect the

numerical stability of the perceptron learning rule. can affect the speed of convergence.

kykdkxkwkdke

kxke

kwkw

T

sgn~where

2

~1 (2.55)

(2.56)

比較 (2.46)

2

2)(

)()()()1(

kx

kxkekwkw

62


The perceptron learning rule is considered a nonlinear algorithm.

The perceptron learning rule performs the update of the weights until all the input patterns are classified correctly. The quantizer error will be zero for all training pattern inputs, and

no weight adjustments will occur.

The weights are not guaranteed to be optimal.

63

Simple Perceptron

Mays’s perceptron learning rules Mays 對標準的 perceptron learning rules 提出兩個修改： Increment adaptation algorithm

Modified relaxation algorithm

kvifkx

kxkdkw

kvifkx

kxkekw

kw )(

2

)(~

1

2

2

2

2

otherwise

kx

kxkekw

kekvifkw

kw2

2

)(

0)(~ and

1

(2.58)

(2.57)

64

Simple Perceptron with a Sigmoid Activation Function

The learning rule is based on the method of steepest descent and attempts to minimize an instantaneous performance function.

65

Simple Perceptron with a Sigmoid Activation Function (cont.)

學習演算法可由 MSE 推導獲得

The instantaneous performance function to be minimized is given as

kykdke

keEJ

qqq

qq

~ where

~2

1 2w

qq

Tqq

qqqq

qqqq

kwkxfkvfky

kykykdkd

kykdkeJ

where

22

12

1~2

1

22

22w

(2.61)

(2.60)

(2.59)

66


假設 activation function 為 hyperbolic tangent sigmoid ，因此，神經元的輸出可表示成

根據 (2.15) 式對 hyperbolic tangent sigmoid函數的微分

採用 steepest descent 的 discrete-time learning rule

kvkvfky qqhtsq tanh

kvfkvfkvg qqq21'

qwqq Jkk www 1(參考 2.29 式 )

(2.64)

(2.63)

(2.62)

)()()()()()1( kxkekJkk w wwww

qbsqbsv

v

q

qbsqbs vfvf

e

e

dv

vdfvg

q

q

11

2

67


計算 (2.64) 式的梯度 (gradient)

以 (2.63) 式代入 (2.65) 式

採用 (2.66) 式的 gradient ，則 discrete-time learning rule for simple perceptron 可寫成

kxkvfke

kxkvfkvfkd

kxkvfkvfkxkvfkdJ

qq

qqq

qqqqqw

~'

''

w

kxkyke

kxkvfkeJ

qq

qqqw

2

2

1~1~

w

kxkykekk qqqq21~1 ww

(2.65)

(2.66)

(2.67)

68


(2.67) 式可改寫成 scalar form

其中

(2.68) 、 (2.69) 和 (2.70) 為 backpropagation training algorithm 的標準形式。

kxkykekwkw jqqqjqj21~1

n

jqqjjqq

qqq

kwkxfkvfky

kykdke

1

~

(2.70)

(2.69)

(2.68)

69

Example 2.2 Applied the architecture of Figure 2.30 to learn character “E” The character image consists of 5x5 array, 25 pixel (column major) The learning rule is Eq.(2.67), with =1, =0.25 The desired neuron response d=0.5, error goal 10-8. The initial weights of the neuron were randomized. After 39 training pattern, the actual neuron output y=0.50009 (see Fig.

2.32)

70

Example 2.2 (cont.)

The single neuron cannot correct for a noisy input.

For Fig. 2.31 (b), y=0.5204 For Fig. 2.31 (c), y=0.6805 To compensate for noisy

Multi-layer perceptron Hopfield associative

memory.

71

Feedforward Multilayer Perceptron Multilayer perceptron (MLP)

The branches can only broadcast information in one direction. Synaptic weight can be adjusted according to a defined

learning rule. h-p-m feedforward MLP neural network. In general there can be any number of hidden layers in the

architecture; however, from a practical perspective, only one or two hidden layer are used.

72

Feedforward Multilayer Perceptron (cont.)

73

Feedforward Multilayer Perceptron (cont.) The first layer has the weight matrix

The second layer has the weight matrix

The third layer has the weight matrix

Define a diagonal nonlinear operator matrix

nhjiwW )1()1(

hprjwW )2()2(

pmsrwW )3()3(

)()()( ,...,, fffdiagf (2.71)

74

Feedforward Multilayer Perceptron (cont.) The output of the first layer can be written as

The output of the second layer can be written as

The output of the third layer can be written as

將 (2.72)代入 (2.73) ，再代入 (2.74) 可得最後的輸出為

xWvx )1()1()1()1(1 ffout

1)2()2()2()2(

2 outout ff xWvx

2)3()3()3()3(

3 outout ff xWvx

(2.72)

(2.73)

(2.74)

xWWW )1()1()2()2()3()3( fffy (2.75)

The synaptic weights are fixed, a training process must be carried out a prioriTo properly adjust the weights.

75

Overview of Basic Learning Rules for a Single Neuron

Generalized LMS Learning Rule 定義一個需最小化的 performance function (energy function)

其中 , ||w||2為向量 w 的 Euclidean norm(.) 為任何可微分的函數， e is the linear error。

2

22)( ww

e

xwTde

Desired output

Weight vector Input vector

(2.76)

(2.77)

76

Generalized LMS Learning Rule (cont.) 採用最陡坡降法 (steepest descent approach) ，可獲得

general LMS algorithm。 Continuous-time learning rule( 可視為向量的微分 )

Discrete-time learning rule

If (t)=1/2t2, and ’(t)=g(t)=t, then (2.81) is written as

wx

ww

)(

)(

egdt

dw

)()()()(

)()()1(

kkegk

kk w

wxw

www

Learning rate Leakage factor

(2.78)

(2.79)

(2.82)

(2. 81)

wxwxwxw eee

dt

d(2.83)

77

Generalized LMS Learning Rule (cont.)

Leaky LMS algorithm (0<=<1)

Standard LMS algorithm (=0), (the same as Eq.2.29)

The scalar form of standard LMS algorithm

)()()()1(

)()()()()1(

kkek

kkkekk

xw

wxww

)()()()1( kkekk xww

(2.84)

(2.85)

n

jjj

jjj

kxkwkdke

nj

kxkekwkw

1

)()()()(

,...,2,1,0for

)()()()1( (2.86)

78

Generalized LMS Learning Rule (cont.) Standard LMS 可有三種重要的變化：

慣性 (momentum)被設計來在平均下坡力量的方向上，提供一特殊動量來改變權重向量。可定義成目前權重 w(k)和前一權重 w(k-1)間的差異。

因此 (2.85) 式可改寫成

其中 0<<1 為 momentum parameter

1)( kkk www

11 kkkkekk wwxww

(2.87)

(2.88)

Standard LMS algorithm with momentum

79


2. Recursive weighted least-squares 是一種基於autoregressive moving average (ARMA) 的 parametric system identification

Recursive weighted least-squares包含有參數向量 (parameter vector) 的更新表示增益向量 (gain vector) 的更新表示權重矩陣 (weighting matrix) 的更新表示 The update expression for the parameter vector is

The update expression for the gain vector is

The update expression for the weighting matrix is

)()1()()1( kekkk Lww

1)()()()()()1(

kkkkkk T xPxxPL

)()()1()(1

)1( kkkkk T PxLPP

11 nLPPP Tnn ,11

11 nw

(2.89)

(2.90)

(2.91)

80


其中 error term e(k) is given by

將 (2.90)代入 (2.89) 得

因此修改後的 synaptic weight vector 可表示成

其中

)()()()( kkkdke T xw

)()()(

)()()()()1(

kkk

kkkekk

T xPx

wPww

)()()()()()1( kkkekkk xPww

)()()(

1)(

kkkk

T xPx

(2.92)

(2.93)

(2.94)

(2.95)

81


將 (2.90)代入 (2.91) 可得 weighting matrix

The update expression for the weighting matrix is written as

)()()()()(1

)()()()()()(1

)()()(

)()()()()(

1)1(

kkkkk

kkkkkk

kkk

kkkkkk

T

T

T

T

PxxPI

PxxPP

xPx

PxxPPP

)()()()()(1

)1( kkkkkk T PxxPIP

(2.96)

(2.97)

82


3. 最小擾動原則 (minimal disturbance principle) Modified normalized LMS 在 (2.46) 式中，在分母的地方加入正的常數，確保權重的更新

不會變成無限大。

Where

typically

2

2

1k

kkekk

x

xww

(2.98)

20

0

11.0

83

Example 2.3

The same as Example 2.1 ，但使用不同的 LMS algorithm 。

Use the same Initial weight vector Initial learning rate Termination criterion

84

Overview of basic learning rules for a single neuron Hebbian Learning

[Donald Hebb] The strength of a synapse between cells A and B increased slightly for the situation when the firing in A was followed by firing in B with a very small time delay. For two neurons on either side of a synapse that are synchronously

activated, then the strength of the synapse is increased. [Stent] expanded Hebb’s original statement to include the case

when two neurons on either side of a synapse are asynchronously activated, leading to a weakened synapse.

[Rumelhart] Adjust the strength of the connection between units A and B in proportion to the product of their simultaneous activation. If the product of the activations is positive, the modification to the

synaptic connection is more excitatory. If the product of the activations is negative, the modification to the

synaptic connection is more inhibitory.

85

Overview of basic learning rules for a single neuron (cont.)

Hebbian synapse Uses a highly local, time-dependent, and strongly

interactive mechanism to increase synaptic efficiency as a function of the correlation between the presynaptic and postsynaptic activity levels.

86


Four key properties of a Hebbian synapse Time-dependent mechanism

To change in a Hebbian synapse that depend on the precise time of occurrence of the the presynaptic and postsynaptic activity levels.

Local mechanism Within a synapse, ongoing activity levels in the presynaptic and

postsynaptic units are used by a hebbian synapse to produce an input-dependent, local synaptic modification.

Interactive mechanism Any form of hebbian learning depends on the interaction between

presynaptic and postsynaptic activities.

Conjunctional (correlational) mechanism The “co-occurrence” of presynaptic and postsynaptic activities within a

relatively short time interval is sufficient to produce a synaptic modification.

87


Synaptic activities can be categorized as Hebbian

A Hebbian synapse increases its strength, and its strength is decreased when the activities are either uncorrelated or negatively correlated.

Anti-Hebbian An anti-Hebbian synapse enhance negatively correlated

presynaptic or postsynaptic activities and weakens positively correlated activities.

Non-Hebbian A non-Hebbian synapse does not involve the strongly

interactive, highly local, time-dependent mechanism.

88

Overview of basic learning rules for a single neuron (cont.) Standard Hebbian learning for a single

neuron

89


The standard Hebbian learning rule for a single neurons from an energy function defined as

The output of the neuron

Taking the steepest descent approach to derive the continuous-time learning rule

2

22)( wxww T

vfvd

vdy

wdt

dww

xwv T

(2.99)

(2.100)

(2.101)

90


The gradient of (2.99) is given as

Using (2.102) and (2.101), the continuous-time standard Hebbian learning rule for a single neuron is given as

The discrete-time standard Hebbian learning rule is

The scalar discrete-time form

wyxww

vvfww

)()(

wyxdt

dw

)()()()()1( kwkxkykwkw

)()()()()1( kwkxkykwkw jjjj

(2.102)

(2.103)

(2.104)

(2.105)

91

Overview of basic learning rules for a single neuron (cont.) Generalized Hebbian learning rule

Can be considered as a gradient optimization process when an appropriate energy or Lyapunov function is selected

The resulting generalized Hebbian learning rule is given as

The discrete-time form as

w

w

dt

dw

)(

wxdt

dw

)()()()1(

)()()()(

)()()()()1(

kxkkw

kxkkwkw

kwkxkkwkw

(2.106)

(2.107)

(2.108)

dyvxw ,,,, is a learning signal dyvxw ,,,, is a learning signal dyvxw ,,,,

92


Assume that the learning signal is the output of the neuron

Therefore, (2.107) becomes

)()(

vfdv

vdy

wyxdt

dw

(2.109)

(2.110)

93

Overview of basic learning rules for a single neuron (cont.) Oja’s Learning Rule

Oja’s learning rule 可由最小化下列能量函數推導：

兩個基本假設：神經元的權重向量是正規化的 (normalized) ， ||w||2=1 神經元採用線性的激發函數 (linear activation

function) ， y=wTx 因此能量函數可改寫成

xxewhere

ew

ˆ

)(2

221

2

2

2

22

12

12

1

wywxywxx

wyxywx

wyxw

TTT

TT

(2.111)

(2.112)

(2.114)

94


採用 steepest descent 法， Oja’s 的 continuous-time learning rule 可改寫成：

將 (2.116)帶入 (2.115) 可得 Oja’s 的 continuous-time learning rule

rate learning 是 0 where,

2wyxyw

wdt

dw

w

w

xwvywhere

wyxydt

dw

T

2

(2.115)

(2.116)

(2.117)

(2.118)

Hebbian cooccurance term Active decay term

95


(2.117) 式可改寫成 discrete-time 格式：

Typical (simple) form of Hebbian learning

kykwkxkykwkw

kykwkxkykwkw

jjjj

)(1

as writtern becan (2.119) of formscalar The

)(1

kxkykwkw jjj )(1

(2.119)

(2.120)

(2.121)

96

Example 2.4 Want to analyze zero-mean

random data. 5000 stochastic vectors 1st component has a variance of

10 2nd and 3rd component have

variances of 0.002. Using Oja’s learning

=0,001 Convergent condition

||w(k)-w(k-1)||2<=10-6

Initial weight vectorw(0)=[0.5949, -0.5585, 0.4811]T

After 1182 iterationsw=[1.000, -0.0007, 0.0002]T

Very close to the 1st principal eigenvector w=[1.000, -0.0001, 0.0001]T

97

Overview of basic learning rules for a single neuron (cont.) Potential Learning Rule

不需要 desired signal ，是一種 unsupervised learning 。 The learning is performed on the activity level of the

neuron. Potential learning rule 可由最小化下列能量函數推導：

function loss theis

0

where

2

22

xwv

wxww

T

T(2.122)

98


The gradient of (2.122) is given as

The learning signal is

Using Amari’s result in (2.106) and the gradient in (2,123), the continuous-time potential learning rule

The discrete-time form

The discrete-time scalar form

wxvww )()(

dv

vdv

)()(

wxvdt

dw )(

)()()()()1( kwkxvkwkw

)()()()()1( kwkxvkwkw jjjj

(2.123)

(2.124)

(2.125)

(2.126)

99

Overview of basic learning rules for a single neuron (cont.) Correlation Learning Rule

Minimizing the energy function

The gradient with respect to the synaptic weight matrix is

利用 Amari’s (2.106) 及 (2.128) 可得 continuous-time correlation learning rule

The discrete-time form can be written as

The discrete-time scalar form is

2

22)( wxdww T

wdxww )(

wdxdt

dw

kwkxkdkwkw 1

kwkxkdkwkw jjjj 1

Correlation learning rule通常應用在記憶網路中紀錄資料，如果將 (2.129) 式中的 d 改成 y ，則成為 hebbian leaning rule 。

(2.127)

(2.128)

(2.129)

(2.130)

(2.131)

100

Overview of basic learning rules for a single neuron (cont.) Standard Perceptron Learning Rule

可由 minimizing the MSE criterion 來推導獲得

其中

神經元的輸出

採用 steepest descent approach, the continuous-time learning rule is given by

2

2

1)( ew

yde

)()( vfxwfy T

)(wdt

dww

(2.132)

(2.133)

(2.134)

101


則 (2.132) 式的 gradient 可得

x

xdv

vdfe

xdv

vdfvd

xdv

vdfvx

dv

vdfdww

)(

)()(

)()(

)()(

)()(')(

vegvefdv

vdfe

其中

(2.135)

(2.136)

102


使用 (2.134), (2.135) 和 (2.136) 式， the continuous-time standard perceptron learning rule for a single neuron as

(2.137) 式可改寫成 discrete-time形式

The scalar form of (2.138) can be written as

xdt

dw

)()()()1( kxkkwkw

)()()()1( kxkkwkw jjj

(2.137)

(2.138)

(2.139)

103

Overview of basic learning rules for a single neuron (cont.) Generalized Perceptron Learning Rule

104

Overview of basic learning rules for a single neuron (cont.) Generalized Perceptron Learning Rule

When the energy function is not defined to be the MSE criterion, we can define a general energy function as

其中為可微函數。如果 e2 則變成 standard perceptron learning rule.

其中

)()()( ydew

w

v

v

y

y

e

eww

)(

)()(')(

eee

e

)(vfxwfy T

(2.141)

(2.142)

(2.143)

(2.140)

105


f(.) is a differentiable function, and

(2.141) can be written as

The continuous-time general perceptron learning rule is given as

If we define the learning signal as

(2.146) can be written as

)()(')()(

vgvfdv

vdf

dv

vdy

xvgeww )()()(

xvgedt

dw)()(

)()( vge

xdt

dw

(2.144)

(2.145)

(2.146)

(2.147)

(2.148)

106

Overview of basic learning rules for a single neuron (cont.) Discrete-time form

Discrete scalar form

)()()()1( kxkkwkw

)()()()1( kxkkwkw jjj

(2.149)

(2.150)

107

Data Preprocessing

The performance of a neural network is strongly dependent on the preprocessing that is performed on the training data.

Scaling The training data can be amplitude-scaled in two

ways The value of the pattern lie between -1 and 1. The value of the pattern lie between 0 and 1.

Referred to as min/max scaling. MATLAB: premnmx

108

Data Preprocessing (cont.)

Another scaling process Mean centering

如果用來 training 的 data包含有 biases 時。 Variance scaling

如果用來 training 的 data 具有不同的單位時。假設輸入向量以行方向排列成矩陣目標向量也以行方向排列成矩陣 Mean centering

計算矩陣 A 、 C 中每一列的 mean value 將矩陣 A 、 C 中的每個元素，減去該對應的 mean value 。

Variance scaling 計算矩陣 A 、 C 中每一列的 standard deviation. 將矩陣 A 、 C 中的每個元素，除以該對應的 standard deviation 。

mnA mpC

109


Transformations The feature of certain “raw” signals are used fro training

inputs provide better results than the raw signals. A front-end feature extractor can be used to discern salient

or distinguishing characteristics of the data. Four transform methods:

Fourier Transform Principal-Component Analysis Partial Least-Squares Regression Wavelets and Wavelet Transforms

110


Fourier Transform The FFT can be used to extract the import

features of the data, and then these dominant characteristic features can be used to train the neural network.

111

Data Preprocessing (cont.)三個具有相同波形，不同相位的信號，每個信號具有 1024 個取樣點。

112


在 FFT magnitude response 上，具有相同 magnitude response ，而且只需 16 個 magnitude取樣即可。

三個具有相同波形，不同相位的信號，在 FFT 的相位上則有所不同。

113


Principal-Component Analysis PCA can be used to “compress” the input training

data set, reduce the dimension of the inputs. By determining the important features of the data

according to an assessment of the variance of the data.

In MATLAB, prepca is provided to perform PCA on the training data

114

Data Preprocessing (cont.) Given a set of training data

where assumed that m>>n,n denote the dimension of the input training patternsm denote the number of training pattern.

Using PCA, an “optimal” orthogonal transformation matrix can be determined

where h<<n (the degree of dimension reduction) The dimension of the input vectors can be reduced according to

the transformation

where Ar is the reduced-dimension set of training patterns.The columns of Ar are the principal components for each of the inputs from A

mnA

nhpcaW

AWA pcar

mhrA

(2.151)

115

Data Preprocessing (cont.) Partial Least-Squares Regression

PLSR can be used to compress the input training data set. Restricted for use with supervised trained neural networks. Only scalar target values are allowed. The factor analysis in PLSR can determine the degree of

compression of the input data. After the optimal number of PLSR factor h has been

determined, the weight loading vectors can be used to transform the data similar to the PCA approach.

The optimal number weight loading vectors can form an orthogonal transformation matrix as the columns of the matrix

The dimension of the input vectors can be reduced according to the transformation

hnplsrW

AWA Tplsrr (2.152)

116


PCA and PLSR orthogonal transformation vectors used for data compression

PLSR 使用輸入資料與目標資料來產生orthogonal transformation Wplsr 的weight loading vector

117


Wavelets and Wavelet Transforms A wave is an oscillating function of time. Fourier analysis is used for analyzing waves

Certain function can be expanded in terms of sinusoidal waves. How much of each frequency component is required to synthesize the signal.

Very useful for periodic, time-invariant, stationary signal analysis. A wavelet can be considered as a small wave, whose energy is

concentrated. Useful for analyzing signals that are time-varying, transient,

nonstationary. To allow for simultaneous time and frequency analysis. Wavelets are local waves. The wavelet transform can provide a time-frequency description of

signals and can be used to compress data for training neural network.

Chapter 2 Fundamental Neurocomputing Concepts

Documents

Transcript of Chapter 2 Fundamental Neurocomputing Concepts