Fundamental Neurocomputing Concepts

1

Fundamental Neurocomputing Concepts

國立雲林科技大學資訊工程研究所張傳育 (Chuan-Yu Chang ) 博士Office: ES 709TEL: 05-5342601 ext. 4337E-mail: [email protected]

2

Basic Models of Artificial neurons An artificial neuron can be referred to as a

processing element, node, or a threshold logic unit. There are four basic components of a neuron

A set of synapses with associated synaptic weights A summing device, each input is multiplied by the

associated synaptic weight and then summed. A activation function, serves to limit the amplitude of the

neuron output. A threshold function, externally applied and lowers the

cumulative input to the activation function.

3

Basic Models of Artificial neurons

4


q

n

jjqjq

qqqq

Tqnqq

n

jq

Tjqjq

xwfy

ufvfy

www

xxxwu

1

1n21

1

bygiven isneuron theofoutput the

)(

isfunction activation theofoutput the

R,...,, where

iscombiner linear theofoutput the

q

Tq

w

ww

5

Basic Models of Artificial neurons The threshold (or bias) is incorporated into the synaptic

weight vector wq for neuron q.

6


qq

n

jjqjq

vfy

q

xwv

as written is neuron ofoutput The

as written is potential activation internal effective The

0

7

Basic Activation Functions

The activation function, transfer function, Linear or nonlinear

Linear (identity) activation function

qqlinq vvfy

8


Hard limiter Binary function, threshold function

(0,1) The output of the binary hard

limiter can be written as

Hard limiter activation function

0 if1

0 if0

q

qqhlq v

vvfy

9


Bipolar, symmetric hard limiter (-1, 1) The output of the symmetric ha

rd limiter can be written as

Sometimes referred to as the signum (or sign) function.

0 if1

0 if0

0 if1

q

q

q

qshlq

v

v

v

vfySymmetric limiter activation function

10


Saturation linear function, piecewise linear function The output of the saturation

linear function is given by

2

1 if1

2

1

2

1- if

2

12

1 if0

q

qq

q

qslq

v

vv

v

vfySaturation linear activation function

11


Saturation linear function The output of the symmetric

saturation linear function is given by

Saturation linear activation function

1 if1

11- if

1 if1

q

qq

q

qsslq

v

vv

v

vfy

12


Sigmoid function (S-shaped function) Binary sigmoid function The output of the Binary

sigmoid function is given by

qvqbsq

evfy

1

1

Where is the slope parameter of the binary sigmoid function

Binary sigmoid function

Hard limiter has no derivative at the origin, the binary sigmoid is a continuousAnd differentiable function

13

Basic Activation Functions The derivation of the binary sigmoid function for two

different values of the slope parameter.

qbsqbsv

v

q

qbsqbs vfvf

e

e

dv

vdfvg

q

q

11

2

14


Sigmoid function (S-shaped function) Bipolar sigmoid function, hyperb

olic tangent sigmoid The output of the Binary sigmoid

function is given by

q

q

qq

qq

v

v

vv

vv

qqhtsqe

e

ee

eevvfy

2

2

1

1tanh

15

Basic Activation Functions The effect of the threshold qan

d bias q can be illustrated by observing the binary sigmoid activation function Three plots of the binary sigm

oid function Threshold=2 (q=2) Bias=2(q=2) Nominal case(q=q=2)

Applying a threshold is analogous to delaying a time-domain signal

Adding a bias is analogous to an advance of a signal.

16

The Hopfield Model of the Artificial Neuron The Hopfield neural network is an asynchronous

parallel processing, fully interconnected. Discrete-time model of the Hopfield neuron

17

The Hopfield Model of the Artificial Neuron

(2.18) )1(

as )1(neuron ofoutput the(2.17), and (2.16) Using

(2.17) 1

(2.16) 1)1(

as zdelay unit thebeforeneuron theofoutput The

1

1

-1

n

jqjqjshlq

q

n

jqjqjq

qshlq

kxwfky

ky

kxwkv

where

kvfky

18

The Hopfield Model of the Artificial Neuron

Continuous-time model of the Hopfield artificial neuron

qhtsq

n

jqjqjqq

qcq

vfy

xwvdt

dvT

1

19

Adaline and Madaline Least-Mean-Square (LMS) Algorithm

Widrow-Hoff learning rule Delta rule The LMS is an adaptive algorithm that computes adjustmen

ts of the neuron synaptic weights. The algorithm is based on the method of steepest decent. It adjusts the neuron weights to minimize the mean square

error between the inner product of the weight vector with the input vector and the desired output of the neuron.

Adaline (adaptive linear element) A single neuron whose synaptic weights are updated accor

ding to the LMS algorithm. Madaline (Multiple Adaline)

20

Simple adaptive linear combiner

kxkwkwkxkv TT

inputs

x0=1, wo= (bias)

21

Simple adaptive linear combiner The difference between the desired response and

the network response is

The MSE criterion can be written as

Expanding Eq(2.23)

22 )()(2

1)(

2

1)( kxkwkdkeEwJ T

kxkwkdkvkdke T

)()(2

1)()(

2

1

)()()()(2

1)()()()(

2

1)(

2

2

kwCkwkwpkdE

kwkxkxEkwkwkxkdEkdEwJ

xTT

TTT

(2.22)

(2.23)

(2.24)

(2.25)

22

Simple adaptive linear combiner Cross correlation vector between the desired respon

se and the input patterns

Covariance matrix for the input pattern

J(w) 的 MSE 表面有一個最小值 (minimum) ，因此計算梯度等於零的權重值

因此，最佳的權重值為

)()( kxkdEp

)()( kxkxEC Tx

0)()(

)(

kwCp

w

wJwJ xw

pCw x1*

(2.26)

(2.27)

23

Adaline and MadalineTypical MSE surface of an adaptive linear combiner

24

The LMS Algorithm

上式的兩個限制求解 covariance matrix 的反矩陣很費時不適合即時的修正權重，因為在大部分情況， covariance m

atrix 和 cross correlation vector 無法事先知道。為避開這些問題， Widow and Hoff 提出了 LMS algo

rithm To obtain the optimal values of the synaptic weights when J

(w) is minimum. Search the error surface using a gradient descent method t

o find the minimum value. We can reach the bottom of the error surface by changing the we

ights in the direction of the negative gradient of the surface.

25

The LMS Algorithm Because the gradient on the surface cannot be computed with

out knowledge of the input covariance matrix and the cross-correlation vector, these must be estimated during an iterative procedure.

Estimate of the MSE gradient surface can be obtained by taking the gradient of the instantaneous error surface.

The gradient of J(w) approximated as

The learning rule for updating the weights using the steepest descent gradients method as

)()(

)(

2

1)( )(

2

kxkew

kewJ kwww

)()()()()()1( kxkekwwJkwkw w

(2.28)

(2.29)

Learning rate specifies the magnitude of the update step for the weights in the negative gradient direction.

26

The LMS Algorithm

If the value of is chosen to be too small, the learning algorithm will modify the weights slowly and a relatively large number of iterations will be required.

If the value of is set too large, the learning rule can become numerically unstable leading to the weights not converging.

27

The LMS Algorithm

The scalar form of the LMS algorithm can be written from (2.22) and (2.29)

從 (2.22) 及 (2.29) 式，我們必須給 learning rate 設立一個上限，以維持穩定性。

n

hhh kxkwkdke

1

)()()()(

)()()()1( kxkekwkw iii

(2.30)

(2.31)

max

20

The largest eigenvalue of the inpu

t covariance matrix Cx

28

The LMS Algorithm

為使 LMS 收斂的最小容忍的穩定性，可接受的 learning rate 可限定在

(2.33) 式是一個近似的合理解法，因為

xCtrace

20

n

h

n

hxhhhx cCtrace

1 1max

(2.33)

(2.34)

29

The LMS Algorithm

從 (2.32) 、 (2.33) 式知道， learning rate 的決定，至少得計算輸入樣本的 covariance matrix ，在實際的應用上，是很難達到的。

即使可以得到，這種固定 learning rate 在結果的精確度上是有問題的。因此， Robbin’s and Monro’s root-finding algorithm 提出了，隨時間

變動 learning rate 的方法。 (Stochastic approximation )

where k is a very small constant. 缺點： learning rate 減低的速度太快。

kk

)( (2.35)

30

The LMS Algorithm

理想的做法應該是在學習的過程中， learning rate 應該在訓練的開始時有較大的值，然後逐漸降低。 (Schedule-type adjustment)

Darken and Moody Search-then converge algorithm

Search phase: is relatively large and almost constant. Converge phase: is decrease exponentially to zero.

0 >0 and >>1, typically 100<=<=500 These methods of adjusting the learning rate are commonly

called learning rate schedules.

/1)( 0

kk

(2.36)

31

The LMS Algorithm

Adaptive normalization approach (non-schedule-type) is adjusted according to the input data every time step

where 0 is a fixed constant.

Stability is guaranteed if 0< 0 <2; the practical range is 0.1<= 0 <=1

2

2

0

)()(

kxk

(2.37)

32

The LMS Algorithm Comparison of two learning rate schedules: stochastic approximation

schedule and the search-then-converge schedule.

Eq.(2.35)

Eq.(2.36)

is a constant

33

Summary of the LMS algorithm Step 1: set k=1, initialize the synaptic weight vector w(k=1), and sel

ect values for 0 and . Step 2: Compute the learning rate parameter

Step 3: Computer the error

Step 4: Update the synaptic weights

Step 5: If convergence is achieved, stop; else set k=k+1, then go to step 2.

/10

kk

n

hhh kxkwkdke

1

)()()()(

)()()()()1( kxkekkwkw iii

34

Example 2.1: Parametric system identification Input data consist of 1000 zero-mean Gaussian random vectors w

ith three components. The bias is set to zero. The variance of the components of x are 5, 1, and 0.5. The assumed linear model is given by b=[1, 0.8, -1]T.

To generate the target values the 1000 input vectors are used to form a matrix X=[x1x2…x1000] the desired outputs are computed according to d=bTX

The progress of the learning rate parameter as it is adjusted according to the search-then converge schedule.

bx d

200

1936.09.0

10001000

1

max0

1000

1

h

TT

xXX

xxC

The learning process was terminated when 82 102/1 keJ

35

Example 2.1 (cont.) Parametric system identification: estimating a parameter vector

associated with a dynamic model of a system given only input/output data from the system.

The root mean square (RMS) value of the performance measure.

36

Adaline and Madaline

Adaline It is an adaptive pattern classification network trained by t

he LMS algorithm.

X0(k)=1

可調整的 bias或 weight

產生 bipolar (+1, -1)的輸出，可因 activation function 的不同，而有 (0,1) 的輸出

)()()( kvkdke )()()(~ kykdke

37

Adaline

Linear error The difference between the desired output and the outp

ut of the linear combiner

Quantizer error The difference between the desired output and the outp

ut of the symmetric hard limiter

)()()( kvkdke

)()()(~ kykdke

38

Adaline

Adaline 的訓練過程輸入向量 x 必須和對應的 desired 輸出 d ，同時餵給 Adaline 。神經鍵的權重值 w ，會根據 linear LMS algorithm 動態的調

整。 Adaline 在訓練的過程，並沒有使用到 activation function ，

(activation function 只有在測試階段才會使用 ) 一旦網路的權重經過適當的調整後，可用未經訓練的 pattern

來測試 Adaline 的反應。如果 Adaline 的輸出和測試的輸入有很高的正確性時，可稱

網路已經 generalization 。

39

訓練時，輸入向量 x 必須和 desired 輸出同時餵給 Adaline 。

Weight 是根據 linear LMS 動態來改變。 Adaline 在訓練的過程並沒有使用到 activation

function ， (activation function 只有在測試階段才用 )

一旦權重適當調整後，可用未經訓練的 pattern 來測試 Adaline 的反應。

40

Adaline One common application of the Adaline is for the realizat

ion of a small class of logic functions: AND

otherwisexxxANDnxy n

n

jj 1

1 xall if1,...,,1sgn j

211

41

Adaline OR

otherwisexxxORnxy n

n

jj 1

1 xsome if1,...,,1sgn j

211

42

Adaline Majority

otherwisexxxMAJxy n

n

jj 1

1 xofmajority theif1,...,,sgn j

211

43

Adaline Linear separability

The Adaline acts as a classifier which separates all possible input patterns into two categories.

The output of the linear combiner is given as

)(

)()(

)(

)()(

0)()()()()(

0)(

)()()()()()(

2

01

2

12

02211

02211

kw

kwkx

kw

kwkx

or

kwkxkwkxkw

kv

kwkxkwkxkwkv

44

AdalineLinear separability of the Adaline

Adaline 只能分割線性可分割的 patten

45

AdalineNonlinear separation problem

非 straight line ，Adaline 無法分割

46

Adaline Adaline with nonlinearly transformed inputs (polynomial d

iscriminant function) To solve the classification problem for patterns that are not linearl

y separable, the inputs to the Adaline can be preprocessed with fixed nonlinearities. (polynomial discriminant function)

22524213

122110

)()()(

)()()()(

xkwxkwxxkw

xkwxkwkwkv

(2.45)

47

Adaline The critical thresholding condition for this Adaline with nonlinearly tr

ansformed inputs occurs when v(k) in (2.45) is set to zero. Realizing a nonlinearly separable function (XNOR)

If the appropriate nonlinearities are chosen, the network can be trained to separate the input space into two subspaces which are not linearly separable.

48

Adaline (cont.) Linear error correction rules

有兩種基本的線性修正規則，可用來動態調整網路的權重值。( 網路權重的改變與網路實際輸出和 desire 輸出的差異有關 ) -LMS: same as (2.22) and (2.29) -LMS: a self-normalizing version of the -LMS learning rule

-LMS 演算法是根據最小擾動原則 (minimal-disturbance principle) ，當調整權重以適應新的 pattern 的同時，對於先前的 pattern 的反應，應該有最小的影響。

-LMS 是基於最小化 MSE 表面。 -LMS 則是更新權重使得目前的誤差降低。

2

2)(

)()()()1(

kx

kxkekwkw (2.46)

49

Adaline (cont.) Consider the change in the error for -LMS

From (2.47) The choice of a controls stability and speed of convergence,

is typically set in the range

kekekx

kxkxkeke

kekxkx

kxkekwkd

kekxkwkdkekeke

T

TT

T

2

2

2

2

11

ke

ke

11.0

(2.47)

(2.48)

50

Adaline (cont.) Detail comparison of the -LMS and -LMS

From (2.46)

Define normalized desired response and normalized training vector

Eq(2.49) can be rewrote as

222

2

2

2

2

)()()(

)(

)(

)()()()1(

kx

kx

kx

kxkw

kx

kdkw

kx

kxkxkwkdkw

kx

kxkekwkw

T

T

22

,kx

kxkx

kx

kdkd

)()1( kxkxkwkdkwkw T

和 -LMS 具有相同的型式，所以 -LMS 表示正規化輸入樣本後的 -LMS 。

(2.49)

(2.50-51)

(2.52)

51

Multiple Adaline (Madaline)

單一個 Adaline 無法解決非線性分割區域的問題。

可使用多個 adaline Multiple adaline Madaline

Madaline I ： single-layer network with single output. Madaline II ： multi-layer network with multiple output.

52

Example of Madaline I network consisting of three Adalines

May be OR, AND, and MAJ

53

Two-layer Madaline II architecture

54

Madaline I realization of an XNOR logic function

55


)(

)()(

)(

)()(

移項整理後可得),(同除以0)()()()()()(

0)( ,1)(

)()()()()()()(

12

101

12

112

12

100122111

10

1001221111

kw

kwkx

kw

kwkx

kw

kwkxkwkxkwkx

kvkx

kwkxkwkxkwkxkv

56


同理

)(

)()(

)(

)()(

移項整理後可得),(同除以0)()()()()()(

0)( ,1)(

)()()()()()()(

22

201

22

212

22

200211222

20

2002112222

kw

kwkx

kw

kwkx

kw

kwkxkwkxkwkx

kvkx

kwkxkwkxkwkxkv

57

Madaline I separation properties for the XNOR problem

58

Madaline Learning Strategies

Madline 的學習策略有兩種 Madaline rule I, MRI, for Madaline I

The basic idea is to adjust those weights for the neuron whose linear output vj(k) is closest to zero.

MRI follows the minimal-disturbance principle. Madaline rule II, MRII, for Madaline II

The weights are initially set to small random values. The training patterns are presented in a random fashion

with the objective of minimizing the average Hamming error over the training set.

59

Simple Perceptron

Simple perceptron (single-layer perceptron) Very similar to the Adaline, 由 Frank Rosenblatt (1950) 提

出。 Minsky and Paper 發縣現一個嚴重的限制： perceptron 無

法解決 XOR 的問題。藉由正確的 processing layer ，可解決 XOR 問題，或是， p

arity function 的問題。 Simple perceptron 和典型的 pattern classifier 的 maximum-

likelihood Gaussian classifier 有關，均可視為線性分類器。大部分的 perceptron 的訓練是 supervised ，也有部分是 sel

f-organizing 。

60

Simple Perceptron (cont.)

In Rosenblatt’s early work The perceptron had three layers

Sensory surface (retina): Association area (A unit) Response unit (R unit)

It was not allowed to have more than one R unit on at a time. (Winner-take-all)

61


Original Rosenblatt’s perceptron Binary input, no bias.

Modified perceptron Bipolar inputs and a bias term Output y{-1,1}

62

Simple Perceptron (cont.) The quantizer error is used to adjust the synaptic we

ights of the neuron. The adaptive algorithm for adjusting the neuron wei

ghts (the perceptron learning rule) is given as

Rosenblatt normally set to unity. The choice of the learning rate does not affect the

numerical stability of the perceptron learning rule. can affect the speed of convergence.

kykdkxkwkdke

kxke

kwkw

T

sgn~where

2

~1 (2.55)

(2.56)

比較 (2.46)

63


The perceptron learning rule is considered a nonlinear algorithm.

The perceptron learning rule performs the update of the weights until all the input patterns are classified correctly. The quantizer error will be zero for all training patt

ern inputs, and no weight adjustments will occur. The weights are not guaranteed to be optimal.

64

Simple Perceptron

Mays’s perceptron learning rules Mays 對標準的 perceptron learning rules 提出兩個修改： Increment adaptation algorithm

Modified relaxation algorithm

kvifkx

kxkdkw

kvifkx

kxkekw

kw )(

2

)(~

1

2

2

2

2

otherwise

kx

kxkekw

kekvifkw

kw2

2

)(

0)(~ and

1

(2.58)

(2.57)

65

Simple Perceptron with a Sigmoid Activation Function The learning rule is based on the method of steepest

descent and attempts to minimize an instantaneous performance function.

66

Simple Perceptron with a Sigmoid Activation Function (cont.) 學習演算法可由 MSE 推導獲得

The instantaneous performance function to be minimized is given as

kykdke

keEwJ

qqq

qq

~ where

~2

1 2

qq

Tqq

qqqq

qqqq

kwkxfkvfky

kykykdkd

kykdkewJ

where

22

12

1~2

1

22

22

(2.61)

(2.60)

(2.59)

67

Simple Perceptron with a Sigmoid Activation Function (cont.) 假設 activation function 為 hyperbolic tangent sigm

oid ，因此，神經元的輸出可表示成

根據 (2.15) 式對 hyperbolic tangent sigmoid 函數的微分

採用 steepest descent 的 discrete-time learning rule

kvkvfky qqhtsq tanh

kvfkvfkvg qqq21'

qwqq wJkwkw 1

( 參考 2.29 式 )

(2.64)

(2.63)

(2.62)

68

Simple Perceptron with a Sigmoid Activation Function (cont.) 計算 (2.64) 式的梯度 (gradient)

以 (2.63) 式代入 (2.65) 式

採用 (2.66) 式的 gradient ，則 discrete-time learning rule for simple perceptron 可寫成

kxkvfke

kxkvfkvfkd

kxkvfkvfkxkvfkdwJ

qq

qqq

qqqqqw

'~'

''

kxkyke

kxkvkewJ

qq

qqqw

2

2

1~f1~

kxkykekwkw qqqq21~1

(2.65)

(2.66)

(2.67)

69

Simple Perceptron with a Sigmoid Activation Function (cont.) (2.67) 式可改寫成 scalar form

其中

(2.68) 、 (2.69) 和 (2.70) 為 backpropagation training algorithm 的標準形式。

kxkykekwkw jqqqjqj21~1

n

jqqjjqq

qqq

kwkxfkvfky

kykdke

1

~

(2.70)

(2.69)

(2.68)

70

Example 2.2 Figure 2.30 的結構， to learn character “E” The character image consists of 5x5 array, 25 pixel (column majo

r) The learning rule is (2.67), with =1, =0.25 The desired neuron response d=0.5, error goal 10-8. The initial weights of the neuron were randomized. After 39 training pattern, the actual neuron output y=0.50009 (se

e Fig. 2.32)

71

Example 2.2 (cont.)

The single neuron cannot correct for a noisy input.

For Fig. 2.31 (b), y=0.5204 For Fig. 2.31 (c), y=0.6805 To compensate for noisy

Multi-layer perceptron Hopfield associative

memory.

72

Feedforward Multilayer Perceptron Multilayer perceptron (MLP)

The branches can only broadcast information in one direction. Synaptic weight can be adjusted according to a defined learni

ng rule. h-p-m feedforward MLP neural network. In general there can be any number of hidden layers in the arc

hitecture; however, from a practical perspective, only one or two hidden layer are used.

73

Feedforward Multilayer Perceptron (cont.)

74

Feedforward Multilayer Perceptron (cont.) The first layer has the weight matrix

The second layer has the weight matrix

The third layer has the weight matrix

Define a diagonal nonlinear operator matrix

nhjiwW )1()1(

hprjwW )2()2(

pmsrwW )3()3(

)()()( ,...,, fffdiagf (2.71)

75

Feedforward Multilayer Perceptron (cont.) The output of the first layer can be written as

The output of the second layer can be written as

The output of the third layer can be written as

將 (2.72) 代入 (2.73) ，再代入 (2.74) 可得最後的輸出為

xWfvfxout)1()1()1()1(

1

1)2()2()2()2(

2 outout xWfvfx

2)3()3()3()3(

3 outout xWfvfx

(2.72)

(2.73)

(2.74)

xWfWfWfy )1()1()2()2()3()3( (2.75)

The synaptic weights are fixed, a training process must be carried out a prioriTo properly adjust the weights.

76

Overview of Basic Learning Rules for a Single Neuron Generalized LMS Learning Rule

定義一個需最小化的 performance function (energy function)

其中 , ||w||2為向量 w 的 Euclidean norm(.) 為任何可微分的函數， e is the linear error 。

2

22

1)( we

xwde TDesired output

Weight vector Input vector

(2.76)

(2.77)

77

Generalized LMS Learning Rule (cont.) 採用最陡坡降法 (steepest descent approach) ，可獲得 gen

eral LMS algorithm 。 Continuous-time learning rule( 可視為向量的微分 )

Discrete-time learning rule

If (t)=1/2t2, and ’(t)=g(t)=t, then (2.81) is written as

wxeg

wdt

dww

)(

)(

)()()()(

)()()1(

kwkxegkw

wkwkw w

Learning rate Leakage factor

(2.78)

(2.79)

(2.82)

(2. 81)

wexwexwexdt

dw (2.83)

78

Generalized LMS Learning Rule (cont.)

Leaky LMS algorithm (0<=<1)

Standard LMS algorithm (=0)

The scalar form of standard LMS algorithm

)()()()1(

)()()()()1(

kxkekw

kwkxkekwkw

)()()()1( kxkekwkw

(2.84)

(2.85)

n

jjj

jjj

kxkwkdke

nj

kxkekwkw

1

)()()()(

,...,2,1,0for

)()()()1( (2.86)

79

Generalized LMS Learning Rule (cont.) Standard LMS 可有三種重要的變化：

慣性 (momentum)被設計來在平均下坡力量的方向上，提供一特殊動量來改變權重向量。可定義成目前權重 w(k) 和前一權重 w(k-1) 間的差異。

因此 (2.85) 式可改寫成

其中 0<<1 為 momentum parameter

1)( kwkwkw

11 kwkwkxkekwkw

(2.87)

(2.88)

Standard LMS algorithm with momentum

80


Recursive weighted least-squares包含有參數向量 (parameter vector) 的更新表示增益向量 (gain vector) 的更新表示權重矩陣 (weighting matrix) 的更新表示 The update expression for the parameter vector is

The update expression for the gain vector is

The update expression for the weighting matrix is

)()1()()1( kekLkwkw

1)()()()()()1(

kxkPkxkwkPkL T

)()()1()(1

)1( kPkxkLkPkP T

11 nL11 nnP

11 nw

(2.89)

(2.90)

(2.91)

81


其中 error term e(k) is given by

將 (2.90) 代入 (2.89) 得

因此修改後的 synaptic weight vector 可表示成

其中

)()()()( kxkwkdke T

)()()(

)()()()()1(

kxkPkx

kwkPkekwkw

T

)()()()()()1( kxkPkekkwkw

)()()(

1)(

kxkPkxk

T

(2.92)

(2.93)

(2.94)

(2.95)

82


將 (2.90) 代入 (2.91) 可得 weighting matrix

The update expression for the weighting matrix is written as

)()()()()(1

)()()()()()(1

)()()(

)()()()()(

1)1(

kPkxkxkPkI

kPkxkxkPkkP

kxkPkx

kPkxkxkPkPkP

T

T

T

T

)()()()()(1

)1( kPkxkxkPkIkP T

(2.96)

(2.97)

83


最小擾動原則 (minimal disturbance principle) Modified normalized LMS 在 (2.46) 式中，在分母的地方加入正的常數，確保權重的更

新不會變成無限大。

2

2

1kx

kxkekwkw

(2.98)

84

Example 2.3

The same as Example 2.1 ，但使用不同的 LMS algorithm 。

Use the same Initial weight vector Initial learning rate Termination criterion

85

Overview of basic learning rules for a single neuron Hebbian Learning

[Donald Hebb] The strength of a synapse between cells A and B increased slightly for the situation when the firing in A was followed by firing in B with a very small time delay. For two neurons on either side of a synapse that are synchronously a

ctivated, then the strength of the synapse is increased. [Stent] expanded Hebb’s original statement to include the case wh

en two neurons on either side of a synapse are asynchronously activated, leading to a weakened synapse.

[Rumelhart] Adjust the strength of the connection between units A and B in proportion to the product of their simultaneous activation. If the product of the activations is positive, the modification to the syna

ptic connection is more excitatory. If the product of the activations is negative, the modification to the syn

aptic connection is more inhibitory.

86

Overview of basic learning rules for a single neuron (cont.)

Hebbian synapse Uses a highly local, time-dependent, and strongly interac

tive mechanism to increase synaptic efficiency as a function of the correlation between the presynaptic and postsynaptic activity levels.

87


Four key properties of a Hebbian synapse Time-dependent mechanism

To change in a Hebbian synapse that depend on the precise time of occurrence of the the presynaptic and postsynaptic activity levels.

Local mechanism Within a synapse, ongoing activity levels in the presynaptic and postsy

naptic units are used by a hebbian synapse to produce an input-dependent, local synaptic modification.

Interactive mechanism Any form of hebbian learning depends on the interaction between pres

ynaptic and postsynaptic activities.

Conjunctional (correlational) mechanism The “co-occurrence” of presynaptic and postsynaptic activities within a

relatively short time interval is sufficient to produce a synaptic modification.

88


Synaptic activities can be categorized as Hebbian

A Hebbian synapse increases its strength, and its strength is decreased when the activities are either uncorrelated or negatively correlated.

Anti-Hebbian An anti-Hebbian synapse enhance negatively correlated pr

esynaptic or postsynaptic activities and weakens positively correlated activities.

Non-Hebbian A non-Hebbian synapse does not involve the strongly inter

active, highly local, time-dependent mechanism.

89

Overview of basic learning rules for a single neuron (cont.) Standard Hebbian learning for a single neuro

n

90


The standard Hebbian learning rule for a single neurons from an energy function defined as

The output of the neuron

Taking the steepest descent approach to derive the continuous-time learning rule

2

22)( wxww T

vfvd

vdy

wdt

dww

xwv T

(2.99)

(2.100)

(2.101)

91


The gradient of (2.99) is given as

Using (2.102) and (2.101), the continuous-time standard Hebbian learning rule for a single neuron is given as

The discrete-time standard Hebbian learning rule is

The scalar discrete-time form

wyxww

vvfww

)()(

wyxdt

dw

)()()()()1( kwkxkykwkw

)()()()()1( kwkxkykwkw jjjj

(2.102)

(2.103)

(2.104)

(2.105)

92

Overview of basic learning rules for a single neuron (cont.) Generalized Hebbian learning rule

Can be considered as a gradient optimization process when an appropriate energy or Lyapunov function is selected

The resulting generalized Hebbian learning rule is given as

The discrete-time form as

w

w

dt

dw

)(

wxdt

dw

)()()()1(

)()()()(

)()()()()1(

kxkkw

kxkkwkw

kwkxkkwkw

(2.106)

(2.107)

(2.108)

93


Assume that the learning signal is the output of the neuron

Therefore, (2.107) becomes

)()(

vfdv

vdy

wyxdt

dw

(2.109)

(2.110)

94

Overview of basic learning rules for a single neuron (cont.) Oja’s Learning Rule

Oja’s learning rule 可由最小化下列能量函數推導：

兩個基本假設：神經元的權重向量是正規化的 (normalized) ， ||w||2=1 神經元採用線性的激發函數 (linear activation function) ， y=w

Tx 因此能量函數可改寫成

xxewhere

ew

ˆ

)(2

221

2

2

2

22

12

12

1

wywxywxx

wyxywx

wyxw

TTT

TT

(2.111)

(2.112)

(2.114)

95


採用 steepest descent 法， Oja’s 的 continuous-time learning rule 可改寫成：

將 (2.116)帶入 (2.115) 可得 Oja’s 的 continuous-time learning rule

rate learning 是 0 where,

2wyxyw

wdt

dw

w

w

xwvywhere

wyxydt

dw

T

2

(2.115)

(2.116)

(2.117)

(2.118)

Hebbian cooccurance term Active decay term

96


(2.117) 式可改寫成 discrete-time 格式：

Typical (simple) form of Hebbian learning

kykwkxkykwkw

kykwkxkykwkw

jjjj

)(1

as writtern becan (2.119) of formscalar The

)(1

kxkykwkw jjj )(1

(2.119)

(2.120)

(2.121)

97

Example 2.4

98

Overview of basic learning rules for a single neuron (cont.) Potential Learning Rule

不需要 desired signal ，是一種 unsupervised learning 。 The learning is performed on the activity level of the neuro

n. Potential learning rule 可由最小化下列能量函數推導：

function loss theis

0

where

2

22

xwv

wxww

T

T(2.122)

99


The gradient of (2.122) is given as

The learning signal is

Using Amari’s result in (2.106) and the gradient in (2,123), the continuous-time potential learning rule

The discrete-time form

The discrete-time scalar form

wxvww )()(

dv

vdv

)()(

wxvdt

dw )(

)()()()()1( kwkxvkwkw

)()()()()1( kwkxvkwkw jjjj

(2.123)

(2.124)

(2.125)

(2.126)

100

Overview of basic learning rules for a single neuron (cont.) Correlation Learning Rule

Minimizing the energy function

The gradient with respect to the synaptic weight matrix is

利用 Amari’s (2.106) 及 (2.128) 可得 continuous-time correlation learning rule

The discrete-time form can be written as

The discrete-time scalar form is

2

22)( wxdww T

wdxww )(

wdxdt

dw

kwkxkdkwkw 1

kwkxkdkwkw jjjj 1

Correlation learning rule通常應用在記憶網路中紀錄資料，如果將 (2.129) 式中的 d 改成 y ，則成為 hebbian leaning rule 。

(2.127)

(2.128)

(2.129)

(2.130)

(2.131)

101

Overview of basic learning rules for a single neuron (cont.) Standard Perceptron Learning Rule

可由 minimizing the MSE criterion 來推導獲得

其中

神經元的輸出

採用 steepest descent approach, the continuous-time learning rule is given by

2

2

1)( ew

yde

)()( vfxwfy T

)(wdt

dww

(2.132)

(2.133)

(2.134)

102


則 (2.132) 式的 gradient 可得

x

xdv

vdfe

xdv

vdfvd

xdv

vdfvx

dv

vdfdww

)(

)()(

)()(

)()(

)()(')(

vegvefdv

vdfe

其中

(2.135)

(2.136)

103


使用 (2.134), (2.135) 和 (2.136) 式， the continuous-time standard perceptron learning rule for a single neuron as

(2.137) 式可改寫成 discrete-time形式

The scalar form of (2.138) can be written as

xdt

dw

)()()()1( kxkkwkw

)()()()1( kxkkwkw jjj

(2.137)

(2.138)

(2.139)

104

Overview of basic learning rules for a single neuron (cont.) Generalized Perceptron Learning Rule

105

Overview of basic learning rules for a single neuron (cont.) Generalized Perceptron Learning Rule

When the energy function is not defined to be the MSE criterion, we can define a general energy function as

其中為可微函數。如果 e2 則變成 standard perceptron learning rule.

其中

)()()( ydew

w

v

v

y

y

e

eww

)(

)()(')(

eee

e

)(vfxwfy T

(2.141)

(2.142)

(2.143)

(2.140)

106


f(.) is a differentiable function, and

(2.141) can be written as

The continuous-time general perceptron learning rule is given as

If we define the learning signal as

(2.146) can be written as

)()(')()(

vgvfdv

vdf

dv

vdy

xvgeww )()()(

xvgedt

dw)()(

)()( vge

xdt

dw

(2.144)

(2.145)

(2.146)

(2.147)

(2.148)

107

Overview of basic learning rules for a single neuron (cont.) Discrete-time form

Discrete scalar form

)()()()1( kxkkwkw

)()()()1( kxkkwkw jjj

(2.149)

(2.150)

108

Data Preprocessing

The performance of a neural network is strongly dependent on the preprocessing that is performed on the training data.

Scaling The training data can be amplitude-scaled in two

ways The value of the pattern lie between -1 and 1. The value of the pattern lie between 0 and 1.

Referred to as min/max scaling. MATLAB: premnmx

109

Data Preprocessing (cont.)

Another scaling process Mean centering

如果用來 training 的 data包含有 biases 時。 Variance scaling

如果用來 training 的 data 具有不同的單位時。假設輸入向量以行方向排列成矩陣目標向量也以行方向排列成矩陣 Mean centering

計算矩陣 A 、 C 中每一列的 mean value 將矩陣 A 、 C 中的每個元素，減去該對應的 mean value 。

Variance scaling 計算矩陣 A 、 C 中每一列的 standard deviation. 將矩陣 A 、 C 中的每個元素，除以該對應的 standard deviation 。

mnA mpC

110


Transformations The feature of certain “raw” signals are used fro training

inputs provide better results than the raw signals. A front-end feature extractor can be used to discern salient

or distinguishing characteristics of the data. Four transform methods:

Fourier Transform Principal-Component Analysis Partial Least-Squares Regression Wavelets and Wavelet Transforms

111


Fourier Transform The FFT can be used to extract the import

features of the data, and then these dominant characteristic features can be used to train the neural network.

112

Data Preprocessing (cont.)三個具有相同波形，不同相位的信號，每個信號具有 1024 個取樣點。

113


在 FFT magnitude response 上，具有相同 magnitude response ，而且只需 16 個 magnitude取樣即可。

三個具有相同波形，不同相位的信號，在 FFT 的相位上則有所不同。

114


Principal-Component Analysis PCA can be used to “compress” the input training

data set, reduce the dimension of the inputs. By determining the important features of the data

according to an assessment of the variance of the data.

In MATLAB, prepca is provided to perform PCA on the training data

115

Data Preprocessing (cont.) Given a set of training data

where assumed that m>>n,n denote the dimension of the input training patternsm denote the number of training pattern.

Using PCA, an “optimal” orthogonal transformation matrix can be determined

where h<<n (the degree of dimension reduction) The dimension of the input vectors can be reduced according to t

he transformation

where Ar is the reduced-dimension set of training patterns.The columns of Ar are the principal components for each of the inputs from A

mnA

nhpcaW

AWA pcar

mhrA

(2.151)

116

Data Preprocessing (cont.) Partial Least-Squares Regression

PLSR can be used to compress the input training data set. Restricted for use with supervised trained neural networks. Only scalar target values are allowed. The factor analysis in PLSR can determine the degree of

compression of the input data. After the optimal number of PLSR factor h has been

determined, the weight loading vectors can be used to transform the data similar to the PCA approach.

The optimal number weight loading vectors can form an orthogonal transformation matrix as the columns of the matrix

The dimension of the input vectors can be reduced according to the transformation

hnplsrW

AWA Tplsrr (2.152)

117


PCA and PLSR orthogonal transformation vectors used for data compression

PLSR 使用輸入資料與目標資料來產生orthogonal transformation Wplsr 的weight loading vector

118


Wavelets and Wavelet Transforms A wave is an oscillating function of time. Fourier analysis is used for analyzing waves

Certain function can be expanded in terms of sinusoidal waves. How much of each frequency component is required to synthesize the sign

al. Very useful for periodic, time-invariant, stationary signal analysis.

A wavelet can be considered as a small wave, whose energy is concentrated. Useful for analyzing signals that are time-varying, transient, nonstatio

nary. To allow for simultaneous time and frequency analysis. Wavelets are local waves. The wavelet transform can provide a time-frequency description of si

gnals and can be used to compress data for training neural network.

Fundamental Neurocomputing Concepts

Documents

Transcript of Fundamental Neurocomputing Concepts