Chapter 3 Single-Layer Perceptrons

資訊工程所醫學影像處理實驗室 (Medical Image Processing Lab. )Graduate School of Computer Science & Information Engineering

Chapter 3Single-Layer Perceptrons

授課教師 : 張傳育博士 (Chuan-Yu Chang Ph.D.)E-mail: [email protected]: (05)5342601 ext. 4337Office: ES709

資訊工程所醫學影像處理實驗室 (Medical Image Processing Lab. )Graduate School of Computer Science & Information Engineering 2

Adaptive Filter Problem在動態系統 (dynamic system) 中，其數學特徵是未知的，在系統中我們所知道的只有一組由系統產生的 labeled input-output data.也就是說，當一個 m-dimension 的輸入 x(i) 輸入到系統中，系統會產生對應的輸出 d(i) 。

因此系統的外部行為可表示成

UnknownDynamicsystem

x1(i)

x2(i)

x3(i)

Outputd(i)

Tm ixixixi

niidi

,...,,

where

,...,...,2,1;,:

21

x

x(3.1)


Adaptive Filter Problem (cont.)問題在於如何設計一多輸入單一輸出的模型？

The neural model operates under the influence of an algorithm that controls necessary adjustments to the synaptic weights of the neuron.

• The algorithm starts from an arbitrary setting of the neuron’s synaptic weights.

• Adjustments to the synaptic weights, in response to statistical variations in the system’s behavior, are made on a continuous basis.

• Computations of adjustments to the synaptic weights are completed inside a time interval that is one sampling period long.

Adaptive model consists of two continuous processes• Filtering process, which involves the computation of two signa

ls. An output, and an error signal

• Adaptive process Automatic adjustment of the synaptic weights of the neur

on in accordance with the error signal e(i).


Adaptive Filter Problem (cont.)The output y(i) is the same as the induced local field v

(i)

Eq(3.2) 可表示成向量的內積形式

where

The neuron’s output y(i) is compared to the corresponding output d(i)

m

kkk ixiwiviy

1

iiiy T wx

Tm iwiwiwi ,...,, 21w

iyidie

x1(i)

x2(i)

x3(i)

d(i)

y(i)

e(i)

v(i)w1(i)

w2(i)

w3(i)

-1

(3.4)

(3.2)


Unconstrained Optimization Techniques若一成本函數 (cost function)E(w) 對權重向量 w是連續可微，則 a

daptive filtering algorithm 的目的在於選擇一權重向量 w，具有最小的成本。

若最佳的權重向量為 w* ，則須滿足

Minimize the cost function E (w) with respect to the weight vector w.

The necessary condition for optimality is

where gradient operator is

the gradient vector of the cost function is

ww EE *

0* wE

T

mwww

,...,,21

T

mwww

EEE

E ,...,,21

w

(3.5)

(3.7)

(3.8)

(3.9)


Unconstrained Optimization Techniques

Local iterative descentStarting with an initial guess denoted by w(0), generate

a sequence of weight vectors w(1), w(2),…,such that the cost function E(w) is reduced at each iteration of the algorithm

where w(n) is the old value of the weight vector and w(n+1) is its updated value.

We hope that the algorithm will eventually converge onto the optimal solution w*.

nn ww EE 1 (3.10)


Method of steepest Descent

The successive adjustments applied to the weight vector w are in the direction of steepest descent, that is in a direction opposite to the gradient vector

The steepest descent algorithm is formally described by

The correction of the algorithm is

)(wg E

)()()1( nnn gww

)(

)()1()(

n

nnn

g

www

(3.11)

(3.12)

(3.13)

Stepsize/learning rate


Method of steepest Descent (cont.)

為了證明 steepest descent algorithm 滿足 Eq(3.10) 的條件，使用一個一階 Taylor 序列展開 w(n) 來近似 E(w(n+1))

將 Eq(3.13) 代入上式，可得

nnnn T wgww EE 1

2

1

nn

nnnn T

gw

ggww

E

EE


Method of steepest Descent (cont.)The method of steepest descent converges to t

he optimal solution w* slowly.The learning-rate parameter has a serious infl

uence on its convergence behavior.When is small, the transient response of the algorit

hm is overdamped, the trajectory traced by w(n) follows a smooth path in the W-plane.

When is large, the transient response of the algorithm is underdamped, the trajectory of w(n) follows a zigzagging (oscillatory) path.

When exceeds a certain critical value, the algorithm becomes unstable.


Method of steepest Descent (cont.)


Here F is assumed to be defined on the plane, and that its graph has a bowl shape.

The blue curves are the contour lines, that is, the regions on which the value of F is constant.

A red arrow originating at a point shows the direction of the negative gradient at that point. Note that the (negative) gradient at a point is perpendicular to the

contour line going through that point. We see that gradient descent leads us to the bottom of the bowl,

that is, to the point where the value of the function F is minimal.


Method of steepest Descent (cont.)Newton’s method

To minimize the quadratic approximation of the cost functionE (w) around the current point w(n).

This minimization is performed at each iteration of the algorithm.Using a second-order Taylor series expansion of the cost functi

on around the point w(n).

g(n) is the m-by-1 gradient vector of the cost function E (w) evaluated at the point w(n).

The matrix H(n) is the m-by-m Hessian matrix of E (w) .

nnnnn

nnn

TT wHwwg

www

2

1

1 EEE(3.14)


Method of steepest Descent (cont.) The Hessian of E (w) is defined by

從 Eq(3.15) 可知， cost function E (w) 必須可對 w進行兩次微分將 Eq(3.14) 對 w進行微分，當下式滿足時， E (w) 改變量將會

最小

上式可解得

也就是

wwwww

wwwww

wwwww

mmm

m

m

2

2

2

2

1

2

2

2

22

2

12

21

2

21

2

21

2

2

EEE

EEE

EEE

E

wH

0 nnn wHg

nnn gHw 1

nnnnnn gHwwww 11

(3.15)

(3.16)

The Hessian H(n) has to be a positive definite matrix for all n. There is no guarantee that H(n) is positive definite at every iteration of the algorithm.


Method of steepest Descent (cont.)Gauss-Newton Method

Let the sum of error square

error signal e(i) 是可調的權重向量 w的函數。給定一工作點 w(n) ，我們可將 e(i) 在 w的相依性表示成

其矩陣表示法為

其中錯誤向量 (error vector) 表示成

n

i

ie1

2

2

1wE

nnnn wwJewe ,'

Tneeen ,...,2,1e

(3.19)

,...,n,inie

ieieT

n

21 ,,'

www

www

(3.17)

(3.18)


Method of steepest Descent (cont.)J(n) is the n-by-m Jacobian matrix of e(n):

The Jacobian J(n) is the transpose of the m-by-n gradient matrix ∇e(n)

The updated weight vector w(n+1) is then defined by

neeen ,...,2,1e

nm

m

m

www

www

www

nenene

eee

eee

n

ww

J

21

21

21222

111

2

,'2

1minarg1 weww

nn

(3.20)

(3.21)


Method of steepest Descent (cont.)Using Eq(3.19) to evaluate the squared Euclidean norm

of e’(n,w), we get

上式對 w微分，並令結果為零 , 可得

可解得

Gauss-Newton 法只需要 error vector e(n) 的 Jacobian matrix 。但須確保 JT(n)J(n) 是非奇異矩陣 (nonsingular)

nnnn

nnnnn

TT

T

wwJJww

wwJeewe

2

12

1,'

2

1 22

0 nnnnn TT wwJJeJ

nnnnnn TT eJJJww1

1

(3.22)


Method of steepest Descent (cont.)There is no guarantee that this condition will always hold.

• Add the diagonal matrix I to the matrix JT(n)J(n).

• The parameter is a small positive constant.

On this basis, the Gauss-Newton method is implemented in the slightly modified form

The effect of this modification is progressively reduced as the number of iterations , n, is increased.

Eq(3.23) 為底下 modified cost function 的解

w(0) 為權重向量 w(i) 的初始值。

nnnnnn TT eJIJJww1

1

n

i

ie1

220

2

1www E

(3.23)

(3.24)


Linear Least-Squares FilterLinear Least-Squares Filter has two distinctive c

haracteristics: The single neuron is built in linear The cost function E (w) used to design the filter consists of the su

m of error squares. 因此使用 Eq(3.3) 和 Eq(3.4) ， error vector 可表示成

where d(n) is a n-by-1 desired response vector:

and X(n) is the n-by-m data matrix: 　

nnn

nnnn T

wXd

wxxxde

,...,2,1

Tndddn ,...,2,1d

Tnn xxxX ,...,2,1

(3.25)

x1(i)

x2(i)

x3(i)

d(i)

y(i)

e(i)

v(i)w1(i)

w2(i)

w3(i)

-1


Linear Least-Squares Filter (cont.)將 Eq(3.25) 對 w(n) 微分可得梯度矩陣 (gradient matrix)

e(n) 的 Jacobian 為

將 Eq(3.25) 和 (3.26) 代入 (3.22) 可得

因此， Eq(3.27) 可改寫成

nn TXe

nn XJ (3.26)

nnnn

nnnnnnnn

TT

TT

dXXX

wXdXXXww1

11

nnn dXw 1

nnnn TT XXXX1

The pseudoinverse of the data matrix X(n)

(3.27)

(3.28)

(3.29)


Linear Least-Squares Filter (cont.)Wiener Filter:

The input vector x(i) and desired response d(i) are draw from an ergodic environment.

We may then substitute long-term sample for expectations (ensemble averages)

Ergodic environment 可使用二階統計來表示Correlation matrix of the input vector x(i), Rx.Cross-correlation vector between the input vector x(i) and desired

response d(i) , rx,d.

where E denotes the statistical expectation operator.

nnn

iiiiE T

n

n

i

T

n

T XXxxxxRx1

limlim1

ndnn

idin

idiE T

n

n

in

d Xxxrx1

lim1

lim1

(3.31)

(3.30)


Linear Least-Squares Filter (cont.)Accordingly, we may reformulate the linear least-squares soluti

on of Eq(3.27) as

The weight vector w0 is called the Wiener solution to the linear optimum filtering problem.

For an ergodic process, the linear least-square filter asymptotically approaches the Wiener filter as the number of observations approaches infinity.

However, the second-order statistics is not available in many important situations encountered in practice.

d

T

n

T

n

TT

nn

nnn

nnn

nnnnn

xx rR

dXXX

dXXXww

1

1

10

1lim

1lim

lim1lim

(3.32)


Least-Mean-Square Algorithm

The LMS algorithm is based on the use of instantaneous values for the cost function

其中， e(n) 為時間 n 時的錯誤訊號將 E(n) 對權重向量 w微分可得

new 2

2

1E

w

nene

w

w

E

(3.33)

(3.34)


Least-Mean-Square Algorithm (cont.)因為

因此

所以 Eq(3.34) 可改寫成

上式稱為梯度向量的估計 (Estimated Gradient vector) 可得

套入 Eq(3.12) 的最陡坡降法， LMS 可寫成。

)()(

)(n

n

nex

w

nnndne T wx

nenn xg ˆ

nennn xww ˆ1ˆ

)()()(

)(nen

nx

w

w

E

(3.35)

(3.36)

(3.37)


Least-Mean-Square Algorithm (cont.)

Summary of LMS AlgorithmTraining Sample:

Input signal vector: x(n)Desired response: d(n)

User-selected parameter: Initialization:

Set

ComputationFor n=1, 2,…, compute

nnndne T xw

nennn xww ˆ1ˆ

0w (0)ˆ


Least-Mean-Square Algorithm (cont.)Signal-flow graph representation of LMS algorith

m將 Eq(3.35) 和 Eq(3.37) 結合起來，可將 LMS 演算法的權重向量演化的過程表示成

其中 I 為 identity matrix因此，我們將

ndnnnnI

nnndnnnT

T

xwxx

wxxww

ˆ

ˆˆ1ˆ(3.38)

1ˆˆ 1 nzn ww

)1(ˆ nw

z-1I

x(n)xT(n)

x(n)d(n))(ˆ nw

++

-


Convergence Considerations of the LMS Algorithm從控制理論我們知道一個回饋系統 (feedback system) 的穩定性 (stability) 是由回饋迴路的參數來決定。從圖 3.3 ， LMS 演算法的回饋迴路中有兩個參數： learnin

g rate ， input vector x(n)LMS 演算法的收斂準則

Convergence in the mean square

假設• 連續的輸入向量 x(1), x(2),… 在統計上是彼此獨立的• 在時間 n ，輸入向量 x(n) 對於前面所有樣本的 disired response d

(1), d(2),…d(n-1) 是統計上獨立的• 在時間 n ， desired response d(n) 相依於 x(n)• x(n) 和 d(n) 是從 Gaussian-distributed 中取出

nneE asconstant2 (3.41)


Convergence Considerations of the LMS Algorithm

By invoking the elements of independence theory and assuming that the learning rate parameter is sufficiently smallThe LMS is convergent in the mean square provided

that satisfies the condition

where max is the largest eigenvalue of the correlation matrix Rx.

然而，在實際的 LMS 的應用中，缺乏關於的 max知識。為了解決此難題，可使用 trace of Rx作為 max的保守估計，則 Eq(3.42) 可改寫成

max

20

xRtr

20

(3.42)

(3.43)


Convergence Considerations of the LMS Algorithm

By definition, the trace of a square matrix is equal to the sum of its diagonal elements.Each diagonal element of the correlation matrix Rx e

quals the mean-square value of the corresponding sensor input

因此， Eq(3.43) 可再改寫成

提供一個滿足上式的學習速率， LMS 演算法可保證收斂到 mean-square ， (implies convergence of the mean)

inputssensortheofvaluessquaremeanofum

20

s (3.44)


Virtues and Limitations of the LMS algorithmVirtues of the LMS algorithm

Simplicity Robust

Small model uncertainty and small disturbances can only result in small estimation errors.

Limitations of the LMS algorithm Slow rate of convergence

Typically requires a number of iterations equal to about 10 times the dimensionality of the input space for it to reach a steady-state condition.

Sensitivity to variations in the eigenstructure of the inputThe LMS algorithm is sensitive to variations in the condition numb

er or eigenvalue defined by

When the condition number X(Rx) is high, the sensitivity of the LMS algorithm becomes acute.

min

max

xR (3.45)


Learning CurvesLearning curve

Is a plot of the mean-square value of the estimation error, Eav(n), versus the number of iterations, n.

Rate of convergenceDefine as the number of iterati

ons, n, required for Eav(n) to decrease to some arbitrarily chosen value.Such as 10 percent of the initial

value Eav(0). Misadjustment

How close the adaptive filter is to optimality in the mean-square error sense.


Learning Curves (cont.) Misadjustment is defined as

where Emin denote the minimum mean-square error produced by the Wiener filter, designed on the basis of known values of the correlation matrix Rx and cross-correlation vector rxd.

The misadjustment M of the LMS algorithm is directly proportional to the learning-rate parameter .

The average time constant av is inversely proportional to the learning rate parameter .If the learning rate parameter is reduced so as to reduce the misadj

ustment, then the settling time of the LMS algorithm is increased. Careful attention must be given to the choice of the learning para

meter in the design of the LMS algorithm in order to produce a satisfactory overall performance.

1

minmin

min

EE

EEE

M(3.46)


Learning-rate Annealing SchedulesLMS 演算法在計算過程可以將 learning –rate 設定成幾種方式：Constant

Learning-rate is time-varying (by Robbins, 1951)

where c is a constant. When c is large, there is a danger of parameter blowup for small n.

Search-then-converge schedule (by Darken and Moody, 1992)

nn allfor 0

n

cn

/1

0

nn


Learning-rate Annealing SchedulesLearning-rate annealing schedules


PerceptronMcCulloch-Pitts model

The perceptron consists of a linear combiner followed by a hard limiter (signum function).

The summing node of the neuronal model computes a linear combination of the inputs applied to its synapses, and also incorporates an externally applied bias.

The resulting sum is applied to a hard limiter.The neuron produces an output equal to +1 if th

e hard limiter input is positive, and -1 if it is negative.


Perceptron (cont.)The synaptic weights of the perceptron are denoted by w1,

w2,…,wm.

The inputs applied to the perceptron are denoted by x1, x2,…,xm.

The bias is denoted by b.The induced local field of the neuron is

x1

x2

xm

Bias, bw1

w2

wm

yv

(v)

m

iii bxwv

1

(3.50)


Perceptron (cont.)Perceptron 的目的在於將外界輸入 (x1, x2,…,xm) 的刺激正確的分類為 class C1或 class C2。The decision rule for the classification is to assign the point

represented by the inputs (x1, x2,…,xm) to class C1 if the perc

eptron output y is +1 and to class C2 if it is -1 。The simplest form of the perceptron is two decision regions

separated by a hyperplane defined by

The synaptic weights (w1, w2,…,wm) of the perceptron can be adapted on an iteration-by-iteration basis.

01

m

iii bxw

x1

x2

Class C1

Class C2

(3.51)


Perceptron Convergence Theorem 根據圖 3.8( 將圖 3.6 的 bias 納入固定輸入 ) ，則 (m+1)-by-1 的 input v

ector 和 weight vector 可表示成

因此， the induced local field of the neuron is defined as

wTx=0 時，座標點 (x1, x2,…,xm) 會描繪一 hyperplane ，可將 input 分成兩類

nwnwnwnbn

nxnxnxn

m

m

,...,,),()(

,...,,,1)(

21

21

W

X

x1

x2

xm

w1

w2

wm

yv

(v)

x0=+1w0=b

m

i

Tii nnnxnwnv

0

xw (3.52)

其中 w0(n) 為 bias b(n)


Perceptron Convergence Theorem (cont.)

ClassC2

ClassC1

DecisionBoundary

ClassC2

ClassC1

欲被分類的 pattern 必須有足夠的分離，以確保存在一 hyperplane


Perceptron Convergence Theorem (cont.)假設 perceptron 的輸入變數，是由兩個可線性分離的 class

所組成，其中子集合 X1={x1(1), x1(2),…} ，子集合 X2={x2(1), x2(2),…} ， X1和 X2的聯集構成完整的 training set X 。

拿 X1和 X2來訓練分類器，將會調整權重向量 w，使兩個類別 C1和 C2可線性分離。也就是說存在一個權重向量 w，使

2

1

class tobelonging or input vectevery for 0

class tobelonging or input vectevery for 0

C

C

xxw

xxw

T

T

(3.53)


Perceptron Convergence Theorem (cont.)The algorithm for adapting the weight vector of the elemen

tary perceptron is formulated as follows:If the nth member of the training set, x(n), is correctly classifi

ed by the weight vector w(n), no correction is made to the weight vector of the perceptron.

Otherwise, the weight vector of the perceptron is updated in accordance with the rule

2

1

class tobelongs )( and 0 if )()1(

class tobelongs )( and 0 if )()1(

C

C

nnn

nnnT

T

xxwww

xxwww

1

2

class tobelongs )( and 0 if n)()1(

class tobelongs )( and 0 if n)()1(

C

C

nnnn

nnnnT

T

xxwxww

xxwxww

(3.54)

(3.55)

X(n) 應被分成 C1 但被錯分為 C2；X(n) 應被分成 C2 但被錯分為 C1；



證明 =1 時， fixed increment adaptation rule的收斂性。假設 initial condition w(0)=0 ， wT(n)x(n)<0 for n=1,2,…,

and the input vector x(n) belongs to the subset X1。( 也就是說， percetron 錯將 x(1), x(2)… 分成第二類 ) n)=1 ，根據 Eq(3.55) 的第二式，可得

給訂初始條件 w(0)=0 ，則 w(n+1) 可由逐次累加 x(n)獲得

1 class tobelonging (n)for )()()1( Cxxww nnn

)(...)2()1()1( nn xxxw

(3.56)

(3.57)



由於 class C1和 C2是假設可線性分離，因此存在一個解 w

0，對屬於 X1子集合的所有輸入向量 x(1),…,x(n) ，使 wTx(n)>0 。因此，可定義一正值

因此對 Eq(3.57) 兩側同時乘以 wT0

因此，根據 Eq(3.58) 的定義我們可得

)(min 0)( 1

nT

nxw

x C

)(...)2()1()1( 0000 nn TTTT xwxwxwww

nnT )1(0ww

(3.58)

(3.59)

共有 n項



根據 Cauchy-Schwarz inequality ，可知

將 Eq(3.59) 代入 Eq(3.60) 可得

或

2022

0 11 nn Twwww

2222

0 1 nn ww

2

0

2221w

wn

n

(3.60)

(3.61)


Perceptron Convergence Theorem (cont.)實際上， Eq(3.56) 可改寫成 ( 以 k 取代 n)

對 Eq(3.62) 兩邊同時取 Euclidean平方，並展開可得

因為一開始假設 perceptron 錯將屬於 C1的向量 x(k) 分成C2，因此 wT(n)x(n)<0 ，所以可從 Eq(3.63)推得

將上式移項，可得

kkkkk T xwxww 21 222

2221 kkk xww

,...,nkkkk 1 1 222 xww

(3.62)

(3.63)

(3.64)

1)( and 1,...,for )()(1 X knkkkk xxww


Perceptron Convergence Theorem (cont.)代入初始條件 w(0)=0 ，並將所有的不等式 k=1,…,n加總，可得

其中

在 Eq(3.65) 和 Eq(3.61) 中， n 的值不能超過某個值 nmax ，此 nmax 必須同時滿足 Eq(3.65) 和 Eq(3.61) ，因此

將上式移項整理可得

n

kkn

k

1

221 xw

2

)( 1

max kk

xx X

(3.65)

(3.66)

max2

0

22max n

n

w

2

2

0max

wn (3.67)

Perceptron 必須在最多經過 nmax 次疊代後，停止調整 synaptic weight



因此，當 (n)=1 for all n, and w(0)=0, perceptron 調整神經鍵的權重值，最多只需 nmax次的迭代。

Fixed-increment convergence theorem of the perceptronLet the subsets of training vectors X1 and X2 be linearly sep

arable.Let the inputs presented to the perceptron originate from th

ese two subsets.

The perceptron converges after some n0 iterations, in the s

ense that

is a solution vector for n0<=nmax

...21 000 nnn www



Absolute error-correction procedure for adaptation of a single-layer perceptron

Each pattern is presented repeatedly to the perceptron until that pattern is classified correctly.

The use of an initial value w(0) merely results in a decrease or increase in the number of iterations required to converge, depending on how w(0) relates to the solution w0.

nnnnn TT xwxx


Perceptron Convergence Theorem (cont.)Summary of the Perceptron Convergence Theorem

Initialization. Set w(0)=0. Then perform the following computations for time step n=1,2,…

Activation. Activate perceptron by applying continuous-valued input vector x(n) and desired response d(n)

Computation of Actual Response. Compute the actual response of the perceptron

Adaptation of Weight Vector. Update the weight vector of the perceptron

whereContinuation: Increment time step n by one and go back to step

2.

nnny T xwsgn

nnyndnn xww 1

2

1

class tobelongs )( if1

class tobelongs )( if1

C

C

n

nnd

x

x


Relation between the perceptron and Bayes Classifier for a Gaussian Environment

Bayes ClassifierTo minimize the average risk RR.For a two-class problem, the average risk is defined as

pi: a priori probability that the observation vector x is drawn from subspace Xi

cij: cost of deciding in favor of class Ci represented by subspace Xi when class Cj is true.

fx(x|Ci): conditional probability density function of the random vector X

1

)|()|( ＋

)|()|(

212121

222111

XX

X X

CC

CC

xxxx

xxxx

xx

xx

dfpcdfpc

dfpcdfpc

21

21

2

1 2(3.72)

Correct decision

Incorrect decision


Relation between the perceptron and Bayes Classifier for a Gaussian Environment (cont.)

由於每個 observation vector x需被分成 C1或 C2中的一類，因此

因此， Eq(3.72) 可改寫成

where c11<c21 and c22<c12, we observe that fact that

2X1XX

1

)|()|( ＋

)|()|(

212121

222111

XXX

X XX

CC

CC

xxxx

xxxx

xx

xx

dfpcdfpc

dfpcdfpc

2-

1

2-

1

1

1 1

1)|()|( 21 XX

CC xxxx xx dfdf

(3.73)

(3.74)

(3.75)



因此，將 Eq(3.74) 展開後，可簡化成

其中， Eq(3.76) 的前兩項為固定成本。由於我們需要使 average risk RR 最小化，因此可根據 Eq

(3.76)推導出下列策略 :若 observation vector x的積分值為負值，則 x應被指定為

X1 (class 1, C1) 。若 observation vector x的積分值為正值，則 x應被指定為

X2 (class 2 , C2) 。若 observation vector x的積分值為零值，表示其對 averag

e risk RR沒有影響可指定成任何一類，這裡則將 x應指定為X2 (class 2) 。

1

)|()()|()( 1121122122222121

X

CC xxx xx dfccpfccppcpc 12(3.76)



根據前面的說明， Bayes classifier 可定義成 If the condition

holds, assign the observation vector x to subspace X1 (Class C1). Otherwise assign x to X2 (class C2)

為方便說明，將上式移項整理後，定義

222122111211 || CC xx xx fccpfccp

)|(

)|(

2

1

C

C

x

xx

x

x

f

f

)(

)(

11211

22122

ccp

ccp

(3.77)

(3.78)

Likelihood ratio

Threshold



將 Bayes classifier 重新敘述成For an observation vector x, the likelihood ratio (x) i

s greater than the threshold , assign x to class C1. Otherwise, assign it to class C2.Likelihood

Ratiocomputer

Comparatorx(x)

Assign x to class C1

If (x)>Otherwise, assign it to class C2

LikelihoodRatio

computerComparatorx

log(x)Assign x to class C1

If log(x)>Otherwise, assign it to class C2

log



Bayes Classifier for Gaussian Distribution假設

X的條件機率密度函數可表示為

假設兩個類別的機率相等

假設分類錯誤的成本相等，正確分類成本為零

CμXμX

μX

CμXμX

μX

T

T

E

EClass

E

EClass

22

22

11

11

:

:

C

C

2 ,1 , )()(2

1exp

(det()2(

1)|( 1

2

if i

Tim

μxCμxC

xx21i

))C

2

121 pp

022111221 ccandcc 　　　　

(3.80)

(3.81)

(3.79)

因為 C1 和 C2 有相關，所以共變異矩陣 C不為對角矩陣，假設 C為非奇異矩陣，因此存在 C-1 。



將 Eq(3.79) 代入 Eq(3.77) ，再取 log 可得

將 Eq(3.80) 和 Eq(3.81) 代入 Eq(3.78) ，再取 log 可得

Eq(3.82) 和 Eq(3.83) 表示的 Bayes classifier 可描述成底下的 linear classifier

其中

)(2

1)(

)()(2

1)()(

2

1)(log

11

121

21

21

21

211

1

μCμμCμxCμμ

μxCμxμxCμxx

TTT

TT

0log

by T xw)(log xy

)( 211 μμCw

)(2

11

112

12 μCμμCμ TTb

(3.82)

(3.83)

(3.84)

(3.85)

(3.86)

(3.87)

Threshold =1

從 Eq(3.51) 和 Eq(3.84) 可知， Bayes classifier 類似於 perceptron的 linear classifier



The classifier consists of a linear combiner with weight vector w and bias b

根據 Eq(3.84) ， log-likelihood test for two-class problem 可描述成If the output y of the linear combiner is positive, assi

gn the observation vector x to class C1. otherwise, assign it to class C2.

x1

x2

xm

Bias, bw1

w2

wm

y


Relation between the perceptron and Bayes Classifier for a Gaussian Environment (cont.)Perceptron vs. Bayes classifier for Gaussian

The perceptron operates on the premise that the patterns to be classified are linearly separable. The Gaussian distribution of the two patterns assumed in the derivation of the Bayes classifier certainly do overlap each other and are therefore no separable.

The Bayes classifier minimizes the probability of classification error. The Bayes classifier always positions the decision boundary at the point where the Gaussian distributions for the two classes C1 and C2 cross each other.

Nonparametric vs. parametric The perceptron convergence algorithm is both adaptive and sim

ple to implement ，但 Bayes classifier 的計算較複雜且較浪費記憶體空間。


Two overlapping, one-dimensional Gaussian distributions


Ergodic process (From Wikipedia) In probability theory, stationary ergodic process is a

stochastic process which exhibits both stationarity and ergodicity. In essence this implies that the random process will not change its statistical properties with time.

Stationarity is the property of a random process which guarantees that its statistical properties, such as the mean value, its moments and variance, will not change over time. A stationary process is one whose probability distribution is the same at all times.

Several sub-types of stationarity are defined: first-order, second-order, nth-order, wide-sense and strict-sense.

An ergodic process is one which conforms to the ergodic theorem. The theorem allows the time average of a conforming process to equal the ensemble average. In practice this means that statistical sampling can be performed at one instant across a group of identical processes or sampled over time on a single process with no change in the measured result.


Taylor series, (From Wikipedia)Taylor series

Taylor series is a representation of a function as an infinite sum of terms calculated from the values of its derivatives at a single point. It may be regarded as the limit of the Taylor polynomials.

The Taylor series of a real or complex function f that is infinitely differentiable in a neighborhoods of a real or complex number a, is the power series

which in a more compact form can be written as

Chapter 3 Single-Layer Perceptrons

Documents

Transcript of Chapter 3 Single-Layer Perceptrons