Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF:...

45
CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 10

Transcript of Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF:...

Page 1: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

CS434a/541a: Pattern RecognitionProf. Olga Veksler

Lecture 10

Page 2: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

Today

� Continue with Linear Discriminant Functions� Last lecture: Perceptron Rule for weight learning� This lecture: Minimum Squared Error (MSE) rule

� Pseudoinverse� Gradient descent (Widrow-Hoff Procedure)� Ho-Kashyap Procedure

Page 3: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Perceptron Criterion Function

� The perceptron criterion function� try to find weight vector a s.t. atyi > 0 for all samples yi

� perceptron criterion function� only look at the misclassified samples� will converge in the linearly separable case

(((( )))) (((( ))))����∈∈∈∈

−−−−====MYy

tp yaaJ

� Problem:� will not converge in the nonseparable

case� to ensure convergence can set

� However we are not guaranteed that we will stop at a good point

(((( ))))(((( ))))

k

1k ηηηηηηηη ====

Page 4: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Minimum Squared-Error Procedures

� MSE procedure� Choose positive constants b1, b2,…, bn

� try to find weight vector a s.t. atyi = bi for all samples yi

� If we can find weight vector a such that atyi = bi for all samples yi , then a is a solution because bi’s are positive

� consider all the samples (not just the misclassified ones)

� Idea: convert to easier and better understood problematyi > 0 for all samples yi

solve system of linear inequalities

atyi = bi for all samples yi

solve system of linear equations

Page 5: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

a

yi

a tyi / ||a||

g(y) = 0

LDF: MSE Margins

� Since we want atyi = bi, we expect sample yi to be at distance bi from the separating hyperplane (normalized by ||a||)

� Thus b1, b2,…, bn give relative expected distances or “margins” of samples from the hyperplane

� Should make bi small if sample i is expected to be near separating hyperplane, and make bi larger otherwise

� In the absence of any additional information, there are good reasons to set b1 = b2 =… = bn = 1

yk

a tyk / ||a||

Page 6: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: MSE Matrix Notation

� Need to solve n equations

� Introduce matrix notation:(((( )))) (((( )))) (((( ))))

(((( )))) (((( )))) (((( ))))

(((( )))) (((( )))) (((( )))) ����������������

����

����

����������������

����

����

====������������

����

����

������������

����

����

��������������������

����

����

��������������������

����

����

ndd

nnn

d

d

b

bb

a

aa

yyy

yyyyyy

��

��

��

2

1

1

0

10

21

20

2

11

10

1

Y a b

� Thus need to solve a linear system Ya = b

nnt

t

bya

bya

====

====�

11

Page 7: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Exact Solution is Rare

� Y is an n by (d +1) matrix

� a = Y-1b

� Exact solution can be found only if Y is nonsingular and square, in which case the inverse Y-1 exists

� Thus need to solve a linear system Ya = b

� (number of samples) = (number of features + 1)� almost never happens in practice� in this case, guaranteed to find the separating hyperplane

a1y

2y

Page 8: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Approximate Solution

� Need Ya = b, but no exact solution exists for an overdetermined system of equation� More equations than unknowns

� Typically Y is overdetermined, that is it has more rows (examples) than columns (features)� If it has more features than examples, should reduce

dimensionality

Y ba =

� Find an approximate solution a, that is bYa ≈≈≈≈� Note that approximate solution a does not necessarily

give the separating hyperplane in the separable case� But hyperplane corresponding to a may still be a good

solution, especially if there is no separating hyperplane

Page 9: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: MSE Criterion Function

� Minimum squared error approach: find a which minimizes the length of the error vector e

bYae −−−−====

Ya

b

e

� Thus minimize the minimum squared error criterion function:

(((( )))) 2bYaaJs −−−−====

� Unlike the perceptron criterion function, we can optimize the minimum squared error criterion function analytically by setting the gradient to 0

(((( ))))����====

−−−−====n

iii

t bya1

2

Page 10: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Optimizing Js(a)

� Let’s compute the gradient:

(((( )))) 2bYaaJs −−−−==== (((( ))))����====

−−−−====n

iii

t bya1

2

dadJs====(((( ))))

��������������������

����

����

��������������������

����

����

∂∂∂∂∂∂∂∂

∂∂∂∂∂∂∂∂

====∇∇∇∇

d

s

s

s

aJ

aJ

aJ �0 (((( ))))2

1ii

tn

i

byadad −−−−==== ����

====

(((( )))) (((( ))))iit

n

iii

t byadad

bya −−−−−−−−==== ����====1

2

(((( ))))����====

−−−−====n

iiii

t ybya1

2

(((( ))))bYaY t −−−−==== 2

Page 11: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Pseudo Inverse Solution

� Setting the gradient to 0:

(((( )))) (((( ))))bYaYaJ ts −−−−====∇∇∇∇ 2

(((( )))) bYYaYbYaY ttt ====����====−−−− 02

� Matrix YtY is square (it has d +1 rows and columns) and it is often non-singular

� If YtY is non-singular, its inverse exists and we can solve for a uniquely:

(((( )))) bYYYa tt 1−−−−====pseudo inverse of Y(((( ))))(((( )))) (((( )))) (((( )))) IYYYYYYYY tttt ======== −−−−−−−− 11

Page 12: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Minimum Squared-Error Procedures� If b1=…=bn =1, MSE procedure is equivalent to finding a

hyperplane of best fit through the samples y1,…,yn

(((( )))) 2ns 1YaaJ −−−−====

a� Then we shift this line to the origin, if this line was a

good fit, all samples will be classified correctly

nn

��������

������������

����

������������

����====

1

11 �

Page 13: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Minimum Squared-Error Procedures

� Thus in linearly separable case, least squares solution a does not necessarily give separating hyperplane

� Only guaranteed the separating hyperplane if Ya > 0

��������

����

����

��������

����

����====

nt

1t

ya

yaYa �� that is if all elements of vector are positive

� If εεεε1,…, εεεεn are small relative to b1,…, bn , then each element of Ya is positive, and a gives a separating hyperplane

� That is where εεεε may be negative ��������

����

����

��������

����

����

++++

++++====

nnb

bYa

εεεε

εεεε�

11

� We have bYa ≈≈≈≈

� If approximation is not good, εεεεi may be large and negative, for some i, thus bi + εεεεi will be negative and a is not a separating hyperplane

� But it will give a “reasonable” hyperplane

Page 14: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Minimum Squared-Error Procedures

2

abYaminarg ββββ−−−− (((( )))) 22

ab/aYminarg −−−−==== ββββββββ

*aββββ====

� thus if for some i th element of Ya is less than 0, that is yt

ia < 0, then yti (ββββa) < 0,

� Relative difference between components of b matters, but not the size of each individual component

� We are free to choose b. May be tempted to make blarge as a way to insure 0bYa >>>>≈≈≈≈

� Does not work� Let β β β β be a scalar, let’s try ββββb instead of b� if a* is a least squares solution to Ya = b, then for any

scalar ββββ, least squares solution to Ya = ββββb is ββββa*

(((( )))) 2

ab/aYminarg −−−−==== ββββ

Page 15: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: How to choose b in MSE Procedure?

� So far we assumed that constants b1, b2,…, bn are positive but otherwise arbitrary

� Good choice is b1 = b2 =…= bn = 1. In this case,

1. MSE solution is basically identical to Fischer’s linear discriminant solution

2. MSE solution approaches the Bayes discriminant function as the number of samples goes to infinity

(((( )))) (((( )))) (((( ))))xcPxcPxgB || 21 −−−−====

Page 16: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Example

� Class 1: (6 9), (5 7)� Class 2: (5 9), (0 4)

� Matrix Y is then������������

����

����

������������

����

����

−−−−−−−−−−−−−−−−−−−−====

401951751961

Y

� Set vectors y1, y2 , y3 , y4 by adding extra feature and “normalizing”

������������

����

������������

����====

961

y1������������

����

������������

����====

751

y2������������

����

������������

����

−−−−−−−−−−−−

====951

y3������������

����

������������

����

−−−−

−−−−====

401

y 4

Page 17: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Example

� Choose������������

����

����

������������

����

����

====1111

b

� In matlab, a=Y\b solves the least squares problem

������������

����

������������

����

−−−−====

9.00.17.2

a

� Note a is an approximation to Ya = b, since no exact solution exists

������������

����

����

������������

����

����

≠≠≠≠������������

����

����

������������

����

����

====1111

1.16.03.14.0

Ya

� This solution does give a separating hyperplane since Ya > 0

Page 18: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Example

� Class 1: (6 9), (5 7)� Class 2: (5 9), (0 10)

� Matrix������������

����

����

������������

����

����

−−−−−−−−−−−−−−−−−−−−====1001

951751961

Y

� The last sample is very far compared to others from the separating hyperplane

������������

����

������������

����====

961

y1������������

����

������������

����====

751

y2������������

����

������������

����

−−−−−−−−−−−−

====951

y3������������

����

������������

����

−−−−

−−−−====

1001

y 4

Page 19: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Example

� Choose������������

����

����

������������

����

����

====1111

b

� In matlab, a=Y\b solves the least squares problem

������������

����

������������

����

−−−−====

4.02.02.3

a

� Note a is an approximation to Ya = b, since no exact solution exists

������������

����

����

������������

����

����

≠≠≠≠������������

����

����

������������

����

����

−−−−====1111

16.104.0

9.02.0

Ya

� This solution does not give a separating hyperplane since aty3 < 0

Page 20: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Example

� MSE pays to much attention to isolated “noisy” examples (such examples are called outliers)

outlierdesired solution

MSE solution

� No problems with convergence though, and solution it gives ranges from reasonable to good

Page 21: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Example� we know that 4th point is far far

from separating hyperplane� In practice we don’t know this

� In Matlab, solve a=Y\b

������������

����

������������

����

−−−−

−−−−====

9.07.11.1

a

� Note a is an approximation to Ya = b, ������������

����

����

������������

����

����

≠≠≠≠������������

����

����

������������

����

����

====10111

0.108.00.19.0

Ya

� This solution does give the separating hyperplane since Ya > 0

new solution

old solution

������������

����

����

������������

����

����

====10111

b� Thus appropriate

Page 22: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Gradient Descent for MSE solution

2. YtY may be close to singular if samples are highly correlated (rows of Y are almost linear combinations of each other)� computing the inverse of YtY is not numerically stable

� May wish to find MSE solution by gradient descent:

1. Computing the inverse of YtY may be too costly

(((( )))) 2bYaaJs −−−−====

� In the beginning of the lecture, computed the gradient:

(((( )))) (((( ))))bYaYaJ ts −−−−====∇∇∇∇ 2

Page 23: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Widrow-Hoff Procedure

� Thus the update rule for gradient descent:(((( )))) (((( )))) (((( )))) (((( ))))(((( ))))bYaYaa ktkkk −−−−−−−−====++++ ηηηη1

� If weight vector a(k) converges to the MSE solution a, that is Yt(Ya-b)=0

(((( )))) (((( )))) kk /1ηηηηηηηη ====

(((( )))) (((( ))))bYaYaJ ts −−−−====∇∇∇∇ 2

� Widrow-Hoff procedure reduces storage requirements by considering single samples sequentially:

(((( )))) (((( )))) (((( )))) (((( ))))(((( ))))ikt

iikkk bayyaa −−−−−−−−====++++ ηηηη1

Page 24: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure

� Suppose training samples are linearly separable. Then there is as and positive bs s.t.

� In the MSE procedure, if b is chosen arbitrarily, finding separating hyperplane is not guaranteed

0>>>>==== ss bYa

� If we knew bs could apply MSE procedure to find the separating hyperplane

� Idea: find both as and bs

� Minimize the following criterion function, restricting to positive b: (((( )))) 2, bYabaJHK −−−−====

� JHK(as,bs)=0

Page 25: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure

� As usual, take partial derivatives w.r.t. a and b

(((( )))) 2, bYabaJHK −−−−====

(((( )))) 02 ====−−−−====∇∇∇∇ bYaYJ tHKa

(((( )))) 02 ====−−−−−−−−====∇∇∇∇ bYaJHKb

� Use modified gradient descent procedure to find a minimum of JHK(a,b)

2) Fix a and minimize JHK(a,b) with respect to b

� Alternate the two steps below until convergence:1) Fix b and minimize JHK(a,b) with respect to a

Page 26: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure

(((( )))) 02 ====−−−−====∇∇∇∇ bYaYJ tHKa (((( )))) 02 ====−−−−−−−−====∇∇∇∇ bYaJHKb

� Step (1) can be performed with pseudoinverse

2) Fix a and minimize JHK(a,b) with respect to b

� Alternate the two steps below until convergence:1) Fix b and minimize JHK(a,b) with respect to a

� For fixed b minimum of JHK(a,b) with respect to a is found by solving

(((( )))) 02 ====−−−− bYaY t

� Thus

(((( )))) bYYYa tt 1−−−−====

Page 27: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure

� We can’t use b = Ya because b has to be positive

� Step 2: fix a and minimize JHK(a,b) with respect to b

� Solution: use modified gradient descent

(((( )))) (((( )))) (((( )))) (((( )))) (((( ))))(((( ))))kkb

kkk baJbb ,1 ∇∇∇∇−−−−====++++ ηηηη� Regular gradient descent rule:

� If any components of are positive, b will decrease and can possibly become negative

Jb∇∇∇∇

(((( ))))

������������

����

������������

����−−−−====

������������

����

������������

����

−−−−−−−−−−−−

������������

����

������������

����====++++

573

232

*2111

1kb

Page 28: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure� start with positive b , follow negative gradient but

refuse to decrease any components of b

� This can be achieved by setting all the positive components of to 0Jb∇∇∇∇

(((( )))) (((( )))) (((( )))) (((( ))))(((( )))) (((( )))) (((( ))))(((( ))))[[[[ ]]]]|b,aJ|b,aJ21bb kk

bkk

bk1k ∇∇∇∇−−−−∇∇∇∇−−−−====++++ ηηηη

� Not doing steepest descent anymore, but we are still doing descent and ensure that b is positive

� here |v| denotes vector we get after applying absolute value to all elements of v

(((( ))))

������������

����

������������

����

������������

����

������������

����−−−−

������������

����

������������

����

−−−−−−−−−−−−

������������

����

������������

����====++++

232

232

21*2

111

b 1k

������������

����

������������

����

−−−−−−−−−−−−

������������

����

������������

����====

460

111

������������

����

������������

����====

571

Page 29: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure

� Then

(((( )))) (((( )))) (((( )))) (((( ))))(((( )))) (((( )))) (((( ))))(((( ))))[[[[ ]]]]|b,aJ|b,aJ21bb kk

bkk

bk1k ∇∇∇∇−−−−∇∇∇∇−−−−====++++ ηηηη

(((( )))) 02 ====−−−−−−−−====∇∇∇∇ bYaJb

� Let (((( )))) (((( )))) (((( ))))kkk bYae −−−−====

(((( )))) (((( )))) (((( )))) (((( ))))[[[[ ]]]]|2|2211 kkkk eebb −−−−−−−−−−−−====++++ ηηηη

(((( )))) (((( ))))(((( ))))kkb baJ ,

21 ∇∇∇∇−−−−====

(((( )))) (((( )))) (((( ))))[[[[ ]]]]|| kkk eeb ++++++++==== ηηηη

Page 30: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure� The final Ho-Kashyap procedure:

0) Start with arbitrary a(1) and b(1) > 0, let k = 1repeat steps (1) through (4)

3) Solve for a(k+1) using b(k+1)

(((( )))) (((( )))) (((( ))))111 ++++−−−−++++ ==== kttk bYYYa

4) k = k + 1

1) (((( )))) (((( )))) (((( ))))kkk bYae −−−−====

2) Solve for b(k+1) using a(k) and b(k)

(((( )))) (((( )))) (((( )))) (((( ))))[[[[ ]]]]||1 kkkk eebb ++++++++====++++ ηηηη

until e(k) >= 0 or k > kmax or b(k+1) = b(k)

� For convergence, learning rate should be fixed between 0 < ηηηη < 1

Page 31: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure

� What if e(k) is negative for all components?

(((( )))) (((( )))) (((( )))) (((( ))))[[[[ ]]]]||1 kkkk eebb ++++++++====++++ ηηηη

� b(k+1) = b(k) and corrections stop

� Write e(k) out:(((( )))) (((( )))) (((( ))))kkk bYae −−−−==== (((( )))) (((( )))) (((( ))))kktt bbYYYY −−−−==== −−−−1

� Multiply by Yt:(((( )))) (((( )))) (((( )))) (((( ))))(((( ))))kktttkt bbYYYYYeY −−−−==== −−−−1 (((( )))) (((( )))) 0====−−−−==== ktkt bYbY

� Thus Yt e(k) = 0

Page 32: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure� Thus Yt e(k) = 0

� Suppose training samples are linearly separable. Then there is as and positive bs s.t.

0>>>>==== ss bYa

� Multiply both sides by (e(k) )t

(((( ))))(((( )))) (((( ))))(((( )))) stkstk beYae ========0

� Either by e(k) = 0 or one of its components is positive

Page 33: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure

� In the linearly separable case, � e(k) = 0, found solution, stop� one of components of e(k) is positive, algorithm continues

� In non separable case, � e(k) will have only negative components eventually, thus

found proof of nonseparability� No bound on how many iteration need for the proof of

nonseparability

Page 34: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure Example

� Class 1: (6 9), (5 7)� Class 1: (5 9), (0 10)

� Matrix������������

����

����

������������

����

����

−−−−−−−−−−−−−−−−−−−−====1001

951751961

Y

� Use fixed learning ηηηη = 0.9

� Start with and (((( ))))

������������

����

����

������������

����

����

====1111

b 1(((( ))))

������������

����

������������

����====

111

a 1

� At the start (((( ))))

������������

����

����

������������

����

����

−−−−−−−−====

11151316

Ya 1

Page 35: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure Example

� solve for a(2) using b(2)

� solve for b(2) using a(1) and b(1)

� Iteration 1:

(((( )))) (((( )))) (((( ))))

������������

����

����

������������

����

����

−−−−������������

����

����

������������

����

����

−−−−−−−−====−−−−====

1111

11151316

bYae 111�

������������

����

����

������������

����

����

−−−−−−−−====

12161215

(((( )))) (((( )))) (((( )))) (((( ))))[[[[ ]]]]|e|e9.0bb 1112 ++++++++====������������

����

����

������������

����

����

������������

����

����

������������

����

����

++++������������

����

����

������������

����

����

−−−−−−−−++++

������������

����

����

������������

����

����

====12161215

12161215

9.01111

������������

����

����

������������

����

����

====116.22

28

(((( )))) (((( )))) (((( )))) ====������������

����

����

������������

����

����

������������

����

������������

����

−−−−−−−−−−−−−−−−−−−−

−−−−−−−−======== −−−−

116.22

28*

1.02.05.026.02.01.01.016.05.06.17.46.2

bYYYa 2t1t2

������������

����

������������

����

−−−− 8.37.26.34

Page 36: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Ho-Kashyap Procedure Example

������������

����

����

������������

����

����

====48.114.0

5.222.27

Ya

� a does gives a separating hyperplane

� Continue iterations until Ya > 0� In practice, continue until minimum

component of Ya is less then 0.01

������������

����

����

������������

����

����

====147

12328

b������������

����

������������

����

−−−−

−−−−====

3.113.279.34

a

� After 104 iterations converged to solution

Page 37: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

m1,...,i )( 0 ====++++==== itii wxwxg

� Suppose we have m classes� Define m linear discriminant functions

� Given x, assign class ci if ij )()( ≠≠≠≠∀∀∀∀≥≥≥≥ xgxg ji

� Such classifier is called a linear machine

� A linear machine divides the feature space into c decision regions, with gi(x) being the largestdiscriminant if x is in the region Ri

LDF: MSE for Multiple Classes

Page 38: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Many Classes

Page 39: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

m1,...,i )( ======== yayg tii

� We still use augmented feature vectors y1,…, yn � Define m linear discriminant functions

� Given y, assign class ci if ij y ≠≠≠≠∀∀∀∀≥≥≥≥ t

jti aya

� For each class i, makes sense to seek weight vector ai, s.t.

LDF: MSE for Multiple Classes

��������

∉∉∉∉∀∀∀∀====∈∈∈∈∀∀∀∀====

iclassy 0iclassy 1

yaya

ti

ti

� If we find such a1,…, am the training error will be 0

Page 40: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

� For each class i, find weight vector ai, s.t.

LDF: MSE for Multiple Classes

��������

∉∉∉∉∀∀∀∀====∈∈∈∈∀∀∀∀====

iclassy 0iclassy 1

yaya

ti

ti

� We can solve for each ai independently

� Let ni be the number of samples in class i� Let Yi be matrix whose rows are samples from

class i, so it has d +1 columns and ni rows

������������

����

����

������������

����

����

====

mY

YY

Y�2

1

� Let’s pile all samples in n by d +1 matrix Y:

����������������

����

����

����������������

����

����

====

mclassfromsamplemclassfromsample

classfromsampleclassfromsample

11

Page 41: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

� Let bi be a column vector of length n which is 0everywhere except rows corresponding to samples from class i, where it is 1:

LDF: MSE for Multiple Classes

������������������������

����

����

������������������������

����

����

====

0

1

1

0

bi

rows corresponding to samples from class i

� We need to solve: ii bYa ====

����������������

����

����

����������������

����

����

mclassfromsamplemclassfromsample

1classfromsample1classfromsample

�w

eig h

ts a

i

��������������������

����

����

��������������������

����

����

������������������������

����

����

������������������������

����

����

====

0

1

1

0

Page 42: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: MSE for Multiple Classes

� We need to solve Yai = bi

� Usually no exact solution since Y is overdetermined

� Use least squares to minimize norm of the error vector || Yai - bi ||

� LSE solution with pseudoinverse:(((( )))) i

t1ti bYYYa

−−−−====

� Thus we need to solve m LSE problems, one for each class

� Can write these m LSE problems in one matrix

Page 43: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: MSE for Multiple Classes

[[[[ ]]]]n1 bbB �====� Let’s pile all bi as columns in n by c matrix B

� Let’s pile all ai as columns in d +1 by m matrix A

[[[[ ]]]]maaA �1====��������������������

����

����

��������������������

����

����

====

wei

g hts

a1

wei

g hts

a2

wei

g hts

am

� m LSE problems can be represented in YA = B:

������������������������

����

����

������������������������

����

����

333211

classfromsampleclassfromsampleclassfromsampleclassfromsampleclassfromsampleclassfromsample

��������������������

����

����

��������������������

����

����

wei

ght s

f or

c1w

eigh

t s f o

r c2

wei

ght s

f or

c3

=��������������������

����

����

��������������������

����

����

100100100010001001

Y A B

Page 44: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: MSE for Multiple Classes

(((( )))) ����====

−−−−====m

1i

2ii bYaAJ

� Our objective function is:

� J(A) is minimized with the use of pseudoinverse

(((( )))) YBYYA t 1−−−−====

Page 45: Lecture 10 - Computer Science - Western Universityolga/Courses/CS434a_541a/Lecture10.pdf · LDF: Perceptron Criterion Function The perceptron criterion function try to find weight

LDF: Summary

� Perceptron procedures � find a separating hyperplane in the linearly separable case,� do not converge in the non-separable case� can force convergence by using a decreasing learning rate,

but are not guaranteed a reasonable stopping point

� MSE procedures � converge in separable and not separable case � may not find separating hyperplane if classes are linearly

separable� use pseudoinverse if YtY is not singular and not too large� use gradient descent (Widrow-Hoff procedure) otherwise

� Ho-Kashyap procedures � always converge� find separating hyperplane in the linearly separable case� more costly