Download - CptS 570 – Machine Learning School of EECS Washington ...holder/courses/CptS570/... · CptS 570 – Machine Learning. School of EECS. Washington State University. ... are basis

CptS 570 – Machine LearningSchool of EECS

Washington State University

CptS 570 - Machine Learning 1

Or, support vector machine (SVM) Discriminant-based method◦ Learn class boundaries

Support vector consists of examples closest to boundary

Kernel computes similarity between examples◦ Maps instance space to a higher-dimensional space

where (hopefully) linear models suffice Choosing the right kernel is crucial Kernel machines among best-performing

learners


Likely to underfit using only hyperplanes But we can map the data to a nonlinear

space and use hyperplanes there◦ Φ: Rd F◦ x Φ(x)


Φ

Note we want ≥+1, not ≥0 Want instances some distance from hyperplane


{ }

( ) 1asrewritten becan which

1for 1

1for 1

such that and find if1 if1

where,

0

0

0

0

2

1

+≥+

−=−≤+

+=+≥+

∈−∈+

==

wr

rwrw

wCC

rr

tTt

ttT

ttT

t

tt

ttt

xw

xwxw

wxx

xX

Distance from instance xt to hyperplanewTxt+w0

Distance from hyperplane to closest instances is the margin


wxw

wxw )( 00 wror

w tTttT++

w

margin

Optimal separating hyperplane is the one maximizing the margin

We want to choose w maximizing ρ such that

Infinite number of solutions by scaling w Thus, we choose solution minimizing ‖w‖


( )t

wr tTt

∀≥+

,ρwxw 0

( ) twr tTt ∀+≥+ ,121

02 xww to subject min

Quadratic optimization problem with complexity polynomial in d (#features)

Kernel will eventually map d-dimensional space to higher-dimensional space

Prefer complexity not based on #dimensions


( ) twr tTt ∀+≥+ ,121

02 xww to subject min

Convert optimization problem to depend on number of training examples N (not d)◦ Still polynomial in N

But optimization will depend only on closest examples (support vector)◦ Typically ≪N


Rewrite quadratic optimization problem using Lagrange multipliers αt, 1 ≤ t ≤ N

Minimize Lp


( )

( )[ ]

( ) ∑∑

∑

==

=

++−=

−+−=

∀+≥+

N

t

tN

t

tTtt

N

t

tTttp

tTt

wr

wrL

twr

110

2

10

2

02

21

121

,1 subject to 21min

αα

α

xww

xww

xww

Equivalently, we can maximize Lp subject to the constraints:

Plugging these into Lp …


00

0

10

1

=⇒=∂

∂

=⇒=∂

∂

∑

∑

=

=

N

t

ttp

N

t

tttp

rwL

rL

α

α xww

Maximize Ld with respect to αt only Complexity O(N3)


( )

( )

( )

∑

∑∑∑

∑

∑ ∑∑

∀≥=

+−=

+−=

+−−=

t and to subject

tr

rr

rwrL

ttt

t

tsTtst

t s

st

t

tT

t t

ttt

t

tttTTd

,002121

21

0

αα

ααα

α

ααα

xx

ww

xwww

Most αt = 0◦ I.e., rt(wTxt+w0) > 1 (xt lie outside margin)

Support vectors: xt such that αt > 0◦ I.e., rt(wTxt+w0) = 1 (xt lie on margin)

w = Σt αt rtxt

w0 = rt – wTxt for any support vector xt

◦ Typically average over all support vectors Resulting discriminant is called the support

vector machine (SVM)



O = support vectors

margin

Data not linearly separable Find hyperplane with least error Define slack variables ξt ≥ 0 storing deviation

from the margin


( ) ttTt wxr ξ−≥+ 10w

(a) Correctly classified example far from margin (ξt = 0)

(b) Correctly classified example on the margin (ξt = 0)

(c) Correctly classified example, but inside the margin (0 < ξt < 1)

(d) Incorrectly classified example (ξt ≥ 1)

Soft error =


∑t

tξ


O = support vectors

margin

Lagrangian equation with slack variables

C is penalty factor μt ≥ 0, new set of Lagrange multipliers Want to minimize Lp


( )[ ] ∑∑∑ −+−+−+=t

tt

t

ttTtt

t

tp wxrCL ξµξαξ 1

21

02 ww

Minimize Lp by setting derivatives to zero

Plugging these into Lp yields dual Ld Maximize Ld with respect to αt


00

00

0

10

1

=−−⇒=∂

∂

=⇒=∂

∂

=⇒=∂

∂

∑

∑

=

=

tttp

N

t

ttp

N

t

tttp

CL

rwL

rL

µαξ

α

α xww

Quadratic optimization problem Support vectors have αt >0◦ Examples on margin: αt < C◦ Examples inside margin or misclassified: αt = C


( )

∑

∑∑∑∀≤≤=

+−=

t, 0 and 0 subject to

21

tCr

rrL

tttt

tsTtst

t s

std

αα

ααα xx

C is a regularization parameter◦ High C high penalty for non-separable examples

(overfit)◦ Low C less penalty (underfit)◦ Determine using validation set (C=1 typical)


( )

∑

∑∑∑∀≤≤=

+−=


21

tCr

rrL

tttt

tsTtst

t s

std

αα

ααα xx

To use previous approaches, data must be near linearly separable

If not, perhaps a transformation φ(x) will help

φ(x) are basis functions


φ

Transform d-dimensional x space to k-dimensional z space using basis functions φ(x)

z=φ(x) where zj = φj(x) , j=1,…,k

Instead of w0, assume z1 = φ1(x) ≡1


∑=

==

=k

jjj

T

T

wg

g

1)()()(

)(

xxφwx

zwz

ϕ

Replace inner product of basis functions φ(xt)Tφ(xs) with kernel function K(xt,xs)


[ ] ∑∑∑ −+−−+=t

ttt

ttTttt

tp rCL ξµξαξ 1)(

21 2 xφww

( )

∑

∑∑∑∀≤≤=

+−=


)(21

tCr

rrL

tttt

tsTtst

t s

std

αα

ααα xx φφ

∑∑∑ +−=t

tstst

t s

std KrrL ααα ),(

21 xx

Kernel K(xt,xs) computes z-space product φ(xt)Tφ(xs) in x-space

Matrix of kernel values K, where Kts = K(xt,xs), called the Gram matrix

K should be symmetric and positive semidefinite


( )

( ) ( ) ( ) ( )

( ) ( )∑

∑

∑∑

=

==

==

t

ttt

t

TtttT

t

ttt

t

ttt

Krg

rg

rr

xxx

xφxφxφwx

xφzw

,α

α

αα

Polynomial kernel of degree q

If q=1, then use original features For example, when q=2 and d=2


( ) ( )qtTtK 1+= xxxx ,

( ) ( )( )

( ) [ ]T

T

xxxxxx

yxyxyyxxyxyx

yxyx

K

22

212121

22

22

21

2121212211

22211

2

2221

22211

1

,,,,,

,

=

+++++=

++=

+=

x

yxyx

φ

Polynomial kernel of degree 2


O = support vectors

margin

Radial basis functions (Gaussian kernel)

xt is the center s is the radius Larger s implies smoother boundaries


( )

−−= 2

2

2sK

tt

xxxx exp,

Sigmoidal functions


)12tanh(),( += tTtK xxxx

tanh

Kernel K(x,y) increases with similarity between x and y

Prior knowledge can be included in the kernel function

E.g., training examples are documents◦ K(D1,D2) = # shared words

E.g., training examples are strings (e.g., DNA)◦ K(S1,S2) = 1 / edit distance between S1 and S2◦ Edit distance is the number of insertions, deletions

and/or substitutions to transform S1 into S2


E.g., training examples are nodes in a graph (e.g., social network)

K(N1,N2) = 1 / length of shortest path connecting nodes

K(N1,N2) = #paths connecting nodes Diffusion kernel


Training examples are graphs, not feature vectors◦ E.g., carcinogenic vs. non-carcinogenic chemical

structures Compare substructures of graphs◦ E.g., walks, paths, cycles, trees, subgraphs

K(G1,G2) = number of identical random walks in both graphs

K(G1,G2) = number of subgraphs shared by both graphs


Training data from multiple modalities (e.g., biometrics, social network, audio/visual)

Construct new kernels by combining simpler kernels

If K1(x,y) and K2(x,y) are valid kernels, and c is a constant, then

◦ are valid kernels


( )( )

( ) ( )( ) ( )

+=

yxyxyxyx

yxyx

,,

,,

,

,

21

21

KK

KK

cK

K

Adaptive kernel combination

Learn both αts and ɳis


( ) ( )

( )( )∑∑

∑∑ ∑∑

∑

=

−=

==

i

tii

t

tt

t s i

stii

stst

t

td

i

m

ii

Krg

KrrL

KK

xxx

xx

yxyx

,)(

,

,,

ηα

ηααα

η

21

1

Learn K different kernel machines gi(x)◦ Each uses one class as

positive, remaining classes as negative◦ Choose class i such that

i=argmaxj gj(x)◦ Best approach in practice


Learn K(K-1)/2 kernel machines◦ Each uses one class as

positive and another class as negative◦ Easier (faster) learning per

kernel machine


Learn all margins at once

◦ zt is the class index of xt

K*N variables to optimize (expensive)


02

21

00

1

2

≥≠∀−++≥+

+ ∑∑∑=

ti

ttii

tTiz

tT

z

i t

ti

K

ii

ziww

C

tt ξξ

ξ

to subject

min

,,xwxw

w

Normally, we would use squared error

For SVM, we use ε -sensitive loss

◦ Tolerate errors up to ε◦ Errors beyond ε have only linear effect


( )( ) ( )( )

−−

<−=

otherwise if0

,ε

εε tt

tttt

frfr

frex

xx

02 )()]([))(,( wffrfre Ttttt +=−= xwxxx

Use slack variables to account for deviations beyond ε◦ ξt

+ for positive deviations◦ ξt

- for negative deviations


( )∑ −+ ++t

ttC ξξ2

21 wmin

( )( )

00

0

≥

+≤−+

+≤+−

−+

−

+

tt

ttT

tTt

rw

wr

ξξ

ξε

ξε

,

xwxw

Subject to

Non-support vectors (inside margin): Support vectors◦ ⊗ on the margin:◦ ⊠ outside margin (outlier):


0)(,0,0

)()(

))()((21

=−≤≤≤≤

−−+−

−−−=

−+−+

−+−+

−+−+

∑

∑∑

∑∑

t

t

ttt

t

t

ttt

t

t

sTtsst

t s

td

CCtosubject

r

L

αααα

ααααε

αααα xx

0== −+tt αα

CorC tt <<<< −+ αα 00CorC tt == −+ αα

Fitted line f(x) is weighted sum of support vectors

Average w0 over:


CifwrCifwr

ttTt

ttTt

<<−+=

<<++=

−

+

αε

αε

0,

0,

0

0

xwxw

∑

∑

−+

−+

−=

+−=+=

t

tttt

TtttT wwf

xw

xxxwx

)(

))(()( 00

αα

αα


0)(,0,0

)()(

)())((21

=−≤≤≤≤

−−+−

−−−=

−+−+

−+−+

−+−+

∑

∑∑

∑∑

t

t

ttt

t

t

ttt

t

t

stsst

t s

td

CCtosubject

r

,KL

αααα

ααααε

αααα xx

∑ +−=+= −+t

tttT w),Kwf 00 ()()( xxxwx αα

Polynomial (quadratic) kernel

Gaussian kernel


Classification: SMO Regression: SMOreg Sequential Minimal Optimization (SMO) Kernels◦ Polynomial◦ RBF◦ String


Seek optimal separating hyperplane Support vector machine (SVM) finds

hyperplane using only closest examples Kernel function allows SVM to operate in

higher dimensions Kernel regression Choosing correct kernel is crucial Kernel machines among best-performing

learners