Two Important Parts of SMO (selection heuristics & stopping criterion)

17
Two Important Parts of SMO (selection heuristics & stopping crite ood selection of a pair of updating po ill speed up the convergence e selection heuristics maybe depend on stopping criterion pping criterion: duality gap urally to choose the points with the m ation of the KKT conditions (Too expen

description

Two Important Parts of SMO (selection heuristics & stopping criterion). A good selection of a pair of updating points will speed up the convergence. The selection heuristics maybe depend on the stopping criterion. Stopping criterion: duality gap - PowerPoint PPT Presentation

Transcript of Two Important Parts of SMO (selection heuristics & stopping criterion)

Two Important Parts of SMO(selection heuristics & stopping criterion)

A good selection of a pair of updating points will speed up the convergence

The selection heuristics maybe depend on the stopping criterion

Stopping criterion: duality gap=> naturally to choose the points with the most violation of the KKT conditions (Too expensive)

How to Solve an Unconstrained MP

Get an initial point and iteratively decrease the obj. function value

Newton’s method is highly recommended Local and quadratic convergent algorithm

Stop once the stopping criteria satisfied

Steep decent might not be a good choice

Need to choose a good step size to guarantee global convergence

Steep Descent with Exact Line Search

Start with any . Having , stop if

Else compute as follows:

x0 2 Rn xi r f (xi) = 0

xi+1

(i) Steep descent direction:di = à r f (xi)

(ii) Exact line search: Choose a stepsize õ 2 R

such that

(iii) Updating:

dõdf (x i+õdi) = f 0(xi + õdi) = 0

xi+1 = xi + õdi

Newton’s Method

Start with . Having , stop if

Else compute as follows:

x0 2 Rn xi r f (xi) = 0

xi+1

(i) Newton direction: r 2f (xi)di = à r f (xi):

(ii) Updating:xi+1 = xi + di

Have to solve a system of linear equations here!

Converge only when is close to enough.x0 xã

It can not converge to the optimal solution.

f (x) = à 61x6+ 4

1x4+ 2x2

g(x) = f(xi) + f 0(xi)(x à xi) + 21f 00(xi)(x à xi)

SVM as an Unconstrained Minimization

Problem

Hence (QP) is equivalent to the nonsmooth SVM:

minw;b 2

C k(eà D(Aw+ eb))+k22 + 2

1(kwk22 + b2)

2C køk2

2 + 21(kwk2

2 + b2)

D(Aw+ eb) + ø>eø>0;w;bmin

s. t.(QP)

Change (QP) into an unconstrained MP

Reduce (n+1+l) variables to (n+1) variables

At the solution of (QP) : where(á)+ = maxf á;0g

ø= (eà D(Aw+ eb))+

.

Smooth the Plus Function: Integrate

Step function: xã Sigmoid function:(1+"à 5x)

1

Plus function: x+ p-function: p(x;5)

(1+"à ì x)1

p(x; ì ) := x + ì1 log(1+ "à ì x)

SSVM: Smooth Support Vector

Machine

(á)+ Replacing the plus function in the nonsmooth SVM by the smooth p(á; ì ), gives our SSVM:

ìnonsmooth SVM as goes to infinity. The solution of SSVM converges to the solution of

ì = 5(Typically, )

min(w;b) 2 Rn+12

Ckp((eà D(Aw+ eb)); ì )k22 + 2

1(kwk22 + b2)

, obtained by integrating the sigmoid function (á)+ofHere,p(á; ì ) is an accurate smooth approximation

of neural networks. (sigmoid = smoothed step)

Newton-Armijo Method: Quadratic Approximation of SSVM

(wi;bi)è é

generated by solving a The sequence

(wã;bã)quadratic approximation of SSVM, converges to the

of SSVM at a quadratic rate.

At each iteration we solve a linear system of: n+1 equations in n+1 variables Complexity depends on dimension of input space

Converges in 6 to 8 iterations

unique solution

It might be needed to select a stepsize

Newton-Armijo Algorithm

Start with any

(w0;b0) 2 Rn+1 . Having

(wi;bi);

stop if r Ðì (wi;bi) = 0; else : (i) Newton Direction :

r 2Ðì (wi;bi)di = à r Ðì (wi;bi)0

(ii) Armijo Stepsize :

(wi+1;bi+1) = (wi;bi) + õidi

õi 2 f1;21;4

1; :::g

globally and globally and quadraticallquadratically converge y converge to unique to unique solution in a solution in a finite finite number of number of stepssteps

such that Armijo’s rule is satisfied

2÷kp((eà D(Awà eí ));ë)k2

2 + 21kw;í k2

2Ðë(w;í ) :=

Ðë(w;í )minNewton-Armijo Algorithm for SSVM:

(wi; í i)(w0; í 0)Start with any 2 Rn+1. Having , stop if

. Else compute(wi+1; í i+1) as follows:r Ðë(wi; í i) = 0(i) Newton Direction: Determine direction di 2 Rn+1 by

solving n+1 linear equations in n+1 variables:

r 2Ðë(wi; í i)di = à r Ðë(wi; í i)0

(ii) Armijo Stepsize: Choose õi = maxf 1;21;4

1; . . .gsuch that:Ðë(wi; í i) à Ðë((wi; í i) + õidi) > à î õir Ðë(wi; í i)di;where î 2 (0;2

1)

(iii) Updating: (wi+1; í i+1) = (wi; í i) + õidi

Comparisons of SSVM with other SVMs

Cleveland Heart297 x 13

86.131.63

84.5518.71

72.1267.55

BUPA Liver345 x 6

70.331.05

64.0319.94

69.86124.23

Ionosphere 351 x 34

89.633.69

86.1042.41

89.17128.15

Pima Indians768 x 8

78.121.54

74.47286.59

77.071138.0

WPBC(24 months)155 x 32

83.472.32

71.086.25

82.0212.50

WPBC(60 months)110 x 22

68.181.03

66.233.72

61.834.91

mâ nDataset Size SSVM SVMí

í áíí 2

2SVMí

í áíí

1

Tenfold test set correctness % (best in Red)CPU time in seconds

QPLPLinear Eqns.

Two-spiral Dataset(94 White Dots & 94 Red Dots)

The Perceptron Algorithm (Dual Form)

w =P

i=1l ë iyixi

Given a linearly separable training setS ë = 0; ë 2 R land

b= 0;R = max16 i6 l jjxijj

Repeat: for i = 1 to l

if yi(P

j=1

l

ë jyjêxj áxi

ë+ b)60 then

ë i ë i + 1;

end if

until no mistakes made within the for loop return:

end for

(ë;b)

Nonlinear SVM Motivation

Linear SVM: (Linear separating surface:x0w = í )2÷kyk2

2 + 21kw;í k2

2

D(Awà eí ) + y > ey > 0;w;í

mins. t.

(QP)

By QP “duality”, w = A0Du. Maximizing the margin in the “dual space” gives:

D(AA0Duà eí ) + y > es. t.uy > 0; ; í

min 2÷kyk2

2 + 21k ; í k2

2u

2÷kp(eà D(AA0Du à eí );ë)k2

2+ 21ku;í k2

2u;ímin

Dual SSVM with separator:x0A0Du = í

Nonlinear Smooth SVM

K (x0;A0)Du = í

K (A;A0) ReplaceAA0by a nonlinear kernel :

2÷kp(eà D(K (A;A0)Du à eí ;ë)k2

2+ 21ku; í k2

2u; ímin

The kernel matrixK (A;A0) 2 Rmâ mis fully dense Use Newton algorithm to solve the problem

Each iteration solves m+1 linear equations in m+1 variables

Nonlinear classifier depends on entire dataset :

K (x0;A0)Du = í

Nonlinear Classifier:

Difficulties with Nonlinear SVM

for Large Problems

The nonlinear kernelK (A;A0) 2 Rmâ m is fully dense

Computational complexity depends on m

Separating surface depends on almost entire dataset

Complexity of nonlinear SSVM ø O((m+ 1)3)

Runs out of memory while storing the kernel matrix

Long CPU time to compute the dense kernel matrix

O(m2) Need to generate and store entries

Need to store the entire dataset even after solving the problem