8/8/2019 Lec9 SVM Nonlinear
1/37
Support Vector Machines:
Nonlinear Case
Jieping YeDepartment of Computer Science and Engineering
Arizona State University
http://www.public.asu.edu/~jye02
Source: Andrews tutorials on
SVM
http://www.public.asu.edu/~jye02http://www.public.asu.edu/~jye028/8/2019 Lec9 SVM Nonlinear
2/37
Outline of lecture
Nonlinear SVM using basis functions
Nonlinear SVM using kernels
Extensions
SVM for multi-class classification
SVM path SVM for unbalanced data
8/8/2019 Lec9 SVM Nonlinear
3/37
Support Vector Machine: Linear Case
Balance the trade off between margin andclassification errors
( )
( )
( )
d* * 2
1 1,
1 1 1 1
2 2 2 2
{ , }= min
1 , 0
1 , 0
...
1 , 0
N
i ji jw b
N N N N
w b w c
y w x b
y w x b
y w x b
= =+
+
+
+
rrr r
r r
r r
denotes +1
denotes -1
1
2
3
8/8/2019 Lec9 SVM Nonlinear
4/37
Case
Maximi
ze
= ==
R
k
R
l
kllk
R
k
k Q1 11 2
1where ( )kl k l k l Q y y= x x
Subject tothese
constraints:
kCk 0
Then define:
=
=R
k
kkky
1
xwThen classify with:
f(x,w,b) = sign(w. x- b)
01
==
R
k
kky
8/8/2019 Lec9 SVM Nonlinear
5/37
b
A linear programming
problem !
ci = iii c
8/8/2019 Lec9 SVM Nonlinear
6/37
Suppose were in 1-dimension
What wouldSVMs do withthis data?
x=0
8/8/2019 Lec9 SVM Nonlinear
7/37
Suppose were in 1-dimension
Not a bigsurprise
Positive plane Negativeplane
x=0
8/8/2019 Lec9 SVM Nonlinear
8/37
Harder 1-dimensional dataset
What can be
done aboutthis?
x=0
8/8/2019 Lec9 SVM Nonlinear
9/37
Harder 1-dimensional dataset
Apply the following map
x=0),(
2
kkk xx=z
8/8/2019 Lec9 SVM Nonlinear
10/37
Harder 1-dimensional dataset
x=0),( 2kkk xx=z
Apply the following map
8/8/2019 Lec9 SVM Nonlinear
11/37
Harder 1-dimensional dataset
0 1-3-4 43-1
),( 2kkk xx=z
0 1-3-4 43-1
16
9
8/8/2019 Lec9 SVM Nonlinear
12/37
Harder 2-dimensional dataset
),,,,( 22 kkkkkkk yxyxyx=z
Apply the following map
8/8/2019 Lec9 SVM Nonlinear
13/37
Common SVM basis functions
zk = ( polynomial terms ofxk of degree 1 to
q )
zk = ( radial basis functions ofxk )
zk = ( sigmoid functions ofxk )
==
2
2||exp)(][
jkkjk j
cxxz
B i
8/8/2019 Lec9 SVM Nonlinear
14/37
BasisFunctions
=
mm
m
m
m
m
xx
xx
xx
xx
xx
xx
x
x
x
x
x
x
1
1
32
1
31
21
2
2
2
2
1
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)(x
ConstantTerm
Linear Terms
PureQuadraticTerms
Quadratic
Cross-Terms
Number of terms (assuming minput dimensions) = (m+2)-choose-2
= (m+2)(m+1)/2
= (as near as)m2
/2
You may be wondering whatthose
s are doing.
Youll find out why theyre theresoon.
2
8/8/2019 Lec9 SVM Nonlinear
15/37
QP (old)
Maximi
ze
where ).( lklkkl yyQ xx=
Subject tothese
constraints:
kCk 0
Then define:
=
=R
kkkky
1
xwThen classify with:
f(x,w,b) = sign(w. x- b)
01
==
R
k
kky
= ==
R
k
R
l
kllk
R
k
k Q
1 11 2
1
8/8/2019 Lec9 SVM Nonlinear
16/37
functions
where ))().(( lklkkl yyQ xx=
Subject tothese
constraints:
kCk 0
Then define: Then classify with:
f(x,w,b) = sign(w. (x)-b)
01
==
R
k
kky
>
=0s.t.
)(
k
k
kkky xw
Maximize = ==
R
k
R
l
kllk
R
k
k Q
1 11 2
1
Most important changes:
X (x)
8/8/2019 Lec9 SVM Nonlinear
17/37
QP with basis functions
where ))().(( lklkkl yyQ xx=
Subject tothese
constraints:
kCk 0
Then define:
Then classify with:
f(x,w,b) = sign(w. (x)-b)
01
==
R
k
kky
We must do R2/2 dot productsto get this matrix ready.
Each dot product requires m2/2additions and multiplications
The whole thing costs R2 m2 /4.
>
=0s.t.
)(
k
k
kkky xw
Maximize = ==
R
k
R
l
kllk
R
k
k Q
1 11 2
1
11
8/8/2019 Lec9 SVM Nonlinear
18/37
Dot
Pro d
ucts
=
mm
m
m
m
m
mm
m
m
m
m
bb
bb
bb
bb
bb
bb
b
b
b
b
b
b
aa
aa
aa
aa
aa
aa
a
a
a
a
a
a
1
1
32
1
31
21
2
2
2
2
1
2
1
1
1
32
1
31
21
2
2
2
2
1
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)()( ba
1
=
m
i
iiba1
2
=
m
i
ii ba1
22
= +=
m
i
m
ij
jiji bbaa1 1
2
+
+
+
8/8/2019 Lec9 SVM Nonlinear
19/37
Dot
Pro d
ucts
= )()( ba
= +===
+++m
i
m
ij
jiji
m
i
ii
m
i
ii bbaababa1 11
22
1
221
Just out of interest, lets look atanother function ofa and b:
2)1.( +ba
1.2).( 2 ++= baba
121
2
1
++
=
==
m
i
ii
m
i
ii baba
1211 1
++= == =
m
i
ii
m
i
m
j
jjii bababa
122)(
11 11
2 +++= == +==
m
i
ii
m
i
m
ij
jjii
m
i
ii babababa
8/8/2019 Lec9 SVM Nonlinear
20/37
D
ot
Pro d
ucts
= )()( ba
Just out of interest, lets look atanother function ofa and b:
2)1.( +ba
1.2).( 2 ++= baba
121
2
1
++
=
==
m
i
ii
m
i
ii baba
1211 1
++= == =
m
i
ii
m
i
m
j
jjii bababa
122)(
11 11
2 +++= == +==
m
i
ii
m
i
m
ij
jjii
m
i
ii babababa
Theyre the same!
And this is only O(m)to compute!
= +===
+++m
i
m
ij
jiji
m
i
ii
m
i
ii bbaababa1 11
22
1
221
8/8/2019 Lec9 SVM Nonlinear
21/37
functions
where ))().(( lklkkl yyQ xx=
Subject tothese
constraints:
kCk 0
Then define: Then classify with:
f(x,w,b) = sign(w. (x)-b)
01
==
R
k
kky
We must do R2/2 dot productsto get this matrix ready.
Each dot product now onlyrequires m additions and
multiplications
>
=0s.t.
)(
k
k
kkky xw
Maximize = ==
R
k
R
l
kllk
R
k
k Q
1 11 2
1
8/8/2019 Lec9 SVM Nonlinear
22/37
Higher Order Polynomials
Poly-nomial (x) Cost tobuild Qkl matrixtraditionally
Cost if 100inputs (a). (b) Cost tobuild Qkl matrixsneakily
Cost if100inputs
QuadraticAll m2/2terms up todegree 2
m2 R2/4 2,500 R2 (a.b+1)2 mR2/ 2 50 R2
Cubic All m3/6terms up todegree 3
m3 R2/12 83,000 R2 (a.b+1)3 mR2/ 2 50 R2
Quartic All m4/24
terms up tode ree 4
m4 R2/48 1,960,000 R2 (a.b+1)4 mR2/ 2 50 R2
5
8/8/2019 Lec9 SVM Nonlinear
23/37
functions5)1( + lk xx
Maximi
ze
= ==
+R
k
R
l
kllk
R
k
k Q1 11
where ))().(( lklkkl yyQ xx=
Subject tothese
constraints:
kCk 0
Then define:
>
=
0s.t.
)(
kk
kkky xw
Then classify with:
f(x,w,b) = sign(w. (x)-b)
01
==
R
k
kky
Only Sm operations (S=#support
>
=0s.t.
)()()(kk
kkky xxxw
>
+=0s.t.
5)1(kk
kkky xx
8/8/2019 Lec9 SVM Nonlinear
24/37
functions
Then define: Then classify with:
f(x,w,b) = sign(K(w, x)- b)
>
=0s.t.
)(kk
kkky xw
Maximize = ==
R
k
R
l
kllk
R
k
k Q
1 11 2
1where ),( lklkkl KyyQ xx=
Subject tothese
constraints:
kCk 0 01
==
R
k
kky
Most important change:
),()().( lklk K xxxx
S l i
8/8/2019 Lec9 SVM Nonlinear
25/37
SVM Kernel Functions K(a,b)=(a . b +1)d is an example of an SVM
Kernel Function
Beyond polynomials there are other very highdimensional basis functions that can be made
practical by finding the right Kernel Function Radial-Basis-style Kernel Function:
Sigmoidal function
=
2
2
2
)(exp),(
babaK
8/8/2019 Lec9 SVM Nonlinear
26/37
Kernel Tricks Replacing dot product with a kernel function
Not all functions are kernel functions
Need to be decomposable
K(a,b) = (a) (b) Could K(a,b) = (a-b)3 be a kernel function ?
Could K(a,b) = (a-b)4 (a+b)2 be a kernelfunction?
8/8/2019 Lec9 SVM Nonlinear
27/37
Kernel Tricks Mercers condition
To expand Kernel function K(x,y) into a dot product,i.e. K(x,y)= (x) (y), K(x, y) has to be positivesemi-definite function, i.e., for any function f(x)whose is finite, the following inequalityholds
Could be a kernel function?
( ) ( , ) ( ) 0dxdyf x K x y f y
2( ) f x dx
( )( , )p
i iiK x y x y=
r r
8/8/2019 Lec9 SVM Nonlinear
28/37
Kernel Tricks Pro
Introducing nonlinearity into the model Computational cheap
Con
Still have potential overfitting problems
N li K l (I)
8/8/2019 Lec9 SVM Nonlinear
29/37
Nonlinear Kernel (I)
N li K l (II)
8/8/2019 Lec9 SVM Nonlinear
30/37
Nonlinear Kernel (II)
SVM P f
8/8/2019 Lec9 SVM Nonlinear
31/37
SVM Performance
Generalization theory General methodology for many types of
problems
Same Program + New Kernel = New method
No problems with local minima
Robust optimization methods.
Successful Applications
SVM P f
8/8/2019 Lec9 SVM Nonlinear
32/37
SVM Performance
Do SVM scale to massive datasets? How to chose C and Kernel?
What is the effect of attribute scaling?
How to handle categorical variables? How to incorporate domain knowledge?
l ifi ti
8/8/2019 Lec9 SVM Nonlinear
33/37
classification SVMs can only handle two-class outputs.
What can be done? Answer: with output arity N, learn N SVMs
SVM 1 learns Output==1 vs Output != 1
SVM 2 learns Output==2 vs Output != 2
: SVM N learns Output==N vs Output != N
Then to predict the output for a new input, justpredict with each SVM and find out which one
puts the prediction the furthest into the positiveregion.
Other approaches
Pair-wise SVM, Multi-category SVM
l ti
8/8/2019 Lec9 SVM Nonlinear
34/37
selection The Entire Regularization Path for the Support Vector Machine
(Hastie, Rosset, Tibshirani and Zhu)
http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf
An algorithm for computing the two-class SVM solution for all possiblevalues of the regularization parameter C, at essentially thecomputational cost of a single SVM fit. Not only does this allow forefficient model selection, but it also exposes the role of regularization
for SVMs.
where ),( lklkkl KyyQ xx=
Subject totheseconstraints:
kCk 0 01
==
R
k
kky
Maximize = == R
k
R
lkllk
R
kk Q
1 11 2
1
f b l d d
http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdfhttp://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf8/8/2019 Lec9 SVM Nonlinear
35/37
SVM for unbalanced data
( ) iallfor,111
2
,min
iii
N
j
j
d
i
ibw
bwxy
cw
+
+ ==
( ) iallfor,1
21
21
1
2
,min
iii
Cx
j
Cx
j
d
i
ibw
bwxy
ccw
jj
+
++ =
Original SVM formulation:
the first class has much smaller size than the second class,
ply different weights to the two classes: 21 cc >
R f
8/8/2019 Lec9 SVM Nonlinear
36/37
References
C.J.C. Burges. A tutorial on support vector
machines for pattern recognition.
Kristin Bennett . Support Vector Machines:Hype or Hallelujah?
Statistical Learning Theory by VladimirVapnik, Wiley-Interscience; 1998
SoftwareSVM-light, http://svmlight.joachims.org/, free
download
N t l
http://svmlight.joachims.org/http://svmlight.joachims.org/8/8/2019 Lec9 SVM Nonlinear
37/37
Next class Topics
Multi-class SVM
Semi-supervised clustering
Readings In Defense of One-Vs-All Classification
Constrained K-means Clustering with Background Knowledge
Semi-supervised Clustering by Seeding
Distance metric learning, with application to clustering with side-information
Top Related