Support Vector Machines
description
Transcript of Support Vector Machines
Support Vector Machines
Lecturer: Yishay Mansour
Itay Kirshenbaum
Lecture OverviewIn this lecture we present in detail one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs).
Lecture Overview – Cont. We begin with building the
intuition behind SVMs continue to define SVM as an
optimization problem and discuss how to efficiently solve it.
We conclude with an analysis of the error rate of SVMs using two techniques: Leave One Out and VC-dimension.
Introduction Support Vector Machine is a
supervised learning algorithm Used to learn a hyperplane that
can solve the binary classification problem
Among the most extensively studied problems in machine learning.
Binary Classification Problem Input space: Output space: Training data: S drawn i.i.d with distribution D Goal: Select hypothesis
that best predicts other points drawn i.i.d from D
nRX
)},(),...,,{( 11 mm yxyxS }1,1{ Y
Hh
Binary Classification – Cont. Consider the problem of
predicting the success of a new drug based on a patient height and weight m ill people are selected and treated This generates m 2d vectors (height
and weight) Each point is assigned +1 to indicate
successful treatment or -1 otherwise This can be used as training data
Binary classification – Cont. Infinitely many ways to classify Occam’s razor – simple
classification rules provide better results Linear classifier or hyperplane
Our class of linear classifiers:
0)*( if 1 to maps bxwXxHh
},|)*(sign {x RbRwbxwH n
Choosing a Good Hyperplane Intuition
Consider two cases of positive classification:
w*x + b = 0.1 w*x + b = 100
More confident in the decision made by the latter rather than the former
Choose a hyperplane with maximal margin
Good Hyperplane – Cont. Definition: Functional margin S
A linear classifier:
)*(ˆ with ˆˆ min},...,1{
s bxwy iiii
mi
),( toaccording oftion classifica theis bwxy ii
Maximal Margin w,b can be scaled to increase
margin sign(w*x + b) = sign(5w*x + 5b) for
all x (5w, 5b) is 5 times greater than (w,b)
Cope by adding an additional constraint: ||w|| = 1
Maximal Margin – Cont. Geometric Margin
Consider the geometric distance between the hyperplane and the closest points
Geometric Margin Definition: Definition: Geometric margin S
Relation to functional margin
Both are equal when
)*( with min},...,1{
s wbx
wwy iiii
mi
ii yw ̂
1w
The Algorithm We saw:
Two definitions of the margin Intuition behind seeking a
maximizing hyperplane Goal: Write an optimization
program that finds such a hyperplan
We always look for (w,b) maximizing the margin
The Algorithm – Take 1 First try:
Idea Maximize - For each sample the
Functional margin is at least Functional and geometric margin are
the same as Largest possible geometric margin
with respect to the training set
1,,...,1,)*(max wmibxwy ii
1w
The Algorithm – Take 2 The first try can’t be solved by any
off-the-shelf optimization software The constraint is non-linear In fact, it’s even non-convex
How can we discard the constraint? Use geometric margin!
1w
mibxwyw
ii ,...,1,ˆ)*(ˆ
max
The Algorithm – Take 3 We now have a non-convex
objective function – The problem remains
Remember We can scale (w,b) as we wish Force the functional margin to be 1 Objective function: Same as: Factor of 0.5 and power of 2 do not
change the program – Make things easier
w1max
2
21min w
The algorithm – Final version The final program:
The objective is convex (quadratic) All constraints are linear Can solve efficiently using standard
quadratic programing (QP) software
mibxwyw ii ,...,1,1)*(21max 2
Convex Optimization We want to solve the optimization
problem more efficiently than generic QP
Solution – Use convex optimization techniques
Convex Optimization – Cont. Definition: A convex function
Theorem
)()1()())1((
:1,0,,yfxfyxf
Xyx
f
))(()()( :,functionconvex abledifferenti a be :Let
xyxfxfyfXyxxf
Convex Optimization Problem Convex optimization problem
We look for a value of Minimizes Under the constraint
mixgxfmixgf
iXx
i
,..,1,0)( s.t. )(min Findfunctionconvex be ,..,1,:,Let
Xx)(xf
mixgi ,..,1,0)(
Lagrange Multipliers Used to find a maxima or a minima
of a function subject to constraints Use to solve out optimization
problem Definition
sMultiplier Lagrange thecalled are
0, )()(),(
,..,1,sconstraint subject to function of Lagragian
1
i
i
m
iii
i
XxxgxfxL
migfL
Primal Program Plan
Use the Lagrangian to write a program called the Primal Program
Equal to f(x) is all the constraints are met
Otherwise – Definition – Primal Program
),(max)( 0 xLxP
Primal Progam – Cont. The constraints are of the form If they are met
is maximized when all are 0, and the summation is 0
Otherwise is maximized for
0)( xgi
m
iii xg
1
)(
i
)()( xfxP
m
iii xg
1
)( i
)(xP
Primal Progam – Cont. Our convex optimization problem
is now:
Define as the value of the primal program
),(maxmin)(min 0 xLx XxPXx )(min* xp PXx
Dual Program We define the Dual Program as:
We’ll look at
Same as our primal program Order of min / max is different
Define the value of our Dual Program
),(min)( xLx XxD
),(minmax)(minmax 00 xLx XxaDXxa
),(minmax 0* xLd Xxa
Dual Program – Cont. We want to show
If we find a solution to one problem, we find the solution to the second problem
Start with “max min” is always less then “min
max”
Now on to
** pd
** pd
*00
* ),(maxmin),(minmax pxLxLd aXxXxa
** dp
Dual Program – Cont. Claim
Proof
Conclude
)( osolution t a is and then
),(),(),( :feasible is which ,0
andpoint saddle a are which 0 and exists if
p***
****
**
xxdp
axLaxLaxLxa
ax
*
0
*
***
00
*
),(infsup),(inf
),(),(sup),(supinf
daxLaxL
axLaxLaxLp
xax
aax
** pd
Karush-Kuhn-Tucker (KKT) conditions KKT conditions derive a
characterization of an optimal solution to a convex problem.
Theorem
0)()( 3.
0)( ),( .20)()( ),( 1.
:s.t. 0 problemon optimizati theosolution t a is convex. and abledifferenti are ,..,1, and that Assume
xgxg
xgxLxgxfxL
xmigf
ii
a
xxx
i
KKT Conditions – Cont. Proof
The other direction holds as well
m
i ii
m
i iii
m
i ixi
x
xga
xgxga
xxxga
xxxfxfxfx
1
1
1
0)(
)]()([
)()(
)()()()(: feasibleevery For
KKT Conditions – Cont. Example
Consider the following optimization problem:
We have The Lagragian will be
2 .. 21min 2 xtsx
xxgxxf 2)(,21)( 1
2
)2(21),( 2 xxxL
**
22*
**
202),(
212)2(
21),(
0
xxL
xL
xxxL
Optimal Margin Classifier Back to SVM
Rewrite our optimization program
Following the KKT conditions Only for points in the training
set with a margin of exactly 1 These are the support vectors of the
training set
01)*(),(
,...,1,1)*(21min 2
bxwybwg
mibxwyw
iii
ii
0i
Optimal Margin – Cont. Optimal margin classifier and its support vectors
Optimal Margin – Cont. Construct the Lagragian
Find the dual form First minimize to get Do so by setting the derivatives to
zero
]1)*([21),,(
1
2
bxwywbwL iim
ii
),,( bwL D
m
i
iii
m
i
iiix
xyw
xywbwL
1
*
1
0),,(
Optimal Margin – Cont. Take the derivative with respect to
Use in the Lagrangian
We saw the last tem is zero
m
i
ii ybwL
b 1
0),,(
b
im
ii
jim
jiji
jim
ii ybxxyybwL
11,1
**
21),,(
*w
)(21),,(
1,1
** WxxyybwL jim
jiji
jim
ii
Optimal Margin – Cont. The dual optimization problem
The KKT conditions hold Can solve by finding that maximize Assuming we have – define The solution to the primal problem
m
i
ii
i
y
miW
1
0
,..,1,0:)(max
* )(W
m
i
iixyw1
**
Optimal Margin – Cont. Still need to find
Assume is a support vector We get
*bix
ii
ii
ii
xwyb
bxwy
bxwy
**
**
**
)(1
Error Analysis Using Leave-One-Out The Leave-One-Out (LOO) method
Remove one point at a time from the training set
Calculate an SVM for the remaining points
Test our result using the removed point
Definition The indicator function I(exp) is 1 if
exp is true, otherwise 0
))((1ˆ1
}{
m
i
iixSLOO yxhI
mR i
LOO Error Analysis – Cont. Expected error
It follows the expected error of LOO for a training set of size m is the same as for a training set of size m-1
)]([])([
)])(([1]ˆ[
'~'}{,
1}{~
1 SDSii
xSXS
m
i
iixSLOODS
herrorEyxhE
yxhIEm
RE
mi
im
LOO Error Analysis – Cont. Theorem
ProofSSN
mSNEherrorE
SV
SVDSSDS mm
in ctorssupport ve ofnumber theis )(
]1
)([)]([ 1~~
1)(ˆ :Hence
ctor.support ve a bemust point they,incorrectlpoint a classifies if
mSNR
h
SV
S
LOO
Generalization Bounds Using VC-dimension Theorem
Proof
2
22
Then .,min:)(set
hyperplane theofdimension -VC thebe Let .:Let
Rdwxwxwsign
dRxxS
Sx
d
i
iid
i
iiii
ii
d
xyxywxy
dixwyw
xx
1
d
1i 1
i1
wd
:dover Summing .,..,1 )(:
1,1yevery for So shattered. is ,..,set that theAssume
Generalization Bounds Using VC-dimension – Cont. Proof – Cont.
2
22
22
,
,
2
1
21
1
Therefore
][d
: thatconcludecan we
when 1][ and when 0][ Since
][d
:ondistributi uniform with s' over the Averaging
Rd
dRxyyxxE
jiyyEjiyyE
yyxxExyExyE
y
i
iji
ji
jiy
jiy
jiy
ji
ji
jiy
d
i
iiy
d
i
iiy