SVM Lecture 2
-
Upload
heenal-mehta -
Category
Documents
-
view
250 -
download
1
description
Transcript of SVM Lecture 2
Support Vector MachineSVM
http://www.svms.org/tutorials/Hearst-etal1998.pdf
http://www.cs.cornell.edu/courses/cs578/2003fa/slides_sigir03_tutorial-modified.v3.pdf
+
-----
-
++
+
++++
g(x)?
Which Hyperplane?
If wT = (3, 4) & w0 = −10
g(x) = wT x +w0
+
-----
-
++
+
++++
(2,1)
(0,2.5)
(3,1)
(2,2)
(1,1)
g(x)
g(x) > 0 then f (x) = 1g(x) ≤ 0 then f (x) = −1
-
g(x) = (3, 4)T x −10
g(0,5 / 2) = (3, 4)i(0,5 / 2)−10 = 0
If wT = (3, 4) & w0 = −10
g(x) = wT x +w0
+
-----
-
++
+
++++
(2,1)
(0,2.5)
(3,1)
(2,2)
(1,1)
g(2,1) = (3, 4)i(2,1)−10 = 0
g(x)
g(2,2) = (3, 4)i(2,2)−10 = 4 > 0
g(3,1) = (3, 4)i(3,1)−10 = 3 > 0
g(1,1) = (3, 4)i(1,1)−10 = −3 ≤ 0
g(x) > 0 then f (x) = 1g(x) ≤ 0 then f (x) = −1
f (2,2)= 1
f (3,1)
so f (2,1) = f (0,5 / 2) = −1
= 1
f (1,1)= −1
-
g(x) = (3, 4)T x −10
+
----
-
++
+
++++
g(x)
-
+
----
-
++
+
++++
g(x)
-
Which line (hyperplane) to choose?
margin margin
Maximal Margin Hyperplane
How to compute the distance from a point on the plane to the hyperplane
x = xp + rww
xpx a point
the normal projection of onto the plane
x
g(x) = wT x +w0 = wT xp + rww
⎛⎝⎜
⎞⎠⎟+w0
= wT r ww wTw = w 2
Observe that
= w r
Thus r = g(x)w
xp
w
x3
x2
x1
r x∀x,g(x) = wT x +w0 = 0
wT = (3, 4) w0 = −10g(x) = wT x +w0
+
-----
-
++
+
+++
+
(3,1)
(2,2)
(1,1)
g(x)
--(1,.5)
+
Distance Formula: r = g(x)w
g(2,2)(3, 4)
= (3, 4)i(2,2)−105
= 4 / 5
g(1,.5)(3, 4)
= (3, 4)i(1,.5)−105
= −1
g(3,1)(3, 4)
= (3, 4)i(3,1)−105
= 3 / 5
g(1,1)(3, 4)
= (3, 4)i(1,1)−105
= −3 / 5
+
-----
-
++
+
+++
+
(3,1)
(2,2)
(1,1)
g(x)
--(1,.5)
+
Distance Formula: r = g '(x)w 'g '(2,2)(1, 4 / 3)
= (1, 4 / 3)i(2,2)−10 / 35 / 3
= 4 / 5
g '(1,.5)(1, 4 / 3)
= (1, 4 / 3)i(1,.5)−10 / 35 / 3
= −1
g '(3,1)(1, 4 / 3)
= (1, 4 / 3)i(3,1)−10 / 35 / 3
= 3 / 5
g '(1,1)(1, 4 / 3)
= (1, 4 / 3)i(1,1)−10 / 35 / 3
= −3 / 5
g '(x) = w 'T x +w '0 =13g(x)
w ' = 13w = (1, 4 / 3)T w '0 = −10 / 3
+
-----
-
++
+
++++
-
We want to classify points in space.Which hyperplane does SVM choose?
+
+
- -----
+
----
-
++
+
++++
g(x)
-
Maximal Margin Hyperplane
The margin is the geometric distance between the closest training example to the hyperplane,
support vector
support vector
support vectormargin
y1g(x1)||w ||
= y2g(x2 )||w ||
x1x2
+
-----
-
++
+
++++
(3,1)
(2,2)
(1,1)
(0,2.5)
(2,1)
g(x)
-
The hyperplane is defined by all the points which satisfy g(x) = (3, 4)ix −10 = 0
g(0,2.5) = 0
(0, 3.3)
g(2,1) = (3, 4)i(2,1)−10 = 0e.g.
All the points above the line are positive
e.g. g(2,2) = (3, 4)i(2,2)−10 = 4g(x) = (3, 4)ix −10 > 0
All the points below the line are negativeg(x) = (3, 4)ix −10 < 0
g(1,1) = (3, 4)i(1,1)−10 = −3e.g.
g(3,1) = (3, 4)i(3,1)−10 = 3
We use the hyperplane to classify a point x f (x) = 1 if wix +w0 > 0f (x) = −1 if wix +w0 ≤ 0
g(x) = (3, 4)ix −10
Notice that for any hyperplane we have an infinite number of formulas that describe it!
If (3, 4)ix −10 = 0, so does (1/3) (3, 4)ix −10( ) = 0so does 23 (3, 4)ix −10( ) = 0so does .9876 (3, 4)ix −10( ) = 0
is (the functional margin is 1.)
The canonical hyperplane for a set of training examples
y(i ) (w 'T x(i ) +w '0 ) ≥1+
-----
-
++
+
+++
+
(3,1)
(2,2)
(1,1)
g(x)
1(1, 4 / 3)i(2,2)−10 / 3 ≥1
--
g '(3,1) = (1, 4 / 3)i(3,1)−10 / 3 = 1g '(1,1) = (1, 4 / 3)i(1,1)−10 / 3 = −1
S = {< (1,1),−1>,< (2,2),1>,< (1,1 / 2),−1>,< (3,2),1> ...}
(3,2)
-(1,.5)
(1,1)−1(1, 4 / 3)i(1,1 / 2)−10 / 3 ≥1
g '(x) = (1, 4 / 3)ix −10 / 3
1
1
For a canonical hyperplane w.r.t. a fixed set of training examples, , the margin is computed by
+
-----
-
++
+
+++
+
(3,1)
(2,2)
(1,1)
--
g(x) = wT x +w0
(3,2)
-(1,.5)
(1,1)
S
ρ = g(x+ )w
= g(x− )w
= 1w
Remember the distance from a point to the hyperplane is ρ = g(x)
wg(x) = wT x +w0
x
g(x) = (1, 4 / 3)T x +10 / 3
ρ = 1(1, 4 / 3)
= 11+16 / 9
= 35
3 / 5
Distance of x to the hyperplane is : r = g(x)w
= ±1w
assuming canonical
hyperplane
For the canonical hyperplane the margin is 1w
To find the maximal canonical hyperplane,the goal is to minimize w
For a set of training examples, we can find themaximum margin hyperplane in polynomial time!
To do this - we reformulate the problem as an optimization problem
There is an algorithm to solve an optimization problem if it has this form:
min:Subject to:
Where is convex, is convex, and is affine.We can use the standard techniques to find the optimal.
f (x)∀i,gi (w) ≤ 0∀i,hi (w) = 0
f ∀i,gi ∀i,hi
Finding the largest geometric margin by finding which
max: ρ
Subject to:
g(x)
y(i ) (wT x(i ) +w0 )w
≥ ρ∀i
Finding the largest geometric marginby finding which
min: w
Subject to:
g(x)
y(i ) (wT x(i ) +w0 ) ≥1∀i
Finding the largest marginby finding which
min:12w 2
Subject to:
g(x)
y(i ) (wT x(i ) +w0 ) ≥1∀i
Solving this constrained quadratic optimization requires The Karush, Kuhn, Tucker (KKT) conditions are met.
The Karush, Kuhn, Tucker (KKT) conditions imply that is non-zero only if it is a support vector!
vi
is
The hypothesis for the set of training examples,
+
-----
-
++
+
+++
+
(3,1)
(2,2)
(1,1)
g(x) = wT x + b = 0--
S = {< (1,1),−1>,< (2,2),1>,< (1,1 / 2),−1>,< (3,2),1>,...}
(3,2)
-(1,.5)
(1,1)
g(x) = v1(2, 7 / 4)ix + v2 (3,1)ix + v3 i(1,1)ix + w0
(2, 7 / 4)
Note that only the support vectors are in the hypothesis.
g(x) = wT x + b = −1
g(x) = wT x + b = 1
NonLinearly Separable DataIII
positivenegative
++-
-
++++ ++
----
---
g(x)+
++ +
+
+ +
+
- -+
++
++
+
+(0,0) (0.5,0)
(−1,−1)
(1,1)+
g(x) = wT x +w0X
(0,1.25)
(−0.25,0.25)
(0.5,−0.5)
--- -
----
- ---- +
+
++
(−0.75,−0.25)
(−1.25,0)
(−1,1)
(1,−0.75)
Linearly separable?
-
++++ ++
----
---
g(x)+
++ +
+
+ +
+
- -+
++
++
+
+(0,0) (0.5,0)
(−1,−1)
(1,1)+
g(x) = wTφ(x)+w0w = (1,1)
φ(x) = x12 , x2
2⎡⎣ ⎤⎦Transform feature space
φ : x→φ(x)
w0 = −1
(0,1.25)
(−0.25,0.25)
(0.5,−0.5)
--- -
----
- ---- +
+
++
(−0.75,−0.25)
(−1.25,0)
(−1,1)
(1,−0.75) --(0,0) (0,0.25)-(0.25,0.25)
+(1,0.56)+(1,1)
- +(1.5625,0)
+(0,1.5625)
-(0.56,0.0625)
Linearly separable?
-
++++ ++
----
---
g(x)+
++ +
+
+ +
+
- -+
++
++
+
+(0,0) (0.5,0)
(−1,−1)
(1,1)+
g(x) = wTφ(x)+w0w = (1,1)
g(.5,−.5) = (1,1)i(0.25,0.25)−1≤ 0
φ(x) = x12 , x2
2⎡⎣ ⎤⎦Transform feature space
φ : x→φ(x) g(1,1) = (1,1)i(1,1)−1> 0g(−1,−1) = (1,1)i(1,1)−1> 0
w0 = −1
(0,1.25)
(−0.25,0.25)
(0.5,−0.5)
--- -
----
- ---- +
+
++
(−0.75,−0.25)
(−1.25,0)
(−1,1)
(1,−0.75) --(0,0) (0,0.25)-(0.25,0.25)
+(1,0.56)+(1,1)
- +(1.5625,0)
+(0,1.5625)
-(0.56,0.0625)
Linearly separable?
++-- -- + -- - --
Linearly separable?
10 2 3............
Linearly separable?
Yes, by transforming the feature space!φ(x) = x, x2⎡⎣ ⎤⎦
g(φ(x)) = (1,−3)iφ(x)+ 2f (x) = 1 if g(φ(x)) = (1,−3)iφ(x)+ 2 > 0f (x) = −1 if g(φ(x)) = (1,−3)iφ(x)+ 2 < 0
++-- -- + -- - --10 2 3
............
Linearly separable?
Yes, by transforming the feature space!
g(φ(1 / 2)) = g(1 / 2,1 / 4) = (1,−3)i(1 / 2,1 / 4)+ 2
φ(x) = x, x2⎡⎣ ⎤⎦
g(φ(x)) = (1,−3)iφ(x)+ 2f (x) = 1 if g(φ(x)) = (1,−3)iφ(x)+ 2 > 0f (x) = −1 if g(φ(x)) = (1,−3)iφ(x)+ 2 < 0
g(φ(3 / 2)) = g(3 / 2,9 / 4) = (1,−3)i(3 / 2,9 / 4)+ 2g(φ(2)) = g(2, 4) = (1,−3)i(2, 4)+ 2
++-- -- + -- - --10 2 3
............
- -
φ(x) = x, x2⎡⎣ ⎤⎦
--
+++(7 / 4,49 /16)(3 / 2,9 / 4)
(5 / 4,25 /16)
-(7 / 4,49 /16)
(3 / 4,9 /16)
g(φ(x)) = (1,−3)iφ(x)+ 2transforming the feature space using
The points are now linearly separable
These points become linearly separable by++-- -- + -- - --
10 2 3............
Kernel Function
K (x, z) = φ(x)iφ(y)
Transform the feature spacemap x to φ(x)
Non-Separable DataIV
+
-----
-
++
+
+++
+
(3,1)
---(1,1)
What if data is not linearly separable for only a few points?
x
-
+
+++
+++
++ +
------
-
+
-----
-
++
+
+++
+
(3,1)
--(1,1)
What if a small number of points prevents the margin from being large?
x
- +
+++
+++
++ +
------
+
-----
-
++
+
+++
+
(3,1)
---(1,1)
What if a small number of points prevents the margin from being large?
x
- +
+++
+++
++ +
------
+
-----
-
++
+
+++
+
(3,1)
---(1,1)
- +
+++
+++
++ +
------
λ large λ small
-+ +
-- -
++
λ =∞?What if