Support Vector Machine (SVM) and Kernel Methods
Transcript of Support Vector Machine (SVM) and Kernel Methods
CE-717: Machine LearningSharif University of Technology
Fall 2015
Soleymani
Support Vector Machine (SVM)
and Kernel Methods
Outline
Margin concept
Hard-Margin SVM
Soft-Margin SVM
Dual Problems of Hard-Margin SVM and Soft-Margin SVM
Nonlinear SVM
Kernel trick
Kernel methods
2
Margin
3
Which line is better to select as the boundary to providemore generalization capability?
Margin for a hyperplane that separates samples of twolinearly separable classes is:
The smallest distance between the decision boundary and any of thetraining samples
๐ฅ2
๐ฅ1
Larger margin provides better
generalization to unseen data
Maximum margin
4
SVM finds the solution with maximum margin
Solution: a hyperplane that is farthest from all training samples
The hyperplane with the largest margin has equal distances tothe nearest sample of both classes
๐ฅ2
๐ฅ1
๐ฅ2
๐ฅ1 Larger margin
Hard-margin SVM: Optimization problem
5
max๐,๐,๐ค0
2๐
๐
s. t. ๐๐๐ ๐ + ๐ค0 โฅ ๐ โ๐ ๐ โ ๐ถ1
๐๐๐ ๐ + ๐ค0 โค โ๐ โ๐ ๐ โ ๐ถ2
๐ฅ2
๐ฅ1
๐
๐
๐๐๐ + ๐ค0 = 0
๐๐๐ + ๐ค0 = ๐
๐๐๐ + ๐ค0 = โ๐
๐
๐ฆ ๐ = 1
๐ฆ ๐ = โ1
Margin: 2๐
๐
Hard-margin SVM: Optimization problem
6
max๐,๐,๐ค0
2๐
๐
s. t. ๐ฆ(๐) ๐๐๐ ๐ + ๐ค0 โฅ ๐ ๐ = 1, โฆ , ๐
๐ฅ2
๐ฅ1
๐
๐
๐
๐๐๐ + ๐ค0 = 0
๐๐๐ + ๐ค0 = ๐
๐๐๐ + ๐ค0 = โ๐
๐ = min๐=1,โฆ,๐
๐ฆ ๐ ๐๐๐ ๐ + ๐ค0
Hard-margin SVM: Optimization problem
7
We can set ๐โฒ =๐
๐, ๐ค0
โฒ =๐ค0
๐:
max๐โฒ,๐ค0
โฒ
2
๐โฒ
s. t. ๐ฆ(๐) ๐โฒ๐๐ ๐ + ๐ค0โฒ โฅ 1 ๐ = 1, โฆ , ๐
๐ฅ2
๐ฅ1
1
๐โฒ
๐โฒ๐๐ + ๐ค0โฒ = 0
๐โฒ๐๐ + ๐ค0โฒ = 1
๐โฒ๐๐ + ๐ค0โฒ = โ1
๐โฒ
The place of boundary and
margin lines do not change
Hard-margin SVM: Optimization problem
8
We can equivalently optimize:
min๐,๐ค0
1
2๐ 2
s. t. ๐ฆ ๐ ๐๐๐ ๐ + ๐ค0 โฅ 1 ๐ = 1, โฆ , ๐
It is a convex Quadratic Programming (QP) problem
There are computationally efficient packages to solve it.
It has a global minimum (if any).
When training samples are not linearly separable, it has no
solution.
How to extend it to find a solution even though the classes are not
linearly separable.
Beyond linear separability
9
How to extend the hard-margin SVM to allow
classification error
Overlapping classes that can be approximately separated by a
linear boundary
Noise in the linearly separable classes
๐ฅ2
๐ฅ1 ๐ฅ1
Beyond linear separability: Soft-margin SVM
10
Minimizing the number of misclassified points?!
NP-complete
Soft margin:
Maximizing a margin while trying to minimize the distance
between misclassified points and their correct margin plane
Soft-margin SVM
11
SVM with slack variables: allows samples to fall within the
margin, but penalizes them
min๐,๐ค0, ๐๐ ๐=1
๐
12
๐ 2 + ๐ถ
๐=1
๐
๐๐
s. t. ๐ฆ ๐ ๐๐๐ ๐ + ๐ค0 โฅ 1 โ ๐๐ ๐ = 1, โฆ , ๐
๐๐ โฅ 0
๐๐ : slack variables
๐๐ > 1: if ๐ ๐ misclassifed
0 < ๐๐ < 1 : if ๐ ๐ correctly
classified but inside margin
๐ฅ2
๐ฅ1
๐๐
Soft-margin SVM
12
linear penalty (hinge loss) for a sample if it is misclassified
or lied in the margin
tries to maintain ๐๐ small while maximizing the margin.
always finds a solution (as opposed to hard-margin SVM)
more robust to the outliers
Soft margin problem is still a convex QP
๐ = 0
๐ = 0
Soft-margin SVM: Parameter ๐ถ
13
๐ถ is a tradeoff parameter:
small ๐ถ allows margin constraints to be easily ignored
large margin
large ๐ถ makes constraints hard to ignore
narrow margin
๐ถ โ โ enforces all constraints: hard margin
๐ถ can be determined using a technique like cross-
validation
Soft-margin SVM: Cost function
14
min๐,๐ค0, ๐๐ ๐=1
๐
1
2๐ 2 + ๐ถ
๐=1
๐
๐๐
s. t. ๐ฆ ๐ ๐๐๐ ๐ + ๐ค0 โฅ 1 โ ๐๐ ๐ = 1, โฆ , ๐
๐๐ โฅ 0
It is equivalent to the unconstrained optimizationproblem:
min๐,๐ค0
1
2๐ 2 + ๐ถ
๐=1
๐
max(0,1 โ ๐ฆ(๐)(๐๐๐(๐) + ๐ค0))
SVM loss function
15
Hinge loss vs. 0-1 loss
๐๐๐ + ๐ค0
0-1 Loss
๐ฆ = 1
Hinge Loss
max(0,1 โ ๐ฆ(๐๐๐ + ๐ค0))
Dual formulation of the SVM
16
We are going to introduce the dual SVM problem which
is equivalent to the original primal problem. The dual
problem:
is often easier
enable us to exploit the kernel trick
gives us further insights into the optimal hyperplane
Optimization: Lagrangian multipliers
17
๐โ = min๐
๐(๐)
s. t. ๐๐ ๐ โค 0 ๐ = 1,โฆ , ๐โ๐ ๐ = 0 ๐ = 1, โฆ , ๐
โ ๐, ๐ถ, ๐ = ๐ ๐ +
๐=1
๐
๐ผ๐ ๐๐ ๐ +
๐=1
๐
๐๐ โ๐ ๐
max๐ผ๐โฅ0 , ๐๐
โ ๐, ๐ถ, ๐ =
โ any ๐๐ ๐ > 0
โ any โ๐ ๐ โ 0
๐ ๐ otherwise
๐โ = min๐
max๐ผ๐โฅ0 , ๐๐
โ ๐, ๐ถ, ๐
Lagrangian multipliers
๐ถ = ๐ผ1, โฆ , ๐ผ๐
๐ = [๐1, โฆ , ๐๐]
Optimization: Dual problem
18
In general, we have:
max๐ฅ
min๐ฆ
๐(๐ฅ, ๐ฆ) โค min๐ฆ
max๐ฅ
๐(๐ฅ, ๐ฆ)
Primal problem: ๐โ = min๐
max๐ผ๐โฅ0 , ๐๐
โ ๐, ๐ถ, ๐
Dual problem: ๐โ = max๐ผ๐โฅ0 , ๐๐
min๐
โ ๐, ๐ถ, ๐
Obtained by swapping the order of min and max
๐โ โค ๐โ
When the original problem is convex (๐ and ๐ are convex
functions and โ is affine), we have strong duality ๐โ = ๐โ
Hard-margin SVM: Dual problem
19
min๐,๐ค0
1
2๐ 2
s. t. ๐ฆ ๐ ๐๐๐ ๐ + ๐ค0 โฅ 1 ๐ = 1, โฆ , ๐
By incorporating the constraints through lagrangian multipliers,we will have:
min๐,๐ค0
max{๐ผ๐โฅ0}
1
2๐ 2 +
๐=1
๐
๐ผ๐ 1 โ ๐ฆ ๐ (๐๐๐(๐) + ๐ค0)
Dual problem (changing the order of min and max in theabove problem):
max{๐ผ๐โฅ0}
min๐,๐ค0
1
2๐ 2 +
๐=1
๐
๐ผ๐ 1 โ ๐ฆ ๐ (๐๐๐(๐) + ๐ค0)
Hard-margin SVM: Dual problem
20
max{๐ผ๐โฅ0}
min๐,๐ค0
โ ๐, ๐ค0, ๐ถ
โ ๐, ๐ค0, ๐ถ =1
2๐ 2 +
๐=1
๐
๐ผ๐ 1 โ ๐ฆ ๐ (๐๐๐(๐) + ๐ค0)
๐ป๐โ ๐, ๐ค0, ๐ถ = 0 โ ๐ = ๐=1๐ ๐ผ๐๐ฆ
(๐)๐(๐)
๐โ ๐,๐ค0,๐ถ
๐๐ค๐= 0 โ ๐=1
๐ ๐ผ๐๐ฆ(๐) = 0
๐ค0 do not appear, instead, a โglobalโ constraint
on ๐ถ is created.
Hard-margin SVM: Dual problem
21
max๐ถ
๐=1
๐
๐ผ๐ โ1
2
๐=1
๐
๐=1
๐
๐ผ๐๐ผ๐๐ฆ(๐)๐ฆ(๐)๐ ๐ ๐
๐ ๐
Subject to ๐=1๐ ๐ผ๐๐ฆ
(๐) = 0
๐ผ๐ โฅ 0 ๐ = 1, โฆ , ๐
It is a convex QP
By solving the above problem first we find ๐ถ and then ๐= ๐=1
๐ ๐ผ๐ ๐ฆ(๐)๐(๐)
๐ค0 can be set by making the margin equidistant to classes (wediscuss it more after defining support vectors).
Karush-Kuhn-Tucker (KKT) conditions
22
Necessary conditions for the solution [๐โ, ๐ค0โ, ๐ถโ]:
๐ป๐โ ๐, ๐ค0, ๐ถ ๐โ,๐ค0โ ,๐ถโ = 0
๐โ ๐,๐ค0,๐ถ
๐๐ค0 ๐โ,๐ค0
โ ,๐ถโ = 0
๐ผ๐โ โฅ 0 ๐ = 1, โฆ , ๐
๐ฆ ๐ ๐โ๐๐ ๐ + ๐ค0โ โฅ 1 ๐ = 1, โฆ , ๐
๐ผ๐โ 1 โ ๐ฆ ๐ ๐โ๐๐ ๐ + ๐ค0
โ = 0 ๐ = 1, โฆ , ๐
๐ป๐โ ๐, ๐ถ ๐โ,๐ถโ
= 0
๐ผ๐โ โฅ 0 ๐ = 1, โฆ , ๐
๐๐ ๐โ โค 0 ๐ = 1, โฆ , ๐๐ผ๐
โ๐๐ ๐โ = 0 ๐ = 1, โฆ , ๐
In general, the optimal ๐โ, ๐ถโ
satisfies KKT conditions:
Hard-margin SVM: Support vectors
23
Inactive constraint: ๐ฆ ๐ ๐๐๐ ๐ + ๐ค0 > 1
โ ๐ผ๐ = 0 and thus ๐ ๐ is not a support vector.
Active constraint: ๐ฆ ๐ ๐๐๐ ๐ + ๐ค0 = 1
โ ๐ผ๐ can be greater than 0 and thus ๐ ๐ can be a support vector.
๐ฅ2
๐ฅ1
1
๐
๐ผ > 0
๐ผ > 0๐ผ > 0
Hard-margin SVM: Support vectors
24
Inactive constraint: ๐ฆ ๐ ๐๐๐ ๐ + ๐ค0 > 1
โ ๐ผ๐ = 0 and thus ๐ ๐ is not a support vector.
Active constraint: ๐ฆ ๐ ๐๐๐ ๐ + ๐ค0 = 1
โ ๐ผ๐ can be greater than 0 and thus ๐ ๐ can be a support vector.
๐ฅ2
๐ฅ1
1
๐
๐ผ > 0
๐ผ > 0๐ผ > 0
๐ผ = 0
๐ผ = 0
A sample with ๐ผ๐ = 0 can also
lie on one of the margin
hyperplanes
Hard-margin SVM: Support vectors
25
SupportVectors (SVs)= {๐ฅ ๐ ๐ผ๐ > 0}
The direction of hyper-plane can be found only based on
support vectors:
๐ =
๐ผ๐>0
๐ผ๐ ๐ฆ(๐)๐(๐)
Hard-margin SVM: Dual problem
Classifying new samples using only SVs
26
Classification of a new sample ๐:
๐ฆ = sign ๐ค0 + ๐๐๐
๐ฆ = sign ๐ค0 + ๐ผ๐>0
๐ผ๐๐ฆ๐ ๐ ๐
๐
๐
๐ฆ = sign(๐ค0 + ๐ผ๐>0
๐ผ๐๐ฆ๐ ๐ ๐ ๐
๐)
The classifier is based on the expansion in terms of dot
products of ๐ with support vectors.
Support vectors are sufficient to
predict labels of new samples
How to find ๐ค0
27
At least one of the ๐ผโs is strictly positive (๐ผ๐ > 0 for support
vector).
The KKT complementary slackness condition tells us that the
constraint corresponding to this non-zero ๐ผ๐ is equality.
๐ค0 is found such that the following constraint is satisfied.
๐ฆ๐ ๐ค0 + ๐๐๐๐ = 1
We can find ๐ค0 using one of the support vectors:
โ ๐ค0 = ๐ฆ๐ โ ๐๐๐๐
= ๐ฆ๐ โ
๐ผ๐>0
๐ผ๐๐ฆ(๐)๐ ๐ ๐๐๐
Hard-margin SVM: Dual problem
28
max๐ถ
๐=1
๐
๐ผ๐ โ1
2
๐=1
๐
๐=1
๐
๐ผ๐๐ผ๐๐ฆ(๐)๐ฆ(๐)๐ ๐ ๐
๐ ๐
Subject to ๐=1๐ ๐ผ๐๐ฆ
(๐) = 0
๐ผ๐ โฅ 0 ๐ = 1, โฆ , ๐
Only the dot product of each pair of training data appears
in the optimization problem
This is an important property that is helpful to extend to non-linear
SVM (the cost function does not depend explicitly on the dimensionality
of the feature space).
Soft-margin SVM: Dual problem
29
max๐ถ
๐=1
๐
๐ผ๐ โ1
2
๐=1
๐
๐=1
๐
๐ผ๐๐ผ๐๐ฆ(๐)๐ฆ(๐)๐ ๐ ๐
๐ ๐
Subject to ๐=1๐ ๐ผ๐๐ฆ
(๐) = 0
0 โค ๐ผ๐ โค ๐ถ ๐ = 1, โฆ , ๐
By solving the above quadratic problem first we find ๐ถ
and then find ๐ = ๐=1๐ ๐ผ๐ ๐ฆ(๐)๐(๐) and ๐ค0 is computed
from SVs.
For a test sample ๐ (as before):
๐ฆ = sign ๐ค0 + ๐๐๐ = sign(๐ค0 + ๐ผ๐>0
๐ผ๐๐ฆ๐ ๐ ๐ ๐
๐)
Soft-margin SVM: Support vectors
30
SupportVectors: ๐ผ > 0
If 0 < ๐ผ < ๐ถ: SVs on the margin, ๐ = 0.
If ๐ผ = ๐ถ: SVs on or over the margin.
Primal vs. dual hard-margin SVM problem
31
Primal problem of hard-margin SVM
๐ inequality constraints
๐ + 1 number of variables
Dual problem of hard-margin SVM
one equality constraint
๐ positivity constraints
๐ number of variables (Lagrange multipliers)
Objective function more complicated
The dual problem is helpful and instrumental to use the kernel
trick
Primal vs. dual soft-margin SVM problem
32
Primal problem of soft-margin SVM
๐ inequality constraints
๐ positivity constraints
๐ + ๐ + 1 number of variables
Dual problem of soft-margin SVM
one equality constraint
2๐ positivity constraints
๐ number of variables (Lagrange multipliers)
Objective function more complicated
The dual problem is helpful and instrumental to use the kernel
trick
Not linearly separable data
33
Noisy data or overlapping classes
(we discussed about it: soft margin)
Near linearly separable
Non-linear decision surface
Transform to a new feature space
๐ฅ2
๐ฅ1
๐ฅ2
๐ฅ1
Nonlinear SVM
34
Assume a transformation ๐: โ๐ โ โ๐ on the feature
space
๐ โ ๐ ๐
Find a hyper-plane in the transformed feature space:
๐ฅ2
๐ฅ1 ๐1(๐)
๐2(๐)
๐: ๐ โ ๐ ๐
๐๐๐ ๐ + ๐ค0 = 0
{๐1(๐),...,๐๐(๐)}: set of basis functions (or features)
๐๐ ๐ : โ๐ โ โ
๐ ๐ = [๐1(๐), . . . , ๐๐(๐)]
Soft-margin SVM in a transformed space:
Primal problem
35
Primal problem:
min๐,๐ค0
1
2๐ 2 + ๐ถ
๐=1
๐
๐๐
s. t. ๐ฆ ๐ ๐๐๐(๐ ๐ ) + ๐ค0 โฅ 1 โ ๐๐ ๐ = 1, โฆ , ๐
๐๐ โฅ 0
๐ โ โ๐: the weights that must be found
If ๐ โซ ๐ (very high dimensional feature space) then there are
many more parameters to learn
Soft-margin SVM in a transformed space:
Dual problem
36
Optimization problem:
max๐ถ
๐=1
๐
๐ผ๐ โ1
2
๐=1
๐
๐=1
๐
๐ผ๐๐ผ๐๐ฆ(๐)๐ฆ(๐)๐ ๐(๐) ๐
๐ ๐(๐)
Subject to ๐=1๐ ๐ผ๐๐ฆ
(๐) = 0
0 โค ๐ผ๐ โค ๐ถ ๐ = 1, โฆ , ๐
If we have inner products ๐ ๐(๐) ๐๐ ๐(๐) , only ๐ถ
= [๐ผ1, โฆ , ๐ผ๐] needs to be learnt.
It is not necessary to learn ๐ parameters as opposed to the primal
problem
Kernelized soft-margin SVM
37
Optimization problem:
max๐ถ
๐=1
๐
๐ผ๐ โ1
2
๐=1
๐
๐=1
๐
๐ผ๐๐ผ๐๐ฆ(๐)๐ฆ(๐)๐ ๐(๐) ๐
๐ ๐(๐)
Subject to ๐=1๐ ๐ผ๐๐ฆ
(๐) = 0
0 โค ๐ผ๐ โค ๐ถ ๐ = 1, โฆ , ๐
Classifying a new data:
๐ฆ = ๐ ๐๐๐ ๐ค0 + ๐๐๐(๐) = ๐ ๐๐๐(๐ค0 + ๐ผ๐>0 ๐ผ๐๐ฆ๐ ๐(๐ ๐ )๐๐(๐))
๐(๐ ๐ , ๐(๐))
๐(๐ ๐ , ๐)
๐ ๐, ๐โฒ = ๐ ๐ ๐๐(๐โฒ)
Kernel trick
38
Kernel: the dot product in the mapped feature space โฑ ๐ ๐, ๐โฒ = ๐ ๐ ๐๐(๐โฒ)
๐ ๐ = ๐1 ๐ , โฆ , ๐๐ ๐ ๐
๐ ๐, ๐โฒ shows the similarity of ๐ and ๐โฒ.
The distance can be found as accordingly
Kernel trick โ Extension of many well-known algorithms to
kernel-based ones
By substituting the dot product with the kernel function
Kernel trick: Idea
39
Idea: when the input vectors appears only in the form of dot
products, we can use other choices of kernel function instead
of inner product
Solving the problem without explicitly mapping the data
Explicit mapping is expensive if ๐ ๐ is very high dimensional
Instead of using a mapping ๐: ๐ณ โ โฑ to represent ๐ โ ๐ณ by
๐(๐) โ โฑ, a similarity function ๐: ๐ณ ร ๐ณ โ โ is used.
The data set can be represented by ๐ ร ๐ matrix ๐ฒ.
We specify only an inner product function between points in the
transformed space (not their coordinates)
In many cases, the inner product in the embedding space can be
computed efficiently.
Why kernel?
40
kernel functions ๐พ can indeed be efficiently computed, with a
cost proportional to ๐ instead of ๐.
Example: consider the second-order polynomial transform:
๐ ๐ = 1, ๐ฅ1, โฆ , ๐ฅ๐ , ๐ฅ12, ๐ฅ1๐ฅ2, โฆ , ๐ฅ๐๐ฅ๐
๐
๐ ๐ ๐๐ ๐โฒ = 1 +
๐=1
๐
๐ฅ๐๐ฅ๐โฒ +
๐=1
๐
๐=1
๐
๐ฅ๐๐ฅ๐๐ฅ๐โฒ๐ฅ๐
โฒ
๐ ๐ ๐๐ ๐โฒ = 1 + ๐ฅ๐๐ฅโฒ + ๐ฅ๐๐ฅโฒ 2
๐=1
๐
๐ฅ๐๐ฅ๐โฒ ร
๐=1
๐
๐ฅ๐๐ฅ๐โฒ
๐ = 1 + ๐ + ๐2
๐(๐)
๐(๐)
Polynomial kernel: Efficient kernel function
41
An efficient kernel function relies on a carefully constructed
specific transforms to allow fast computation
Example: for polynomial kernels, certain combinations of
coefficients that would still make K easy (๐พ > 0 and ๐ > 0)
๐ ๐
= ๐ ร 1, 2๐๐พ ร ๐ฅ1, โฆ , 2๐๐พ ร ๐ฅ๐ , ๐พ ร ๐ฅ12, ๐พ ร ๐ฅ1๐ฅ2, โฆ , ๐พ
Polynomial kernel: Degree two
42
We instead use ๐(๐, ๐โฒ) = ๐๐๐โฒ + 1 2 that corresponds to:
๐ ๐
= 1, 2๐ฅ1, โฆ , 2๐ฅ๐ , ๐ฅ12, . . , ๐ฅ๐
2, 2๐ฅ1๐ฅ2, โฆ , 2๐ฅ1๐ฅ๐ , 2๐ฅ2๐ฅ3, โฆ , 2๐ฅ๐โ1๐ฅ๐
๐
๐ ๐ = 1, 2๐ฅ1, โฆ , 2๐ฅ๐ , ๐ฅ12, . . . , ๐ฅ๐
2, ๐ฅ1๐ฅ2, โฆ , ๐ฅ1๐ฅ๐ , ๐ฅ2๐ฅ1, โฆ , ๐ฅ๐โ1๐ฅ๐
๐
๐-dimensional feature space ๐ = ๐ฅ1, โฆ ,๐ฅ๐๐
Polynomial kernel
44
In general, ๐(๐, ๐) = ๐๐๐ ๐ contains all monomials of order
๐.
This can similarly be generalized to include all terms up to
degree ๐, ๐(๐, ๐) = ๐๐๐ + ๐ ๐ (๐ > 0)
Example: SVM boundary for a polynomial kernel
๐ค0 + ๐๐๐ ๐ = 0
โ ๐ค0 + ๐ผ๐>0 ๐ผ๐๐ฆ(๐)๐ ๐ ๐ ๐
๐ ๐ = 0
โ ๐ค0 + ๐ผ๐>0 ๐ผ๐๐ฆ(๐)๐(๐ ๐ , ๐) = 0
โ ๐ค0 + ๐ผ๐>0 ๐ผ๐๐ฆ(๐) ๐(๐)๐
๐ + ๐๐
= 0Boundary is a
polynomial of order ๐
Some common kernel functions
45
๐พ is a mathematical and computational shortcut that
allows us to combine the transform and the inner
product into a single more efficient function.
Linear: ๐(๐, ๐โฒ) = ๐๐๐โฒ
Polynomial: ๐ ๐, ๐โฒ = (๐๐๐โฒ + ๐)๐ ๐ โฅ 0
Gaussian: ๐ ๐, ๐โฒ = exp(โ๐โ๐โฒ 2
2๐2 )
Sigmoid: ๐ ๐, ๐โฒ = tanh(๐๐๐๐โฒ + ๐)
Gaussian kernel
46
Gaussian: ๐ ๐, ๐โฒ = exp(โ๐โ๐โฒ 2
2๐2 )
Infinite dimensional feature space
Example: SVM boundary for a gaussian kernel
Considers a Gaussian function around each data point.
๐ค0 + ๐ผ๐>0 ๐ผ๐๐ฆ(๐)exp(โ
๐โ๐(๐) 2
2๐2 ) = 0
SVM + Gaussian Kernel can classify any arbitrary training set
Training error is zero when ๐ โ 0 All samples become support vectors (likely overfiting)
SVM: Linear and Gaussian kernels example
47
Gaussian kernel: ๐ถ = 1,๐ = 0.25Linear kernel: ๐ถ = 0.1
[Zissermanโs slides]
some SVs appear to be far
from the boundary?
Hard margin Example
48
For narrow Gaussian (large ๐), even the protection of a large
margin cannot suppress overfitting.
Y.S. Abou Mostafa, et. al, โLearning From Dataโ, 2012
SVM Gaussian kernel: Example
49 Source: Zissermanโs slides
๐ ๐ = ๐ค0 + ๐ผ๐>0
๐ผ๐๐ฆ(๐)exp(โ
๐ โ ๐(๐) 2
2๐2 )
SVM Gaussian kernel: Example
50 Source: Zissermanโs slides
SVM Gaussian kernel: Example
51 Source: Zissermanโ s slides
SVM Gaussian kernel: Example
52 Source: Zissermanโ s slides
SVM Gaussian kernel: Example
53 Source: Zissermanโ s slides
SVM Gaussian kernel: Example
54 Source: Zissermanโ s slides
SVM Gaussian kernel: Example
55 Source: Zissermanโ s slides
Constructing kernels
56
Construct kernel functions directly
Ensure that it is a valid kernel
Corresponds to an inner product in some feature space.
Example: ๐(๐, ๐โฒ) = ๐๐๐โฒ 2
Corresponding mapping: ๐ ๐ = ๐ฅ12, 2๐ฅ1๐ฅ2, ๐ฅ2
2 ๐for ๐
= ๐ฅ1, ๐ฅ2๐
We need a way to test whether a kernel is valid without
having to construct ๐ ๐
Valid kernel: Necessary & sufficient conditions
57
Gram matrix ๐ฒ๐ร๐: ๐พ๐๐ = ๐(๐(๐), ๐(๐))
Restricting the kernel function to a set of points {๐ 1 , ๐ 2 , โฆ , ๐(๐)}
๐พ =๐(๐(1), ๐(1)) โฏ ๐(๐(1), ๐(๐))
โฎ โฑ โฎ๐(๐(๐), ๐(1)) โฏ ๐(๐(๐), ๐(๐))
Mercer Theorem: The kernel matrix is Symmetric Positive
Semi-Definite (for any choice of data points)
Any symmetric positive definite matrix can be regarded as a kernel
matrix, that is as an inner product matrix in some space
[Shawe-Taylor & Cristianini 2004]
Construct Valid Kernels
58
โข ๐ > 0 , ๐1: valid kernel
โข ๐(. ): any function
โข ๐(. ): a polynomial with coefficientsโฅ 0
โข ๐1, ๐2: valid kernels
โข ๐(๐): a function from ๐ to โ๐
๐3(. , . ): a valid kernel in โ๐
โข ๐จ : a symmetric positive semi-definite
matrix
โข ๐๐ and ๐๐ are variables (not necessarily
disjoint) with ๐ = (๐๐ , ๐๐), and ๐๐ and ๐๐
are valid kernel functions over their
respective spaces.
[Bishop]
Extending linear methods to kernelized ones
59
Kernelized version of linear methods
Linear methods are famous
Unique optimal solutions, faster learning algorithms, and better analysis
However, we often require nonlinear methods in real-world problems
and so we can use kernel-based version of these linear algorithms
If a problem can be formulated such that feature vectors only
appear in inner products, we can extend it to a kernelized one
When we only need to know the dot product of all pairs of data (about
the geometry of the data in the feature space)
By replacing inner products with kernels in linear algorithms,
we obtain very flexible methods
We can operate in the mapped space without ever computing the
coordinates of the data in that space
Which information can be obtained from kernel?
60
Example: we know all pairwise distances
๐ ๐ ๐ ,๐ ๐2
= ๐ ๐ โ ๐ ๐ 2 = ๐ ๐, ๐ + ๐ ๐, ๐ โ 2๐(๐, ๐)
Therefore, we also know distance of points from center of mass of a set
Many dimensionality reduction, clustering, and classification
methods can be described according to pairwise distances.
This allow us to introduce kernelized versions of them
Kernels for structured data
61
Kernels also can be defined on general types of data
Kernel functions do not need to be defined over vectors
just we need a symmetric positive definite matrix
Thus, many algorithms can work with general (non-vectorial)
data
Kernels exist to embed strings, trees, graphs, โฆ
This may be more important than nonlinearity that we can use
kernel-based version of the classical learning algorithms for
recognition of structured data
Kernel function for objects
62
Sets: Example of kernel function for sets:
๐ ๐ด, ๐ต = 2 ๐ดโฉ๐ต
Strings: The inner product of the feature vectors for two
strings can be defined as
e.g. sum over all common subsequences weighted according to
their frequency of occurrence and lengths
A E G A T E A G G
E G T E A G A E G A T G
Kernel trick advantages: summary
63
Operating in the mapped space without ever computing the
coordinates of the data in that space
Besides vectors, we can introduce kernel functions for
structured data (graphs, strings, etc.)
Much of the geometry of the data in the embedding space is
contained in all pairwise dot products
In many cases, inner product in the embedding space can be
computed efficiently.
SVM: Summary
64
Hard margin: maximizing margin
Soft margin: handling noisy data and overlapping classes
Slack variables in the problem
Dual problems of hard-margin and soft-margin SVM
Lead us to non-linear SVM method easily by kernel substitution
Also, classifier decision in terms of support vectors
Kernel SVMโs
Learns linear decision boundary in a high dimension space without
explicitly working on the mapped data