Support Vector Machine (SVM) and Kernel Methods

63
CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Support Vector Machine (SVM) and Kernel Methods

Transcript of Support Vector Machine (SVM) and Kernel Methods

Page 1: Support Vector Machine (SVM) and Kernel Methods

CE-717: Machine LearningSharif University of Technology

Fall 2015

Soleymani

Support Vector Machine (SVM)

and Kernel Methods

Page 2: Support Vector Machine (SVM) and Kernel Methods

Outline

Margin concept

Hard-Margin SVM

Soft-Margin SVM

Dual Problems of Hard-Margin SVM and Soft-Margin SVM

Nonlinear SVM

Kernel trick

Kernel methods

2

Page 3: Support Vector Machine (SVM) and Kernel Methods

Margin

3

Which line is better to select as the boundary to providemore generalization capability?

Margin for a hyperplane that separates samples of twolinearly separable classes is:

The smallest distance between the decision boundary and any of thetraining samples

๐‘ฅ2

๐‘ฅ1

Larger margin provides better

generalization to unseen data

Page 4: Support Vector Machine (SVM) and Kernel Methods

Maximum margin

4

SVM finds the solution with maximum margin

Solution: a hyperplane that is farthest from all training samples

The hyperplane with the largest margin has equal distances tothe nearest sample of both classes

๐‘ฅ2

๐‘ฅ1

๐‘ฅ2

๐‘ฅ1 Larger margin

Page 5: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Optimization problem

5

max๐‘€,๐’˜,๐‘ค0

2๐‘€

๐’˜

s. t. ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ ๐‘€ โˆ€๐’™ ๐‘– โˆˆ ๐ถ1

๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ค โˆ’๐‘€ โˆ€๐’™ ๐‘– โˆˆ ๐ถ2

๐‘ฅ2

๐‘ฅ1

๐‘€

๐’˜

๐’˜๐‘‡๐’™ + ๐‘ค0 = 0

๐’˜๐‘‡๐’™ + ๐‘ค0 = ๐‘€

๐’˜๐‘‡๐’™ + ๐‘ค0 = โˆ’๐‘€

๐’˜

๐‘ฆ ๐‘– = 1

๐‘ฆ ๐‘– = โˆ’1

Margin: 2๐‘€

๐’˜

Page 6: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Optimization problem

6

max๐‘€,๐’˜,๐‘ค0

2๐‘€

๐’˜

s. t. ๐‘ฆ(๐‘–) ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ ๐‘€ ๐‘– = 1, โ€ฆ , ๐‘

๐‘ฅ2

๐‘ฅ1

๐‘€

๐’˜

๐’˜

๐’˜๐‘‡๐’™ + ๐‘ค0 = 0

๐’˜๐‘‡๐’™ + ๐‘ค0 = ๐‘€

๐’˜๐‘‡๐’™ + ๐‘ค0 = โˆ’๐‘€

๐‘€ = min๐‘–=1,โ€ฆ,๐‘

๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0

Page 7: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Optimization problem

7

We can set ๐’˜โ€ฒ =๐’˜

๐‘€, ๐‘ค0

โ€ฒ =๐‘ค0

๐‘€:

max๐’˜โ€ฒ,๐‘ค0

โ€ฒ

2

๐’˜โ€ฒ

s. t. ๐‘ฆ(๐‘–) ๐’˜โ€ฒ๐‘‡๐’™ ๐‘– + ๐‘ค0โ€ฒ โ‰ฅ 1 ๐‘– = 1, โ€ฆ , ๐‘

๐‘ฅ2

๐‘ฅ1

1

๐’˜โ€ฒ

๐’˜โ€ฒ๐‘‡๐’™ + ๐‘ค0โ€ฒ = 0

๐’˜โ€ฒ๐‘‡๐’™ + ๐‘ค0โ€ฒ = 1

๐’˜โ€ฒ๐‘‡๐’™ + ๐‘ค0โ€ฒ = โˆ’1

๐’˜โ€ฒ

The place of boundary and

margin lines do not change

Page 8: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Optimization problem

8

We can equivalently optimize:

min๐’˜,๐‘ค0

1

2๐’˜ 2

s. t. ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ 1 ๐‘– = 1, โ€ฆ , ๐‘

It is a convex Quadratic Programming (QP) problem

There are computationally efficient packages to solve it.

It has a global minimum (if any).

When training samples are not linearly separable, it has no

solution.

How to extend it to find a solution even though the classes are not

linearly separable.

Page 9: Support Vector Machine (SVM) and Kernel Methods

Beyond linear separability

9

How to extend the hard-margin SVM to allow

classification error

Overlapping classes that can be approximately separated by a

linear boundary

Noise in the linearly separable classes

๐‘ฅ2

๐‘ฅ1 ๐‘ฅ1

Page 10: Support Vector Machine (SVM) and Kernel Methods

Beyond linear separability: Soft-margin SVM

10

Minimizing the number of misclassified points?!

NP-complete

Soft margin:

Maximizing a margin while trying to minimize the distance

between misclassified points and their correct margin plane

Page 11: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM

11

SVM with slack variables: allows samples to fall within the

margin, but penalizes them

min๐’˜,๐‘ค0, ๐œ‰๐‘– ๐‘–=1

๐‘

12

๐’˜ 2 + ๐ถ

๐‘–=1

๐‘

๐œ‰๐‘–

s. t. ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ 1 โˆ’ ๐œ‰๐‘– ๐‘– = 1, โ€ฆ , ๐‘

๐œ‰๐‘– โ‰ฅ 0

๐œ‰๐‘– : slack variables

๐œ‰๐‘– > 1: if ๐’™ ๐‘– misclassifed

0 < ๐œ‰๐‘– < 1 : if ๐’™ ๐‘– correctly

classified but inside margin

๐‘ฅ2

๐‘ฅ1

๐œ‰๐‘–

Page 12: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM

12

linear penalty (hinge loss) for a sample if it is misclassified

or lied in the margin

tries to maintain ๐œ‰๐‘– small while maximizing the margin.

always finds a solution (as opposed to hard-margin SVM)

more robust to the outliers

Soft margin problem is still a convex QP

๐œ‰ = 0

๐œ‰ = 0

Page 13: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM: Parameter ๐ถ

13

๐ถ is a tradeoff parameter:

small ๐ถ allows margin constraints to be easily ignored

large margin

large ๐ถ makes constraints hard to ignore

narrow margin

๐ถ โ†’ โˆž enforces all constraints: hard margin

๐ถ can be determined using a technique like cross-

validation

Page 14: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM: Cost function

14

min๐’˜,๐‘ค0, ๐œ‰๐‘– ๐‘–=1

๐‘

1

2๐’˜ 2 + ๐ถ

๐‘–=1

๐‘›

๐œ‰๐‘–

s. t. ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ 1 โˆ’ ๐œ‰๐‘– ๐‘– = 1, โ€ฆ , ๐‘›

๐œ‰๐‘– โ‰ฅ 0

It is equivalent to the unconstrained optimizationproblem:

min๐’˜,๐‘ค0

1

2๐’˜ 2 + ๐ถ

๐‘–=1

๐‘›

max(0,1 โˆ’ ๐‘ฆ(๐‘–)(๐’˜๐‘‡๐’™(๐‘–) + ๐‘ค0))

Page 15: Support Vector Machine (SVM) and Kernel Methods

SVM loss function

15

Hinge loss vs. 0-1 loss

๐’˜๐‘‡๐’™ + ๐‘ค0

0-1 Loss

๐‘ฆ = 1

Hinge Loss

max(0,1 โˆ’ ๐‘ฆ(๐’˜๐‘‡๐’™ + ๐‘ค0))

Page 16: Support Vector Machine (SVM) and Kernel Methods

Dual formulation of the SVM

16

We are going to introduce the dual SVM problem which

is equivalent to the original primal problem. The dual

problem:

is often easier

enable us to exploit the kernel trick

gives us further insights into the optimal hyperplane

Page 17: Support Vector Machine (SVM) and Kernel Methods

Optimization: Lagrangian multipliers

17

๐‘โˆ— = min๐’™

๐‘“(๐’™)

s. t. ๐‘”๐‘– ๐’™ โ‰ค 0 ๐‘– = 1,โ€ฆ , ๐‘šโ„Ž๐‘– ๐’™ = 0 ๐‘– = 1, โ€ฆ , ๐‘

โ„’ ๐’™, ๐œถ, ๐€ = ๐‘“ ๐’™ +

๐‘–=1

๐‘š

๐›ผ๐‘– ๐‘”๐‘– ๐’™ +

๐‘–=1

๐‘

๐œ†๐‘– โ„Ž๐‘– ๐’™

max๐›ผ๐‘–โ‰ฅ0 , ๐œ†๐‘–

โ„’ ๐’™, ๐œถ, ๐€ =

โˆž any ๐‘”๐‘– ๐’™ > 0

โˆž any โ„Ž๐‘– ๐’™ โ‰  0

๐‘“ ๐’™ otherwise

๐‘โˆ— = min๐’™

max๐›ผ๐‘–โ‰ฅ0 , ๐œ†๐‘–

โ„’ ๐’™, ๐œถ, ๐€

Lagrangian multipliers

๐œถ = ๐›ผ1, โ€ฆ , ๐›ผ๐‘š

๐€ = [๐œ†1, โ€ฆ , ๐œ†๐‘]

Page 18: Support Vector Machine (SVM) and Kernel Methods

Optimization: Dual problem

18

In general, we have:

max๐‘ฅ

min๐‘ฆ

๐‘“(๐‘ฅ, ๐‘ฆ) โ‰ค min๐‘ฆ

max๐‘ฅ

๐‘“(๐‘ฅ, ๐‘ฆ)

Primal problem: ๐‘โˆ— = min๐’™

max๐›ผ๐‘–โ‰ฅ0 , ๐œ†๐‘–

โ„’ ๐’™, ๐œถ, ๐€

Dual problem: ๐‘‘โˆ— = max๐›ผ๐‘–โ‰ฅ0 , ๐œ†๐‘–

min๐’™

โ„’ ๐’™, ๐œถ, ๐€

Obtained by swapping the order of min and max

๐‘‘โˆ— โ‰ค ๐‘โˆ—

When the original problem is convex (๐‘“ and ๐‘” are convex

functions and โ„Ž is affine), we have strong duality ๐‘‘โˆ— = ๐‘โˆ—

Page 19: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Dual problem

19

min๐’˜,๐‘ค0

1

2๐’˜ 2

s. t. ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ 1 ๐‘– = 1, โ€ฆ , ๐‘

By incorporating the constraints through lagrangian multipliers,we will have:

min๐’˜,๐‘ค0

max{๐›ผ๐‘–โ‰ฅ0}

1

2๐’˜ 2 +

๐‘–=1

๐‘

๐›ผ๐‘– 1 โˆ’ ๐‘ฆ ๐‘– (๐’˜๐‘‡๐’™(๐‘–) + ๐‘ค0)

Dual problem (changing the order of min and max in theabove problem):

max{๐›ผ๐‘–โ‰ฅ0}

min๐’˜,๐‘ค0

1

2๐’˜ 2 +

๐‘–=1

๐‘

๐›ผ๐‘– 1 โˆ’ ๐‘ฆ ๐‘– (๐’˜๐‘‡๐’™(๐‘–) + ๐‘ค0)

Page 20: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Dual problem

20

max{๐›ผ๐‘–โ‰ฅ0}

min๐’˜,๐‘ค0

โ„’ ๐’˜, ๐‘ค0, ๐œถ

โ„’ ๐’˜, ๐‘ค0, ๐œถ =1

2๐’˜ 2 +

๐‘–=1

๐‘

๐›ผ๐‘– 1 โˆ’ ๐‘ฆ ๐‘– (๐’˜๐‘‡๐’™(๐‘–) + ๐‘ค0)

๐›ป๐’˜โ„’ ๐’˜, ๐‘ค0, ๐œถ = 0 โ‡’ ๐’˜ = ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–)๐’™(๐‘–)

๐œ•โ„’ ๐’˜,๐‘ค0,๐œถ

๐œ•๐‘ค๐ŸŽ= 0 โ‡’ ๐‘–=1

๐‘ ๐›ผ๐‘–๐‘ฆ(๐‘–) = 0

๐‘ค0 do not appear, instead, a โ€œglobalโ€ constraint

on ๐œถ is created.

Page 21: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Dual problem

21

max๐œถ

๐‘–=1

๐‘

๐›ผ๐‘– โˆ’1

2

๐‘–=1

๐‘

๐‘—=1

๐‘

๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ(๐‘–)๐‘ฆ(๐‘—)๐’™ ๐‘– ๐‘‡

๐’™ ๐‘—

Subject to ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–) = 0

๐›ผ๐‘– โ‰ฅ 0 ๐‘– = 1, โ€ฆ , ๐‘

It is a convex QP

By solving the above problem first we find ๐œถ and then ๐’˜= ๐‘–=1

๐‘ ๐›ผ๐‘– ๐‘ฆ(๐‘–)๐’™(๐‘–)

๐‘ค0 can be set by making the margin equidistant to classes (wediscuss it more after defining support vectors).

Page 22: Support Vector Machine (SVM) and Kernel Methods

Karush-Kuhn-Tucker (KKT) conditions

22

Necessary conditions for the solution [๐’˜โˆ—, ๐‘ค0โˆ—, ๐œถโˆ—]:

๐›ป๐’˜โ„’ ๐’˜, ๐‘ค0, ๐œถ ๐’˜โˆ—,๐‘ค0โˆ— ,๐œถโˆ— = 0

๐œ•โ„’ ๐’˜,๐‘ค0,๐œถ

๐œ•๐‘ค0 ๐’˜โˆ—,๐‘ค0

โˆ— ,๐œถโˆ— = 0

๐›ผ๐‘–โˆ— โ‰ฅ 0 ๐‘– = 1, โ€ฆ , ๐‘

๐‘ฆ ๐‘– ๐’˜โˆ—๐‘‡๐’™ ๐‘– + ๐‘ค0โˆ— โ‰ฅ 1 ๐‘– = 1, โ€ฆ , ๐‘

๐›ผ๐‘–โˆ— 1 โˆ’ ๐‘ฆ ๐‘– ๐’˜โˆ—๐‘‡๐’™ ๐‘– + ๐‘ค0

โˆ— = 0 ๐‘– = 1, โ€ฆ , ๐‘

๐›ป๐’™โ„’ ๐’™, ๐œถ ๐’™โˆ—,๐œถโˆ—

= 0

๐›ผ๐‘–โˆ— โ‰ฅ 0 ๐‘– = 1, โ€ฆ , ๐‘š

๐‘”๐‘– ๐’™โˆ— โ‰ค 0 ๐‘– = 1, โ€ฆ , ๐‘š๐›ผ๐‘–

โˆ—๐‘”๐‘– ๐’™โˆ— = 0 ๐‘– = 1, โ€ฆ , ๐‘š

In general, the optimal ๐’™โˆ—, ๐œถโˆ—

satisfies KKT conditions:

Page 23: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Support vectors

23

Inactive constraint: ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 > 1

โ‡’ ๐›ผ๐‘– = 0 and thus ๐’™ ๐‘– is not a support vector.

Active constraint: ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 = 1

โ‡’ ๐›ผ๐‘– can be greater than 0 and thus ๐’™ ๐‘– can be a support vector.

๐‘ฅ2

๐‘ฅ1

1

๐’˜

๐›ผ > 0

๐›ผ > 0๐›ผ > 0

Page 24: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Support vectors

24

Inactive constraint: ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 > 1

โ‡’ ๐›ผ๐‘– = 0 and thus ๐’™ ๐‘– is not a support vector.

Active constraint: ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 = 1

โ‡’ ๐›ผ๐‘– can be greater than 0 and thus ๐’™ ๐‘– can be a support vector.

๐‘ฅ2

๐‘ฅ1

1

๐’˜

๐›ผ > 0

๐›ผ > 0๐›ผ > 0

๐›ผ = 0

๐›ผ = 0

A sample with ๐›ผ๐‘– = 0 can also

lie on one of the margin

hyperplanes

Page 25: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Support vectors

25

SupportVectors (SVs)= {๐‘ฅ ๐‘– ๐›ผ๐‘– > 0}

The direction of hyper-plane can be found only based on

support vectors:

๐’˜ =

๐›ผ๐‘–>0

๐›ผ๐‘– ๐‘ฆ(๐‘–)๐’™(๐‘–)

Page 26: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Dual problem

Classifying new samples using only SVs

26

Classification of a new sample ๐’™:

๐‘ฆ = sign ๐‘ค0 + ๐’˜๐‘‡๐’™

๐‘ฆ = sign ๐‘ค0 + ๐›ผ๐‘–>0

๐›ผ๐‘–๐‘ฆ๐‘– ๐’™ ๐‘–

๐‘‡

๐’™

๐‘ฆ = sign(๐‘ค0 + ๐›ผ๐‘–>0

๐›ผ๐‘–๐‘ฆ๐‘– ๐’™ ๐‘– ๐‘‡

๐’™)

The classifier is based on the expansion in terms of dot

products of ๐’™ with support vectors.

Support vectors are sufficient to

predict labels of new samples

Page 27: Support Vector Machine (SVM) and Kernel Methods

How to find ๐‘ค0

27

At least one of the ๐›ผโ€™s is strictly positive (๐›ผ๐‘  > 0 for support

vector).

The KKT complementary slackness condition tells us that the

constraint corresponding to this non-zero ๐›ผ๐‘  is equality.

๐‘ค0 is found such that the following constraint is satisfied.

๐‘ฆ๐‘  ๐‘ค0 + ๐’˜๐‘‡๐’™๐‘  = 1

We can find ๐‘ค0 using one of the support vectors:

โ‡’ ๐‘ค0 = ๐‘ฆ๐‘  โˆ’ ๐’˜๐‘‡๐’™๐‘ 

= ๐‘ฆ๐‘  โˆ’

๐›ผ๐‘›>0

๐›ผ๐‘›๐‘ฆ(๐‘›)๐’™ ๐‘› ๐‘‡๐’™๐‘ 

Page 28: Support Vector Machine (SVM) and Kernel Methods

Hard-margin SVM: Dual problem

28

max๐œถ

๐‘–=1

๐‘

๐›ผ๐‘– โˆ’1

2

๐‘–=1

๐‘

๐‘—=1

๐‘

๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ(๐‘–)๐‘ฆ(๐‘—)๐’™ ๐‘– ๐‘‡

๐’™ ๐‘—

Subject to ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–) = 0

๐›ผ๐‘– โ‰ฅ 0 ๐‘– = 1, โ€ฆ , ๐‘

Only the dot product of each pair of training data appears

in the optimization problem

This is an important property that is helpful to extend to non-linear

SVM (the cost function does not depend explicitly on the dimensionality

of the feature space).

Page 29: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM: Dual problem

29

max๐œถ

๐‘–=1

๐‘

๐›ผ๐‘– โˆ’1

2

๐‘–=1

๐‘

๐‘—=1

๐‘

๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ(๐‘–)๐‘ฆ(๐‘—)๐’™ ๐‘– ๐‘‡

๐’™ ๐‘—

Subject to ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–) = 0

0 โ‰ค ๐›ผ๐‘– โ‰ค ๐ถ ๐‘– = 1, โ€ฆ , ๐‘

By solving the above quadratic problem first we find ๐œถ

and then find ๐’˜ = ๐‘–=1๐‘ ๐›ผ๐‘– ๐‘ฆ(๐‘–)๐’™(๐‘–) and ๐‘ค0 is computed

from SVs.

For a test sample ๐’™ (as before):

๐‘ฆ = sign ๐‘ค0 + ๐’˜๐‘‡๐’™ = sign(๐‘ค0 + ๐›ผ๐‘–>0

๐›ผ๐‘–๐‘ฆ๐‘– ๐’™ ๐‘– ๐‘‡

๐’™)

Page 30: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM: Support vectors

30

SupportVectors: ๐›ผ > 0

If 0 < ๐›ผ < ๐ถ: SVs on the margin, ๐œ‰ = 0.

If ๐›ผ = ๐ถ: SVs on or over the margin.

Page 31: Support Vector Machine (SVM) and Kernel Methods

Primal vs. dual hard-margin SVM problem

31

Primal problem of hard-margin SVM

๐‘ inequality constraints

๐‘‘ + 1 number of variables

Dual problem of hard-margin SVM

one equality constraint

๐‘ positivity constraints

๐‘ number of variables (Lagrange multipliers)

Objective function more complicated

The dual problem is helpful and instrumental to use the kernel

trick

Page 32: Support Vector Machine (SVM) and Kernel Methods

Primal vs. dual soft-margin SVM problem

32

Primal problem of soft-margin SVM

๐‘ inequality constraints

๐‘ positivity constraints

๐‘ + ๐‘‘ + 1 number of variables

Dual problem of soft-margin SVM

one equality constraint

2๐‘ positivity constraints

๐‘ number of variables (Lagrange multipliers)

Objective function more complicated

The dual problem is helpful and instrumental to use the kernel

trick

Page 33: Support Vector Machine (SVM) and Kernel Methods

Not linearly separable data

33

Noisy data or overlapping classes

(we discussed about it: soft margin)

Near linearly separable

Non-linear decision surface

Transform to a new feature space

๐‘ฅ2

๐‘ฅ1

๐‘ฅ2

๐‘ฅ1

Page 34: Support Vector Machine (SVM) and Kernel Methods

Nonlinear SVM

34

Assume a transformation ๐œ™: โ„๐‘‘ โ†’ โ„๐‘š on the feature

space

๐’™ โ†’ ๐“ ๐’™

Find a hyper-plane in the transformed feature space:

๐‘ฅ2

๐‘ฅ1 ๐œ™1(๐’™)

๐œ™2(๐’™)

๐œ™: ๐’™ โ†’ ๐“ ๐’™

๐’˜๐‘‡๐“ ๐’™ + ๐‘ค0 = 0

{๐œ™1(๐’™),...,๐œ™๐‘š(๐’™)}: set of basis functions (or features)

๐œ™๐‘– ๐’™ : โ„๐‘‘ โ†’ โ„

๐“ ๐’™ = [๐œ™1(๐’™), . . . , ๐œ™๐‘š(๐’™)]

Page 35: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM in a transformed space:

Primal problem

35

Primal problem:

min๐’˜,๐‘ค0

1

2๐’˜ 2 + ๐ถ

๐‘–=1

๐‘

๐œ‰๐‘–

s. t. ๐‘ฆ ๐‘– ๐’˜๐‘‡๐“(๐’™ ๐‘– ) + ๐‘ค0 โ‰ฅ 1 โˆ’ ๐œ‰๐‘– ๐‘– = 1, โ€ฆ , ๐‘

๐œ‰๐‘– โ‰ฅ 0

๐’˜ โˆˆ โ„๐‘š: the weights that must be found

If ๐‘š โ‰ซ ๐‘‘ (very high dimensional feature space) then there are

many more parameters to learn

Page 36: Support Vector Machine (SVM) and Kernel Methods

Soft-margin SVM in a transformed space:

Dual problem

36

Optimization problem:

max๐œถ

๐‘–=1

๐‘

๐›ผ๐‘– โˆ’1

2

๐‘–=1

๐‘

๐‘—=1

๐‘

๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ(๐‘–)๐‘ฆ(๐‘—)๐“ ๐’™(๐‘–) ๐‘‡

๐“ ๐’™(๐‘–)

Subject to ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–) = 0

0 โ‰ค ๐›ผ๐‘– โ‰ค ๐ถ ๐‘– = 1, โ€ฆ , ๐‘›

If we have inner products ๐“ ๐’™(๐‘–) ๐‘‡๐“ ๐’™(๐‘—) , only ๐œถ

= [๐›ผ1, โ€ฆ , ๐›ผ๐‘] needs to be learnt.

It is not necessary to learn ๐‘š parameters as opposed to the primal

problem

Page 37: Support Vector Machine (SVM) and Kernel Methods

Kernelized soft-margin SVM

37

Optimization problem:

max๐œถ

๐‘–=1

๐‘

๐›ผ๐‘– โˆ’1

2

๐‘–=1

๐‘

๐‘—=1

๐‘

๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ(๐‘–)๐‘ฆ(๐‘—)๐“ ๐’™(๐‘–) ๐‘‡

๐“ ๐’™(๐‘–)

Subject to ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–) = 0

0 โ‰ค ๐›ผ๐‘– โ‰ค ๐ถ ๐‘– = 1, โ€ฆ , ๐‘›

Classifying a new data:

๐‘ฆ = ๐‘ ๐‘–๐‘”๐‘› ๐‘ค0 + ๐’˜๐‘‡๐“(๐’™) = ๐‘ ๐‘–๐‘”๐‘›(๐‘ค0 + ๐›ผ๐‘–>0 ๐›ผ๐‘–๐‘ฆ๐‘– ๐“(๐’™ ๐‘– )๐‘‡๐“(๐’™))

๐‘˜(๐’™ ๐‘– , ๐’™(๐‘—))

๐‘˜(๐’™ ๐‘– , ๐’™)

๐‘˜ ๐’™, ๐’™โ€ฒ = ๐“ ๐’™ ๐‘‡๐“(๐’™โ€ฒ)

Page 38: Support Vector Machine (SVM) and Kernel Methods

Kernel trick

38

Kernel: the dot product in the mapped feature space โ„ฑ ๐‘˜ ๐’™, ๐’™โ€ฒ = ๐“ ๐’™ ๐‘‡๐“(๐’™โ€ฒ)

๐“ ๐’™ = ๐œ™1 ๐’™ , โ€ฆ , ๐œ™๐‘š ๐’™ ๐‘‡

๐‘˜ ๐’™, ๐’™โ€ฒ shows the similarity of ๐’™ and ๐’™โ€ฒ.

The distance can be found as accordingly

Kernel trick โ†’ Extension of many well-known algorithms to

kernel-based ones

By substituting the dot product with the kernel function

Page 39: Support Vector Machine (SVM) and Kernel Methods

Kernel trick: Idea

39

Idea: when the input vectors appears only in the form of dot

products, we can use other choices of kernel function instead

of inner product

Solving the problem without explicitly mapping the data

Explicit mapping is expensive if ๐“ ๐’™ is very high dimensional

Instead of using a mapping ๐“: ๐’ณ โ† โ„ฑ to represent ๐’™ โˆˆ ๐’ณ by

๐“(๐’™) โˆˆ โ„ฑ, a similarity function ๐‘˜: ๐’ณ ร— ๐’ณ โ†’ โ„ is used.

The data set can be represented by ๐‘ ร— ๐‘ matrix ๐‘ฒ.

We specify only an inner product function between points in the

transformed space (not their coordinates)

In many cases, the inner product in the embedding space can be

computed efficiently.

Page 40: Support Vector Machine (SVM) and Kernel Methods

Why kernel?

40

kernel functions ๐พ can indeed be efficiently computed, with a

cost proportional to ๐‘‘ instead of ๐‘š.

Example: consider the second-order polynomial transform:

๐“ ๐’™ = 1, ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘‘ , ๐‘ฅ12, ๐‘ฅ1๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‘๐‘ฅ๐‘‘

๐‘‡

๐“ ๐’™ ๐‘‡๐“ ๐’™โ€ฒ = 1 +

๐‘–=1

๐‘‘

๐‘ฅ๐‘–๐‘ฅ๐‘–โ€ฒ +

๐‘–=1

๐‘‘

๐‘—=1

๐‘‘

๐‘ฅ๐‘–๐‘ฅ๐‘—๐‘ฅ๐‘–โ€ฒ๐‘ฅ๐‘—

โ€ฒ

๐“ ๐’™ ๐‘‡๐“ ๐’™โ€ฒ = 1 + ๐‘ฅ๐‘‡๐‘ฅโ€ฒ + ๐‘ฅ๐‘‡๐‘ฅโ€ฒ 2

๐‘–=1

๐‘‘

๐‘ฅ๐‘–๐‘ฅ๐‘–โ€ฒ ร—

๐‘—=1

๐‘‘

๐‘ฅ๐‘—๐‘ฅ๐‘—โ€ฒ

๐‘š = 1 + ๐‘‘ + ๐‘‘2

๐‘‚(๐‘š)

๐‘‚(๐‘‘)

Page 41: Support Vector Machine (SVM) and Kernel Methods

Polynomial kernel: Efficient kernel function

41

An efficient kernel function relies on a carefully constructed

specific transforms to allow fast computation

Example: for polynomial kernels, certain combinations of

coefficients that would still make K easy (๐›พ > 0 and ๐‘ > 0)

๐“ ๐’™

= ๐‘ ร— 1, 2๐‘๐›พ ร— ๐‘ฅ1, โ€ฆ , 2๐‘๐›พ ร— ๐‘ฅ๐‘‘ , ๐›พ ร— ๐‘ฅ12, ๐›พ ร— ๐‘ฅ1๐‘ฅ2, โ€ฆ , ๐›พ

Page 42: Support Vector Machine (SVM) and Kernel Methods

Polynomial kernel: Degree two

42

We instead use ๐‘˜(๐’™, ๐’™โ€ฒ) = ๐’™๐‘‡๐’™โ€ฒ + 1 2 that corresponds to:

๐“ ๐’™

= 1, 2๐‘ฅ1, โ€ฆ , 2๐‘ฅ๐‘‘ , ๐‘ฅ12, . . , ๐‘ฅ๐‘‘

2, 2๐‘ฅ1๐‘ฅ2, โ€ฆ , 2๐‘ฅ1๐‘ฅ๐‘‘ , 2๐‘ฅ2๐‘ฅ3, โ€ฆ , 2๐‘ฅ๐‘‘โˆ’1๐‘ฅ๐‘‘

๐‘‡

๐“ ๐’™ = 1, 2๐‘ฅ1, โ€ฆ , 2๐‘ฅ๐‘‘ , ๐‘ฅ12, . . . , ๐‘ฅ๐‘‘

2, ๐‘ฅ1๐‘ฅ2, โ€ฆ , ๐‘ฅ1๐‘ฅ๐‘‘ , ๐‘ฅ2๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘‘โˆ’1๐‘ฅ๐‘‘

๐‘‡

๐‘‘-dimensional feature space ๐’™ = ๐‘ฅ1, โ€ฆ ,๐‘ฅ๐‘‘๐‘‡

Page 43: Support Vector Machine (SVM) and Kernel Methods

Polynomial kernel

44

In general, ๐‘˜(๐’™, ๐’›) = ๐’™๐‘‡๐’› ๐‘€ contains all monomials of order

๐‘€.

This can similarly be generalized to include all terms up to

degree ๐‘€, ๐‘˜(๐’™, ๐’›) = ๐’™๐‘‡๐’› + ๐‘ ๐‘€ (๐‘ > 0)

Example: SVM boundary for a polynomial kernel

๐‘ค0 + ๐’˜๐‘‡๐“ ๐’™ = 0

โ‡’ ๐‘ค0 + ๐›ผ๐‘–>0 ๐›ผ๐‘–๐‘ฆ(๐‘–)๐“ ๐’™ ๐‘– ๐‘‡

๐“ ๐’™ = 0

โ‡’ ๐‘ค0 + ๐›ผ๐‘–>0 ๐›ผ๐‘–๐‘ฆ(๐‘–)๐‘˜(๐’™ ๐‘– , ๐’™) = 0

โ‡’ ๐‘ค0 + ๐›ผ๐‘–>0 ๐›ผ๐‘–๐‘ฆ(๐‘–) ๐’™(๐‘–)๐‘‡

๐’™ + ๐‘๐‘€

= 0Boundary is a

polynomial of order ๐‘€

Page 44: Support Vector Machine (SVM) and Kernel Methods

Some common kernel functions

45

๐พ is a mathematical and computational shortcut that

allows us to combine the transform and the inner

product into a single more efficient function.

Linear: ๐‘˜(๐’™, ๐’™โ€ฒ) = ๐’™๐‘‡๐’™โ€ฒ

Polynomial: ๐‘˜ ๐’™, ๐’™โ€ฒ = (๐’™๐‘‡๐’™โ€ฒ + ๐‘)๐‘€ ๐‘ โ‰ฅ 0

Gaussian: ๐‘˜ ๐’™, ๐’™โ€ฒ = exp(โˆ’๐’™โˆ’๐’™โ€ฒ 2

2๐œŽ2 )

Sigmoid: ๐‘˜ ๐’™, ๐’™โ€ฒ = tanh(๐‘Ž๐’™๐‘‡๐’™โ€ฒ + ๐‘)

Page 45: Support Vector Machine (SVM) and Kernel Methods

Gaussian kernel

46

Gaussian: ๐‘˜ ๐’™, ๐’™โ€ฒ = exp(โˆ’๐’™โˆ’๐’™โ€ฒ 2

2๐œŽ2 )

Infinite dimensional feature space

Example: SVM boundary for a gaussian kernel

Considers a Gaussian function around each data point.

๐‘ค0 + ๐›ผ๐‘–>0 ๐›ผ๐‘–๐‘ฆ(๐‘–)exp(โˆ’

๐’™โˆ’๐’™(๐‘–) 2

2๐œŽ2 ) = 0

SVM + Gaussian Kernel can classify any arbitrary training set

Training error is zero when ๐œŽ โ†’ 0 All samples become support vectors (likely overfiting)

Page 46: Support Vector Machine (SVM) and Kernel Methods

SVM: Linear and Gaussian kernels example

47

Gaussian kernel: ๐ถ = 1,๐œŽ = 0.25Linear kernel: ๐ถ = 0.1

[Zissermanโ€™s slides]

some SVs appear to be far

from the boundary?

Page 47: Support Vector Machine (SVM) and Kernel Methods

Hard margin Example

48

For narrow Gaussian (large ๐œŽ), even the protection of a large

margin cannot suppress overfitting.

Y.S. Abou Mostafa, et. al, โ€œLearning From Dataโ€, 2012

Page 48: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

49 Source: Zissermanโ€™s slides

๐‘“ ๐’™ = ๐‘ค0 + ๐›ผ๐‘–>0

๐›ผ๐‘–๐‘ฆ(๐‘–)exp(โˆ’

๐’™ โˆ’ ๐’™(๐‘–) 2

2๐œŽ2 )

Page 49: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

50 Source: Zissermanโ€™s slides

Page 50: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

51 Source: Zissermanโ€™ s slides

Page 51: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

52 Source: Zissermanโ€™ s slides

Page 52: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

53 Source: Zissermanโ€™ s slides

Page 53: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

54 Source: Zissermanโ€™ s slides

Page 54: Support Vector Machine (SVM) and Kernel Methods

SVM Gaussian kernel: Example

55 Source: Zissermanโ€™ s slides

Page 55: Support Vector Machine (SVM) and Kernel Methods

Constructing kernels

56

Construct kernel functions directly

Ensure that it is a valid kernel

Corresponds to an inner product in some feature space.

Example: ๐‘˜(๐’™, ๐’™โ€ฒ) = ๐’™๐‘‡๐’™โ€ฒ 2

Corresponding mapping: ๐“ ๐’™ = ๐‘ฅ12, 2๐‘ฅ1๐‘ฅ2, ๐‘ฅ2

2 ๐‘‡for ๐’™

= ๐‘ฅ1, ๐‘ฅ2๐‘‡

We need a way to test whether a kernel is valid without

having to construct ๐“ ๐’™

Page 56: Support Vector Machine (SVM) and Kernel Methods

Valid kernel: Necessary & sufficient conditions

57

Gram matrix ๐‘ฒ๐‘ร—๐‘: ๐พ๐‘–๐‘— = ๐‘˜(๐’™(๐‘–), ๐’™(๐‘—))

Restricting the kernel function to a set of points {๐’™ 1 , ๐’™ 2 , โ€ฆ , ๐’™(๐‘)}

๐พ =๐‘˜(๐’™(1), ๐’™(1)) โ‹ฏ ๐‘˜(๐’™(1), ๐’™(๐‘))

โ‹ฎ โ‹ฑ โ‹ฎ๐‘˜(๐’™(๐‘), ๐’™(1)) โ‹ฏ ๐‘˜(๐’™(๐‘), ๐’™(๐‘))

Mercer Theorem: The kernel matrix is Symmetric Positive

Semi-Definite (for any choice of data points)

Any symmetric positive definite matrix can be regarded as a kernel

matrix, that is as an inner product matrix in some space

[Shawe-Taylor & Cristianini 2004]

Page 57: Support Vector Machine (SVM) and Kernel Methods

Construct Valid Kernels

58

โ€ข ๐‘ > 0 , ๐‘˜1: valid kernel

โ€ข ๐‘“(. ): any function

โ€ข ๐‘ž(. ): a polynomial with coefficientsโ‰ฅ 0

โ€ข ๐‘˜1, ๐‘˜2: valid kernels

โ€ข ๐“(๐’™): a function from ๐’™ to โ„๐‘€

๐‘˜3(. , . ): a valid kernel in โ„๐‘€

โ€ข ๐‘จ : a symmetric positive semi-definite

matrix

โ€ข ๐’™๐‘Ž and ๐’™๐‘ are variables (not necessarily

disjoint) with ๐’™ = (๐’™๐‘Ž , ๐’™๐‘), and ๐‘˜๐‘Ž and ๐‘˜๐‘

are valid kernel functions over their

respective spaces.

[Bishop]

Page 58: Support Vector Machine (SVM) and Kernel Methods

Extending linear methods to kernelized ones

59

Kernelized version of linear methods

Linear methods are famous

Unique optimal solutions, faster learning algorithms, and better analysis

However, we often require nonlinear methods in real-world problems

and so we can use kernel-based version of these linear algorithms

If a problem can be formulated such that feature vectors only

appear in inner products, we can extend it to a kernelized one

When we only need to know the dot product of all pairs of data (about

the geometry of the data in the feature space)

By replacing inner products with kernels in linear algorithms,

we obtain very flexible methods

We can operate in the mapped space without ever computing the

coordinates of the data in that space

Page 59: Support Vector Machine (SVM) and Kernel Methods

Which information can be obtained from kernel?

60

Example: we know all pairwise distances

๐‘‘ ๐“ ๐’™ ,๐“ ๐’›2

= ๐“ ๐’™ โˆ’ ๐“ ๐’› 2 = ๐‘˜ ๐’™, ๐’™ + ๐‘˜ ๐’›, ๐’› โˆ’ 2๐‘˜(๐’™, ๐’›)

Therefore, we also know distance of points from center of mass of a set

Many dimensionality reduction, clustering, and classification

methods can be described according to pairwise distances.

This allow us to introduce kernelized versions of them

Page 60: Support Vector Machine (SVM) and Kernel Methods

Kernels for structured data

61

Kernels also can be defined on general types of data

Kernel functions do not need to be defined over vectors

just we need a symmetric positive definite matrix

Thus, many algorithms can work with general (non-vectorial)

data

Kernels exist to embed strings, trees, graphs, โ€ฆ

This may be more important than nonlinearity that we can use

kernel-based version of the classical learning algorithms for

recognition of structured data

Page 61: Support Vector Machine (SVM) and Kernel Methods

Kernel function for objects

62

Sets: Example of kernel function for sets:

๐‘˜ ๐ด, ๐ต = 2 ๐ดโˆฉ๐ต

Strings: The inner product of the feature vectors for two

strings can be defined as

e.g. sum over all common subsequences weighted according to

their frequency of occurrence and lengths

A E G A T E A G G

E G T E A G A E G A T G

Page 62: Support Vector Machine (SVM) and Kernel Methods

Kernel trick advantages: summary

63

Operating in the mapped space without ever computing the

coordinates of the data in that space

Besides vectors, we can introduce kernel functions for

structured data (graphs, strings, etc.)

Much of the geometry of the data in the embedding space is

contained in all pairwise dot products

In many cases, inner product in the embedding space can be

computed efficiently.

Page 63: Support Vector Machine (SVM) and Kernel Methods

SVM: Summary

64

Hard margin: maximizing margin

Soft margin: handling noisy data and overlapping classes

Slack variables in the problem

Dual problems of hard-margin and soft-margin SVM

Lead us to non-linear SVM method easily by kernel substitution

Also, classifier decision in terms of support vectors

Kernel SVMโ€™s

Learns linear decision boundary in a high dimension space without

explicitly working on the mapped data