Support Vector Machine (SVM) and Kernel Methods

Post on 10-Nov-2021

21 views 2 download

Transcript of Support Vector Machine (SVM) and Kernel Methods

CE-717: Machine LearningSharif University of Technology

Fall 2015

Soleymani

Support Vector Machine (SVM)

and Kernel Methods

Outline

Margin concept

Hard-Margin SVM

Soft-Margin SVM

Dual Problems of Hard-Margin SVM and Soft-Margin SVM

Nonlinear SVM

Kernel trick

Kernel methods

2

Margin

3

Which line is better to select as the boundary to providemore generalization capability?

Margin for a hyperplane that separates samples of twolinearly separable classes is:

The smallest distance between the decision boundary and any of thetraining samples

๐‘ฅ2

๐‘ฅ1

Larger margin provides better

generalization to unseen data

Maximum margin

4

SVM finds the solution with maximum margin

Solution: a hyperplane that is farthest from all training samples

The hyperplane with the largest margin has equal distances tothe nearest sample of both classes

๐‘ฅ2

๐‘ฅ1

๐‘ฅ2

๐‘ฅ1 Larger margin

Hard-margin SVM: Optimization problem

5

max๐‘€,๐’˜,๐‘ค0

2๐‘€

๐’˜

s. t. ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ ๐‘€ โˆ€๐’™ ๐‘– โˆˆ ๐ถ1

๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ค โˆ’๐‘€ โˆ€๐’™ ๐‘– โˆˆ ๐ถ2

๐‘ฅ2

๐‘ฅ1

๐‘€

๐’˜

๐’˜๐‘‡๐’™ + ๐‘ค0 = 0

๐’˜๐‘‡๐’™ + ๐‘ค0 = ๐‘€

๐’˜๐‘‡๐’™ + ๐‘ค0 = โˆ’๐‘€

๐’˜

๐‘ฆ ๐‘– = 1

๐‘ฆ ๐‘– = โˆ’1

Margin: 2๐‘€

๐’˜

Hard-margin SVM: Optimization problem

6

max๐‘€,๐’˜,๐‘ค0

2๐‘€

๐’˜

s. t. ๐‘ฆ(๐‘–) ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ ๐‘€ ๐‘– = 1, โ€ฆ , ๐‘

๐‘ฅ2

๐‘ฅ1

๐‘€

๐’˜

๐’˜

๐’˜๐‘‡๐’™ + ๐‘ค0 = 0

๐’˜๐‘‡๐’™ + ๐‘ค0 = ๐‘€

๐’˜๐‘‡๐’™ + ๐‘ค0 = โˆ’๐‘€

๐‘€ = min๐‘–=1,โ€ฆ,๐‘

๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0

Hard-margin SVM: Optimization problem

7

We can set ๐’˜โ€ฒ =๐’˜

๐‘€, ๐‘ค0

โ€ฒ =๐‘ค0

๐‘€:

max๐’˜โ€ฒ,๐‘ค0

โ€ฒ

2

๐’˜โ€ฒ

s. t. ๐‘ฆ(๐‘–) ๐’˜โ€ฒ๐‘‡๐’™ ๐‘– + ๐‘ค0โ€ฒ โ‰ฅ 1 ๐‘– = 1, โ€ฆ , ๐‘

๐‘ฅ2

๐‘ฅ1

1

๐’˜โ€ฒ

๐’˜โ€ฒ๐‘‡๐’™ + ๐‘ค0โ€ฒ = 0

๐’˜โ€ฒ๐‘‡๐’™ + ๐‘ค0โ€ฒ = 1

๐’˜โ€ฒ๐‘‡๐’™ + ๐‘ค0โ€ฒ = โˆ’1

๐’˜โ€ฒ

The place of boundary and

margin lines do not change

Hard-margin SVM: Optimization problem

8

We can equivalently optimize:

min๐’˜,๐‘ค0

1

2๐’˜ 2

s. t. ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ 1 ๐‘– = 1, โ€ฆ , ๐‘

It is a convex Quadratic Programming (QP) problem

There are computationally efficient packages to solve it.

It has a global minimum (if any).

When training samples are not linearly separable, it has no

solution.

How to extend it to find a solution even though the classes are not

linearly separable.

Beyond linear separability

9

How to extend the hard-margin SVM to allow

classification error

Overlapping classes that can be approximately separated by a

linear boundary

Noise in the linearly separable classes

๐‘ฅ2

๐‘ฅ1 ๐‘ฅ1

Beyond linear separability: Soft-margin SVM

10

Minimizing the number of misclassified points?!

NP-complete

Soft margin:

Maximizing a margin while trying to minimize the distance

between misclassified points and their correct margin plane

Soft-margin SVM

11

SVM with slack variables: allows samples to fall within the

margin, but penalizes them

min๐’˜,๐‘ค0, ๐œ‰๐‘– ๐‘–=1

๐‘

12

๐’˜ 2 + ๐ถ

๐‘–=1

๐‘

๐œ‰๐‘–

s. t. ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ 1 โˆ’ ๐œ‰๐‘– ๐‘– = 1, โ€ฆ , ๐‘

๐œ‰๐‘– โ‰ฅ 0

๐œ‰๐‘– : slack variables

๐œ‰๐‘– > 1: if ๐’™ ๐‘– misclassifed

0 < ๐œ‰๐‘– < 1 : if ๐’™ ๐‘– correctly

classified but inside margin

๐‘ฅ2

๐‘ฅ1

๐œ‰๐‘–

Soft-margin SVM

12

linear penalty (hinge loss) for a sample if it is misclassified

or lied in the margin

tries to maintain ๐œ‰๐‘– small while maximizing the margin.

always finds a solution (as opposed to hard-margin SVM)

more robust to the outliers

Soft margin problem is still a convex QP

๐œ‰ = 0

๐œ‰ = 0

Soft-margin SVM: Parameter ๐ถ

13

๐ถ is a tradeoff parameter:

small ๐ถ allows margin constraints to be easily ignored

large margin

large ๐ถ makes constraints hard to ignore

narrow margin

๐ถ โ†’ โˆž enforces all constraints: hard margin

๐ถ can be determined using a technique like cross-

validation

Soft-margin SVM: Cost function

14

min๐’˜,๐‘ค0, ๐œ‰๐‘– ๐‘–=1

๐‘

1

2๐’˜ 2 + ๐ถ

๐‘–=1

๐‘›

๐œ‰๐‘–

s. t. ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ 1 โˆ’ ๐œ‰๐‘– ๐‘– = 1, โ€ฆ , ๐‘›

๐œ‰๐‘– โ‰ฅ 0

It is equivalent to the unconstrained optimizationproblem:

min๐’˜,๐‘ค0

1

2๐’˜ 2 + ๐ถ

๐‘–=1

๐‘›

max(0,1 โˆ’ ๐‘ฆ(๐‘–)(๐’˜๐‘‡๐’™(๐‘–) + ๐‘ค0))

SVM loss function

15

Hinge loss vs. 0-1 loss

๐’˜๐‘‡๐’™ + ๐‘ค0

0-1 Loss

๐‘ฆ = 1

Hinge Loss

max(0,1 โˆ’ ๐‘ฆ(๐’˜๐‘‡๐’™ + ๐‘ค0))

Dual formulation of the SVM

16

We are going to introduce the dual SVM problem which

is equivalent to the original primal problem. The dual

problem:

is often easier

enable us to exploit the kernel trick

gives us further insights into the optimal hyperplane

Optimization: Lagrangian multipliers

17

๐‘โˆ— = min๐’™

๐‘“(๐’™)

s. t. ๐‘”๐‘– ๐’™ โ‰ค 0 ๐‘– = 1,โ€ฆ , ๐‘šโ„Ž๐‘– ๐’™ = 0 ๐‘– = 1, โ€ฆ , ๐‘

โ„’ ๐’™, ๐œถ, ๐€ = ๐‘“ ๐’™ +

๐‘–=1

๐‘š

๐›ผ๐‘– ๐‘”๐‘– ๐’™ +

๐‘–=1

๐‘

๐œ†๐‘– โ„Ž๐‘– ๐’™

max๐›ผ๐‘–โ‰ฅ0 , ๐œ†๐‘–

โ„’ ๐’™, ๐œถ, ๐€ =

โˆž any ๐‘”๐‘– ๐’™ > 0

โˆž any โ„Ž๐‘– ๐’™ โ‰  0

๐‘“ ๐’™ otherwise

๐‘โˆ— = min๐’™

max๐›ผ๐‘–โ‰ฅ0 , ๐œ†๐‘–

โ„’ ๐’™, ๐œถ, ๐€

Lagrangian multipliers

๐œถ = ๐›ผ1, โ€ฆ , ๐›ผ๐‘š

๐€ = [๐œ†1, โ€ฆ , ๐œ†๐‘]

Optimization: Dual problem

18

In general, we have:

max๐‘ฅ

min๐‘ฆ

๐‘“(๐‘ฅ, ๐‘ฆ) โ‰ค min๐‘ฆ

max๐‘ฅ

๐‘“(๐‘ฅ, ๐‘ฆ)

Primal problem: ๐‘โˆ— = min๐’™

max๐›ผ๐‘–โ‰ฅ0 , ๐œ†๐‘–

โ„’ ๐’™, ๐œถ, ๐€

Dual problem: ๐‘‘โˆ— = max๐›ผ๐‘–โ‰ฅ0 , ๐œ†๐‘–

min๐’™

โ„’ ๐’™, ๐œถ, ๐€

Obtained by swapping the order of min and max

๐‘‘โˆ— โ‰ค ๐‘โˆ—

When the original problem is convex (๐‘“ and ๐‘” are convex

functions and โ„Ž is affine), we have strong duality ๐‘‘โˆ— = ๐‘โˆ—

Hard-margin SVM: Dual problem

19

min๐’˜,๐‘ค0

1

2๐’˜ 2

s. t. ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 โ‰ฅ 1 ๐‘– = 1, โ€ฆ , ๐‘

By incorporating the constraints through lagrangian multipliers,we will have:

min๐’˜,๐‘ค0

max{๐›ผ๐‘–โ‰ฅ0}

1

2๐’˜ 2 +

๐‘–=1

๐‘

๐›ผ๐‘– 1 โˆ’ ๐‘ฆ ๐‘– (๐’˜๐‘‡๐’™(๐‘–) + ๐‘ค0)

Dual problem (changing the order of min and max in theabove problem):

max{๐›ผ๐‘–โ‰ฅ0}

min๐’˜,๐‘ค0

1

2๐’˜ 2 +

๐‘–=1

๐‘

๐›ผ๐‘– 1 โˆ’ ๐‘ฆ ๐‘– (๐’˜๐‘‡๐’™(๐‘–) + ๐‘ค0)

Hard-margin SVM: Dual problem

20

max{๐›ผ๐‘–โ‰ฅ0}

min๐’˜,๐‘ค0

โ„’ ๐’˜, ๐‘ค0, ๐œถ

โ„’ ๐’˜, ๐‘ค0, ๐œถ =1

2๐’˜ 2 +

๐‘–=1

๐‘

๐›ผ๐‘– 1 โˆ’ ๐‘ฆ ๐‘– (๐’˜๐‘‡๐’™(๐‘–) + ๐‘ค0)

๐›ป๐’˜โ„’ ๐’˜, ๐‘ค0, ๐œถ = 0 โ‡’ ๐’˜ = ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–)๐’™(๐‘–)

๐œ•โ„’ ๐’˜,๐‘ค0,๐œถ

๐œ•๐‘ค๐ŸŽ= 0 โ‡’ ๐‘–=1

๐‘ ๐›ผ๐‘–๐‘ฆ(๐‘–) = 0

๐‘ค0 do not appear, instead, a โ€œglobalโ€ constraint

on ๐œถ is created.

Hard-margin SVM: Dual problem

21

max๐œถ

๐‘–=1

๐‘

๐›ผ๐‘– โˆ’1

2

๐‘–=1

๐‘

๐‘—=1

๐‘

๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ(๐‘–)๐‘ฆ(๐‘—)๐’™ ๐‘– ๐‘‡

๐’™ ๐‘—

Subject to ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–) = 0

๐›ผ๐‘– โ‰ฅ 0 ๐‘– = 1, โ€ฆ , ๐‘

It is a convex QP

By solving the above problem first we find ๐œถ and then ๐’˜= ๐‘–=1

๐‘ ๐›ผ๐‘– ๐‘ฆ(๐‘–)๐’™(๐‘–)

๐‘ค0 can be set by making the margin equidistant to classes (wediscuss it more after defining support vectors).

Karush-Kuhn-Tucker (KKT) conditions

22

Necessary conditions for the solution [๐’˜โˆ—, ๐‘ค0โˆ—, ๐œถโˆ—]:

๐›ป๐’˜โ„’ ๐’˜, ๐‘ค0, ๐œถ ๐’˜โˆ—,๐‘ค0โˆ— ,๐œถโˆ— = 0

๐œ•โ„’ ๐’˜,๐‘ค0,๐œถ

๐œ•๐‘ค0 ๐’˜โˆ—,๐‘ค0

โˆ— ,๐œถโˆ— = 0

๐›ผ๐‘–โˆ— โ‰ฅ 0 ๐‘– = 1, โ€ฆ , ๐‘

๐‘ฆ ๐‘– ๐’˜โˆ—๐‘‡๐’™ ๐‘– + ๐‘ค0โˆ— โ‰ฅ 1 ๐‘– = 1, โ€ฆ , ๐‘

๐›ผ๐‘–โˆ— 1 โˆ’ ๐‘ฆ ๐‘– ๐’˜โˆ—๐‘‡๐’™ ๐‘– + ๐‘ค0

โˆ— = 0 ๐‘– = 1, โ€ฆ , ๐‘

๐›ป๐’™โ„’ ๐’™, ๐œถ ๐’™โˆ—,๐œถโˆ—

= 0

๐›ผ๐‘–โˆ— โ‰ฅ 0 ๐‘– = 1, โ€ฆ , ๐‘š

๐‘”๐‘– ๐’™โˆ— โ‰ค 0 ๐‘– = 1, โ€ฆ , ๐‘š๐›ผ๐‘–

โˆ—๐‘”๐‘– ๐’™โˆ— = 0 ๐‘– = 1, โ€ฆ , ๐‘š

In general, the optimal ๐’™โˆ—, ๐œถโˆ—

satisfies KKT conditions:

Hard-margin SVM: Support vectors

23

Inactive constraint: ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 > 1

โ‡’ ๐›ผ๐‘– = 0 and thus ๐’™ ๐‘– is not a support vector.

Active constraint: ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 = 1

โ‡’ ๐›ผ๐‘– can be greater than 0 and thus ๐’™ ๐‘– can be a support vector.

๐‘ฅ2

๐‘ฅ1

1

๐’˜

๐›ผ > 0

๐›ผ > 0๐›ผ > 0

Hard-margin SVM: Support vectors

24

Inactive constraint: ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 > 1

โ‡’ ๐›ผ๐‘– = 0 and thus ๐’™ ๐‘– is not a support vector.

Active constraint: ๐‘ฆ ๐‘– ๐’˜๐‘‡๐’™ ๐‘– + ๐‘ค0 = 1

โ‡’ ๐›ผ๐‘– can be greater than 0 and thus ๐’™ ๐‘– can be a support vector.

๐‘ฅ2

๐‘ฅ1

1

๐’˜

๐›ผ > 0

๐›ผ > 0๐›ผ > 0

๐›ผ = 0

๐›ผ = 0

A sample with ๐›ผ๐‘– = 0 can also

lie on one of the margin

hyperplanes

Hard-margin SVM: Support vectors

25

SupportVectors (SVs)= {๐‘ฅ ๐‘– ๐›ผ๐‘– > 0}

The direction of hyper-plane can be found only based on

support vectors:

๐’˜ =

๐›ผ๐‘–>0

๐›ผ๐‘– ๐‘ฆ(๐‘–)๐’™(๐‘–)

Hard-margin SVM: Dual problem

Classifying new samples using only SVs

26

Classification of a new sample ๐’™:

๐‘ฆ = sign ๐‘ค0 + ๐’˜๐‘‡๐’™

๐‘ฆ = sign ๐‘ค0 + ๐›ผ๐‘–>0

๐›ผ๐‘–๐‘ฆ๐‘– ๐’™ ๐‘–

๐‘‡

๐’™

๐‘ฆ = sign(๐‘ค0 + ๐›ผ๐‘–>0

๐›ผ๐‘–๐‘ฆ๐‘– ๐’™ ๐‘– ๐‘‡

๐’™)

The classifier is based on the expansion in terms of dot

products of ๐’™ with support vectors.

Support vectors are sufficient to

predict labels of new samples

How to find ๐‘ค0

27

At least one of the ๐›ผโ€™s is strictly positive (๐›ผ๐‘  > 0 for support

vector).

The KKT complementary slackness condition tells us that the

constraint corresponding to this non-zero ๐›ผ๐‘  is equality.

๐‘ค0 is found such that the following constraint is satisfied.

๐‘ฆ๐‘  ๐‘ค0 + ๐’˜๐‘‡๐’™๐‘  = 1

We can find ๐‘ค0 using one of the support vectors:

โ‡’ ๐‘ค0 = ๐‘ฆ๐‘  โˆ’ ๐’˜๐‘‡๐’™๐‘ 

= ๐‘ฆ๐‘  โˆ’

๐›ผ๐‘›>0

๐›ผ๐‘›๐‘ฆ(๐‘›)๐’™ ๐‘› ๐‘‡๐’™๐‘ 

Hard-margin SVM: Dual problem

28

max๐œถ

๐‘–=1

๐‘

๐›ผ๐‘– โˆ’1

2

๐‘–=1

๐‘

๐‘—=1

๐‘

๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ(๐‘–)๐‘ฆ(๐‘—)๐’™ ๐‘– ๐‘‡

๐’™ ๐‘—

Subject to ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–) = 0

๐›ผ๐‘– โ‰ฅ 0 ๐‘– = 1, โ€ฆ , ๐‘

Only the dot product of each pair of training data appears

in the optimization problem

This is an important property that is helpful to extend to non-linear

SVM (the cost function does not depend explicitly on the dimensionality

of the feature space).

Soft-margin SVM: Dual problem

29

max๐œถ

๐‘–=1

๐‘

๐›ผ๐‘– โˆ’1

2

๐‘–=1

๐‘

๐‘—=1

๐‘

๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ(๐‘–)๐‘ฆ(๐‘—)๐’™ ๐‘– ๐‘‡

๐’™ ๐‘—

Subject to ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–) = 0

0 โ‰ค ๐›ผ๐‘– โ‰ค ๐ถ ๐‘– = 1, โ€ฆ , ๐‘

By solving the above quadratic problem first we find ๐œถ

and then find ๐’˜ = ๐‘–=1๐‘ ๐›ผ๐‘– ๐‘ฆ(๐‘–)๐’™(๐‘–) and ๐‘ค0 is computed

from SVs.

For a test sample ๐’™ (as before):

๐‘ฆ = sign ๐‘ค0 + ๐’˜๐‘‡๐’™ = sign(๐‘ค0 + ๐›ผ๐‘–>0

๐›ผ๐‘–๐‘ฆ๐‘– ๐’™ ๐‘– ๐‘‡

๐’™)

Soft-margin SVM: Support vectors

30

SupportVectors: ๐›ผ > 0

If 0 < ๐›ผ < ๐ถ: SVs on the margin, ๐œ‰ = 0.

If ๐›ผ = ๐ถ: SVs on or over the margin.

Primal vs. dual hard-margin SVM problem

31

Primal problem of hard-margin SVM

๐‘ inequality constraints

๐‘‘ + 1 number of variables

Dual problem of hard-margin SVM

one equality constraint

๐‘ positivity constraints

๐‘ number of variables (Lagrange multipliers)

Objective function more complicated

The dual problem is helpful and instrumental to use the kernel

trick

Primal vs. dual soft-margin SVM problem

32

Primal problem of soft-margin SVM

๐‘ inequality constraints

๐‘ positivity constraints

๐‘ + ๐‘‘ + 1 number of variables

Dual problem of soft-margin SVM

one equality constraint

2๐‘ positivity constraints

๐‘ number of variables (Lagrange multipliers)

Objective function more complicated

The dual problem is helpful and instrumental to use the kernel

trick

Not linearly separable data

33

Noisy data or overlapping classes

(we discussed about it: soft margin)

Near linearly separable

Non-linear decision surface

Transform to a new feature space

๐‘ฅ2

๐‘ฅ1

๐‘ฅ2

๐‘ฅ1

Nonlinear SVM

34

Assume a transformation ๐œ™: โ„๐‘‘ โ†’ โ„๐‘š on the feature

space

๐’™ โ†’ ๐“ ๐’™

Find a hyper-plane in the transformed feature space:

๐‘ฅ2

๐‘ฅ1 ๐œ™1(๐’™)

๐œ™2(๐’™)

๐œ™: ๐’™ โ†’ ๐“ ๐’™

๐’˜๐‘‡๐“ ๐’™ + ๐‘ค0 = 0

{๐œ™1(๐’™),...,๐œ™๐‘š(๐’™)}: set of basis functions (or features)

๐œ™๐‘– ๐’™ : โ„๐‘‘ โ†’ โ„

๐“ ๐’™ = [๐œ™1(๐’™), . . . , ๐œ™๐‘š(๐’™)]

Soft-margin SVM in a transformed space:

Primal problem

35

Primal problem:

min๐’˜,๐‘ค0

1

2๐’˜ 2 + ๐ถ

๐‘–=1

๐‘

๐œ‰๐‘–

s. t. ๐‘ฆ ๐‘– ๐’˜๐‘‡๐“(๐’™ ๐‘– ) + ๐‘ค0 โ‰ฅ 1 โˆ’ ๐œ‰๐‘– ๐‘– = 1, โ€ฆ , ๐‘

๐œ‰๐‘– โ‰ฅ 0

๐’˜ โˆˆ โ„๐‘š: the weights that must be found

If ๐‘š โ‰ซ ๐‘‘ (very high dimensional feature space) then there are

many more parameters to learn

Soft-margin SVM in a transformed space:

Dual problem

36

Optimization problem:

max๐œถ

๐‘–=1

๐‘

๐›ผ๐‘– โˆ’1

2

๐‘–=1

๐‘

๐‘—=1

๐‘

๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ(๐‘–)๐‘ฆ(๐‘—)๐“ ๐’™(๐‘–) ๐‘‡

๐“ ๐’™(๐‘–)

Subject to ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–) = 0

0 โ‰ค ๐›ผ๐‘– โ‰ค ๐ถ ๐‘– = 1, โ€ฆ , ๐‘›

If we have inner products ๐“ ๐’™(๐‘–) ๐‘‡๐“ ๐’™(๐‘—) , only ๐œถ

= [๐›ผ1, โ€ฆ , ๐›ผ๐‘] needs to be learnt.

It is not necessary to learn ๐‘š parameters as opposed to the primal

problem

Kernelized soft-margin SVM

37

Optimization problem:

max๐œถ

๐‘–=1

๐‘

๐›ผ๐‘– โˆ’1

2

๐‘–=1

๐‘

๐‘—=1

๐‘

๐›ผ๐‘–๐›ผ๐‘—๐‘ฆ(๐‘–)๐‘ฆ(๐‘—)๐“ ๐’™(๐‘–) ๐‘‡

๐“ ๐’™(๐‘–)

Subject to ๐‘–=1๐‘ ๐›ผ๐‘–๐‘ฆ

(๐‘–) = 0

0 โ‰ค ๐›ผ๐‘– โ‰ค ๐ถ ๐‘– = 1, โ€ฆ , ๐‘›

Classifying a new data:

๐‘ฆ = ๐‘ ๐‘–๐‘”๐‘› ๐‘ค0 + ๐’˜๐‘‡๐“(๐’™) = ๐‘ ๐‘–๐‘”๐‘›(๐‘ค0 + ๐›ผ๐‘–>0 ๐›ผ๐‘–๐‘ฆ๐‘– ๐“(๐’™ ๐‘– )๐‘‡๐“(๐’™))

๐‘˜(๐’™ ๐‘– , ๐’™(๐‘—))

๐‘˜(๐’™ ๐‘– , ๐’™)

๐‘˜ ๐’™, ๐’™โ€ฒ = ๐“ ๐’™ ๐‘‡๐“(๐’™โ€ฒ)

Kernel trick

38

Kernel: the dot product in the mapped feature space โ„ฑ ๐‘˜ ๐’™, ๐’™โ€ฒ = ๐“ ๐’™ ๐‘‡๐“(๐’™โ€ฒ)

๐“ ๐’™ = ๐œ™1 ๐’™ , โ€ฆ , ๐œ™๐‘š ๐’™ ๐‘‡

๐‘˜ ๐’™, ๐’™โ€ฒ shows the similarity of ๐’™ and ๐’™โ€ฒ.

The distance can be found as accordingly

Kernel trick โ†’ Extension of many well-known algorithms to

kernel-based ones

By substituting the dot product with the kernel function

Kernel trick: Idea

39

Idea: when the input vectors appears only in the form of dot

products, we can use other choices of kernel function instead

of inner product

Solving the problem without explicitly mapping the data

Explicit mapping is expensive if ๐“ ๐’™ is very high dimensional

Instead of using a mapping ๐“: ๐’ณ โ† โ„ฑ to represent ๐’™ โˆˆ ๐’ณ by

๐“(๐’™) โˆˆ โ„ฑ, a similarity function ๐‘˜: ๐’ณ ร— ๐’ณ โ†’ โ„ is used.

The data set can be represented by ๐‘ ร— ๐‘ matrix ๐‘ฒ.

We specify only an inner product function between points in the

transformed space (not their coordinates)

In many cases, the inner product in the embedding space can be

computed efficiently.

Why kernel?

40

kernel functions ๐พ can indeed be efficiently computed, with a

cost proportional to ๐‘‘ instead of ๐‘š.

Example: consider the second-order polynomial transform:

๐“ ๐’™ = 1, ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘‘ , ๐‘ฅ12, ๐‘ฅ1๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‘๐‘ฅ๐‘‘

๐‘‡

๐“ ๐’™ ๐‘‡๐“ ๐’™โ€ฒ = 1 +

๐‘–=1

๐‘‘

๐‘ฅ๐‘–๐‘ฅ๐‘–โ€ฒ +

๐‘–=1

๐‘‘

๐‘—=1

๐‘‘

๐‘ฅ๐‘–๐‘ฅ๐‘—๐‘ฅ๐‘–โ€ฒ๐‘ฅ๐‘—

โ€ฒ

๐“ ๐’™ ๐‘‡๐“ ๐’™โ€ฒ = 1 + ๐‘ฅ๐‘‡๐‘ฅโ€ฒ + ๐‘ฅ๐‘‡๐‘ฅโ€ฒ 2

๐‘–=1

๐‘‘

๐‘ฅ๐‘–๐‘ฅ๐‘–โ€ฒ ร—

๐‘—=1

๐‘‘

๐‘ฅ๐‘—๐‘ฅ๐‘—โ€ฒ

๐‘š = 1 + ๐‘‘ + ๐‘‘2

๐‘‚(๐‘š)

๐‘‚(๐‘‘)

Polynomial kernel: Efficient kernel function

41

An efficient kernel function relies on a carefully constructed

specific transforms to allow fast computation

Example: for polynomial kernels, certain combinations of

coefficients that would still make K easy (๐›พ > 0 and ๐‘ > 0)

๐“ ๐’™

= ๐‘ ร— 1, 2๐‘๐›พ ร— ๐‘ฅ1, โ€ฆ , 2๐‘๐›พ ร— ๐‘ฅ๐‘‘ , ๐›พ ร— ๐‘ฅ12, ๐›พ ร— ๐‘ฅ1๐‘ฅ2, โ€ฆ , ๐›พ

Polynomial kernel: Degree two

42

We instead use ๐‘˜(๐’™, ๐’™โ€ฒ) = ๐’™๐‘‡๐’™โ€ฒ + 1 2 that corresponds to:

๐“ ๐’™

= 1, 2๐‘ฅ1, โ€ฆ , 2๐‘ฅ๐‘‘ , ๐‘ฅ12, . . , ๐‘ฅ๐‘‘

2, 2๐‘ฅ1๐‘ฅ2, โ€ฆ , 2๐‘ฅ1๐‘ฅ๐‘‘ , 2๐‘ฅ2๐‘ฅ3, โ€ฆ , 2๐‘ฅ๐‘‘โˆ’1๐‘ฅ๐‘‘

๐‘‡

๐“ ๐’™ = 1, 2๐‘ฅ1, โ€ฆ , 2๐‘ฅ๐‘‘ , ๐‘ฅ12, . . . , ๐‘ฅ๐‘‘

2, ๐‘ฅ1๐‘ฅ2, โ€ฆ , ๐‘ฅ1๐‘ฅ๐‘‘ , ๐‘ฅ2๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘‘โˆ’1๐‘ฅ๐‘‘

๐‘‡

๐‘‘-dimensional feature space ๐’™ = ๐‘ฅ1, โ€ฆ ,๐‘ฅ๐‘‘๐‘‡

Polynomial kernel

44

In general, ๐‘˜(๐’™, ๐’›) = ๐’™๐‘‡๐’› ๐‘€ contains all monomials of order

๐‘€.

This can similarly be generalized to include all terms up to

degree ๐‘€, ๐‘˜(๐’™, ๐’›) = ๐’™๐‘‡๐’› + ๐‘ ๐‘€ (๐‘ > 0)

Example: SVM boundary for a polynomial kernel

๐‘ค0 + ๐’˜๐‘‡๐“ ๐’™ = 0

โ‡’ ๐‘ค0 + ๐›ผ๐‘–>0 ๐›ผ๐‘–๐‘ฆ(๐‘–)๐“ ๐’™ ๐‘– ๐‘‡

๐“ ๐’™ = 0

โ‡’ ๐‘ค0 + ๐›ผ๐‘–>0 ๐›ผ๐‘–๐‘ฆ(๐‘–)๐‘˜(๐’™ ๐‘– , ๐’™) = 0

โ‡’ ๐‘ค0 + ๐›ผ๐‘–>0 ๐›ผ๐‘–๐‘ฆ(๐‘–) ๐’™(๐‘–)๐‘‡

๐’™ + ๐‘๐‘€

= 0Boundary is a

polynomial of order ๐‘€

Some common kernel functions

45

๐พ is a mathematical and computational shortcut that

allows us to combine the transform and the inner

product into a single more efficient function.

Linear: ๐‘˜(๐’™, ๐’™โ€ฒ) = ๐’™๐‘‡๐’™โ€ฒ

Polynomial: ๐‘˜ ๐’™, ๐’™โ€ฒ = (๐’™๐‘‡๐’™โ€ฒ + ๐‘)๐‘€ ๐‘ โ‰ฅ 0

Gaussian: ๐‘˜ ๐’™, ๐’™โ€ฒ = exp(โˆ’๐’™โˆ’๐’™โ€ฒ 2

2๐œŽ2 )

Sigmoid: ๐‘˜ ๐’™, ๐’™โ€ฒ = tanh(๐‘Ž๐’™๐‘‡๐’™โ€ฒ + ๐‘)

Gaussian kernel

46

Gaussian: ๐‘˜ ๐’™, ๐’™โ€ฒ = exp(โˆ’๐’™โˆ’๐’™โ€ฒ 2

2๐œŽ2 )

Infinite dimensional feature space

Example: SVM boundary for a gaussian kernel

Considers a Gaussian function around each data point.

๐‘ค0 + ๐›ผ๐‘–>0 ๐›ผ๐‘–๐‘ฆ(๐‘–)exp(โˆ’

๐’™โˆ’๐’™(๐‘–) 2

2๐œŽ2 ) = 0

SVM + Gaussian Kernel can classify any arbitrary training set

Training error is zero when ๐œŽ โ†’ 0 All samples become support vectors (likely overfiting)

SVM: Linear and Gaussian kernels example

47

Gaussian kernel: ๐ถ = 1,๐œŽ = 0.25Linear kernel: ๐ถ = 0.1

[Zissermanโ€™s slides]

some SVs appear to be far

from the boundary?

Hard margin Example

48

For narrow Gaussian (large ๐œŽ), even the protection of a large

margin cannot suppress overfitting.

Y.S. Abou Mostafa, et. al, โ€œLearning From Dataโ€, 2012

SVM Gaussian kernel: Example

49 Source: Zissermanโ€™s slides

๐‘“ ๐’™ = ๐‘ค0 + ๐›ผ๐‘–>0

๐›ผ๐‘–๐‘ฆ(๐‘–)exp(โˆ’

๐’™ โˆ’ ๐’™(๐‘–) 2

2๐œŽ2 )

SVM Gaussian kernel: Example

50 Source: Zissermanโ€™s slides

SVM Gaussian kernel: Example

51 Source: Zissermanโ€™ s slides

SVM Gaussian kernel: Example

52 Source: Zissermanโ€™ s slides

SVM Gaussian kernel: Example

53 Source: Zissermanโ€™ s slides

SVM Gaussian kernel: Example

54 Source: Zissermanโ€™ s slides

SVM Gaussian kernel: Example

55 Source: Zissermanโ€™ s slides

Constructing kernels

56

Construct kernel functions directly

Ensure that it is a valid kernel

Corresponds to an inner product in some feature space.

Example: ๐‘˜(๐’™, ๐’™โ€ฒ) = ๐’™๐‘‡๐’™โ€ฒ 2

Corresponding mapping: ๐“ ๐’™ = ๐‘ฅ12, 2๐‘ฅ1๐‘ฅ2, ๐‘ฅ2

2 ๐‘‡for ๐’™

= ๐‘ฅ1, ๐‘ฅ2๐‘‡

We need a way to test whether a kernel is valid without

having to construct ๐“ ๐’™

Valid kernel: Necessary & sufficient conditions

57

Gram matrix ๐‘ฒ๐‘ร—๐‘: ๐พ๐‘–๐‘— = ๐‘˜(๐’™(๐‘–), ๐’™(๐‘—))

Restricting the kernel function to a set of points {๐’™ 1 , ๐’™ 2 , โ€ฆ , ๐’™(๐‘)}

๐พ =๐‘˜(๐’™(1), ๐’™(1)) โ‹ฏ ๐‘˜(๐’™(1), ๐’™(๐‘))

โ‹ฎ โ‹ฑ โ‹ฎ๐‘˜(๐’™(๐‘), ๐’™(1)) โ‹ฏ ๐‘˜(๐’™(๐‘), ๐’™(๐‘))

Mercer Theorem: The kernel matrix is Symmetric Positive

Semi-Definite (for any choice of data points)

Any symmetric positive definite matrix can be regarded as a kernel

matrix, that is as an inner product matrix in some space

[Shawe-Taylor & Cristianini 2004]

Construct Valid Kernels

58

โ€ข ๐‘ > 0 , ๐‘˜1: valid kernel

โ€ข ๐‘“(. ): any function

โ€ข ๐‘ž(. ): a polynomial with coefficientsโ‰ฅ 0

โ€ข ๐‘˜1, ๐‘˜2: valid kernels

โ€ข ๐“(๐’™): a function from ๐’™ to โ„๐‘€

๐‘˜3(. , . ): a valid kernel in โ„๐‘€

โ€ข ๐‘จ : a symmetric positive semi-definite

matrix

โ€ข ๐’™๐‘Ž and ๐’™๐‘ are variables (not necessarily

disjoint) with ๐’™ = (๐’™๐‘Ž , ๐’™๐‘), and ๐‘˜๐‘Ž and ๐‘˜๐‘

are valid kernel functions over their

respective spaces.

[Bishop]

Extending linear methods to kernelized ones

59

Kernelized version of linear methods

Linear methods are famous

Unique optimal solutions, faster learning algorithms, and better analysis

However, we often require nonlinear methods in real-world problems

and so we can use kernel-based version of these linear algorithms

If a problem can be formulated such that feature vectors only

appear in inner products, we can extend it to a kernelized one

When we only need to know the dot product of all pairs of data (about

the geometry of the data in the feature space)

By replacing inner products with kernels in linear algorithms,

we obtain very flexible methods

We can operate in the mapped space without ever computing the

coordinates of the data in that space

Which information can be obtained from kernel?

60

Example: we know all pairwise distances

๐‘‘ ๐“ ๐’™ ,๐“ ๐’›2

= ๐“ ๐’™ โˆ’ ๐“ ๐’› 2 = ๐‘˜ ๐’™, ๐’™ + ๐‘˜ ๐’›, ๐’› โˆ’ 2๐‘˜(๐’™, ๐’›)

Therefore, we also know distance of points from center of mass of a set

Many dimensionality reduction, clustering, and classification

methods can be described according to pairwise distances.

This allow us to introduce kernelized versions of them

Kernels for structured data

61

Kernels also can be defined on general types of data

Kernel functions do not need to be defined over vectors

just we need a symmetric positive definite matrix

Thus, many algorithms can work with general (non-vectorial)

data

Kernels exist to embed strings, trees, graphs, โ€ฆ

This may be more important than nonlinearity that we can use

kernel-based version of the classical learning algorithms for

recognition of structured data

Kernel function for objects

62

Sets: Example of kernel function for sets:

๐‘˜ ๐ด, ๐ต = 2 ๐ดโˆฉ๐ต

Strings: The inner product of the feature vectors for two

strings can be defined as

e.g. sum over all common subsequences weighted according to

their frequency of occurrence and lengths

A E G A T E A G G

E G T E A G A E G A T G

Kernel trick advantages: summary

63

Operating in the mapped space without ever computing the

coordinates of the data in that space

Besides vectors, we can introduce kernel functions for

structured data (graphs, strings, etc.)

Much of the geometry of the data in the embedding space is

contained in all pairwise dot products

In many cases, inner product in the embedding space can be

computed efficiently.

SVM: Summary

64

Hard margin: maximizing margin

Soft margin: handling noisy data and overlapping classes

Slack variables in the problem

Dual problems of hard-margin and soft-margin SVM

Lead us to non-linear SVM method easily by kernel substitution

Also, classifier decision in terms of support vectors

Kernel SVMโ€™s

Learns linear decision boundary in a high dimension space without

explicitly working on the mapped data