Support Vector Machine (SVM) and Kernel Methods

CE-717: Machine LearningSharif University of Technology

Fall 2015

Soleymani

Support Vector Machine (SVM)

and Kernel Methods

Outline

Margin concept

Hard-Margin SVM

Soft-Margin SVM

Dual Problems of Hard-Margin SVM and Soft-Margin SVM

Nonlinear SVM

Kernel trick

Kernel methods

Margin

Which line is better to select as the boundary to providemore generalization capability?

Margin for a hyperplane that separates samples of twolinearly separable classes is:

The smallest distance between the decision boundary and any of thetraining samples

Larger margin provides better

generalization to unseen data

Maximum margin

SVM finds the solution with maximum margin

Solution: a hyperplane that is farthest from all training samples

The hyperplane with the largest margin has equal distances tothe nearest sample of both classes

𝑥1 Larger margin

Hard-margin SVM: Optimization problem

max𝑀,𝒘,𝑤0

s. t. 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 𝑀 ∀𝒙 𝑖 ∈ 𝐶1

𝒘𝑇𝒙 𝑖 + 𝑤0 ≤ −𝑀 ∀𝒙 𝑖 ∈ 𝐶2

𝒘𝑇𝒙 + 𝑤0 = 0

𝒘𝑇𝒙 + 𝑤0 = 𝑀

𝒘𝑇𝒙 + 𝑤0 = −𝑀

𝑦 𝑖 = 1

𝑦 𝑖 = −1

Margin: 2𝑀

max𝑀,𝒘,𝑤0

s. t. 𝑦(𝑖) 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 𝑀 𝑖 = 1, … , 𝑁

𝒘𝑇𝒙 + 𝑤0 = 0

𝒘𝑇𝒙 + 𝑤0 = 𝑀

𝒘𝑇𝒙 + 𝑤0 = −𝑀

𝑀 = min𝑖=1,…,𝑁

𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0

We can set 𝒘′ =𝒘

𝑀, 𝑤0

′ =𝑤0

max𝒘′,𝑤0

𝒘′

s. t. 𝑦(𝑖) 𝒘′𝑇𝒙 𝑖 + 𝑤0′ ≥ 1 𝑖 = 1, … , 𝑁

𝒘′

𝒘′𝑇𝒙 + 𝑤0′ = 0

𝒘′𝑇𝒙 + 𝑤0′ = 1

𝒘′𝑇𝒙 + 𝑤0′ = −1

𝒘′

The place of boundary and

margin lines do not change

We can equivalently optimize:

min𝒘,𝑤0

2𝒘 2

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 𝑖 = 1, … , 𝑁

It is a convex Quadratic Programming (QP) problem

There are computationally efficient packages to solve it.

It has a global minimum (if any).

When training samples are not linearly separable, it has no

solution.

How to extend it to find a solution even though the classes are not

linearly separable.

Beyond linear separability

How to extend the hard-margin SVM to allow

classification error

Overlapping classes that can be approximately separated by a

linear boundary

Noise in the linearly separable classes

𝑥1 𝑥1

Beyond linear separability: Soft-margin SVM

Minimizing the number of misclassified points?!

NP-complete

Soft margin:

Maximizing a margin while trying to minimize the distance

between misclassified points and their correct margin plane

Soft-margin SVM

SVM with slack variables: allows samples to fall within the

margin, but penalizes them

min𝒘,𝑤0, 𝜉𝑖 𝑖=1

𝒘 2 + 𝐶

𝑖=1

𝜉𝑖

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑁

𝜉𝑖 ≥ 0

𝜉𝑖 : slack variables

𝜉𝑖 > 1: if 𝒙 𝑖 misclassifed

0 < 𝜉𝑖 < 1 : if 𝒙 𝑖 correctly

classified but inside margin

𝜉𝑖

Soft-margin SVM

linear penalty (hinge loss) for a sample if it is misclassified

or lied in the margin

tries to maintain 𝜉𝑖 small while maximizing the margin.

always finds a solution (as opposed to hard-margin SVM)

more robust to the outliers

Soft margin problem is still a convex QP

𝜉 = 0

Soft-margin SVM: Parameter 𝐶

𝐶 is a tradeoff parameter:

small 𝐶 allows margin constraints to be easily ignored

large margin

large 𝐶 makes constraints hard to ignore

narrow margin

𝐶 → ∞ enforces all constraints: hard margin

𝐶 can be determined using a technique like cross-

validation

Soft-margin SVM: Cost function

min𝒘,𝑤0, 𝜉𝑖 𝑖=1

2𝒘 2 + 𝐶

𝑖=1

𝜉𝑖

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑛

𝜉𝑖 ≥ 0

It is equivalent to the unconstrained optimizationproblem:

min𝒘,𝑤0

2𝒘 2 + 𝐶

𝑖=1

max(0,1 − 𝑦(𝑖)(𝒘𝑇𝒙(𝑖) + 𝑤0))

SVM loss function

Hinge loss vs. 0-1 loss

𝒘𝑇𝒙 + 𝑤0

0-1 Loss

𝑦 = 1

Hinge Loss

max(0,1 − 𝑦(𝒘𝑇𝒙 + 𝑤0))

Dual formulation of the SVM

We are going to introduce the dual SVM problem which

is equivalent to the original primal problem. The dual

problem:

is often easier

enable us to exploit the kernel trick

gives us further insights into the optimal hyperplane

Optimization: Lagrangian multipliers

𝑝∗ = min𝒙

𝑓(𝒙)

s. t. 𝑔𝑖 𝒙 ≤ 0 𝑖 = 1,… , 𝑚ℎ𝑖 𝒙 = 0 𝑖 = 1, … , 𝑝

ℒ 𝒙, 𝜶, 𝝀 = 𝑓 𝒙 +

𝑖=1

𝛼𝑖 𝑔𝑖 𝒙 +

𝑖=1

𝜆𝑖 ℎ𝑖 𝒙

max𝛼𝑖≥0 , 𝜆𝑖

ℒ 𝒙, 𝜶, 𝝀 =

∞ any 𝑔𝑖 𝒙 > 0

∞ any ℎ𝑖 𝒙 ≠ 0

𝑓 𝒙 otherwise

𝑝∗ = min𝒙

ℒ 𝒙, 𝜶, 𝝀

Lagrangian multipliers

𝜶 = 𝛼1, … , 𝛼𝑚

𝝀 = [𝜆1, … , 𝜆𝑝]

Optimization: Dual problem

In general, we have:

max𝑥

min𝑦

𝑓(𝑥, 𝑦) ≤ min𝑦

max𝑥

𝑓(𝑥, 𝑦)

Primal problem: 𝑝∗ = min𝒙

Dual problem: 𝑑∗ = max𝛼𝑖≥0 , 𝜆𝑖

min𝒙

Obtained by swapping the order of min and max

𝑑∗ ≤ 𝑝∗

When the original problem is convex (𝑓 and 𝑔 are convex

functions and ℎ is affine), we have strong duality 𝑑∗ = 𝑝∗

Hard-margin SVM: Dual problem

min𝒘,𝑤0

2𝒘 2

s. t. 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 ≥ 1 𝑖 = 1, … , 𝑁

By incorporating the constraints through lagrangian multipliers,we will have:

min𝒘,𝑤0

max{𝛼𝑖≥0}

2𝒘 2 +

𝑖=1

𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)

Dual problem (changing the order of min and max in theabove problem):

max{𝛼𝑖≥0}

min𝒘,𝑤0

2𝒘 2 +

𝑖=1

𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)

max{𝛼𝑖≥0}

min𝒘,𝑤0

ℒ 𝒘, 𝑤0, 𝜶

ℒ 𝒘, 𝑤0, 𝜶 =1

2𝒘 2 +

𝑖=1

𝛼𝑖 1 − 𝑦 𝑖 (𝒘𝑇𝒙(𝑖) + 𝑤0)

𝛻𝒘ℒ 𝒘, 𝑤0, 𝜶 = 0 ⇒ 𝒘 = 𝑖=1𝑁 𝛼𝑖𝑦

(𝑖)𝒙(𝑖)

𝜕ℒ 𝒘,𝑤0,𝜶

𝜕𝑤𝟎= 0 ⇒ 𝑖=1

𝑁 𝛼𝑖𝑦(𝑖) = 0

𝑤0 do not appear, instead, a “global” constraint

on 𝜶 is created.

max𝜶

𝑖=1

𝛼𝑖 −1

𝑖=1

𝑗=1

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝒙 𝑖 𝑇

𝒙 𝑗

Subject to 𝑖=1𝑁 𝛼𝑖𝑦

(𝑖) = 0

𝛼𝑖 ≥ 0 𝑖 = 1, … , 𝑁

It is a convex QP

By solving the above problem first we find 𝜶 and then 𝒘= 𝑖=1

𝑁 𝛼𝑖 𝑦(𝑖)𝒙(𝑖)

𝑤0 can be set by making the margin equidistant to classes (wediscuss it more after defining support vectors).

Karush-Kuhn-Tucker (KKT) conditions

Necessary conditions for the solution [𝒘∗, 𝑤0∗, 𝜶∗]:

𝛻𝒘ℒ 𝒘, 𝑤0, 𝜶 𝒘∗,𝑤0∗ ,𝜶∗ = 0

𝜕ℒ 𝒘,𝑤0,𝜶

𝜕𝑤0 𝒘∗,𝑤0

∗ ,𝜶∗ = 0

𝛼𝑖∗ ≥ 0 𝑖 = 1, … , 𝑁

𝑦 𝑖 𝒘∗𝑇𝒙 𝑖 + 𝑤0∗ ≥ 1 𝑖 = 1, … , 𝑁

𝛼𝑖∗ 1 − 𝑦 𝑖 𝒘∗𝑇𝒙 𝑖 + 𝑤0

∗ = 0 𝑖 = 1, … , 𝑁

𝛻𝒙ℒ 𝒙, 𝜶 𝒙∗,𝜶∗

𝛼𝑖∗ ≥ 0 𝑖 = 1, … , 𝑚

𝑔𝑖 𝒙∗ ≤ 0 𝑖 = 1, … , 𝑚𝛼𝑖

∗𝑔𝑖 𝒙∗ = 0 𝑖 = 1, … , 𝑚

In general, the optimal 𝒙∗, 𝜶∗

satisfies KKT conditions:

Hard-margin SVM: Support vectors

Inactive constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 > 1

⇒ 𝛼𝑖 = 0 and thus 𝒙 𝑖 is not a support vector.

Active constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 = 1

⇒ 𝛼𝑖 can be greater than 0 and thus 𝒙 𝑖 can be a support vector.

𝛼 > 0

𝛼 > 0𝛼 > 0

Inactive constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 > 1

⇒ 𝛼𝑖 = 0 and thus 𝒙 𝑖 is not a support vector.

Active constraint: 𝑦 𝑖 𝒘𝑇𝒙 𝑖 + 𝑤0 = 1

⇒ 𝛼𝑖 can be greater than 0 and thus 𝒙 𝑖 can be a support vector.

𝛼 > 0

𝛼 > 0𝛼 > 0

𝛼 = 0

A sample with 𝛼𝑖 = 0 can also

lie on one of the margin

hyperplanes

SupportVectors (SVs)= {𝑥 𝑖 𝛼𝑖 > 0}

The direction of hyper-plane can be found only based on

support vectors:

𝒘 =

𝛼𝑖>0

𝛼𝑖 𝑦(𝑖)𝒙(𝑖)

Classifying new samples using only SVs

Classification of a new sample 𝒙:

𝑦 = sign 𝑤0 + 𝒘𝑇𝒙

𝑦 = sign 𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦𝑖 𝒙 𝑖

𝑦 = sign(𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦𝑖 𝒙 𝑖 𝑇

The classifier is based on the expansion in terms of dot

products of 𝒙 with support vectors.

Support vectors are sufficient to

predict labels of new samples

How to find 𝑤0

At least one of the 𝛼’s is strictly positive (𝛼𝑠 > 0 for support

vector).

The KKT complementary slackness condition tells us that the

constraint corresponding to this non-zero 𝛼𝑠 is equality.

𝑤0 is found such that the following constraint is satisfied.

𝑦𝑠 𝑤0 + 𝒘𝑇𝒙𝑠 = 1

We can find 𝑤0 using one of the support vectors:

⇒ 𝑤0 = 𝑦𝑠 − 𝒘𝑇𝒙𝑠

= 𝑦𝑠 −

𝛼𝑛>0

𝛼𝑛𝑦(𝑛)𝒙 𝑛 𝑇𝒙𝑠

max𝜶

𝑖=1

𝛼𝑖 −1

𝑖=1

𝑗=1

𝒙 𝑗

(𝑖) = 0

𝛼𝑖 ≥ 0 𝑖 = 1, … , 𝑁

Only the dot product of each pair of training data appears

in the optimization problem

This is an important property that is helpful to extend to non-linear

SVM (the cost function does not depend explicitly on the dimensionality

of the feature space).

Soft-margin SVM: Dual problem

max𝜶

𝑖=1

𝛼𝑖 −1

𝑖=1

𝑗=1

𝒙 𝑗

(𝑖) = 0

0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑁

By solving the above quadratic problem first we find 𝜶

and then find 𝒘 = 𝑖=1𝑁 𝛼𝑖 𝑦(𝑖)𝒙(𝑖) and 𝑤0 is computed

from SVs.

For a test sample 𝒙 (as before):

𝑦 = sign 𝑤0 + 𝒘𝑇𝒙 = sign(𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦𝑖 𝒙 𝑖 𝑇

Soft-margin SVM: Support vectors

SupportVectors: 𝛼 > 0

If 0 < 𝛼 < 𝐶: SVs on the margin, 𝜉 = 0.

If 𝛼 = 𝐶: SVs on or over the margin.

Primal vs. dual hard-margin SVM problem

Primal problem of hard-margin SVM

𝑁 inequality constraints

𝑑 + 1 number of variables

Dual problem of hard-margin SVM

one equality constraint

𝑁 positivity constraints

𝑁 number of variables (Lagrange multipliers)

Objective function more complicated

The dual problem is helpful and instrumental to use the kernel

Primal vs. dual soft-margin SVM problem

Primal problem of soft-margin SVM

𝑁 inequality constraints

𝑁 positivity constraints

𝑁 + 𝑑 + 1 number of variables

Dual problem of soft-margin SVM

one equality constraint

2𝑁 positivity constraints

𝑁 number of variables (Lagrange multipliers)

Objective function more complicated

The dual problem is helpful and instrumental to use the kernel

Not linearly separable data

Noisy data or overlapping classes

(we discussed about it: soft margin)

Near linearly separable

Non-linear decision surface

Transform to a new feature space

Nonlinear SVM

Assume a transformation 𝜙: ℝ𝑑 → ℝ𝑚 on the feature

𝒙 → 𝝓 𝒙

Find a hyper-plane in the transformed feature space:

𝑥1 𝜙1(𝒙)

𝜙2(𝒙)

𝜙: 𝒙 → 𝝓 𝒙

𝒘𝑇𝝓 𝒙 + 𝑤0 = 0

{𝜙1(𝒙),...,𝜙𝑚(𝒙)}: set of basis functions (or features)

𝜙𝑖 𝒙 : ℝ𝑑 → ℝ

𝝓 𝒙 = [𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)]

Soft-margin SVM in a transformed space:

Primal problem

Primal problem:

min𝒘,𝑤0

2𝒘 2 + 𝐶

𝑖=1

𝜉𝑖

s. t. 𝑦 𝑖 𝒘𝑇𝝓(𝒙 𝑖 ) + 𝑤0 ≥ 1 − 𝜉𝑖 𝑖 = 1, … , 𝑁

𝜉𝑖 ≥ 0

𝒘 ∈ ℝ𝑚: the weights that must be found

If 𝑚 ≫ 𝑑 (very high dimensional feature space) then there are

many more parameters to learn

Soft-margin SVM in a transformed space:

Dual problem

Optimization problem:

max𝜶

𝑖=1

𝛼𝑖 −1

𝑖=1

𝑗=1

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝝓 𝒙(𝑖) 𝑇

𝝓 𝒙(𝑖)

(𝑖) = 0

0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑛

If we have inner products 𝝓 𝒙(𝑖) 𝑇𝝓 𝒙(𝑗) , only 𝜶

= [𝛼1, … , 𝛼𝑁] needs to be learnt.

It is not necessary to learn 𝑚 parameters as opposed to the primal

problem

Kernelized soft-margin SVM

Optimization problem:

max𝜶

𝑖=1

𝛼𝑖 −1

𝑖=1

𝑗=1

𝛼𝑖𝛼𝑗𝑦(𝑖)𝑦(𝑗)𝝓 𝒙(𝑖) 𝑇

𝝓 𝒙(𝑖)

(𝑖) = 0

0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, … , 𝑛

Classifying a new data:

𝑦 = 𝑠𝑖𝑔𝑛 𝑤0 + 𝒘𝑇𝝓(𝒙) = 𝑠𝑖𝑔𝑛(𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦𝑖 𝝓(𝒙 𝑖 )𝑇𝝓(𝒙))

𝑘(𝒙 𝑖 , 𝒙(𝑗))

𝑘(𝒙 𝑖 , 𝒙)

𝑘 𝒙, 𝒙′ = 𝝓 𝒙 𝑇𝝓(𝒙′)

Kernel trick

Kernel: the dot product in the mapped feature space ℱ 𝑘 𝒙, 𝒙′ = 𝝓 𝒙 𝑇𝝓(𝒙′)

𝝓 𝒙 = 𝜙1 𝒙 , … , 𝜙𝑚 𝒙 𝑇

𝑘 𝒙, 𝒙′ shows the similarity of 𝒙 and 𝒙′.

The distance can be found as accordingly

Kernel trick → Extension of many well-known algorithms to

kernel-based ones

By substituting the dot product with the kernel function

Kernel trick: Idea

Idea: when the input vectors appears only in the form of dot

products, we can use other choices of kernel function instead

of inner product

Solving the problem without explicitly mapping the data

Explicit mapping is expensive if 𝝓 𝒙 is very high dimensional

Instead of using a mapping 𝝓: 𝒳 ← ℱ to represent 𝒙 ∈ 𝒳 by

𝝓(𝒙) ∈ ℱ, a similarity function 𝑘: 𝒳 × 𝒳 → ℝ is used.

The data set can be represented by 𝑁 × 𝑁 matrix 𝑲.

We specify only an inner product function between points in the

transformed space (not their coordinates)

In many cases, the inner product in the embedding space can be

computed efficiently.

Why kernel?

kernel functions 𝐾 can indeed be efficiently computed, with a

cost proportional to 𝑑 instead of 𝑚.

Example: consider the second-order polynomial transform:

𝝓 𝒙 = 1, 𝑥1, … , 𝑥𝑑 , 𝑥12, 𝑥1𝑥2, … , 𝑥𝑑𝑥𝑑

𝝓 𝒙 𝑇𝝓 𝒙′ = 1 +

𝑖=1

𝑥𝑖𝑥𝑖′ +

𝑖=1

𝑗=1

𝑥𝑖𝑥𝑗𝑥𝑖′𝑥𝑗

𝝓 𝒙 𝑇𝝓 𝒙′ = 1 + 𝑥𝑇𝑥′ + 𝑥𝑇𝑥′ 2

𝑖=1

𝑥𝑖𝑥𝑖′ ×

𝑗=1

𝑥𝑗𝑥𝑗′

𝑚 = 1 + 𝑑 + 𝑑2

𝑂(𝑚)

𝑂(𝑑)

Polynomial kernel: Efficient kernel function

An efficient kernel function relies on a carefully constructed

specific transforms to allow fast computation

Example: for polynomial kernels, certain combinations of

coefficients that would still make K easy (𝛾 > 0 and 𝑐 > 0)

𝝓 𝒙

= 𝑐 × 1, 2𝑐𝛾 × 𝑥1, … , 2𝑐𝛾 × 𝑥𝑑 , 𝛾 × 𝑥12, 𝛾 × 𝑥1𝑥2, … , 𝛾

Polynomial kernel: Degree two

We instead use 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′ + 1 2 that corresponds to:

𝝓 𝒙

= 1, 2𝑥1, … , 2𝑥𝑑 , 𝑥12, . . , 𝑥𝑑

2, 2𝑥1𝑥2, … , 2𝑥1𝑥𝑑 , 2𝑥2𝑥3, … , 2𝑥𝑑−1𝑥𝑑

𝝓 𝒙 = 1, 2𝑥1, … , 2𝑥𝑑 , 𝑥12, . . . , 𝑥𝑑

2, 𝑥1𝑥2, … , 𝑥1𝑥𝑑 , 𝑥2𝑥1, … , 𝑥𝑑−1𝑥𝑑

𝑑-dimensional feature space 𝒙 = 𝑥1, … ,𝑥𝑑𝑇

Polynomial kernel

In general, 𝑘(𝒙, 𝒛) = 𝒙𝑇𝒛 𝑀 contains all monomials of order

This can similarly be generalized to include all terms up to

degree 𝑀, 𝑘(𝒙, 𝒛) = 𝒙𝑇𝒛 + 𝑐 𝑀 (𝑐 > 0)

Example: SVM boundary for a polynomial kernel

𝑤0 + 𝒘𝑇𝝓 𝒙 = 0

⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)𝝓 𝒙 𝑖 𝑇

𝝓 𝒙 = 0

⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)𝑘(𝒙 𝑖 , 𝒙) = 0

⇒ 𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖) 𝒙(𝑖)𝑇

𝒙 + 𝑐𝑀

= 0Boundary is a

polynomial of order 𝑀

Some common kernel functions

𝐾 is a mathematical and computational shortcut that

allows us to combine the transform and the inner

product into a single more efficient function.

Linear: 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′

Polynomial: 𝑘 𝒙, 𝒙′ = (𝒙𝑇𝒙′ + 𝑐)𝑀 𝑐 ≥ 0

Gaussian: 𝑘 𝒙, 𝒙′ = exp(−𝒙−𝒙′ 2

2𝜎2 )

Sigmoid: 𝑘 𝒙, 𝒙′ = tanh(𝑎𝒙𝑇𝒙′ + 𝑏)

Gaussian kernel

Gaussian: 𝑘 𝒙, 𝒙′ = exp(−𝒙−𝒙′ 2

2𝜎2 )

Infinite dimensional feature space

Example: SVM boundary for a gaussian kernel

Considers a Gaussian function around each data point.

𝑤0 + 𝛼𝑖>0 𝛼𝑖𝑦(𝑖)exp(−

𝒙−𝒙(𝑖) 2

2𝜎2 ) = 0

SVM + Gaussian Kernel can classify any arbitrary training set

Training error is zero when 𝜎 → 0 All samples become support vectors (likely overfiting)

SVM: Linear and Gaussian kernels example

Gaussian kernel: 𝐶 = 1,𝜎 = 0.25Linear kernel: 𝐶 = 0.1

[Zisserman’s slides]

some SVs appear to be far

from the boundary?

Hard margin Example

For narrow Gaussian (large 𝜎), even the protection of a large

margin cannot suppress overfitting.

Y.S. Abou Mostafa, et. al, “Learning From Data”, 2012

SVM Gaussian kernel: Example

49 Source: Zisserman’s slides

𝑓 𝒙 = 𝑤0 + 𝛼𝑖>0

𝛼𝑖𝑦(𝑖)exp(−

𝒙 − 𝒙(𝑖) 2

2𝜎2 )

50 Source: Zisserman’s slides

51 Source: Zisserman’ s slides

Constructing kernels

Construct kernel functions directly

Ensure that it is a valid kernel

Corresponds to an inner product in some feature space.

Example: 𝑘(𝒙, 𝒙′) = 𝒙𝑇𝒙′ 2

Corresponding mapping: 𝝓 𝒙 = 𝑥12, 2𝑥1𝑥2, 𝑥2

2 𝑇for 𝒙

= 𝑥1, 𝑥2𝑇

We need a way to test whether a kernel is valid without

having to construct 𝝓 𝒙

Valid kernel: Necessary & sufficient conditions

Gram matrix 𝑲𝑁×𝑁: 𝐾𝑖𝑗 = 𝑘(𝒙(𝑖), 𝒙(𝑗))

Restricting the kernel function to a set of points {𝒙 1 , 𝒙 2 , … , 𝒙(𝑁)}

𝐾 =𝑘(𝒙(1), 𝒙(1)) ⋯ 𝑘(𝒙(1), 𝒙(𝑁))

⋮ ⋱ ⋮𝑘(𝒙(𝑁), 𝒙(1)) ⋯ 𝑘(𝒙(𝑁), 𝒙(𝑁))

Mercer Theorem: The kernel matrix is Symmetric Positive

Semi-Definite (for any choice of data points)

Any symmetric positive definite matrix can be regarded as a kernel

matrix, that is as an inner product matrix in some space

[Shawe-Taylor & Cristianini 2004]

Construct Valid Kernels

• 𝑐 > 0 , 𝑘1: valid kernel

• 𝑓(. ): any function

• 𝑞(. ): a polynomial with coefficients≥ 0

• 𝑘1, 𝑘2: valid kernels

• 𝝓(𝒙): a function from 𝒙 to ℝ𝑀

𝑘3(. , . ): a valid kernel in ℝ𝑀

• 𝑨 : a symmetric positive semi-definite

matrix

• 𝒙𝑎 and 𝒙𝑏 are variables (not necessarily

disjoint) with 𝒙 = (𝒙𝑎 , 𝒙𝑏), and 𝑘𝑎 and 𝑘𝑏

are valid kernel functions over their

respective spaces.

[Bishop]

Extending linear methods to kernelized ones

Kernelized version of linear methods

Linear methods are famous

Unique optimal solutions, faster learning algorithms, and better analysis

However, we often require nonlinear methods in real-world problems

and so we can use kernel-based version of these linear algorithms

If a problem can be formulated such that feature vectors only

appear in inner products, we can extend it to a kernelized one

When we only need to know the dot product of all pairs of data (about

the geometry of the data in the feature space)

By replacing inner products with kernels in linear algorithms,

we obtain very flexible methods

We can operate in the mapped space without ever computing the

coordinates of the data in that space

Which information can be obtained from kernel?

Example: we know all pairwise distances

𝑑 𝝓 𝒙 ,𝝓 𝒛2

= 𝝓 𝒙 − 𝝓 𝒛 2 = 𝑘 𝒙, 𝒙 + 𝑘 𝒛, 𝒛 − 2𝑘(𝒙, 𝒛)

Therefore, we also know distance of points from center of mass of a set

Many dimensionality reduction, clustering, and classification

methods can be described according to pairwise distances.

This allow us to introduce kernelized versions of them

Kernels for structured data

Kernels also can be defined on general types of data

Kernel functions do not need to be defined over vectors

just we need a symmetric positive definite matrix

Thus, many algorithms can work with general (non-vectorial)

Kernels exist to embed strings, trees, graphs, …

This may be more important than nonlinearity that we can use

kernel-based version of the classical learning algorithms for

recognition of structured data

Kernel function for objects

Sets: Example of kernel function for sets:

𝑘 𝐴, 𝐵 = 2 𝐴∩𝐵

Strings: The inner product of the feature vectors for two

strings can be defined as

e.g. sum over all common subsequences weighted according to

their frequency of occurrence and lengths

A E G A T E A G G

E G T E A G A E G A T G

Kernel trick advantages: summary

Operating in the mapped space without ever computing the

coordinates of the data in that space

Besides vectors, we can introduce kernel functions for

structured data (graphs, strings, etc.)

Much of the geometry of the data in the embedding space is

contained in all pairwise dot products

In many cases, inner product in the embedding space can be

computed efficiently.

SVM: Summary

Hard margin: maximizing margin

Soft margin: handling noisy data and overlapping classes

Slack variables in the problem

Dual problems of hard-margin and soft-margin SVM

Lead us to non-linear SVM method easily by kernel substitution

Also, classifier decision in terms of support vectors

Kernel SVM’s

Learns linear decision boundary in a high dimension space without

explicitly working on the mapped data

Support Vector Machine (SVM) and Kernel Methods

Documents

Transcript of Support Vector Machine (SVM) and Kernel Methods

SVM – Support Vector Machines

Support Vector Classi cation and Regressioncjlin/talks/itri.pdf · SVM and kernel methods Outline 1 Introduction 2 SVM and kernel methods 3 Dual problem and solving optimization problems

SVM and Kernel machine Lecture 1: Linear SVM...A support vector machine (SVM) is a linear classiﬁer associated with the following decision function: D(x) = sign w ⊤x+b where w

Support vector machines - slazebni.cs.illinois.eduslazebni.cs.illinois.edu/fall17/lec17_svm.pdf• Linear SVM decision function: • Kernel SVM decision function: • This gives a

Kernel Logistic Regression and the Import Vector Machinepeople.ee.duke.edu/~lcarin/Mingtao12.9.2011.pdfDec 09, 2011 · SVM and KLR The SVM only estimates sign[p(x) 1=2] (by calculating

SVM Support Vector Machines

Lecture5 kernel svm

Support Vector Machines (SVM) - CAE Usershomepages.cae.wisc.edu/~ece539/spring00/notes/wordfile/svm.doc · Web viewSupport Vector Machines (SVM) Outline. Linear pattern classfiers

Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.

SVM – Support Vector Machines Metoda wektorów nośnych

Towards Building an Intelligent Anti-Malware System: A ... · vector machine (SVM) classifier used in the study. 2.4.1 Support Vector Machine (SVM). The support vector ma-chine (SVM)

Metode Kernel SVM

Support Vector Machine - 國立臺灣大學speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/SVM...Time-Alignment Kernel in Support Vector Machine”, NIPS, 2002 Marco Cuturi, Jean-Philippe

Support Vector Machine (SVM) and Kernel Methods · Support Vector Machine (SVM) and Kernel Methods. Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

University of Texas at Austin - A Divide-and-Conquer Solver for Kernel Support …cjhsieh/DC_ICML.pdf · 2014-06-23 · Support Vector Machines (SVM) SVM is a widely used classi er.

1 DiP-SVM : Distribution Preserving Kernel Support Vector ...ckm/assets/pdfs/j4.pdf · support vector machine is quite challenging [2]. Various implementations of SVM are available

PENERAPAN ALGORITMA SUPPORT VECTOR MACHINE (SVM) …

LNAI 5212 - Support Vector Machines, Data …bnrg.cs.berkeley.edu/~adj/publications/paper-files/SVM-PKDD08.pdf · Keywords: Support vector machines, kernel methods, approximate kernel

APLIKASI METODE SUPPORT VECTOR MACHINE (SVM) …

ECE595 / STAT598: Machine Learning I Lecture 20 Support ...Outline Support Vector Machine Lecture 19 SVM 1: The Concept of Max-Margin Lecture 20 SVM 2: Dual SVM Lecture 21 SVM 3: Kernel