Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and...

42
Support Vector Support Vector Machines and Machines and Kernels Kernels Adapted from slides by Tim Oates Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County University of Maryland Baltimore County Doing Doing Really Really Well with Well with Linear Decision Surfaces Linear Decision Surfaces

Transcript of Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and...

Page 1: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support Vector Support Vector Machines and Machines and

KernelsKernels

Adapted from slides by Tim OatesAdapted from slides by Tim Oates

Cognition, Robotics, and Learning (CORAL) LabCognition, Robotics, and Learning (CORAL) Lab

University of Maryland Baltimore CountyUniversity of Maryland Baltimore County

Doing Doing ReallyReally Well with Well with

Linear Decision SurfacesLinear Decision Surfaces

Page 2: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

OutlineOutline

PredictionPrediction Why might predictions be wrong?Why might predictions be wrong?

Support vector machinesSupport vector machines Doing really well with linear modelsDoing really well with linear models

KernelsKernels Making the non-linear linearMaking the non-linear linear

Page 3: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Supervised ML = Supervised ML = PredictionPrediction

Given training instances (x,y)Given training instances (x,y) Learn a model fLearn a model f Such that f(x) = ySuch that f(x) = y Use f to predict y for new xUse f to predict y for new x Many variations on this basic themeMany variations on this basic theme

Page 4: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Why might predictions be Why might predictions be wrong?wrong?

True Non-Determinism True Non-Determinism Flip a biased coinFlip a biased coin p(p(headsheads) = ) = Estimate Estimate If If > 0.5 predict > 0.5 predict headsheads, else , else tailstails Lots of ML research on problems like Lots of ML research on problems like

thisthis Learn a modelLearn a model Do the best you can in expectationDo the best you can in expectation

Page 5: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Why might predictions be Why might predictions be wrong? wrong?

Partial Observability Partial Observability Something needed to predict y is Something needed to predict y is

missing from observation xmissing from observation x N-bit parity problemN-bit parity problem

x contains N-1 bits (hard PO)x contains N-1 bits (hard PO) x contains N bits but learner ignores some x contains N bits but learner ignores some

of them (soft PO)of them (soft PO)

Page 6: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Why might predictions be Why might predictions be wrong?wrong?

True non-determinismTrue non-determinism Partial observabilityPartial observability

hard, softhard, soft Representational biasRepresentational bias Algorithmic biasAlgorithmic bias Bounded resourcesBounded resources

Page 7: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Representational BiasRepresentational Bias

Having the right features (x) is Having the right features (x) is crucialcrucial

XOO O O XXX

X

OO O O

X

X

X

Page 8: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support Vector Support Vector MachinesMachines

Doing Doing ReallyReally Well with Well with Linear Decision SurfacesLinear Decision Surfaces

Page 9: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Strengths of SVMsStrengths of SVMs

Good generalization in theoryGood generalization in theory Good generalization in practiceGood generalization in practice Work well with few training Work well with few training

instancesinstances Find globally best modelFind globally best model Efficient algorithmsEfficient algorithms Amenable to the kernel trickAmenable to the kernel trick

Page 10: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Linear SeparatorsLinear Separators

Training instancesTraining instances x x nn

y y {-1, 1} {-1, 1} w w nn

b b HyperplaneHyperplane

<w, x> + b = 0<w, x> + b = 0 ww11xx11 + w + w22xx22 … + w … + wnnxxnn + b = 0 + b = 0

Decision functionDecision function f(x) = sign(<w, x> + b)f(x) = sign(<w, x> + b)

Math ReviewMath ReviewInner (dot) product:Inner (dot) product:

<a, b> = a · b = ∑ a<a, b> = a · b = ∑ aii*b*bii = a= a11bb11 + a + a22bb22 + …+a + …+annbbnn

Page 11: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

IntuitionsIntuitions

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 12: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

IntuitionsIntuitions

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 13: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

IntuitionsIntuitions

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 14: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

IntuitionsIntuitions

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 15: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

A “Good” SeparatorA “Good” Separator

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 16: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Noise in the Noise in the ObservationsObservations

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 17: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Ruling Out Some Ruling Out Some SeparatorsSeparators

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 18: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Lots of NoiseLots of Noise

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 19: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Maximizing the MarginMaximizing the Margin

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 20: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

““Fat” SeparatorsFat” Separators

X

X

O

OO

O

OOX

X

X

X

X

XO

O

Page 21: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Why Maximize Margin?Why Maximize Margin?

Increasing margin reduces Increasing margin reduces capacitycapacity Must restrict capacity to generalize Must restrict capacity to generalize

m training instancesm training instances 22mm ways to label them ways to label them What if function class that can separate What if function class that can separate

them all?them all? ShattersShatters the training instances the training instances

VC Dimension is largest m such that VC Dimension is largest m such that function class can shatter some set function class can shatter some set of m pointsof m points

Page 22: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

VC Dimension ExampleVC Dimension Example

X

X X

O

X X

X

O X

X

X O

O

O X

O

X O

X

O O

O

O O

Page 23: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Bounding Generalization Bounding Generalization ErrorError

R[f] = risk, test errorR[f] = risk, test error RRempemp[f] = empirical risk, train error[f] = empirical risk, train error h = VC dimensionh = VC dimension m = number of training instancesm = number of training instances = probability that bound does not = probability that bound does not

holdhold

1m

2mh

ln + 14

+ lnhR[f] Remp[f] +

Page 24: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support VectorsSupport Vectors

X

X

O

OO

O

OO

O

O

X

X

X

X

X

X

Page 25: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

The MathThe Math Training instancesTraining instances

x x nn

y y {-1, 1} {-1, 1} Decision functionDecision function

f(x) = sign(<w,x> + b)f(x) = sign(<w,x> + b) w w nn

b b Find w and b that Find w and b that

Perfectly classify training instancesPerfectly classify training instances Assuming linear separabilityAssuming linear separability

Maximize marginMaximize margin

Page 26: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

The MathThe Math

For perfect classification, we wantFor perfect classification, we want yyii (<w,x (<w,xii> + b) ≥ 0 for all i> + b) ≥ 0 for all i Why?Why?

To maximize the margin, we wantTo maximize the margin, we want w that minimizes |w|w that minimizes |w|22

Page 27: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Dual Optimization Dual Optimization ProblemProblem

Maximize over Maximize over W(W() = ) = ii ii - 1/2 - 1/2 i,ji,j ii jj y yii y yj j <x<xii, x, xjj>>

Subject toSubject to i i 0 0

ii ii y yii = 0 = 0

Decision functionDecision function f(x) = sign(f(x) = sign(ii ii y yii <x, x <x, xii> + b)> + b)

Page 28: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

What if Data Are Not What if Data Are Not Perfectly Linearly Perfectly Linearly

Separable?Separable? Cannot find w and b that satisfyCannot find w and b that satisfy

yyii (<w,x (<w,xii> + b) ≥ 1 for all i> + b) ≥ 1 for all i

Introduce slack variables Introduce slack variables ii

yyii (<w,x (<w,xii> + b) ≥ 1 - > + b) ≥ 1 - ii for all i for all i

MinimizeMinimize |w||w|22 + C + C ii

Page 29: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Strengths of SVMsStrengths of SVMs

Good generalization in theoryGood generalization in theory Good generalization in practiceGood generalization in practice Work well with few training Work well with few training

instancesinstances Find globally best modelFind globally best model Efficient algorithmsEfficient algorithms Amenable to the kernel trick …Amenable to the kernel trick …

Page 30: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

What if Surface is Non-What if Surface is Non-Linear?Linear?

XXXX

XX

O OOOO

O OOOO

OOOO O

OOO OO

Image from http://www.atrandomresearch.com/iclass/

Page 31: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Kernel MethodsKernel Methods

Making the Non-Linear Making the Non-Linear LinearLinear

Page 32: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

When Linear Separators When Linear Separators FailFail

XOO O O XXX x1

x2

X

OO O O

X

X

X

x1

x12

Page 33: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Mapping into a New Feature Mapping into a New Feature SpaceSpace

Rather than run SVM on xRather than run SVM on xii, run it on , run it on (x(xii)) Find non-linear separator in input spaceFind non-linear separator in input space What if What if (x(xii) is really big?) is really big? Use kernels to compute it implicitly!Use kernels to compute it implicitly!

: x : x X = X = (x)(x)

(x(x11,x,x22) = (x) = (x11,x,x22,x,x1122,x,x22

22,x,x11xx22))

Image from http://web.engr.oregonstate.edu/ ~afern/classes/cs534/

Page 34: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

KernelsKernels

Find kernel K such thatFind kernel K such that K(xK(x11,x,x22) = < ) = < (x(x11), ), (x(x22)>)>

Computing K(xComputing K(x11,x,x22) should be ) should be efficient, much more so than efficient, much more so than computing computing (x(x11) and ) and (x(x22))

Use K(xUse K(x11,x,x22) in SVM algorithm rather ) in SVM algorithm rather than <xthan <x11,x,x22>>

Remarkably, this is possibleRemarkably, this is possible

Page 35: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

The Polynomial KernelThe Polynomial Kernel

K(xK(x11,x,x22) = < x) = < x11, x, x2 2 > > 22

xx11 = (x = (x1111, x, x1212))

xx22 = (x = (x2121, x, x2222))

< x< x11, x, x2 2 > = (x> = (x1111xx2121 + x + x1212xx2222))

< x< x11, x, x2 2 > > 22 = (x = (x11112 2 xx2121

22 + x + x121222xx2222

22 + 2x + 2x1111 xx12 12 xx2121

xx2222))

(x(x11) = (x) = (x111122, x, x1212

22, √2x, √2x1111 xx1212))

(x(x22) = (x) = (x212122, x, x2222

22, √2x, √2x2121 xx2222))

K(xK(x11,x,x22) = < ) = < (x(x11), ), (x(x22)) >>

Page 36: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

The Polynomial KernelThe Polynomial Kernel

(x) contains all monomials of (x) contains all monomials of degree ddegree d

Useful in visual pattern recognitionUseful in visual pattern recognition Number of monomialsNumber of monomials

16x16 pixel image16x16 pixel image 10101010 monomials of degree 5 monomials of degree 5

Never explicitly compute Never explicitly compute (x)!(x)! Variation - K(xVariation - K(x11,x,x22) = (< x) = (< x11, x, x2 2 > + 1) > + 1) 22

Page 37: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

A Few Good KernelsA Few Good Kernels Dot product kernelDot product kernel

K(xK(x11,x,x22) = < x) = < x11,x,x22 > > Polynomial kernelPolynomial kernel

K(xK(x11,x,x22) = < x) = < x11,x,x22 > >d d (Monomials of degree d)(Monomials of degree d) K(xK(x11,x,x22) = (< x) = (< x11,x,x22 > + 1) > + 1)d d (All monomials of degree 1,2,(All monomials of degree 1,2,

…,d)…,d) Gaussian kernelGaussian kernel

K(xK(x11,x,x22) = exp(-| x) = exp(-| x11-x-x22 | |22/2/222)) Radial basis functionsRadial basis functions

Sigmoid kernelSigmoid kernel K(xK(x11,x,x22) = tanh(< x) = tanh(< x11,x,x22 > + > + )) Neural networksNeural networks

Establishing “kernel-hood” from first principles is Establishing “kernel-hood” from first principles is non-trivialnon-trivial

Page 38: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

The Kernel TrickThe Kernel Trick

“Given an algorithm which is formulated in terms of a positive definite kernel K1, one can construct an alternative algorithm by replacing K1 with another positive definite kernel K2” SVMs can use the kernel trick

Page 39: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Using a Different Kernel in Using a Different Kernel in the Dual Optimization the Dual Optimization

ProblemProblem For example, using the polynomial For example, using the polynomial

kernel with d = 4 (including lower-order kernel with d = 4 (including lower-order terms).terms).

Maximize over Maximize over W(W() = ) = ii ii - 1/2 - 1/2 i,ji,j ii jj y yii y yj j <x<xii, x, xjj>>

Subject toSubject to i i 0 0

ii ii y yii = 0 = 0 Decision functionDecision function

f(x) = sign(f(x) = sign(ii ii y yii <x, x <x, xii> + b)> + b)

(<x(<xii, x, xjj> + > + 1)1)44

X

(<x(<xii, x, xjj> + > + 1)1)44

X

These are kernels!So by the kernel trick,we just replace them!

Page 40: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Exotic KernelsExotic Kernels

StringsStrings TreesTrees GraphsGraphs The hard part is establishing kernel-The hard part is establishing kernel-

hoodhood

Page 41: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Application: “Beautification Application: “Beautification Engine”Engine”

(Leyvand et al., 2008)(Leyvand et al., 2008)

Page 42: Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

ConclusionConclusion

SVMs find optimal linear separatorSVMs find optimal linear separator The kernel trick makes SVMs non-The kernel trick makes SVMs non-

linear learning algorithmslinear learning algorithms