SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos...

28
©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th , 2007

Transcript of SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos...

Page 1: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 1

SVMs, Duality and theKernel Trick

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

February 26th, 2007

Page 2: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 2

SVMs reminder

Page 3: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 3

Today’s lecture

Learn one of the most interesting and excitingrecent advancements in machine learning The “kernel trick” High dimensional feature spaces at no extra cost!

But first, a detour Constrained optimization!

Page 4: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 4

Constrained optimization

Page 5: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 5

Lagrange multipliers – Dual variables

Moving the constraint to objective functionLagrangian:

Solve:

Page 6: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 6

Lagrange multipliers – Dual variables

Solving:

Page 7: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 7

Dual SVM derivation (1) –the linearly separable case

Page 8: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 8

Dual SVM derivation (2) –the linearly separable case

Page 9: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 9

Dual SVM interpretation

w.x

+ b

= 0

Page 10: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 10

Dual SVM formulation –the linearly separable case

Page 11: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 11

Dual SVM derivation –the non-separable case

Page 12: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 12

Dual SVM formulation –the non-separable case

Page 13: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 13

Announcements

Class projects out later this week

Page 14: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 14

Why did we learn about the dualSVM?

There are some quadratic programmingalgorithms that can solve the dual faster than theprimal

But, more importantly, the “kernel trick”!!! Another little detour…

Page 15: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 15

Reminder from last time: What if thedata is not linearly separable?

Use features of features of features of features….

Feature space can get really large really quickly!

Page 16: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 16

Higher order polynomials

number of input dimensions

num

ber o

f mon

omia

l ter

ms

d=2

d=4

d=3

m – input featuresd – degree of polynomial

grows fast!d = 6, m = 100about 1.6 billion terms

Page 17: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 17

Dual formulation only depends ondot-products, not on w!

Page 18: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 18

Dot-product of polynomials

Page 19: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 19

Finally: the “kernel trick”!

Never represent features explicitly Compute dot products in closed form

Constant-time high-dimensional dot-products for many classes of features

Very interesting theory – ReproducingKernel Hilbert Spaces Not covered in detail in 10701/15781,

more in 10702

Page 20: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 20

Polynomial kernels

All monomials of degree d in O(d) operations:

How about all monomials of degree up to d? Solution 0:

Better solution:

Page 21: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 21

Common kernels

Polynomials of degree d

Polynomials of degree up to d

Gaussian kernels

Sigmoid

Page 22: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 22

Overfitting?

Huge feature space with kernels, what aboutoverfitting??? Maximizing margin leads to sparse set of support

vectors Some interesting theory says that SVMs search for

simple hypothesis with large margin Often robust to overfitting

Page 23: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 23

What about at classification time

For a new input x, if we need to represent Φ(x),we are in trouble!

Recall classifier: sign(w.Φ(x)+b) Using kernels we are cool!

Page 24: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 24

SVMs with kernels

Choose a set of features and kernel function Solve dual problem to obtain support vectors αi

At classification time, compute:

Classify as

Page 25: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 25

What’s the difference betweenSVMs and Logistic Regression?

High dimensionalfeatures withkernels

Loss function

NoYes!

Log-lossHinge loss

LogisticRegression

SVMs

Page 26: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 26

Kernels in logistic regression

Define weights in terms of support vectors:

Derive simple gradient descent rule on αi

Page 27: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 27

What’s the difference between SVMsand Logistic Regression? (Revisited)

Almost always no!Often yes!Solution sparse

Yes!Yes!High dimensionalfeatures withkernels

Real probabilities“Margin”Semantics ofoutput

Loss function Log-lossHinge loss

LogisticRegression

SVMs

Page 28: SVMs, Duality and the Kernel Trickguestrin/Class/10701-S07/Slides/kernels.pdf · ©2005-2007 Carlos Guestrin 1 SVMs, Duality and the Kernel Trick Machine Learning – 10701/15781

©2005-2007 Carlos Guestrin 28

What you need to know

Dual SVM formulation How it’s derived

The kernel trick Derive polynomial kernel Common kernels Kernelized logistic regression Differences between SVMs and logistic regression