Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...

Support Vector Machines

Aar$ Singh

Machine Learning 10-‐601 Nov 22, 2011

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAA

SVMs reminder

26

min w.w + C Σ ξj w,b,ξ

s.t. (w.xj+b) yj ≥ 1-‐ξj ∀j

ξj ≥ 0 ∀j

Hinge loss Regularization

Soft margin approach

Essen$ally a constrained op$miza$on problem!

Constrained Op7miza7on

Primal problem:

⌘ min

x

max

↵�0x

2 � ↵(x� b)

min

x

max

↵�0x

2 � ↵(x� b)

Lagrange mul$plier

Lagrangian

L(x,↵)

Dual problem:

max

↵�0min

x

x

2 � ↵(x� b)⌘ min

x

max

↵�0x

2 � ↵(x� b)

L(x,↵)

f⇤ = arg

d⇤ = arg

d⇤ f⇤In general, d⇤ = f⇤When is ?


28

Primal problem: Dual problem:

f⇤ = arg d⇤ = arg

↵⇤= dual solution

x

⇤= primal solution

d⇤ = f⇤ if f is convex and x*, α* sa$sfy the KKT condi$ons:

rL(x⇤,↵

⇤) = 0

↵⇤ � 0x

⇤ � b

↵

⇤(x⇤ � b) = 0

Zero Gradient

Primal feasibility

Dual feasibility

Complementary slackness

è If α* > 0, then x* = b i.e. Constraint is effec$ve


29

α* = 0 Constraint is ineffec$ve

α* = 1/2 > 0 Constraint is effec$ve

Complementary slackness α*(x*-‐1) = 0

Dual SVM – linearly separable case

•  Primal problem:

•  Lagrangian:

30

w – weights on features

α – weights on training pts

α j > 0 constraint is effec7ve (w.xj+b)yj = 1 point j is a support vector!!


•  Dual problem deriva$on:

31

If we can solve for αs (dual problem), then we have a solu$on for w (primal problem)


•  Dual problem deriva$on:

•  Dual problem:

32


Dual problem is also QP Solu$on gives αjs

33 Use support vectors to compute b w.xk+b = yk (w.xk+b)yk = 1


Dual SVM Interpreta7on: Sparsity

34

w.x + b = 0

Only few αjs can be non-‐zero : where constraint is $ght

(w.xj + b)yj = 1 Support vectors – training points j whose αjs are non-‐zero

αj > 0

αj > 0

αj > 0

αj = 0

αj = 0

αj = 0 = Support vectors

So why solve the dual SVM?

•  There are some quadra$c programming algorithms that can solve the dual faster than the primal, specially in high dimensions d>>n

Recall: (w1, w2, …, wd, b) – d+1 primal variables α1, α2, ..., αn – n dual variables

35

Dual SVM – non-‐separable case

36

•  Primal problem:


Lagrange Mul7pliers

Dual SVM – non-‐separable case

37

Dual problem is also QP Solu$on gives αjs

comes from Intuition: Earlier -‐ If constraint violated, αi →∞ Now -‐ If constraint violated, αi ≤ C

So why solve the dual SVM?

•  There are some quadra$c programming algorithms that can solve the dual faster than the primal, specially in high dimensions d>>n

Recall: (w1, w2, …, wd, b) – d+1 primal variables α1, α2, ..., αn – n dual variables

•  But, more importantly, the “kernel trick”!!!

38

39

What if data is not linearly separable?

x1

Using non-‐linear features to get linear separa$on

x 2

Radius, r = √x12+x22

Angle, θ

40


x

Using non-‐linear features to get linear separa$on

y

x

x2


41

Use features of features of features of features….

Feature space becomes really large very quickly!

Can we get by without having to write out the features explicitly?

Φ(x) = (x12, x22, x1x2, …., exp(x1))

Higher Order Polynomials

42

d – input features m – degree of polynomial

grows fast! m = 6, d = 100 about 1.6 billion terms

(m+ d� 1)!

m!(d� 1)!dm

✓m+d-1

m

◆

Dual formula7on only depends on dot-‐products, not on w!

43

Φ(x) – High-‐dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K

Dot Product of Polynomials

44

m=1

m=2

m

m

√2 2 √2

Don’t store high-‐dim features -‐ Only evaluate dot-‐products with kernels

m = K(x,z)

Finally: The Kernel Trick!

45

•  Never represent features explicitly –  Compute dot products in closed form

•  Constant-‐$me high-‐dimensional dot-‐products for many classes of features

Common Kernels

46

•  Polynomials of degree d

•  Polynomials of degree up to d

•  Gaussian/Radial kernels (polynomials of all orders – recall series expansion)

•  Sigmoid

Which Func7ons Can Be Kernels?

•  not all func$ons •  for some defini$ons of K(x1,x2) there is no corresponding projec$on ϕ(x)

•  Nice theory on this, including how to construct new kernels from exis$ng ones

•  Ini$ally kernels were defined over data points in Euclidean space, but more recently over strings, over trees, over graphs, …

OverfiYng

48

•  Huge feature space with kernels, what about overfiyng???

– Maximizing margin leads to sparse set of support vectors (decision boundary not too complicated)

–  Some interes$ng theory says that SVMs search for simple hypothesis with large margin

–  O{en robust to overfiyng

What about classifica7on 7me?

49

•  For a new input x, if we need to represent Φ(x), we are in trouble! •  Recall classifier: sign(w.Φ(x)+b) •  Using kernels we are cool!

SVMs with Kernels

50

•  Choose a set of features and kernel func$on •  Solve dual problem to obtain support vectors αi •  At classifica$on $me, compute:

Classify as

SVM Decision Surface using Gaussian Kernel

51

Bishop Fig 7.2

Circled points are the support vectors: training examples with non-‐zero αj Points plo|ed in original 2-‐D space. Contour lines show constant

SVM So[ Margin Decision Surface using Gaussian Kernel

52

Bishop Fig 7.4

Circled points are the support vectors: training examples with non-‐zero αj Points plo|ed in original 2-‐D space. Contour lines show constant

−2 0 2

−2

0

2

53

SVMs Kernel Regression

or

Differences: •  SVMs:

–  Learn weights αi (and bandwidth) –  O{en sparse solu$on

•  KR: –  Fixed “weights”, learn bandwidth –  Solu$on may not be sparse – Much simpler to implement

SVMs vs. Kernel Regression

SVMs vs. Logis7c Regression

54

SVMs Logistic Regression

Loss function Hinge loss Log-loss

High dimensional features with kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output

“Margin” Real probabilities

Kernels in Logis7c Regression

55

•  Define weights in terms of features:

•  Derive simple gradient descent rule on αi

SVMs vs. Logis7c Regression

56

SVMs Logistic Regression

Loss function Hinge loss Log-loss

High dimensional features with kernels

Yes! Yes!

Solution sparse Often yes! Almost always no!

Semantics of output

“Margin” Real probabilities

Kernels : Key Points

•  Many learning tasks are framed as op$miza$on problems

•  Primal and Dual formula$ons of op$miza$on problems

•  Dual version framed in terms of dot products between x’s

•  Kernel func$ons K(x,y) allow calcula$ng dot products <Φ(x),Φ(y)> without bothering to project x into Φ(x)

•  Leads to major efficiencies, and ability to use very high dimensional (virtual) feature spaces

What you need to know…

58

•  Dual SVM formula$on – How it’s derived

•  The kernel trick •  Common kernels •  Differences between SVMs and kernel regression •  Differences between SVMs and logis$c regression •  Kernelized logis$c regression

•  Also, Kernel PCA, Kernel ICA, ...

Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...

Documents

Transcript of Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...