Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...
Transcript of Support’Vector’Machines’aarti/Class/10601/slides/svm_11_22_2011.pdf · 2011-11-22 ·...
Support Vector Machines
Aar$ Singh
Machine Learning 10-‐601 Nov 22, 2011
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAA
SVMs reminder
26
min w.w + C Σ ξj w,b,ξ
s.t. (w.xj+b) yj ≥ 1-‐ξj ∀j
ξj ≥ 0 ∀j
Hinge loss Regularization
Soft margin approach
Essen$ally a constrained op$miza$on problem!
Constrained Op7miza7on
Primal problem:
⌘ min
x
max
↵�0x
2 � ↵(x� b)
min
x
max
↵�0x
2 � ↵(x� b)
Lagrange mul$plier
Lagrangian
L(x,↵)
Dual problem:
max
↵�0min
x
x
2 � ↵(x� b)⌘ min
x
max
↵�0x
2 � ↵(x� b)
L(x,↵)
f⇤ = arg
d⇤ = arg
d⇤ f⇤In general, d⇤ = f⇤When is ?
Constrained Op7miza7on
28
Primal problem: Dual problem:
f⇤ = arg d⇤ = arg
↵⇤= dual solution
x
⇤= primal solution
d⇤ = f⇤ if f is convex and x*, α* sa$sfy the KKT condi$ons:
rL(x⇤,↵
⇤) = 0
↵⇤ � 0x
⇤ � b
↵
⇤(x⇤ � b) = 0
Zero Gradient
Primal feasibility
Dual feasibility
Complementary slackness
è If α* > 0, then x* = b i.e. Constraint is effec$ve
Constrained Op7miza7on
29
α* = 0 Constraint is ineffec$ve
α* = 1/2 > 0 Constraint is effec$ve
Complementary slackness α*(x*-‐1) = 0
Dual SVM – linearly separable case
• Primal problem:
• Lagrangian:
30
w – weights on features
α – weights on training pts
α j > 0 constraint is effec7ve (w.xj+b)yj = 1 point j is a support vector!!
Dual SVM – linearly separable case
• Dual problem deriva$on:
31
If we can solve for αs (dual problem), then we have a solu$on for w (primal problem)
Dual SVM – linearly separable case
• Dual problem deriva$on:
• Dual problem:
32
Dual SVM – linearly separable case
Dual problem is also QP Solu$on gives αjs
33 Use support vectors to compute b w.xk+b = yk (w.xk+b)yk = 1
• Dual problem:
Dual SVM Interpreta7on: Sparsity
34
w.x + b = 0
Only few αjs can be non-‐zero : where constraint is $ght
(w.xj + b)yj = 1 Support vectors – training points j whose αjs are non-‐zero
αj > 0
αj > 0
αj > 0
αj = 0
αj = 0
αj = 0 = Support vectors
So why solve the dual SVM?
• There are some quadra$c programming algorithms that can solve the dual faster than the primal, specially in high dimensions d>>n
Recall: (w1, w2, …, wd, b) – d+1 primal variables α1, α2, ..., αn – n dual variables
35
Dual SVM – non-‐separable case
36
• Primal problem:
• Dual problem:
Lagrange Mul7pliers
Dual SVM – non-‐separable case
37
Dual problem is also QP Solu$on gives αjs
comes from Intuition: Earlier -‐ If constraint violated, αi →∞ Now -‐ If constraint violated, αi ≤ C
So why solve the dual SVM?
• There are some quadra$c programming algorithms that can solve the dual faster than the primal, specially in high dimensions d>>n
Recall: (w1, w2, …, wd, b) – d+1 primal variables α1, α2, ..., αn – n dual variables
• But, more importantly, the “kernel trick”!!!
38
39
What if data is not linearly separable?
x1
Using non-‐linear features to get linear separa$on
x 2
Radius, r = √x12+x22
Angle, θ
40
What if data is not linearly separable?
x
Using non-‐linear features to get linear separa$on
y
x
x2
What if data is not linearly separable?
41
Use features of features of features of features….
Feature space becomes really large very quickly!
Can we get by without having to write out the features explicitly?
Φ(x) = (x12, x22, x1x2, …., exp(x1))
Higher Order Polynomials
42
d – input features m – degree of polynomial
grows fast! m = 6, d = 100 about 1.6 billion terms
(m+ d� 1)!
m!(d� 1)!dm
✓m+d-1
m
◆
Dual formula7on only depends on dot-‐products, not on w!
43
Φ(x) – High-‐dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K
Dot Product of Polynomials
44
m=1
m=2
m
m
√2 2 √2
Don’t store high-‐dim features -‐ Only evaluate dot-‐products with kernels
m = K(x,z)
Finally: The Kernel Trick!
45
• Never represent features explicitly – Compute dot products in closed form
• Constant-‐$me high-‐dimensional dot-‐products for many classes of features
Common Kernels
46
• Polynomials of degree d
• Polynomials of degree up to d
• Gaussian/Radial kernels (polynomials of all orders – recall series expansion)
• Sigmoid
Which Func7ons Can Be Kernels?
• not all func$ons • for some defini$ons of K(x1,x2) there is no corresponding projec$on ϕ(x)
• Nice theory on this, including how to construct new kernels from exis$ng ones
• Ini$ally kernels were defined over data points in Euclidean space, but more recently over strings, over trees, over graphs, …
OverfiYng
48
• Huge feature space with kernels, what about overfiyng???
– Maximizing margin leads to sparse set of support vectors (decision boundary not too complicated)
– Some interes$ng theory says that SVMs search for simple hypothesis with large margin
– O{en robust to overfiyng
What about classifica7on 7me?
49
• For a new input x, if we need to represent Φ(x), we are in trouble! • Recall classifier: sign(w.Φ(x)+b) • Using kernels we are cool!
SVMs with Kernels
50
• Choose a set of features and kernel func$on • Solve dual problem to obtain support vectors αi • At classifica$on $me, compute:
Classify as
SVM Decision Surface using Gaussian Kernel
51
Bishop Fig 7.2
Circled points are the support vectors: training examples with non-‐zero αj Points plo|ed in original 2-‐D space. Contour lines show constant
SVM So[ Margin Decision Surface using Gaussian Kernel
52
Bishop Fig 7.4
Circled points are the support vectors: training examples with non-‐zero αj Points plo|ed in original 2-‐D space. Contour lines show constant
−2 0 2
−2
0
2
53
SVMs Kernel Regression
or
Differences: • SVMs:
– Learn weights αi (and bandwidth) – O{en sparse solu$on
• KR: – Fixed “weights”, learn bandwidth – Solu$on may not be sparse – Much simpler to implement
SVMs vs. Kernel Regression
SVMs vs. Logis7c Regression
54
SVMs Logistic Regression
Loss function Hinge loss Log-loss
High dimensional features with kernels
Yes! Yes!
Solution sparse Often yes! Almost always no!
Semantics of output
“Margin” Real probabilities
Kernels in Logis7c Regression
55
• Define weights in terms of features:
• Derive simple gradient descent rule on αi
SVMs vs. Logis7c Regression
56
SVMs Logistic Regression
Loss function Hinge loss Log-loss
High dimensional features with kernels
Yes! Yes!
Solution sparse Often yes! Almost always no!
Semantics of output
“Margin” Real probabilities
Kernels : Key Points
• Many learning tasks are framed as op$miza$on problems
• Primal and Dual formula$ons of op$miza$on problems
• Dual version framed in terms of dot products between x’s
• Kernel func$ons K(x,y) allow calcula$ng dot products <Φ(x),Φ(y)> without bothering to project x into Φ(x)
• Leads to major efficiencies, and ability to use very high dimensional (virtual) feature spaces
What you need to know…
58
• Dual SVM formula$on – How it’s derived
• The kernel trick • Common kernels • Differences between SVMs and kernel regression • Differences between SVMs and logis$c regression • Kernelized logis$c regression
• Also, Kernel PCA, Kernel ICA, ...