Support Vector Machines (and Kernel Methods in general)
description
Transcript of Support Vector Machines (and Kernel Methods in general)
![Page 1: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/1.jpg)
1
Support Vector Machines (and Kernel Methods in general)
Machine Learning
![Page 2: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/2.jpg)
2
Last Time
• Multilayer Perceptron/Logistic Regression Networks– Neural Networks– Error Backpropagation
![Page 3: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/3.jpg)
3
Today
• Support Vector Machines
• Note: we’ll rely on some math from Optimality Theory that we won’t derive.
![Page 4: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/4.jpg)
4
Maximum Margin
• Perceptron (and other linear classifiers) can lead to many equally valid choices for the decision boundary
Are these really “equally valid”?
![Page 5: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/5.jpg)
5
Max Margin
• How can we pick which is best?
• Maximize the size of the margin.
Are these really “equally valid”?
Small Margin
Large Margin
![Page 6: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/6.jpg)
6
Support Vectors
• Support Vectors are those input points (vectors) closest to the decision boundary
• 1. They are vectors• 2. They “support”
the decision hyperplane
![Page 7: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/7.jpg)
7
Support Vectors
• Define this as a decision problem
• The decision hyperplane:
• No fancy math, just the equation of a hyperplane.
![Page 8: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/8.jpg)
8
Support Vectors
• Aside: Why do some cassifiers use or – Simplicity of the math and
interpretation.– For probability density
function estimation 0,1 has a clear correlate.
– For classification, a decision boundary of 0 is more easily interpretable than .5.
![Page 9: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/9.jpg)
9
Support Vectors
• Define this as a decision problem
• The decision hyperplane:
• Decision Function:
![Page 10: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/10.jpg)
10
Support Vectors
• Define this as a decision problem
• The decision hyperplane:
• Margin hyperplanes:
![Page 11: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/11.jpg)
11
Support Vectors
• The decision hyperplane:
• Scale invariance
![Page 12: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/12.jpg)
12
Support Vectors
• The decision hyperplane:
• Scale invariance
![Page 13: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/13.jpg)
13
Support Vectors
• The decision hyperplane:
• Scale invariance
This scaling does not change the decision hyperplane, or the supportvector hyperplanes. But we willeliminate a variable from the optimization
![Page 14: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/14.jpg)
14
What are we optimizing?
• We will represent the size of the margin in terms of w.
• This will allow us to simultaneously– Identify a decision
boundary– Maximize the margin
![Page 15: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/15.jpg)
15
How do we represent the size of the margin in terms of w?
1. There must at least one point that lies on each support hyperplanes
Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).
![Page 16: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/16.jpg)
16
How do we represent the size of the margin in terms of w?
1. There must at least one point that lies on each support hyperplanes
Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).
![Page 17: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/17.jpg)
17
How do we represent the size of the margin in terms of w?
1. There must at least one point that lies on each support hyperplanes
2. Thus:
3. And:
![Page 18: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/18.jpg)
18
How do we represent the size of the margin in terms of w?
1. There must at least one point that lies on each support hyperplanes
2. Thus:
3. And:
![Page 19: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/19.jpg)
19
• The vector w is perpendicular to the decision hyperplane– If the dot product of two
vectors equals zero, the two vectors are perpendicular.
How do we represent the size of the margin in terms of w?
![Page 20: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/20.jpg)
20
• The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.
How do we represent the size of the margin in terms of w?
![Page 21: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/21.jpg)
21
Aside: Vector Projection
![Page 22: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/22.jpg)
22
• The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.
How do we represent the size of the margin in terms of w?
Size of the Margin:
Projection:
![Page 23: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/23.jpg)
23
Maximizing the margin
• Goal: maximize the margin
Linear Separability of the data by the decision boundary
![Page 24: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/24.jpg)
24
Max Margin Loss Function
• If constraint optimization then Lagrange Multipliers
• Optimize the “Primal”
![Page 25: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/25.jpg)
25
Max Margin Loss Function
• Optimize the “Primal”
Partial wrt b
![Page 26: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/26.jpg)
26
Max Margin Loss Function
• Optimize the “Primal”
Partial wrt w
![Page 27: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/27.jpg)
27
Max Margin Loss Function
• Optimize the “Primal”
Partial wrt w
Now have to find αi.Substitute back to the Loss function
![Page 28: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/28.jpg)
28
Max Margin Loss Function
• Construct the “dual”
![Page 29: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/29.jpg)
29
Dual formulation of the error
• Optimize this quadratic program to identify the lagrange multipliers and thus the weights
There exist (rather) fast approaches to quadratic optimization in both C, C++, Python, Java and R
![Page 30: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/30.jpg)
30
Quadratic Programming
•If Q is positive semi definite, then f(x) is convex.
•If f(x) is convex, then there is a single maximum.
![Page 31: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/31.jpg)
31
Support Vector Expansion
• When αi is non-zero then xi is a support vector
• When αi is zero xi is not a support vector
New decision FunctionIndependent of the
Dimension of x!
![Page 32: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/32.jpg)
32
Kuhn-Tucker Conditions
• In constraint optimization: At the optimal solution– Constraint * Lagrange Multiplier = 0
Only points on the decision boundary contribute to the solution!
![Page 33: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/33.jpg)
33
Visualization of Support Vectors
![Page 34: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/34.jpg)
34
Interpretability of SVM parameters
• What else can we tell from alphas?– If alpha is large, then the associated data point is
quite important.– It’s either an outlier, or incredibly important.
• But this only gives us the best solution for linearly separable data sets…
![Page 35: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/35.jpg)
35
Basis of Kernel Methods
• The decision process doesn’t depend on the dimensionality of the data.• We can map to a higher dimensionality of the data space.
• Note: data points only appear within a dot product.• The error is based on the dot product of data points – not the data
points themselves.
![Page 36: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/36.jpg)
36
Basis of Kernel Methods
• Since data points only appear within a dot product.• Thus we can map to another space through a replacement
• The error is based on the dot product of data points – not the data points themselves.
![Page 37: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/37.jpg)
37
Learning Theory bases of SVMs
• Theoretical bounds on testing error.– The upper bound doesn’t depend on the
dimensionality of the space– The lower bound is maximized by maximizing the
margin, γ, associated with the decision boundary.
![Page 38: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/38.jpg)
38
Why we like SVMs
• They work– Good generalization
• Easily interpreted.– Decision boundary is based on the data in the
form of the support vectors.• Not so in multilayer perceptron networks
• Principled bounds on testing error from Learning Theory (VC dimension)
![Page 39: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/39.jpg)
39
SVM vs. MLP
• SVMs have many fewer parameters– SVM: Maybe just a kernel parameter– MLP: Number and arrangement of nodes and eta
learning rate • SVM: Convex optimization task– MLP: likelihood is non-convex -- local minima
![Page 40: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/40.jpg)
40
Soft margin classification• There can be outliers on the other side of the decision
boundary, or leading to a small margin.• Solution: Introduce a penalty term to the constraint function
![Page 41: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/41.jpg)
41
Soft Max Dual
Still Quadratic Programming!
![Page 42: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/42.jpg)
42
• Points are allowed within the margin, but cost is introduced.
Soft margin example
Hinge Loss
![Page 43: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/43.jpg)
43
Probabilities from SVMs
• Support Vector Machines are discriminant functions
– Discriminant functions: f(x)=c– Discriminative models: f(x) = argmaxc p(c|x)
– Generative Models: f(x) = argmaxc p(x|c)p(c)/p(x)
• No (principled) probabilities from SVMs• SVMs are not based on probability distribution
functions of class instances.
![Page 44: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/44.jpg)
44
Efficiency of SVMs
• Not especially fast.• Training – n^3– Quadratic Programming efficiency
• Evaluation – n– Need to evaluate against each support vector
(potentially n)
![Page 45: Support Vector Machines (and Kernel Methods in general)](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816662550346895dd9f200/html5/thumbnails/45.jpg)
45
Good Bye
• Next time: – The Kernel “Trick” -> Kernel Methods– or– How can we use SVMs that are not linearly
separable?