SVM for Overlapping Classes - University at Buffalosrihari/CSE574/Chap7/7.2-SVM-Overlap.pdf · SVM...
Transcript of SVM for Overlapping Classes - University at Buffalosrihari/CSE574/Chap7/7.2-SVM-Overlap.pdf · SVM...
Machine Learning Srihari
SVM Discussion Overview
1. Importance of SVMs 2. Overview of Mathematical Techniques
Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing with Multiple Classes 7. SVM and Computational Learning Theory 8. Relevance Vector Machines
Machine Learning Srihari
Overlapping Class Distributions
• We have assumed training data are linearly separable in mapped space ϕ(x) • Resulting SVM gives exact separation in
input space x although decision boundary will be nonlinear
• In practice class-conditionals overlap • Exact separation leads to poor generalization • Therefore need to allow SVM to misclassify
some training points
Machine Learning Srihari
Overlapping Disributions
Machine Learning Srihari Modifying SVM • Implicit error function of SVM is
• Gives infinite error if point is misclassified • Zero error if it was classified correctly
• We optimize parameters to maximize margin • We now modify this approach
• Points allowed to be on wrong side of margin • But with penalty that increases with the distance
from the boundary – Convenient if penalty is a linear function of distance
E∞
n=1
N
∑ y(xn)t
n−1( )+λ w
2
Machine Learning Srihari
Definition of Slack Variables ξn
• Introduce slack variables, ξn ≥ 0, n=1,..,N • With one slack variable for each training point
• These are defined by • ξn =0 for data points that are on or inside the
correct margin boundary • ξn =|tn-y(xn)| for other points
Machine Learning Srihari
Slack Variables • A data point that is on the decision
boundary y(xn)=0 will have ξn =1 • Points with ξn ≥1 will be misclassified • Exact classification constraints
tn (wTϕ(xn)+b)≥1 , n=1,..,N are then replaced with
tny(xn) ≥1-ξn n=1,..,N • in which slack variables are constrained to
satisfy ξn ≥0
Machine Learning Srihari
Slack Variables and Margins y = 1
y = 0y = �1
margin
y = 1
y = 0
y = �1
Correctly classified points on correct side of margin have ξ=0
Correctly classified points inside margin but on correct side of decision boundary have 0< ξ≤1
Data points with circles are support vectors
Data points on wrong side of decision boundary have ξ>1
Hard Margin
Soft Margin
Machine Learning Srihari
Soft Margin Classification
• Allows some of the data points to be misclassified
• While slack variables allow for overlapping class distributions, the framework is still sensitive to outliers • because the penalty for misclassification still
increases linearly with ξ
Machine Learning Srihari
Optimization with slack variables • Our goal is to maximize the margin while
softly penalizing points that lie on the wrong side of the margin boundary
• We therefore minimize • Parameter C>0 controls trade-off between
slack variable penalty and the margin • Subject to constraints
tny(xn)≥1-ξn n=1,..,N
C ξ
n+
12n=1
N
∑ w2
Machine Learning Srihari
Lagrangian with slack • The Lagrangian we wish to maximize is
• where {an≥0}, {µn≥0} are Lagrange multipliers
• The corresponding KKT conditions are
• We then optimize out w,b and {ξn} by taking derivatives of L wrt w, b and ξn to get
L w,b,a( ) =
12
w2
+C ξn
n=1
N
∑ − an
tny(x
n)−1+ ξ
n⎡⎣⎢
⎤⎦⎥
n=1
N
∑ − µn
n=1
N
∑ ξn
an≥ 0
tny(x
n)−1+ ξ
n≥ 0
an{t
ny(x
n)−1+ ξ} = 0
µn≥ 0
ξn≥ 0
µnξ
n= 0
Machine Learning Srihari
Dual Lagrangian
• The dual Lagrangian has the form
• Which is identical to the separable case, except that the constraints are somewhat different
!L a( ) = a
nn=1
N
∑ −12 m=1
N
∑ ama
ntntmk(x
n,x
m)
n=1
N
∑
Machine Learning Srihari
Nu-SVM • An equivalent formulation of SVM
• It involves maximizing
• Subject to the constraints
• ν-SVM has the advantage of using a
parameter ν for controlling the no. of support vectors
!L a( ) =−
12 m=1
N
∑ ama
ntntmk(x
n,x
m)
n=1
N
∑
0≤an≤1/N
antn
n=1
N
∑ = 0
an≥ ν
n=1
N
∑
Machine Learning Srihari
ν-SVM applied to non-separable data
• Support Vectors are indicated by circles
• Done by introducing slack variables
• With one slack variable per training data point
• Maximize the margin while softly penalizing points that lie on the wrong side of the margin boundary
• ν is an upper-bound on the fraction of margin errors (lie on wrong side of margin boundary)