Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3...
Transcript of Lecture 18: Multiclass Support Vector Machineshzhang/math574m/2020Lect18_msvm.pdfClass 2 Class 3...
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Lecture 18: Multiclass Support Vector Machines
Hao Helen Zhang
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Outlines
Traditional Methods for Multiclass Problems
One-vs-rest approachesPairwise approaches
Recent development for Multiclass Problems
Simultaneous ClassificationVarious loss functions
Extensions of SVM
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVMTraditional Methods for Multiclass Problems
Multiclass Classification Setup
Label: {−1,+1} → {1, 2, . . . ,K}.Classification decision rule:
f : Rd =⇒ {1, 2, . . . ,K}.
Classification accuracy is measured by
Equal-cost: Generalization Error (GE)
Err(f) = P(Y 6= f (X)).
Unequal-cost: the risk
R(f ) = EY ,XC (Y , f (X)).
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVMTraditional Methods for Multiclass Problems
Traditional Methods
Main ideas:
(i) Decompose the multiclass classification problem into multiplebinary classification problems.
(ii) Use the majority voting principle (a combined decision fromthe committee) to predict the label
Common approaches: simple but effective
One-vs-rest (one-vs-all) approaches
Pairwise (one-vs-one, all-vs-all) approaches
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVMTraditional Methods for Multiclass Problems
One-vs-rest Approach
One of the simplest multiclass classifier; commonly used in SVMs;also known as the one-vs-all (OVA) approach
(i) Solve K different binary problems: classify “class k” versus“the rest classes” for k = 1, · · · ,K .
(ii) Assign a test sample to the class giving the largest fk(x)(most positive) value, where fk(x) is the solution from the kthproblem
Properties:
Very simple to implement, perform well in practice
Not optimal (asymptotically): the decision rule is not Fisherconsistent if there is no dominating class (i.e.arg max pk(x) < 1
2 ).
Read: Rifkin and Klautau (2004) “In Defense of One-vs-allClassification”
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVMTraditional Methods for Multiclass Problems
Pairwise Approach
Also known as all-vs-all (AVA) approach
(i) Solve(K
2
)different binary problems: classify “class k” versus
“class j” for all j 6= k . Each classifier is called gij .
(ii) For prediction at a point, each classifier is queried once andissues a vote. The class with the maximum number of(weighted) votes is the winner.
Properties:
Training process is efficient, by dealing with small binaryproblems.
If K is big, there are too many problems to solve. If K = 10,we need to train 45 binary classifiers.
Simple to implement; perform competitively in practice.
Read: Park and Furnkranz (2007) “Efficient PairwiseClassification”
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
One Single SVM approach: Simultaneous Classification
Label: {−1,+1} → {1, 2, . . . ,K}.Use one single SVM to construct a decision function vector
f = (f1, . . . , fK ).
Classifier (Decision rule):
f (x) = argmaxk=1,··· ,K fk(x).
If K = 2, there is one fk and the decision rule is sign(fk).
In some sense, multiple logistic regression is a simultaneousclassification procedure
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
x1
x2
Class 1
Class 2 Class 3
f1>max(f2,f3)
f2>max(f1,f3)
f3>max(f1,f2)
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
Ployhedron one
Ployhedron two Ployhedron three
f1−f2=1
f1−f2=0
f2−f1=1
f1−f3=1
f1−f3=0
f3−f1=1
f2−f3=1
f3−f2=1f2−f3=0
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
SVM for Multiclass Problem
Multiclass SVM: solving one single regularization problem byimposing a penalty on the values of fy (x)− fl(x)’s.
Weston and Watkins (1999)
Cramer and Singer (2002)
Lee et al. (2004)
Liu and Shen (2006); multiclass ψ-learning: Shen et al.(2003)
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
Various Multiclass SVMs
Weston and Watkins (1999):a penalty is imposed only if fy (x) < fk(x) + 2 for k 6= y .
Even if fy (x) < 1, a penalty is not imposed as long as fk(x) issufficiently small for k 6= y ;Similarly, if fk(x) > 1 for k 6= y , we do not pay a penalty iffy (x) is sufficiently large.
L(y , f(x)) =∑k 6=y
[2− (fy (x)− fk(x))]+ .
Lee et al. (2004): L(y , f(x)) =∑
k 6=y [fk(x) + 1]+.
Crammer and Singer (2002): Liu and Shen (2006),L(y , f(x)) = [1−mink{fy (x)− fk(x)}]+.
To avoid the redundancy, a sum-to-zero constraint∑K
k=1 fk = 0 issometimes enforced.
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
Linear Multiclass SVMs
For linear classification problems, we have
fk(x) = βkx + β0k , k = 1, · · · ,K .
The sum-to-zero constraint can be replaced by
K∑k=1
β0k = 0,K∑
k=1
βk = 0.
The optimization problem becomes
minf
n∑i=1
L (yi , f(xi )) + λ
K∑k=1
‖βk‖2
subject to the sum-to-zero constraint.
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
Nonlinear Multiclass SVMs
To achieve the nonlinear classification, we assume
fk(x) = β′kφ(x) + βk0, k = 1, · · · ,K .
where φ(x) represents the basis functions in the feature space F .
Similar to the binary classification, the nonlinear MSVM canbe conveniently solved using a kernel function.
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
Regularization Problems for Nonlinear MSVMs
We can represent the MSVM as the solution to a regularizationproblem in the RKHS.
Assume that
f(x) = (f1(x), · · · , fK (x)) ∈K∏
k=1
({1}+Hk)
under the sum-to-zero constraint.
Then a MSVM classifier can be derived by solving
minf
n∑i=1
L (yi , f(xi )) + λ
K∑k=1
‖gk‖2Hk,
where fk(x) = gk(x) + β0k , gk ∈ Hk , β0k ∈ R.
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
Generalized Functional Margin
Given (x, y), a reasonable decision vector f(x) should
encourage a large value for fy (x)
have small values for fk(x), k 6= y .
Define the K − 1-vector of relative differences as
g = (fy (x)− f1(x), · · · , fy (x)− fy−1(x), fy (x)− fy+1(x), · · · , fy (x)− fK (x)) .
Liu et al. (2004) called the vector g the generalized functionalmargin of f
g characterizes correctness and strength of classification of xby f.
f indicates a correct classification of (x, y) if
g(f(x), y) > 0k−1.
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
0-1 Loss with Functional Margin
A point (x , y) is misclassified if y 6= arg maxk fk(x).Define the multivariate sign function as
sign(u) = 1 if umin = min(u1, · · · , um) > 0,
−1 if umin ≤ 0.
where u = (u1, · · · , um). Using the functional margin,
The 0-1 loss becomes
I (min g(f (x), y) < 0) =1
2[1− sign(g(f (x), y))] .
The GE becomes to
R[f ] =1
2E [1− sign(g(f (x), y))] .
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
Generalized Loss Functions Using Functional Margin
A natural way to generalize the binary loss is
n∑i=1
`(min g(f (x i ), yi )).
In particular, the loss function L(y , f(x)) can be expressed asV (g(f(x), y)) with
Weston and Watkins (1999): V (u) =∑K−1
j=1 [2− uj ]+ .
Lee et al. (2004): V (u) =∑K−1
j=1 [∑K−1
c=1 ucK − uj + 1]+.
Liu and Shen (2006): V (u) = [1−minj uj ]+.
All of these loss functions are the upper bounds of the 0-1 loss.
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
Small Round Blue Cell Tumors of Childhood� Khan et al. (2001) in Nature Medicine� Tumor types: neuroblastoma (NB), rhabdomyosarcoma (RMS),
non-Hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS)� Number of genes : 2308� Class distribution of data set
Data set EWS BL(NHL) NB RMS total
Training set 23 8 12 20 63
Test set 6 3 6 5 20
Total 29 11 18 25 83
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
−10
−5
0
5
10
−20
−10
0
10−10
−5
0
5
10
PC1PC2
PC3
EWSBLNBRMSEWSBLNBRMSNON
Figure 3: Three principal components of 100 gene expression levels (circles: training
samples, squares: test samples including non SRBCT samples). The tumor types are
distinguished by colors.
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Various Loss FunctionsGeneralized Functional Margin
Characteristics of Support Vector Machines
High accuracy, high flexibility
Naturally handle large dimensional data
Sparse representation of the solutions (via support vectors):fast for making future prediction
No probability estimates (hard classifiers)
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Other Active Problems in SVM
Variable/Feature Selection
Linear SVM: Bradley and Mangasarian (1998), Guyon et al.(2000), Rakotomamonjy (2003), Jebara and Jaakkola (2000)Nonlinear SVM: Weston et al. (2002), Grandvalet (2003),Basis pursuit (Zhang 2003), COSSO selection (Lin & Zhang(2003)
Proximal SVM – faster computation
Robust SVM – get rid of outliers
Choice of kernels
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
The L1 SVM
Replace the L2 penalty by the L1 penalty.
The L1 penalty tends to give sparse solutions.
For f (x) = h(x)Tβ + β0, the L1 SVM solves
minβ0,β
n∑i=1
[1− yi f (xi )]+ + λd∑
j=1
|βj |. (1)
The solution will have at most n nonzero coefficients βj .
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
L1 Penalty versus L2 Penalty
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
3.5
4
w
J(w)
L1 penalty
L2 penalty
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Robust Support Vector Machines
Hinge loss is unbounded; sensitive to outliers (e.g. wronglabels etc)
Support Vectors: yi f (xi ) ≤ 1.
Truncated hinge loss: Ts(u) = H1(u)− Hs(u), where
Hs(u) = [s − u]+ .
Remove some “bad” SVs (Wu and Liu, 2006).
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
Decomposition: Difference of Convex Functions
Key: D.C. decomposition (Diff. Convex functions).Ts(u) = H1(u)− Hs(u).
−3 0 1 30
1
2
3
4
z
H 1
−3 s 0 30
1
2
3
4
z
H s
−3 s 0 1 30
1
2
3
4
z
T s
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines
Overview of Multiclass LearningSimultaneous Classification by MSVMs
Extensions of SVM
D.C. Algorithm
D.C. Algorithm: The Difference Convex Algorithm for mini-mizingJ(Θ) = Jvex(Θ) + Jcav (Θ)
1. Initialize Θ0.
2. Repeat Θt+1 = argminΘ(Jvex(Θ) +⟨J ′cav (Θt),Θ−Θt
⟩)
until convergence of Θt .
The algorithm converges in finite steps (Liu et al. (2005)).
Choice of initial values: Use SVM’s solution.
RSVM: The set of SVs is a only a SUBSET of the original one!
Nonlinear learning can be achieved by the kernel trick.
Hao Helen Zhang Lecture 18: Multiclass Support Vector Machines