Considering Cost Asymmetry in Learning Classifiers
Presented by Chunping Wang
Machine Learning Group, Duke University
May 21, 2007
by Bach, Heckerman and Horvitz
Outline
• Introduction
• SVM with Asymmetric Cost
• SVM Regularization Path (Hastie et al., 2005)
• Path with Cost Asymmetry
• Results
• Conclusions
Introduction (1)Binary classification
bxwxf T )(
A classifier could be defined as
based on a linear decision function
di Rx real-valued predictors
binary response }1,1{ iy
)]([ xfsigny
0)( xf
w
||||||
wb
Parameters1),( dRbw
Introduction (2)Two types of misclassification:
• false negative: cost
• false positive: cost C
C
Expected cost:
}1,0{}1,0{),,,( ybxwPCybxwPCbwCCR TT
In terms of 0-1 loss function
)}(1{)}(1{),,,( 101101 bxwECbxwECbwCCR Ty
Ty
010 1)( uu
)(10 u
u
Real loss function but Non-convex Non-differentiable
Introduction (3)Convex loss functions – surrogates for the 0-1 loss function
(for training purpose)
Introduction (4)
CC
C
Iii
Ti
Iii
Ti bxwy
nCbxwy
nCbwCCR ))(())((),,,(ˆ
Empirical cost given n labeled data points
Objective function2
21 ||||),,,(ˆ),,,(ˆ wbwCCRbwCCJ n
regularization asymmetry
Motivation: efficiently look at many training asymmetries even if the testing asymmetry is given.
CC
Since convex surrogates of the 0-1 loss function are used for training, the cost asymmetries for training and testing are mismatched.
SVM with Asymmetric Cost (1)
hinge loss }0,1max{)( uuhi
),,,(ˆmin,
bwCCJbw
nibxwy
nits
wC
iT
ii
i
iiibw
,,1),(1
,,1,0..
||||min 221
,,
where .1;1 yifCCyifCC ii
SVM with asymmetric cost
SVM with Asymmetric Cost (2)The Lagrangian with dual variables nR ,
i
iii
iiT
iii
i bxwywCbwL ))(1(||||),,,,( 221
Karush-Kuhn-Tucker (KKT) conditions
)3(,,1,0
)2(00
)1(0
niCL
ybL
xywwL
iiii
iii
iiii
nibxwy iiiiT
ii ,,1,0,0))(1(
SVM with Asymmetric Cost (3)The dual problem
niCytsK iiTTT
Rn,,1,0,0..1~max 2
1
where jTiij xxKydiagKydiagK ),()(~
A quadratic optimization problem given a cost structure
Computation will be intractable for the whole space ),( CC
Following the SVM regularization path algorithm (Hastie et al., 2005), the authors deal with (1)-(3) and KKT conditions instead of the dual problem.
SVM Regularization Path (1)
SVM regularization path 1 CCC
The cost is symmetric and thus searching is along the axis.
Define active sets of data points:
• Margin:
• Left of margin:
• Right of margin:
}1)(:{ bxwyiM iT
i
}1)(:{ bxwyiL iT
i
}1)(:{ bxwyiR iT
i
nibxwy iiiiT
ii ,,1,0,0))(1( KKT conditions
0
i
ii
RiCLi
SVM Regularization Path (2)Initialization ( )
Consider sufficiently large (C is very small), all the points are in L
nn
niCi ,,1, nixfy ii ,,1,1)(
)2(0i
ii yRemain
Decrease
One or more positive and negative examples hit the margin simultaneously
Iixwb
Iixwb
iT
iT
,1
,1
*
*
i
ii xyw*with
SVM Regularization Path (3)
iT
Iii
T
Iixwixwi ** minarg,maxarg
Define
The critical condition for first two points hitting the margin
1)(,1)( ii xfxf
Initialization ( ) nn
)(
2
**
**
0
**
0
iT
iT
iT
iT
iT
iT
xwxwxwxw
b
xwxw
nnFor , this initial condition keeps the same except the definition of . ii ,
SVM Regularization Path (4)The path: decrease , changes only for except that one of the following events happens
• A point from L or R has entered M;
• A point in M has left the set to join either R or L
i Mi
1 ll consider only the points on the margin
lMlm
ljlljj Mj }0{,)(
where is some function of , b 0jlMjjj yx },{
Therefore, the for points on the margin proceed linearly in ; the function changes in a piecewise-inverse manner in
j
)()]()([)( xhxhxfxf llll
0),()( lMj
jjjl xxKyxh
SVM Regularization Path (4)The path: decrease , changes only for except that one of the following events happens
• A point from L or R has entered M;
• A point in M has left the set to join either R or L
i Mi
1 ll consider only the points on the margin
lMlm
ljlljj Mj }0{,)(
where is some function of , b 0jlMjjj yx },{
Therefore, the for points on the margin proceed linearly in ; the function changes in a piecewise-inverse manner in .
j
)()]()([)( xhxhxfxf llll
0),()( lMj
jjjl xxKyxh
SVM Regularization Path (5)• Update regularization
• Update active sets and solutions
• Stopping condition In the separable case, we terminate when L become empty; In the non-separable case, we terminate when
max1 l for all the possible events
ljlllj
lj Mj }0{,)( 1
1
)()]()([)(1
1 xhxhxfxf lllll
l
)()( xhxf ll
Path with Cost Asymmetry (1)Exploration in the 2-d space),( CC
Path initialization: start at situations when all points are in LniCii ,,1, 0yT
nCnCFollow the updating procedure in the 1-d case along the line
Regularization is changing and the cost asymmetry is fixed.
Among all the classifiers, find the best one , given user’s cost function
),( 11 CC
),( 00 CC
Paths starting from
)1,1(),1,1( ),( 11
rCrC
Path with Cost Asymmetry (2)Produce ROC
Collecting R lines in the direction of , we can build three ROC curves
)1,1(),1,1(
Results (1)For 1000 testing asymmetries , three methods are compared:
“one” – take as training cost asymmetry;
“int” – vary the intercept of “one” and build an ROC, then select the optimal classifier;
“all” – select the optimal classifier from the ROC obtained by varying both the training asymmetry and the intercept.
Use a nested cross-validation:
The outer cross-validation: produce overall accuracy estimates for the classifier;
The inner cross-validation: select optimal classifier parameters (training asymmetry and/or intercept).
Results (2)
Conclusions
An efficient algorithm is presented to build ROC curves by varying the training cost asymmetries for SVMs.
The main contribution is generalizing the SVM regularization path (Hastie et al., 2005) from a 1-d axis to a 2-d plane.
Because of the usage of a convex surrogate, using the testing asymmetry for training leads to non-optimal classifier.
Results show advantages of considering more training asymmetries.
Top Related