Download - Considering Cost Asymmetry in Learning Classifiers

Considering Cost Asymmetry in Learning Classifiers

Presented by Chunping Wang

Machine Learning Group, Duke University

May 21, 2007

by Bach, Heckerman and Horvitz

Outline

• Introduction

• SVM with Asymmetric Cost

• SVM Regularization Path (Hastie et al., 2005)

• Path with Cost Asymmetry

• Results

• Conclusions

Introduction (1)Binary classification

bxwxf T )(

A classifier could be defined as

based on a linear decision function

di Rx real-valued predictors

binary response }1,1{ iy

)]([ xfsigny

0)( xf

w

||||||

wb

Parameters1),( dRbw

Introduction (2)Two types of misclassification:

• false negative: cost

• false positive: cost C

C

Expected cost:

}1,0{}1,0{),,,( ybxwPCybxwPCbwCCR TT

In terms of 0-1 loss function

)}(1{)}(1{),,,( 101101 bxwECbxwECbwCCR Ty

Ty

010 1)( uu

)(10 u

u

Real loss function but Non-convex Non-differentiable

Introduction (3)Convex loss functions – surrogates for the 0-1 loss function

(for training purpose)

Introduction (4)

CC

C

Iii

Ti

Iii

Ti bxwy

nCbxwy

nCbwCCR ))(())((),,,(ˆ

Empirical cost given n labeled data points

Objective function2

21 ||||),,,(ˆ),,,(ˆ wbwCCRbwCCJ n

regularization asymmetry

Motivation: efficiently look at many training asymmetries even if the testing asymmetry is given.

CC

Since convex surrogates of the 0-1 loss function are used for training, the cost asymmetries for training and testing are mismatched.

SVM with Asymmetric Cost (1)

hinge loss }0,1max{)( uuhi

),,,(ˆmin,

bwCCJbw

nibxwy

nits

wC

iT

ii

i

iiibw

,,1),(1

,,1,0..

||||min 221

,,

where .1;1 yifCCyifCC ii

SVM with asymmetric cost

SVM with Asymmetric Cost (2)The Lagrangian with dual variables nR ,

i

iii

iiT

iii

i bxwywCbwL ))(1(||||),,,,( 221

Karush-Kuhn-Tucker (KKT) conditions

)3(,,1,0

)2(00

)1(0

niCL

ybL

xywwL

iiii

iii

iiii

nibxwy iiiiT

ii ,,1,0,0))(1(

SVM with Asymmetric Cost (3)The dual problem

niCytsK iiTTT

Rn,,1,0,0..1~max 2

1

where jTiij xxKydiagKydiagK ),()(~

A quadratic optimization problem given a cost structure

Computation will be intractable for the whole space ),( CC

Following the SVM regularization path algorithm (Hastie et al., 2005), the authors deal with (1)-(3) and KKT conditions instead of the dual problem.

SVM Regularization Path (1)

SVM regularization path 1 CCC

The cost is symmetric and thus searching is along the axis.

Define active sets of data points:

• Margin:

• Left of margin:

• Right of margin:

}1)(:{ bxwyiM iT

i

}1)(:{ bxwyiL iT

i

}1)(:{ bxwyiR iT

i

nibxwy iiiiT

ii ,,1,0,0))(1( KKT conditions

0

i

ii

RiCLi

SVM Regularization Path (2)Initialization ( )

Consider sufficiently large (C is very small), all the points are in L

nn

niCi ,,1, nixfy ii ,,1,1)(

)2(0i

ii yRemain

Decrease

One or more positive and negative examples hit the margin simultaneously

Iixwb

Iixwb

iT

iT

,1

,1

*

*

i

ii xyw*with

SVM Regularization Path (3)

iT

Iii

T

Iixwixwi ** minarg,maxarg

Define

The critical condition for first two points hitting the margin

1)(,1)( ii xfxf

Initialization ( ) nn

)(

2

**

**

0

**

0

iT

iT

iT

iT

iT

iT

xwxwxwxw

b

xwxw

nnFor , this initial condition keeps the same except the definition of . ii ,

SVM Regularization Path (4)The path: decrease , changes only for except that one of the following events happens

• A point from L or R has entered M;

• A point in M has left the set to join either R or L

i Mi

1 ll consider only the points on the margin

lMlm

ljlljj Mj }0{,)(

where is some function of , b 0jlMjjj yx },{

Therefore, the for points on the margin proceed linearly in ; the function changes in a piecewise-inverse manner in

j

)()]()([)( xhxhxfxf llll

0),()( lMj

jjjl xxKyxh

SVM Regularization Path (4)The path: decrease , changes only for except that one of the following events happens

• A point from L or R has entered M;

• A point in M has left the set to join either R or L

i Mi

1 ll consider only the points on the margin

lMlm

ljlljj Mj }0{,)(

where is some function of , b 0jlMjjj yx },{

Therefore, the for points on the margin proceed linearly in ; the function changes in a piecewise-inverse manner in .

j

)()]()([)( xhxhxfxf llll

0),()( lMj

jjjl xxKyxh

SVM Regularization Path (5)• Update regularization

• Update active sets and solutions

• Stopping condition In the separable case, we terminate when L become empty; In the non-separable case, we terminate when

max1 l for all the possible events

ljlllj

lj Mj }0{,)( 1

1

)()]()([)(1

1 xhxhxfxf lllll

l

)()( xhxf ll

Path with Cost Asymmetry (1)Exploration in the 2-d space),( CC

Path initialization: start at situations when all points are in LniCii ,,1, 0yT

nCnCFollow the updating procedure in the 1-d case along the line

Regularization is changing and the cost asymmetry is fixed.

Among all the classifiers, find the best one , given user’s cost function

),( 11 CC

),( 00 CC

Paths starting from

)1,1(),1,1( ),( 11

rCrC

Path with Cost Asymmetry (2)Produce ROC

Collecting R lines in the direction of , we can build three ROC curves

)1,1(),1,1(

Results (1)For 1000 testing asymmetries , three methods are compared:

“one” – take as training cost asymmetry;

“int” – vary the intercept of “one” and build an ROC, then select the optimal classifier;

“all” – select the optimal classifier from the ROC obtained by varying both the training asymmetry and the intercept.

Use a nested cross-validation:

The outer cross-validation: produce overall accuracy estimates for the classifier;

The inner cross-validation: select optimal classifier parameters (training asymmetry and/or intercept).

Results (2)

Conclusions

An efficient algorithm is presented to build ROC curves by varying the training cost asymmetries for SVMs.

The main contribution is generalizing the SVM regularization path (Hastie et al., 2005) from a 1-d axis to a 2-d plane.

Because of the usage of a convex surrogate, using the testing asymmetry for training leads to non-optimal classifier.

Results show advantages of considering more training asymmetries.