Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf ·...

Machine Learning on a Budget

Kirill Trapeznikov

September 5th, 2013

Advisors: Venkatesh Saligrama, David Castanon

Committee: Venkatesh Saligrama, David Castanon, Prakash Ishwar,Ioannis Paschalidis

Chair: Ayse Coskun

1 / 79

Supervised Learning

NatureP(x, y)

x y

example x = image, text article, .. , collection of k sensormeasurements

label y = scene category, article topic, .. , target present/not present

2 / 79

Classifier

Goal is to find a classifier f (x) that minimizes expected error:

minf

Ex,y

[1[f (x) 6=y ]

]And if P(x, y) is known then f (x) is the posterior

f (x) = arg maxy

P(y | x)

3 / 79

Empirical Risk Minimization

In most cases, P(x, y), unknown and typically hard to estimate

Empirical Risk Minimization:Collect training data (x1, y1), (x2, y2), . . . and approximate expected risk

minf∈F

∑i

1[f (x))6=yi ]

F is a family of classifiers (ex. linear separators)

SupervisedLearner

Labeled Training Data

Classifier with small error

f(x)

4 / 79

Learning Classifier

Most effort has been in learning a good classifier

how to parametrize classifier family Fhow to design and minimize surrogate risk C (·) to replace 1[·]

minf∈F

∑i

C (f (xi )), yi )

Recently, cost of learning gained importance

Training Phase: labeling cost

Testing Phase: acquisition cost

5 / 79

Labeling Cost

To train a classifier need: measurements x and labels y

often, large collection of unlabeled data is available: xi ’s

to label yi need to use an expert, typically expensive

Examples

Medical Imaging

large amount of unlabeled data, (tests, scans, etc)to label need a doctor/radiologist

Computer Vision/ Object Detection

unlabeled images/videosrequires user input to annotate objects in images

How to reduce labeling cost?

6 / 79

Reducing Labeling Cost

Unnecessary to label every example: many redundant, uninformative

Active Learner+

Expert

SupervisedLearner

Unlabeled Pool Labeled Subset Classifier with small error:

f(x)

Active Learning: label a small fraction of training data, X , to learn agood classifier

minL,|L|≤B

∑i∈X

1[fL(xi ) 6= yi

]

fL(x), classifier trained on a labeled subset of examples L

labeling budget: label at most B examples

7 / 79

Test-time Costs

Given, a classifier, to make a decision

need to acquire sensor measurements: x = x1, x2, . . . , xK

with acquisition costs: c1, c2, . . . , cK

In many situations,

some sensors: less informative but fast/cheap

others: more informative but slow/expensive

cannot afford to use every sensor all the time

Applications

security screening: x-ray machine vs. manual inspection

medical diagnoses: physical exam vs. invasive procedure

computer vision: coarse features vs. high res

Not every decisions requires every sensor!

8 / 79

Reducing test-time costs

Sequential Sensor Selection

Sequentially select sensors to reduce average acquisition cost

Classify easy examples based on cheap sensors

Acquire expensive measurements only for difficult decisions

Learn a decision system F (x) from training data with full measurements:

minF

Ex,y [error(F , x, y) + α cost(F , x)]

F (x) controls

when to stop and classifyor request more sensor measurements

tune α to achieve desired average acquistion budget

9 / 79

Active Learning: Reducing Training Cost

10 / 79

Overview

Active Boosted Learning: Active Learning algorithm in the boostingframework

version space active learning

ActBoost algorithm

theoretical convergence

experiments: comparison to other methods on several datasets

11 / 79

Active Learning Problem

Active Learner+

Expert

SupervisedLearner

Unlabeled Pool Labeled Subset Classifier with small error:

f(x)

Label a small fraction of data to learn a good classifier

binary setting: labels y ∈ {+1,−1}unlabeled pool of M examples: x1, x2, . . . , xM

12 / 79

Margin Based Active Learning

Margin Based AL:labels examples that are ambiguous w.r.t. to current classifier

[Schohn and Cohn, 2000, Balcan et al., 2007, Abe and Mamitsuka, 1998, Campbell et al., 2000]

UpdateModel

LabelMost Uncertain

Sample

...

Seems like a good strategy?

13 / 79

Sensitive to Initialization

Initialize from the first two clusters:

slow to converge (QBB method)

stuck in the first two clusters

cluster 1

cluster 2

cluster 3

initial labels 20 40 60 80 100

50

60

70

80

90

100

Training Samples (#)

Acc

urac

y (%

)

ActBoostQBB

(Our) Alternative Strategy: ActBoost, robust to initialization bias

Version Space Based approach

14 / 79

Version Space [Freund et al., 1997]

Version Space:

set of classifiers that agree with labeled data

Generalized Binary Search:

label examples to bisect version space ([Nowak, 2009])

Labeled 4 examples instead 12

logarithmic reduction in # of labeled examples(for simple classifier families)

15 / 79


Version Space:




true classifier



15 / 79


Version Space:




version space



15 / 79


Version Space:






15 / 79

Version Space

Issue:

Data is not separable with thresholds = version space is empty

But separable with intervals

Need to consider more complex classifiers: boosted classifiers

16 / 79

Boosting

Combine simple binary decisions to form a strong classifier

q1 · h1(x) q2 · h2(x) q3 · h3(x) q4 · h4(x) f(x)+ + +

boosted classifier is parametrized by weight vector q

qTh(x) =N∑j=1

qj︸︷︷︸weights

weak learners︷︸︸︷hj(x) , hj(x) ∈ {−1,+1}

assume a fixed set of N weak learners

x→ [h1(x) h2(x) . . . hN(x)]T = h(x)

weak learning assumption: version space is not empty[Freund and Schapire, 1996]

17 / 79

Version Space of Boosted Classifiers

restrict q to probability simplex

Q = {q |N∑j=1

qj = 1, qj ≥ 0}

q, correctly classifies x if,

sgn[qTh(x)

]= y ⇐⇒ yqTh(x) > 0

For a labeled set of examples Lt

version space = space of q’s that classify Lt correctly

Qt ={

q ∈ Q | yih(xi )Tq ≥ 0, ∀i ∈ Lt

}Think of a polyhedron in N dimensions:

Qt

y1h(x1)

y2h(x2)

y3h(x3)

18 / 79

Iterations of Active Learning

As example xt is labeled at time t labeled set Lt grows:

∅ = L0 ⊂ L1 ⊂ L2 ⊂ . . . ⊂ Lt

Version Space Shrinks:

Q = Q0 ⊃ Q1 ⊃ Q2 ⊃ . . . ⊃ Qt

Labeled examples become constraints:

y tqTh(xt) ≥ 0

y1h(x1)

y2h(x2)

Q = Q0 Q1 Q2

yth(xt)Qt…

How to pick xt to maximally reduce version space?

19 / 79

Active Boosted Learning: ActBoost

Generalized binary search in the space of boosted classifiers

Label xt to best bisect version space → geometric reductiony t is unknown, but once revealed → about half is eliminated!

h(xt)

Qt�

Qt+

Qt+

Qt�

+h(xt)

�h(xt)

request label yt

yt = +1

yt = �1

ActBoost Strategy: label x t with smallest volume difference

minx∈Ut

∣∣Vol Qt+(x)− Vol Qt

−(x)∣∣

20 / 79

Approximate ActBoost Strategy

Approximate volume by uniform random samples from Qt

(hit and run algorithm)

q1,q2 . . .qD

h(x)

h(xt)

Qt�

Qt+

Samples from Qt :q1, . . . ,qD

Label x with greatest disagreement among samples

minx∈Ut

∣∣Vol Qt+(x)− Vol Qt

−(x)∣∣ ≈ min

x∈Ut

∣∣∣∣∣D∑

d=1

1[h(x)T qd>0] −D∑

d=1

1[h(x)T qd≤0]

∣∣∣∣∣pick x to equalize number of q′s in red and blue sections

21 / 79

Summary of Convergence Results

After labeling t examples, what can we say?

We prove the following:Volume Convergence

reduce volume to ε fraction in t = log( 1/ε1/λ )

logarithmic speed up log(1/ε) vs 1/ε

λ is computable constant, property of x1, . . . , xM and h1, . . . , hN

reduce volume 6=⇒ reduce error!

Error Convergence of Sparse Strategy

need structure: search p sparse version space instead

limit to p non-zero q1, . . . , qN ∈ Q

label t ≥ log (Np)+log 1

ε

log 1λ

and achieve error classifier trained on full data

combinatorially hard, search(Np

)subspaces!

ActBoost as a convex surrogate of sparse strategy

22 / 79

Experiments

Compare to Query By Boosting (QBB)

margin based method [Abe and Mamitsuka, 1998] in boosting

labels examples that are ambiguous w.r.t. to current boostedclassifier

requires initialization

ActBoost

use Ada-boost as a supervised learner to evaluate performance

stumps (thresholds on dimensions of x) as weak learners

ActBoost+

ExpertAdaboost

Fqa(x)

23 / 79

Unbiased Initialization Experiments

Remove initialization bias by randomly resampling the initial labeled set

simulate a well behaved initialization (2D synthetic datasets)

20 40 60 80 100

60

65

70

75

80


Acc

urac

y (%

)

RandomActBoostActBoost(sp)QBB

(a) banana

20 40 60 80 100

70

75

80

85

90

95

Training Samples (#)A

ccur

acy

(%)

RandomActBoostActBoost(sp)QBB

(b) box

ActBoost is on par with QBB

ActBoost(sp) simulates not tractable sparse strategyprior knowledge of the sparse support

24 / 79

Biased Initialization Experiments

Simulate adversarial conditions:

data lives in clusters, initialize by labeling points only from the firsttwo clusters

cluster 1

cluster 2

cluster 3

initial labels 20 40 60 80 100

50

60

70

80

90

100


Acc

urac

y (%

)

ActBoostQBB

Goal: quickly discover (start labeling) 3rd cluster

25 / 79

Biased Initialization Experiments

Datasets consisting of three clusters, initialize from first two only

Dermatology

20 40 60 80 100

70

80

90

100


Acc

urac

y (%

)

ActBoostQBB

predict skin diseasefrom physiology

features

Soy

20 40 60 80

70

80

90

100


Acc

urac

y (%

)

ActBoostQBB

soybean disease fromseed attributes

Iris

20 40 60 80 100 120 140

70

80

90


Acc

urac

y (%

)

ActBoostQBB

flower type from leafmeasuremens

ActBoost quickly discovers unknown clustersQBB does not explore full space

26 / 79

Summary of ActBoost Work

Algorithm:

novel active learning algorithm in the boosting framework

labels examples to approximately halve feasible set of boostedclassfiers

Convergence Results:

characterize volume convergence in terms of properties of weaklearners and unlabeled training data

logarithmic error convergence for a sparse strategy

Experiments:

performs on par with margin based methods when initialization isunbiased

not-sensitive to initialization bias unlike margin based methods

Publication:Active Boosted Learning, AISTATS, 2011

27 / 79

Sequential Sensor Selection: reducing test-time costs

28 / 79

Overview

Sequential Reject Classifiers: learning sequential decisions to reduceacquisition cost

motivation

importance of future uncertainty in learning decisions

two-stage example, novel empirical risk approach

multiple stages

experimental results

29 / 79

Sequential Reject Classifier

fK( )f2( )reject

classify

reject reject

classify classify

cheap/fast sensor

slow/costly sensor

f1( )

K stage decision system:

Stage k can use sensor k for a cost ck

Measurements can be high dimensional

Order of stages/sensors is given

Decision at each stage, fk(x1, x2, . . . , xk) ∈ {classify , reject}:classify using measurements x1, x2, . . . , xk , or

request (reject to) next sensor

Learn decisions, F = {f1, f2, . . . , fK}, to trade-off error vs cost

minF

Ex,y [error(F , x, y) + α cost(F , x)]

30 / 79

Example

Sensors of Increasing Resolutions

classify handwritten digit images

f1( ) f2( ) f3( ) f4( )?

low resolution(cheap)

high resolution(expensive)

Do we need all sensors for every decision?

31 / 79

Difficult Decision

f2( )f1( ) f3( ) f4( )

classify

?

8

high acquisition cost: need full resolution to make a decision

32 / 79

Difficult Decision

f2( )f1( ) f3( ) f4( )

classify

?

8

reject


32 / 79

Difficult Decision

f1( ) f2( ) f3( ) f4( )

classify

?

8

reject


32 / 79

Difficult Decision

f2( )f1( ) f3( ) f4( )

classify

?

8


32 / 79

Easy Decision

f1( ) f2( ) f3( ) f4( )

classify

?

1

small acquisition cost: full resolution is unnecessary

33 / 79

Easy Decision

f1( ) f2( ) f3( ) f4( )

classify

?

1

reject


33 / 79

Easy Decision

f1( ) f2( ) f3( ) f4( )

classify

?

1


33 / 79

How to reduce sensor cost?

Sensor 1 is cheap, Sensor 2 is expensive

Sens

or 2

Sensor 1 Sensor 1

Sens

or 2

Sensor 1

Non-Adaptive

Centralized

Centralized strategy:

use both sensors

high cost, low error

Non-adaptive strategy:

only use sensor 1

low cost, high error

34 / 79

How to reduce sensor cost?

Sensor 1 is cheap, Sensor 2 is expensive

Sens

or 2

Sensor 1 Sensor 1

Sens

or 2

Sensor 1

Non-Adaptive

Centralized

Centralized strategy:

use both sensors

high cost, low error

Non-adaptive strategy:

only use sensor 1

low cost, high error34 / 79

A better strategy: be adaptive

Only request 2nd sensor on difficult examples

Sens

or 2

Sensor 1

Sens

or 2

Sensor 1

Sensor 1

classify

reject

Stage 1 Decision

Stage 2 Decision

Sensor 1

35 / 79

How does it compare?

Same error rate as centralized for half the cost

Average Cost / Sample

cost

= 1

2nd

sens

or

Erro

r Rat

e

.1

.2

.5 11st sensorcost=0

Centralized

Non-adaptive

Adaptive

36 / 79

Deciding to reject

How to decide if to use the next sensor?

f1( ) f2( )reject

classify classify

cheap/fastsensor

expensive/slowsensorx

Risk of a decision:

min [ current uncertainty︸︷︷︸classify

, α× cost + future uncertainty︸︷︷︸reject to next stage

]

(uncertainty is in correct classification)

Acquisition cost justify the reduction in uncertainty?

37 / 79

Deciding to reject

How to decide if to use the next sensor?

f1( ) f2( )reject

classify classify

cheap/fastsensor


Risk of a decision:



]

(uncertainty is in correct classification)

Acquisition cost justify the reduction in uncertainty?37 / 79

Deciding to reject

Risk = min [ current uncertainty︸︷︷︸classify


]

Difficulty: sensor output is not known since it has not been acquired

How to determine future uncertainty?

Must base decision on collected measurements!2n

d se

nsor

1st sensor

38 / 79

Myopic Approach

Not clear how to determine uncertainty of the future:



]

Ignore the future, and only use current uncertainty to make a decision:


, α× cost︸︷︷︸reject to next stage

]

Reduces to:

decision =

{classify, uncertainly < threshold

reject, uncertainty ≥ threshold

39 / 79

Myopic In Discriminative Setting

Train a classifier at a stage h(x)

Classifier uncertainty ≈ distance to decision boundary (margin)

Small distance → high uncertainty

Large distance → low uncertainty

h(x)

reject to next stage

threshold

Related work: [Liu et al., 2008]

40 / 79

Example 1

Data:

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

Sens

or 2

Sensor 1

41 / 79

Example 1

1st Stage Classifier: only utilizes Sensor 1

Sens

or 2

Sensor 1

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

41 / 79

Example 1

2nd Stage Classifier: utilizes Sensors 1 and 2

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

Sens

or 2

Sensor 1

41 / 79

Example 1

Myopic Reject Classifier

Reject

Classify

Stage 1 Decision Stage 2 Decision

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

42 / 79

Example 1

Myopic Reject Classifier

Requests sensor 2 where sensor 1 is ambiguous

Current uncertainty seems to be a good criteria to reject

reject to 2nd stage(request 2nd sensor)

Sensor 1

Sens

or 2

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

43 / 79

Example 1: Error vs Budget

sweep threshold to generate different operating points

Sensor 1

Sens

or 2

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6 0 0.2 0.4 0.6 0.8 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

optimalmyopic

Good performance overall

44 / 79

Example 2

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sens

or 2

Sensor 1

45 / 79

Example 2

1st Stage Classifier: only utilizes Sensor 1

Sens

or 2

Sensor 1

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

45 / 79

Example 2

2nd Stage Classifier: utilizes Sensors 1 and 2

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sens

or 2

Sensor 1

45 / 79

Example 2

Region 1

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sens

or 2

Sensor 1

separable only with sensor 2

45 / 79

Example 2

Region 2

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sens

or 2

Sensor 1

neither sensor helps

45 / 79

Example 2

Myopic Reject Decision

Sensor 1 uncertainty is equally distributed between regions 1 and 2Uniformly rejects in both regions

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sensor 1

Sens

or 2

reject to 2nd stage

46 / 79

Example 2

Myopic Reject Decision

Current uncertainty is equally distributed between regions 1 and 2

Without future uncertainty cannot tell where sensor 2 is useful

0 0.2 0.4 0.6 0.8 10.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

budget

error

myopicoptimal

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

47 / 79

Myopic

0 0.2 0.4 0.6 0.8 10.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

budget

error

myopicoptimal

Myopic FailsMyopic Works

0 0.2 0.4 0.6 0.8 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

optimalmyopic

48 / 79

Future Uncertainty is Important

Need to incorporate future uncertainty in the decision



]

49 / 79

Generative & Parametric Methods

Known model: partially observable Markov decision process (POMDP)

Posterior Model: P( label | sensor measurements )

Likelihood Model: P( sensor k | sensor j )

Method 1: Learn models and solve POMDP

hard to learn models,

cannot solve POMDP in general case

Previous Work: [Ji and Carin, 2007, Kapoor and Horvitz, 2009, Zubek and Dietterich, 2002]

Method 2: Greedily maximize expected utility of a sensor

One step look ahead approximation to POMDP, unclear how to chooseutility

Correlation across sensors: hard to learn likelihood(e.g. sensor output = image)

Previous Work: [Kanani and Melville, 2008, Koller and Gao, 2011]

50 / 79

Our Approach

Avoid estimating probability models

Directly learn decision at each stage from training data

Empirical Risk Minimization (ERM):incorporates uncertainty of future in the current decision

51 / 79

Two Stage System

f1( ) f2( )reject

classify classify

cheap/fastsensor


52 / 79

Parametrization of Reject Region

We developed several parameterizations of fk(x):

Binary Setting:

reject as disagreement region of two binary decisions

learn f pk , fnk

fk(x) =

{f pk (x), f pk (x) = f nk (x)

reject, f pk (x) 6= f nk (x)

Multi-class Setting:

stage classifiers, dk(x) ∈ {1, . . . ,C}, given/pretained on x1, . . . xkaccess to σdk (x), confidence of classification (ex. margin)

only learn binary decision gk(x)

fk(x) =

{dk(x), σdk (x) > gk(x)

reject, σdk (x) ≤ gk(x)

focus on a simplified parametrization for illustration

53 / 79

Stage Classifiers

f1( ) f2( )reject

classify classify

cheap/fastsensor


Fix classifiers at each stage:

d1(x) is standard classifier trained on sensor 1

d2(x) is standard classifier trained on sensor 1 & 2

d1(x) := d1(x1), d2(x) := d2(x1, x2)

54 / 79

Decompose Reject Decision

f1( ) f2( )reject

classify classify

cheap/fastsensor


Decompose classification and rejection decisions:

g(x) is reject / not reject decision

f1(x) =

{d1(x), g(x) = not reject

reject, else

55 / 79

Risk Based Approach

f1( ) f2( )reject

classify classify

cheap/fastsensor


Risks of Each Stage:

Current: Rcu(x) = 1[d1 misclassifies x]

Future: Rfu(x) = 1[d2 misclassifies x] + α× sensor 2 cost

Stage 1 reject decision g(x):

g(x) =

{classify at 1, Rcu(x) < Rfu(x)

reject to 2nd sensor, Rcu(x) ≥ Rfu(x)

Difficulty: Rcu,Rfu require ground truth y and Rfu requires sensor 2

56 / 79

Empirical Risk Minimization

Use training data with full measurement,

(x1, y1), (x2, y2), . . . , (xN , yN)

And system risk for a point x and decision g(x)

R(g , x, y) =

{Rcu(x, y), g(x) = not reject

Rfu(x, y), g(x) = reject

Minimize empirical risk,

ming

Ex,y [R(g , x, y)] ≈ ming∈G

1

N

N∑i=1

R(g , xi , yi )

57 / 79

Back to Example 2

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sens

or 2

Sensor 1

58 / 79

Example 2

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

Sensor 1

Sens

or 2

reject to 2nd stage

59 / 79

Example 2

Smaller error for the same cost

MyopicOurs

Error=19%Error=14.8%

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.50

0.5

1

1.5

2

2.5

60 / 79

Example 2

Incorporating future uncertainty in current decisionimproves performance

0 0.2 0.4 0.6 0.8 10.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

budget

erro

r

myopicoursoptimal

61 / 79

Learning to Reject

How to learn reject decision g(x) (green region)?

Reduce reject option to learning a binary decision

Define a weighted supervised learning problem:

risk difference induces pseudo labels on training data,

pseudo label of xi =

{reject , Rcu(xi ) > Rfu(xi )

not reject, Rcu(xi ) ≤ Rfu(xi )

importance weights, risk difference = penalty for misclassifying

weight of xi = |Rcu(xi )− Rfu(xi )|

62 / 79

Learning to Reject

Risks induce pseudo-labels

LearnReject Decsion

pseudo label of xi =

{reject , Rcu(xi ) > Rfu(xi )

not reject, Rcu(xi ) ≤ Rfu(xi )

weight of xi = |Rcu(xi )− Rfu(xi )|

63 / 79

Reduction to supervised learning

Empirical risk minimization simplifies to weighted supervised learning:

arg ming∈G

1

N

N∑i=1

R(g , xi , yi ) =

arg ming∈G

N∑i=1

1[g(xi ) 6= pseudo label of xi

] × weight of xi

64 / 79

Reduction to supervised learning

ming∈G

N∑i=1

1[g(xi ) 6= pseudo label of xi

] × weight of xi

Can be solved with existing supervised learning tools:

Choose:

surrogate loss C (z) ≥ 1[z>0] (e.g. logistic)

classifier family G (e.g. linear)

general framework, not tied to a specific learning algorithm

ming∈G

N∑i=1

C [g(xi )× pseudo label of xi ]× weight of xi

65 / 79

Generalize to Multiple Stages

Measurement x = [x1, . . . , xK ] and true label y

fK( )f2( )reject

classify

reject reject

classify classify

cheap/fast sensor

slow/costly sensor

f1( )

Learn decisions at each stage F = f1, f2, . . . fk

fk(x) =

{dk(x), gk(x) = not reject

reject, else

dk(x) is a standard classifier, pretrained on sensors 1, . . . , k

66 / 79

Stage-wise Decomposition

System Risk: R(F , x, y) = Error(F (x), y) + αCost(F , x)

Stage-wise recursion:

R(F , xi , yi ) = R1(xi , yi , f1)

Rk(xi , yi , fk) =

δk+1i , reject to next stage

1, error & not reject

0, correct & not reject

Cost-to-go:δk+1i = αck+1 + Rk+1(xi , yi , fk+1)

risk of future stage decisions fk+1, . . . , fK if rejected

67 / 79

Stage-wise Decomposition

Rk(xi , yi , fk) =

δk+1i , reject to next stage

1, error & not reject

0, correct & not reject

Key Observation:Given the past: f1, . . . , fk−1, and the future: fk+1, . . . , fK

Find current decision, fk , from singe stage risk Rk

Equivalent to a two stage problem

Rcu(xi ) = 1[dk misclassifies xi ]

Rfu(xi ) = δk+1i

68 / 79

Algorithm

For every training example xi :

cost-to-go, δk+1i :, empirical risk of future stages,

state, Ski , indicates if example is still active at stage k

Algorithm: alternatively minimize one stage at a time

For every stage k:

1: Learn decision fk :

minf∈F

N∑i=1

Ski Rk [f , xi , yi , δ

k+1i ]

2: Update state, S ji , for future stages j > k

3: Update cost-to-go, δji , for past stages j < k

repeat until stopping criteria is satisfied

69 / 79

Algorithm

For every training example xi :

cost-to-go, δk+1i :, empirical risk of future stages,

state, Ski , indicates if example is still active at stage k

Algorithm: alternatively minimize one stage at a time

For every stage k :

1: Learn decision fk :

minf∈F

N∑i=1


k+1i ]

2: Update state, S ji , for future stages j > k

3: Update cost-to-go, δji , for past stages j < k

repeat until stopping criteria is satisfied

69 / 79

Learning Stage Decision

Step 1:

fk = arg minf∈F

N∑i=1


k+1i ]

simplifies to a two stage setting,

Rcu(xi ) = 1[dk misclassifies xi ]

Rfu(xi ) = δk+1i

learn reject decision, gk(x), for every stage

supervised weighted classification problem

use any classification algorithm

70 / 79

Example

Sensors Varying Resolutions

classify handwritten digit images (mnist)

f1( ) f2( ) f3( ) f4( )?

low resolution(cheap)

high resolution(expensive)

x handwritten digit imagey ∈ {0, 1, . . . , 9} label

Stage 1 2 3 4Sensor 4x4 8x8 16x16 32x32Cost 1 2 3 4

Supervised Learner: logistic regression with linear classifiers

71 / 79

Example

Sensor 1 Sensor 2 Sensor 3 Sensor 4

Digit 0

Digit 1

Digit 8

Sensor selection depends on example

72 / 79

Handwritten Digit Dataset

Same performance as centralized (best) with much lower budget

1 1.5 2 2.5 3 3.5 40.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

Budget

Err

or

oursmyopiccentralized

73 / 79

More Experiments

Achieve target error rate with fraction of max budget

Dataset Stages Sensors Target Error Myopic Ours

synthetic 2 .147 52% 28%pima 3 weight, age, blood tests .245 41% 15%

threat 3 ir,pmmw,ammw .16 89% 71%covertype 3 soils, wild. areas, elev, aspect .285 79% 40%

letter 3 pixel counts, edge feat’s .25 81% 51%mnist 4 res. levels .085 90% 52%

landsat 4 hyperspectral bands .17 56% 31%mam 2 CAD feat’s, expert rating .173 65 % 25%

74 / 79

Summary of Theoretical Results

Train multi-stage decision system F (x)

How well does it perform on unseen data?

Ex,y [F (x) 6= y ]?

We prove two types of test error bound:

Margin type bound (ACML 2012/ML 2013)

two stage, binary setting, for boosted classifiers

VC Dimension type bound (AISTATS 2013)

VC dimension = complexity of a classifier familysmall VC dim = good generalizationcomplexity of our system scales as K logK times most complex stage

75 / 79

Summary of Sequential Sensor Selection

Sequential Reject Classifiers

ordered sequence of stages/sensors

each stage either classifiers or rejects to next stage

multi-class and multi-stage

Novel Empirical Risk Minimization

Decompose into stage-wise empirical risk minimization

Reduce to a series of weighted supervised learning problems

Empirical Results

demonstrate on several datasets

achieve performance of a centralized system with less expensive sensors

Publications

Two Stage Decision System, IEEE SSP, 2012

Multi-Stage Classifier Design, ACML, 2012 and an extended version inMachine Learning, 2013

Supervised Sequential Classification Under Budget Constraints, AISTATS,2013

76 / 79

Future Directions

Dynamic Sensor Selection: learn a sequential decision system (a policy,F (x)) when the order of stages is no longer fixed

Difficulties

policy has to choose which sensor to acquire next or to stop andclassify

policy has to handle any subset of measurements

Promising direction: imitation learning

77 / 79

Future Directions

Main idea: instead of learning an optimal policy, learn to imitate a verygood policy (an oracle)

oracle may only be evaluated on a training set, requires ground truth

learn a parameterized policy function to match oracle decisions onthe training

can use this policy estimate on unseen examples

Imitation learning is well studied in robotics community. Oracle is ahuman operator.

Difficulty in sensor selection setting: designing or computing an oracle onthe training data

78 / 79

Thank You!

advisors: Venkatesh Saligrama, David Castanon

rest of my committee: Prakash Ishwar, Ioannis Paschalidis

chair: Ayse Coskun

fellow BU students

79 / 79

[Abe and Mamitsuka, 1998] Abe, N. and Mamitsuka, H. (1998).Query learning strategies using boosting and bagging.In Proc. of the 15th ICML, pages 1–9.

[Balcan et al., 2007] Balcan, M.-F., Broder, A., and Zhang, T. (2007).Margin based active learning.In Procedings of the 20th Conference on Learning Theory.

[Campbell et al., 2000] Campbell, C., Cristianini, N., and Smola, A. (2000).Query learning with large margin classifiers.In Proceedings 17th International Conference on Machine Learning, pages 111–118.

[Freund and Schapire, 1996] Freund, Y. and Schapire, R. E. (1996).Experiments with a new boosting algorithm.In Proc. of the 13th ICML, pages 148–156.

[Freund et al., 1997] Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. (1997).Selective sampling using the query by committee algorithm.In ML.

[Ji and Carin, 2007] Ji, S. and Carin, L. (2007).Cost-sensitive feature acquisition and classification.In Pattern Recognition.

[Kanani and Melville, 2008] Kanani, P. and Melville, P. (2008).Prediction-time active feature-value acquisition for cost-effective customer targeting.In NIPS.

[Kapoor and Horvitz, 2009] Kapoor, A. and Horvitz, E. (2009).Breaking boundaries: Active information acquisition across learning and diagnosis.In NIPS.

[Koller and Gao, 2011] Koller and Gao (NIPS 2011).Active value.

[Liu et al., 2008] Liu, L.-P., Yu, Y., Jiang, Y., and Zhou, Z.-H. (2008).Tefe: A time-efficient approach to feature extraction.In ICDM.

[Nowak, 2009] Nowak, R. D. (2009).The geometry of generalized binary search.In Advances in Neural Information Processing Systems.

[Schohn and Cohn, 2000] Schohn, G. and Cohn, D. (2000).Less is more: Active learning with support vector machines.In Proceedings of the 17th International Conference on Machine Learning, pages 839–846.

79 / 79

[Zubek and Dietterich, 2002] Zubek, V. B. and Dietterich, T. G. (2002).Pruning improves heuristic search for cost-sensitive learning.In ICML.

79 / 79

Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf ·...

Documents

Transcript of Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf ·...