Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf ·...
Transcript of Machine Learning on a Budget - blogs.bu.edublogs.bu.edu/ktrap/files/2013/09/defense_talk.pdf ·...
Machine Learning on a Budget
Kirill Trapeznikov
September 5th, 2013
Advisors: Venkatesh Saligrama, David Castanon
Committee: Venkatesh Saligrama, David Castanon, Prakash Ishwar,Ioannis Paschalidis
Chair: Ayse Coskun
1 / 79
Supervised Learning
NatureP(x, y)
x y
example x = image, text article, .. , collection of k sensormeasurements
label y = scene category, article topic, .. , target present/not present
2 / 79
Classifier
Goal is to find a classifier f (x) that minimizes expected error:
minf
Ex,y
[1[f (x) 6=y ]
]And if P(x, y) is known then f (x) is the posterior
f (x) = arg maxy
P(y | x)
3 / 79
Empirical Risk Minimization
In most cases, P(x, y), unknown and typically hard to estimate
Empirical Risk Minimization:Collect training data (x1, y1), (x2, y2), . . . and approximate expected risk
minf∈F
∑i
1[f (x))6=yi ]
F is a family of classifiers (ex. linear separators)
SupervisedLearner
Labeled Training Data
Classifier with small error
f(x)
4 / 79
Learning Classifier
Most effort has been in learning a good classifier
how to parametrize classifier family Fhow to design and minimize surrogate risk C (·) to replace 1[·]
minf∈F
∑i
C (f (xi )), yi )
Recently, cost of learning gained importance
Training Phase: labeling cost
Testing Phase: acquisition cost
5 / 79
Labeling Cost
To train a classifier need: measurements x and labels y
often, large collection of unlabeled data is available: xi ’s
to label yi need to use an expert, typically expensive
Examples
Medical Imaging
large amount of unlabeled data, (tests, scans, etc)to label need a doctor/radiologist
Computer Vision/ Object Detection
unlabeled images/videosrequires user input to annotate objects in images
How to reduce labeling cost?
6 / 79
Reducing Labeling Cost
Unnecessary to label every example: many redundant, uninformative
Active Learner+
Expert
SupervisedLearner
Unlabeled Pool Labeled Subset Classifier with small error:
f(x)
Active Learning: label a small fraction of training data, X , to learn agood classifier
minL,|L|≤B
∑i∈X
1[fL(xi ) 6= yi
]
fL(x), classifier trained on a labeled subset of examples L
labeling budget: label at most B examples
7 / 79
Test-time Costs
Given, a classifier, to make a decision
need to acquire sensor measurements: x = x1, x2, . . . , xK
with acquisition costs: c1, c2, . . . , cK
In many situations,
some sensors: less informative but fast/cheap
others: more informative but slow/expensive
cannot afford to use every sensor all the time
Applications
security screening: x-ray machine vs. manual inspection
medical diagnoses: physical exam vs. invasive procedure
computer vision: coarse features vs. high res
Not every decisions requires every sensor!
8 / 79
Reducing test-time costs
Sequential Sensor Selection
Sequentially select sensors to reduce average acquisition cost
Classify easy examples based on cheap sensors
Acquire expensive measurements only for difficult decisions
Learn a decision system F (x) from training data with full measurements:
minF
Ex,y [error(F , x, y) + α cost(F , x)]
F (x) controls
when to stop and classifyor request more sensor measurements
tune α to achieve desired average acquistion budget
9 / 79
Active Learning: Reducing Training Cost
10 / 79
Overview
Active Boosted Learning: Active Learning algorithm in the boostingframework
version space active learning
ActBoost algorithm
theoretical convergence
experiments: comparison to other methods on several datasets
11 / 79
Active Learning Problem
Active Learner+
Expert
SupervisedLearner
Unlabeled Pool Labeled Subset Classifier with small error:
f(x)
Label a small fraction of data to learn a good classifier
binary setting: labels y ∈ {+1,−1}unlabeled pool of M examples: x1, x2, . . . , xM
12 / 79
Margin Based Active Learning
Margin Based AL:labels examples that are ambiguous w.r.t. to current classifier
[Schohn and Cohn, 2000, Balcan et al., 2007, Abe and Mamitsuka, 1998, Campbell et al., 2000]
UpdateModel
LabelMost Uncertain
Sample
...
Seems like a good strategy?
13 / 79
Sensitive to Initialization
Initialize from the first two clusters:
slow to converge (QBB method)
stuck in the first two clusters
cluster 1
cluster 2
cluster 3
initial labels 20 40 60 80 100
50
60
70
80
90
100
Training Samples (#)
Acc
urac
y (%
)
ActBoostQBB
(Our) Alternative Strategy: ActBoost, robust to initialization bias
Version Space Based approach
14 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
true classifier
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
version space
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space [Freund et al., 1997]
Version Space:
set of classifiers that agree with labeled data
Generalized Binary Search:
label examples to bisect version space ([Nowak, 2009])
Labeled 4 examples instead 12
logarithmic reduction in # of labeled examples(for simple classifier families)
15 / 79
Version Space
Issue:
Data is not separable with thresholds = version space is empty
But separable with intervals
Need to consider more complex classifiers: boosted classifiers
16 / 79
Version Space
Issue:
Data is not separable with thresholds = version space is empty
But separable with intervals
Need to consider more complex classifiers: boosted classifiers
16 / 79
Boosting
Combine simple binary decisions to form a strong classifier
q1 · h1(x) q2 · h2(x) q3 · h3(x) q4 · h4(x) f(x)+ + +
boosted classifier is parametrized by weight vector q
qTh(x) =N∑j=1
qj︸︷︷︸weights
weak learners︷ ︸︸ ︷hj(x) , hj(x) ∈ {−1,+1}
assume a fixed set of N weak learners
x→ [h1(x) h2(x) . . . hN(x)]T = h(x)
weak learning assumption: version space is not empty[Freund and Schapire, 1996]
17 / 79
Version Space of Boosted Classifiers
restrict q to probability simplex
Q = {q |N∑j=1
qj = 1, qj ≥ 0}
q, correctly classifies x if,
sgn[qTh(x)
]= y ⇐⇒ yqTh(x) > 0
For a labeled set of examples Lt
version space = space of q’s that classify Lt correctly
Qt ={
q ∈ Q | yih(xi )Tq ≥ 0, ∀i ∈ Lt
}Think of a polyhedron in N dimensions:
Qt
y1h(x1)
y2h(x2)
y3h(x3)
18 / 79
Iterations of Active Learning
As example xt is labeled at time t labeled set Lt grows:
∅ = L0 ⊂ L1 ⊂ L2 ⊂ . . . ⊂ Lt
Version Space Shrinks:
Q = Q0 ⊃ Q1 ⊃ Q2 ⊃ . . . ⊃ Qt
Labeled examples become constraints:
y tqTh(xt) ≥ 0
y1h(x1)
y2h(x2)
Q = Q0 Q1 Q2
yth(xt)Qt…
How to pick xt to maximally reduce version space?
19 / 79
Active Boosted Learning: ActBoost
Generalized binary search in the space of boosted classifiers
Label xt to best bisect version space → geometric reductiony t is unknown, but once revealed → about half is eliminated!
h(xt)
Qt�
Qt+
Qt+
Qt�
+h(xt)
�h(xt)
request label yt
yt = +1
yt = �1
ActBoost Strategy: label x t with smallest volume difference
minx∈Ut
∣∣Vol Qt+(x)− Vol Qt
−(x)∣∣
20 / 79
Approximate ActBoost Strategy
Approximate volume by uniform random samples from Qt
(hit and run algorithm)
q1,q2 . . .qD
h(x)
h(xt)
Qt�
Qt+
Samples from Qt :q1, . . . ,qD
Label x with greatest disagreement among samples
minx∈Ut
∣∣Vol Qt+(x)− Vol Qt
−(x)∣∣ ≈ min
x∈Ut
∣∣∣∣∣D∑
d=1
1[h(x)T qd>0] −D∑
d=1
1[h(x)T qd≤0]
∣∣∣∣∣pick x to equalize number of q′s in red and blue sections
21 / 79
Summary of Convergence Results
After labeling t examples, what can we say?
We prove the following:Volume Convergence
reduce volume to ε fraction in t = log( 1/ε1/λ )
logarithmic speed up log(1/ε) vs 1/ε
λ is computable constant, property of x1, . . . , xM and h1, . . . , hN
reduce volume 6=⇒ reduce error!
Error Convergence of Sparse Strategy
need structure: search p sparse version space instead
limit to p non-zero q1, . . . , qN ∈ Q
label t ≥ log (Np)+log 1
ε
log 1λ
and achieve error classifier trained on full data
combinatorially hard, search(Np
)subspaces!
ActBoost as a convex surrogate of sparse strategy
22 / 79
Experiments
Compare to Query By Boosting (QBB)
margin based method [Abe and Mamitsuka, 1998] in boosting
labels examples that are ambiguous w.r.t. to current boostedclassifier
requires initialization
ActBoost
use Ada-boost as a supervised learner to evaluate performance
stumps (thresholds on dimensions of x) as weak learners
ActBoost+
ExpertAdaboost
Fqa(x)
23 / 79
Unbiased Initialization Experiments
Remove initialization bias by randomly resampling the initial labeled set
simulate a well behaved initialization (2D synthetic datasets)
20 40 60 80 100
60
65
70
75
80
Training Samples (#)
Acc
urac
y (%
)
RandomActBoostActBoost(sp)QBB
(a) banana
20 40 60 80 100
70
75
80
85
90
95
Training Samples (#)A
ccur
acy
(%)
RandomActBoostActBoost(sp)QBB
(b) box
ActBoost is on par with QBB
ActBoost(sp) simulates not tractable sparse strategyprior knowledge of the sparse support
24 / 79
Biased Initialization Experiments
Simulate adversarial conditions:
data lives in clusters, initialize by labeling points only from the firsttwo clusters
cluster 1
cluster 2
cluster 3
initial labels 20 40 60 80 100
50
60
70
80
90
100
Training Samples (#)
Acc
urac
y (%
)
ActBoostQBB
Goal: quickly discover (start labeling) 3rd cluster
25 / 79
Biased Initialization Experiments
Datasets consisting of three clusters, initialize from first two only
Dermatology
20 40 60 80 100
70
80
90
100
Training Samples (#)
Acc
urac
y (%
)
ActBoostQBB
predict skin diseasefrom physiology
features
Soy
20 40 60 80
70
80
90
100
Training Samples (#)
Acc
urac
y (%
)
ActBoostQBB
soybean disease fromseed attributes
Iris
20 40 60 80 100 120 140
70
80
90
Training Samples (#)
Acc
urac
y (%
)
ActBoostQBB
flower type from leafmeasuremens
ActBoost quickly discovers unknown clustersQBB does not explore full space
26 / 79
Summary of ActBoost Work
Algorithm:
novel active learning algorithm in the boosting framework
labels examples to approximately halve feasible set of boostedclassfiers
Convergence Results:
characterize volume convergence in terms of properties of weaklearners and unlabeled training data
logarithmic error convergence for a sparse strategy
Experiments:
performs on par with margin based methods when initialization isunbiased
not-sensitive to initialization bias unlike margin based methods
Publication:Active Boosted Learning, AISTATS, 2011
27 / 79
Sequential Sensor Selection: reducing test-time costs
28 / 79
Overview
Sequential Reject Classifiers: learning sequential decisions to reduceacquisition cost
motivation
importance of future uncertainty in learning decisions
two-stage example, novel empirical risk approach
multiple stages
experimental results
29 / 79
Sequential Reject Classifier
fK( )f2( )reject
classify
reject reject
classify classify
cheap/fast sensor
slow/costly sensor
f1( )
K stage decision system:
Stage k can use sensor k for a cost ck
Measurements can be high dimensional
Order of stages/sensors is given
Decision at each stage, fk(x1, x2, . . . , xk) ∈ {classify , reject}:classify using measurements x1, x2, . . . , xk , or
request (reject to) next sensor
Learn decisions, F = {f1, f2, . . . , fK}, to trade-off error vs cost
minF
Ex,y [error(F , x, y) + α cost(F , x)]
30 / 79
Sequential Reject Classifier
fK( )f2( )reject
classify
reject reject
classify classify
cheap/fast sensor
slow/costly sensor
f1( )
K stage decision system:
Stage k can use sensor k for a cost ck
Measurements can be high dimensional
Order of stages/sensors is given
Decision at each stage, fk(x1, x2, . . . , xk) ∈ {classify , reject}:classify using measurements x1, x2, . . . , xk , or
request (reject to) next sensor
Learn decisions, F = {f1, f2, . . . , fK}, to trade-off error vs cost
minF
Ex,y [error(F , x, y) + α cost(F , x)]
30 / 79
Sequential Reject Classifier
fK( )f2( )reject
classify
reject reject
classify classify
cheap/fast sensor
slow/costly sensor
f1( )
K stage decision system:
Stage k can use sensor k for a cost ck
Measurements can be high dimensional
Order of stages/sensors is given
Decision at each stage, fk(x1, x2, . . . , xk) ∈ {classify , reject}:classify using measurements x1, x2, . . . , xk , or
request (reject to) next sensor
Learn decisions, F = {f1, f2, . . . , fK}, to trade-off error vs cost
minF
Ex,y [error(F , x, y) + α cost(F , x)]
30 / 79
Example
Sensors of Increasing Resolutions
classify handwritten digit images
f1( ) f2( ) f3( ) f4( )?
low resolution(cheap)
high resolution(expensive)
Do we need all sensors for every decision?
31 / 79
Difficult Decision
f2( )f1( ) f3( ) f4( )
classify
?
8
high acquisition cost: need full resolution to make a decision
32 / 79
Difficult Decision
f2( )f1( ) f3( ) f4( )
classify
?
8
reject
high acquisition cost: need full resolution to make a decision
32 / 79
Difficult Decision
f2( )f1( ) f3( ) f4( )
classify
?
8
reject
high acquisition cost: need full resolution to make a decision
32 / 79
Difficult Decision
f1( ) f2( ) f3( ) f4( )
classify
?
8
reject
high acquisition cost: need full resolution to make a decision
32 / 79
Difficult Decision
f2( )f1( ) f3( ) f4( )
classify
?
8
high acquisition cost: need full resolution to make a decision
32 / 79
Easy Decision
f1( ) f2( ) f3( ) f4( )
classify
?
1
small acquisition cost: full resolution is unnecessary
33 / 79
Easy Decision
f1( ) f2( ) f3( ) f4( )
classify
?
1
reject
small acquisition cost: full resolution is unnecessary
33 / 79
Easy Decision
f1( ) f2( ) f3( ) f4( )
classify
?
1
small acquisition cost: full resolution is unnecessary
33 / 79
Easy Decision
f1( ) f2( ) f3( ) f4( )
classify
?
1
small acquisition cost: full resolution is unnecessary
33 / 79
How to reduce sensor cost?
Sensor 1 is cheap, Sensor 2 is expensive
Sens
or 2
Sensor 1 Sensor 1
Sens
or 2
Sensor 1
Non-Adaptive
Centralized
Centralized strategy:
use both sensors
high cost, low error
Non-adaptive strategy:
only use sensor 1
low cost, high error
34 / 79
How to reduce sensor cost?
Sensor 1 is cheap, Sensor 2 is expensive
Sens
or 2
Sensor 1 Sensor 1
Sens
or 2
Sensor 1
Non-Adaptive
Centralized
Centralized strategy:
use both sensors
high cost, low error
Non-adaptive strategy:
only use sensor 1
low cost, high error
34 / 79
How to reduce sensor cost?
Sensor 1 is cheap, Sensor 2 is expensive
Sens
or 2
Sensor 1 Sensor 1
Sens
or 2
Sensor 1
Non-Adaptive
Centralized
Centralized strategy:
use both sensors
high cost, low error
Non-adaptive strategy:
only use sensor 1
low cost, high error34 / 79
A better strategy: be adaptive
Only request 2nd sensor on difficult examples
Sens
or 2
Sensor 1
Sens
or 2
Sensor 1
Sensor 1
classify
reject
Stage 1 Decision
Stage 2 Decision
Sensor 1
35 / 79
How does it compare?
Same error rate as centralized for half the cost
Average Cost / Sample
cost
= 1
2nd
sens
or
Erro
r Rat
e
.1
.2
.5 11st sensorcost=0
Centralized
Non-adaptive
Adaptive
36 / 79
Deciding to reject
How to decide if to use the next sensor?
f1( ) f2( )reject
classify classify
cheap/fastsensor
expensive/slowsensorx
Risk of a decision:
min [ current uncertainty︸ ︷︷ ︸classify
, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage
]
(uncertainty is in correct classification)
Acquisition cost justify the reduction in uncertainty?
37 / 79
Deciding to reject
How to decide if to use the next sensor?
f1( ) f2( )reject
classify classify
cheap/fastsensor
expensive/slowsensorx
Risk of a decision:
min [ current uncertainty︸ ︷︷ ︸classify
, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage
]
(uncertainty is in correct classification)
Acquisition cost justify the reduction in uncertainty?37 / 79
Deciding to reject
Risk = min [ current uncertainty︸ ︷︷ ︸classify
, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage
]
Difficulty: sensor output is not known since it has not been acquired
How to determine future uncertainty?
Must base decision on collected measurements!2n
d se
nsor
1st sensor
38 / 79
Myopic Approach
Not clear how to determine uncertainty of the future:
min [ current uncertainty︸ ︷︷ ︸classify
, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage
]
Ignore the future, and only use current uncertainty to make a decision:
min [ current uncertainty︸ ︷︷ ︸classify
, α× cost︸ ︷︷ ︸reject to next stage
]
Reduces to:
decision =
{classify, uncertainly < threshold
reject, uncertainty ≥ threshold
39 / 79
Myopic Approach
Not clear how to determine uncertainty of the future:
min [ current uncertainty︸ ︷︷ ︸classify
, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage
]
Ignore the future, and only use current uncertainty to make a decision:
min [ current uncertainty︸ ︷︷ ︸classify
, α× cost︸ ︷︷ ︸reject to next stage
]
Reduces to:
decision =
{classify, uncertainly < threshold
reject, uncertainty ≥ threshold
39 / 79
Myopic Approach
Not clear how to determine uncertainty of the future:
min [ current uncertainty︸ ︷︷ ︸classify
, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage
]
Ignore the future, and only use current uncertainty to make a decision:
min [ current uncertainty︸ ︷︷ ︸classify
, α× cost︸ ︷︷ ︸reject to next stage
]
Reduces to:
decision =
{classify, uncertainly < threshold
reject, uncertainty ≥ threshold
39 / 79
Myopic In Discriminative Setting
Train a classifier at a stage h(x)
Classifier uncertainty ≈ distance to decision boundary (margin)
Small distance → high uncertainty
Large distance → low uncertainty
h(x)
reject to next stage
threshold
Related work: [Liu et al., 2008]
40 / 79
Example 1
Data:
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6
Sens
or 2
Sensor 1
41 / 79
Example 1
1st Stage Classifier: only utilizes Sensor 1
Sens
or 2
Sensor 1
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6
41 / 79
Example 1
2nd Stage Classifier: utilizes Sensors 1 and 2
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6
Sens
or 2
Sensor 1
41 / 79
Example 1
Myopic Reject Classifier
Reject
Classify
Stage 1 Decision Stage 2 Decision
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6
42 / 79
Example 1
Myopic Reject Classifier
Requests sensor 2 where sensor 1 is ambiguous
Current uncertainty seems to be a good criteria to reject
reject to 2nd stage(request 2nd sensor)
Sensor 1
Sens
or 2
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6
43 / 79
Example 1: Error vs Budget
sweep threshold to generate different operating points
Sensor 1
Sens
or 2
−6 −4 −2 0 2 4 6
−6
−4
−2
0
2
4
6 0 0.2 0.4 0.6 0.8 10
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
optimalmyopic
Good performance overall
44 / 79
Example 2
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
Sens
or 2
Sensor 1
45 / 79
Example 2
1st Stage Classifier: only utilizes Sensor 1
Sens
or 2
Sensor 1
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
45 / 79
Example 2
2nd Stage Classifier: utilizes Sensors 1 and 2
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
Sens
or 2
Sensor 1
45 / 79
Example 2
Region 1
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
Sens
or 2
Sensor 1
separable only with sensor 2
45 / 79
Example 2
Region 2
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
Sens
or 2
Sensor 1
neither sensor helps
45 / 79
Example 2
Myopic Reject Decision
Sensor 1 uncertainty is equally distributed between regions 1 and 2Uniformly rejects in both regions
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
Sensor 1
Sens
or 2
reject to 2nd stage
46 / 79
Example 2
Myopic Reject Decision
Current uncertainty is equally distributed between regions 1 and 2
Without future uncertainty cannot tell where sensor 2 is useful
0 0.2 0.4 0.6 0.8 10.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
budget
error
myopicoptimal
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
47 / 79
Myopic
0 0.2 0.4 0.6 0.8 10.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
budget
error
myopicoptimal
Myopic FailsMyopic Works
0 0.2 0.4 0.6 0.8 10
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
optimalmyopic
48 / 79
Future Uncertainty is Important
Need to incorporate future uncertainty in the decision
min [ current uncertainty︸ ︷︷ ︸classify
, α× cost + future uncertainty︸ ︷︷ ︸reject to next stage
]
49 / 79
Generative & Parametric Methods
Known model: partially observable Markov decision process (POMDP)
Posterior Model: P( label | sensor measurements )
Likelihood Model: P( sensor k | sensor j )
Method 1: Learn models and solve POMDP
hard to learn models,
cannot solve POMDP in general case
Previous Work: [Ji and Carin, 2007, Kapoor and Horvitz, 2009, Zubek and Dietterich, 2002]
Method 2: Greedily maximize expected utility of a sensor
One step look ahead approximation to POMDP, unclear how to chooseutility
Correlation across sensors: hard to learn likelihood(e.g. sensor output = image)
Previous Work: [Kanani and Melville, 2008, Koller and Gao, 2011]
50 / 79
Our Approach
Avoid estimating probability models
Directly learn decision at each stage from training data
Empirical Risk Minimization (ERM):incorporates uncertainty of future in the current decision
51 / 79
Two Stage System
f1( ) f2( )reject
classify classify
cheap/fastsensor
expensive/slowsensorx
52 / 79
Parametrization of Reject Region
We developed several parameterizations of fk(x):
Binary Setting:
reject as disagreement region of two binary decisions
learn f pk , fnk
fk(x) =
{f pk (x), f pk (x) = f nk (x)
reject, f pk (x) 6= f nk (x)
Multi-class Setting:
stage classifiers, dk(x) ∈ {1, . . . ,C}, given/pretained on x1, . . . xkaccess to σdk (x), confidence of classification (ex. margin)
only learn binary decision gk(x)
fk(x) =
{dk(x), σdk (x) > gk(x)
reject, σdk (x) ≤ gk(x)
focus on a simplified parametrization for illustration
53 / 79
Parametrization of Reject Region
We developed several parameterizations of fk(x):
Binary Setting:
reject as disagreement region of two binary decisions
learn f pk , fnk
fk(x) =
{f pk (x), f pk (x) = f nk (x)
reject, f pk (x) 6= f nk (x)
Multi-class Setting:
stage classifiers, dk(x) ∈ {1, . . . ,C}, given/pretained on x1, . . . xkaccess to σdk (x), confidence of classification (ex. margin)
only learn binary decision gk(x)
fk(x) =
{dk(x), σdk (x) > gk(x)
reject, σdk (x) ≤ gk(x)
focus on a simplified parametrization for illustration
53 / 79
Parametrization of Reject Region
We developed several parameterizations of fk(x):
Binary Setting:
reject as disagreement region of two binary decisions
learn f pk , fnk
fk(x) =
{f pk (x), f pk (x) = f nk (x)
reject, f pk (x) 6= f nk (x)
Multi-class Setting:
stage classifiers, dk(x) ∈ {1, . . . ,C}, given/pretained on x1, . . . xkaccess to σdk (x), confidence of classification (ex. margin)
only learn binary decision gk(x)
fk(x) =
{dk(x), σdk (x) > gk(x)
reject, σdk (x) ≤ gk(x)
focus on a simplified parametrization for illustration
53 / 79
Parametrization of Reject Region
We developed several parameterizations of fk(x):
Binary Setting:
reject as disagreement region of two binary decisions
learn f pk , fnk
fk(x) =
{f pk (x), f pk (x) = f nk (x)
reject, f pk (x) 6= f nk (x)
Multi-class Setting:
stage classifiers, dk(x) ∈ {1, . . . ,C}, given/pretained on x1, . . . xkaccess to σdk (x), confidence of classification (ex. margin)
only learn binary decision gk(x)
fk(x) =
{dk(x), σdk (x) > gk(x)
reject, σdk (x) ≤ gk(x)
focus on a simplified parametrization for illustration
53 / 79
Stage Classifiers
f1( ) f2( )reject
classify classify
cheap/fastsensor
expensive/slowsensorx
Fix classifiers at each stage:
d1(x) is standard classifier trained on sensor 1
d2(x) is standard classifier trained on sensor 1 & 2
d1(x) := d1(x1), d2(x) := d2(x1, x2)
54 / 79
Decompose Reject Decision
f1( ) f2( )reject
classify classify
cheap/fastsensor
expensive/slowsensorx
Decompose classification and rejection decisions:
g(x) is reject / not reject decision
f1(x) =
{d1(x), g(x) = not reject
reject, else
55 / 79
Risk Based Approach
f1( ) f2( )reject
classify classify
cheap/fastsensor
expensive/slowsensorx
Risks of Each Stage:
Current: Rcu(x) = 1[d1 misclassifies x]
Future: Rfu(x) = 1[d2 misclassifies x] + α× sensor 2 cost
Stage 1 reject decision g(x):
g(x) =
{classify at 1, Rcu(x) < Rfu(x)
reject to 2nd sensor, Rcu(x) ≥ Rfu(x)
Difficulty: Rcu,Rfu require ground truth y and Rfu requires sensor 2
56 / 79
Risk Based Approach
f1( ) f2( )reject
classify classify
cheap/fastsensor
expensive/slowsensorx
Risks of Each Stage:
Current: Rcu(x) = 1[d1 misclassifies x]
Future: Rfu(x) = 1[d2 misclassifies x] + α× sensor 2 cost
Stage 1 reject decision g(x):
g(x) =
{classify at 1, Rcu(x) < Rfu(x)
reject to 2nd sensor, Rcu(x) ≥ Rfu(x)
Difficulty: Rcu,Rfu require ground truth y and Rfu requires sensor 2
56 / 79
Risk Based Approach
f1( ) f2( )reject
classify classify
cheap/fastsensor
expensive/slowsensorx
Risks of Each Stage:
Current: Rcu(x) = 1[d1 misclassifies x]
Future: Rfu(x) = 1[d2 misclassifies x] + α× sensor 2 cost
Stage 1 reject decision g(x):
g(x) =
{classify at 1, Rcu(x) < Rfu(x)
reject to 2nd sensor, Rcu(x) ≥ Rfu(x)
Difficulty: Rcu,Rfu require ground truth y and Rfu requires sensor 2
56 / 79
Empirical Risk Minimization
Use training data with full measurement,
(x1, y1), (x2, y2), . . . , (xN , yN)
And system risk for a point x and decision g(x)
R(g , x, y) =
{Rcu(x, y), g(x) = not reject
Rfu(x, y), g(x) = reject
Minimize empirical risk,
ming
Ex,y [R(g , x, y)] ≈ ming∈G
1
N
N∑i=1
R(g , xi , yi )
57 / 79
Empirical Risk Minimization
Use training data with full measurement,
(x1, y1), (x2, y2), . . . , (xN , yN)
And system risk for a point x and decision g(x)
R(g , x, y) =
{Rcu(x, y), g(x) = not reject
Rfu(x, y), g(x) = reject
Minimize empirical risk,
ming
Ex,y [R(g , x, y)] ≈ ming∈G
1
N
N∑i=1
R(g , xi , yi )
57 / 79
Empirical Risk Minimization
Use training data with full measurement,
(x1, y1), (x2, y2), . . . , (xN , yN)
And system risk for a point x and decision g(x)
R(g , x, y) =
{Rcu(x, y), g(x) = not reject
Rfu(x, y), g(x) = reject
Minimize empirical risk,
ming
Ex,y [R(g , x, y)] ≈ ming∈G
1
N
N∑i=1
R(g , xi , yi )
57 / 79
Back to Example 2
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
Sens
or 2
Sensor 1
58 / 79
Example 2
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
Sensor 1
Sens
or 2
reject to 2nd stage
59 / 79
Example 2
Smaller error for the same cost
MyopicOurs
Error=19%Error=14.8%
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.5
60 / 79
Example 2
Incorporating future uncertainty in current decisionimproves performance
0 0.2 0.4 0.6 0.8 10.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
budget
erro
r
myopicoursoptimal
61 / 79
Learning to Reject
How to learn reject decision g(x) (green region)?
Reduce reject option to learning a binary decision
Define a weighted supervised learning problem:
risk difference induces pseudo labels on training data,
pseudo label of xi =
{reject , Rcu(xi ) > Rfu(xi )
not reject, Rcu(xi ) ≤ Rfu(xi )
importance weights, risk difference = penalty for misclassifying
weight of xi = |Rcu(xi )− Rfu(xi )|
62 / 79
Learning to Reject
Risks induce pseudo-labels
LearnReject Decsion
pseudo label of xi =
{reject , Rcu(xi ) > Rfu(xi )
not reject, Rcu(xi ) ≤ Rfu(xi )
weight of xi = |Rcu(xi )− Rfu(xi )|
63 / 79
Reduction to supervised learning
Empirical risk minimization simplifies to weighted supervised learning:
arg ming∈G
1
N
N∑i=1
R(g , xi , yi ) =
arg ming∈G
N∑i=1
1[g(xi ) 6= pseudo label of xi
] × weight of xi
64 / 79
Reduction to supervised learning
ming∈G
N∑i=1
1[g(xi ) 6= pseudo label of xi
] × weight of xi
Can be solved with existing supervised learning tools:
Choose:
surrogate loss C (z) ≥ 1[z>0] (e.g. logistic)
classifier family G (e.g. linear)
general framework, not tied to a specific learning algorithm
ming∈G
N∑i=1
C [g(xi )× pseudo label of xi ]× weight of xi
65 / 79
Generalize to Multiple Stages
Measurement x = [x1, . . . , xK ] and true label y
fK( )f2( )reject
classify
reject reject
classify classify
cheap/fast sensor
slow/costly sensor
f1( )
Learn decisions at each stage F = f1, f2, . . . fk
fk(x) =
{dk(x), gk(x) = not reject
reject, else
dk(x) is a standard classifier, pretrained on sensors 1, . . . , k
66 / 79
Stage-wise Decomposition
System Risk: R(F , x, y) = Error(F (x), y) + αCost(F , x)
Stage-wise recursion:
R(F , xi , yi ) = R1(xi , yi , f1)
Rk(xi , yi , fk) =
δk+1i , reject to next stage
1, error & not reject
0, correct & not reject
Cost-to-go:δk+1i = αck+1 + Rk+1(xi , yi , fk+1)
risk of future stage decisions fk+1, . . . , fK if rejected
67 / 79
Stage-wise Decomposition
System Risk: R(F , x, y) = Error(F (x), y) + αCost(F , x)
Stage-wise recursion:
R(F , xi , yi ) = R1(xi , yi , f1)
Rk(xi , yi , fk) =
δk+1i , reject to next stage
1, error & not reject
0, correct & not reject
Cost-to-go:δk+1i = αck+1 + Rk+1(xi , yi , fk+1)
risk of future stage decisions fk+1, . . . , fK if rejected
67 / 79
Stage-wise Decomposition
Rk(xi , yi , fk) =
δk+1i , reject to next stage
1, error & not reject
0, correct & not reject
Key Observation:Given the past: f1, . . . , fk−1, and the future: fk+1, . . . , fK
Find current decision, fk , from singe stage risk Rk
Equivalent to a two stage problem
Rcu(xi ) = 1[dk misclassifies xi ]
Rfu(xi ) = δk+1i
68 / 79
Algorithm
For every training example xi :
cost-to-go, δk+1i :, empirical risk of future stages,
state, Ski , indicates if example is still active at stage k
Algorithm: alternatively minimize one stage at a time
For every stage k:
1: Learn decision fk :
minf∈F
N∑i=1
Ski Rk [f , xi , yi , δ
k+1i ]
2: Update state, S ji , for future stages j > k
3: Update cost-to-go, δji , for past stages j < k
repeat until stopping criteria is satisfied
69 / 79
Algorithm
For every training example xi :
cost-to-go, δk+1i :, empirical risk of future stages,
state, Ski , indicates if example is still active at stage k
Algorithm: alternatively minimize one stage at a time
For every stage k :
1: Learn decision fk :
minf∈F
N∑i=1
Ski Rk [f , xi , yi , δ
k+1i ]
2: Update state, S ji , for future stages j > k
3: Update cost-to-go, δji , for past stages j < k
repeat until stopping criteria is satisfied
69 / 79
Learning Stage Decision
Step 1:
fk = arg minf∈F
N∑i=1
Ski Rk [f , xi , yi , δ
k+1i ]
simplifies to a two stage setting,
Rcu(xi ) = 1[dk misclassifies xi ]
Rfu(xi ) = δk+1i
learn reject decision, gk(x), for every stage
supervised weighted classification problem
use any classification algorithm
70 / 79
Example
Sensors Varying Resolutions
classify handwritten digit images (mnist)
f1( ) f2( ) f3( ) f4( )?
low resolution(cheap)
high resolution(expensive)
x handwritten digit imagey ∈ {0, 1, . . . , 9} label
Stage 1 2 3 4Sensor 4x4 8x8 16x16 32x32Cost 1 2 3 4
Supervised Learner: logistic regression with linear classifiers
71 / 79
Example
Sensor 1 Sensor 2 Sensor 3 Sensor 4
Digit 0
Digit 1
Digit 8
Sensor selection depends on example
72 / 79
Handwritten Digit Dataset
Same performance as centralized (best) with much lower budget
1 1.5 2 2.5 3 3.5 40.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
Budget
Err
or
oursmyopiccentralized
73 / 79
More Experiments
Achieve target error rate with fraction of max budget
Dataset Stages Sensors Target Error Myopic Ours
synthetic 2 .147 52% 28%pima 3 weight, age, blood tests .245 41% 15%
threat 3 ir,pmmw,ammw .16 89% 71%covertype 3 soils, wild. areas, elev, aspect .285 79% 40%
letter 3 pixel counts, edge feat’s .25 81% 51%mnist 4 res. levels .085 90% 52%
landsat 4 hyperspectral bands .17 56% 31%mam 2 CAD feat’s, expert rating .173 65 % 25%
74 / 79
Summary of Theoretical Results
Train multi-stage decision system F (x)
How well does it perform on unseen data?
Ex,y [F (x) 6= y ]?
We prove two types of test error bound:
Margin type bound (ACML 2012/ML 2013)
two stage, binary setting, for boosted classifiers
VC Dimension type bound (AISTATS 2013)
VC dimension = complexity of a classifier familysmall VC dim = good generalizationcomplexity of our system scales as K logK times most complex stage
75 / 79
Summary of Sequential Sensor Selection
Sequential Reject Classifiers
ordered sequence of stages/sensors
each stage either classifiers or rejects to next stage
multi-class and multi-stage
Novel Empirical Risk Minimization
Decompose into stage-wise empirical risk minimization
Reduce to a series of weighted supervised learning problems
Empirical Results
demonstrate on several datasets
achieve performance of a centralized system with less expensive sensors
Publications
Two Stage Decision System, IEEE SSP, 2012
Multi-Stage Classifier Design, ACML, 2012 and an extended version inMachine Learning, 2013
Supervised Sequential Classification Under Budget Constraints, AISTATS,2013
76 / 79
Future Directions
Dynamic Sensor Selection: learn a sequential decision system (a policy,F (x)) when the order of stages is no longer fixed
Difficulties
policy has to choose which sensor to acquire next or to stop andclassify
policy has to handle any subset of measurements
Promising direction: imitation learning
77 / 79
Future Directions
Main idea: instead of learning an optimal policy, learn to imitate a verygood policy (an oracle)
oracle may only be evaluated on a training set, requires ground truth
learn a parameterized policy function to match oracle decisions onthe training
can use this policy estimate on unseen examples
Imitation learning is well studied in robotics community. Oracle is ahuman operator.
Difficulty in sensor selection setting: designing or computing an oracle onthe training data
78 / 79
Thank You!
advisors: Venkatesh Saligrama, David Castanon
rest of my committee: Prakash Ishwar, Ioannis Paschalidis
chair: Ayse Coskun
fellow BU students
79 / 79
[Abe and Mamitsuka, 1998] Abe, N. and Mamitsuka, H. (1998).Query learning strategies using boosting and bagging.In Proc. of the 15th ICML, pages 1–9.
[Balcan et al., 2007] Balcan, M.-F., Broder, A., and Zhang, T. (2007).Margin based active learning.In Procedings of the 20th Conference on Learning Theory.
[Campbell et al., 2000] Campbell, C., Cristianini, N., and Smola, A. (2000).Query learning with large margin classifiers.In Proceedings 17th International Conference on Machine Learning, pages 111–118.
[Freund and Schapire, 1996] Freund, Y. and Schapire, R. E. (1996).Experiments with a new boosting algorithm.In Proc. of the 13th ICML, pages 148–156.
[Freund et al., 1997] Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. (1997).Selective sampling using the query by committee algorithm.In ML.
[Ji and Carin, 2007] Ji, S. and Carin, L. (2007).Cost-sensitive feature acquisition and classification.In Pattern Recognition.
[Kanani and Melville, 2008] Kanani, P. and Melville, P. (2008).Prediction-time active feature-value acquisition for cost-effective customer targeting.In NIPS.
[Kapoor and Horvitz, 2009] Kapoor, A. and Horvitz, E. (2009).Breaking boundaries: Active information acquisition across learning and diagnosis.In NIPS.
[Koller and Gao, 2011] Koller and Gao (NIPS 2011).Active value.
[Liu et al., 2008] Liu, L.-P., Yu, Y., Jiang, Y., and Zhou, Z.-H. (2008).Tefe: A time-efficient approach to feature extraction.In ICDM.
[Nowak, 2009] Nowak, R. D. (2009).The geometry of generalized binary search.In Advances in Neural Information Processing Systems.
[Schohn and Cohn, 2000] Schohn, G. and Cohn, D. (2000).Less is more: Active learning with support vector machines.In Proceedings of the 17th International Conference on Machine Learning, pages 839–846.
79 / 79
[Zubek and Dietterich, 2002] Zubek, V. B. and Dietterich, T. G. (2002).Pruning improves heuristic search for cost-sensitive learning.In ICML.
79 / 79