Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. ·...

Robust Artificial IntelligenceReading Seminar; Tsinghua University

Thomas G. Dietterich, Oregon State [email protected]

1

Lecture 2: Rejection

• Given:• Training data 𝑥𝑥1,𝑦𝑦1 , … , (𝑥𝑥𝑁𝑁,𝑦𝑦𝑁𝑁)• Target accuracy level 1 − 𝜖𝜖• Learn a classifier 𝑓𝑓 and a rejection rule 𝑟𝑟

• At run time• Given query 𝑥𝑥𝑞𝑞• If 𝑟𝑟 𝑥𝑥𝑞𝑞 < 0, REJECT• Else classify 𝑓𝑓 𝑥𝑥𝑞𝑞

2

Papers for Today

• Cortes, C., DeSalvo, G., & Mohri, M. (2016). Learning with rejection. Lecture Notes in Artificial Intelligence, 9925 LNAI, 67–82. http://doi.org/10.1007/978-3-319-46379-7_5

• Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9, 371–421. Retrieved from http://arxiv.org/abs/0706.3188

• Papadopoulos, H. (2008). Inductive Conformal Prediction: Theory and Application to Neural Networks. Book chapter. https://www.researchgate.net/publication/221787122_Inductive_Conformal_Prediction_Theory_and_Application_to_Neural_Networks

3

http://doi.org/10.1007/978-3-319-46379-7_5

http://arxiv.org/abs/0706.3188

https://www.researchgate.net/publication/221787122_Inductive_Conformal_Prediction_Theory_and_Application_to_Neural_Networks

Basic Theory

• Suppose 𝑓𝑓∗ 𝑥𝑥,𝑦𝑦 = 𝑃𝑃(𝑦𝑦|𝑥𝑥) is the optimal probabilistic classifier

• Best prediction is �𝑦𝑦 = arg max𝑦𝑦

𝑓𝑓∗(𝑥𝑥,𝑦𝑦)

• Then the optimal rejection rule is to REJECT if 𝑓𝑓∗ 𝑥𝑥, �𝑦𝑦 < 1 − 𝜖𝜖

• (Chow 1970)

4

1.0

0.5

0.0

𝜖𝜖

Non-Optimal Case

• If 𝑓𝑓 is not optimal, we can still determine a threshold with performance guarantees

• Let 𝑓𝑓 𝑥𝑥𝑖𝑖 , �𝑦𝑦𝑖𝑖 , 𝐼𝐼 �𝑦𝑦𝑖𝑖 = 𝑦𝑦𝑖𝑖 be a set of calibration data points 𝑖𝑖 = 1, … ,𝑁𝑁

• Sort them by �̂�𝑝 �𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖 , �𝑦𝑦𝑖𝑖• Choose the smallest threshold 𝜏𝜏 such

that if 𝑓𝑓 𝑥𝑥𝑖𝑖 , �𝑦𝑦𝑖𝑖 > 𝜏𝜏 then the fraction of correct predictions is 1 − 𝜖𝜖

5

�̂�𝑝( �𝑦𝑦, 𝑥𝑥)

�𝑃𝑃( �𝑦𝑦, 𝑥𝑥)

1 − 𝜖𝜖

𝜏𝜏

Finite Sample (PAC) Guarantee

• 𝑃𝑃 𝑛𝑛 sup𝑥𝑥

�𝐹𝐹𝑛𝑛 𝑥𝑥 − 𝐹𝐹 𝑥𝑥 > 𝜆𝜆 ≤ 2 exp −2𝜆𝜆2 Massart (1990)

• Set 𝑥𝑥 ≔ 𝜏𝜏

• 𝑃𝑃 𝜂𝜂 > 𝜆𝜆𝑛𝑛

= 2 exp −2𝜆𝜆2

• Set 𝜆𝜆𝑛𝑛

= 𝜂𝜂 and 𝛿𝛿 = 2 exp −2𝜆𝜆2 ; solve for 𝑛𝑛

• 𝜆𝜆 = 𝜂𝜂 𝑛𝑛

• 𝛿𝛿 = 2 exp −2𝜂𝜂2𝑛𝑛

• log 𝛿𝛿2

= −𝜂𝜂2𝑛𝑛

• 𝑛𝑛 = 1𝜂𝜂2

log 2𝛿𝛿

• If 𝑛𝑛 > 1𝜂𝜂2

log 2𝛿𝛿

then w.p. 1 − 𝛿𝛿, the true error rate will be bounded by 1 − 𝜖𝜖 + 𝜂𝜂

6

�̂�𝑝( �𝑦𝑦, 𝑥𝑥)

�𝑃𝑃( �𝑦𝑦, 𝑥𝑥)

1 − 𝜖𝜖

𝜏𝜏

Related Work

• Geifman & El Yaniv (2017)• Develop confidence scores

based on either the softmax(“SR”) or Monte Carlo dropout (“MC-dropout”)

• Binary search for the threshold• Use an exact Binomial

confidence interval instead of Massart’s bound

• Union bound over the binary search queries

7

Cost-Sensitive Rejection

• Cost Matrix• Optimal Classifier

• For �̂�𝑝 𝑦𝑦 = 1 𝑥𝑥 ≥ 𝜏𝜏1, predict 1• For �̂�𝑝 𝑦𝑦 = 2 𝑥𝑥 ≥ 𝜏𝜏2, predict 2• Else REJECT

• Search all pairs 𝜏𝜏1, 𝜏𝜏2 to minimize expected cost

• Pietraszek (2005) provides a fast algorithm based on (a) isotonic regression and (b) computing the slopes on the ROC curve corresponding to 𝜏𝜏1 and 𝜏𝜏2

8

Actions

Probabilities Predict 1 Predict 2 Reject

𝑃𝑃(𝑦𝑦 = 1|𝑥𝑥) 0 𝑐𝑐12 𝑐𝑐1𝑟𝑟𝑃𝑃(𝑦𝑦 = 2|𝑥𝑥) 𝑐𝑐21 0 𝑐𝑐2𝑟𝑟

Support Vector Machines

• Key insight: Maximize the Margin around the Decision Boundary• Three strategies:

• Fit standard SVM, then calibrate or threshold• Fit a double-hinge loss (DHL) SVM that maximizes margin around the rejection

thresholds• Fit two separate functions (classifier and rejection function) that maximize

margins around the rejection thresholds

9

Reminder: Standard SVM• Linear classifier that maximizes the margin

between positive and negative examples

• 𝑦𝑦 ∈ {+1,−1} so 𝑦𝑦𝑖𝑖𝑓𝑓 𝑥𝑥𝑖𝑖 > 0 means 𝑥𝑥𝑖𝑖 is classified correctly

min𝑤𝑤,𝑏𝑏,𝜉𝜉

𝐶𝐶 𝑤𝑤 2 + ∑𝑖𝑖 𝜉𝜉𝑖𝑖 subject to

𝑦𝑦𝑖𝑖 𝑤𝑤⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏 + 𝜉𝜉𝑖𝑖 ≥ 1 ∀𝑖𝑖

• The 𝜉𝜉𝑖𝑖 are “slack variables” the measure how “wrong” we are classifying 𝑥𝑥𝑖𝑖

• 𝐶𝐶 is the regularization parameter10

Double Hinge Loss (Herbei & Wegkamp, 2006; Bartlett & Wegkamp, 2008)

• Assume cost of rejection is 𝑐𝑐• Reject if 𝑓𝑓 𝑥𝑥 < 𝛿𝛿• Loss function ℓ𝑐𝑐 𝑦𝑦𝑓𝑓 𝑥𝑥

• if 𝑦𝑦𝑓𝑓 𝑥𝑥 < −𝛿𝛿 ℓ𝑐𝑐 = 1• if 𝑦𝑦𝑓𝑓 𝑥𝑥 ∈ −𝛿𝛿, +𝛿𝛿 ℓ𝑐𝑐 = 𝑐𝑐• if 𝑦𝑦𝑓𝑓 𝑥𝑥 > +𝛿𝛿 ℓ𝑐𝑐 = 0

• Convex upper bound 𝜙𝜙𝑐𝑐• if 𝑦𝑦𝑓𝑓 𝑥𝑥 < 0 𝜙𝜙𝑐𝑐 = 1 − 𝑎𝑎𝑦𝑦𝑓𝑓(𝑥𝑥)• if 𝑦𝑦𝑓𝑓 𝑥𝑥 ∈ 0,1 𝜙𝜙𝑐𝑐 = 1 − 𝑦𝑦𝑓𝑓(𝑥𝑥)• if 𝑦𝑦𝑓𝑓 𝑥𝑥 > 1 𝜙𝜙𝑐𝑐 = 0

11

−𝛿𝛿 +𝛿𝛿0

0𝑐𝑐

1

𝑦𝑦𝑓𝑓(𝑥𝑥)1

𝜙𝜙𝑐𝑐

ℓ𝑐𝑐

DHL Optimization Problem

• min𝑤𝑤,𝑏𝑏,𝜉𝜉,𝛾𝛾

∑𝑖𝑖 𝜉𝜉𝑖𝑖 + 1−2𝑐𝑐𝑐𝑐

𝛾𝛾𝑖𝑖 subject to

• 𝑦𝑦𝑖𝑖 𝑤𝑤⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏 + 𝜉𝜉𝑖𝑖 ≥ 1• 𝑦𝑦𝑖𝑖 𝑤𝑤⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏 + 𝛾𝛾𝑖𝑖 ≥ 0• 𝜉𝜉𝑖𝑖 ≥ 0; 𝛾𝛾𝑖𝑖 ≥ 0• ∑𝑖𝑖,𝑗𝑗 𝛼𝛼𝑖𝑖𝛼𝛼𝑗𝑗(𝑥𝑥𝑖𝑖 ⋅ 𝑥𝑥𝑗𝑗) ≤ 𝑟𝑟2 (regularization constraint)

• This is a quadratically-constrained quadratic program, so it can be solved, but it is not easy

12

Non-Optimal Case (2)• Defining the rejection function in

terms of ℎ assumes that the probability of error is monotonically related to �̂�𝑝(𝑦𝑦|𝑥𝑥).

• We saw last lecture that this is not necessarily true

• We can try to fix ℎ or we can learn a more complex 𝑟𝑟 function

• Unlikely to be a problem for flexible models, but could be a problem for linear and SVM methods

13

Method 3: Learn (𝑓𝑓, 𝑟𝑟) pair(Cortes, DeSalvo & Mohri, 2016)

• Two-dimensional loss function• If 𝑟𝑟 𝑥𝑥 ≥ 0 and 𝑦𝑦𝑓𝑓 𝑥𝑥 ≥ 0 loss = 0• If 𝑟𝑟 𝑥𝑥 ≥ 0 and 𝑦𝑦𝑓𝑓 𝑥𝑥 < 0 loss = 1• If 𝑟𝑟 𝑥𝑥 < 0 loss = 𝑐𝑐

14

𝑦𝑦𝑓𝑓(𝑥𝑥)

0

0

𝑟𝑟(𝑥𝑥)𝑐𝑐 𝑐𝑐

01

Convex Upper Bound

• 𝐿𝐿𝑀𝑀𝑀𝑀 𝑟𝑟, 𝑓𝑓, 𝑥𝑥,𝑦𝑦 = max 1 + 12𝑟𝑟 𝑥𝑥 − 𝑦𝑦𝑓𝑓 𝑥𝑥 , 𝑐𝑐 1 − 1

1−2𝑐𝑐𝑟𝑟 𝑥𝑥 , 0

15

CHR Optimization Problem

• 𝑓𝑓 𝑥𝑥 = 𝑤𝑤⊤𝑥𝑥 + 𝑏𝑏• 𝑟𝑟 𝑥𝑥 = 𝑢𝑢⊤𝑥𝑥 + 𝑏𝑏𝑏

• min𝑤𝑤,𝑢𝑢,𝜉𝜉

𝜆𝜆2𝑤𝑤 2 + 𝜆𝜆′

2𝑢𝑢 2 + ∑𝑖𝑖 𝜉𝜉𝑖𝑖 subject to

• 𝑐𝑐 1 − 11−2𝑐𝑐

𝑢𝑢⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏′ ≤ 𝜉𝜉𝑖𝑖

• 12𝑢𝑢⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏′ − 𝑦𝑦𝑖𝑖𝑤𝑤⊤𝑥𝑥𝑖𝑖 − 𝑏𝑏 ≤ 1 + 𝜉𝜉𝑖𝑖

• 0 ≤ 𝜉𝜉𝑖𝑖• By minimizing 𝜉𝜉𝑖𝑖 we are minimize the max of these three terms

16

Experimental Tests

17

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

liver bank skin pima australian cod haber

Tota

l Los

s

Data Set

Total Loss (Misclassification + Rejection)

DHL

CHR

Error on Non-Rejected Points

18

-0.1

0

0.1

0.2

0.3

0.4

0.5

liver bank skin pima australian cod haber

Misc

lass

ifica

tion

Loss

Data Set

Non-Rejected Error

DHL

CHR

Note: DHL modified to reject the same number of points as CHR

Reject Option Conclusions

• Basic thresholding is easy and gives PAC guarantees• 2-class thresholding with differential costs is easy• 𝐾𝐾-class thresholding?• Thresholding SVMs is interesting

• Focus the “margin” on the reject boundaries• Learning a 𝑓𝑓, 𝑟𝑟 pair is better than optimizing the double hinge loss

• Open question: How to jointly train DNNs and a rejection function

19

Conformal Prediction (online version)

• Given:• Training data 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛−1 where 𝑧𝑧𝑖𝑖 = 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖• Classifier 𝑓𝑓 trained on the training data• Nonconformity measure 𝐴𝐴𝑛𝑛:𝒵𝒵𝑛𝑛−1 × 𝒵𝒵 ↦ ℝ• Query 𝑥𝑥𝑛𝑛• Accuracy level 𝛿𝛿

• Find:• A set 𝐶𝐶 𝑥𝑥𝑞𝑞 ⊆ 1, … ,𝐾𝐾 such that 𝑦𝑦𝑞𝑞 ∈ 𝐶𝐶 𝑥𝑥𝑞𝑞 with probability 1 − 𝛿𝛿

• Method:• For each 𝑘𝑘, let 𝑧𝑧𝑛𝑛𝑘𝑘 = (𝑥𝑥𝑞𝑞, 𝑘𝑘)

• ∀𝑖𝑖 𝛼𝛼𝑖𝑖𝑘𝑘 ≔ 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑖𝑖 “how different is 𝑧𝑧𝑖𝑖 from the rest of the 𝑧𝑧 values?• Let 𝑝𝑝𝑘𝑘 = fraction of 𝛼𝛼1𝑘𝑘 , … ,𝛼𝛼𝑛𝑛𝑘𝑘 that are ≥ 𝛼𝛼𝑛𝑛𝑘𝑘

• 𝐶𝐶 𝑥𝑥𝑞𝑞 = 𝑘𝑘 𝑝𝑝𝑘𝑘 ≥ 𝛿𝛿• Output 𝐶𝐶 𝑥𝑥𝑞𝑞

20

Examples of Nonconformity Measures

• Conditional probability method:• Train a probabilistic classifier 𝑓𝑓 on 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛• Then compute 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑖𝑖 = −log 𝑓𝑓 𝑧𝑧𝑖𝑖

• Nearest neighbor nonconformity• 𝐴𝐴 𝐵𝐵, 𝑧𝑧 = distance to nearest 𝑧𝑧′∈𝐵𝐵 in same class

distance to nearest 𝑧𝑧′∈𝐵𝐵 in different class

21

Additional Information

• In addition to outputting 𝐶𝐶 𝑥𝑥𝑞𝑞 , we can output• �𝑦𝑦𝑞𝑞 = arg max

𝑘𝑘𝑝𝑝𝑘𝑘 (the best prediction)

• 𝑝𝑝𝑞𝑞 = max𝑘𝑘

𝑝𝑝𝑘𝑘 (the p-value of the best prediction)

• 1 − max𝑘𝑘;𝑘𝑘≠ �𝑦𝑦𝑞𝑞

𝑝𝑝𝑘𝑘 (the “confidence”. We have more confidence if the second-best

p-value is small)

22

Batch (“inductive”) Conformal Prediction

• Divide data into training and calibration• Train 𝑓𝑓 on the training data• Let 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛 be the validation data• Let 𝛼𝛼1, … ,𝛼𝛼𝑛𝑛 be the non-conformity scores of the validation data

• 𝛼𝛼𝑖𝑖 ≔ 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑖𝑖• Given query 𝑥𝑥𝑞𝑞

• For 𝑘𝑘 = 1, … ,𝐾𝐾• Let 𝑧𝑧𝑞𝑞𝑘𝑘 = (𝑥𝑥𝑞𝑞 ,𝑘𝑘)• let 𝛼𝛼𝑞𝑞𝑘𝑘 = 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑞𝑞𝑘𝑘• Let 𝑝𝑝𝑘𝑘 = fraction of 𝛼𝛼1, … ,𝛼𝛼𝑛𝑛,𝛼𝛼𝑞𝑞𝑘𝑘 that are ≥ 𝛼𝛼𝑞𝑞𝑘𝑘

• 𝐶𝐶 𝑥𝑥𝑞𝑞 = 𝑘𝑘 𝑝𝑝𝑘𝑘 ≥ 𝛿𝛿

• Key difference: 𝑧𝑧𝑞𝑞𝑘𝑘 does not affect the other non-conformity scores

23

Almost Equivalent to Learning a Threshold

• Let 𝜏𝜏 = the 𝛿𝛿 quantile of 𝛼𝛼1, … ,𝛼𝛼𝑛𝑛• Given query 𝑥𝑥𝑞𝑞

• For 𝑘𝑘 = 1, … ,𝐾𝐾• Let 𝑧𝑧𝑞𝑞𝑘𝑘 = (𝑥𝑥𝑞𝑞, 𝑘𝑘)• let 𝛼𝛼𝑞𝑞𝑘𝑘 = 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑞𝑞𝑘𝑘

• 𝐶𝐶 𝑥𝑥𝑞𝑞 = 𝑘𝑘 𝛼𝛼𝑞𝑞𝑘𝑘 ≥ 𝜏𝜏

• Additional difference: 𝜏𝜏 is computed without considering 𝛼𝛼𝑞𝑞𝑘𝑘

• IF 𝑛𝑛 is large enough, this does not matter

24

Experimental Results

• Non-conformity Measures• Resubstitution:

• Train 𝑓𝑓 on all data• Let �𝑦𝑦𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖• 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖 , … , 𝑧𝑧𝑁𝑁 , 𝑥𝑥𝑖𝑖 ,𝑘𝑘 = 𝐼𝐼 �𝑦𝑦𝑖𝑖 = 𝑘𝑘

• Leave One Out:• Train 𝑓𝑓 on 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖 , … , 𝑧𝑧𝑁𝑁• Let �𝑦𝑦𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖• 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖 , … , 𝑧𝑧𝑁𝑁 , 𝑥𝑥𝑖𝑖 ,𝑘𝑘 = 𝐼𝐼 �𝑦𝑦𝑖𝑖 = 𝑘𝑘

25

Satellite

26

Resubstitution

Leave one out

Error:𝑦𝑦𝑖𝑖 ∉ 𝐶𝐶 𝑥𝑥𝑖𝑖

Shuttle

27

Resubstitution

Leave one out

Segmentation

28

Resubstitution

Leave one out

Pendigits + Random Forest(Dietterich, unpublished)

• Train a random forest on half of UCI Training Set• Use the predicted class probability 𝑃𝑃 𝑦𝑦 = 𝑘𝑘 𝑥𝑥 as the

(non)conformity score• Compute 𝜏𝜏 values using other half of Training Set• Compute 𝐶𝐶 on the Test Set

29

Cumulative Distribution Function for Class “9”

30

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ecdf(p.val[, 10])

x

Fn(x

)

Empirical CDF

𝐶𝐶(𝑥𝑥, "9")

Pendigits Results

• All 𝜏𝜏 values were 0 (for 𝜖𝜖 = 0.001)• Probability 𝑦𝑦 ∈ Γ 𝑥𝑥 = 0.9997• Abstention rate = 0.72• Sizes of prediction sets Γ:

31

0.00

0.28

0.18

0.120.11 0.11

0.070.06

0.040.02

0.000

0.05

0.1

0.15

0.2

0.25

0.3

0 1 2 3 4 5 6 7 8 9 10

Frac

tion

of T

est P

oint

sSize of the Confidence Set

Simple Thresholding of max𝑘𝑘

�̂�𝑝 𝑦𝑦 = 𝑘𝑘 𝑥𝑥

32

0

20

40

60

80

100

120

140

160

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Num

ber o

f Val

idat

ion

Erro

rs

Threshold on p.max

Zoomed In: 𝜏𝜏 = 0.87 for 𝛿𝛿 = 0.05

33

0

1

2

3

4

5

6

7

8

0.7 0.75 0.8 0.85 0.9 0.95 1

Num

ber o

f Val

idat

ion

Erro

rs

Threshold on p.max

Test Set Results

• Probability of correct classification: 0.9987• Rejection rate: 33.4%

• [Conformal prediction was 72%]

34

Another Use Case: Lexicon Reduction

• US Postal Service Address Reading Task• (Madhvanath, Kleinberg, Govindaraju, 1997)

• Two classifiers• Method 1: Fast but not always accurate• Method 2: Slower but more accurate

• Can only afford to run on 1/3 of envelopes• Faster if it can be focused on a subset of the classes

• Apply conformal prediction using Method 1• Eliminate as many classes as possible• Apply Method 2 if 𝐶𝐶 𝑥𝑥 > 1

35

Summary

• Lecture 1: Calibration• Lecture 2: Rejection

• Method 1: Threshold 𝑓𝑓 with single or multiple thresholds• Multiple thresholds requires a change in the SVM methodology

• Method 2: Learn a separate rejection function and threshold it• Method 3: Conformal: Use thresholding to construct a confidence set

• Reject if 𝐶𝐶 𝑥𝑥𝑞𝑞 ≠ 1• Can perform “lexicon reduction”

• In my experience, Conformal Prediction is not good for Rejection, but more experiments are needed

36

Next Lecture

• All of these methods assume a closed world• What happens when queries may belong to “alien” classes not

observed during training?• Papers:

• Bendale, A., & Boult, T. (2016). Towards Open Set Deep Networks. In CVPR 2016 (pp. 1563–1572). http://doi.org/10.1109/CVPR.2016.173

• Liu, S., Garrepalli, R., Dietterich, T. G., Fern, A., & Hendrycks, D. (2018). Open Category Detection with PAC Guarantees. Proceedings of the 35th International Conference on Machine Learning, PMLR, 80, 3169–3178. http://proceedings.mlr.press/v80/liu18e.html

37

http://doi.org/10.1109/CVPR.2016.173

http://proceedings.mlr.press/v80/liu18e.html

Citations• Bartlett, P., Wegkamp, M. (2008). Classification with a reject option using a hinge loss. JMLR, 2008.• Chow (1970). On optimum recognition error and reject trade-off. IEEE Transactions on Computing.• Cortes, C., DeSalvo, G., & Mohri, M. (2016). Learning with rejection. Lecture Notes in Artificial Intelligence,

9925 LNAI, 67–82. http://doi.org/10.1007/978-3-319-46379-7_5• Geifman, Y., El Yaniv, R. (2017) Selective Classification for Deep Neural Networks. NIPS 2017. arXiv:

1705.08500• Herbei, R., Wegkamp, M. (2005). Classification with reject option. Canadian Journal of Statistics. • Madhvanath, S., Kleinberg, E., Govindaraju, V. (1997). Empirical Design of a Multi-Classifier

Thresholding/Control Strategy for Recognition of Handwritten Street Names. International Journal of Pattern Recognition and Artificial Intelligence, 11(6):933-946. https://doi.org/10.1142/S0218001497000421

• Papadopoulos, H. (2008). Inductive Conformal Prediction: Theory and Application to Neural Networks. Book chapter. https://www.researchgate.net/publication/221787122_Inductive_Conformal_Prediction_Theory_and_Application_to_Neural_Networks

• Pietraszek, T. (2005). Optimizing abstaining classifiers using ROC analysis. In ICML, 2005• Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9,

371–421. Retrieved from http://arxiv.org/abs/0706.3188

38

http://doi.org/10.1007/978-3-319-46379-7_5

https://doi.org/10.1142/S0218001497000421

https://www.researchgate.net/publication/221787122_Inductive_Conformal_Prediction_Theory_and_Application_to_Neural_Networks

http://arxiv.org/abs/0706.3188

Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. ·...

Documents

Transcript of Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. ·...