Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. ·...

38
Robust Artificial Intelligence Reading Seminar; Tsinghua University Thomas G. Dietterich, Oregon State University [email protected] 1

Transcript of Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. ·...

Page 1: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Robust Artificial IntelligenceReading Seminar; Tsinghua University

Thomas G. Dietterich, Oregon State [email protected]

1

Page 2: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Lecture 2: Rejection

• Given:• Training data 𝑥𝑥1,𝑦𝑦1 , … , (𝑥𝑥𝑁𝑁,𝑦𝑦𝑁𝑁)• Target accuracy level 1 − 𝜖𝜖• Learn a classifier 𝑓𝑓 and a rejection rule 𝑟𝑟

• At run time• Given query 𝑥𝑥𝑞𝑞• If 𝑟𝑟 𝑥𝑥𝑞𝑞 < 0, REJECT• Else classify 𝑓𝑓 𝑥𝑥𝑞𝑞

2

Page 3: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Papers for Today

• Cortes, C., DeSalvo, G., & Mohri, M. (2016). Learning with rejection. Lecture Notes in Artificial Intelligence, 9925 LNAI, 67–82. http://doi.org/10.1007/978-3-319-46379-7_5

• Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9, 371–421. Retrieved from http://arxiv.org/abs/0706.3188

• Papadopoulos, H. (2008). Inductive Conformal Prediction: Theory and Application to Neural Networks. Book chapter. https://www.researchgate.net/publication/221787122_Inductive_Conformal_Prediction_Theory_and_Application_to_Neural_Networks

3

Page 4: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Basic Theory

• Suppose 𝑓𝑓∗ 𝑥𝑥,𝑦𝑦 = 𝑃𝑃(𝑦𝑦|𝑥𝑥) is the optimal probabilistic classifier

• Best prediction is �𝑦𝑦 = arg max𝑦𝑦

𝑓𝑓∗(𝑥𝑥,𝑦𝑦)

• Then the optimal rejection rule is to REJECT if 𝑓𝑓∗ 𝑥𝑥, �𝑦𝑦 < 1 − 𝜖𝜖

• (Chow 1970)

4

1.0

0.5

0.0

𝜖𝜖

Page 5: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Non-Optimal Case

• If 𝑓𝑓 is not optimal, we can still determine a threshold with performance guarantees

• Let 𝑓𝑓 𝑥𝑥𝑖𝑖 , �𝑦𝑦𝑖𝑖 , 𝐼𝐼 �𝑦𝑦𝑖𝑖 = 𝑦𝑦𝑖𝑖 be a set of calibration data points 𝑖𝑖 = 1, … ,𝑁𝑁

• Sort them by �̂�𝑝 �𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖 , �𝑦𝑦𝑖𝑖• Choose the smallest threshold 𝜏𝜏 such

that if 𝑓𝑓 𝑥𝑥𝑖𝑖 , �𝑦𝑦𝑖𝑖 > 𝜏𝜏 then the fraction of correct predictions is 1 − 𝜖𝜖

5

�̂�𝑝( �𝑦𝑦, 𝑥𝑥)

�𝑃𝑃( �𝑦𝑦, 𝑥𝑥)

1 − 𝜖𝜖

𝜏𝜏

Page 6: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Finite Sample (PAC) Guarantee

• 𝑃𝑃 𝑛𝑛 sup𝑥𝑥

�𝐹𝐹𝑛𝑛 𝑥𝑥 − 𝐹𝐹 𝑥𝑥 > 𝜆𝜆 ≤ 2 exp −2𝜆𝜆2 Massart (1990)

• Set 𝑥𝑥 ≔ 𝜏𝜏

• 𝑃𝑃 𝜂𝜂 > 𝜆𝜆𝑛𝑛

= 2 exp −2𝜆𝜆2

• Set 𝜆𝜆𝑛𝑛

= 𝜂𝜂 and 𝛿𝛿 = 2 exp −2𝜆𝜆2 ; solve for 𝑛𝑛

• 𝜆𝜆 = 𝜂𝜂 𝑛𝑛

• 𝛿𝛿 = 2 exp −2𝜂𝜂2𝑛𝑛

• log 𝛿𝛿2

= −𝜂𝜂2𝑛𝑛

• 𝑛𝑛 = 1𝜂𝜂2

log 2𝛿𝛿

• If 𝑛𝑛 > 1𝜂𝜂2

log 2𝛿𝛿

then w.p. 1 − 𝛿𝛿, the true error rate will be bounded by 1 − 𝜖𝜖 + 𝜂𝜂

6

�̂�𝑝( �𝑦𝑦, 𝑥𝑥)

�𝑃𝑃( �𝑦𝑦, 𝑥𝑥)

1 − 𝜖𝜖

𝜏𝜏

Page 7: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Related Work

• Geifman & El Yaniv (2017)• Develop confidence scores

based on either the softmax(“SR”) or Monte Carlo dropout (“MC-dropout”)

• Binary search for the threshold• Use an exact Binomial

confidence interval instead of Massart’s bound

• Union bound over the binary search queries

7

Page 8: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Cost-Sensitive Rejection

• Cost Matrix• Optimal Classifier

• For �̂�𝑝 𝑦𝑦 = 1 𝑥𝑥 ≥ 𝜏𝜏1, predict 1• For �̂�𝑝 𝑦𝑦 = 2 𝑥𝑥 ≥ 𝜏𝜏2, predict 2• Else REJECT

• Search all pairs 𝜏𝜏1, 𝜏𝜏2 to minimize expected cost

• Pietraszek (2005) provides a fast algorithm based on (a) isotonic regression and (b) computing the slopes on the ROC curve corresponding to 𝜏𝜏1 and 𝜏𝜏2

8

Actions

Probabilities Predict 1 Predict 2 Reject

𝑃𝑃(𝑦𝑦 = 1|𝑥𝑥) 0 𝑐𝑐12 𝑐𝑐1𝑟𝑟𝑃𝑃(𝑦𝑦 = 2|𝑥𝑥) 𝑐𝑐21 0 𝑐𝑐2𝑟𝑟

Page 9: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Support Vector Machines

• Key insight: Maximize the Margin around the Decision Boundary• Three strategies:

• Fit standard SVM, then calibrate or threshold• Fit a double-hinge loss (DHL) SVM that maximizes margin around the rejection

thresholds• Fit two separate functions (classifier and rejection function) that maximize

margins around the rejection thresholds

9

Page 10: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Reminder: Standard SVM• Linear classifier that maximizes the margin

between positive and negative examples

• 𝑦𝑦 ∈ {+1,−1} so 𝑦𝑦𝑖𝑖𝑓𝑓 𝑥𝑥𝑖𝑖 > 0 means 𝑥𝑥𝑖𝑖 is classified correctly

min𝑤𝑤,𝑏𝑏,𝜉𝜉

𝐶𝐶 𝑤𝑤 2 + ∑𝑖𝑖 𝜉𝜉𝑖𝑖 subject to

𝑦𝑦𝑖𝑖 𝑤𝑤⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏 + 𝜉𝜉𝑖𝑖 ≥ 1 ∀𝑖𝑖

• The 𝜉𝜉𝑖𝑖 are “slack variables” the measure how “wrong” we are classifying 𝑥𝑥𝑖𝑖

• 𝐶𝐶 is the regularization parameter10

Page 11: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Double Hinge Loss (Herbei & Wegkamp, 2006; Bartlett & Wegkamp, 2008)

• Assume cost of rejection is 𝑐𝑐• Reject if 𝑓𝑓 𝑥𝑥 < 𝛿𝛿• Loss function ℓ𝑐𝑐 𝑦𝑦𝑓𝑓 𝑥𝑥

• if 𝑦𝑦𝑓𝑓 𝑥𝑥 < −𝛿𝛿 ℓ𝑐𝑐 = 1• if 𝑦𝑦𝑓𝑓 𝑥𝑥 ∈ −𝛿𝛿, +𝛿𝛿 ℓ𝑐𝑐 = 𝑐𝑐• if 𝑦𝑦𝑓𝑓 𝑥𝑥 > +𝛿𝛿 ℓ𝑐𝑐 = 0

• Convex upper bound 𝜙𝜙𝑐𝑐• if 𝑦𝑦𝑓𝑓 𝑥𝑥 < 0 𝜙𝜙𝑐𝑐 = 1 − 𝑎𝑎𝑦𝑦𝑓𝑓(𝑥𝑥)• if 𝑦𝑦𝑓𝑓 𝑥𝑥 ∈ 0,1 𝜙𝜙𝑐𝑐 = 1 − 𝑦𝑦𝑓𝑓(𝑥𝑥)• if 𝑦𝑦𝑓𝑓 𝑥𝑥 > 1 𝜙𝜙𝑐𝑐 = 0

11

−𝛿𝛿 +𝛿𝛿0

0𝑐𝑐

1

𝑦𝑦𝑓𝑓(𝑥𝑥)1

𝜙𝜙𝑐𝑐

ℓ𝑐𝑐

Page 12: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

DHL Optimization Problem

• min𝑤𝑤,𝑏𝑏,𝜉𝜉,𝛾𝛾

∑𝑖𝑖 𝜉𝜉𝑖𝑖 + 1−2𝑐𝑐𝑐𝑐

𝛾𝛾𝑖𝑖 subject to

• 𝑦𝑦𝑖𝑖 𝑤𝑤⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏 + 𝜉𝜉𝑖𝑖 ≥ 1• 𝑦𝑦𝑖𝑖 𝑤𝑤⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏 + 𝛾𝛾𝑖𝑖 ≥ 0• 𝜉𝜉𝑖𝑖 ≥ 0; 𝛾𝛾𝑖𝑖 ≥ 0• ∑𝑖𝑖,𝑗𝑗 𝛼𝛼𝑖𝑖𝛼𝛼𝑗𝑗(𝑥𝑥𝑖𝑖 ⋅ 𝑥𝑥𝑗𝑗) ≤ 𝑟𝑟2 (regularization constraint)

• This is a quadratically-constrained quadratic program, so it can be solved, but it is not easy

12

Page 13: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Non-Optimal Case (2)• Defining the rejection function in

terms of ℎ assumes that the probability of error is monotonically related to �̂�𝑝(𝑦𝑦|𝑥𝑥).

• We saw last lecture that this is not necessarily true

• We can try to fix ℎ or we can learn a more complex 𝑟𝑟 function

• Unlikely to be a problem for flexible models, but could be a problem for linear and SVM methods

13

Page 14: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Method 3: Learn (𝑓𝑓, 𝑟𝑟) pair(Cortes, DeSalvo & Mohri, 2016)

• Two-dimensional loss function• If 𝑟𝑟 𝑥𝑥 ≥ 0 and 𝑦𝑦𝑓𝑓 𝑥𝑥 ≥ 0 loss = 0• If 𝑟𝑟 𝑥𝑥 ≥ 0 and 𝑦𝑦𝑓𝑓 𝑥𝑥 < 0 loss = 1• If 𝑟𝑟 𝑥𝑥 < 0 loss = 𝑐𝑐

14

𝑦𝑦𝑓𝑓(𝑥𝑥)

0

0

𝑟𝑟(𝑥𝑥)𝑐𝑐 𝑐𝑐

01

Page 15: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Convex Upper Bound

• 𝐿𝐿𝑀𝑀𝑀𝑀 𝑟𝑟, 𝑓𝑓, 𝑥𝑥,𝑦𝑦 = max 1 + 12𝑟𝑟 𝑥𝑥 − 𝑦𝑦𝑓𝑓 𝑥𝑥 , 𝑐𝑐 1 − 1

1−2𝑐𝑐𝑟𝑟 𝑥𝑥 , 0

15

Page 16: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

CHR Optimization Problem

• 𝑓𝑓 𝑥𝑥 = 𝑤𝑤⊤𝑥𝑥 + 𝑏𝑏• 𝑟𝑟 𝑥𝑥 = 𝑢𝑢⊤𝑥𝑥 + 𝑏𝑏𝑏

• min𝑤𝑤,𝑢𝑢,𝜉𝜉

𝜆𝜆2𝑤𝑤 2 + 𝜆𝜆′

2𝑢𝑢 2 + ∑𝑖𝑖 𝜉𝜉𝑖𝑖 subject to

• 𝑐𝑐 1 − 11−2𝑐𝑐

𝑢𝑢⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏′ ≤ 𝜉𝜉𝑖𝑖

• 12𝑢𝑢⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏′ − 𝑦𝑦𝑖𝑖𝑤𝑤⊤𝑥𝑥𝑖𝑖 − 𝑏𝑏 ≤ 1 + 𝜉𝜉𝑖𝑖

• 0 ≤ 𝜉𝜉𝑖𝑖• By minimizing 𝜉𝜉𝑖𝑖 we are minimize the max of these three terms

16

Page 17: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Experimental Tests

17

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

liver bank skin pima australian cod haber

Tota

l Los

s

Data Set

Total Loss (Misclassification + Rejection)

DHL

CHR

Page 18: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Error on Non-Rejected Points

18

-0.1

0

0.1

0.2

0.3

0.4

0.5

liver bank skin pima australian cod haber

Misc

lass

ifica

tion

Loss

Data Set

Non-Rejected Error

DHL

CHR

Note: DHL modified to reject the same number of points as CHR

Page 19: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Reject Option Conclusions

• Basic thresholding is easy and gives PAC guarantees• 2-class thresholding with differential costs is easy• 𝐾𝐾-class thresholding?• Thresholding SVMs is interesting

• Focus the “margin” on the reject boundaries• Learning a 𝑓𝑓, 𝑟𝑟 pair is better than optimizing the double hinge loss

• Open question: How to jointly train DNNs and a rejection function

19

Page 20: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Conformal Prediction (online version)

• Given:• Training data 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛−1 where 𝑧𝑧𝑖𝑖 = 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖• Classifier 𝑓𝑓 trained on the training data• Nonconformity measure 𝐴𝐴𝑛𝑛:𝒵𝒵𝑛𝑛−1 × 𝒵𝒵 ↦ ℝ• Query 𝑥𝑥𝑛𝑛• Accuracy level 𝛿𝛿

• Find:• A set 𝐶𝐶 𝑥𝑥𝑞𝑞 ⊆ 1, … ,𝐾𝐾 such that 𝑦𝑦𝑞𝑞 ∈ 𝐶𝐶 𝑥𝑥𝑞𝑞 with probability 1 − 𝛿𝛿

• Method:• For each 𝑘𝑘, let 𝑧𝑧𝑛𝑛𝑘𝑘 = (𝑥𝑥𝑞𝑞, 𝑘𝑘)

• ∀𝑖𝑖 𝛼𝛼𝑖𝑖𝑘𝑘 ≔ 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑖𝑖 “how different is 𝑧𝑧𝑖𝑖 from the rest of the 𝑧𝑧 values?• Let 𝑝𝑝𝑘𝑘 = fraction of 𝛼𝛼1𝑘𝑘 , … ,𝛼𝛼𝑛𝑛𝑘𝑘 that are ≥ 𝛼𝛼𝑛𝑛𝑘𝑘

• 𝐶𝐶 𝑥𝑥𝑞𝑞 = 𝑘𝑘 𝑝𝑝𝑘𝑘 ≥ 𝛿𝛿• Output 𝐶𝐶 𝑥𝑥𝑞𝑞

20

Page 21: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Examples of Nonconformity Measures

• Conditional probability method:• Train a probabilistic classifier 𝑓𝑓 on 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛• Then compute 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑖𝑖 = −log 𝑓𝑓 𝑧𝑧𝑖𝑖

• Nearest neighbor nonconformity• 𝐴𝐴 𝐵𝐵, 𝑧𝑧 = distance to nearest 𝑧𝑧′∈𝐵𝐵 in same class

distance to nearest 𝑧𝑧′∈𝐵𝐵 in different class

21

Page 22: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Additional Information

• In addition to outputting 𝐶𝐶 𝑥𝑥𝑞𝑞 , we can output• �𝑦𝑦𝑞𝑞 = arg max

𝑘𝑘𝑝𝑝𝑘𝑘 (the best prediction)

• 𝑝𝑝𝑞𝑞 = max𝑘𝑘

𝑝𝑝𝑘𝑘 (the p-value of the best prediction)

• 1 − max𝑘𝑘;𝑘𝑘≠ �𝑦𝑦𝑞𝑞

𝑝𝑝𝑘𝑘 (the “confidence”. We have more confidence if the second-best

p-value is small)

22

Page 23: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Batch (“inductive”) Conformal Prediction

• Divide data into training and calibration• Train 𝑓𝑓 on the training data• Let 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛 be the validation data• Let 𝛼𝛼1, … ,𝛼𝛼𝑛𝑛 be the non-conformity scores of the validation data

• 𝛼𝛼𝑖𝑖 ≔ 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑖𝑖• Given query 𝑥𝑥𝑞𝑞

• For 𝑘𝑘 = 1, … ,𝐾𝐾• Let 𝑧𝑧𝑞𝑞𝑘𝑘 = (𝑥𝑥𝑞𝑞 ,𝑘𝑘)• let 𝛼𝛼𝑞𝑞𝑘𝑘 = 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑞𝑞𝑘𝑘• Let 𝑝𝑝𝑘𝑘 = fraction of 𝛼𝛼1, … ,𝛼𝛼𝑛𝑛,𝛼𝛼𝑞𝑞𝑘𝑘 that are ≥ 𝛼𝛼𝑞𝑞𝑘𝑘

• 𝐶𝐶 𝑥𝑥𝑞𝑞 = 𝑘𝑘 𝑝𝑝𝑘𝑘 ≥ 𝛿𝛿

• Key difference: 𝑧𝑧𝑞𝑞𝑘𝑘 does not affect the other non-conformity scores

23

Page 24: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Almost Equivalent to Learning a Threshold

• Let 𝜏𝜏 = the 𝛿𝛿 quantile of 𝛼𝛼1, … ,𝛼𝛼𝑛𝑛• Given query 𝑥𝑥𝑞𝑞

• For 𝑘𝑘 = 1, … ,𝐾𝐾• Let 𝑧𝑧𝑞𝑞𝑘𝑘 = (𝑥𝑥𝑞𝑞, 𝑘𝑘)• let 𝛼𝛼𝑞𝑞𝑘𝑘 = 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑞𝑞𝑘𝑘

• 𝐶𝐶 𝑥𝑥𝑞𝑞 = 𝑘𝑘 𝛼𝛼𝑞𝑞𝑘𝑘 ≥ 𝜏𝜏

• Additional difference: 𝜏𝜏 is computed without considering 𝛼𝛼𝑞𝑞𝑘𝑘

• IF 𝑛𝑛 is large enough, this does not matter

24

Page 25: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Experimental Results

• Non-conformity Measures• Resubstitution:

• Train 𝑓𝑓 on all data• Let �𝑦𝑦𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖• 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖 , … , 𝑧𝑧𝑁𝑁 , 𝑥𝑥𝑖𝑖 ,𝑘𝑘 = 𝐼𝐼 �𝑦𝑦𝑖𝑖 = 𝑘𝑘

• Leave One Out:• Train 𝑓𝑓 on 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖 , … , 𝑧𝑧𝑁𝑁• Let �𝑦𝑦𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖• 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖 , … , 𝑧𝑧𝑁𝑁 , 𝑥𝑥𝑖𝑖 ,𝑘𝑘 = 𝐼𝐼 �𝑦𝑦𝑖𝑖 = 𝑘𝑘

25

Page 26: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Satellite

26

Resubstitution

Leave one out

Error:𝑦𝑦𝑖𝑖 ∉ 𝐶𝐶 𝑥𝑥𝑖𝑖

Page 27: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Shuttle

27

Resubstitution

Leave one out

Page 28: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Segmentation

28

Resubstitution

Leave one out

Page 29: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Pendigits + Random Forest(Dietterich, unpublished)

• Train a random forest on half of UCI Training Set• Use the predicted class probability 𝑃𝑃 𝑦𝑦 = 𝑘𝑘 𝑥𝑥 as the

(non)conformity score• Compute 𝜏𝜏 values using other half of Training Set• Compute 𝐶𝐶 on the Test Set

29

Page 30: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Cumulative Distribution Function for Class “9”

30

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ecdf(p.val[, 10])

x

Fn(x

)

Empirical CDF

𝐶𝐶(𝑥𝑥, "9")

Page 31: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Pendigits Results

• All 𝜏𝜏 values were 0 (for 𝜖𝜖 = 0.001)• Probability 𝑦𝑦 ∈ Γ 𝑥𝑥 = 0.9997• Abstention rate = 0.72• Sizes of prediction sets Γ:

31

0.00

0.28

0.18

0.120.11 0.11

0.070.06

0.040.02

0.000

0.05

0.1

0.15

0.2

0.25

0.3

0 1 2 3 4 5 6 7 8 9 10

Frac

tion

of T

est P

oint

sSize of the Confidence Set

Page 32: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Simple Thresholding of max𝑘𝑘

�̂�𝑝 𝑦𝑦 = 𝑘𝑘 𝑥𝑥

32

0

20

40

60

80

100

120

140

160

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Num

ber o

f Val

idat

ion

Erro

rs

Threshold on p.max

Page 33: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Zoomed In: 𝜏𝜏 = 0.87 for 𝛿𝛿 = 0.05

33

0

1

2

3

4

5

6

7

8

0.7 0.75 0.8 0.85 0.9 0.95 1

Num

ber o

f Val

idat

ion

Erro

rs

Threshold on p.max

Page 34: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Test Set Results

• Probability of correct classification: 0.9987• Rejection rate: 33.4%

• [Conformal prediction was 72%]

34

Page 35: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Another Use Case: Lexicon Reduction

• US Postal Service Address Reading Task• (Madhvanath, Kleinberg, Govindaraju, 1997)

• Two classifiers• Method 1: Fast but not always accurate• Method 2: Slower but more accurate

• Can only afford to run on 1/3 of envelopes• Faster if it can be focused on a subset of the classes

• Apply conformal prediction using Method 1• Eliminate as many classes as possible• Apply Method 2 if 𝐶𝐶 𝑥𝑥 > 1

35

Page 36: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Summary

• Lecture 1: Calibration• Lecture 2: Rejection

• Method 1: Threshold 𝑓𝑓 with single or multiple thresholds• Multiple thresholds requires a change in the SVM methodology

• Method 2: Learn a separate rejection function and threshold it• Method 3: Conformal: Use thresholding to construct a confidence set

• Reject if 𝐶𝐶 𝑥𝑥𝑞𝑞 ≠ 1• Can perform “lexicon reduction”

• In my experience, Conformal Prediction is not good for Rejection, but more experiments are needed

36

Page 37: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Next Lecture

• All of these methods assume a closed world• What happens when queries may belong to “alien” classes not

observed during training?• Papers:

• Bendale, A., & Boult, T. (2016). Towards Open Set Deep Networks. In CVPR 2016 (pp. 1563–1572). http://doi.org/10.1109/CVPR.2016.173

• Liu, S., Garrepalli, R., Dietterich, T. G., Fern, A., & Hendrycks, D. (2018). Open Category Detection with PAC Guarantees. Proceedings of the 35th International Conference on Machine Learning, PMLR, 80, 3169–3178. http://proceedings.mlr.press/v80/liu18e.html

37

Page 38: Robust Artificial Intelligence - College of Engineeringtgd/classes/qinghua... · 2018. 11. 10. · Lecture 2: Rejection • Given: • Training data 𝑥𝑥 1,𝑦𝑦 1,…,(𝑥𝑥

Citations• Bartlett, P., Wegkamp, M. (2008). Classification with a reject option using a hinge loss. JMLR, 2008.• Chow (1970). On optimum recognition error and reject trade-off. IEEE Transactions on Computing.• Cortes, C., DeSalvo, G., & Mohri, M. (2016). Learning with rejection. Lecture Notes in Artificial Intelligence,

9925 LNAI, 67–82. http://doi.org/10.1007/978-3-319-46379-7_5• Geifman, Y., El Yaniv, R. (2017) Selective Classification for Deep Neural Networks. NIPS 2017. arXiv:

1705.08500• Herbei, R., Wegkamp, M. (2005). Classification with reject option. Canadian Journal of Statistics. • Madhvanath, S., Kleinberg, E., Govindaraju, V. (1997). Empirical Design of a Multi-Classifier

Thresholding/Control Strategy for Recognition of Handwritten Street Names. International Journal of Pattern Recognition and Artificial Intelligence, 11(6):933-946. https://doi.org/10.1142/S0218001497000421

• Papadopoulos, H. (2008). Inductive Conformal Prediction: Theory and Application to Neural Networks. Book chapter. https://www.researchgate.net/publication/221787122_Inductive_Conformal_Prediction_Theory_and_Application_to_Neural_Networks

• Pietraszek, T. (2005). Optimizing abstaining classifiers using ROC analysis. In ICML, 2005• Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9,

371–421. Retrieved from http://arxiv.org/abs/0706.3188

38