Self-supervised Learning for Generalizable Out-of...

1
We propose a new technique relying on self- supervision for generalizable out-of-distribution (OOD) feature learning and rejecting those samples at the inference time, also: It does not need to pre-know the distribution of targeted OOD samples for tuning. It incurs no extra computation and memory overheads compared methods like DNN ensembles and MC-dropout. Our technique performs favorably against state-of- the-art OOD detection methods, for example: Train Setup: : , : Test Setup: : Generalizable OOD Detection: Our technique does not need to know the distribution of targeted OOD samples. We test OOD detection performance when is a mix of 5 different datasets (equal mix, random sampling) and show our technique outperforms sota both in OOD detection AUROC and coverage. Synthesized OOD Training Set: Our experimental results support the use of synthesized training set for OOD features; however, in all experiments we observed superior results when using real OOD samples from outlier datasets. FPR @ 0.95 TPR AUROC AUPR curve Baseline OE [4] Our Method Baselin e OE [4] Our Method Baselin e OE [4] Our Method MNIST E-MNIST not-MNIST 17.11 0.25 0 95.98 99.86 99.99 95.75 99.86 99.99 F-MNIST 2.96 0.99 0 99.3 99.83 100 99.19 99.83 100 k-MNIST 10.54 0.03 0.35 97.11 97.60 99.91 96.46 97.05 99.91 SVHN Tiny Images Texture 4.7 1.04 2.28 98.4 99.75 99.37 93.07 99.09 98.16 Places365 2.55 0.02 0.05 99.27 99.99 99.94 99.1 99.99 99.93 LSUN 2.75 0.05 0.04 99.18 99.98 99.94 97.57 99.95 99.98 CIFAR10 5.88 3.11 0.31 98.04 99.26 99.83 94.91 97.88 99.60 CIFAR100 7.74 4.01 0.07 97.48 99 99.93 93.92 97.19 99.81 CIFAR-10 Tiny Images SVHN 28.49 8.41 3.62 90.05 98.2 99.18 60.27 97.97 99.13 Texture 43.27 14.9 3.07 88.42 96.7 99.19 78.65 94.39 98.78 Places365 44.78 19.07 10.86 88.23 95.41 97.57 86.33 95.32 97.77 LSUN 38.31 15.2 4.27 89.11 96.43 98.92 86.61 96.01 98.74 CIFAR100 43.12 26.59 30.07 87.83 92.93 93.83 85.21 92.31 94.23 CIFAR-100 Tiny Images SVHN 69.33 52.61 18.22 71.33 82.86 95.82 67.81 80.21 95.03 Texture 71.83 55.97 40.30 73.59 84.23 89.76 57.41 75.76 83.55 Places365 70.26 57.77 39.96 73.97 82.65 89.08 70.46 81.47 88.00 LSUN 73.92 63.56 41.24 70.64 79.51 88.88 66.35 77.85 87.59 CIFAR10 65.12 59.96 57.79 75.33 77.53 77.70 71.29 72.82 72.31 1) Architecture: our method imposes the minimal change in the model architecture by only adding extra nodes in the last layer of the network to train for outlier features and detect OOD samples. We use a two-step training, which starts with learning the normal training set and then continues with OOD clustering step. 3) Self-Supervised Out-of-Distribution Learning: We train the auxiliary head for OOD features with the unlabeled OOD training set that we generate pseudo-random labels for. A two-term loss function ( =ℒ + ) is used for both in- and out- of-distribution feature learning. Robust in-distribution classification: We tested our technique for its effect on normal error rate and coverage due to FN and FP detections. In comparison to OE and baseline, our technique shows higher normal test set coverage when rejecting OOD samples. Number of reject Classes: In our experiments, we found the impact of the number of reject classes on OOD detection performance to be mild and insensitive. We used five reject classes for the CIFAR-10, MNIST, and SVHN experiments and 10 reject classes for the CIFAR-100 experiment. Introduction Detection Method Empirical Results Self-supervised Learning for Generalizable Out-of-Distribution Detection Authors: Sina Mohseni 1,2 , Mandar Pitale 1 , JBS Yadawa 1 , ZhangyangWang 2 1 NVIDIA , 2 Texas A&M University 2) Supervised In-distribution Training: we first train the model on the normal distribution to reach the desired classification performance. We used cross-entropy loss ( ) for the normal training. Algorithm: Two-step training for In- and Out-of-distribution Training Sets Step 1: Supervised In-Distribution Learning Input: Batch of samples in different classes. Training the in-distribution set by solving: min E (ෝ ,ෝ ) −log( ( = ො |ො )) Step 2: Self-Supervised Out-of-Distribution Learning Input: Batch of mixed samples, unlabeled samples, set of OOD classes . Training the mixed set by solving: min E (ෝ ,ෝ ) −log( ( = ො |ො )) + E (ෝ ,()) −log( ( = ()| ො )) The real-world deployment of Deep Neural Network (DNN) algorithms in safety-critical applications such as autonomous vehicles needs to address a variety of DNNs vulnerabilities such as 1) Generalization error, 2) Out-of-distribution samples, and 3) Adversarial attacks. For instance, examples of OOD samples in traffic sign recognition application include: Motivation Samples from the training set distribution Outside of the training set distribution Solution Overview In-distribution Training set Output layer Supervised learning for samples Self-supervised learning for samples OOD Training set The problem we consider in this paper is to detect OOD outliers ( ) using the same classifier trained on normal distribution ( ). We add an auxiliary head to the network and take a two-step training for and distributions. We first use a supervised training for followed by a self-supervised training for unlabeled set. 4) Inference: we only use one softmax function for all output classes. We take the sum of softmax output of the OOD classes as the OOD-detection signal. Thus, OOD detection takes only one forward pass with no memory overhead. CIFAR-10 CIFAR-100 References: [1] Hendrycks et al. A baseline for detecting misclassified and out-of-distribution examples in neural networks” ICLR 2017. [2] Liang et al. “Enhancing the reliability of out-of-distribution image detection in neural networks” ICLR 2018. [3] Pidhorskyi et al. “Generative probabilistic novelty detection with adversarial autoencoders” NeurIPS 2018. [4] Hendrycks et al. “Deep anomaly detection with outlier exposure” ICLR. 2019. 0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 80 90 100 Total Classification Error Test Coverage (in) BaseLine OE Our Method Risk-Coverage at the presence of mixed 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 80 90 100 Total Classification Error Test Coverage (in) BaseLine OE Our Method Risk-Coverage at the presence of mixed : CIFAR-100 : CIFAR-10 OOD Detection Performance: To evaluate our method, we train and test our technique on multiple image datasets. Notice that in all experiments we used different unlabeled OOD training and test sets. Table 1 compared our OOD detection performance with state-of-the-art methods.

Transcript of Self-supervised Learning for Generalizable Out-of...

Page 1: Self-supervised Learning for Generalizable Out-of ...people.tamu.edu/~sina.mohseni/papers/AAAI-Poster-final.pdf · Authors: Sina Mohseni1,2, Mandar Pitale1, JBS Yadawa1, ZhangyangWang2

I

We propose a new technique relying on self-

supervision for generalizable out-of-distribution

(OOD) feature learning and rejecting those samples

at the inference time, also:

✓ It does not need to pre-know the distribution of

targeted OOD samples for tuning.

✓ It incurs no extra computation and memory

overheads compared methods like DNN ensembles

and MC-dropout.

✓ Our technique performs favorably against state-of-

the-art OOD detection methods, for example:

Train Setup: 𝐷𝑡𝑟𝑎𝑖𝑛𝑖𝑛 : 𝐶𝐼𝐹𝐴𝑅, 𝐷𝑡𝑟𝑎𝑖𝑛

𝑜𝑢𝑡 : 𝑇𝑖𝑛𝑦 𝐼𝑚𝑎𝑔𝑒𝑠 𝑑𝑎𝑡𝑎𝑠𝑒𝑡

Test Setup: 𝐷𝑡𝑒𝑠𝑡𝑜𝑢𝑡 : 𝐸𝑞𝑢𝑎𝑙 𝑚𝑖𝑥 𝑜𝑓 𝑓𝑖𝑣𝑒 𝑜𝑢𝑡𝑙𝑖𝑒𝑟 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡

✓Generalizable OOD Detection: Our technique does

not need to know the distribution of targeted OOD

samples. We test OOD detection performance when

𝐷𝑡𝑒𝑠𝑡𝑜𝑢𝑡 is a mix of 5 different datasets (equal mix,

random sampling) and show our technique

outperforms sota both in OOD detection AUROC and

𝐷𝑡𝑒𝑠𝑡𝑖𝑛 coverage.

✓Synthesized OOD Training Set: Our experimental

results support the use of synthesized training set

for OOD features; however, in all experiments we

observed superior results when using real OOD

samples from outlier datasets.

𝑫𝒕𝒆𝒔𝒕𝒐𝒖𝒕

FPR @ 0.95 TPR AUROC AUPR curve

Baseline OE [4]Our

MethodBaselin

eOE [4]

Our Method

Baseline

OE [4] Our Method

MN

IST

E-M

NIS

T not-MNIST 17.11 0.25 0 95.98 99.86 99.99 95.75 99.86 99.99

F-MNIST 2.96 0.99 0 99.3 99.83 100 99.19 99.83 100

k-MNIST 10.54 0.03 0.35 97.11 97.60 99.91 96.46 97.05 99.91

SVH

N

Tin

y Im

ages

Texture 4.7 1.04 2.28 98.4 99.75 99.37 93.07 99.09 98.16

Places365 2.55 0.02 0.05 99.27 99.99 99.94 99.1 99.99 99.93

LSUN 2.75 0.05 0.04 99.18 99.98 99.94 97.57 99.95 99.98

CIFAR10 5.88 3.11 0.31 98.04 99.26 99.83 94.91 97.88 99.60

CIFAR100 7.74 4.01 0.07 97.48 99 99.93 93.92 97.19 99.81

CIF

AR

-10

Tin

y Im

ages

SVHN 28.49 8.41 3.62 90.05 98.2 99.18 60.27 97.97 99.13

Texture 43.27 14.9 3.07 88.42 96.7 99.19 78.65 94.39 98.78

Places365 44.78 19.07 10.86 88.23 95.41 97.57 86.33 95.32 97.77

LSUN 38.31 15.2 4.27 89.11 96.43 98.92 86.61 96.01 98.74

CIFAR100 43.12 26.59 30.07 87.83 92.93 93.83 85.21 92.31 94.23

CIF

AR

-10

0

Tin

y Im

ages

SVHN 69.33 52.61 18.22 71.33 82.86 95.82 67.81 80.21 95.03

Texture 71.83 55.97 40.30 73.59 84.23 89.76 57.41 75.76 83.55

Places365 70.26 57.77 39.96 73.97 82.65 89.08 70.46 81.47 88.00

LSUN 73.92 63.56 41.24 70.64 79.51 88.88 66.35 77.85 87.59

CIFAR10 65.12 59.96 57.79 75.33 77.53 77.70 71.29 72.82 72.31

➢1) Architecture: our method imposes the minimal

change in the model architecture by only adding

extra nodes in the last layer of the network to train

for outlier features and detect OOD samples. We

use a two-step training, which starts with learning

the normal training set and then continues with

OOD clustering step.

➢3) Self-Supervised Out-of-Distribution Learning:

We train the auxiliary head for OOD features with

the unlabeled OOD training set that we generate

pseudo-random labels for. A two-term loss function

(ℒ𝒕𝒐𝒕𝒂𝒍 = ℒ𝒊𝒏 + 𝜆 ∗ ℒ𝒐𝒖𝒕) is used for both in- and out-

of-distribution feature learning.

✓Robust in-distribution classification: We tested our

technique for its effect on normal error rate and

coverage due to FN and FP detections. In

comparison to OE and baseline, our technique

shows higher normal test set coverage when

rejecting OOD samples.

✓Number of reject Classes: In our experiments, we

found the impact of the number of reject classes on

OOD detection performance to be mild and

insensitive. We used five reject classes for the

CIFAR-10, MNIST, and SVHN experiments and 10

reject classes for the CIFAR-100 experiment.

Introduction Detection Method

Empirical Results

Self-supervised Learning for Generalizable Out-of-Distribution Detection

Authors: Sina Mohseni1,2, Mandar Pitale1, JBS Yadawa1, ZhangyangWang2

1NVIDIA , 2Texas A&M University

➢2) Supervised In-distribution Training: we first

train the model on the normal distribution to

reach the desired classification performance. We

used cross-entropy loss ( ℒ𝒊𝒏 ) for the normal

training.

Algorithm: Two-step training for In- and Out-of-distribution Training Sets

Step 1: Supervised In-Distribution Learning

Input: Batch of 𝐷𝑡𝑟𝑎𝑖𝑛𝑖𝑛 samples in 𝑐 different classes.

Training the in-distribution set by solving: min E𝑃𝑖𝑛(ෝ𝑥,ෝ𝑦) −log(𝑃𝜃(𝑦 = ො𝑦| ො𝑥))

Step 2: Self-Supervised Out-of-Distribution Learning

Input: Batch of mixed 𝐷𝑡𝑟𝑎𝑖𝑛𝑖𝑛 samples, 𝐷𝑡𝑟𝑎𝑖𝑛

𝑜𝑢𝑡 unlabeled samples, set of OOD classes 𝑘.

Training the mixed set by solving:

min E𝑃𝑖𝑛(ෝ𝑥,ෝ𝑦) −log(𝑃𝜃(𝑦 = ො𝑦| ො𝑥)) + 𝜆E𝑃𝑜𝑢𝑡(ෝ𝑥,𝑟𝑎𝑛𝑑(𝑘)) −log(𝑃𝜃(𝑦 = 𝑟𝑎𝑛𝑑(𝑘)| ො𝑥))

The real-world deployment of Deep Neural Network

(DNN) algorithms in safety-critical applications such

as autonomous vehicles needs to address a variety of

DNNs vulnerabilities such as 1) Generalization error,

2) Out-of-distribution samples, and 3) Adversarial

attacks.

For instance, examples of OOD samples in traffic sign

recognition application include:

Motivation

Samples from the training set distribution

Outside of the training set distribution

Solution Overview

In-distribution Training set

Output layer

Supervised learning for 𝑫𝒊𝒏 samples

Self-supervised learning for 𝑫𝒐𝒖𝒕 samples

OOD Training set

✓ The problem we consider in this paper is to detect

OOD outliers (𝑫𝒐𝒖𝒕) using the same classifier

𝑷𝜽 𝒚 𝒙 trained on normal distribution (𝑫𝒊𝒏).

✓ We add an auxiliary head to the network and take a

two-step training for 𝑫𝒊𝒏 and 𝑫𝒐𝒖𝒕 distributions.

✓ We first use a supervised training for 𝑫𝒊𝒏 followed

by a self-supervised training for unlabeled 𝑫𝒐𝒖𝒕 set.

➢4) Inference: we only use one softmax function for

all output classes. We take the sum of softmax

output of the OOD classes as the OOD-detection

signal. Thus, OOD detection takes only one forward

pass with no memory overhead.

CIFAR-10 CIFAR-100

References: [1] Hendrycks et al. A baseline for detecting misclassified and out-of-distribution examples in

neural networks” ICLR 2017. [2] Liang et al. “Enhancing the reliability of out-of-distribution image

detection in neural networks” ICLR 2018. [3] Pidhorskyi et al. “Generative probabilistic novelty detection

with adversarial autoencoders” NeurIPS 2018. [4] Hendrycks et al. “Deep anomaly detection with outlier

exposure” ICLR. 2019.

0

10

20

30

40

50

60

0 10 20 30 40 50 60 70 80 90 100

Tota

l Cla

ssif

icat

ion

Err

or

Test Coverage (in)BaseLine OE Our Method

Risk-Coverage at the presence of mixed 𝑫𝒕𝒆𝒔𝒕𝒐𝒖𝒕

0

10

20

30

40

50

60

70

0 10 20 30 40 50 60 70 80 90 100

Tota

l Cla

ssif

icat

ion

Err

or

Test Coverage (in)

BaseLine OE Our Method

Risk-Coverage at the presence of mixed 𝑫𝒕𝒆𝒔𝒕𝒐𝒖𝒕

𝑫𝒊𝒏: CIFAR-100

𝑫𝒊𝒏: CIFAR-10

➢OOD Detection Performance: To evaluate our method, we train and test our technique on multiple image

datasets. Notice that in all experiments we used different unlabeled OOD training and test sets. Table 1

compared our OOD detection performance with state-of-the-art methods.