arXiv:2012.10209v5 [cs.CL] 1 Apr 2021

9
Deep Open Intent Classification with Adaptive Decision Boundary Hanlei Zhang, 1, 2 Hua Xu, 1, 2 * Ting-En Lin 1, 2 1 State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China, 2 Beijing National Research Center for Information Science and Technology (BNRist), Beijing 100084, China [email protected], [email protected], [email protected] Abstract Open intent classification is a challenging task in dialogue systems. On the one hand, it should ensure the quality of known intent identification. On the other hand, it needs to detect the open (unknown) intent without prior knowledge. Current models are limited in finding the appropriate deci- sion boundary to balance the performances of both known intents and the open intent. In this paper, we propose a post- processing method to learn the adaptive decision boundary (ADB) for open intent classification. We first utilize the la- beled known intent samples to pre-train the model. Then, we automatically learn the adaptive spherical decision boundary for each known class with the aid of well-trained features. Specifically, we propose a new loss function to balance both the empirical risk and the open space risk. Our method does not need open intent samples and is free from modifying the model architecture. Moreover, our approach is surprisingly insensitive with less labeled data and fewer known intents. Extensive experiments on three benchmark datasets show that our method yields significant improvements compared with the state-of-the-art methods. The codes are released at https://github.com/thuiar/Adaptive-Decision-Boundary. Introduction Identifying the user’s open intent plays a significant role in dialogue systems. As shown in Figure 1, we have two known intents for specific purposes, such as book flight and restau- rant reservation. However, there are also utterances with ir- relevant or unsupported intents that our system cannot han- dle. It is necessary to distinguish these utterances from the known intents as much as possible. On the one hand, effec- tively identifying the open intent can improve customer sat- isfaction by reducing false-positive error. On the other hand, we can use the open intent to discover potential user needs. We regard open intent classification as an (n+1)-class classification task as suggested in (Shu, Xu, and Liu 2017; Lin and Xu 2019a), and group open classes into the (n+1) th class . Our goal is to classify the n-class known intents into their corresponding classes correctly while identifying the (n+1) th class open intent. To solve this problem, Scheirer * Hua Xu is the corresponding author. Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: An example of open intent classification. Book flight and Restaurant reservation are two known intents. We should identify them correctly while detecting the sentences with the open intent. et al. (2013) propose the concept of open space risk as the measure of open classification. Fei and Liu (2016) reduce the open space risk by learning the closed boundary of each positive class in the similarity space. However, they fail to capture high-level semantic concepts with SVM. Ben- dale and Boult (2016) manage to reduce the open space risk through deep neural networks (DNNs), but need to sample open classes for selecting the core hyperparame- ters. Hendrycks and Gimpel (2017) use the softmax proba- bility as the confidence score, but also need to select the con- fidence threshold with negative samples. Shu, Xu, and Liu (2017) replace softmax with the sigmoid activation function, and calculate the confidence thresholds of each class based on statistics. However, the statistics-based thresholds can not learn the essential differences between known classes and the open class. Lin and Xu (2019a) propose to learn the deep intent features with the margin loss and detect the unknown intent with local outlier factor (Breunig et al. 2000). How- ever, it has no specific decision boundaries for distinguishing the open intent, and needs model architecture modification. Most of the existing methods need to design specific clas- sifiers for identifying the open class (Bendale and Boult 2016; Shu, Xu, and Liu 2017; Lin and Xu 2019a), and per- form poorly with the common classifier (Hendrycks and Gimpel 2017). Moreover, the performance of open classi- arXiv:2012.10209v5 [cs.CL] 1 Apr 2021

Transcript of arXiv:2012.10209v5 [cs.CL] 1 Apr 2021

Deep Open Intent Classification with Adaptive Decision Boundary

Hanlei Zhang,1, 2 Hua Xu,1, 2* Ting-En Lin1, 2

1State Key Laboratory of Intelligent Technology and Systems,Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China,

2 Beijing National Research Center for Information Science and Technology (BNRist), Beijing 100084, [email protected], [email protected], [email protected]

Abstract

Open intent classification is a challenging task in dialoguesystems. On the one hand, it should ensure the quality ofknown intent identification. On the other hand, it needs todetect the open (unknown) intent without prior knowledge.Current models are limited in finding the appropriate deci-sion boundary to balance the performances of both knownintents and the open intent. In this paper, we propose a post-processing method to learn the adaptive decision boundary(ADB) for open intent classification. We first utilize the la-beled known intent samples to pre-train the model. Then, weautomatically learn the adaptive spherical decision boundaryfor each known class with the aid of well-trained features.Specifically, we propose a new loss function to balance boththe empirical risk and the open space risk. Our method doesnot need open intent samples and is free from modifying themodel architecture. Moreover, our approach is surprisinglyinsensitive with less labeled data and fewer known intents.Extensive experiments on three benchmark datasets showthat our method yields significant improvements comparedwith the state-of-the-art methods. The codes are released athttps://github.com/thuiar/Adaptive-Decision-Boundary.

IntroductionIdentifying the user’s open intent plays a significant role indialogue systems. As shown in Figure 1, we have two knownintents for specific purposes, such as book flight and restau-rant reservation. However, there are also utterances with ir-relevant or unsupported intents that our system cannot han-dle. It is necessary to distinguish these utterances from theknown intents as much as possible. On the one hand, effec-tively identifying the open intent can improve customer sat-isfaction by reducing false-positive error. On the other hand,we can use the open intent to discover potential user needs.

We regard open intent classification as an (n+1)-classclassification task as suggested in (Shu, Xu, and Liu 2017;Lin and Xu 2019a), and group open classes into the (n+1)th

class . Our goal is to classify the n-class known intents intotheir corresponding classes correctly while identifying the(n+1)th class open intent. To solve this problem, Scheirer

*Hua Xu is the corresponding author.Copyright © 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: An example of open intent classification. Bookflight and Restaurant reservation are two known intents. Weshould identify them correctly while detecting the sentenceswith the open intent.

et al. (2013) propose the concept of open space risk as themeasure of open classification. Fei and Liu (2016) reducethe open space risk by learning the closed boundary of eachpositive class in the similarity space. However, they failto capture high-level semantic concepts with SVM. Ben-dale and Boult (2016) manage to reduce the open spacerisk through deep neural networks (DNNs), but need tosample open classes for selecting the core hyperparame-ters. Hendrycks and Gimpel (2017) use the softmax proba-bility as the confidence score, but also need to select the con-fidence threshold with negative samples. Shu, Xu, and Liu(2017) replace softmax with the sigmoid activation function,and calculate the confidence thresholds of each class basedon statistics. However, the statistics-based thresholds can notlearn the essential differences between known classes andthe open class. Lin and Xu (2019a) propose to learn the deepintent features with the margin loss and detect the unknownintent with local outlier factor (Breunig et al. 2000). How-ever, it has no specific decision boundaries for distinguishingthe open intent, and needs model architecture modification.

Most of the existing methods need to design specific clas-sifiers for identifying the open class (Bendale and Boult2016; Shu, Xu, and Liu 2017; Lin and Xu 2019a), and per-form poorly with the common classifier (Hendrycks andGimpel 2017). Moreover, the performance of open classi-

arX

iv:2

012.

1020

9v5

[cs

.CL

] 1

Apr

202

1

[CLS] Tok1 ….

Mean Pooling

Transformer Layers1

Transformer Layers12

Inputs

……

TokN

BERT

Embedding

Known

Classes

Open

Class∆1

∆2

∆𝐾−1

∆𝐾

Boundary

Adaption

c1

c2

𝒄1

𝒄2

𝒄𝐾−1

𝒄𝐾

Dense LayerPre-training

Figure 2: The model architecture of our approach. Firstly, we use BERT to extract intent features and pre-train the model withlabeled samples. Then, we initialize the centroids {ci}Ki=1 and the radius of decision boundaries {∆i}Ki=1 for each known class.Next, we propose the boundary loss to learn tight decision boundaries adaptive to the known intent feature space. Finally, weperform open classification with the learned decision boundaries to both identify known classes and detect the open class.

fication largely depends on the decision conditions. Mostof these methods need negative samples for determiningthe suitable decision conditions (Scheirer et al. 2013; Feiand Liu 2016; Hendrycks and Gimpel 2017; Liang, Li, andSrikant 2018). It is also a complicated and time-consumingprocess to manually select the optimal decision condition,which is not applicable in real scenarios.

To solve these problems, we use known intents as priorknowledge, and propose a novel post-processing method tolearn the adaptive decision boundary (ADB) for open in-tent classification. As illustrated in Figure 2, we first ex-tract intent representations from BERT (Devlin et al. 2019).Then, we pre-train the model under the supervision of thesoftmax loss. We define centroids for each known class andsuppose intent features of each known class are constrainedin a closed ball area. Next, we aim to learn the radius ofthe spherical area to obtain the decision boundaries. Specif-ically, we initialize the boundary parameters with standardnormal distribution and use a learnable activation functionas a projection to get the radius of each decision boundary.

The suitable decision boundaries should satisfy two con-ditions. On the one hand, they should be broad enough tosurround known intent samples as much as possible. On theother hand, they need to be tight enough to prevent the openintent samples from being identified as known intents. Toaddress these issues, we propose a new loss function, whichoptimizes the boundary parameters by balancing both theopen space risk and the empirical risk (Scheirer et al. 2013).The decision boundaries can automatically learn to adapt tothe intent feature space with the boundary loss until balance.We find our post-processing method can learn discriminativedecision boundaries for detecting the open intent even with-out modifying the original model architecture.

We summarize our contribution as follows. Firstly, we

propose a novel post-processing method for open classifi-cation, with no need for prior knowledge of the open in-tent. Secondly, we propose a new loss function to automati-cally learn tight decision boundaries adaptive to the featurespace. To the best of our knowledge, this is the first attemptto adopt deep neural networks to learn the adaptive decisionboundaries for open classification. Thirdly, extensive experi-ments conducted on three challenging datasets show that ourapproach yields consistently better and more robust resultscompared with the state-of-the-art methods.

The Proposed ApproachIntent RepresentationWe use the BERT model to extract deep intent features.Given ith input sentence si, we get all its token embeddings[CLS, T1, · · · , TN ] ∈R(N+1)×H from the last hidden layerof BERT. As suggested in (Lin, Xu, and Zhang 2020), weperform mean-pooling on these token embeddings to syn-thesize the high-level semantic features in one sentence, andget the averaged representation xi ∈ RH :

xi = mean-pooling([CLS, T1, · · · , TN ]), (1)

where CLS is the vector for text classification, N is thesequence length and H is the hidden layer size. To fur-ther strengthen feature extraction capability, we feed xi toa dense layer h to get the intent representation zi ∈ RD:

zi = h(xi) = σ(Whxi + bh), (2)

where D is the dimension of the intent representation, σ isa ReLU activation function, Wh ∈ RH×D and bh ∈ RDrespectively denote the weights and the bias term of layer h.

Dataset Classes #Training #Validation #Test Vocabulary Size Length (max / mean)

BANKING 77 9,003 1,000 3,080 5,028 79 / 11.91OOS 150 15,000 3,000 5,700 8,376 28 / 8.31

StackOverflow 20 12,000 2,000 6,000 17,182 41 / 9.18

Table 1: Statistics of BANKING, OOS and StackOverflow datasets. # indicates the total number of sentences.

Pre-trainingAs the decision boundaries learn to adapt to the intent fea-ture space, we need to learn intent representations at first.Due to lack of open intent samples, we use known intentsas prior knowledge to pre-train the model. In order to reflectthe effectiveness of the learned decision boundary, we usethe simple softmax loss Ls to learn the intent feature zi:

Ls = − 1

N

N∑i=1

logexp(φ(zi)

yi)∑Kj=1 exp(φ(zi)j)

, (3)

where φ(·) is a linear classifier and φ(·)j are the output log-its of the jth class. Then, we use the pre-trained model toextract intent features for learning decision boundaries.

Adaptive Decision Boundary LearningIn this section, we propose our approach to learning theadaptive decision boundary (ADB) for open intent classi-fication. First, we introduce the formulation of the decisionboundary. Then, we propose our boundary learning strategyfor optimization. Finally, we use the learned decision bound-ary to perform open classification.

Decision Boundary Formulation It has been shown thesuperiority of the spherical shape boundary for open classi-fication (Fei and Liu 2016). Compared with the half-spacebinary linear classifier (Scholkopf et al. 2001) or two paral-lel hyper-planes (Scheirer et al. 2013), the bounded sphericalarea greatly reduces the open space risk. Inspired by this, weaim to learn the decision boundary of each class constrainingthe known intents within a ball area.

Let S = {(zi, yi), . . . , (zN , yN )} be the known intentexamples with their corresponding labels. Sk denotes the setof examples labeled with class k. The centroid ck ∈ RD isthe mean vector of embedded samples in Sk:

ck =1

|Sk|∑

(zi,yi)∈Sk

zi, (4)

where |Sk| denotes the number of examples in Sk. We define∆k as the radius of the decision boundary with respect to thecentroid ck. For each known intent zi, we aim to satisfy thefollowing constraints:

∀zi ∈ Sk, ‖zi − ck‖2 ≤ ∆k, (5)

where ‖zi−ck‖2 denotes the Euclidean distance between ziand ck. That is, we hope examples belonging to class k areconstrained in the ball area with centroid ck and radius ∆k.As radius ∆k needs to be adaptive to the intent feature space,we use the deep neural network to optimize the learnable

boundary parameter ∆k ∈ R. As suggested in (Tapaswi,Law, and Fidler 2019), we use Softplus activation functionas the mapping between ∆k and ∆k:

∆k = log(

1 + e∆k

). (6)

The Softplus activation function has the following advan-tages. First, it is totally differentiable with different ∆k ∈ R.Second, it can ensure the learned radius ∆k is above zero. Fi-nally, it achieves linear characteristics like ReLU and allowsfor bigger ∆k if necessary.

Boundary Learning The decision boundaries should beadaptive to the intent feature space to balance both empiricaland open space risk (Bendale and Boult 2015). For example,if ‖zi − cyi‖2 > ∆yi , the known intent samples are outsidetheir corresponding decision boundaries, which may intro-duce more empirical risk. Therefore, the decision boundariesneed to expand to contain more samples from known classes.If ‖zi − cyi‖2 < ∆yi , though more known intent samplesare likely to be identified with broader decision boundaries,it may introduce more open intent samples and increase theopen space risk. Thus, we propose the boundary loss Lb:

Lb =1

N

N∑i=1

[δi (‖zi − cyi‖2 −∆yi)

+ (1− δi) (∆yi − ‖zi − cyi‖2)] ,

(7)

where yi is the label of the ith sample and δi is defined as:

δi :=

{1, if ‖zi − cyi‖2 > ∆yi ,0, if ‖zi − cyi‖2 ≤ ∆yi .

(8)

Then, we update the boundary parameter ∆k regarding toLb as follows:

∆k := ∆k − η∂Lb∂∆k

, (9)

where η is the learning rate of the boundary parameters ∆and ∂Lb

∂∆kis computed by:

∂Lb∂∆k

=

∑Ni=1 δ

′(yi = k) · (−1)δi∑N

i=1 δ′ (yi = k)

· 1

1 + e−∆k

, (10)

where δ′(yi = k) = 1 if yi = k and δ

′(yi = k) = 0 if

not. We only update the radius ∆yi belonging to class k in amini-batch, which ensures the denominator is not zero.

With the boundary loss Lb, the boundaries can adapt tothe intent feature space and learn suitable decision bound-aries. The learned decision boundaries can not only effec-tively surround most of the known intent samples, but alsonot be far away from each known class centroid, which iseffective to identify the open intent samples.

BANKING OOS StackOverflow

Methods Accuracy F1-score Accuracy F1-score Accuracy F1-score

25%

MSP 43.67 50.09 47.02 47.62 28.67 37.85DOC 56.99 58.03 74.97 66.37 42.74 47.73OpenMax 49.94 54.14 68.50 61.99 40.28 45.98DeepUnk 64.21 61.36 81.43 71.16 47.84 52.05ADB 78.85 71.62 87.59 77.19 86.72 80.83

50%

MSP 59.73 71.18 62.96 70.41 52.42 63.01DOC 64.81 73.12 77.16 78.26 52.53 62.84OpenMax 65.31 74.24 80.11 80.56 60.35 68.18DeepUnk 72.73 77.53 83.35 82.16 58.98 68.01ADB 78.86 80.90 86.54 85.05 86.40 85.83

75%

MSP 75.89 83.60 74.07 82.38 72.17 77.95DOC 76.77 83.34 78.73 83.59 68.91 75.06OpenMax 77.45 84.07 76.80 73.16 74.42 79.78DeepUnk 78.52 84.31 83.71 86.23 72.33 78.28ADB 81.08 85.96 86.32 88.53 82.78 85.99

Table 2: Results of open classification with different known class proportions (25%, 50% and 75%) on BANKING, OOS andStackOverflow datasets. “Accuracy” and “F1-score” respectively denote the accuracy score and macro F1-score over all classes.

Open Classification with Decision BoundaryAfter training, we use the centroids and the learned deci-sion boundaries of each known class for inference. We sup-pose known intent samples are constrained in the closed ballarea produced by their corresponding centroids and decisionboundaries. On the contrary, the open intent samples are out-side any of the bounded spherical areas. Specifically, we per-form open intent classification as follows:

y =

{open, if d(zi, ck) > ∆k,∀k ∈ Y;arg mink∈Y d(zi, ck), otherwise, (11)

where d(zi, ck) denotes the Euclidean distance between ziand ck. Y = {1, 2, · · · ,K} denote the known intent labels.

ExperimentsDatasetsWe conduct experiments on three challenging real-worlddatasets to evaluate our approach. The detailed statistics areshown in Table 1.

BANKING A fine-grained dataset in the banking do-main (Casanueva et al. 2020). It contains 77 intents and13,083 customer service queries.

OOS A dataset for intent classification and out-of-scopeprediction (Larson et al. 2019). It contains 150 intents,22,500 in-domain queries and 1,200 out-of-domain queries.

StackOverflow A dataset published in Kaggle.com. Itcontains 3,370,528 technical question titles. We use theprocessed dataset (Xu et al. 2015), which has 20 differentclasses and 1,000 samples for each class.

BaselinesWe compare our method with the following state-of-the-artopen classification methods: OpenMax (Bendale and Boult2016), MSP (Hendrycks and Gimpel 2017), DOC (Shu, Xu,and Liu 2017) and DeepUnk (Lin and Xu 2019a).

As OpenMax is an open set detection method in computervision, we adapt it for open intent classification. We firstlyuse the softmax loss to train a classifier on known intents,then fit a Weibull distribution to the classifier’s output logits.Finally, we recalibrate the confidence scores with the Open-Max Layer. Due to lack of open intent for tuning, we adoptdefault hyperparameters of OpenMax. We use the same con-fidence threshold (0.5) as in (Lin and Xu 2019a) for MSP.For a fairness comparison, we replace the backbone networkof these methods with the same BERT model as ours.

Evaluation MetricsFollowing previous work (Shu, Xu, and Liu 2017; Lin andXu 2019a), we regard all the open classes as one rejectedclass. To evaluate the overall performance, we use accuracyscore (Accuracy) and macro F1-score (F1-score) as metrics.They are calculated over all classes (known classes and openclass). We also calculate macro F1-score over known classesand open class respectively, which better evaluates the fine-grained performance.

Experimental SettingsFollowing the same settings as in (Shu, Xu, and Liu 2017;Lin and Xu 2019a), we keep some classes as unknown(open) and integrate them back during testing. All datasetsare divided into training, validation and test sets. The num-ber of known classes are varied with the proportions of 25%,50%, and 75% in the training set. The remaining classes areregarded as one open class and removed from the trainingset. Both known classes and open class are used for testing.

BANKING OOS StackOverflow

Methods Open Known Open Known Open Known

25%

MSP 41.43 50.55 50.88 47.53 13.03 42.82DOC 61.42 57.85 81.98 65.96 41.25 49.02OpenMax 51.32 54.28 75.76 61.62 36.41 47.89DeepUnk 70.44 60.88 87.33 70.73 49.29 52.60ADB 84.56 70.94 91.84 76.80 90.88 78.82

50%

MSP 41.19 71.97 57.62 70.58 23.99 66.91DOC 55.14 73.59 79.00 78.25 25.44 66.58OpenMax 54.33 74.76 81.89 80.54 45.00 70.49DeepUnk 69.53 77.74 85.85 82.11 43.01 70.51ADB 78.44 80.96 88.65 85.00 87.34 85.68

75%

MSP 39.23 84.36 59.08 82.59 33.96 80.88DOC 50.60 83.91 72.87 83.69 16.76 78.95OpenMax 50.85 84.64 76.35 73.13 44.87 82.11DeepUnk 58.54 84.75 81.15 86.27 37.59 81.00ADB 66.47 86.29 83.92 88.58 73.86 86.80

Table 3: Results of open classification with different known class ratios (25%, 50% and 75%) on BANKING, OOS andStackOverflow datasets. “Open” and “Known” denote the macro f1-score over open class and known classes respectively.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Epoch

2

3

4

5

6

7

8

9

Decis

ion

Boun

dary

Δ

50% Known Classes on StackOverflow

c1c2c3c4c5c6c7c8c9c10

Figure 3: The boundary learning process.

For each known class ratio, we report the average perfor-mance over ten runs of experiments.

We employ the BERT model (bert-uncased, with 12-layertransformer) implemented in PyTorch (Wolf et al. 2019) andadopt most of its suggested hyperparameters for optimiza-tion. To speed up the training procedure and achieve betterperformance, we freeze all but the last transformer layer pa-rameters of BERT. The training batch size is 128, and thelearning rate is 2e-5. For the boundary loss Lb, we employAdam (Kingma and Ba 2014) to optimize the boundary pa-rameters at a learning rate of 0.05.

ResultsTable 2 and Table 3 show the performances of all comparedmethods, where the best results are highlighted in bold.Firstly, we observe the overall performance. Table 2 showsaccuracy score and macro F1-score over all classes. With

0.25Δ 0.5Δ 0.75Δ Δ 1.25Δ 1.5Δ 1.75ΔBoundary

0

20

40

60

80

100F1

-sco

re (%

)

50% known classes on OOS

Figure 4: Influence of the learned decision boundary.

25%, 50%, and 75% known classes, our approach consis-tently achieves the best results and outperforms other base-lines by a significant margin. Compared with the best resultsof all baselines, our method improves accuracy score (Accu-racy) on BANKING by 14.64%, 6.13%, and 2.56%, on OOSby 6.16%, 3.19%, and 2.61%, on StackOverflow by 38.88%,27.42%, and 10.45% in 25%, 50% and 75% settings respec-tively, which demonstrates the priority of our method.

Secondly, we notice that the improvements on StackOver-flow are much more drastic than the other two datasets. Wesuppose the improvements mainly depend on the character-istics of datasets. Most baselines lack explicit or suitable de-cision boundaries for identifying the open intent, so they aremore sensitive to different datasets. For example, they arelimited to distinguish difficult semantic intents (e.g., tech-nical question titles in StackOverflow) without prior knowl-edge. By contrast, our method learns specific and tight deci-sion boundaries for each known class, which is more effec-tive for open intent classification.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Labeled Ratio

30

40

50

60

70

80

Scor

eBANKING (25% Known Intents)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Labeled Ratio

50

55

60

65

70

75

80

85

Scor

e

BANKING (50% Known Intents)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Labeled Ratio

40

50

60

70

80

90

Scor

e

BANKING (75% Known Intents)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Labeled Ratio

40

50

60

70

80

90

Scor

e

OOS (25% Known Intents)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Labeled Ratio

55

60

65

70

75

80

85

90

Scor

e

OOS (50% Known Intents)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Labeled Ratio

40

50

60

70

80

90

Scor

e

OOS (75% Known Intents)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Labeled Ratio

20

30

40

50

60

70

80

90

Scor

e

StackOverflow (25% Known Intents)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Labeled Ratio

40

50

60

70

80

90

Scor

e

StackOverflow (50% Known Intents)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Labeled Ratio

60

65

70

75

80

85

90

Scor

e

StackOverflow (75% Known Intents)

Figure 5: Influence of the labeled ratio on three datasets with different known class proportions (25%, 50%, 75%).

Thirdly, we observe the fine-grained performance. Table 3shows the macro F1-score on open intent and known intentsrespectively. We notice that our method not only achievessubstantial improvements on open class, but also largely en-hances the performances on known classes compared withbaselines. That is because our method can learn specific andtight decision boundaries for detecting open class while en-suring the quality of known intent classification.

DiscussionBoundary Learning ProcessFigure 3 shows the decision boundary learning process. Atfirst, most parameters are assigned small values near zeroafter initialization, which leads to small radius with theSoftplus activation function. As the initial radius is toosmall, the empirical risk plays a dominant role. Therefore,the radius of each decision boundary expands to containmore known intent samples belonging to its class. As thetraining process goes on, the radius of the decision bound-ary learns to be large enough to contain most of the knownintents. However, the large radius will also introduce redun-dant open intent samples. In this case, the open space riskplays a dominant role, which prevents the radius from en-

larging. Finally, the decision boundaries converge with a bal-ance between empirical risk and open space risk.

Effect of Decision BoundaryTo verify the effectiveness of the learned decision boundary,we use different ratios of ∆ as boundaries during testing.As shown in Figure 4, ADB achieves the best performancewith ∆ among all assigned decision boundaries, which ver-ifies the tightness of the learned decision boundary. More-over, we notice that the performance of open classificationis sensitive to the size of the decision boundaries. Overcom-pact decision boundaries will increase the open space riskby misclassifying more known intent samples to the open in-tent. Correspondingly, overrelaxed decision boundaries willincrease the empirical risk by misclassifying more open in-tent samples as known intents. As shown in Figure 4, bothof these two cases perform worse compared with ∆.

Effect of Labeled DataTo investigate the influence of labeled data, we vary the la-beled ratio in the training set in the range of 0.2, 0.4, 0.6,0.8 and 1.0. We use Accuracy as the score to evaluate theperformance. As shown in Figure 5, ADB outperforms all

the other baselines on three datasets on almost all settings.Besides, it keeps a more robust performance under differentlabeled ratios compared with other methods.

Notably, the statistic-based methods (e.g., MSP and DOC)show better performances with less labeled data. We supposethe reason is that the predicted scores are in low-confidencewith less prior knowledge for training, which is helpful to re-ject the open intent with the threshold. However, as the num-ber of labeled data increases, these methods tend to be biasedtowards the known intents, with the aid of strong feature ex-traction capability of DNNs (Nguyen, Yosinski, and Clune2015). Therefore, the performances drop dramatically.

In addition, we notice that OpenMax and DeepUnk aretwo competitive baselines. We suppose the reason is thatthey both leverage the characteristics of intent feature dis-tribution to detect the open class. However, OpenMax com-putes centroids of each known class with only correctivepositive training samples. The qualities of centroids are eas-ily influenced by the number of training samples. DeepUnkadopts a density-based novelty detection algorithm to per-form open classification, which is also limited to the priorknowledge of labeled data. Thus, their performances all dropdramatically with less labeled data, as shown in Figure 5.

Effect of Known ClassesWe vary the known class ratio between 25%, 50% and 75%,and show the results in Table 2 and Table 3. Firstly, we ob-serve the overall performance in Table 2. Compared withother methods, our method achieves huge improvements onall settings of three datasets. All baselines drop dramaticallyas the known class ratio decreases. By contrast, our methodstill achieves robust results on accuracy score with fewertraining samples.

Then, we observe the fine-grained performance in Ta-ble 3. We notice that all baselines achieve high scores onknown classes, but they are limited to identify the open in-tent and suffer poor performance. However, our method stillyields the best results on both known classes and the openclass. It further demonstrates that the suitable learned de-cision boundaries are helpful to both balance the empiricalrisk and the open space risk.

Related WorkIntent DetectionThere are many works for intent detection in dialogue sys-tems in recent years (Min et al. 2020; Qin et al. 2020; Zhanget al. 2019; E et al. 2019; Qin et al. 2019). Nevertheless, theyall make assumptions of closed world classification withoutthe open intent. Srivastava, Labutov, and Mitchell (2018)perform intent detection with the zero-shot learning (ZSL)method. However, ZSL is different from our task because itonly contains novel classes during testing.

Unknown intent detection is a specific task for detectingthe unknown intent. Brychcin and Kral (2017) propose anunsupervised approach to modeling intents, but fail to uti-lize the prior knowledge of known intents. Kim and Kim(2018) jointly train the in-domain (ID) classifier and out-of-domain (OOD) detector but need to sample OOD utter-

ances. Yu et al. (2017) adopt adversarial learning to generatepositive and negative samples for training the classifier. Ryuet al. (2018) use a generative adversarial network (GAN) totrain on the ID samples and detect the OOD samples withthe discriminator. However, it has been shown that deep gen-erative models fail to capture high-level semantics on real-world data (Nalisnick et al. 2019; Mundt et al. 2019). Re-cent methods try to learn friendly features for detecting theunknown intent (Lin and Xu 2019a; Gangal et al. 2020; Yanet al. 2020), but they need to modify the model architecture,and fail to construct specific decision boundaries.

Open World ClassificationAt first, researchers use SVM to solve open set problems.One-class classifiers (Scholkopf et al. 2001; Tax and Duin2004) find the decision boundary based on the positivetraining data. For multi-class open classification, One-vs-allSVM (Rifkin and Klautau 2004) trains the binary classifierfor each class and treats the negative classified samples asthe open class. Scheirer et al. (2013) extend the method tocomputer vision and introduce the concept of open spacerisk. Jain, Scheirer, and Boult (2014) estimate the unnormal-ized posterior probability of inclusion for open set problems.They fit the probability distributions to statistical ExtremeValue Theory (EVT) by using a Weibull-calibrated multi-class SVM. Scheirer, Jain, and Boult (2014) propose a Com-pact Abating Probability (CAP) model, which further im-proves the performance of Weibull-calibrated SVM by trun-cating the abating probability. However, all these methodsneed negative samples for selecting the decision boundaryor probability threshold. Moreover, SVM cannot capture ad-vanced semantic features of intents (Lin and Xu 2019b).

Recently, researchers use deep neural networks for openclassification. OpenMax (Bendale and Boult 2016) fitsWeibull distribution to the outputs of the penultimate layer,but still needs negative samples for selecting the best hy-perparameters. MSP (Hendrycks and Gimpel 2017) calcu-lates the softmax probability of known samples and re-jects the low confidence unknown samples with the thresh-old. ODIN (Liang, Li, and Srikant 2018) uses tempera-ture scaling and input preprocessing to enlarge the dif-ferences between known and unknown samples. However,both of them (Hendrycks and Gimpel 2017; Liang, Li, andSrikant 2018) need unknown samples to select the confi-dence threshold artificially. DOC (Shu, Xu, and Liu 2017)uses the sigmoid function and calculates the confidencethreshold based on Gaussian statistics, but it performs worsewhen the output probabilities are not discriminative.

ConclusionIn this paper, we propose a novel post-processing method foropen intent classification. After pre-training the model withlabeled samples, our model can automatically learn specificand tight decision boundaries adaptive to the known intentfeature space. Our method has no require for open intent ormodel architecture modification. Extensive experiments onthree benchmark datasets show that our method yields sig-nificant improvements over the compared baselines, and ismore robust with less labeled data and fewer known intents.

AcknowledgmentsThis work is supported by seed fund of Tsinghua Univer-sity (Department of Computer Science and Technology)-Siemens Ltd., China Joint Research Center for Industrial In-telligence and Internet of Things.

ReferencesBendale, A.; and Boult, T. 2015. Towards Open WorldRecognition. In Proceedings of CVPR, 1893–1902.Bendale, A.; and Boult, T. E. 2016. Towards open set deepnetworks. In Proceedings of CVPR, 1563–1572.Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.; and Sander, J.2000. LOF: identifying density-based local outliers. In ACMsigmod record, volume 29, 93–104.Brychcin, T.; and Kral, P. 2017. Unsupervised DialogueAct Induction using Gaussian Mixtures. In Proceedings ofEACL, 485–490.Casanueva, I.; Temcinas, T.; Gerz, D.; Henderson, M.; andVulic, I. 2020. Efficient Intent Detection with Dual SentenceEncoders. In Proceedings of ACL WorkShop.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019.BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding. In Proceedings of NAACL-HLT,4171–4186.E, H.; Niu, P.; Chen, Z.; and Song, M. 2019. A Novel Bi-directional Interrelated Model for Joint Intent Detection andSlot Filling. In Proceedings of ACL, 5467–5471.Fei, G.; and Liu, B. 2016. Breaking the Closed World As-sumption in Text Classification. In Proceedings of NAACL-HLT, 506–514.Gangal, V.; Arora, A.; Einolghozati, A.; and Gupta, S. 2020.Likelihood Ratios and Generative Classifiers for Unsuper-vised Out-of-Domain Detection in Task Oriented Dialog. InProceedings of AAAI, 7764–7771.Hendrycks, D.; and Gimpel, K. 2017. A Baseline for De-tecting Misclassified and Out-of-Distribution Examples inNeural Networks. In Proceedings of ICLR.Jain, L. P.; Scheirer, W. J.; and Boult, T. E. 2014. Multi-Class Open Set Recognition Using Probability of Inclusion.In Proceedings of ECCV.Kim, J.-K.; and Kim, Y.-B. 2018. Joint Learning of DomainClassification and Out-of-Domain Detection with DynamicClass Weighting for Satisficing False Acceptance Rates. InProceedings of INTERSPEECH, 556–560.Kingma, D. P.; and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 .Larson, S.; Mahendran, A.; Peper, J. J.; Clarke, C.; Lee, A.;Hill, P.; Kummerfeld, J. K.; Leach, K.; Laurenzano, M. A.;Tang, L.; and Mars, J. 2019. An Evaluation Dataset for In-tent Classification and Out-of-Scope Prediction. In Proceed-ings of EMNLP-IJCNLP, 1311–1316.Liang, S.; Li, Y.; and Srikant, R. 2018. Enhancing The Relia-bility of Out-of-distribution Image Detection in Neural Net-works. In Proceedings of ICLR.

Lin, T.-E.; and Xu, H. 2019a. Deep Unknown Intent Detec-tion with Margin Loss. In Proceedings of ACL, 5491–5496.Lin, T.-E.; and Xu, H. 2019b. A post-processing method fordetecting unknown intent of dialogue system via pre-traineddeep neural network classifier. Knowledge-Based Systems186: 104979.Lin, T.-E.; Xu, H.; and Zhang, H. 2020. Discovering New In-tents via Constrained Deep Adaptive Clustering with ClusterRefinement. In Proceedings of AAAI, 8360–8367.Min, Q.; Qin, L.; Teng, Z.; Liu, X.; and Zhang, Y. 2020. Dia-logue State Induction Using Neural Latent Variable Models.In Proceedings of IJCAI, 3845–3852.Mundt, M.; Pliushch, I.; Majumder, S.; and Ramesh, V.2019. Open Set Recognition Through Deep Neural Net-work Uncertainty: Does Out-of-Distribution Detection Re-quire Generative Classifiers? In Proceedings of ICCV Work-shops.Nalisnick, E.; Matsukawa, A.; Teh, Y. W.; Gorur, D.; andLakshminarayanan, B. 2019. Do Deep Generative ModelsKnow What They Don’t Know? In Proceedings of ICLR.Nguyen, A.; Yosinski, J.; and Clune, J. 2015. Deep neuralnetworks are easily fooled: High confidence predictions forunrecognizable images. In Proceedings of CVPR, 427–436.Qin, L.; Che, W.; Li, Y.; Ni, M.; and Liu, T. 2020. DCR-Net:A Deep Co-Interactive Relation Network for Joint DialogAct Recognition and Sentiment Classification. In Proceed-ings of AAAI, 8665–8672.Qin, L.; Che, W.; Li, Y.; Wen, H.; and Liu, T. 2019. AStack-Propagation Framework with Token-Level Intent De-tection for Spoken Language Understanding. In Proceedingsof EMNLP, 2078–2087.Rifkin, R. M.; and Klautau, A. 2004. In Defense of One-Vs-All Classification. Journal of Machine Learning Research5: 101–141.Ryu, S.; Koo, S.; Yu, H.; and Lee, G. G. 2018. Out-of-domain Detection based on Generative Adversarial Net-work. In Proceedings of EMNLP, 714–718.Scheirer, W. J.; de Rezende Rocha, A.; Sapkota, A.; andBoult, T. E. 2013. Toward Open Set Recognition. IEEETransactions on Pattern Analysis and Machine Intelligence7(35): 1757–1772.Scheirer, W. J.; Jain, L. P.; and Boult, T. E. 2014. ProbabilityModels for Open Set Recognition. IEEE Transactions onPattern Analysis and Machine Intelligence (T-PAMI) 36.Scholkopf, B.; Platt, J. C.; Shawe-Taylor, J.; Smola, A. J.;and Williamson, R. C. 2001. Estimating the support of ahigh-dimensional distribution. Neural computation 13(7):1443–1471.Shu, L.; Xu, H.; and Liu, B. 2017. DOC: Deep Open Clas-sification of Text Documents. In Proceedings of EMNLP,2911–2916.Srivastava, S.; Labutov, I.; and Mitchell, T. 2018. Zero-shotLearning of Classifiers from Natural Language Quantifica-tion. In Proceedings of ACL, 306–316.

Tapaswi, M.; Law, M. T.; and Fidler, S. 2019. Video FaceClustering with Unknown Number of Clusters. In Proceed-ings of ICCV, 5026–5035.Tax, D.; and Duin, R. 2004. Support Vector Data Descrip-tion. Machine Learning 54: 45–66.Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.;Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.;and Brew, J. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv preprintarXiv:1910.03771 .Xu, J.; Wang, P.; Tian, G.; Xu, B.; Zhao, J.; Wang, F.; andHao, H. 2015. Short Text Clustering via Convolutional Neu-ral Networks. In Proceedings of NAACL-HLT, 62–69.Yan, G.; Fan, L.; Li, Q.; Liu, H.; Zhang, X.; Wu, X.-M.; andLam, A. Y. 2020. Unknown Intent Detection Using Gaus-sian Mixture Model with an Application to Zero-shot IntentClassification. In Proceedings of ACL, 1050–1060.Yu, Y.; Qu, W.-Y.; Li, N.; and Guo, Z. 2017. Open-CategoryClassification by Adversarial Sample Generation. In Pro-ceedings of IJCAI, 3357–3363.Zhang, C.; Li, Y.; Du, N.; Fan, W.; and Yu, P. 2019. Joint SlotFilling and Intent Detection via Capsule Neural Networks.In Proceedings of ACL, 5259–5267.