End-to-End latent-variable task-oriented dialogue system ...their model can be trained under...

14
World Wide Web https://doi.org/10.1007/s11280-019-00688-8 End-to-End latent-variable task-oriented dialogue system with exact log-likelihood optimization Haotian Xu 1 · Haiyun Peng 2 · Haoran Xie 3 · Erik Cambria 2 · Liuyang Zhou 1 · Weiguo Zheng 1 Received: 3 April 2019 / Revised: 22 April 2019 / Accepted: 29 April 2019 / © Springer Science+Business Media, LLC, part of Springer Nature 2019 Abstract We propose an end-to-end dialogue model based on a hierarchical encoder-decoder, which employed a discrete latent variable to learn underlying dialogue intentions. The system is able to model the structure of utterances dominated by statistics of the language and the dependencies among utterances in dialogues without manual dialogue state design. We argue that the latent discrete variable interprets the intentions that guide machine responses generation. We also propose a model which can be refined autonomously with reinforce- ment learning, due to that intention selection at each dialogue turn can be formulated as a sequential decision-making process. Our experiments show that exact MLE optimized model is much more robust than neural variational inference on dialogue success rate with limited BLEU sacrifice. Keywords Dialogue model · Hierarchical encoder-decoder · Log-likelihood optimization · Dialogue intention 1 Introduction Task-oriented dialogue systems aim to produce responses by accessing information from knowledge bases and planning over multiple dialogue turns. Conventional task-oriented dialogue systems generally have a complicated pipeline consisting of natural language understanding [2], dialogue state tracking [16, 25], dialogue policy [29], commonsense reasoning [30], and natural language generation [23]. A limitation of such system lies in that errors could propagate from the upper streams to downstream modules, making the identification and tracking of the errors’ source difficult. This article belongs to the Topical Collection: Computational Social Science as the Ultimate Web Intelligence Guest Editors: Xiaohui Tao, Juan D. Velasquez, Jiming Liu, and Ning Zhong Liuyang Zhou [email protected] Extended author information available on the last page of the article. (2020) 23:1989–2002 Published online: 7 June 2019

Transcript of End-to-End latent-variable task-oriented dialogue system ...their model can be trained under...

Page 1: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

World Wide Webhttps://doi.org/10.1007/s11280-019-00688-8

End-to-End latent-variable task-oriented dialoguesystemwith exact log-likelihood optimization

Haotian Xu1 ·Haiyun Peng2 ·Haoran Xie3 · Erik Cambria2 · Liuyang Zhou1 ·Weiguo Zheng1

Received: 3 April 2019 / Revised: 22 April 2019 / Accepted: 29 April 2019 /

© Springer Science+Business Media, LLC, part of Springer Nature 2019

AbstractWe propose an end-to-end dialogue model based on a hierarchical encoder-decoder, whichemployed a discrete latent variable to learn underlying dialogue intentions. The systemis able to model the structure of utterances dominated by statistics of the language andthe dependencies among utterances in dialogues without manual dialogue state design. Weargue that the latent discrete variable interprets the intentions that guide machine responsesgeneration. We also propose a model which can be refined autonomously with reinforce-ment learning, due to that intention selection at each dialogue turn can be formulated asa sequential decision-making process. Our experiments show that exact MLE optimizedmodel is much more robust than neural variational inference on dialogue success rate withlimited BLEU sacrifice.

Keywords Dialogue model · Hierarchical encoder-decoder · Log-likelihood optimization ·Dialogue intention

1 Introduction

Task-oriented dialogue systems aim to produce responses by accessing information fromknowledge bases and planning over multiple dialogue turns. Conventional task-orienteddialogue systems generally have a complicated pipeline consisting of natural languageunderstanding [2], dialogue state tracking [16, 25], dialogue policy [29], commonsensereasoning [30], and natural language generation [23].

A limitation of such system lies in that errors could propagate from the upper streams todownstream modules, making the identification and tracking of the errors’ source difficult.

This article belongs to the Topical Collection: Computational Social Science as the Ultimate WebIntelligenceGuest Editors: Xiaohui Tao, Juan D. Velasquez, Jiming Liu, and Ning Zhong

� Liuyang [email protected]

Extended author information available on the last page of the article.

(2020) 23:1989–2002

Published online: 7 June 2019

Page 2: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

Moreover, each component in the pipeline needs to be re-trained when preceding com-ponents were updated, which caused several issues in practice. Moreover, training naturallanguage understanding and dialogue state tracking often need massive labeled data whichis time-consuming and difficult to obtain in a specific domain.

To ameliorate these limitations with the conventional pipeline dialogue systems, recentefforts have been made in constructing neural networks based end-to-end models. Suchend-to-end systems aim to optimize directly towards final system objectives [6, 24] or per-form component-wise optimization [12]. Most recent end-to-end models are either trainedin deep reinforcement learning-based systems [5, 26] or in supervised approaches [6, 24].The former relies on interaction between system and human user (or user simulator), whilethe latter depends on established dialogue corpora, be it between user-user or user-machine.

However, these models did not address the uncertainty of user’s intention which iscrucial for a better dialogue modeling [24]. Wen et al. [24] proposes a latent intention dia-logue model (LIDM) to learn heterogeneous distributions of communicative intentions. Ituses discrete latent random variables to represent dialogue intentions, based on which theappropriate responses can be generated. Augmented with discrete latent random variables,their model can be trained under unsupervised, semi-supervised and reinforcement learningmanners, which can easily adapt to industrial applications.

In this work, we present a task-oriented neural network dialogue model which is end-to-end trainable, augmenting with discrete latent intention inferred from user’s utterance. Thecontribution of this article is listed as follows.

– The proposed dialogue model can be fully trained in an end-to-end manner underunsupervised, semi-supervised or reinforcement learning frameworks.

– For parameter estimation, we propose to apply exact maximum log-likelihood ratherthan variational inference.

– Our experiments show that exact MLE achieves higher performance on success ratewith little sacrifice of BLEU than variational inference.

– We further illustrate the benefit of the discrete latent variable in our model comparingto a fully deterministic model [21] and a model with continuous latent variable [20].

The remaining sections of this article are organized as follows. Section 2 will reviewthe relevant studies about dialogue models and end-to-end neural networks. Section 3will introduce the proposed methodology including the system architecture, model train-ing and inference. In Section 4, experiments for verifying the effectiveness of the proposedmethod are discussed. Finally, the findings, conclusion and future research directions willbe included in Section 5.

2 Related work

Recent approaches apply partially observable Markov Decision Process (POMDP) [29] totackle task-oriented dialogue and use reinforcement learning for online policy optimizationvia user interaction [7]. It needs extra efforts to carefully design state and action space tomake reinforcement policy learning tractable.

Ever since the success of end-to-end trainable neural network in modeling chit-chat dia-logues [19, 28], many efforts have been made in leveraging on the end-to-end models totask-oriented dialogues. Williams et al. [26] proposed hybrid code networks (HCN) that theyextracted different kinds of features from utterance as dialogue state and applied recurrentneural network (RNN) to track dialogue history. For answer generation, they select finial

World Wide Web (2020) 23:1989–20021990

Page 3: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

response directly from candidate responses. This model can be tuned by supervised andreinforcement learning. Madotto et al. [13] developed a new system model named Mem2seqto address issue of incorporating knowledge base by using the pointer network for integrat-ing neural generative model with the multihop attention. More recently, Zhao et al. [33]have proposed a framework for latent action, which employs latent variables to model allpossible actions of an end-to-end agent and presents unsupervised models for inducing theaction space.

Compared to this approach, our model consisting of encoder-decoder modules can gener-ate more diverse responses and is easy to integrate chitchat ability [32]. With discrete latentvariables, our model can be trained under unsupervised, semi-supervised and reinforcementlearning which is more preferred in industrial application. Wen et al. [24] proposed an end-to-end trainable neural network with discrete latent variable (LIDM). Though LIDM can betrained by unsupervised, semi-supervised and reinforcement learning like our model, it ismodularly connected which may propagate errors from upstream modules to downstreamcomponents. Dhingra et al. [5] introduced an end-to-end reinforcement learning dialogueagent for information access. The proposed model innovated in making the KB differen-tiable, by which they introduced a retrieval process for the KB entry selections. Soft-KBlookup as such may subject to information updates in the KB and lacks scalability tolarge-scale database in the real world.

In our model, we use symbolic query and employ external services (e.g., a recommendersystem) to select KB entities because entity ranking in real-world systems can utilize evenricher feature sets, such as user profiles, location, and time context etc.

3 Methodology

3.1 System architecture

Figure 1 presents the overall architecture of our model. A continuous form dialogue stateover a sequence of turns is maintained in the state st of a dialogue-level LSTM. At eachdialogue turn t, this dialogue-level LSTM takes in the encoding of the user utterance Ut fromthe bi-LSTM and latent intention z

jt derived from intent network. The attention decoder

takes Ut , zjt and yt as inputs to generate responses rt , where yt is a variable that indicates

whether there are items satisfying user’s constraints:

yt ={

1, ∃ items satisfying user’s constraints0, otherwise

(1)

For user utterance ut = {w1t , . . . , w

Nt }, we apply a single-layer biLSTM:

Unt = biLST M(wn

t , Un−1t ) (2)

The higher-level context RNN keeps track of past utterances by processing each utterancevector Ut iteratively. The hidden state of the context RNN st represents a summary of thedialogues up to turn t :

st = LST M(st−1, Ut ) (3)

Conditioning on dialogue context st and current utterance vector Ut , we use a single MLPlayer as a latent network to generate a discrete latent variable for modeling user’s intention

World Wide Web (2020) 23:1989–2002 1991

Page 4: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

Figure 1 The architecture of proposed end-to-end task-oriented dialogue model

to handle variability of dialogues. A latent intention zjt (also termed as an action under the

umbrella of reinforcement learning) can then be sampled from the conditional distribution:

zjt ∼ π�2(zt |st , Ut ) (4)

For response generation, we employ attention decoder [1] to generate responses. Attentionoffers the decoder a capability to attend each hidden state in the encoder and dynamicallydecides their importance at every decoding step. In particular, let Un

t and ct,n denote thehidden state outputs of the encoder RNNs and the attention vector at turn t and time step n,respectively.

cnt =

∑i

αtn,iU

it (5)

where

αtn,i = sof tmax(Ui

t Wahtn−1 + ba) (6)

and htn−1 is the previous hidden state output by decoder RNNs at turn t and time n − 1,

{Wa, ba} are trainable parameters. The sampled intent zjt , attention ct and dialogue his-

tory st , as a whole, determine the response generation according to a conditional languagegeneration model:

p(rt |st , zjt , yt , ct ) =

∏l

p�1(rl+1t |rl

t , htl , c

lt , z

jt , yt ) (7)

World Wide Web (2020) 23:1989–20021992

Page 5: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

Thus we can get a complete likelihood to summarize over latent intention with its parameters� = {�1, �2}:

p�(rt |st , ct , yt )

=∑zt

p�1(rt |st , zt , yt , ct )π�2(zt |st , Ut ) (8)

3.2 Model training

In this subsection, we apply unsupervised, semi-supervised and reinforcement learning tooptimize our model.

3.2.1 Exact MLE

Previous works such as [24, 31] also built a generative model bringing discrete latentvariable in question, such as latent intention derived from user’s utterance or model-ing uncertainty of entity detection. Latent variables in their model are all followed bymulti-nominal distribution. In fact, for d-dim multi-nominal distribution, we can actuallysummarize over it within d times calculation to acquire exact log-likelihood without usingvariational inference approximation. When an inference network is optimized via ELBO,summarization is still needed which may lose key information, because there are alwaysgaps between true and approximated posterior distributions.

Thus, we can get log-likelihood based on (8):

∂logp�(rt |st , ct , yt )

∂�1

=∑zt

∂p�1(rt |st , zt , yt , ct )

∂�1

π�2(zt |st , Ut )

p�(rt |st , ct , yt )(9)

∂logp�(rt |st , ct , yt )

∂�2

=∑zt

∂π�2(zt |st , Ut )

∂�2

p�1(rt |st , zt , yt , ct )

p�(rt |st , ct , yt )(10)

In order to compute (9) and (10), we need to calculate p�(rt |st , ct , yt ) which is a normalizedconstant. In general, we can do accurate calculation since it only needs limited computation.

In practice, however, we may use several samples to approximate it:

p�(rt |st , ct , yt ) ≈∑j

p�1(rt |st , zjt , yt , ct )π�2(z

jt |st , Ut ) (11)

In addition, our model can also be optimized via neural variational inference learning(NVIL) [14] used in [24]. More details can be found in [24].

3.2.2 Semi-Supervision

Despite exact MLE without labeling of zt , we can also apply semi-supervised learning whenacquired part of labeled data about intention zt . Instead of applying unsupervised clusteringto infer latent intentions used in [24], we apply Hidden Topic Markov Model (HTMM) [8]

World Wide Web (2020) 23:1989–2002 1993

Page 6: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

to extract dialogue topic as the latent intention for each user’s utterance in each dialoguesession. Thus we can generate automatic labels zt for partial training examples (ut , rt , zt ) ⊆L and (ut , rt ) ⊆ U . Accordingly, we can obtain the objective function on unlabelled data U :

L1 =∑

(ut ,rt )⊆Ulogp�(rt |st , ct , yt )

=∑

(ut ,rt )⊆Ulog

∑zt

p�1(rt |st , zt , yt , ct )π�2(zt |st , Ut ) (12)

For objective function on partially labelled data L:

L2 =∑

(ut ,rt ,zt )⊆Llogp�(rt |st , ct , yt )

=∑

(ut ,rt ,zt )⊆Llog[p�1(rt |st , zt , yt , ct )π�2(zt |st , Ut )] (13)

Thus the finial objective function is given as:

L = L1 + αL2 (14)

where α decides the trade-off between the unsupervised and supervised examples.

3.2.3 Reinforcement learning

Using the discrete latent variable as the intention levarages on operational experience incontrolling and refining the model’s behavior.

The learned generative network π�2(zt |st , Ut ) encodes the policy discovered from theunderlying data distribution, however, it is not always optimal for every specific task. Sinceπ�2(zt |st , Ut ) itself is a parameterized policy network, whichever other objective functionwe are interested in, we could use any policy gradient-based reinforcement learning [27] tofine-tune the policy.

Based on the interested objective function J , we optimize it via the REINFORCEalgorithm [27].

∂J

∂�2≈ 1

M

M∑m=1

R(m)t

∂log π�2(z(m)t |st , Ut )

∂�2(15)

For more information, please refer to the subsection Training Details.

3.3 Inference

During inference, given only user utterance, we firstly sample latent intention to generateresponse by finding arg maxzt

logp(rt |st , zt , yt , ct ) where zt ∼ π�2(zt |st , Ut ). However,this computation is computationally intensive. Alternatively, we can approximate it viabeam search that we select top-k intentions {z1

t , z2t , . . . , z

kt } based on π�2(zt |st , Ut ). Then

the response is generated via:

r∗t = arg max

zt∈{z1t ,z

2t ,...,z

kt }

logp(rt |st , zt , yt , ct ) (16)

In our experiments, we find k=1 (equivalent to greedy selection) already achieves goodperformance.

World Wide Web (2020) 23:1989–20021994

Page 7: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

4 Experiments

4.1 Dataset andmetrics

We evaluate our proposed model using the CamRest676 corpus collected by [18]. The task isto guide users for finding a restaurant around Cambridge, UK. In order to restrict the search,three informative slots (area, price range and food) are designed. While for the users toinquire information about mentioned restaurants, six requestable slots (the aforementionedthree informative slots plus address, phone and postcode slots) are defined. The datasetconsists of 676 dialogues (both finished and unfinished) and around 2750 conversationalturns in sum.

There are 99 unique restaurants in the dataset. Like previous works [18, 24], the train-ing, validation and test sets partition ratio of the corpus is 3:1:1. The vocabulary size is606 after preprocessing the original CamRest676 corpus.1 We use task success rate [22]and BLEU score [17] to evaluate all models in the held-out test set. In order to assess thehuman evaluation performance, we evaluate our model with optional blocks using exactmaximum log-likelihood or variational lower bound for optimization. Hierarchical recur-rent encoder-decoder (HRED) [21] is a deterministic model compared to our model. It isused as a baseline to verify the effectiveness of employed latent intention for better dialoguevariability modeling.

The other model, hierarchical latent variable encoder-decoder model (VHRED) [20], isa modification of HRED with continuous Gaussian latent variable, optimized via ELBO.It is used as a baseline to show the benefits of the discrete latent variable with exact MLEoptimization. All models are also evaluated on Web interface by recruiting human judges ina company. Each judge was requested to focus on a task and performed a conversation withthe machine. In the end of each conversation, judges rated the performance of model basedon perceived comprehension ability, the naturalness of responses and subjective success rateon a scale of 1 to 5 used in [24].

Rt = η · sBLEU(rt , rt ) +⎧⎨⎩

1, rt improves

−1, rt degrades

0, otherwise

(17)

4.2 Training details

For all experiments, we set the word embedding size to 100. The hidden state size for both encoderRNNs and context RNN are 256. All encoders are bidirectional LSTMs. The forward andbackward LSTMs both have 256 hidden units. The context RNN and decoder RNNs areGRUs. The size of hidden states of context GRU is 256, and 512 for decoder GRU2.

To make a direct comparison with LIDM, in which the latent intention sizes of the modelwere set to 10, 30 and 50 for variational inference learning used in [24] with 5 sampledintentions to calculate gradients and 10, 30, 50 for exact MLE due to the computationalefficiency. The trade-off factor λ is set to 0.1 [10]. All models3 are trained at a learning

1Similar to LIDM, all dialogues are pre-processed by delexicalization [9]. Based on the ontology, slot-valuespecific words are substituted with their corresponding generic tokens.2We concatenate hidden states of bidirectional encoder RNNs of the last step to initialize hidden states ofdecoder RNNs3Except for training with REINFORCE models

World Wide Web (2020) 23:1989–2002 1995

Page 8: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

rate of 0.0001. When trained with REINFORCE to further tune, we set the learning rate to0.00001, and every mini-batch contains 32 training dialogues.

We optimize all models in an end-to-end fashion using Adam [11] and tune (such ashyper-parameters, early stopping) on the hold-out validation set. The drop rate in the contextRNN and decoder RNNs with an adjustable number starts from 0.5 and linearly decreases to0.0. When fine-tuned with REINFORCE, we only optimize the policy network by fixing theparameters of attention decoder. We introduce the reward function Rt described in LIDM,where constant η is set to 0.5, rt is the generated response, rt is the ground truth response,and sBLEU(rt , rt ) is BLEU score of sentence. During testing, we greedily select latentintention and feed it into the decoder RNNs for response generation. We implement ourmodels and all experiments using TensorFlow.

4.3 Experiments results

Our model was assessed through both offline and online evaluations for exact MLE andNVIL [14] with optional blocks including database vector yt and reinforcement learningrefinement. In Tables 1 and 2, the model using exact MLE achieves higher success rateeven with lower dimension of latent intention compared to variational approximation. Thisphenomenon indicates that exact MLE could force model to generate slot-related token.Whereas model optimized via NVIL will lose detail information due to the high variancewhen optimizing parameters of inference network used in ELBO. Regarding BLEU, how-ever, the latter achieves better results than the former. One possible cause might be thatvariational lower bound of the dataset was not optimized for task success rate using NVIL.While exact MLE was optimized for task success rate. As for the impact of yt and rein-forcement learning refinement, models with yt improve the performance when optimizedusing the exact log-likelihood objective function, and harm it with variational inference.Since we only sample 5 intentions to estimate parameters using NVIL, it may lose keyinformation due to the limited sampled intentions and the accuracy of posterior distributionapproximation.

Table 1 Offline results of our model via exact MLE and NVIL optimization varying latent intentiondimension from 10 to 50

Models Success rate BLEU

10 30 50 10 30 50

exact MLE

HRED 72.5% 0.349our model 76.9% 66.3% 68.1% 0.293 0.295 0.296

our model, +yt 77.5% 70.6% 81.9% 0.310 0.297 0.304

our model, +RL 78.1% 66.9% 67.5% 0.296 0.294 0.294

our model, +yt , +RL 78.1% 70.6% 81.2% 0.309 0.299 0.303

NVIL

VHRED 76.9% 68.8% 72.5% 0.327 0.358 0.354

our model 71.3% 73.8% 76.3% 0.321 0.336 0.314

our model, +yt 74.4% 73.1% 75.6% 0.323 0.312 0.320

our model, +RL 71.9% 73.8% 76.3% 0.320 0.337 0.315

our model, +yt , +RL 75.0% 73.8% 75.0% 0.322 0.313 0.320

Values in bold denote best performance

World Wide Web (2020) 23:1989–20021996

Page 9: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

Table 2 Results of human evaluation where yt is the indicator from KB indicating wether there are matchedsubjects in it. The significance test p < 0.07 is based on a two-tailed student test of our model with optionalblocks optimized via exact MLE or variational inference

Metricss Success rate Naturalness Comprehension # of Turns

published models

HRED 89.3% 3.98 4.01 4.60

VHRED 88.7% 3.70 3.91 4.42

NDM 91.5% 4.08 4.21 4.45

LIDM 92.0% 4.40 4.29 4.54

LIDM, +RL 93.0% 4.40 4.28 4.29

our models optimized via NVIL [14]

our model 90.5% 4.01 4.10 4.40

our model, +yt 90.9% 4.05 4.15 4.27

our model, +RL 92.3% 4.30 4.19 4.50

our model, +yt , +RL 91.5% 4.12 4.14 4.30

our models optimized via exact MLE

our model 91.5% 4.07 4.12 4.31

our model, +yt 90.7% 4.09 4.19 4.29

our model, +RL 91.3% 4.35 4.21 4.52

our model, +yt , +RL 91.9% 4.14 4.23 4.45

Moreover, models using reinforcement learning refinement optimized by exact MLEor NVIL show little improvements in all settings which is contradictory to the results inLIDM [24]. The reason may fall back on model architecture or definition of reward func-tion. Because our model has a totally different architecture comparing with LIDM, makingit hard to explore in the space of latent intention using similar reward function used in [24].

Compared to HRED in Table 1, our model could achieve better results on success rateoptimized via exact MLE or variational lower bound, indicating that our model with thelatent variable can enlarge the model capacity for modeling uncertainty of natural dia-logues than the deterministic model, HRED. However, the fact that HRED got the bestBLEU shows that it may generate more fluent responses. While compared to VHRED, theresults demonstrate that discrete latent variable is more appropriate for dialogue modeling

Table 3 Results of PublishedModels, where “+” are results ofNDM from [18] with attentionand self-supervised sub-taskneurons, “∗” are results of LIDMfrom [24] with varioushyper-parameter settings

Published models Success rate BLEU

NDM+ 76.1% 0.212

NDM+Att 79.0% 0.224

NDM+Att+SS 81.8% 0.240

LIDM∗, I=50 66.9% 0.238

LIDM, I=70 61.0% 0.246

LIDM, I=100 63.2% 0.242

LIDM, I=50, +RL 82.4% 0.231

LIDM, I=70, +RL 81.6% 0.230

LIDM, I=100, +RL 84.6% 0.240

our best model, I=50, exact MLE 81.9% 0.304Values in bold denote bestperformance

World Wide Web (2020) 23:1989–2002 1997

Page 10: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

Table 4 Sampled dialogue results of HRED, VHRED and our model with optional database vector, y t andreinforcement learning fine-tuning on test corpus set using variational inference

User: I need to find an expensive restaurant that’s in the south sectionof the city.

HRED: the good luck chinese food takeaway is a expensive restaurantin the south area. would you like their

information ?

VHRED: what food are you looking for ?

our model: what food would you like ?

our model,+y t: what food are you looking for ?

our model,+RL: the address is 82 Cherry Hinton Road Cherry Hinton.

our model,+y t,+RL: what food are you looking for ?

User: no i don’t care about the type of cuisine.

HRED: the good luck chinese food takeaway serves chinese food andis in the expensive pricerange.

VHRED: the good luck is a expsneive restaurant in the south area.

our model: the good luck chinese food takeaway serves chinese food andis in the expensive pricerange.

our model,+y t: the good luck chinese food takeaway is a expensive restaurantin the south area.

would you like their address ?

our model,+RL: serve chinese food.

our model,+y t,+RL: the good luck chinese food takeaway serves chinese food andis in the expensive pricerange.

User: what is their address ?

HRED: 82 Cherry Hinton Road Cherry Hinton

VHRED: 82 Cherry Hinton Road Cherry Hinton

our model: 82 Cherry Hinton Road Cherry Hinton

our model,+y t: 82 Cherry Hinton Road Cherry Hinton

our model,+RL: 82 Cherry Hinton Road Cherry Hinton

our model,+y t,+RL: 82 Cherry Hinton Road Cherry Hinton

User: what is the food they serve ?

HRED: they serve chinese food

VHRED: they serve chinese food

our model: they serve chinese food

our model,+y t: they serve chinese food .

our model,+RL: they serve chinese food

our model,+y t,+RL: they serve chinese food

User: nope , that’s what i was looking for . thanks a lot !

HRED: you’re welcome .

VHRED: thank you

our model: thank you for using our system . good bye

our model,+y t: goodbye .

our model,+RL: thank you .

our model,+y t,+RL: thank you .

World Wide Web (2020) 23:1989–20021998

Page 11: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

than continuous ones followed by Gaussian distribution for success rate in general. LikeHRED, however, VHRED also achieves better BLEU than our model due to the difficultoptimization of our model, as we need to calculate a normalized constant.

Compared to other existing models, such as [18] and [24] presented in Table 3, ourmodels optimized with exact log-likelihood or variational lower bound objective functioncan achieve comparable task success rate to LIDM (LIDM, I=100, +RL) and NDM (withattention and self-supervised sub-task neurons). It shows that our model, which is end-to-end trainable, can learn reasonable dialogue policy than NDM and LIDM, which both needmassive labeled data to train dialogue state tracker.

In Tables 4 and 5, we list some samples from the test dataset. When optimized with vari-ational inference, our model with reinforcement learning prefers to generating slot relatedtokens to provide information of restaurants. Hence, it exhibits a much greedier behaviorin achieving high task success rate and short responses. The generated responses from our

Table 5 Sampled dialogue results of our model with optional database vector, y t and reinforcement learningfine-tuning test corpus set using exact MLE

User: I need to find an expensive restaurant that’s in the south section of the city.

our model: what food are you looking for ?

our model,+y t: what food are you looking for ?

our model,+RL: what food are you looking for ?

our model,+y t,+RL: what food are you looking for ?

User: no i don’t care about the type of cuisine .

our model: there is a expensive chinese restaurant in the expensive pricerange.

would you like the phone ?

our model,+y t: there is The Good Luck . would you like the information ?

our model,+RL: there is a expensive chinese restaurant in the expensive pricerange.

would you like the phone ?

our model,+y t,+RL: there is The Good Luck . would you like the information ?

User: what is their address ?

our model: The Good Luck is located at 82 Cherry Hinton Road Cherry Hinton.

is there anything else i can help you with today ?

our model,+y t: the address for The Good Luck is 82 Cherry Hinton Road Cherry Hinton.

would you like their phone ?

our model,+RL: there address is 82 Cherry Hinton Road Cherry Hinton ,

is there anything else i can help you with ?

our model,+y t,+RL: there address is 82 Cherry Hinton Road Cherry Hinton ,

is there anything else i can help you with ?

User: what is the food they serve ?

our model: they serve chinese food .

our model,+y t: they serve chinese food .

our model,+RL: they serve chinese food .

our model,+y t,+RL: they serve chinese food .

User: nope , that’s what i was looking for . thanks a lot !

our model: thank you .

our model,+y t: goodbye .

our model,+RL: thank you .

our model,+y t,+RL: goodbye .

World Wide Web (2020) 23:1989–2002 1999

Page 12: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

model (optimized via variational inference) tend to generate similar responses. We conjec-ture that the distribution of learned intention may be peaked in several cases, and the modelprefers to sample them to generate responses which lacked diversity. Optimized via exactinference, the model not only tends to generate relevant information for user’s constraints,but also provides extra information for the user to request, which in the end creates betterexperience.

Furthermore, the model optimized via exact inference tends to generate more natural andcomprehensible responses compared to that optimized via variational inference. The reasonlies in that although the approximate posterior is accurate, it only samples a part of intentionsto optimize model. Those sampled intentions are very likely to generate slot-related tokensbut may lack basic structures of utterances. Therefore, it leads to low naturalness scores.On the contrary, exact inference employs all intentions so as to generate slot-related tokenswith some extra information, which makes responses more comprehensible.

In order to assess the human evaluation performance, we evaluate our model withoptional blocks optimized via exact or variational inference, HRED, VHRED on our Webinterface4 by recruiting human judges in a company. Each judge was requested to focuson a task and performed a conversation with the machine. In the end of each conversation,judges rated the performance of our model based on perceived comprehension ability, thenaturalness of responses and subjective success rate on a scale of 1 to 5.

5 Conclusion

In conclusion, we propose a probability generative dialogue model for task-oriented dia-logue systems. Specifically, in order to represent latent intentions of users and eventuallymodel the uncertainty of user’s intentions, our model uses a hierarchical recurrent neuralnetwork structure to model historical information of dialogue and augments it with discretelatent variables.

Our framework can be trained in a fully end-to-end trainable unsupervised fashion withexact log-likelihood or variational lower bound. Furthermore, we can extract dialogue topicusing HTMM to infer labels of latent intention through semi-supervised learning. Thismodel achieves better scalability and easy to extend to other domains without excessive datalabeling. Due to the ability of modeling variability of natural dialogues, augmenting withlatent intentions leads to a better generation performance. The experimental results on thecorpus and human evaluation show that our model optimized with exact MLE or NVIL pre-vails the state-of-the-art methods in terms of BLEU scores and achieves comparable successrates.

For future work, we would like to extend our model to multi-domain dialogue modeling [15]and augment it with chitchat ability for better customer experience. We may also designnew reward functions for reinforcement learning [3] to bootstrap performance. For morecomplicated applications, interface and reasoning mechanisms with knowledge graph [4]will be designed. We will use reinforcement learning for reasoning on knowledge graph.

Acknowledgements This work was supported by the Shenzhen Science and Technology Innovation Committeewith the project name of Intelligent Question Answering Robot, under grant NO. CKCY20170508121036342.

4http://gd.xxx.ai/static/chat.html

World Wide Web (2020) 23:1989–20022000

Page 13: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

References

1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate.Computer Science (2014)

2. Barr, A.: Natural language understanding. AI Mag. 1(1), 5 (2017)3. Cuayahuitl, H., Yu, S., Williamson, A., Carse, J.: Deep reinforcement learning for multi-domain dialogue

systems. arXiv:1611.08675 (2016)4. Das, R., Dhuliawala, S., Zaheer, M., Vilnis, L., Durugkar, I., Krishnamurthy, A., Smola, A., McCallum,

A.: Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcementlearning. arXiv:1711.05851 (2017)

5. Dhingra, B., Li, L., Li, X., Gao, J., Chen, Y.N., Ahmed, F., Deng, L.: Towards end-to-end reinforcementlearning of dialogue agents for information access. In: Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 484–495 (2017)

6. Eric, M., Manning, C.D.: A copy-augmented sequence-to-sequence architecture gives good performanceon task-oriented dialogue. arXiv:1701.04024 (2017)

7. Gasic, M., Breslin, C., Henderson, M., Kim, D., Szummer, M., Thomson, B., Tsiakoulis, P., Young, S.:On-Line policy optimisation of bayesian spoken dialogue systems via human interaction. In: 2013 IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8367–8371. IEEE(2013)

8. Gruber, A., Weiss, Y., Rosen-Zvi, M.: Hidden topic markov models. In: Artificial Intelligence andStatistics, pp. 163–170 (2007)

9. Henderson, M., Thomson, B., Young, S.: Word-based dialog state tracking with recurrent neuralnetworks. In: Meeting of the Special Interest Group on Discourse and Dialogue, pp. 292–299 (2014)

10. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.:beta-vae: Learning basic visual concepts with a constrained variational framework. In: Proceedings ofInternational Conference on Learning Representations (ICLR) (2017)

11. Kingma, D., Ba, J.: Adam: a Method for Stochastic Optimization. In: The International Conference onLearning Representations (ICLR) (2015)

12. Liu, B., Tur, G., Hakkani-Tur, D., Shah, P., Heck, L.: End-to-end optimization of task-oriented dialoguemodel with deep reinforcement learning. arXiv:1711.10712 (2017)

13. Madotto, A., Wu, C.S., Fung, P.: Mem2seq: Effectively incorporating knowledge bases into end-to-endtask-oriented dialog systems. arXiv:1804.08217 (2018)

14. Mnih, A., Gregor, K.: Neural variational inference and learning in belief networks. In: Proceedings ofthe 34th International Conference on Machine Learning (ICML). http://arxiv.org/abs/1402.0030 (2014)

15. Mrksic, N., Seaghdha, D.O., Thomson, B., Gasic, M., Su, P.H., Vandyke, D., Wen, T.H., Young, S.:Multi-domain dialog state tracking using recurrent neural networks. In: Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguistics and the 7th International Joint Conference onNatural Language Processing (vol. 2: Short Papers), vol. 2, pp. 794–799 (2015)

16. Mrksic, N., Seaghdha, D.O., Wen, T., Thomson, B., Young, S.J.: Neural belief tracker: Data-drivendialogue state tracking. In: Proceedings of the 55th Annual Meeting of the Association for ComputationalLinguistics, pp. 1777–1788 (2017)

17. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machinetranslation. In: Meeting on Association for Computational Linguistics, pp. 311–318 (2002)

18. Rojas-Barahona, L.M., Gasic, M., Mrksic, N., Su, P., Ultes, S., Wen, T., Young, S.J., Vandyke, D.: Anetwork-based end-to-end trainable task-oriented dialogue system. in: Proceedings of the 15th Confer-ence of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia,Spain, April 3-7, 2017, vol. 1: Long Papers, pp. 438–449 (2017)

19. Serban, I.V., Sordoni, A., Bengio, Y., Courville, A.C., Pineau, J.: Building End-To-End dialogue systemsusing generative hierarchical neural network models. In: AAAI, pp. 3776–3784 (2016)

20. Serban, I.V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A.C., Bengio, Y.: A hierarchi-cal latent variable encoder-decoder model for generating dialogues. In: Proceedings of the Thirty-FirstAAAI Conference on Artificial Intelligence (AAAI), pp. 3295–3301 (2017)

21. Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Grue Simonsen, J., Nie, J.Y.: A hierarchical recur-rent encoder-decoder for generative context-aware query suggestion. In: Proceedings of the 24th ACMInternational on Conference on Information and Knowledge Management, pp. 553–562. ACM (2015)

22. Su, P.H., Vandyke, D., Gasic, M., Kim, D., Mrksic, N., Wen, T.H., Young, S.: Learning from real users:Rating dialogue success with neural networks for reinforcement learning in spoken dialogue systems.In: Proceedings of the 11th Annual Conference of the International Speech Communication Association(INTERSPEECH) (2015)

World Wide Web (2020) 23:1989–2002 2001

Page 14: End-to-End latent-variable task-oriented dialogue system ...their model can be trained under unsupervised, semi-supervised and reinforcement learning manners, which can easily adapt

23. Wen, T., Gasic, M., Mrksic, N., Su, P., Vandyke, D., Young, S.J.: Semantically conditioned lstm-basednatural language generation for spoken dialogue systems. In: Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing, pp. 1711–1721. http://arxiv.org/abs/1508.01745(2015)

24. Wen, T.H., Miao, Y., Blunsom, P., Young, S.: Latent intention dialogue models. In: Precup, D., Teh,Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings ofMachine Learning Research, vol 70, pp. 3732–3741. PMLR, International Convention Centre, Sydney,Australia (2017). http://proceedings.mlr.press/v70/wen17a.html

25. Williams, J., Raux, A., Henderson, M.: The dialog state tracking challenge series: a review. Dialogue &Discourse 7(3), 4–33 (2016)

26. Williams, J.D., Asadi, K., Zweig, G.: Hybrid code networks: practical and efficient end-to-end dialogcontrol with supervised and reinforcement learning. arXiv:1702.03274 (2017)

27. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning.Mach. Learn. 8(3-4), 229–256 (1992)

28. Wu, C.S., Madotto, A., Winata, G.I., Fung, P.: End-To-End dynamic query memory network for entity-value independent task-oriented dialog. In: 2018 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), pp. 6154–6158. IEEE (2018)

29. Young, S., Gasic, M., Thomson, B., Williams, J.D.: Pomdp-based statistical spoken dialog systems: Areview. Proc. IEEE 101(5), 1160–1179 (2013)

30. Young, T., Cambria, E., Chaturvedi, I., Zhou, H., Biswas, S., Huang, M.: Augmenting End-To-Enddialogue systems with commonsense knowledge. In: AAAI, pp. 4970–4977 (2018)

31. Zhang, Y., Dai, H., Kozareva, Z., Smola, A.J., Song, L.: Variational reasoning for question answeringwith knowledge graph. arXiv:1709.04071 (2017)

32. Zhao, T., Lu, A., Lee, K., Eskenazi, M.: Generative encoder-decoder models for task-oriented spokendialog systems with chatting capability. In: 18th Annual SIGdial Meeting on Discourse and Dialogue(SIGDIAL). http://arxiv.org/abs/1706.08476 (2017)

33. Zhao, T., Xie, K., Eskenazi, M.: Rethinking action spaces for reinforcement learning in end-to-end dialogagents with latent variable models. arXiv:1902.08858 (2019)

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published mapsand institutional affiliations.

Affiliations

Haotian Xu1 ·Haiyun Peng2 ·Haoran Xie3 · Erik Cambria2 · Liuyang Zhou1 ·Weiguo Zheng1

Haotian [email protected]

Haiyun [email protected]

Haoran [email protected]

Erik [email protected]

Weiguo [email protected]

1 Zhiyan Technology (Shenzhen) Limited, Shenzhen, China2 School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore3 Department of Mathematics and Information Technology, The Education University of Hong Kong, Tai

Po, Hong Kong

World Wide Web (2020) 23:1989–20022002