Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can...

9
Adversarial Bootstrapping for Dialogue Model Training Oluwatobi O. Olabiyi, Erik T. Mueller, Christopher Larson, Tarek Lahlou Capital One Conversation Research, Vienna, VA {oluwatobi.olabiyi, erik.mueller, christopher.larson2, tarek.lahlou}@capitalone.com Abstract Open domain neural dialogue models, despite their successes, are known to produce responses that lack relevance, diver- sity, and in many cases coherence. These shortcomings stem from the limited ability of common training objectives to di- rectly express these properties as well as their interplay with training datasets and model architectures. Toward address- ing these problems, this paper proposes bootstrapping a di- alogue response generator with an adversarially trained dis- criminator. The method involves training a neural generator in both autoregressive and traditional teacher-forcing modes, with the maximum likelihood loss of the auto-regressive out- puts weighted by the score from a metric-based discriminator model. The discriminator input is a mixture of ground truth labels, the teacher-forcing outputs of the generator, and dis- tractors sampled from the dataset, thereby allowing for richer feedback on the autoregressive outputs of the generator. To improve the calibration of the discriminator output, we also bootstrap the discriminator with the matching of the inter- mediate features of the ground truth and the generator’s au- toregressive output. We explore different sampling and adver- sarial policy optimization strategies during training in order to understand how to encourage response diversity without sacrificing relevance. Our experiments shows that adversarial bootstrapping is effective at addressing exposure bias, lead- ing to improvement in response relevance and coherence. The improvement is demonstrated with the state-of-the-art results on the Movie and Ubuntu dialogue datasets with respect to human evaluations and BLUE, ROGUE, and distinct n-gram scores. Introduction End-to-end neural dialogue models have demonstrated the ability to generate reasonable responses to human inter- locutors. However, a significant gap remains between these state-of-the-art dialogue models and human-level discourse. The fundamental problem with neural dialogue modeling is exemplified by their generic responses, such as I don’t know, I’m not sure, or how are you, when conditioned on broad ranges of dialogue contexts. In addition to the lim- ited contextual information in single-turn Seq2Seq mod- els (Sutskever, Vinyals, and Le 2014; Vinyals and Le 2015; Li et al. 2016a), which has motivated the need for hierar- chical recurrent encoder decoder (HRED) multi-turn models Preprint (Serban et al. 2016; Xing et al. 2017; Serban et al. 2017b; Serban et al. 2017a; Olabiyi et al. 2018; Olabiyi et al. 2019), previous work points to three underlying reasons why neural models fail at dialogue response generation. i) Exposure Bias: Similar to language and machine trans- lation models, traditional conversation models are trained with the model input taken from the ground truth rather than a previous output (a method known as teacher forc- ing (Williams and Zipser 1989)). During inference, how- ever, the model uses past outputs, i.e., is used autoregres- sively. Interestingly, training with teacher forcing does not present a significant problem in the machine translation set- ting since the conditional distribution of the target given the source is well constrained. On the other hand, this is problematic in the dialogue setting since the learning task is unconstrained (Lowe et al. 2015). In particular, there are several suitable target responses per dialogue context and vice versa. This discrepancy between training and infer- ence is known as exposure bias (Williams and Zipser 1989; Lamb et al. 2016) and significantly limits the informative- ness of the responses as the decoding error compounds rapidly during inference. Training methods that incorporate autoregressive sampling into model training have been ex- plored to address this (Li et al. 2016b; Li et al. 2017; Yu et al. 2017; Che et al. 2017; Zhang et al. 2017; Xu et al. 2017; Zhang et al. 2018b). ii) Training data: The inherent problem with dialogue training data, although identified, has not been particularly addressed in the literature (Sharath, Tandon, and Bauer 2017). Human conversations contain a large number of generic, uninformative responses with little or no semantic information, giving rise to a classic class-imbalance prob- lem. This problem also exists at the word and turn level; hu- man dialogue (Banchs 2012; Serban et al. 2017b) contains non-uniform sequence entropy that is concave with respect to the token position, with the tokens at the beginning and end of a sequence having lower entropy than those in the middle (see Fig. 1). This initial positive energy gradient can create learning barriers for recurrent models, and is a pri- mary contributing factor to their short, generic outputs. iii) Training Objective: Most existing dialogue models are trained using maximum likelihood estimation (MLE) (Sutskever, Vinyals, and Le 2014; Vinyals and Le 2015; Serban et al. 2016; Xing et al. 2017) with teacher forcing arXiv:1909.00925v2 [cs.CL] 4 Sep 2019

Transcript of Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can...

Page 1: Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can be expressed according to the objective L AB( G) = X y i2Y i t G(y i)log p G (y ijx

Adversarial Bootstrapping for Dialogue Model Training

Oluwatobi O. Olabiyi, Erik T. Mueller, Christopher Larson, Tarek LahlouCapital One Conversation Research, Vienna, VA

{oluwatobi.olabiyi, erik.mueller, christopher.larson2, tarek.lahlou}@capitalone.com

Abstract

Open domain neural dialogue models, despite their successes,are known to produce responses that lack relevance, diver-sity, and in many cases coherence. These shortcomings stemfrom the limited ability of common training objectives to di-rectly express these properties as well as their interplay withtraining datasets and model architectures. Toward address-ing these problems, this paper proposes bootstrapping a di-alogue response generator with an adversarially trained dis-criminator. The method involves training a neural generatorin both autoregressive and traditional teacher-forcing modes,with the maximum likelihood loss of the auto-regressive out-puts weighted by the score from a metric-based discriminatormodel. The discriminator input is a mixture of ground truthlabels, the teacher-forcing outputs of the generator, and dis-tractors sampled from the dataset, thereby allowing for richerfeedback on the autoregressive outputs of the generator. Toimprove the calibration of the discriminator output, we alsobootstrap the discriminator with the matching of the inter-mediate features of the ground truth and the generator’s au-toregressive output. We explore different sampling and adver-sarial policy optimization strategies during training in orderto understand how to encourage response diversity withoutsacrificing relevance. Our experiments shows that adversarialbootstrapping is effective at addressing exposure bias, lead-ing to improvement in response relevance and coherence. Theimprovement is demonstrated with the state-of-the-art resultson the Movie and Ubuntu dialogue datasets with respect tohuman evaluations and BLUE, ROGUE, and distinct n-gramscores.

IntroductionEnd-to-end neural dialogue models have demonstrated theability to generate reasonable responses to human inter-locutors. However, a significant gap remains between thesestate-of-the-art dialogue models and human-level discourse.The fundamental problem with neural dialogue modelingis exemplified by their generic responses, such as I don’tknow, I’m not sure, or how are you, when conditioned onbroad ranges of dialogue contexts. In addition to the lim-ited contextual information in single-turn Seq2Seq mod-els (Sutskever, Vinyals, and Le 2014; Vinyals and Le 2015;Li et al. 2016a), which has motivated the need for hierar-chical recurrent encoder decoder (HRED) multi-turn models

Preprint

(Serban et al. 2016; Xing et al. 2017; Serban et al. 2017b;Serban et al. 2017a; Olabiyi et al. 2018; Olabiyi et al. 2019),previous work points to three underlying reasons why neuralmodels fail at dialogue response generation.

i) Exposure Bias: Similar to language and machine trans-lation models, traditional conversation models are trainedwith the model input taken from the ground truth ratherthan a previous output (a method known as teacher forc-ing (Williams and Zipser 1989)). During inference, how-ever, the model uses past outputs, i.e., is used autoregres-sively. Interestingly, training with teacher forcing does notpresent a significant problem in the machine translation set-ting since the conditional distribution of the target giventhe source is well constrained. On the other hand, this isproblematic in the dialogue setting since the learning taskis unconstrained (Lowe et al. 2015). In particular, there areseveral suitable target responses per dialogue context andvice versa. This discrepancy between training and infer-ence is known as exposure bias (Williams and Zipser 1989;Lamb et al. 2016) and significantly limits the informative-ness of the responses as the decoding error compoundsrapidly during inference. Training methods that incorporateautoregressive sampling into model training have been ex-plored to address this (Li et al. 2016b; Li et al. 2017; Yu etal. 2017; Che et al. 2017; Zhang et al. 2017; Xu et al. 2017;Zhang et al. 2018b).

ii) Training data: The inherent problem with dialoguetraining data, although identified, has not been particularlyaddressed in the literature (Sharath, Tandon, and Bauer2017). Human conversations contain a large number ofgeneric, uninformative responses with little or no semanticinformation, giving rise to a classic class-imbalance prob-lem. This problem also exists at the word and turn level; hu-man dialogue (Banchs 2012; Serban et al. 2017b) containsnon-uniform sequence entropy that is concave with respectto the token position, with the tokens at the beginning andend of a sequence having lower entropy than those in themiddle (see Fig. 1). This initial positive energy gradient cancreate learning barriers for recurrent models, and is a pri-mary contributing factor to their short, generic outputs.

iii) Training Objective: Most existing dialogue modelsare trained using maximum likelihood estimation (MLE)(Sutskever, Vinyals, and Le 2014; Vinyals and Le 2015;Serban et al. 2016; Xing et al. 2017) with teacher forcing

arX

iv:1

909.

0092

5v2

[cs

.CL

] 4

Sep

201

9

Page 2: Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can be expressed according to the objective L AB( G) = X y i2Y i t G(y i)log p G (y ijx

Figure 1: Positional Entropy of Movie and Ubuntudatasets - Applying a greedy training objective to thedatasets can achieve low overall entropy just by overfittingto low entropy regions, resulting in short and generic re-sponses.

because autoregressive sampling leads to unstable training.Unfortunately, the use of MLE is incongruent with the re-dundant nature of dialogue datasets, exacerbates the expo-sure bias problem in dialogue datasets, and is the primaryfactor leading to uninteresting and generic responses. Al-ternative training frameworks that complement MLE withother constraints such as generative adversarial networks,reinforcement learning, and variational auto-encoders thatspecifically encourage diversity have been explored to over-come the limitations of the MLE objective alone (Li etal. 2016a; Li et al. 2016b; Li et al. 2017; Yu et al. 2017;Che et al. 2017; Zhang et al. 2017; Xu et al. 2017; Ser-ban et al. 2017b; Zhang et al. 2018b; Olabiyi et al. 2018;Olabiyi et al. 2019).

In this paper, we propose an adversarial bootstrappingframework for training dialogue models. This frameworktackles the class imbalance caused by the redundancy in di-alogue training data, and addresses the problem of exposurebias in dialogue models. Bootstrapping has been proposedin the literature as a way to handle data with noisy, sub-jective, and incomplete labels by combining cross-entropylosses from both the ground truth (i.e. teacher forcing)and model outputs (i.e. autoregression) (Reed et al. 2015;Grandvalet and Bengio 2005; Grandvalet and Bengio 2006).Here, we first extend its use to dialogue model training to en-courage the generation of high variance response sequencesfor a given ground truth target (Reed et al. 2015). Thisshould reduce the tendency of dialogue models to reproducethose generic and uninteresting target responses present inthe training data. This is achieved by training a discrimina-tor adversarially, and use the feedback from the discrimina-tor to weigh the cross-entropy loss from the model-predictedtarget. The gradient from the feedback provided by the dis-criminator encourages the dialogue model to generate a widerange of structured outputs. Second, we bootstrap the dis-criminator to improve the calibration of its output. We usethe similarity between the representations of the genera-

tor’s autoregressive output and the ground truth from an in-termediate layer of the discriminator as an addition targetfor the discriminator. This further improves the diversity ofthe generator’s output without sacrificing relevance. We ap-ply adversarial bootstrapping to multi-turn dialogue models.Architecture wise, we employ an HRED generator and anHRED discriminator as depicted in Fig. 2, with a sharedhierarchical recurrent encoder. In our experiments, the pro-posed adversarial bootstrapping demonstrates state-of-the-art performances on the Movie and Ubuntu datasets as mea-sured in terms of both automatic (BLUE, ROGUE, and dis-tinct n-gram scores) and human evaluations.

Related WorkThe literature on dialogue modeling even in multi-turn sce-nario is vast (see (Serban et al. 2016; Xing et al. 2017;Serban et al. 2017b; Serban et al. 2017a; Xing et al. 2017;Olabiyi et al. 2018; Olabiyi, Khazan, and Mueller 2018;Olabiyi et al. 2019; Li et al. 2016b)), and so in this section,we focus on key relevant previous papers. The proposed ad-versarial bootstrapping is closely related to the use of rein-forcement learning for dialogue response generation with anadversarially trained discriminator serving as a reward func-tion (Li et al. 2017). First, we employ a different discrim-inator training strategy from Li et al. (2017). The negativesamples of our discriminator consist of (i) the generator’sdeterministic teacher forcing output and (ii) distractors sam-pled from the training set. This makes the discriminator’stask more challenging and improves the quality of the feed-back to the generator by discouraging the generation of highfrequency generic responses. Also, while Li et al. samplesover all the possible outputs of the generator, we take sam-ples from the generator’s top k outputs or the MAP outputwith Gaussian noise as additional inputs. This allows ourmodel to explore mostly plausible trajectories during train-ing compared to Li et al. where the discriminator mostlyscore the generated samples very low. The top k samplingstrategy also mitigates the gradient variance problem foundin the traditional policy optimization employed by Li et al..Finally, we bootstrap our discriminator with the similaritybetween the intermediate representations of the generator’sautoregressive output and the ground truth to improve thecalibration of the discriminator output.

ModelLet xi = (x1, x2, · · · , xi) denote the context or conversa-tion history up to turn i and let xi+1 denote the associatedtarget response. Provided input-target samples (xi, xi+1),we aim to learn a generative model pθG(yi | xi) whichscores representative hypotheses yi given arbitrary dialoguecontexts xi such that responses that are indistinguishablefrom informative and diverse target responses are favoredwith high scores and otherwise given low scores. Notation-ally, we write the collection of possible responses at turn i asthe set Yi containing elements yi = (y1i , y

2i , · · · , y

Tii ) where

Ti is the length of the i-th candidate response yi and yti is thet-th word of that response.

Page 3: Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can be expressed according to the objective L AB( G) = X y i2Y i t G(y i)log p G (y ijx

Generator BootstrappingTo achieve the goal outlined above, we propose an Adver-sarial Bootstrapping (AB) approach to training multi-turndialogue models such as the one depicted in Fig. 2. The ad-versarial bootstrapping for the generator can be expressedaccording to the objective

LAB(θG) = −∑yi∈Yi

tG(yi) log pθG(yi | xi) (1)

where tG(·) is the target variable that controls the genera-tor training. Indeed, hard bootstrapping (Reed et al. 2015)is one such special case of (1) wherein tG(yi) = β[yi=xi+1],tG(yi) = 1−β[yi=argmax

yi

pθG (yi|xi)], and 0 otherwise, where

β is a hyperparameter. Similarly, MLE is another specialcase in which tG(yi) = 1[yi=xi+1], and 0 otherwise. It is rea-sonable to assume from these formulations that bootstrap-ping will outperform MLE since it does not assume all neg-ative outputs are equally wrong.

Interestingly, Li et al. (2017) make use of the MLE set-ting but additionally relies on the sampling stochasticity toobtain non-zero credit assignment information from the dis-criminator for the generator policy updates. To avoid thisinconsistency, we instead modify the generator target to

tG(yi) =

β yi = xi+1

0 yi = argmaxyi

pθG(yi|xi)

αQθD (yi,xi) otherwise

(2)

where α is a hyperparamter and QθD (yi,xi) ∈ [0, 1] is thebootstrapped target obtained from a neural network discrim-inator D with parameters θD. The first two assignments in(2) are also used in training the discriminator in additionto the human-generated distractors, denoted x−i+1, from thedataset. In detail, we make use of the term

tD(yi) =

β yi = xi+1

0 yi = argmaxyi

pθG(yi|xi)

0 yi = x−i+1

(3)

within the context of the objective function. Namely, the dis-criminator objective is the cross-entropy between the outputand the target of the discriminator given by

LAB(θD) = −∑yi∈Yi

[tD(yi) log QθD (yi,xi)+

(1− tD(yi)) log (1−QθD (yi,xi))]. (4)

The inclusion of human-generated negative samples encour-ages the discriminator to assign low scores to high fre-quency, generic target responses in the dataset, thereby dis-couraging the generator from reproducing them.

Discriminator BootstrappingIn addition to the generator bootstrapping with the discrimi-nator, we can also bootstrap the discriminator using the sim-ilarity measure, S(., .) ∈ [0, 1], between latent representa-tions of the sampled generator outputs, yi ∼ pθG(yi|xi),and ground truth encoded by the discriminator. i.e.

tD(yi) = S(hD(yi), hD(xi+1)

), yi ∼ pθG(yi|xi) (5)

In our experiments, we chose the cosine similarity metricand the output of the discriminator before the logit projec-tion for S(., .) and hD respectively. This helps to better cal-ibrate the discriminator’s judgment of the generator’s out-puts.

Sampling StrategyTo backpropagate the learning signal for the case wheret(yi) = αQθD (yi,xi), we explore both stochastic and de-terministic policy gradient methods. For stochastic poli-cies, we approximate the gradient of LAB(θG) w.r.t. θG byMonte Carlo samples using the REINFORCE policy gradi-ent method (Li et al. 2017; Glynn 1990; Williams 1992):

∇θGLAB(θG) ≈ EpθG (yi|xi)QθD (yi,xi) ·∇θG log pθG(yi|xi). (6)

For deterministic policies, we approximate the gradient ac-cording to (Silver et al. 2014; Zhang et al. 2018b)

∇θGLAB(θG) ≈ Ep(zi)∇ymaxQθD (ymax,xi) ·∇θG log pθG(ymax|xi, zi) (7)

where ymax = argmaxyi

pθG(yi|xi, zi) and zi ∼ Ni(0, I)

is the source of randomness. We denote the model trainedwith (7) as aBoots gau. To reduce the variance of (6), wepropose a novel approach of sampling from top k generatoroutputs using (i) a categorical distribution based on the out-put logits (aBoots cat), similar to the treatment of Radfordet al. (2019), and (ii) a uniform distribution (aBoots uni);where top k is a hyperparameter. This is especially usefulfor dialogue modeling with large vocabulary sizes.

EncoderReferring to the network architecture in Fig. 2, the genera-tor and discriminator share the same encoder. The encoderuses two RNNs to handle multi-turn representations similarto the approach of Serban et al. (2016). First, during turni, a bidirectional encoder RNN, eRNN(·), with an initialstate of h0i maps the conversation context xi comprising thesequence of input symbols (x1i , x

2i , · · · , x

Jii ), where Ji is

the sequence length, into a sequence of hidden state vectors{eji}

Jij=1 according to

eji = eRNN(E(xji ), ej−1i ), j = 1, · · · , Ji (8)

where E(·) is the embedding lookup and E ∈ Rhdim×V isthe embedding matrix with dimension hdim and vocabularysize V . The vector representation of the input sequence xi isthe L2 pooling over the encoded sequence {eji}

Jij=1 (Serban

et al. 2016). In addition, we use the output sequence as anattention memory to the generator as depicted in Fig. 2. Thisis done to improve the relevance of the generated response.

To capture xi we use a unidirectional context RNN,cRNN(·), to combine the past dialogue context hi−1 withthe L2 pooling of the encoded sequence as

hi = cRNN(L2({eji}

Jij=1),hi−1

). (9)

Page 4: Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can be expressed according to the objective L AB( G) = X y i2Y i t G(y i)log p G (y ijx

Figure 2: A multi-turn recurrent architecture with adversarial bootstrapping: The generator and discriminator share thesame encoder (through the context state) and the same word embeddings. The generator also uses the word embeddings as theoutput projection weights. The encoder and the discriminator RNNs are bidirectional while the context and generator RNNs areunidirectional.

GeneratorThe generator, denoted gRNN(·), is a unidirectional de-coder RNN with an attention mechanism (Bahdanau, Cho,and Bengio 2015; Luong et al. 2015). Similar to Serban etal. (2016), the decoder RNN is initialized with the last stateof the context RNN. The generator outputs a hidden staterepresentation gji for each previous token χj−1 according to

gji = gRNN(E(χj−1), gj−1i , aji ,hi

)(10)

where aji is the attention over the encoded sequence{eji}

Jij=1. When the generator is run in teacher-forcing mode,

as is typically done during training, the previous token fromthe ground truth is used, i.e., χ = xi+1. During infer-ence (autoregressive mode), the generator’s previous de-coded output is used, i.e., χ = yi.

The decoder hidden state, gji is mapped to a probabilitydistribution typically through a logistic layer, σ(·), yielding,

pθG(yji |χ

1:j−1,xi)= softmax(σ(gji )/τ) (11)

where τ is a hyperparameter, σ(gji ) = E · gji + bg , and bg ∈R1×V is the logit bias. The generative model can then bederived as:

pθG(yi|xi) = pθ(y1i |xi)

Ti∏j=2

pθG(yji |χ

1:j−1,xi)

(12)

DiscriminatorThe discriminator QθD (yi,xi) is a binary classifier thattakes as input a response sequence yi and a dialogue contextxi and is trained with output labels provided in (3) and (5).The discriminator, as shown in Fig. 2 is an RNN, dRNN(·),

that shares the hierarchical encoder and the word embed-dings with the generator, with the initial state being the finalstate of the context RNN. The last layer of the discrimina-tor RNN is fed to a logistic layer and a sigmoid function toproduce the normalized Q (action-value function) value fora pair of dialogue context (state) and response (action).

We explore two options of estimating the Q value, i.e., atthe word or utterance level. At the utterance level, we use(4) in conjunction with a unidirectional discriminator RNN.TheQ value is calculated using the last output of dRNN(·),i.e,

QθD (yi,xi) = sigmoid(σ(dTii )) (13)

where σ(dTii ) = Wd · hgji + bd, Wd ∈ Rhdim×V , andbg ∈ R1×V are the logit projection and bias respectively.At the word level, the discriminator RNN (we use a bidi-rectional RNN in our implementation) produces a word-level evaluation. The normalizedQ value and the adversarialbootstrapping objective function are then respectively givenby

dji = QθD (yji ,xi|yi) = sigmoid(σ(dji )) (14)

LAB(θG) = −∑yi∈Yi

Ti∑j=1

tG(yji )log pθG(y

ji |xi) (15)

where

tG(yji ) =

β yji = xji+1

0 yji = argmaxyji

pθG(yji |χ1:j−1,xi)

(1− β)dji otherwise

(16)

Page 5: Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can be expressed according to the objective L AB( G) = X y i2Y i t G(y i)log p G (y ijx

TrainingWe train both the generator and discriminator simultane-ously with two samples for the generator and three for thediscriminator. In all our experiments, we use the genera-tor’s teacher forcing outputs to train the discriminator (i.e.,argmax cases of (2) and (3)). The encoder parameters areincluded with the generator, i.e., we did not update the en-coder during discriminator updates. Each RNN is a 3-layerGRU cell, with a hidden state size (hdim) of 512. The wordembedding size is the same hdim, and the vocabulary sizeV is 50000. Other hyperparameters include β = 1, α = 1,τ = 1, top k = 10 for aBoots uni and aBoots cat, andtop k = 10 for aBoots gau. Although we used a singletop k value during training, we avoided training with mul-tiple top k values by searching for the optimum top k (be-tween 1 and 20) on the validation set using the BLEU score.We used the obtained optimum values for inference. Othertraining parameters are as follows: the initial learning rate is0.5 with decay rate factor of 0.99, applied when the gener-ator loss has increased over two iterations. We use a batchsize of 64 and clip gradients around 5.0. All parameters areinitialized with Xavier uniform random initialization (Glo-rot and Bengio 2010). Due to the large vocabulary size, weuse sampled softmax loss (Jean et al. 2015) for the genera-tor to limit the GPU memory requirements and expedite thetraining process. However, we use full softmax for evalua-tion. The model is trained end-to-end using the stochasticgradient descent algorithm.

ExperimentsSetupWe evaluated the proposed adversarial bootstrapping(aBoots) with both generator and discriminator bootstrap-ping, on the Movie Triples and Ubuntu Dialogue corporarandomly split into training, validation, and test sets, using90%, 5%, and 5% proportions. We performed minimal pre-processing of the datasets by replacing all words except thetop 50,000 most frequent words by an UNK symbol. TheMovie dataset (Serban et al. 2016) spans a wide range of top-ics with few spelling mistakes and contains about 240,000dialogue triples, which makes it suitable for studying therelevance vs. diversity tradeoff in multi-turn conversations.The Ubuntu dataset, extracted from the Ubuntu Relay ChatChannel (Serban et al. 2017b), contains about 1.85 millionconversations with an average of 5 utterances per conversa-tion. This dataset is ideal for training dialogue models thatcan provide expert knowledge/recommendation in domain-specific conversations.

We explore different variants of aBoots along thechoice of discrimination (either word( w) or utterance( u)level) and sampling strategy (either uniform( uni), categor-ical( cat) or with Gaussian noise ( gau)). We compare theirperformance with existing state-of-the-art dialogue mod-els including (V)HRED1 (Serban et al. 2016; Serban et al.

1implementation obtained from https://github.com/julianser/hed-dlg-truncated

2017b), and DAIM2 (Zhang et al. 2018b). For completeness,we also include results from a transformer-based Seq2Seqmodel (Vaswani et al. 2017).

We compare the performance of the models based onthe informativeness (a combination of relevance and diver-sity metrics) of the generated responses. For relevance, weadopted BLEU-2 (Papineni et al. 2002) and ROUGE-2 (Lin2014) scores. For diversity, we adopted distinct unigram(DIST-1) and bi-gram (DIST-2) (Li et al. 2016a) as wellas and normalized average sequence length (NASL) scores(Olabiyi et al. 2018).

For human evaluation, we follow a similar setup as Li etal. (2016a), employing crowd sourced judges to evaluate arandom selection of 200 samples. We present both the multi-turn context and the generated responses from the models to3 judges and ask them to rank the response quality in termsof informativeness. Ties are not allowed. The informative-ness measure captures the temporal appropriateness, i.e, thedegree to which the generated response is temporally andsemantically appropriate for the dialogue context as well asother factors such as length of the response, and repetitions.For analysis, we pair the models and compute the averagenumber of times each model is ranked higher than the other.

Results and DiscussionQuantitative EvaluationThe quantitative measures reported in Table 1 show thatadversarial bootstrapping aBoots gives the best over-all relevance and diversity performance in comparison to(V)HRED, hredGAN, DAIM and Transformer, on both theMovie and Ubuntu datasets. We believe that the combina-tion of improved discriminator training and the policy-basedobjective is responsible for the observed performance im-provement. On the other hand, multi-turn models (V)HREDand hredGAN suffer performance loss due to exposure bias,since autoregressive sampling is not included in their train-ing. Although DAIM uses autoregressive sampling, its poorperformance shows the limitation of the single-turn archi-tecture and GAN objective compared to the multi-turn ar-chitecture and policy-based objective in aBoots. The trans-former Seq2Seq model, which performs better than RNNson the machine translation task, also suffers from exposurebias, and overfits very quickly to the low entropy regionsin the data, which leads to a poor inference performance.Also, the results from aBoots models indicate that word-level discrimination performs better than utterance-level dis-crimination, consistent with the results reported by Olabiyiet al. (2018) for the hredGAN model. While it is difficultto identify why some models generate very long responses,we observe that models with Gaussian noise inputs (e.g.,hredGAN and aBoots gau) may be using the latent Gaus-sian distribution to better encode response length informa-tion; indeed, this is an area of ongoing work. Within thevariants of aBoots, we observe that models trained with astochastic policy, aBoots uni and aBoots cat, outperform

2implementation obtained from https://github.com/dreasysnail/converse_GAN

Page 6: Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can be expressed according to the objective L AB( G) = X y i2Y i t G(y i)log p G (y ijx

Table 1: Automatic evaluation of generator performance

ModelMovie Ubuntu

Relevance Diversity Relevance DiversityBLEU ROUGE DIST-1/2 NASL BLEU ROUGE DIST-1/2 NASL

HRED 0.0474 0.0384 0.0026/0.0056 0.535 0.0177 0.0483 0.0203/0.0466 0.892VHRED 0.0606 0.1181 0.0048/0.0163 0.831 0.0171 0.0855 0.0297/0.0890 0.873hredGAN u 0.0493 0.2416 0.0167/0.1306 0.884 0.0137 0.0716 0.0260/0.0847 1.379hredGAN w 0.0613 0.3244 0.0179/0.1720 1.540 0.0216 0.1168 0.0516/0.1821 1.098DAIM 0.0155 0.0077 0.0005/0.0006 0.721 0.0015 0.0131 0.0013/0.0048 1.626Transformer 0.0360 0.0760 0.0107/0.0243 1.602 0.0030 0.0384 0.0465/0.0949 0.566

aBoots u gau 0.0642 0.3326 0.0526/0.2475 0.764 0.0115 0.2064 0.1151/0.4188 0.819aBoots w gau 0.0749 0.3755 0.0621/0.3051 0.874 0.0107 0.1712 0.1695/0.7661 1.235aBoots u uni 0.0910 0.4015 0.0660/0.3677 0.975 0.0156 0.1851 0.0989/0.4181 0.970aBoots w uni 0.0902 0.4048 0.0672/0.3653 0.972 0.0143 0.1984 0.1214/0.5443 1.176aBoots u cat 0.0880 0.4063 0.0624/0.3417 0.918 0.0210 0.1491 0.0523/0.1795 1.040aBoots w cat 0.0940 0.3973 0.0613/0.3476 1.016 0.0233 0.2292 0.1288/0.5190 1.208

Table 2: Human evaluation of generator performance based on response informativeness

Model Pair Movie Ubuntu

aBoots w cat – DAIM 0.957 – 0.043 0.960 – 0.040aBoots w cat – HRED 0.645 – 0.355 0.770 – 0.230aBoots w cat – VHRED 0.610 – 0.390 0.746 – 0.254aBoots w cat – hredGAN w 0.550 – 0.450 0.556 – 0.444

those trained with a deterministic policy, aBoots gau. No-tably, we find that for the stochastic policy, there is a trade-off in relevance and diversity between top k categorical anduniform sampling. The categorical sampling tends to per-form better with relevance but worse with diversity. We be-lieve that this is because top k categorical sampling causesthe generator to exploit high likelihood (i.e., more likely tobe encountered during inference) than uniform sampling ofthe top candidates, while still allowing the policy to explore.This however comes with some loss of diversity, althoughnot significant. Overall, the automatic evaluation indicatesthat adversarial bootstrapping trained with stochastic policyusing top k categorical sampling strategy gives the best per-formance.

Qualitative EvaluationAs part of our evaluation we also consider scores from hu-man judges. Specifically, we have each evaluator compareresponses from five models: aBoots w cat, hredGAN w,(V)HRED, and DAIM. The pairwise human preferencesare reported in Table 2. These data indicate a significantpreference for responses generated by aBoots w cat ascompared to both (V)HRED and DAIM. We observe thataBoots w cat is preferred over hredGAN w on average, al-though not by a significant margin. We note that this scorewas computed from only 200 evaluation samples, whichis likely too small to demonstrate a strong preference foraBoots w cat. It is also worthy noting that the hredGAN wmodel represents a strong baseline, based on previous hu-man evaluations (Olabiyi et al. 2018), against which to com-pare our adversarially trained models. It is interesting tonote that although automatic evaluation scores hredGAN wmuch lower than aBoots w cat on relevance, the long re-

sponse length from hredGAN w, which indicates strong di-versity, has a considerable impact on how human evalua-tors judge the informativeness of responses generated byhredGAN w. Table 4 shows example responses from themodels.

Ablation StudiesIn this section, we examine the effect of partial bootstrap-ping on the model performance. Here, the target in (5) isexcluded from the discriminator. The automatic evaluationresults on all the variants of aBoots are reported in Table 3.The table shows that the generator models bootstrapped bya discriminator that is not bootstrapped generally performsworse than ones with a bootstrapped discriminator. This im-provement is particularly more evident in the best perform-ing variant, aBoots w cat. We attribute this performanceimprovement to the better calibration of discriminator ob-tained from the bootstrapping of the discriminator outputwith the similarity measure between the generator’s autore-gressive output and the ground truth during training.

ConclusionWe have proposed a novel training technique, adversarialbootstrapping, which is useful for dialogue modeling. Themethod addresses the issues of data-induced redundancy andexposure bias in dialogue models trained with maximumlikelihood. This is achieved by bootstrapping the teacher-forcing MLE objective with feedback on autoregressive out-puts from an adversarially trained discriminator. This feed-back discourages the generator from producing bland andgeneric responses that are characteristic of MLE training.Experimental results indicate that a doubly bootstrappedsystem produces better performance than a system where

Page 7: Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can be expressed according to the objective L AB( G) = X y i2Y i t G(y i)log p G (y ijx

Table 3: Automatic evaluation of aBoots models with the generator bootstrapping only

ModelMovie Ubuntu

Relevance Diversity Relevance DiversityBLEU ROUGE DIST-1/2 NASL BLEU ROUGE DIST-1/2 NASL

aBoots g u gau 0.0638 0.3193 0.0498/0.2286 0.778 0.0150 0.1298 0.0480/0.1985 0.960aBoots g w gau 0.0729 0.3678 0.0562/0.3049 1.060 0.0123 0.1370 0.0646/0.1820 0.841aBoots g u uni 0.0801 0.3972 0.0655/0.3414 0.869 0.0124 0.1424 0.0636/0.1853 0.870aBoots g w uni 0.0860 0.4046 0.0671/0.3514 0.838 0.0170 0.2049 0.1074/0.4646 1.349aBoots g u cat 0.0836 0.3887 0.0597/0.3276 0.917 0.0131 0.1214 0.0597/0.3276 1.060aBoots g w cat 0.0928 0.4029 0.0613/0.3358 0.976 0.0202 0.2343 0.1254/0.4805 0.873

Table 4: Sample responses of DAIM, (V)HRED, hredGAN,aBoots w cat.

Model ResponseMovie

Context 0 we ’ re grateful , <person> . it ’ s not that . those craftsman tools . they’ re a dream come true .

Context 1 your dad and i were raised with a certain work ethic , that ’ s all .HRED <person> .VHRED <person> , mr . <person> . <person> , i ’ m not going to tell you ,

<person> . i ’ m sorry . i ’ m sorry , <person> . i ’ m sorryhredGAN w <person> , i know you ’ re see . < continued utterance ><person>

, please .aBoots w cat <person> , i ’ m a little worried about my future . i ’ m not here to

marry you .

Context 0 where would they be going ?Context 1 we ’ ve got two choices . one of them is wrong .HRED i ’ m sorry , <person> .VHRED i know .hredGAN w <person> , i know , it ’ s a real hair . with you a little little little , little

little little little somethinaBoots w cat i don ’ t care where they are , i don ’ t know anyone will be outside .

Ubuntu

Context 0 DJones: update manager won’t detect 12.04.1 as a new version if youare already running 12.04, because 12.04.1 = 12.04 + lots of packageupdates

Context 1 DJones I am running 10.04.Context 2 what does ’gksu update-manager c’ report?HRED I don’t know what the problem isVHRED you mean ?hredGAN w yes , I get the new one , I just want to get the new oneaBoots w cat it says that command not found

Context 0 libgnome2-dev is the package that has that pkgconfig fileContext 1 After I installed libgnome2-dev I got a different output.

http://pastebin.com/zpDeGs81 sorry it is so long.Context 2 I thought that was what was going on thanks. :) I ran the command ./con-

figure && sudo make && sudo make install would that have worked.HRED I don’t know what the problem isVHRED what do you mean ? eou what are you trying ?hredGAN w I have no idea , I can just copy the files and copy files to the other oneaBoots w cat yes I did . I just did sudo apt-get update and it worked

only the generator is bootstrapped. Also, the model vari-ant characterized by choosing top k categorical sampling,stochastic policy optimization, and word-level discrimina-tion gives the best performance. The results demonstrate thatthe proposed method leads to models generating more rele-vant and diverse responses in comparison to existing meth-ods.

References[Bahdanau, Cho, and Bengio 2015] Bahdanau, D.; Cho, K.;and Bengio, Y. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings of Interna-tional Conference of Learning Representation (ICLR 2015).

[Banchs 2012] Banchs, R. E. 2012. Movie-dic: A movie dia-logue corpus for research and development. In Proceedings

of the 50th Annual Meeting of the Association for Computa-tional Linguistics, 203–207.

[Che et al. 2017] Che, T.; Li, Y.; Zhang, R.; Hjelm, R. D.; Li,W.; Song, Y.; and Bengio, Y. 2017. Maximum-likelihoodaugmented discrete generative adversarial networks. InarXiv preprint arXiv:1702.07983.

[Glorot and Bengio 2010] Glorot, X., and Bengio, Y. 2010.Understanding the difficulty of training deep feedforwardneural networks. In International conference on artificialintelligence and statistics.

[Glynn 1990] Glynn, P. W. 1990. Likelihood ratio gradientestimation for stochastic systems. Communications of theACM 33(10):75–84.

[Grandvalet and Bengio 2005] Grandvalet, Y., and Bengio,Y. 2005. Semi-supervised learning by entropy minimiza-tion. In NIPS, 529–536.

[Grandvalet and Bengio 2006] Grandvalet, Y., and Bengio,Y. 2006. 9 entropy regularization. mitpress: 10.7551/mit-press/9780262033589.003.0009.

[Graves et al. 2016] Graves, A.; Wayne, G.; Reynolds, M.;Harley, T.; Danihelka, I.; Grabska-Barwiska, A.; Col-menarejo, S. G.; Grefenstette, E.; Ramalho, T.; Agapiou,J.; Badia, A. P.; Hermann, K. M.; Zwols, Y.; Ostrovski,G.; Cain, A.; King, H.; Summerfield, C.; Blunsom, P.;Kavukcuoglu, K.; and Hassabis, D. 2016. Hybrid comput-ing using a neural network with dynamic external memory.Nature Online 538:471–476.

[Graves, Wayne, and Danihelka 2014] Graves, A.; Wayne,G.; and Danihelka, I. 2014. Neural turing machines. InarXiv preprint arXiv:1410.5401, 2014.

[Jean et al. 2015] Jean, S.; Cho, K.; Memisevic, R.; and Ben-gio, Y. 2015. On using very large target vocabulary for neu-ral machine translation. In arXiv preprint arXiv:1412.2007.

[Kulis 2013] Kulis, B. 2013. Metric learning: A survey.Foundations and Trends in Machine Learning 5(4):287–364.

[Lamb et al. 2016] Lamb, A.; Goyah, A.; Zhang, Y.; Zhang,S.; Courville, A.; and Bengio, Y. 2016. Professor forcing: Anew algorithm for training recurrent networks. In Proceed-ings of Advances in Neural Information Processing Systems(NIPS 2016).

[Li et al. 2016a] Li, J.; Galley, M.; Brockett, C.; Gao, J.; andDolan, B. 2016a. A diversity-promoting objective functionfor neural conversation models. In Proceedings of NAACL-HLT.

Page 8: Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can be expressed according to the objective L AB( G) = X y i2Y i t G(y i)log p G (y ijx

[Li et al. 2016b] Li, J.; Monroe, W.; Ritter, A.; Galley, M.;Gao, J.; and Jurafsky, D. 2016b. Deep reinforce-ment learning for dialogue generation. In arXiv preprintarXiv:1606.01541v4.

[Li et al. 2017] Li, J.; Monroe, W.; Shi, T.; Ritter, A.; and Ju-rafsky, D. 2017. Adversarial learning for neural dialoguegeneration. In arXiv preprint arXiv:1701.06547.

[Lin 2014] Lin, C. Y. 2014. Rouge: a package for automaticevaluation of summaries. In Proceedings of the Workshopon Text Summarization Branches Out.

[Lowe et al. 2015] Lowe, R.; Pow, N.; Serban, I.; and Pineau,J. 2015. The ubuntu dialogue corpus: A large dataset for re-search in unstructured multi-turn dialogue systems. In SIG-DIAL.

[Luong et al. 2015] Luong, M. T.; Sutskever, I.; Le, Q. V.;Vinyals, O.; and Zaremba, W. 2015. Addressing the rareword problem in neural machine translation. In Proceedingsof the 53rd Annual Meeting of the Association for Computa-tional Linguistics.

[Nakamura et al. 2019] Nakamura, R.; Sudoh, K.; Yoshino,K.; and Nakamura, S. 2019. Another diversity-promotingobjective function for neural dialogue generation. In AAAIWorkshop on Reasoning and Learning for Human-MachineDialogues (DEEP-DIAL).

[Olabiyi et al. 2018] Olabiyi, O.; Salimov, A.; Khazane, A.;and Mueller, E. 2018. Multi-turn dialogue response genera-tion in an adversarial learning framework. In arXiv preprintarXiv:1805.11752.

[Olabiyi et al. 2019] Olabiyi, O.; Khazan, A.; Salimov, A.;and Mueller, E. 2019. An adversarial learning frameworkfor a persona-based multi-turn dialogue model. In NAACLNeuralGen Workshop.

[Olabiyi, Khazan, and Mueller 2018] Olabiyi, O.; Khazan,A.; and Mueller, E. 2018. An adversarial learning frame-work for a persona-based multi-turn dialogue model. In 17thIEEE International Conference on Machine Learning andApplications (ICMLA).

[Papineni et al. 2002] Papineni, K.; Roukos, S.; Ward, T.;and Zhu, W. 2002. Bleu: A method for automatic evalutionof machine translation. In Proceedings of the 40th AnnualMeeting of the Association for Computational Linguistics,311–318.

[Radford et al. 2019] Radford, A.; Wu, J.; Child, R.;Luan, D.; Amodei, D.; and Sutskever, I. 2019. Lan-guage models are unsupervised multitask learners. Inhttps://d4mucfpksywv.cloudfront.net/better-language-models.

[Reed et al. 2015] Reed, S.; Lee, H.; Anguelov, D.; Szegedy,C.; Erhan, D.; and Rabinovich, A. 2015. Training deep neu-ral networks on noisy labels with bootstrapping. In ICLR.

[Serban et al. 2016] Serban, I.; Sordoni, A.; Bengio, Y.;Courville, A.; and Pineau, J. 2016. Building end-to-end dia-logue systems using generative hierarchical neural networkmodels. In Proceedings of The Thirtieth AAAI Conferenceon Artificial Intelligence (AAAI 2016), 3776–3784.

[Serban et al. 2017a] Serban, I. V.; Klinger, T.; Tesauro, G.;Talamadupula, K.; Zhou, B.; Bengio, Y.; and Courville, A.2017a. Multiresolution recurrent neural networks: An appli-cation to dialogue response generation. In Proceedings ofThe Thirty-first AAAI Conference on Artificial Intelligence(AAAI 2017).

[Serban et al. 2017b] Serban, I. V.; Sordoni, A.; Lowe, R.;Charlin, L.; Pineau, J.; Courville, A.; and Bengio, Y. 2017b.A hierarchical latent variable encoder-decoder model forgenerating dialogue. In Proceedings of The Thirty-first AAAIConference on Artificial Intelligence (AAAI 2017).

[Shao et al. 2017] Shao, L.; Gouws, S.; Britz, D.; Goldie, A.;Strope, B.; and Kurzweil, R. 2017. Generating long and di-verse responses with neural conversational models. In Pro-ceedings of International Conference of Learning Represen-tation (ICLR).

[Sharath, Tandon, and Bauer 2017] Sharath, T.; Tandon, S.;and Bauer, R. 2017. A dual encoder sequence to se-quence model for open-domain dialogue modeling. In arXivpreprint arXiv:1710.10520.

[Silver et al. 2014] Silver, D.; Lever, G.; Heess, N.; Degris,T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministicpolicy gradient algorithms. In ICML.

[Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.;and Le, Q. 2014. Sequence to sequence learning with neuralnetworks. In Proceedings of Advances in Neural Informa-tion Processing Systems (NIPS), 3104–3112.

[Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.;Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polo-sukhin, I. 2017. Attention is all you need. In NIPS.

[Vinyals and Le 2015] Vinyals, O., and Le, Q. 2015. A neu-ral conversational model. In Proceedings of ICML DeepLearning Workshop.

[Williams and Zipser 1989] Williams, R. J., and Zipser, D.1989. A learning algorithm for continually running fully re-current neural networks. Neural computation 1(2):270–280.

[Williams 1992] Williams, R. J. 1992. Simple statistical gra-dientfollowing algorithms for connectionist reinforcementlearning. Machine learning 8(3-4):229–256.

[Xing et al. 2017] Xing, C.; Wu, W.; Wu, Y.; Zhou, M.;Huang, Y.; and Ma, W. 2017. Hierarchical recurrent at-tention network for response generation. In arXiv preprintarXiv:1701.07149.

[Xu et al. 2017] Xu, Z.; Liu, B.; Wang, B.; Chengjie, S.;Wang, X.; Wang, Z.; and Qi, C. 2017. Neural responsegeneration via gan with an approximate embedding layer. InEMNLP.

[Yu et al. 2017] Yu, L.; Zhang, W.; Wang, J.; and Yu, Y.2017. Seqgan: sequence generative adversarial nets withpolicy gradient. In Proceedings of The Thirty-first AAAIConference on Artificial Intelligence (AAAI 2017).

[Zhang et al. 2017] Zhang, Y.; Gan, Z.; Fan, K.; Chen, Z.;Henao, R.; Shen, D.; and Carin, L. 2017. Adversarialfeature matching for text generation. In arXiv preprintarXiv:1706.03850.

Page 9: Adversarial Bootstrapping for Dialogue Model Trainingversarial bootstrapping for the generator can be expressed according to the objective L AB( G) = X y i2Y i t G(y i)log p G (y ijx

[Zhang et al. 2018a] Zhang, S.; Dinan, E.; Urbanek, J.;Szlam, A.; Kiela, D.; and Weston, J. 2018a. Personaliz-ing dialogue agents: I have a dog, do you have pets too? InarXiv preprint arXiv:1801.07243v3.

[Zhang et al. 2018b] Zhang, Y.; Galley, M.; Gao, J.; Gan, Z.;Li, X.; Brockett, C.; and Dolan, B. 2018b. Generating infor-mative and diverse conversational responses via adversarialinformation maximization. In NeurIPS.