Building Task-Oriented Visual Dialog Systems Through Alternative ... · Building Task-Oriented...

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processingand the 9th International Joint Conference on Natural Language Processing, pages 143–153,Hong Kong, China, November 3–7, 2019. c©2019 Association for Computational Linguistics

143

Building Task-Oriented Visual Dialog Systems Through AlternativeOptimization Between Dialog Policy and Language Generation

Mingyang Zhou Josh Arnold Zhou YuDepartment of Computer Science

University of California, Davis{minzhou, jarnold, joyu}@ucdavis.edu

AbstractReinforcement learning (RL) is an effectiveapproach to learn an optimal dialog pol-icy for task-oriented visual dialog systems.A common practice is to apply RL on aneural sequence-to-sequence (seq2seq) frame-work with the action space being the out-put vocabulary in the decoder. However, itis difficult to design a reward function thatcan achieve a balance between learning an ef-fective policy and generating a natural dia-log response. This paper proposes a novelframework that alternatively trains a RL policyfor image guessing and a supervised seq2seqmodel to improve dialog generation quality.We evaluate our framework on the Guess-Which task and the framework achieves thestate-of-the-art performance in both task com-pletion and dialog quality.

1 Introduction

Visually-grounded conversational artificial intel-ligence (AI) is an important field that exploresthe extent intelligent systems are able to holdmeaningful conversations regarding visual con-tent. Visually-grounded conversational AI can beapplied to a wide range of real-world tasks, includ-ing assisting blind people to navigate their sur-roundings, online recommendation systems, andanalysing mass amounts of visual media throughnatural language. Current approaches to thesetasks involve an end-to-end framework that mapsthe multi-modal context to a deep vector andin order to decode a natural dialog response.This framework can be trained through supervisedlearning (SL) with the objective to maximize thedistribution of the response given a human-humandialog history. Given a large conversational data,the neural end-to-end system can effectively learnto generate coherent and natural language.

While much success has been achieved by ap-plying neural sequence to sequence models to

open visual grounding conversation, the visual di-alog system also needs to learn an optimal strat-egy to efficiently accomplish an external goalthrough natural conversations. To address this is-sue, various image guessing tasks such us Guess-Which (Chattopadhyay et al., 2017) and Guess-What (de Vries et al., 2016) are proposed to eval-uate a visual-grounded conversational agent on itsability to retrieve visual content via conversing innatural language. To obtain an optimal dialog pol-icy, reinforcement learning (RL) is introduced toenable the neural end-to-end framework to modela more effective action distribution by exploringdifferent dialog strategies. With the application ofRL, the visual dialog system can generate moreconsistent responses and achieve a higher level ofengagement in the conversation when compared toa dialog system trained via SL (Das et al., 2017b;Zhang et al., 2017). A typical way to apply RLon a dialog system is to assign a task-related re-ward to influence the utterance generation processby treating each output word as the action step.A significant limitation of this approach is that itis difficult to achieve an optimal dialog policy thatcan both effectively complete the external goal andgenerate natural utterances (Zhao et al., 2019; Daset al., 2017b). As there is no ground truth refer-ence during the RL training stage, the dialog sys-tem can only leverage the reward signal when gen-erating the response. However, this approach oftendeviates from natural language as it is challengingto define a comprehensive reward that considersall aspects of the dialog quality, and in addition,assigns appropriate rewards to the large word vo-cabulary action space.

In this paper we propose a novel learning cur-riculum to address the challenge of joint learningbetween the dialog policy and language generationfor task-oriented dialog systems. In our frame-work, we separate the training of the image re-

144

trieval policy from dialog generation by applyingRL, with the goal of achieving an optimal policyfor guessing the target image at every turn. In ad-dition, we apply a language model objective func-tion to optimize the utterance generator to miti-gate language degeneration. We specifically studythis framework in the image guessing task Guess-Which, where a conversational agent attempts toguess a target image by asking a series of ques-tions. When compared to state-of-art RL visualdialog systems, our method achieves superior per-formance in both task-accomplishment and dialogquality.

2 Related Work

2.1 Visual Dialog System

Visual dialog systems are an emerging area of in-terdisciplinary research that attracts both the vi-sion and language communities due to the poten-tial applications. Das et al. (2017a) proposed a vi-sual dialog task in which a conversational agent at-tempts to answer questions regarding an assignedimage based on a dialog history. To approachthis task, they initially collected data by havingtwo people chat about an image with one personacting as the questioner and the other as the an-swerer. GuessWhich (Chattopadhyay et al., 2017)extends VisDial with the goal to build an agentthat learns how to identify a target image throughquestion and answers. (de Vries et al., 2016) ad-ditionally introduced a game in which a series ofyes-or-no questions are asked by an agent in orderto locate an object in an image. Many researchersapproached these tasks via reinforcement learning(RL), with the goal of obtaining an optimal dialogpolicy. Zhang et al. (2017), for example, designedthree rewards with respect to the goals of taskachievement, efficiency, and question informative-ness, in order to help the agent to achieve an ef-fective question generation policy for GuessWhatgame. Das et al. (2017b) applies reinforcementlearning in the GuessWhich task and demonstratesa moderate improvement in accuracy compared tothe supervised learning approach. Both methodsapply RL on a neural end-to-end pipeline to jointlyinfluence the language generation and dialog pol-icy. Due the challenge of designing an appropri-ate reward for language generation, these meth-ods generate responses that deviate from humannatural language. Zhang et al. (2018), proposedan approach involving hierarchical reinforcement

learning and state-adaptation techniques that en-able the agent to learn an optimal and efficientmulti-modal policy. The bottleneck of (Zhanget al., 2018)’s method, however, is that the systemresponse is retrieved from a predefined human-written or system-generated utterance. The num-ber of predefined responses are limited, therefore,this method does not easily generalize to othertasks in real-world settings. We address these lim-itations by applying RL on a reduced, yet morerelevant action space, while optimizing the dialoggenerator in a supervised fashion. We alternativelyoptimize policy learning to language generation tocombine the two tasks together.

2.2 RL on Task-oriented Dialog System

Various RL-based models have been proposedto train task-oriented dialog systems (Williamsand Young, 2007). In order to build a tradi-tional modular-based dialog system, researchersfirst identify the semantic representation, such asthe dialog acts and slots in user utterances. Thenthey accumulate these semantic representationsover time to track the dialog state. Finally they ap-ply RL to learn an optimized dialog policy giventhe dialog state (Raux et al., 2005; Shi and Yu,2018). Such modular-based dialog systems areeffective in narrow task domains, such as search-ing a bus route schedule and reserving a restaurantthrough natural language, but they fail to general-ize to complex settings where the size of the ac-tion space increases. Owing to the development ofdeep learning, RL on neural sequence-to-sequencemodels has been explored in more complex di-alog domains such as open-domain conversation(Li et al., 2016) and negotiation (Lewis et al.,2017). However, due to the difficulty of assign-ing appropriate rewards when operating in a largeaction space, these frameworks cannot generatefluent dialog utterances. Zhao et al. (2019) pro-posed a novel latent action RL framework to marrythe advantage of a module based approach andsequence-to-sequence approach. They learned theoptimal dialog policy in a complex task-orienteddialog domain while achieving decent conversa-tion quality. We study the similar issue in a multi-modal task-oriented dialog scenario. We proposean iterative approach to optimize dialog policy us-ing RL methods and system response generationvia SL.

145

Figure 1: The proposed end-to-end framework of the conversation agent for GuessWhich task-oriented visualdialog task

3 Method

3.1 Problem Setting

In the GuessWhich problem, we aim to build anagent (Q-Bot) that attempts to guess an image itgtthat another agent (A-Bot) knows by asking it aseries of questions. At the beginning of the con-versation, the Q-Bot is primed with a short captionc of the target image that is only known by A-Bot.At every round t, the Q-Bot generates a questionqt to elicit as much information as possible aboutthe target image and the A-Bot provides an appro-priate answer at with regard to qt and the targetimage. In the end, the agent guesses the target im-age among a set of images considering the entireconversation.

In addition, our dialog system also guesses acandidate image it out of an image database I ={ik}mk=0 at every turn. This action models the pro-cess of sequentially updating the visual belief stateon the target image based on the latest dialog his-tory. Conditioned on the current guessed imageand the prior dialog contexts, the system will gen-erate an optimal question in order to get the max-imum information from A-Bot that can strengthenthe system’s belief on the target image. At theend of the conversation, our Q-Bot will guess thetarget image based on the multimodal contextssn = (q1:n, a1:n, i1:n, c) consisting of the dialoghistory and the trajectory of guessed images.

3.2 Model Architecture

Our Q-Bot is constructed on top of a hierarchicalencoder-decoder framework (Serban et al., 2015),

which consists of three major components: TheResponse Encoder, the Question Decoder, andthe Image Guesser. We introduce each compo-nent as follows:

Response Encoder The goal of the response en-coder is to append the question qt, the answerat, and the guessed image it received at currentround to the dialog history and obtain an updatedvector representation of the multimodal contextst. The image it is encoded with a pre-trainedconvolutional neural network VGG-16 (Simonyanand Zisserman, 2015) followed by a linear em-bedding layer and the image feature vector de-noted as zt. For the question and answer pairat the current round (qt, at), we map them to ahidden state vector ft through the LSTM basedQA Encoder. We then apply a linear projectionon the concatenation of ft and zt in order to ob-tain the multi-modal context vector ht for the cur-rent round. The context vector is then passedthrough another LSTM encoder: History Encodergenerates an updated dialog history representationst = HistoryEnc(ht, st−1). We denote the train-able parameters for Response Encoder as θe.

Question Decoder The question decoder is atwo-layer LSTM network initialized with the mostupdated dialog history representation vector stfrom the response encoder. It will sequentiallysample the words to come up with the next ques-tion qt. The learned parameters for question de-coder are denoted as θd.

146

Image Guesser The Image Guesser attempts toidentify the candidate image that best aligns withthe dialog history. Given a image database I ={ik}mk=0 where we sample the candidate image,we first extract the image feature representations{zk}mk=0 for all candidate images with the con-volutional neural network and image embeddinglayer defined in response encoder. Then, we cansample a candidate image ik for the current turnbased on the euclidean distance d(zk, st) betweenthe image feature of the candidate image and thecurrent dialog history vector. The image with thesmallest euclidean distance is selected as the guessit at the current round. The associated parametersfor image guesser are defined as θg.

3.3 Training Dialog System

We follow a two-stage training fashion as intro-duced in many previous end-to-end RL dialog sys-tems (Das et al., 2017b; Zhang et al., 2017; Zhaoet al., 2019), where we first pre-train the dialogframework with a supervised objective then applyreinforcement learning to learn an optimal policyto retrieve the target image. The Supervised pre-training is a critical step that facilitates an effectivepolicy exploration for RL training, as it is diffi-cult to explore a complex action space with limitedprior knowledge. During RL training, we intro-duce an alternative learning method between dia-log policy exploration and natural utterance gener-ation that addresses the issue of language degener-ation in previous RL based visual dialog systems(Das et al., 2017b). We introduce each trainingmethod as follows.

3.3.1 Supervised Pre-training

During the supervised pre-training process, wejointly optimize the objective to generate ques-tions and also predict target image features fromdialog contexts. The task of question generationis optimized by maximizing the log conditionalprobability of the next question dependent on aground truth dialog for every round of the con-versation. For the image feature prediction, weminimize the mean square error (MSE) betweenthe target image feature ztgt and the dialog contextvector st at each round. The joint loss function for

supervised pre-training is:

LSL(θr, θd, θg) = αn∑t=0

log p(qt|st)

+ βn∑t=0

MSE(ztgt, st) (1)

Where α and β are weights assigned to the ob-jective function of each task in the joint objectivefunction. With SL pre-training process, the dialogsystem is facilitated with the ability to estimate avisual object and emit a natural language sentencegiven a dialog context.

3.3.2 Reinforcement Learning on ImageRetrieval

In our framework, we treat the sequence of imageguess through out the conversation as a partiallyobservable markov decision process and train apolicy network via RL to obtain an optimal strat-egy to retrieve the target image. We formally de-scribe state, policy, action, rewards, and the train-ing procedures in our pipeline.

State The dialog states in our framework consistof a combination of multimodal contexts, includ-ing the image caption c, the dialog history withA-Bot [q1, a2, . . . , qt, at], and the image guessingtrajectories [i1, i2, . . . , it].

Policy The dialog policy πθr,θg(it|St) is astochastic policy that samples the candidate imageto guess from an image set based on the previousdialog histories. The policy is learned from re-sponse encoder and image generator which is pa-rameterized via θr and θg.

Action The full action space is the number ofimages in the database that we can sample toguess an image. As the pre-trained process al-ready enables the system to approximate a targetimage feature ztgt with the dialog history repre-sentation vector st, we reduce the action space tothe top-K nearest images, st, based upon the eu-clidean distance. The probability to sample animage ij is gained by applying a softmax func-tion over the top-K candidates on their distance tost: π(j) = e−dj∑K

k=1 e−dk

. dj represents the mean-

square-distance between the j-th image and the di-alog history state vector st.

147

Rewards We use the ranking percentile of thetarget image with respect to the dialog history vec-tor st as the reward signal to credit the guess ateach turn. The goal is to maximize the expecta-tion value of the discounted return E[

∑nt=1 γ

trt]over the n-round conversation. rt is the rankingpercentile of target image at round t and γ is thediscounted factor between (0, 1).

Training Procedure Inspired from the RL train-ing process on the iterative image retrieval frame-work (Guo et al., 2018), we apply the policy im-provement theory (Sutton and Barto, 1998) to es-timate an improved policy π∗(st) from an exist-ing policy π(st) obtained from the pre-trained di-alog system. Given a dialog state st and the actionat derived from the existing policy, the value es-timated by the current policy for taking the actionat isQπ(st, at) = E[

∑nt′=t γ

trt]. To improve this,we explore a different action a∗t 6= at such that alarger policy value Qπ(st, a∗t ) > Qπ(st, at) esti-mated with the current policy is achieved. Thenwe can adjust the existing policy π(st) to a newpolicy π∗(st) that executes that optimal action a∗tgiven the current dialog state. The parameters ofthe policy can be effectively optimized via a crossentropy loss function conditioned on the derivedoptimal action a∗t :

LRL(θr, θg) = E[−n∑t=1

log(πθr,θg(a∗t |st))] (2)

Compared to the previous RL visual-groundedconversational agent, (Das et al., 2017b), there areseveral advantages of conducting policy learningon the action level of guessing the image. First,the action space of the top-k nearest neighbors aremuch smaller compared to the vocabulary size ofthe output words which reduces the difficulty toexplore optimal strategies. Second, only the pa-rameters of response encoder and image genera-tor will be optimized during the RL training stage.The question decoder stays intact so that it is lesslikely for the dialog system to suffer from lan-guage deviation.

3.3.3 Alternating Policy Learning andLanguage Generation

Although the parameters of the decoder won’t beimpacted during the RL training stage, the sharedresponse encoder of the dialog context is still op-timized with policy learning. The language dis-tribution captured by both the response encoder

and question decoder will gradually be differen-tiated from the original human dialog distribu-tion. To prevent the potential language degener-ation behavior, we alternatively optimize the di-alog system with a policy learning objective inequation 2 and the language model objective func-tion in equation 1 at every other epoch. It assuresthe dialog system maintains a good estimation ofthe human language distribution while also effec-tively exploring various dialog actions in order toachieve the task of guessing the right image.

4 Experiments

4.1 AI-AI Image Guessing Game

We evaluate the performance of our task-orienteddialog system by playing the image guessinggame, GuessWhich with an automatic answer bot.Our conversational agent’s goal is to locate thetarget image out of the 9,628 test images by in-teracting with the other player in five conversa-tion exchanges. We evaluate agent on both goalachievement and utterance generation quality us-ing two automatic evaluation metrics PercentileMean Rank (PMR) and perplexity respectively.PMR estimates how good the agent can rank thetarget image against other candidates in the testdatabase based on its current dialog state. Per-plexity estimates the closeness of the generated re-sponse to a reference utterance given a dialog con-text from the VisDial dataset.

4.2 Human-AI Image Guessing Game

To evaluate the ability of our task-oriented dialogsystem in a realistic conversational scenario, wealso make our agent play the image guessing gamewith human users. The games are set up as 20-image guessing games where the agent attempts toguess a target image outside of a pool of 20 can-didate images by asking a human player 5 roundsof questions. The objective of the human player isto play the role of answer bot and answer agent’squestion with respect to the target image.

In this setting, the performance of the agent ontask accomplishment is evaluated by the game winrates. The quality of the dialogs are manually ratedon four criteria: fluency, comprehension, diversityand relevance. Fluency defines the naturalness andreadability of the generated question in English.Comprehension represents the consistency of thegenerated question with respect to the previous di-alog context. Diversity evaluates the uniqueness

148

Image RL-Q-IG SL-Q-IG RL-Q

Q: are bananas in bunch? Q: what color are planes? Q: is this outside?A: yes it’s in bunch A: no planes there A: yes it isQ: are they on table? Q: are they in city? Q: is there any people?A: yes they are A: yes it is A: yes ...Q: is this in kitchen? Q: are they in baskets? Q: is this in zoo?A: I’m not sure A: yes they are A: no it is notQ: any people? Q: any people? Q: are there any people?A: Yes there are several people A: yes there are A: yes there areQ: any other fruits? Q: animals? Q: is there any people?A: NO, only bananas A: no animals A: yes ...

Table 1: A dialog example with the ground truth caption: bunches of bananas hang on a wall and arranged forsale. blue indicates ideal relevant questions and orange indicates less relevant questions.

of the questions generated within one game. Rel-evance presents how well the asked question is re-lated to the target image and the given caption.

4.3 Comparative Models

We compare the performance of our model withstate-of-the-art task-oriented visual dialog sys-tems. Meanwhile we also perform an ablationstudy to evaluate the contribution of different de-signs in our framework. We introduce each modelas follows:

SL-Q: The dialog agent from (Das et al.,2017b), which is trained with a joint supervisedlearning objective function for language genera-tion and image prediction.

RL-Q: The dialog agent from (Das et al.,2017b) which is fine-tuned on a trained SL-Q byapplying RL to the action space of output wordvocabulary.

SL-Q-IG: The dialog agent from this frame-work is build on top of the SL-Q. Compared toSL-Q, SL-Q-IG has an additional image guessermodule that makes a guess on target image at ev-ery round. SL-Q-IG also has an image encoderwhich fuses the guessed candidate image into thedialog history tracker. We only train this modelwith the supervised learning objective introducedequation 1.

RL-Q-IG: We use RL method to fine-tune SL-Q-IG. The RL method used is applied on actionspace of guessing candidate image. We alternatethe model to optimize towards dialog policy learn-ing and language generation.

RL-Q-IG-NA: We fine-tune SL-Q-IG by ap-plying RL to the action space of guessing candi-date image and only optimized with policy learn-ing objective function alone.

RL-Q-IG-W: The dialog agent from our frame-work, which is fine-tuned on a trained SL-Q-IG byapplying reinforcement learning on output wordvocabulary. It follows the same training proce-dures as RL-Q to conduct policy learning.

All the SL dialog agents are trained on the Vis-Dial Dataset with the default setting from (Daset al., 2017b) for 40 epochs. The RL dialog agentsare then fine-tuned on their corresponding SL di-alog agents for another 20 epochs. We evaluateevery model on AI-AI image guessing games withthe same answer bot, trained on the Visdial Datasetwith the objective of visual question answering.We only evaluate RL-Q, SL-Q-IG and RL-Q-IGin human evaluation.

4.4 Human-AI Evaluation Implementation

In order to evaluate the effectiveness of the model,we designed three human evaluation tasks. Sixcollege students were recruited to conduct theevaluation. Each student evaluated 100 games us-ing the ground truth captions and 30 games usinghuman generated captions. An additional threeevaluators each completed 30 rounds of the rele-vancy experiment.

Ground Truth Captions We generated 100 im-age guessing games that used the ground truth cap-tions to ensure a consistent amount of informa-tion is supplied across all human evaluators. Eachgame consists of a randomly selected set of 20images from the VisDial Dataset, with one imagerandomly chosen as the target. For each game, wetest three different models, each twice, resulting ina total of 600 evaluated games from the 100 gen-erated games. We keep the identity of the modelsanonymous to the evaluator.

During each game, the human evaluator is pre-

149

sented with a target image the agent is trying toguess. Five rounds of Q&A take place in which theagent asks a question to elicit information and thehuman evaluator responds with a relevant truthfulanswer. At the end of each game, the evaluatoris asked to rate the conversation on four criteria:fluency, relevance, comprehension and diversity.

Human Generated Captions In order to dis-tinguish SL-Q-IG and RL-Q-IG in a more naturalsetting, we generate an additional 30 games, sim-ilar to the previous human evaluation task, exceptwhen beginning the game, the evaluator is askedto provide the caption for the target image insteadof using the ground truth.

Relevance Experiment We noticed that the hu-man evaluators found rating dialogues on the rel-evance criteria challenging and nuanced. In or-der to reduce the difficulty of rating dialogues us-ing the relevance criteria, we designed a sepa-rate experiment in which, using the conversationsobtained from the previous 600 evaluated groundtruth games, a human evaluator is presented withthree complete conversations side by side at eachround. The evaluator then selects the most rele-vant conversation out of the three that correspondsto the image caption. Each of the three conversa-tions have the same caption, however, correspondto a different model, thus allowing for an effectivecomparison between the relevancy of each model.

5 Results

5.1 Results on AI-AI Image Guess Game

Image Retrieval It is clear from Fig 2 that ourdialog system significantly outperforms the base-line models from (Das et al., 2017b) in terms ofPMR on every round of the dialog. PMR esti-mates how good the agent can rank the target im-age against other candidates in the test database.The biggest improvement gap is observed betweenSL-Q-IG and SL-Q. In comparison to SL-Q, SL-Q-IG tracks the additional context from the previ-ously guessed images which leads to a better es-timation of the target image. RL-Q-IG has bet-ter performance compared to SL-Q-IG in terms ofPMR. This suggests that fine-tuning dialog sys-tems with RL can further improve the success ofguessing the correct image. The best image re-trieval result is achieved by RL-Q-IG-NA, as theobjective function of RL-Q-IG-NA is based solelyon policy learning without consideration for thedialog generation quality.

Figure 2: The percentile mean rank (PMR) over the 5-round dialog in the AI-AI image guessing game

Model PMR PerplexitySL-Q 90.07% 79.49SL-Q-IG 96.09% 61.42

RL-Q 94.78% 544.97RL-Q-IG 96.81% 54.66RL-Q-IG-NA 96.88% 363.88RL-Q-IG-W 96.65% 227.35

Table 2: RL-Q-IG-NA performs best in PMR and RL-Q-IG perform best in perplexity

Although our framework achieved an improvedimage retrieval accuracy, we observed, however,that there is little improvement gained in PMR af-ter additional rounds of conversation. We suspectthis is partially due to the fact that images fromMSCOCO are composed of a diverse selection ob-jects and background scenes, thus making imageseasily distinguishable with a detailed caption. Incases where candidate images are visually similaror the given caption is not informative, additionalrounds of dialog are necessary to identify the tar-get image.

Language Generation We observe a marginalincrease of perplexity from SL-Q to RL-Q in Ta-ble 2, thus demonstrating that there is a bottleneckwhen applying RL to improve the language gen-eration. By decoupling the policy learning fromthe language generation and alternatively optimiz-ing the dialog policy and language model, our RL-Q-IG avoids language deviation while still achiev-ing an optimal dialog policy for the image re-trieval task. To further evaluate the contributionfrom the RL and alternative training curriculum,we conduct two ablation studies. RL-Q-IG-NA isfine-tuned with a policy learning objective that ex-cludes alternatively applying the language modelloss. While RL-Q-IG-NA only achieves an in-

150

Model Win Fluency Relevance Comprehension DiversityRL-Q 59.6 4.19 3.22 2.60 2.50

SL-Q-IG 62.7 4.18 3.96 3.18 3.22RL-Q-IG 67.5 4.40 4.02 3.50 3.25

Table 3: Evaluation results on the human-AI image guessing game initialized with ground truth captions

Model Win Fluency Relevance Comprehension DiversityRL-Q 29.2 4.04 2.88 2.71 2.29

SL-Q-IG 40.6 4.16 3.19 2.75 2.69RL-Q-IG 67.6 4.23 3.74 3.32 3.06

Table 4: Evaluation results on the human-AI image guessing game initialized with human generated captions

cremental improvement over the full frameworkRL-Q-IG in terms of the PMR rate with less than0.1%, it suffers from a dramatic increase of per-plexity from 61.42 to 363.88, thus suggesting thatalternatively applying the supervised learning ob-jective can prevent the language model from devi-ating from the human language distribution. Weadditionally apply policy learning on the questiondecoder of SL-Q-IG and follow the RL fine-tuningprocess in (Das et al., 2017b) to train the agent,RL-Q-IG-W. While applying word-level RL en-ables RL-Q to achieve a moderate improvementover SL-Q in terms of PMR, we did not observe,the same degree of advantage in RL-Q-IG-W overSL-Q-IG. Additionally, RL-Q-IG-W is affected bya marginal increase in perplexity in comparisonto the SL pre-trained agent, which approves thedrawbacks of applying RL on a large action spacein language generation.

5.2 Results on Human-AI Image Guess GameThe performance of a dialog agent evaluated witha user simulator does not necessarily reflect its per-formance on real human (de Vries et al., 2016).We conduct human evaluation on different dia-log agents. From the results summarized in Ta-ble 3 and Table 4, we observe a consistent op-timal performance of our method from conver-sations with AI agent to conversations with realhuman. Our RL-Q-IG significantly outperformsthe baseline RL agent in all criteria for both set-tings. RL-Q-IG’s advantage over SL-Q-IG is notsignificant in the game when agents are primedwith ground truth image caption. This observa-tion correlates with the result in the Human-AIgame, as both RL-Q-IG and SL-Q-IG achieve su-perior PMR over 96% when presented with theground truth caption. However, if a human gen-

erated caption is given, the performance of the SLpre-trained agent suffers a big drop in all metricsexcept fluency while our RL agent maintains simi-lar performance. Applying RL to fine-tune the dia-log system enables the agent to generate more con-sistent dialogs in unseen scenarios. We also noticea degradation of the baseline RL agent from itsperformance with the user simulator, which sug-gests deviation from natural language is due to thesub-optimal RL training on a large action space.

We conduct a qualitative analysis on the gen-erated dialogs from the three models with humanplayers. Besides a marginal improvement overthe RL baseline model and SL pretrained agent interms of decreased repetition and grammar mis-takes, there is a distinct superiority in regards tothe relevance to the image caption in the questionsgenerated from our RL agent. For example, in Ta-ble 9, we demonstrate the three dialogs generatedby RL-Q-IG, SL-Q-IG and RL-Q on one game.Given the image caption bunches of bananas hangon a wall and arranged for sale., RL-Q and SL-Q-IG ask very general questions that are not relatedto the caption such as “planes”, “zoo” and “ani-mals”. In comparison, our agent asks high-qualityquestions regarding the caption that covers “ba-nanas” and “fruits”. These questions help our RLagent obtain useful information to guess the tar-get image. This advantage is also evident from theresults of comparative evaluation on the degree ofrelevance of the questions in Table 5. We creditthe positive result to the dialog policy, which ex-plores multiple paths to conduct the conversation.The optimal path will involve a set of questionsthat obtains the maximum information of the tar-get image such that it can construct the best esti-mation of the target image.

151

Model Prefered (%)RL-Q 8.93

SL-Q-ImGuess 39.90RL-Q-IG 51.20

Table 5: Results on comparative evaluation of rele-vance on the human-AI image guessing dialogs

6 Conclusion and Future Work

We present a novel framework for building a task-oriented visual dialog system. We model the agentto simultaneously optimize two actions: guessingthe image and generating effective questions. Weachieve this simultaneous optimization through al-ternatively applying reinforcement learning to ob-tain an effective image guessing policy, whilst alsoapplying supervised learning to enhance the qual-ity of generated questions. By decoupling the pol-icy learning from language generation, we over-come language degeneration in the word-level re-inforcement learning framework. Both analyti-cal and human evaluation suggests our proposedframework leads to a higher task completion rateand an improved dialog quality.

In the future, we plan to collect a fashion re-trieval visual dialog dataset which simulates a re-alistic application for multi-modal dialog systems.To address the limitation of a high image retrievalrate with just the use of captions from the VisDialdataset, we plan to format a challenging candidateimage pool in which images are visually similarto each other. This will incentivize the dialog sys-tem to conduct multiple rounds of dialog in orderto retrieve the target image successfully. Further-more, we will explore additional task-oriented set-tings where we can decouple task accomplishmentfrom language generation to evaluate the extentour framework can generalize to other conversa-tional tasks.

ReferencesPrithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu,

Arjun Chandrasekaran, Abhishek Das, Stefan Lee,Dhruv Batra, and Devi Parikh. 2017. Evaluating vi-sual conversational agents via cooperative human-aigames. CoRR, abs/1708.05122.

Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jose M.F. Moura, DeviParikh, and Dhruv Batra. 2017a. Visual Dialog. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR).

Abhishek Das, Satwik Kottur, Jose M.F. Moura, Ste-fan Lee, and Dhruv Batra. 2017b. Learning coop-erative visual dialog agents with deep reinforcementlearning. In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV).

Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, andRogerio Schmidt Feris. 2018. Dialog-based interac-tive image retrieval. CoRR, abs/1805.00145.

Mike Lewis, Denis Yarats, Yann N. Dauphin, DeviParikh, and Dhruv Batra. 2017. Deal or nodeal? end-to-end learning for negotiation dialogues.CoRR, abs/1706.05125.

Jiwei Li, Will Monroe, Alan Ritter, Michel Galley,Jianfeng Gao, and Dan Jurafsky. 2016. Deep rein-forcement learning for dialogue generation. CoRR,abs/1606.01541.

Antoine Raux, Brian Langner, Dan Bohus, Alan WBlack, and Maxine Eskenazi. 2005. Lets go pub-lic! taking a spoken dialog system to the real world.In in Proc. of Interspeech 2005.

Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben-gio, Aaron C. Courville, and Joelle Pineau. 2015.Hierarchical neural network generative models formovie dialogues. CoRR, abs/1507.04808.

Weiyan Shi and Zhou Yu. 2018. Sentiment adaptiveend-to-end dialog systems. CoRR, abs/1804.10731.

K. Simonyan and A. Zisserman. 2015. Very deep con-volutional networks for large-scale image recogni-tion. In International Conference on Learning Rep-resentations.

Richard S. Sutton and Andrew G. Barto. 1998. In-troduction to Reinforcement Learning, 1st edition.MIT Press, Cambridge, MA, USA.

Harm de Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron C. Courville.2016. Guesswhat?! visual object discovery throughmulti-modal dialogue. CoRR, abs/1611.08481.

Jason D. Williams and Steve Young. 2007. Partiallyobservable markov decision processes for spoken di-alog systems. Comput. Speech Lang., 21(2):393–422.

Jiaping Zhang, Tiancheng Zhao, and Zhou Yu.2018. Multimodal hierarchical reinforcement learn-ing policy for task-oriented visual dialog. CoRR,abs/1805.03257.

Junjie Zhang, Qi Wu, Chunhua Shen, Jian Zhang, Jian-feng Lu, and Anton van den Hengel. 2017. Askingthe difficult questions: Goal-oriented visual ques-tion generation via intermediate rewards. CoRR,abs/1711.07614.

Tiancheng Zhao, Kaige Xie, and Maxine Eskenazi.2019. Rethinking action spaces for reinforcementlearning in end-to-end dialog agents with latent vari-able models. CoRR, abs/1902.08858.

http://arxiv.org/abs/1708.05122















https://doi.org/10.1016/j.csl.2006.06.008











152

A Human Evaluation Interface

We demonstrate our web interface for our Human-AI guessing game in Fig 3. At the beginning ofevery game the human player is presented with arandom sampled target image. For the trained Q-bot, 20 images are sampled as the candidate poolwith the objective of retrieving the target image.At the end of the conversation, the Q-Bot’s guessis revealed to the human player followed by a re-

quest to evaluate the quality of the conversationin the four aspects introduced in section 4.4 on ascale from 1 to 5.

Our human relevancy evaluation interface isdemonstrated in Fig 4. We present side-by-sidequestions generated in the same game from eachof the three different models and ask the humanevaluators to pick the conversation that is most se-mantically relevant to the target image caption.

Figure 3: The web interface for human-AI guessing game. The left image is a target image randomly sampledfrom (Das et al., 2017a). The center section is a chat platform for human to communicate with a trained Q-Bot.On the right hand side are the 20 candidate images sampled for the Q-Bot to retrieve the target image.

Figure 4: The user interface for human relevancy experiment. The conversation from the three different models onthe same game setting are presented in the three boxes. On top of every conversation is the caption for the targetimage. Followed by that are the five questions asked by the Q-Bot in that game.

153

B Qualitative Examples

Image RL-Q-ImGuess SL-Q-ImGuess RL-Q

Q: what color is vase? Q: what color is vase? Q: what color is table?A: it is transparent A: it is brown A: it is orangeQ: is it on table? Q: is it in vase? Q: is there anything else on table?A: yes it is yes it is A: yes there are flowersQ: is it in kitchen? Q: is it in vase? Q: what color is table?A: I do not know A: yes it is A: it is orangeQ: any people? Q: is it in vase? Q: is there anything else on table?A: no, there is no people A: yes it is A: yes there areQ: what color is the table? Q: is it in vase? Q: is there anything else on table?A: yes there is A: yes it is A: yes there is

Table 6: A dialog example with the human generated caption: there is a plant in a vase and cookies. Bluehighlights diverse questions and orange indicates poor diversity.


Q: are men old? Q: how old are men? Q: what color is hat?A: No they are not A: 30 years old A: there is no hatQ: are they in uniform? Q: are they in city? Q: is it sunny?A: I’m not sure A: yes they are A: it seems yesQ: is it sunny? Q: is it sunny? Q: is this in city?A: yes it is A: yes it is A: it is in cityQ: are they on road? Q: any other people? Q: are there any people in picture?A: yes they are A: no other people A: there are two peopleQ: are they in parking lot? Q: animals? Q: is this in home or restaurant?A: No they’re not A: no other animals A: it is outside

Table 7: A dialog example with the human generated caption: two men in formal wear standing next to amonster truck. Blue highlights ideal relevant questions and orange indicates less relevant questions.


Q: how many horses are there? Q: how many horses are there? Q: is this in park?A: there are no horses A: There are no horses, only bulls A: maybe it is in parkQ: are they in wild? Q: what color are horses? Q: are there any people in picture?A: no, they are in village A: they are white and black A: yes there is one pictureQ: is there fence? Q: are they in field? Q: is this in home or restaurant?A: yes, there is fence A: Yes they are A: it is in neither of thoseQ: any people? Q: is it sunny? Q: are there any people in picture?A: yes, a person A: Yes it is A: yes there is one personQ: animals? Q: is it sunny? Q: is this in restaurant?A: yes, several bulls A: Yes it is A: no it is not

Table 8: A dialog example with the human generated caption: an image of running with the bulls outside. Blueindicates ideal relevant questions and orange highlights irrelevant/repeating questions.


Q: what color is kite? Q: what color is horse? Q: is there only 1 person?A: It is orange A: there is no horse A: no there is notQ: is it sunny? Q: how old is man? Q: is this in park?A: Yes it is A: he is young A: yes it isQ: any other people? Q: is he wearing hat? Q: are there any people in picture?A: Yes there are two people A: no, he is not wearing hat A: yes there areQ: any animals? Q: what color is horse? Q: is there any other food on table?A: No there is not A: there is no horse A: no food on the tableQ: any trees? Q: is it sunny? Q: is there anything else on table?A: Yes, there are several trees A: yes it is A: nothing . .

Table 9: A dialog example with the human generated caption: a man holding a kite while a girl tries to fly it.Blue indicates ideal relevant questions and orange indicates poor relevance.

Building Task-Oriented Visual Dialog Systems Through Alternative ... · Building Task-Oriented...

Documents

Transcript of Building Task-Oriented Visual Dialog Systems Through Alternative ... · Building Task-Oriented...