Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received...

11
Obj2Text: Generating Visually Descriptive Language from Object Layouts Xuwang Yin Vicente Ordonez Department of Computer Science, University of Virginia, Charlottesville, VA. [xy4cm, vicente]@virginia.edu Abstract Generating captions for images is a task that has recently received considerable at- tention. In this work we focus on caption generation for abstract scenes, or object layouts where the only information pro- vided is a set of objects and their locations. We propose OBJ2TEXT, a sequence-to- sequence model that encodes a set of ob- jects and their locations as an input se- quence using an LSTM network, and de- codes this representation using an LSTM language model. We show that our model, despite encoding object layouts as a se- quence, can represent spatial relationships between objects, and generate descriptions that are globally coherent and semanti- cally relevant. We test our approach in a task of object-layout captioning by us- ing only object annotations as inputs. We additionally show that our model, com- bined with a state-of-the-art object detec- tor, improves an image captioning model from 0.863 to 0.950 (CIDEr score) in the test benchmark of the standard MS-COCO Captioning task. 1 Introduction Natural Language generation (NLG) is a long standing goal in natural language processing. There have already been several successes in ap- plications such as financial reporting (Kukich, 1983; Smadja and McKeown, 1990), or weather forecasts (Konstas and Lapata, 2012; Wen et al., 2015), however it is still a challenging task for less structured and open domains. Given recent progress in training robust visual recognition mod- els using convolutional neural networks, the task of generating natural language descriptions for ar- Word Embedding Location Encoder group of people are flying a kite END a Encoder Output START person person kite (a) Input Object Layout (b) LSTM Object Layout Encoder (c) LSTM Language Model Decoder Figure 1: Overview of our proposed model for generating visually descriptive language from ob- ject layouts. The input (a) is an object layout that consists of object categories and their correspond- ing bounding boxes, the encoder (b) uses a two- stream recurrent neural network to encode the in- put object layout, and the decoder (c) uses a stan- dard LSTM recurrent neural network to generate text. bitrary images has received considerable atten- tion (Vinyals et al., 2015; Karpathy and Fei-Fei, 2015; Mao et al., 2015). In general, generating vi- sually descriptive language can be useful for vari- ous tasks such as human-machine communication, accessibility, image retrieval, and search. How- ever this task is still challenging and it depends on developing both a robust visual recognition model, and a reliable language generation model. In this paper, we instead tackle a task of describing ob- ject layouts where the categories for the objects in an input scene and their corresponding locations are known. Object layouts are commonly used for story-boarding, sketching, and computer graphics applications. Additionally, using our object layout captioning model on the outputs of an object de- tector we are also able to improve image caption-

Transcript of Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received...

Page 1: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

Obj2Text: Generating Visually Descriptive Language fromObject Layouts

Xuwang Yin Vicente OrdonezDepartment of Computer Science, University of Virginia, Charlottesville, VA.

[xy4cm, vicente]@virginia.edu

Abstract

Generating captions for images is a taskthat has recently received considerable at-tention. In this work we focus on captiongeneration for abstract scenes, or objectlayouts where the only information pro-vided is a set of objects and their locations.We propose OBJ2TEXT, a sequence-to-sequence model that encodes a set of ob-jects and their locations as an input se-quence using an LSTM network, and de-codes this representation using an LSTMlanguage model. We show that our model,despite encoding object layouts as a se-quence, can represent spatial relationshipsbetween objects, and generate descriptionsthat are globally coherent and semanti-cally relevant. We test our approach ina task of object-layout captioning by us-ing only object annotations as inputs. Weadditionally show that our model, com-bined with a state-of-the-art object detec-tor, improves an image captioning modelfrom 0.863 to 0.950 (CIDEr score) in thetest benchmark of the standard MS-COCOCaptioning task.

1 Introduction

Natural Language generation (NLG) is a longstanding goal in natural language processing.There have already been several successes in ap-plications such as financial reporting (Kukich,1983; Smadja and McKeown, 1990), or weatherforecasts (Konstas and Lapata, 2012; Wen et al.,2015), however it is still a challenging task forless structured and open domains. Given recentprogress in training robust visual recognition mod-els using convolutional neural networks, the taskof generating natural language descriptions for ar-

Word Embedding Location Encoder

group of people are flying a kite ENDa

Encoder Output START

person

person kite

(a) Input Object Layout (b) LSTM Object Layout Encoder

(c) LSTM Language Model Decoder

Figure 1: Overview of our proposed model forgenerating visually descriptive language from ob-ject layouts. The input (a) is an object layout thatconsists of object categories and their correspond-ing bounding boxes, the encoder (b) uses a two-stream recurrent neural network to encode the in-put object layout, and the decoder (c) uses a stan-dard LSTM recurrent neural network to generatetext.

bitrary images has received considerable atten-tion (Vinyals et al., 2015; Karpathy and Fei-Fei,2015; Mao et al., 2015). In general, generating vi-sually descriptive language can be useful for vari-ous tasks such as human-machine communication,accessibility, image retrieval, and search. How-ever this task is still challenging and it depends ondeveloping both a robust visual recognition model,and a reliable language generation model. In thispaper, we instead tackle a task of describing ob-ject layouts where the categories for the objects inan input scene and their corresponding locationsare known. Object layouts are commonly used forstory-boarding, sketching, and computer graphicsapplications. Additionally, using our object layoutcaptioning model on the outputs of an object de-tector we are also able to improve image caption-

Page 2: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

ing models. Object layouts contain rich seman-tic information, however they also abstract awayseveral other visual cues such as color, texture,and appearance, thus introducing a different set ofchallenges than those found in traditional imagecaptioning.

We propose OBJ2TEXT, a sequence-to-sequence model that encodes object layouts usingan LSTM network (Hochreiter and Schmidhuber,1997), and decodes natural language descriptionsusing an LSTM-based neural language model1.Natural language generation systems usually con-sist of two steps: content planning, and surfacerealization. The first step decides on the contentto be included in the generated text, and thesecond step connects the concepts using structurallanguage properties. In our proposed model,OBJ2TEXT, content planning is performed by theencoder, and surface realization is performed bythe decoder. Our model is trained in the standardMS-COCO dataset (Lin et al., 2014), whichincludes both object annotations for the task ofobject detection, and textual descriptions for thetask of image captioning. While most previousresearch has been devoted to any one of thesetwo tasks, our paper presents, to our knowledge,the first approach for learning mappings betweenobject annotations and textual descriptions. Usingseveral lesioned versions of the proposed modelwe explored the effect of object counts and loca-tions in the quality and accuracy of the generatednatural language descriptions.

Generating visually descriptive language re-quires beyond syntax, and semantics; an under-standing of the physical word. We also take in-spiration from recent work by Schmaltz et al.(2016) where the goal was to reconstruct a sen-tence from a bag-of-words (BOW) representationusing a simple surface-level language model basedon an encoder-decoder sequence-to-sequence ar-chitecture. In contrast to this previous approach,our model is grounded on visual data, and its cor-responding spatial information, so it goes beyondword re-ordering. Also relevant to our work is Yaoet al. (2016a) which previously explored the taskof oracle image captioning by providing a lan-guage generation model with a list of manuallydefined visual concepts known to be present in theimage. In addition, our model is able to leverage

1We build on neuraltalk2 and make our Torch code, and aninteractive demo of our model available in the following url:http://vision.cs.virginia.edu/obj2text

both quantity and spatial information as additionalcues associated with each object/concept, thus al-lowing it to learn about verbosity, and spatial rela-tions in a supervised fashion.

In summary, our contributions are as follows:

• We demonstrate that despite encoding objectlayouts as a sequence using an LSTM, ourmodel can still effectively capture spatial in-formation for the captioning task. We per-form ablation studies to measure the individ-ual impact of object counts, and locations.

• We show that a model relying only on ob-ject annotations as opposed to pixel data, per-forms competitively in image captioning de-spite the ambiguity of the setup for this task.

• We show that more accurate and compre-hensive descriptions can be generated onthe image captioning task by combining ourOBJ2TEXT model using the outputs of astate-of-the-art object detector with a stan-dard image captioning approach.

2 Task

We evaluate OBJ2TEXT in the task of object lay-out captioning, and image captioning. In the firsttask, the input is an object layout that takes theform of a set of object categories and boundingbox pairs 〈o, l〉 = {〈oi, li〉}, and the output isnatural language. This task resembles the secondtask of image captioning except that the input isan object layout instead of a standard raster im-age represented as a pixel array. We experiment inthe MS-COCO dataset for both tasks. For the firsttask, object layouts are derived from ground-truthbounding box annotations, and in the second taskobject layouts are obtained using the outputs of anobject detector over the input image.

3 Related Work

Our work is related to previous works that usedclipart scenes for visually-grounded tasks includ-ing sentence interpretation (Zitnick and Parikh,2013; Zitnick et al., 2013), and predicting ob-ject dynamics (Fouhey and Zitnick, 2014). Thecited advantage of abstract scene representationssuch as the ones provided by the clipart scenesdataset proposed in (Zitnick and Parikh, 2013)is their ability to separate the complexity of pat-tern recognition from semantic visual representa-tion. Abstract scene representations also maintain

Page 3: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

common-sense knowledge about the world. Theworks of Vedantam et al. (2015b); Eysenbach et al.(2016) proposed methods to learn common-senseknowledge from clipart scenes, while the methodof Yatskar et al. (2016), similar to our work, lever-ages object annotations for natural images. Un-derstanding abstract scenes has demonstrated tobe a useful capability for both language and visiontasks and our work is another step in this direction.

Our work is also related to other language gen-eration tasks such as image and video caption-ing (Farhadi et al., 2010; Ordonez et al., 2011;Mason and Charniak, 2014; Ordonez et al., 2015;Xu et al., 2015; Donahue et al., 2015; Mao et al.,2015; Fang et al., 2015). This problem is inter-esting because it combines two challenging butperhaps complementary tasks: visual recogni-tion, and generating coherent language. Fueledby recent advances in training deep neural net-works (Krizhevsky et al., 2012) and the availabil-ity of large annotated datasets with images andcaptions such as the MS-COCO dataset (Lin et al.,2014), recent methods on this task perform end-to-end learning from pixels to text. Most re-cent approaches use a variation of an encoder-decoder model where a convolutional neural net-work (CNN) extracts visual features from the in-put image (encoder), and passes its outputs to arecurrent neural network (RNN) that generates acaption as a sequence of words (decoder) (Karpa-thy and Fei-Fei, 2015; Vinyals et al., 2015). How-ever, the MS-COCO dataset, containing object an-notations, is also a popular benchmark in com-puter vision for the task of object detection, wherethe objective is to go from pixels to a collection ofobject locations. In this paper, we instead frameour problem as going from a collection of objectcategories and locations (object layouts) to imagecaptions. This requires proposing a novel encod-ing approach to encode these object layouts in-stead of pixels, and allows for analyzing the im-age captioning task from a different perspective.Several other recent works use a similar sequence-to-sequence approach to generate text from sourcecode input (Iyer et al., 2016), or to translate textfrom one language to another (Bahdanau et al.,2015).

There have also been a few previous works ex-plicitly analyzing the role of spatial and geomet-ric relations between objects for vision and lan-guage related tasks. The work of Elliott and Keller

(2013) manually defined a dictionary of object-object relations based on geometric cues. Thework of Ramisa et al. (2015) is focused on pre-dicting preposition given two entities and their lo-cations in an image. Previous works of Plummeret al. (2015) and Rohrbach et al. (2016) showedthat switching from classification-based CNN net-work to detection-based Fast RCNN network im-proves performance for phrase localization. Thework of Hu et al. (2016) showed that encodingimage regions with spatial information is crucialfor natural language object retrieval as the task ex-plicitly asks for locations of target objects. Unlikethese previous efforts, our model is trained end-to-end for the language generation task, and takesas input a holistic view of the scene layout, poten-tially learning higher order relations between ob-jects.

4 Model

In this section we describe our base OBJ2TEXTmodel for encoding object layouts to produce text(section 4.1), as well as two further variations touse our model to generate captions for real images:OBJ2TEXT-YOLO which uses the YOLO objectdetector (Redmon and Farhadi, 2017) to generatelayouts of object locations from real images (sec-tion 4.2), and OBJ2TEXT-YOLO + CNN-RNNwhich further combines the previous model withan encoder-decoder image captioning which usesa convolutional neural network to encode the im-age (section 4.3).

4.1 OBJ2TEXT

OBJ2TEXT is a sequence-to-sequence model thatencodes an input object layout as a sequence, anddecodes a textual description by predicting thenext word at each time step. Given a trainingdata set comprising N observations

{〈o(n), l(n)〉

},

where 〈o(n), l(n)〉 is a pair of sequences of in-put category and location vectors, together witha corresponding set of target captions

{s(n)

}, the

encoder and decoder are trained jointly by mini-mizing a loss function over the training set usingstochastic gradient descent:

W ∗ = argminW

N∑n=1

L(〈o(n), l(n)〉, s(n)), (1)

in which W =(W1

W2

)is the group of encoder pa-

rameters W1 and decoder parameters W2. The loss

Page 4: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

function is a negative log likelihood function ofthe generated description given the encoded objectlayout

L(〈o(n), l(n)〉, s(n)) = − log p(sn|hnL,W2), (2)

where hnL is computed using the LSTM-based en-coder (eqs. 3, and 4) from the object layout inputs〈o(n), l(n)〉, and p(sn|hnL,W2) is computed usingthe LSTM-based decoder (eqs. 5, 6 and 7).

At inference time we encode an input layout〈o, l〉 into its representation hL, and sample a sen-tence word by word based on p(st|hL, s<t) ascomputed by the decoder in time-step t. Findingthe optimal sentence s∗ = argmaxs p(s|hL) re-quires the evaluation of an exponential number ofsentences as in each time-step we have K numberof choices for a word vocabulary of size K. As acommon practice for an approximate solution, wefollow (Vinyals et al., 2015) and use beam searchto limit the choices for words at each time-step byonly using the ones with the highest probabilities.Encoder: The encoder at each time-step t takesas input a pair 〈ot, lt〉, where ot is the object cat-egory encoded as a one-hot vector of size V , andlt = [Bx

t , Byt , B

wt , B

ht ] is the location configura-

tion vector that contains left-most position, top-most position, and the width and height of thebounding box corresponding to object ot, all nor-malized in the range [0,1] with respect to input im-age dimensions. ot and lt are mapped to vectorswith the same size k and added to form the inputxt to one time-step of the LSTM-based encoder asfollows:

xt = Woot + (Wllt + bl), xt ∈ Rk, (3)

in which Wo ∈ Rk×V is a categorical embeddingmatrix (the word encoder), and Wl ∈ Rk×4 andbias bl ∈ Rk are parameters of a linear transfor-mation unit (the object location encoder).

Setting initial value of cell state vector ce0 = 0and hidden state vector he0 = 0, the LSTM-basedencoder takes the sequence of input (x1, ..., xT1)and generates a sequence of hidden state vectors(he1, ..., h

eT1) using the following step function (we

omit cell state variables and internal transitiongates for simplicity as we use a standard LSTMcell definition):

het = LSTM(het−1, xt;W1). (4)

We use the last hidden state vector hL = heT1as the

encoded representation of the input layout 〈ot, lt〉to generate the corresponding description s.

Decoder: The decoder takes the encoded layouthL as input and generates a sequence of multino-mial distributions over a vocabulary of words us-ing an LSTM neural language model. The jointprobability distribution of generated sentence s =(s1, ..., sT2) is factorized into products of condi-tional probabilities:

p(s|hL) =T2∏t=1

p(st|hL, s<t), (5)

where each factor is computed using a softmaxfunction over the hidden states of the decoderLSTM as follows:

p(st|hL, s<t) = softmax(Whhdt−1 + bh), (6)

hdt = LSTM(hdt−1,Wsst;W2), (7)

where Ws is the categorical embedding matrix forthe one-hot encoded caption sequence of symbols.By setting hd−1 = 0 and cd−1 = 0 for the initialhidden state and cell state, the layout representa-tion is encoded into the decoder network at the 0time step as a regular input:

hd0 = LSTM(hd−1, hL;W2). (8)

We use beam search to sample from the LSTMas is routinely performed in previous literature inorder to generate text.

4.2 OBJ2TEXT-YOLOFor the task of image captioning we proposeOBJ2TEXT-YOLO. This model takes an image asinput, extracts an object layout (object categoriesand locations) with a state-of-the-art object detec-tion model YOLO (Redmon and Farhadi, 2017),and uses OBJ2TEXT as described in section 4.1 togenerate a natural language description of the in-put layout and hence, the input image. The modelis trained using the standard back-propagation al-gorithm, but the error is not back-propagated to theobject detection module.

4.3 OBJ2TEXT-YOLO + CNN-RNNFor the image captioning task we experiment witha combined model (see Figure 2) where we takean image as input, and then use two separatecomputation branches to extract visual feature in-formation and object layout information. Thesetwo streams of information are then passed to anLSTM neural language model to generate a de-scription. Visual features are extracted using the

Page 5: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

a group of people are flying a kite in the beach

Object Detection

Layout Encoding

CNN Visual Features Extraction

RNN Language Model

Figure 2: Image Captioning by joint learning ofvisual features and object layout encoding.

VGG-16 (Simonyan and Zisserman, 2015) con-volutional neural network pre-trained on the Im-ageNet classification task (Russakovsky et al.,2015). Object layouts are extracted using theYOLO object detection system and its output ob-ject locations are encoded using our proposedOBJ2TEXT encoder. These two streams of in-formation are encoded into vectors of the samesize and their sum is input to the language modelto generate a textual description. The model istrained using the standard back-propagation al-gorithm where the error is back-propagated toboth branches but not the object detection mod-ule. The weights of the image CNN model arefine-tuned only after the layout encoding branch iswell trained but no significant overall performanceimprovements were observed.

5 Experimental Setup

We evaluate the proposed models on the MS-COCO (Lin et al., 2014) dataset which is a pop-ular image captioning benchmark that also con-tains object extent annotations. In the object lay-out captioning task the model uses the ground-truth object extents as input object layouts, whilein the image captioning task the model takes rawimages as input. The qualities of generated de-scriptions are evaluated using both human evalu-ations and automatic metrics. We train and vali-date our models based on the commonly adoptedsplit regime (113,287 training images, 5000 val-idation and 5000 test images) used in (Karpathyet al., 2016), and also test our model in the MS-COCO official test benchmark.

We implement our models based on theopen source image captioning system Neu-raltalk2 (Karpathy et al., 2016). Other config-urations including data preprocessing and train-ing hyper-parameters also follow Neuraltalk2. Wetrained our models using a GTX1080 GPU with8GB of memory for 400k iterations using a batch

size of 16 and an Adam optimizer with alpha of0.8, beta of 0.999 and epsilon of 1e-08. Descrip-tions of the CNN-RNN approach are generated us-ing the publicly available code and model check-point provided by Neuraltalk2 (Karpathy et al.,2016). Captions for online test set evaluations aregenerated using beam search of size 2, but scorehistories on split validation set are based on cap-tions generated without beam search (i.e. maxsampling at each time-step).

Ablation on Object Locations and Counts:. Wesetup an experiment where we remove the inputlocations from the OBJ2TEXT encoder to studythe effects on the generated captions, and confirmwhether the model is actually using spatial infor-mation during surface realization. In this restrictedversion of our model the LSTM encoder at eachtime step only takes the object category embed-ding vector as input. The OBJ2TEXT model ad-ditionally encodes different instances of the sameobject category in different time steps, potentiallyencoding in some of its hidden states informationabout how many objects of a particular class arein the image. For example, in the object annota-tion presented in the input in Figure 1, there aretwo instances of “person”. We perform an addi-tional experiment where our model does not haveaccess neither to object locations, nor the num-ber of object instances by providing only a setof object categories. Note that in this set of ex-periments the object layouts are given as inputs,thus we assume full access to ground-truth objectannotations, even in the test split. In the exper-imental results section we use the “-GT” postfixto indicate that input object layouts are obtainedfrom ground-truth object annotations provided bythe MS-COCO dataset.

Image Captioning Experiment: In this exper-iment we assess whether the image captioningmodel OBJ2TEXT-YOLO that only relies on ob-ject categories and locations could give compara-ble performance with a CNN-RNN model basedon Neuraltalk2 (Karpathy et al., 2016) that has fullaccess to visual image features. We also explorehow much does a combined OBJ2TEXT-YOLO+ CNN-RNN model could improve over a CNN-RNN model by fusing object counts and locationinformation that is not explicitly encoded in a tra-ditional CNN-RNN approach.

Human Evaluation Protocol. We use a two-alternative forced-choice evaluation (2AFC) ap-

Page 6: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

proach to compare two methods that generate cap-tions. For this, we setup a task on Amazon Me-chanical Turk where users are presented with animage and two alternative captions, and they haveto choose the caption that best describes the image.Users are not prompted to use any single criteriabut rather a holistic assessment of the captions, in-cluding their semantics, syntax, and the degree towhich they describe the image content. In our ex-periment we randomly sample 500 captions gen-erated by various models for MS COCO onlinetest set images, and use three users per image toobtain annotations. Note that three users choos-ing randomly between two options have a chanceof 25% to select the same caption for a given im-age. In our experiments comparing method A vsmethod B, we report the percentage of times Awas picked over B (Choice-all), the percentage oftimes all users selected the same method, either Aor B, (Agreement), and the percentage of times Awas picked over B only for these cases where allusers agreed (Choice-agreement).

6 Results

Impact of Object Locations and Counts: Fig-ure 3a shows the CIDEr (Vedantam et al., 2015a),and BLEU-4 (Papineni et al., 2002) score his-tory on our validation set during 400k iterations oftraining of OBJ2TEXT, as well as a version of ourmodel that does not use object locations, and a ver-sion of our model that does not use neither objectlocations nor object counts. These results showthat our model is effectively using both object lo-cations and counts to generate better captions, andabsence of any one of these two cues affects per-formance. Table 1 confirms these results on thetest split after a full round of training.

Furthermore, human evaluation results in thefirst row of Table 2 show that the OBJ2TEXTmodel with access to object locations is preferredby users, especially in cases where all evaluatorsagreed on their choice (62% over the baseline thatdoes not have access to locations). In Figure 4 weadditionally present qualitative examples showingpredictions side-by-side between OBJ2TEXT-GTand OBJ2TEXT-GT (no obj-locations). These re-sults indicate that 1) perhaps not surprisingly, ob-ject counts is useful for generating better qual-ity descriptions, and 2) object location informa-tion when properly encoded, is an important cuefor generating more accurate descriptions. We ad-

ditionally implemented a nearest neighbor base-line by representing the objects in the input layoutusing an orderless bag-of-words representation ofobject counts and the CIDEr score on the test splitwas only 0.387.

On top of OBJ2TEXT we additionally experi-mented with the global attention model proposedin (Luong et al., 2015) so that a weighted combi-nation of the encoder hidden states are forwardedto the decoding neural language model, howeverwe did not notice any overall gains in terms of ac-curacy from this formulation. We observed thatthis model provided gains only for larger input se-quences where it is more likely that the LSTMnetwork forgets its past history (Bahdanau et al.,2015). However in MS-COCO the average num-ber of objects in each image is rather modest, sothe last hidden state can capture well the overallnuances of the visual input.

Object Layout Encoding for Image Captioning:Figure 3b shows the CIDEr, and BLEU-4 scorehistory on the validation set during 400k iterationsof training of OBJ2TEXT-YOLO, CNN-RNN,and their combination. These results show thatOBJ2TEXT-YOLO performs surprisingly close toCNN-RNN, and the model resulting from com-bining the two, clearly outperforms each methodalone. Table 3 shows MS-COCO evaluation re-sults on the test set using their online benchmarkservice, and confirms results obtained in the vali-dation split, where CNN-RNN seems to have onlya slight edge over OBJ2TEXT-YOLO which lacksaccess to pixel data after the object detection stage.Human evaluation results in Table 2 rows 2, and3, further confirm these findings. These resultsshow that meaningful descriptions could be gen-erated solely based on object categories and loca-tions information, even without access to color andtexture input.

The combined model performs better than thetwo models, improving the CIDEr score of thebasic CNN-RNN model from 0.863 to 0.950,and human evaluation results show that the com-bined model is preferred over the basic CNN-RNN model for 65.3% of the images for whichall evaluators were in agreement about the se-lected method. These results show that explic-itly encoded object counts and location informa-tion, which is often overlooked in traditional im-age captioning approaches, could boost the perfor-mance of existing models. Intuitively, object lay-

Page 7: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

0.0

0.1

0.2

Bleu_4 score history

0 50000 100000 150000 200000 250000 300000 350000 400000iterations

0.0

0.2

0.4

0.6

0.8

1.0CIDEr score history

OBJ2TEXT-GT (no obj-locations, no obj-counts)OBJ2TEXT-GT (no obj-locations)OBJ2TEXT-GT

(a) Score histories of lesioned versions of the proposed modelfor the task of object layout captioning.

0.0

0.1

0.2

0.3Bleu_4 score history

0 50000 100000 150000 200000 250000 300000 350000 400000iterations

0.0

0.2

0.4

0.6

0.8

1.0CIDEr score history

OBJ2TEXT-YOLOCNN-RNNOBJ2TEXT-YOLO + CNN-RNN

(b) Score histories of image captioning models. Performanceboosts of CNN-RNN and combined model around iteration100K and 250K are due to fine-tuning of the image CNN model.

Figure 3: Score histories of various models on the MS COCO split validation set.

Method Bleu 4 CIDEr METEOR ROUGE-LOBJ2TEXT-GT (no obj-locations, counts) 0.21 0.759 0.215 0.464OBJ2TEXT-GT (no obj-locations) 0.233 0.837 0.222 0.482OBJ2TEXT-GT 0.253 0.922 0.238 0.507

Table 1: Performance of lesioned versions of the proposed model on the MS COCO split test set.

out and visual features are complementary: neuralnetwork models for visual feature extraction aretrained on a classification task where object-levelinformation such as number of instances and lo-cations are ignored in the objective. Object lay-outs on the other hand, contain categories and theirbounding-boxes but don’t have access to rich im-age features such as image background, object at-tributes and objects with categories not present inthe object detection vocabulary.

Figure 5 provides a three-way comparison ofcaptions generated by the three image captioningmodels, with preferred captions by human evalu-ators annotated in bold text. Analysis on actualoutputs gives us insights into the benefits of comb-ing object layout information and visual featuresobtained using a CNN. Our OBJ2TEXT-YOLOmodel makes many mistakes because of lack ofimage context information since it only has accessto object layout, while CNN-RNN makes manymistakes because the visual recognition model isimperfect at predicting the correct content. Thecombined model is usually able to generate moreaccurate and comprehensive descriptions.

In this work we only explored encoding spa-tial information with object labels, but object la-

bels could be readily augmented with rich seman-tic features that are more detailed descriptions ofobjects or image regions. For example, the workof You et al. (2016) and Yao et al. (2016b) showedthat visual features trained with semantic concepts(text entities mentioned in captions) instead of ob-ject labels is useful for image captioning, althoughthey didn’t consider encoding semantic conceptswith spatial information. In case of object an-notations the MS-COCO dataset only providesobject labels and bounding-boxes, but there areother datasets such as Flick30K Entities (Plummeret al., 2015), and the Visual Genome dataset (Kr-ishna et al., 2017) that provide richer region-to-phrase correspondence annotations. In addition,the fusion of object counts and spatial informationwith CNN visual features could in principle bene-fit other vision and language tasks such as visualquestion answering. We leave these possible ex-tensions as future work.

7 Conclusion

We introduced OBJ2TEXT, a sequence-to-sequence model to generate visual descriptionsfor object layouts where only categories andlocations are specified. Our proposed model

Page 8: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

Alternatives Choice-all Choice-agreement AgreementOBJ2TEXT-GT vs. OBJ2TEXT-GT (no obj-locations) 54.1% 62.1% 40.6%OBJ2TEXT-YOLO vs. CNN+RNN 45.6% 40.6% 54.7%OBJ2TEXT-YOLO + CNN-RNN vs. CNN-RNN 58.1% 65.3% 49.5%OBJ2TEXT-GT vs. HUMAN 23.6% 9.9% 58.8%

Table 2: Human evaluation results using two-alternative forced choice evaluation. Choice-all is percent-age the first alternative was chosen. Choice-agreement is percentage the first alternative was chosen onlywhen all annotators agreed. Agreement is percentage where all annotators agreed (random is 25%).

MS COCO Test Set Performance CIDEr ROUGE-L METEOR B-4 B-3 B-2 B-15-RefsOBJ2TEXT-YOLO 0.830 0.497 0.228 0.262 0.361 0.500 0.681CNN-RNN 0.857 0.514 0.237 0.283 0.387 0.529 0.705OBJ2TEXT-YOLO + CNN-RNN 0.932 0.528 0.250 0.300 0.404 0.546 0.71940-RefsOBJ2TEXT-YOLO 0.853 0.636 0.305 0.508 0.624 0.746 0.858CNN-RNN 0.863 0.654 0.318 0.540 0.656 0.775 0.877OBJ2TEXT-YOLO + CNN-RNN 0.950 0.671 0.334 0.569 0.686 0.802 0.896

Table 3: The 5-Refs and 40-Refs performances of OBJ2TEXT-YOLO, CNN-RNN and the combinedapproach on the MS COCO online test set. The 5-Refs performance is measured using 5 ground-truthreference captions, while 40-Refs performance is measured using 40 ground-truth reference captions.

a: three buses parked in a parking lotb: a bus is parked in front of a bus stop

a: two people riding on the back of an elephantb: a man and a woman riding on the back of an elephant

a: a man is riding a horse and a dog is carrying a bagb: two dogs are sitting on the back of a horse

bus bus bus personperson

elephant

elephant

person

dogdog

boottle

trucktruck

horse

a: a group of people standing around a parking meterb: a man riding a motorcycle down a street

a: a woman sitting on a couch with a man holding a doughnutb: a woman and a child sitting at a table with food

a: two young girls holding tennis racquets on a courtb: a man holding a tennis racquet on a tennis court

motorcyclepersonpersonpersonperson

person person

bag

bagbag

boottleboottle

donut

person person

bag bag

cupcup

couchperson

person

tennis racket

Figure 4: Qualitative examples comparing generated captions of (a) OBJ2TEXT-GT, and (b)OBJ2TEXT-GT (no obj-locations).

shows that an orderless visual input representationof concepts is not enough to produce good de-scriptions, but object extents, locations, and objectcounts, all contribute to generate more accurateimage descriptions. Crucially we show that ourencoding mechanism is able to capture usefulspatial information using an LSTM network toproduce image descriptions, even when the inputis provided as a sequence rather than as an explicit2D representation of objects. Additionally, using

our proposed OBJ2TEXT model in combinationwith an existing image captioning model anda robust object detector we showed improvedresults in the task of image captioning.

AcknowledgmentsThis work was supported in part by an NVIDIAHardware Grant. We are also thankful for the feed-back from Mark Yatskar and anonymous review-ers of this paper.

Page 9: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

7/16/2017 vision.cs.virginia.edu:8001

http://vision.cs.virginia.edu:8001/ 1/1

(a) a yellow fire hydrant sittingon the side of a road

(b) a man is standing in thesnow with a snowboard

(c) a fire hydrant in the snownear a road

(a) a small boat in a body ofwater

(b) a boat with a bunch ofpeople on it

(c) a boat is docked in alarge body of water

(a) a man and a woman aresitting on a bench

(b) a man and a woman sittingon a couch

(c) a man and woman aretalking on their cell phones

(a) a man sitting on a bench ina park

(b) a woman sitting on a benchwith a cell phone

(c) a woman sitting on abench with her legs crossed

(a) a bird sitting on a treebranch in a tree

(b) a bird sitting on a branch ina tree

(c) two birds are sitting on atree branch

(a) a zebra standing in a fieldof grass

(b) a man riding a wave on topof a surfboard

(c) a zebra standing in thewater near a rock wall

(a) a chair and a table in aroom

(b) a pile of luggage sitting ontop of a wooden floor

(c) a room with a table andchairs and a suitcase

(a) a white plate topped withmeat and vegetables

(b) a white plate topped withmeat and vegetables(c) a plate of food with meatand vegetables

(a) a herd of cattle grazing ona lush green field

(b) a herd of elephants walkingacross a river

(c) a group of cows standingin a river

(a) a man is swinging a bat ata ball

(b) a man is playing with afrisbee in a park

(c) a man is swinging a batin a field

(a) a little girl sitting at a tablewith a cake

(b) a bride and groom cuttingtheir wedding cake

(c) a bride and groomcutting their wedding cake

(a) a street sign with a streetname sign on it

(b) a large clock towertowering over a city

(c) a large body of waterwith a clock tower in thebackground

(a) a couple of giraffe standingnext to each other

(b) a giraffe is standing in atree in a forest(c) two giraffes standing nextto each other in a tree

(a) a group of people standingaround a pizza

(b) a man holding a box offood in front of him(c) a couple of people that areholding a pizza

(a) a group of people playing agame with remote controllers

(b) a man and a womanplaying a video game

(c) a group of people sittingaround a living room

(a) a man is standing in themiddle of a street

(b) a man sitting on a beachwith a surfboard

(c) a man and woman sittingon the beach

(a) a glass vase with someflowers in it

(b) a vase filled with flowerson top of a table(c) a vase filled with flowers ontop of a table

(a) a man and a woman sittingat a table eating pizza

(b) a woman is taking a pictureof herself in a mirror

(c) two women sitting at atable with a pizza

(a) a man sitting in a kitchennext to a woman

(b) a woman standing in frontof a counter in a kitchen

(c) two women in a kitchenpreparing food on a table

(a) a man and a womanstanding next to each other

(b) a man in a suit and tiestanding in a room

(c) a man in a suit and tiestanding next to a man

(a) a man riding a bike down astreet

(b) a man riding a motorcycledown a street

(c) a man riding a bike with ahelmet on his head

(a) a woman is playing tennison a court

(b) a woman is playingtennis on a tennis court(c) a tennis player in action onthe court

(a) a woman in a bathroomwith a sink and a mirror

(b) a man standing in a kitchennext to a stove

(c) a man standing in akitchen next to a counter

(a) a man holding a nintendowii game controller

(b) a young boy sitting on acouch holding a remote control

(c) a young child is holdinga toy in his hand

(a) a person riding a horse ona beach

(b) a woman is riding ahorse in a field(c) a girl is riding a horse in afield

(a) a bunch of vases sitting ona shelf

(b) a bunch of flowers in avase on a table

(c) a bunch of colorful vasessitting on a table

(a) a man riding a bike down astreet next to tall buildings

(b) a man is on a boat in thewater

(c) a couple of peoplestanding on top of a bridge

(a) a man riding a skateboardwith a dog

(b) a dog is riding askateboard on a street

(c) a dog on a skateboard inthe middle of the street

(a) a young boy in a baseballuniform holding a glove

(b) a man is sitting on theground holding askateboard(c) a man sitting on the groundwith a baseball glove

(a) a woman holding a tennisracquet on a court

(b) a dog with a tennis racketin a basket

(c) a person holding a tennisracket in a park

Figure 5: Qualitative examples comparing the generated captions of (a) OBJ2TEXT-YOLO, (b) CNN-RNN and (c) OBJ2TEXT-YOLO + CNN-RNN. Images are selected from the 500 human evaluationimages and annotated with YOLO object detection results. Captions preferred by human evaluators withagreement are highlighted in bold text.

Page 10: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In International Con-ference on Learning Representations (ICLR).

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar-rama, Marcus Rohrbach, Subhashini Venugopalan,Kate Saenko, and Trevor Darrell. 2015. Long-termrecurrent convolutional networks for visual recogni-tion and description. In IEEE conference on com-puter vision and pattern recognition (CVPR), pages2625–2634.

Desmond Elliott and Frank Keller. 2013. Image de-scription using visual dependency representations.In EMNLP, volume 13, pages 1292–1302.

Benjamin Eysenbach, Carl Vondrick, and Antonio Tor-ralba. 2016. Who is mistaken? arXiv preprintarXiv:1612.01175.

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh KSrivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xi-aodong He, Margaret Mitchell, John C Platt, et al.2015. From captions to visual concepts and back.In Proceedings of the IEEE conference on computervision and pattern recognition, pages 1473–1482.

Ali Farhadi, Mohsen Hejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Every pic-ture tells a story: Generating sentences from images.In European conference on computer vision, pages15–29. Springer.

David F Fouhey and C Lawrence Zitnick. 2014. Pre-dicting object dynamics in scenes. In Proceedings ofthe IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pages 2019–2026.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Ronghang Hu, Huazhe Xu, Marcus Rohrbach, JiashiFeng, Kate Saenko, and Trevor Darrell. 2016. Nat-ural language object retrieval. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 4555–4564.

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, andLuke Zettlemoyer. 2016. Summarizing source codeusing a neural attention model. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 2073–2083, Berlin, Germany. Association forComputational Linguistics.

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descrip-tions. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages3128–3137.

Andrej Karpathy et al. 2016. Neuraltalk2.https://github.com/karpathy/neuraltalk2/.

Ioannis Konstas and Mirella Lapata. 2012. Unsuper-vised concept-to-text generation with hypergraphs.In Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies (NAACL-HLT), pages 752–761.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma,et al. 2017. Visual genome: Connecting languageand vision using crowdsourced dense image anno-tations. International Journal of Computer Vision,123(1):32–73.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep con-volutional neural networks. In Neural InformationProcessing Systems (NIPS), pages 1097–1105.

Karen Kukich. 1983. Design of a knowledge-based re-port generator. In Proceedings of the 21st annualmeeting on Association for Computational Linguis-tics, pages 145–150. Association for ComputationalLinguistics.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In ECCV, pages 740–755. Springer.

Minh-Thang Luong, Hieu Pham, and Christo-pher D. Manning. 2015. Effective approaches toattention-based neural machine translation. CoRR,abs/1508.04025.

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, ZhihengHuang, and Alan Yuille. 2015. Deep captioningwith multimodal recurrent neural networks (m-rnn).ICLR.

Rebecca Mason and Eugene Charniak. 2014. Nonpara-metric method for data-driven image captioning. InACL (2), pages 592–598.

Vicente Ordonez, Xufeng Han, Polina Kuznetsova,Girish Kulkarni, Margaret Mitchell, Kota Yam-aguchi, Karl Stratos, Amit Goyal, Jesse Dodge,Alyssa Mensch, III Daume, Hal, Alexander C. Berg,Yejin Choi, and Tamara L. Berg. 2015. Large scaleretrieval and generation of image descriptions. In-ternational Journal of Computer Vision, pages 1–14.

Vicente Ordonez, Girish Kulkarni, and Tamara L Berg.2011. Im2text: Describing images using 1 millioncaptioned photographs. In Advances in Neural In-formation Processing Systems, pages 1143–1151.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings of

Page 11: Obj2Text: Generating Visually Descriptive Language from ...text. bitrary images has received considerable atten-tion (Vinyals et al.,2015;Karpathy and Fei-Fei, 2015;Mao et al.,2015).

the 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

Bryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. 2015. Flickr30k entities: Collectingregion-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEEinternational conference on computer vision, pages2641–2649.

Arnau Ramisa, JK Wang, Ying Lu, Emmanuel Del-landrea, Francesc Moreno-Noguer, and RobertGaizauskas. 2015. Combining geometric, textualand visual features for predicting prepositions in im-age descriptions. In Conference on Empirical Meth-ods in Natural Language Processing (EMNLP),pages 214–220. Association for Computational Lin-guistics.

Joseph Redmon and Ali Farhadi. 2017. YOLO9000:better, faster, stronger. In Computer Vision and Pat-tern Recognition (CVPR).

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu,Trevor Darrell, and Bernt Schiele. 2016. Ground-ing of textual phrases in images by reconstruction.In European Conference on Computer Vision, pages817–834. Springer.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, et al.2015. Imagenet large scale visual recognition chal-lenge. International Journal of Computer Vision,115(3):211–252.

Allen Schmaltz, Alexander M. Rush, and StuartShieber. 2016. Word ordering without syntax.In Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 2319–2324,Austin, Texas.

Karen Simonyan and Andrew Zisserman. 2015. Verydeep convolutional networks for large-scale imagerecognition. International Conference on LearningRepresentations (ICLR).

Frank A Smadja and Kathleen R McKeown. 1990. Au-tomatically extracting and representing collocationsfor language generation. In Annual meeting of theAssociation for Computational Linguistics (ACL),pages 252–259.

Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. 2015a. Cider: Consensus-based image de-scription evaluation. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition, pages 4566–4575.

Ramakrishna Vedantam, Xiao Lin, Tanmay Batra,C Lawrence Zitnick, and Devi Parikh. 2015b.Learning common sense through visual abstraction.In Proceedings of the IEEE International Confer-ence on Computer Vision (ICCV), pages 2542–2550.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2015. Show and tell: A neural im-age caption generator. In Proceedings of the IEEEconference on computer vision and pattern recogni-tion, pages 3156–3164.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015.Semantically conditioned lstm-based natural lan-guage generation for spoken dialogue systems. InConference on Empirical Methods in Natural Lan-guage Processing (EMNLP), pages 1711–1721, Lis-bon, Portugal.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual at-tention. In International Conference on MachineLearning, pages 2048–2057.

Li Yao, Nicolas Ballas, Kyunghyun Cho, John R.Smith, and Yoshua Bengio. 2016a. Oracle perfor-mance for visual captioning. In British Machine Vi-sion Conference (BMVC).

Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, andTao Mei. 2016b. Boosting image captioning withattributes. arXiv preprint arXiv:1611.01646.

Mark Yatskar, Vicente Ordonez, and Ali Farhadi. 2016.Stating the obvious: Extracting visual commonsense knowledge. In Proceedings of the 2016 Con-ference of the North American Chapter of the As-sociation for Computational Linguistics: HumanLanguage Technologies, pages 193–198, San Diego,California. Association for Computational Linguis-tics.

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang,and Jiebo Luo. 2016. Image captioning with seman-tic attention. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages4651–4659.

C Lawrence Zitnick and Devi Parikh. 2013. Bring-ing semantics into focus using visual abstraction.In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages3009–3016.

C Lawrence Zitnick, Devi Parikh, and Lucy Vander-wende. 2013. Learning the visual interpretation ofsentences. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 1681–1688.