Translating Related Words to Videos and Back through ...jcorso/pubs/wsdm13-das-final.pdf ·...

Translating Related Words to Videos and Back throughLatent Topics

Pradipto DasCSE Dept., SUNY Buffalo

Buffalo, NY [email protected]

Rohini K. SrihariCSE Dept., SUNY Buffalo


Jason J. CorsoCSE Dept., SUNY Buffalo


ABSTRACTDocuments containing video and text are becoming moreand more widespread and yet content analysis of those doc-uments depends primarily on the text. Although automateddiscovery of semantically related words from text improvesfree text query understanding, translating videos into textsummaries facilitates better video search particularly in theabsence of accompanying text. In this paper, we proposea multimedia topic modeling framework suitable for pro-viding a basis for automatically discovering and translatingsemantically related words obtained from textual metadataof multimedia documents to semantically related videos orframes from videos. The framework jointly models video andtext and is flexible enough to handle different types of docu-ment features in their constituent domains such as discreteand real valued features from videos representing actions,objects, colors and scenes as well as discrete features fromtext. Our proposed models show much better fit to the mul-timedia data in terms of held-out data log likelihoods. For agiven query video, our models translate low level vision fea-tures into bag of keyword summaries which can be furthertranslated using simple natural language generation tech-niques into human readable paragraphs. We quantitativelycompare the results of video to bag of words translationagainst a state-of-the-art baseline object recognition modelfrom computer vision. We show that text translations frommultimodal topic models vastly outperform the baseline ona multimedia dataset downloaded from the Internet.

Categories and Subject DescriptorsI.2.7 [Computing Methodologies]: Artificial Intelligence—Natural Language Processing

General TermsAlgorithms, Experimentation

Keywordsmultimedia topic models; video to text summarization

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WSDM’13, February 4–8, 2012, Rome, Italy.Copyright 2013 ACM 978-1-4503-1869-3/13/02 ...$15.00.

1. INTRODUCTIONIn recent years there has been an abundance of multimedia

data in the form of video contents from television networks,video uploads to websites and so on. However, organizingsuch data by integrating the semantic content of the videosis a very difficult problem. On the other hand, summarizingvideos directly into text summaries can lead to significantimprovements in many end-user applications—multimediasearch experience, content based advertisements [33], help-ing the visually impaired and so on. In this paper we concernourselves with two language agnostic tasks: i) Building atopic modeling framework to model multimedia documentsconsisting of videos and textual metadata and ii) use thetopic modeling framework to predict bag-of-word summariesfor a new video belonging to a previously known category.The first task helps us discover semantically related con-cepts in the text through latent topics and translating themto topically related videos or frames. The second task takesa video and generates intermediate text keywords ideal fornatural language generation.

Human summary Footage of people skateboarding and doing tricks - skateboarder falls and hits head

Human summary One guy is making a wooden table indoors

Figure 1: An example of the task of video summarization

As a further addition we also experiment with efficientnatural language sentence generation from the predicted bagof words (henceforth BoW) using language models and con-fidence of syntactic parse tree generation following a simpletemplate. The first two tasks, though, are the focus of thispaper. Fig. 1 shows some keyframes of two sample videosfrom our training dataset and the short summaries writtenby human annotators. This dataset is discussed in Section1.1. In our paper, translation from video to text is synony-mous to summarizing a video with a set of textual keywords.

It seems intuitive that a topic model which incorporateslow level vision features representing objects, actions, colorand scenes and correspond those to the text summaries shouldhave a better chance of describing the multimedia data.We thus seek representations of visual data that mimic thesubject-verb-object-scene quadruplet structure of English sen-tences in terms of subject, object and scene nouns as wellas verbs. Additionally illumination gives rise to color whichdifferentiates one object from another and is often expressedas adjectives. Objects, actions and color can be visualized

using specific word concepts and can be counted over timethus lending themselves to quantization but scene representsglobal energy distributions which pervade the arrangementof objects and is thus better represented as real values.

In the context of this paper, our training data consistsof videos and associated summaries. The training data isalso available with event category labels for e.g. “boardingevent” and topic modeling video documents in each eventcan allow us to discover “sub-topics” e.g. “skateboarding”,“snowboarding” and “surfing”. Of course, our topic mod-els do not include any event label bias and can be appliedon the overall dataset as well. However we will see shortlythat doing so can be very unappealing to end users whensummaries need to be generated.

Apart from the topical analysis of video documents, theproblem of generating summaries directly from videos alsohas significant end user appeal. We emphasize that the videosummarization/translation task in this paper is to describean entire video firstly as a bag of salient keywords i.e. BoW,and then, as a further addition, use simple Natural LanguageGeneration (henceforth NLG) techniques to summarize thebag of words into a human readable paragraph of text wher-ever possible. Translating and generating summaries from avideo can always be looked upon as finding the right infor-mation need which is paramount to any search problem.

Video summarization in our context is different from im-age annotation mentioned in early literature [27] and in moremodern ones [18]. If we perfectly annotate all objects andactions in each frame correctly, we can probably use a stan-dard topic modeling technique [5] to figure out the themes inthe multimedia documents. However, detecting objects [15]and scenes [22] in images and actions [11] and keyframes[14] in videos in a reliable manner are open problems in thecomputer vision community [18]. Most of these detectiontechniques fall in the domain of supervised learning and re-quire large amounts of annotated data. Video annotationtask is a particularly laborious process even though toolslike VATIC [29] have been written to ease the effort (for aninteractive demo of the tool, see VATIC’s website1). Alsofiring thousands of detectors to accurately label even a singlekeyframe of an unknown video leads to many false positives.This is particularly true of the videos in-the-wild i.e. videosdownloaded from the Internet where there are plenty of res-olution problems, severe motion blurs, camera shakes etc.These are the videos that we experiment with in this paper.

We view the video to text translation/summarization prob-lem in the light of multidocument summarization of plaintext documents which has been popularized by the TextAnalysis Conference2 (TAC). In the multidocument summa-rization track of TAC, participants are given document sets(docsets) of newswire articles typically belonging to 5 majorevent types like “Health and Safety,”“Accidents and Natu-ral Disasters” etc. and are asked to generate a fixed lengthfluent summary of the documents in each docset. A docsetin the TAC setting is unique in that it contains a set of doc-uments that are relevant for a particular information needlike “Cyclone Katrina.” The system summaries are scored inseveral ways including the most reliable manual way usingPYRAMID [21] evaluation but systems usually are tunedw.r.t. the automatic ROUGE [16] scoring. By analogy, we

1http://mit.edu/vondrick/vatic/

2http://www.nist.gov/tac/2011/Summarization/

assume that each docset here corresponds to a video andcontains a sequence of frames and a set of keyframes. Attest time we are given unknown event specific videos with-out any text summary. For measuring system performance,we generate summaries of videos and evaluate them usingthe recall oriented ROUGE-1 score to measure the percentoverlap of the words in the short ground truth summaries.

Human Summary: Montage of clips from an outdoor wedding Predicted bag of words summary: birthday wed indoor outdoors mob dance flash cake parade ceremony fish

Figure 2: An example of vocabulary intrusion in the task of videosummarization. Best viewed with magnification

A key concern in generating a BoW summary of a video isthe vocabulary intrusion problem. Fig. 2 shows an exampleof vocabulary intrusion in the task of video summarizationthat arises out of topic modeling on the entire vocabularyof the corpus. If we consider a vocabulary of V words—theprobability of getting the top L words correctly in the sum-mary is (1/V )(1/(V − 1))...(1/(V − L + 1)). If V is large(such as 2000) then the probability is very low. Further, ifthe entire vocabulary is used, then intrusive words describ-ing other but related event categories like “birthday, flashmob, dance, parade, fish” can appear with high probability(see a possible predicted BoW summary in Fig. 2 from atopic model(Fig. 3b) trained over all events with numberof topics set to 200). This problem is mitigated by firstclassifying the test video into its corresponding event cate-gory (Section 4.4) and then using a topic model to predictthe BoW summary. In the absence of the event labels, thisdirection improves readability and is much faster.

The novelty in our new approach to topic model-ing video documents with textual metadata is the use ofthe right features for the videos and augmenting basic topicmodels for joint modeling with those features along withtext. We represent each video in terms of objects, actions,color (represented with discrete distributions) and scenes(represented with Normal distributions with unknown meansand variances) and try to find a translation space that trans-lates the pattern of these features to a permutation in lan-guage vocabulary. Such a representation of a video is bothintuitive and logical. We observe that the interplay of thefull spectrum of representations (Section 4) indeed yield thehighest likelihoods to held out test data than those usingpartial representations (Section 4.1).

1.1 Dataset DescriptionThe dataset that we use for the video summarization task

is released as part of NIST’s 2011 TRECVID MultimediaEvent Detection (MED) evaluation set3. The dataset con-sists of a collection of Internet multimedia content posted tothe various Internet video hosting sites. The training set isorganized into 15 event categories, some of which are:1) Attempting a board trick 2)Feeding an animal 3)Landing a fish

4)Wedding ceremony 5) Working on a woodworking project etc.

We use the videos and their textual metadata in all the 15events as training data. There are 2062 clips with summariesin the training set with almost equal distribution amongstthe events. The test set which we use is called the Transpar-ent Development (Dev-T) collection. The Dev-T collectionincludes positive instances of the first 5 training events andnear positive instances for the last 10 events—a total of 630

3http://www.nist.gov/itl/iad/mig/med11.cfm

videos labeled with event category information (and associ-ated human synopses which are to be compared against forsummarization performance). Each summary is a short andvery high level description of the entire video and rangesfrom 2 to 40 words but on average 10 words (with stop-words). We remove standard English stopwords and retainonly the word morphologies (not required) from the synopsesas our training vocabularies. The proportion of videos be-longing to events 6 through 15 in the Dev-T set is much lowcompared to the proportion for the other events since thoseclips are considered to be “related” instances which coveronly part of the event category specifications. The perfor-mances of our topic models are evaluated on those kinds ofclips as well. The numbers of videos in events 6 through15 in the Dev-T set are {4,9,5,7,8,3,3,3,10,8} while there arearound 120 videos per event for the first 5 events. All othervideos in the Dev-T set neither have any event category la-bel nor are identified as positive, negative or related videosand we do not consider these videos in our experiments.

1.2 Evaluation MeasuresWe measure the predictive performance of the topic mod-

els using the Evidence Lower BOunds (ELBO) on held-outtest set—the Dev-T collection with summaries, as well asthe predictive ELBO for BoW summary generation on theheld-out Dev-T collection without summaries (Section 4.1).ELBO is just log likelihood and is directly related to mea-suring average perplexity of the model per observed textualword [5, 4]. We also evaluate our BoW summaries usingthe ROUGE scorer. ROUGE measures the n-gram overlapfor system generated summaries to the ones written by an-notators and the scores are interpreted in terms of recall.Usually 4 gold standard summaries are needed for evalua-tion but here we use the base case of using only one shortsummary as a reference summary per video on this dataset.While summarizing, since our primary task is to evaluateonly the BoW summaries generated from a video, we usethe ROUGE-1 unigram measure. We evaluate 5 and 10 key-words long BoW summaries respecting the average lengthof the short human summaries. Since we are consideringvideos in the Dev-T set with event category information,we can use the ROUGE evaluation setup of multidocumentsummarization as used in TAC. If the categories are notknown, we can multiply the ROUGE scores with the eventclassification accuracies to obtain lower bounds (see Section4.4 for lower bounds on classification accuracies). Evalua-tions with higher order n-grams are not needed for unigramtranslations. We do not use manual evaluations since thedata cannot be released for public verifications.

The task of discovering topically related words is mostlyevaluated w.r.t ELBO. We use the topic models from [4]as baselines. We modify the GM-LDA model in [4] fol-lowing [26] to use discrete visual data and name the modelMMLDA—“MM”stands for the multinomials for text as wellas the multinomials for the visual words. We implement adeterministic optimization framework for MMLDA insteadof the non-deterministic sampling as in [26]. The Corr-LDAmodel in [4] is also extended by using Normal-Wishart pri-ors and named Corr-MGLDA (M for Multinomials and Gfor Gaussians). For evaluating video to text summarizationbased on ROUGE-1 scores, we use a non-topic model basedautomatic image annotation tool as the baseline for video la-beling by using labels aggregated from keyframes. Our topicmodel based video summarization methods outperform the

state-of-the-art image to text translation model [15] appliedon video keyframes in terms of ROUGE-1 scores of the pre-dicted keyword summaries.

2. RELATED WORKMakadia et al. [18] uses nearest neighbor and label trans-

fer techniques to annotate images suitable for the image re-trieval task. However, we can not directly apply their meth-ods as the individual frames/keyframes of the videos in ourdataset are not annotated. Based on the size and genre ofour dataset, such annotations prove very expensive and wedo not follow that direction. Further, we are interested inthe task of direct natural language summarization of the en-tire video and not specific annotation of a vast majority ofpossible objects, actions and scenes in every frame/keyframeof the video. The closest work to our task is by Yang et al.[34] where low level object and scene classifiers are used toobtain object and scene labels in an image. These are thencombined using background language models and HiddenMarkov Models to predict a natural language sentence thatautomatically includes the best possible verb i.e. action. Wewill observe in Section 4.4 that actions, which are intrinsicto videos, are important event discriminators. Further, noneof above mentioned methods can discover related conceptsas latent topics and translate them into related frames.

In the domain of topic modeling of images with captions,the Corr-LDA model has recently been extended to handlea multinomial feature space in [25] with different numberof topics for visual word type and textual word type. Themodel learns an association from the topic proportions overimage domain to those over text domain through a regressionformulation. However, during prediction, this dependencyneeds to be marginalized out anyways. Also, if we quantizeevery type of real valued vision feature using some clusteringalgorithm such as K-means into C clusters, then each Crepresents a parameter of the final model and performanceanalysis become that much more difficult. Ahmed et al. [1]uses Gaussian feature vectors and mention Normal-Wishartpriors but do not use them—they use uniform priors in anon-deterministic sampling framework instead.

On the other hand the Continuous Relevance Model (CRM)[13] and Multiple Bernoulli Relevance Model (MBRM) [8]assume different, nonparametric density representations ofthe joint word-image space. CRM gets rid of the latent fac-tor representation and achieves non-parameterization. Thedataset used in [8] for MBRM has hierarchical word anno-tations which are handled using multiple Bernoulli modelsrather than multinomial distributions. In our dataset, multi-nomial distributions are sufficient since the summaries readlike very short documents with repeated word morphologies.

Detecting objects can often be seen as an important steptowards identifying the main topic of a video and generatinga BoW summary. To that end, Torresani et al. [28] trans-form an image feature vector into a another lower dimen-sional feature vector whose values are the outputs of severalcategory classifiers (which are named “classemes” in theirpaper). We take a similar approach to convert Object Bank[15] (OB) feature vectors to high level sparse histogram ofobject detectors to be used in our discrete video data rep-resentation and as baseline for video to text translation. Toextract OB features, keyframes are identified to reduce com-putational time. Keyframe detection is a research topic inits own right, where some recent ones include more involved

techniques [14] using Transfer Learning from accompanyingtext transcripts. However, the keyframes extracted usingthe change in color histogram [35] satisfy our purposes.

In the domain of topic modeling of videos, the HiddenTopic Markov Model in [32] does not incorporate both textand visual words in a single framework and also does notuse a fuller representation of videos as we do. A Markovianassumption is also imposed in [10] for modeling actions andidentifying behaviors (with no automatic labeling), however,we can safely ignore frame dependence because our actionfeatures are derived using temporal windows and activitytracking is not an objective in this paper. The reformula-tions of LDA and CTM using class labels and without anytemporal dynamics in [30] also target activity classification.To the best of our knowledge this is the first work to usemixed membership topic models for video to text summa-rization which can eliminate frame-wise object annotations.

The proposed models are discussed in the following sec-tion in as much depth as possible. The use of asymmetricDirichlet priors over the topic proportions helps us achievebetter sparsity in topics. However, this also leads to sin-gularities in precision matrices conditioned on topics whenNormal-Wishart priors for real valued data are not used.

3. THE PROPOSED MODELSIn this section we describe our proposed topic models

for multimedia documents. We call the model in Fig. 3dMMGLDA, short for Multinomial-Multinomial-Gaussian LDA,and the model in Fig. 3e to Corr-MMGLDA, short for cor-respondence MMGLDA. In our context, the correspondenceLDA model [4] places a probabilistic constraint on the cor-respondence of summary words to Gaussian observations—aword is likely to be generated by the topic which is agreedupon by most of the Gaussian instances in the video docu-ment. Since the real valued GIST features (see Section 4)“summarize” a scene in an image, we want a stronger influ-ence of the topic of the scene on the summary text. Thisassumption is relaxed in MMGLDA. Of course the corre-spondence could have been established between the discreteobservations only or both discrete and real valued ones butconditioned on the current dataset, we want more flexibil-ity in topics for sampling discrete observations. We avoidoverly complicated topic models and instead go for betterdata representations and supporting models with just theright amount of complexity.

In MMGLDA, for a multimedia document d, it is possi-ble to have different topics competing for each occurrenceof wm. In Corr-MMGLDA, the number of such modes isconstrained to be much fewer. The asymmetric α can yieldfew additional modes which group co-occurring data domi-nant in densities or masses in separate latent topics. Thisphenomenon is observed for a larger number of latent topics.

Table 1 explains the symbols used in the two proposedtopic models. Everywhere in this paper, we assume that Kis the number of topics. The generative processes for theproposed models are illustrated below:

For each video document d ∈ 1, ...,DChoose a topic proportion θ|α ∼ Dir(α)For each position h in d

Choose topic indicator zh|θ ∼Mult(θ)Choose a discrete video “word” wh|zh = k,ρ ∼Mult(ρzh

)For each real valued observation o in d

Choose topic indicator zo|θ ∼ N (μ,Λ−1)Choose a real valued wo|zo = k,μ,Λ−1 ∼ N (μzo

,Λ−1zo )

Symbol Meaning (r.v. = random variable)D total number of multimedia “documents”

M total number of discrete text features per multime-dia document d ∈ {1, ...,D}

H total number of discrete visual features in a multi-media document d ∈ {1, ...,D}

O total number of real valued visual features per mul-timedia document d ∈ {1, ...,D}

α ={α1, ..., αK}

r.v. for asymmetric Dirichlet prior for the documentlevel topic proportions

θd r.v. for document level latent topic proportions

ρ corpus level topic multinomials over discrete videofeatures

β corpus level topic multinomials for textual words

μ means of topic Gaussians for the real valued featuresfrom videos

Λ =(Σ−1)

precision (inverse covariance) matrices of topicGaussians for the real valued features from videos

ym in Figs.3b, and 3d

indicator variable for a sample from θd for discretetext features

ym in Figs.3c and 3e

indicator variable for document level real valued da-tum correspondences

zh indicator variable for a sample from θd for discretevisual features

zo indicator variable for a sample from θd for real val-ued visual features

wm r.v. for textual word at position m in document d;vocabulary size of V

wh r.v. for vision oriented discrete feature at positionh in document d; vocabulary size corrVH

wo r.v. for the oth Gaussian feature vector with a di-mensionality of P in document d

Table 1: Meanings of the variables used in the models

For each position m in video dChoose ym ∼ Uniform(1, ...,O) (for Fig. 3e)or Choose ym|θ ∼Mult(θ) (for Fig. 3d)Choose a word wm ∼ p(wm|zym ,β) (for Fig. 3e)or Choose a word wm ∼ p(wm|ym,β) (for Fig. 3d)

In all further notations, wM is the ensemble of observedrandom variables that represent summary words in the dth

multimedia document. Similar notations hold for wO, wH

and the indicators y and z. In this paper, the text vocabu-laries are event specific and of size 312 words on average.

Fig. 3a shows the Object Bank [15] (OB) baseline thatwe initially used to translate videos to text. The boxes la-beled OB1, ..., OBN are the individual object detectors inObject Bank. The positive responses of the detectors leadtowards identifying the label of the objects in the keyframesand hence translating the entire video. We choose this base-line to verify the difficult nature of our dataset—there is a10% overlap between OB’s vocabulary and the test set vo-cabulary, and we should expect to see at least 2-5% recall inROUGE-1 recall scores for most events based on a 40-50%ROUGE-1 recall achieved by the best 100-word multidocu-ment text summarization systems in TAC competitions.

3.1 Inference on Latent VariablesWe use the Variational Bayesian Expectation Maximiza-

tion (VB-EM) [2, 31] algorithmic framework as the opti-mization framework. An advantage of VB-EM is that it isdeterministic.

The derivations for the MMG class of topic models be-come sufficiently complicated due to the need for using priorsover the parameters governing the real valued observations.Since the nature of the modes for topic proportions is not

(a) Object Bank objectdetection model [15]

D

M H

d

ym zh

wh wm

K

(b) MMLDA [4, 26]

D

M O

d

ym zo

wo wm

K

K μ

(c) Corr-MGLDA [4](extended)

D

M O

d

ym zo

wo wm

K

H zh

wh

K K

μ

(d) MMGLDA (pro-posed)

D

M O

d

ym zo

wo wm

K

H zh

wh

K K

μ

(e) Corr-MMGLDA(proposed)

Figure 3: Graphical model representations of existing topic models and proposed extensions— Figs. 3d and 3e. In this paper, we extendthe model in Fig. 3c i.e. the Corr-LDA model in [4] with Normal-Wishart priors over parameters for real valued observations as well.

known in advance, singularities arising out ill-conditionedtopic covariance matrices must be handled. This problemis mitigated in a principled way by introducing independentNormal-Wishart priors governing the mean vectors and pre-cision matrices of the Gaussians conditioned on the topics.Since both μ andΛ are unknown we cannot factorize p(μ,Λ)directly because the variance of the distribution over μ isa function of Λ. Instead we use combinations of Normal-Wishart priors on each Gaussian component as:

p(μ,Λ) =K∏

k=1

N (μk |m0, (κ0Λk)−1)W(Λk|W0, ν0) (1)

where Σ−1k = Λk is the precision matrix for the kth fac-

tor or topic. This is similar to the mixture model used in[19]. To preserve the dependence between the means andcovariances, a partially factorized tractable q distributionwith “free” variational parameters γ, φ, φ(O),φ(H) (for ev-ery multimedia document d ∈ D) is imposed by

q(θ,y, zO , zH |γ,φ,φ(O)φ(H)) =

⎡⎣ D∏d=1

q(θd|γd)

⎡⎣ Md∏m=1

q(yd,m

|φd,m)

Od∏o=1

q(zd,o|φ(O)d,o )

Hd∏h=1

q(zd,h|φ(H)d,h )

⎤⎦⎤⎦ K∏

i=1

q(μi,Λi) (2)

with θd ∼ Dirichlet(γd), zd,o ∼ Mult(φ(O)d,o ) and zd,h ∼

Mult(φ(H)d,h ). The maximum likelihood (ML) estimates of

free parameters are found by optimizing the lower bound onlog p(wM ,wH ,wO|α,β,ρ,μ,Λ).

The hyperparameters for α in the asymmetric Dirichletcase (the concentration parameter and the base measure)and κ0, ν0, m0 and W0 are not shown in Figs. 3c, 3d and3e and in equ. 2 above. Also φ are the free parametersof the variational summary word multinomials over Gaus-sian observations in the correspondence multimodal mod-els or summary word multinomials over topics in the plainmultimodal models; φ(O) are the free parameters of the vari-ational Gaussian observation multinomials over topics andsimilarly for φ(H) for discrete visual features.

The variational posterior distribution q(μk,Λk) does notfactorize into the product of the marginals, but we can al-ways write it as q(μk,Λk) = q(μk|Λk)q(Λk). Then weuse the result from mean field theory[24, 31] that says thatthe log of the optimal solution for factor qj is obtained byconsidering the log of the joint distribution over all hid-den and observed variables and then taking the expecta-tion with respect to all of the other factors {qi} for i �= ji.e. for visible and hidden variable ensembles V and H ,log q∗j (Hj) = Ei�=j [log p(V,H)] + const. This results in aNormal-Wishart distribution and is given by:

q(μk,Λk) = N (μk|mk, (κkΛk)−1)W(Λk|Wk, νk) (3)

where Σ−1k = Λk is the precision matrix for the kth factor

or topic. The expression in Equ. 3 is obtained by firstwriting out the expression for log q∗(.) and selecting thoseterms that involve μk and Λk. This yields:

log q∗(μ,Λ) =K∑i=1

log p(μi,Λi)+ (4)

D∑d=1

Od∑o=1

K∑i=1

φ(O)d,o,i logN (wd,o|μi,Λ

−1i ) + const

Note that the variance of the distribution over μk is afunction of Λk. The random variables mk and Wk can bethought of as surrogates to m0 and W0 and that κk and νksurrogates to κ0 and ν0 but conditioned on latent topic k.The expressions for these variables, which are also used inthe M-Step updates, can be found in Equs. 23, 24, 22 and 25.These expressions are obtained by matching the moments ofμk and Λk to the Normal and Wishart distribution expres-sions. The optimal solution for q∗(μk,Λk) depends on themoments evaluated with respect to the distributions of othervariables, and so the variational update equations are cou-pled and must be solved iteratively. Following [5, 4], let usnow write down the objective functional, L(.), to be max-imized which acts as the lower bound to the true data loglikelihood. We have for the MMGLDA model,

LMMG = Eq[log p(θ|α)] +Eq[log p(yM |θ)]+ Eq[log p(wM |yM , β)] + Eq[log p(zO|θ)]+ Eq[log p(wO|zO,μ,Λ)] + Eq[log p(μ,Λ|m0,W0, κ0, ν0)]

+ Eq[log p(zH |θ)] + Eq[log p(wH |zH ,ρ)]−Eq [log q(θ,yM ,

zO, zH ,μ,Λ|γ,φ,φ(O),φ(H),m,W,κ,ν)] (5)

We only highlight the derivations for the expressions:

Eq[µi,Λi]

[ln|Λi|

2− (wo−μi)

′Λi(wo−μi)2

], Eq[µi,Λi]

[log p(μi,Λi)]

and Eq[µi,Λi][log q(μi,Λi)]. In the variational Bayesian set-

ting, the expression:

Eq[µi,Λi]

[(ln |Λi|)/2 − ((wo − μi)

�Λi(wo − μi))/2]

needs to be evaluated in the log likelihood calculation forevery video document d to update the free distributions

given the current parameter values. The term[ln|Λi|

2

]is

the normalization factor of the Gaussians and its expecta-tions can cause the log likelihood to be positive. We there-fore only evaluate Eq[µi,Λi]

[−((wo − μi)�Λi(wo − μi))/2

]

for the per document updates and subtract the log of theexponentials of the aggregations as an approximation. Weindependently derive and mention only the final expressionsfor the following variables due to space constraints and selfcontainment of this paper:

Eq[µi,Λi][ln |Λi|] =

P∑p=1

ψ

(νi + 1− p

2

)+ P ln 2 + ln |Wi| (6)

Eq[µi,Λi]

[(wo − μi)

′Λi(wo − μi)

]= Pκ−1

i + νi((wo −mi)

′Wi(wo −mi)

)(7)

Eq[µ,Λ][log q(μ,Λ)] =

K∑i=1

{1

2ln Λi +

P

2lnκi

2π− P

2−H[q(Λi)]

}

(8)

H[q(Λi)] = − lnZ(Wi, νi)− (νi − P − 1)

2ln Λi +

νiP

2, where

(9)

• Z(Wi, νi) = |Wi|−νi/2

⎛⎝2νiP/2πP (P−1)/4

P∏p=1

Γ

(νi + 1− p

2

)⎞⎠−1

• ln Λi = Eq[ln |Λi|] =P∑

p=1

ψ

(νi + 1− p

2

)+ P ln 2 + ln |Wi|

Note that Ψ is the digamma function. For the expression∑Ki=1 Eq[µi,Λi]

[log p(μi,Λi)], we have:

K∑i=1

Eq[µi,Λi][log p(μi,Λi)] =

1

2

K∑i=1

{P ln(

κ0

2π) + ln Λi − κ0P

κi

−κ0νi(mi −m0)′Wi(mi −m0)

}+K lnZ(W0, ν0)

+ν0 − P − 1

2

K∑i=1

ln Λi − 1

2

K∑i=1

νiTr(W−10 Wi) (10)

Using the lower bound LMMG, the ML estimations of thehidden variables in video document d can be obtained usingLagrange Multipliers on φ(H), φ(O) and φ as follows:

φ(H)d,h,i ∝ exp

⎧⎨⎩ψ(γd,i)− ψ(

K∑j=1

γd,j) + log ρi,wd,h

⎫⎬⎭ (11)

φ(O)d,o,i ∝ exp

⎧⎨⎩ψ(γd,i)− ψ(

K∑j=1

γd,j) +Eq[µi,Λi][(ln |Λi|)/2

−((wo − μi)′Λi(wo − μi))/2

]}(12)

φd,m,i ∝ exp

⎧⎨⎩ψ(γd,i)− ψ(

K∑j=1

γd,j) + log βi,wd,m

⎫⎬⎭ (13)

γd,i = αi +

Md∑m=1

φd,m,i +

Od∑o=1

φ(O)d,o,i +

Hd∑h=1

φ(H)d,h,i (14)

For the Corr-MMGLDA model, Eq[log p(wM |zyM , β)]expands out to be:

M∑m=1

K∑i=1

(O∑

o=1

φm,oφ(O)o,i

)log βi,wm (15)

Also,

Eq[log q(yM |φyM)] =

M∑m=1

O∑o=1

φm,o log φm,o (16)

and Eq[log p(yM |O)] is constant for all m in d. Equation(15) is a computational bottleneck because finding the con-fidence of the word wm on topic i necessitates the elimina-tion of uncertainties of wm’s dependence on wo and wo’sdependence on topic i. This is also a strong point since themarginalization suggests a stronger influence of a topic on asummary word if that influence is justified by most wos.

Using a similar lower bound LCorr−MMG for Corr-MMGLDA,the ML estimations of the hidden variables in video d canbe obtained as follows:

φ(H)d,h,i ∝ exp

⎧⎨⎩ψ(γd,i)− ψ(

K∑j=1

γd,j) + log ρi,wd,h

⎫⎬⎭ (17)

φ(O)d,o,i ∝ exp

⎧⎨⎩ψ(γd,i)− ψ(

K∑j=1

γd,j) + Eq[µi,Λi]

[log |Λi|

2−

(wo − μi)′Λi(wo − μi)

2

]+

Md∑m=1

φd,m,o log βzym ,wd,m

⎫⎬⎭ (18)

φd,m,o ∝ exp

{K∑i=1

φ(O)d,o,i log βi,wd,m

}(19)

γd,i = αi +

Od∑o=1

φ(O)d,o,i +

Hd∑h=1

φ(H)d,h,i (20)

3.2 Model Parameter EstimationsBefore deriving the expressions for the maximum a poste-

riori and maximum likelihood estimates of the parameters ofthe proposed models using moment matching (Section 3.1)and derivatives w.r.t the parameters of the functional L(.)

let us define the following quantities for each topic i:

Ni =D∑

d=1

Od∑o=1

φ(O)d,o,i; xi =

1

Ni

D∑d=1

Od∑o=1

φ(O)d,o,iwd,o

Si =

∑Dd=1

∑Odo=1 φ

(O)d,o,i(wd,o − xi)(wd,o − xi)

′

Ni(21)

By using moment matching techniques on Equ. 3, we obtainthe following MAP expressions in general for each topic i:

κi = κ0 +Ni (22)

mi =1

κi(κ0m0 +Nixi) (23)

W−1i = W−1

0 +NiSi +κ0Ni

κ0 +Ni(xi −m0)(xi −m0)

′(24)

νi = ν0 +Ni (25)

Further, using some algebraic manipulations and utilizingLagrange Multipliers for β and ρ for each topic i, we obtain:For the MMGLDA model:

ρi,j ∝D∑

d=1

Hd∑h=1

corrVH∑j=1

φ(H)d,h,iδ(wd,h, j) (26)

βi,j ∝D∑

d=1

Md∑m=1

V∑j=1

φd,m,iδ(wd,m, j) (27)

For the Corr-MMGLDA model:

βi,j ∝D∑

d=1

Md∑m=1

V∑j=1

(O∑

o=1

φd,m,oφ(O)d,o,i

)δ(wd,m, j) (28)

The update for parameters ρi,j remain the same. To opti-mize the α parameters, we follow the corresponding expres-sions in [5] and optimize using Newton’s iterative gradientbased method using backtracking line search.

For predicting a bag of words summary from an ensembleof low level features of video document d and the learntp(wv|zv = k,β), we permute the vocabulary V for the newtest video as:

p(wv|wO,wH) ≈O∑

o=1

K∑i=1

φ(O)d,o,ip(wv|βi) +

H∑h=1

K∑i=1

φ(H)d,h,ip(wv|βi)

(29)

4. EXPERIMENTAL SETUP AND RESULTSHere we briefly mention the descriptors that we use to

represent the videos. To represent actions, we use featuresknown as Histogram of Oriented Gradients in 3D (HOG3D)[11]. The gradient directions are binned by mapping themto 10 polar meridian and 6 polar parallel planes and thentreating half spaces to be equivalent. We resized the videoframes such that the largest dimension (height or width)was 160 pixels, and extracted HOG3D features from a densesampling of frames. Our HOG3D parameters resulted ina 300-dimensional feature vector using support volumes ofdimension 2 × 2 × 5 and 5 × 3 polar co-ordinate bins. Wethen use K-means clustering to create a 1000-word codebookfollowing [3] from a random sampling of the training data.

Color histogram features are also used as part of the dis-crete visual data. We use 512 RGB color bins and his-tograms are computed on densely sampled frames. Due tolarge deviations in the extremities of the color spectrum, weuse the histogram between the 15th and 85th percentiles av-eraged across a video and counts normalized to lie in [1,100].

Finally we use Object Banks [15] for a histogram patternof positive object detections. OB transforms an image intoa 44604 dimensional concatenated feature vector for each ofthe 177 off-the-shelf object detectors that are currently used.Each entry within a 252 dimensional detection feature vectorrepresents the distance from the decision hyperplane midwaywithin the margins for different scale-space transformationsof the image. The object labels in OB cover only about 10%of the summary words (246 out of 2687 for the training setand 166 out of 1219 for the Dev-T set). Keyframes usedfor these features are extracted using the change in colorhistogram method [35] and the positive OB responses arequantized following classemes in [28]. Thus wH in Figs. 3b,3d and 3e consists of codebook histograms from HOG3D,color and OB. Needless to say, the contributions of theseoff-the-shelf object detectors are not significant at all.

The real valued features we use in our video represen-tation are those representing scenes as mentioned in [22].The scene property by itself induces image summarizationin a way that is consistent with human perception of vision[23]. A set of perceptual dimensions is proposed along theboundary viewpoint (e.g. depth, openness, expansion, per-spective) and along the content viewpoint (e.g. naturalness,roughness, ruggedness, etc.) which represent the dominantspatial structure of a scene. These features are named GISTfeatures as is common parlance in computer vision litera-ture. To compute these features, we have used the setup in[7] leading to a 960-dimensional descriptor for each frame.We calculate GIST features for every 10th frame.

To save computational time, the GIST features are pro-jected into lower dimensions using Principle Component Anal-ysis (PCA). PCA is done on the training data across all event

categories to remove the dependence of the visual descriptorson specific events. We first visualize the lower (15, 30 and60) dimensional GIST features in two dimensions using t-statistic based stochastic neighbor embedding (t-SNE) [17],however, the separations look more or less the same and donot yield conclusive evidence of choosing the right number ofdimensions. By inspecting the plots from t-SNE, we choose15 dimensions and validate the choice by both manually in-specting the eigenvalues and experimentally cross-validatingwith the baseline Corr-MGLDA topic model (Fig. 3c). 30 or60 dimensional features decreased the ELBO of the model.We do not select further lower dimensions based on signif-icance of the eigenvalues. Each wo in Figs. 3c, 3d and 3erepresents a frame in 15 dimensions corresponding to a GISTfeature vector.

4.1 Model Log Likelihoods and TopicsIn this section, we evaluate the topic models in terms of

ELBO on the held-out Dev-T set acting as a test set (withthe human summaries) for posterior inference and as a pre-diction set (without human summaries) for BoW summarygeneration. Multinomial parameters are seeded and Gaus-sian parameters are randomly initialized. The base measures

Model Topic 1 Topic 2 Topic 3

Corr-

MMG

LDA

wed couple cer-emony churchring footageexchangebride groomhelicopter

wed ceremonybride groomchurch flowervow exchangering walk kissoutdoors

wed Hawai USbeach guestplace footagescene ringminister leiKailua

Wed

dingceremony

Corr-

MG

LDA-

PDS

wed ceremonycouple bridegroom churchflower exchangevow

wed ceremonybride groomcouple flowerchurch manoutdoors vow

wed ceremonybride groomcouple flowerchurch manoutdoors vow

Table 2: 3 latent topics for an event from proposed 5-topic Corr-MMGLDA and Corr-MGLDA-PDS models. The topics from Corr-MGLDA-PDS are similar as a result of high values of αk obtainedafter running Corr-MGLDA-PDS on scaled i.e. normalized data. Thetopics from Corr-MMGLDA are qualitatively far superior and indi-cates sub-events of the “Wedding ceremony” event.

of α are initialized to 0.1 and normalized while its concen-tration parameter is set to 10. An issue with the real valuedfeatures is the influence of data normalization on the ELBOsfrom the topic models. We have observed that when the datais not normalized to lie within [0,1]P , the sequence of ELBOsfrom Corr-MGLDA during EM often indicate suboptimalityeven during training. The “PDS” suffix (in the table and allother figures) means “Positive Data Scaling” i.e. each realvalued vector is sum-normalized to [0,1]P independently.

-700000-600000-500000-400000-300000-200000-100000

0E001 E002 E003 E004 E005

MMLDA Corr-MGLDA-PDS MMGLDA Corr-MMGLDA

Figure 4: Test ELBOs on events E001-E005 in the Dev-T set. Lower is better.

The PDS nor-malization fixesthis problem andraises ELBOs forCorr-MGLDA sig-nificantly but con-vergence is slower.However in thelatter setting, thevalues of αk become large which destroys sparsity in topics.This is possibly due to strong overlap of modes within the[0, 1]P hypercube where one dimension is severely correlatedwith the others. Examples of such topics on the “WeddingCeremony” event is given in Table 2 where all topics arealmost alike and lose subjective interpretability.

The new topic models with both Multinomial and Gaus-

sian distributions on the video features do not suffer from thedata scaling problem. It is possible that the mean parameterspace for the tractable distributions over both discrete andreal valued observations prevents co-ordinate ascent steps todwell in suboptimal regions that could arise out of extremevalues in the real valued data alone.

-60000

-50000

-40000

-30000

-20000

-10000

0E006 E007 E008 E009 E010 E011 E012 E013 E014 E015

MMLDA Corr-MGLDA-PDS MMGLDA Corr-MMGLDA

Figure 5: Test ELBOs on events E006-E015from Dev-T set. Lower is better

Although theNormal-Wishartpriors act asregularizers, au-tomatically tun-ing W0 usinganother levelof priors orfrom the dataitself is not used here. In general optimization withtractable distributions and parameter constraints (e.g. non-negativity, boundedness and positive definiteness) can benon-convex [31].

Figures 4 and 5 show the test ELBOs of MMGLDA andCorr-MMGLDA versus the MMLDA model and the Corr-MGLDA model with PDS. The ELBOs for MMLDA are offthe charts (at least three to four times the cut-off shown inthe graphs). For the first 5 events, the videos contain posi-tive instances of the events in Dev-T set. For this subset ofevents, the MMGLDA family of models outperform the best

-1500000

-1300000

-1100000

-900000

-700000

-500000

-300000

-100000

E001 E002 E003 E004 E005

MMLDA Corr-MGLDA Corr-MGLDA-PDS MMGLDA Corr-MMGLDA

Figure 6: Prediction ELBOs on first 5event for Dev-T set. Lower is better.

version of Corr-MGLDA in termsof ELBOs (i.e.with PDS). Fig-ures 4 and 5 areobtained usingK=20 topics—K being set through5-fold cross -validation. Forthe last 10 events (Fig. 5), the videos contain only relatedinstances of the events in Dev-T set—dissimilar to the train-ing configuration i.e. the annotators are unsure about therelevance of the videos to the event category. In this case,Corr-MGLDA-PDS do not perform worse in general sincethe GIST features are global features [22].

The prediction performance on the first 5 events is shownin terms of ELBO in Fig. 6 for the same value of K. Fig. 6shows that MMLDA does not perform well in terms of wordprediction ELBO measure. We can also see the effects ofsub-optimality when PDS is suppressed for Corr-MGLDA(Corr-MGLDA in Fig. 6). MMGLDA and Corr-MMGLDAagain perform comparably and outperforms Corr-MGLDA-PDS on the first 5 events except E002—“feeding an animal”—a very complex event for computer vision.

For events 6 through 15, the prediction ELBO graphs alsolook very similar to that in Fig. 5 (not shown here due tospace constraints). PDS on our proposed MMGLDA fam-ily shows even better ELBOs, but topic sparsity problemsmitigate only a little and we do not report those here. Allthese experiments are run using m0 set to 0, W0 set to abroader prior I, the identity matrix, ν0 set to P and κ0 setto 1. Normalizing the data to lie in [0,1]P with I as priorsfor Λks leads to sharing of topic responsibilities of the realvalued data by only a few Gaussians thereby contributingmuch less to the overall log-likelihood.

It is also observed that the means of the ELBOs of our

proposed models are significantly less negative (i.e. better)at 95% confidence level (using paired t-test) than the exist-ing topic models during cross-validation on the training set.

-400000

-350000

-300000

-250000

-200000

-150000

-100000

-50000

0K=5 K=10 K=15

Corr-MGLDA Corr-MGLDA-PDS

MMGLDA Corr-MMGLDA

Figure 7: Average test ELBOs onall events in the Dev-T set for dif-ferent topics. Lower is better.

For most events, EL-BOs for proposed mod-els with K=10 arenot statistically worseeither and show slightlyhigher ROUGE-1 scoresfor some events. Fig-ure 7 shows the macroaverage of test EL-BOs across all the 15events in the Dev-Tset. We omit the line graph for MMLDA as it is out of axislimits. The graphs confirm the superior fit of our proposedmodels to a natural representation of multimedia (test) data.

4.2 Translating Related Words to VideosTable 3 shows how latent topics can first be used to dis-

cover most probable related words from unstructured textwhich can then be translated to most probable frames fromone or more videos (and hence the videos themselves). Theframes correspond to wos in Fig. 3 and Table 1.

Topic 6: mob flash dance people mall out-doors large gather woman public plaza

Topic 10: Hollywood dog star wars robotlight saber R2D2 eye lens blvd fight street

Table 3: 2 latent topics for “Flash mob” eventfrom a 10-topic Corr-MMGLDA

We observefrom Table3 how top-ics 6 and10 decom-pose the“Flashmob gath-ering”eventinto its con-stituent sub-themes. Whiletopic 6 de-scribes flashmob dancesin outdoorsand near plazas,topic 10 fo-cuses on aflash mobin Hollywoodposing in StarWars costumes and light sabers along with the famousminiature robot R2D2. Due to space constraints we can-not show more samples of numerous such examples.

Table 4 shows the inter-translation of modalities for topic

Topic 10: place fight fake public eventgather flash mob music star outdoors war

Table 4: Topic 10 for the “Flash mob” eventfrom a 10-topic MMGLDA

10 from MMGLDAcorrespondingto that inTable 3 fromCorr-MMGLDA.Note how thetopic losesspecificity (e.g.misses“R2D2”,“light,”“saber”within thetop few words)and focuses on generality (e.g. flash mob). Topic 6 forMMGLDA is exactly the same as that for Corr-MMGLDA.Table 5 shows log of the ratio: αk

|Λk| for the two proposed

models and gives us a hint on how “broad” a topic k may bevs. how much variance in the visual summary is it able tocapture. A relatively higher value of the ratio means thata topic captures more variance and hence the volume cap-tured by the determinant of the inverse covariance matrixΛk, i.e. |Λk|, through its spanning eigenvectors is propor-tionally less. For MMGLDA, this ratio is always relativelylower in our setting and this means that the model capturesmore generic patterns first giving rise to a lower |Λk|−1.

The last column in Table 5 is the average of the ratiosfor the other topics from the two models for event eight.

Model k=6 k=10 avgk �={6,10}Corr-MMGLDA 104.628 8.164 40.2398

MMGLDA 104.623 8.102 40.0702

Table 5: log αk|Λk| values for topics in event 8

The ratioscan be largefor a moregeneral topic(e.g. topic6 in Table3) owing to a higher αk too. Although, all values are closedue to the use of the broader prior, I, it is observed thatCorr-MMGLDA discovers related words which are qualita-tively superior. However, the corresponding most probableframes are almost similar for both models in most cases.Translating related words to frames is best judged manu-ally, but, the ratio we use here can be a viable alternative.

4.3 Translating/Summarizing Videos To TextTable 6 reports the ROUGE-1 (henceforth R-1) scores of

the predicted 5 to 10 keyword summaries from the differ-ent models when compared with the corresponding shorthuman synopses. Sometimes full sentences cannot be gen-erated from the predicted words due to a deficient languagemodel. This and the short nature of human synopses are theprimary reasons why we perform only R-1 evaluation. Someexamples of the sentences/phrases are shown in Fig. 8.

Bag of words: feed bird food outdoors woman eat hand leaf daytime boy giraffe zoo cup girl goat Sentences: Boys feed birds by hand. Girl feed birds by hand. Girl eats food in zoo. Woman feed birds by hand. Woman feeds goat in zoo.

[Event E002 - Feeding animal] Actual Summary: little boy feeds goat

Bag of words: fish noodle man sea boat bare land big stretch person catch hand stream catfish Sentences: Men catch fish on boat. Men catch fish by hand. Men catch fish in stream. Men catch fish in boat. [Event E003 - Landing a fish] Actual Summary: catching big fish off dock

Bag of words: church wed ceremony inside bride groom aisle dress doughnut couple clap hug kiss sign Sentences: Could not generate sentence current template and language model is insufficient, but phrases are found

[Event E004 - Wedding] Actual Summary: wedding ceremony in church

Figure 8: Bag of keywords and sentence translations from ourproposed MMGLDA (K = 20) for some clips from events 2, 3 and4 from the Dev-T set. Best viewed in color and magnification.

In Table 6, OB is the Object Bank baseline (see just aheadof Section 3.1)—it confirms the difficulty of detecting objectson this dataset. The quantized OB responses perform poorlysince a-priori it is hard to know which object detectors willbe needed and existing but irrelevant object detectors canproduce an unpredictable pattern of false positives. Creat-

ing object models for every genre of data requires expensiveannotation efforts. Even if there is a 100% overlap betweenour training vocabulary and object models, the R-1 scoresfor OB may only increase by 10-folds which is still low.

Purely multinomial topic models showing lower ELBOscan perform quite well in BoW summarization. MMLDAassigns likelihoods based on success and failure of indepen-dent events and failures contribute highly negative termsto the log likelihoods but this does not indicate the model’ssummarization performance where low probability terms arepruned out. Gaussian components can partially removethe independence through covariance modeling and fit thedata better at the cost of higher time and space complexity.

Model n=5 n=10 OB

E001

MMLDA 0.182 0.248

0.0*Corr-MGLDA-PDS 0.187 0.257

MMGLDA 0.179 0.245

Corr-MMGLDA 0.139* 0.192*

E002

MMLDA 0.186 0.249


MMGLDA 0.186 0.237


E003

MMLDA 0.221 0.265

0.012*Corr-MGLDA-PDS 0.233 0.263

MMGLDA 0.228 0.267 =1%

Corr-MMGLDA 0.171* 0.230

E004

MMLDA 0.265 0.302


MMGLDA 0.264 0.321

Corr-MMGLDA 0.221 0.247*E005

MMLDA 0.167 0.213

0.005*Corr-MGLDA-PDS 0.180 0.208

MMGLDA 0.165 0.205 =0.5%


6-15Avg. MMLDA 0.216 0.252

0.001*Corr-MGLDA-PDS 0.211 0.258

MMGLDA 0.210 0.243 =0.1%

Corr-MMGLDA 0.179* 0.221

Table 6: Individual and average ROUGE-1 scoreson the events—best results from 10/20 latent top-ics are shown. The value of n represents the top-n most probable keywords. A (*) means signif-icantly worse performance at 95% confidence to{MM,MMG}LDAs. These results are only reportedfor the same hyperparameter settings.

The R-1 scoresfrom MM(G)LDAs arecompara-ble for 5and 10 key-words withno sta-tistical dif-ference, how-ever, apossible rea-son forlower R-1 scoresfor Corr-MMGLDAmodel isthat dueto bet-ter cor-respondenceto the topicof the GISTenergy inthe scene,when atopicallyrelevant butnon-summaryword ischosen upfront, more related but non-summary words arealso drawn in. As future research, we also like to do a prin-cipled initialization of Gaussian parameter priors as in [19].

However, the high scores with Corr-MGLDA-PDS is en-tirely co-incidental—the topics are more or less uniform andeach one covers parts of the sub-events equally. Further,each wo’s density over those topics is uniform enough tonot achieve a reasonable permutation. The same thing hap-pens when PDS is used for our MMGLDA family of mod-els and the summaries completely lose subjective appeal al-though R-1 scores improve considerably. This is similar tothe qualitatively degenerate approach—taking the top n fre-quent words from the event vocabulary and using those assummaries for every test video. The scores for MMLDAand MMGLDA are also comparable to this setting. Quan-tification of the permutation quality has not been done.

Scores in Table 6 need to be multiplied by the event clas-sification accuracies to obtain lower bounds for clips having

no event labels. The scores become competitive for larger nand much larger K if we topic model on the entire corpus.

4.3.1 Natural Language GenerationTo translate a video into multiple sentences from pre-

dicted keywords, we use an ordered-sequence template as<subject, verb, object, preposition, scene-noun>. Languagemodels from the data at hand is used to prune impossiblesequences. The subjects, objects, verbs and nouns extractedfrom the training synopses using dependency grammars andPOS models appear in each generated sentence only once.

We use the parser in [12] to score the sequence of wordsfollowing the template. The sentences are ordered accordingto bigram and parse tree scores. When complete sentencescannot be generated due to a deficient language model, weoutput possible bigrams and trigrams (see E004 in Fig. 8).Similar corpus based sentence generation techniques can befound in [34, 9] but NLG is a research topic in its own right.

4.4 Event ClassificationFig. 9 shows 5-fold cross-validation on the 15-event train-

ing set and also the test accuracies on Dev-T53.5248

46.9932 47.0417

18.5901

56.5217

49.8413 49.6025

23.2804

0

10

20

30

40

50

60

HOG3D+OB+CHIST HOG3D+OB HOG3D OB

CROSS-VALIDATION ON EVENTS TEST ON Dev-T

Figure 9: Event detection accuracies forcross-validation (light gray bars) and test(dark gray bars) with different features

set for eventclassification. Ac-SVM classi-fier from thelibSVM [6] pack-age is usedwith defaultsettings for mul-ticlass classi-fication. Al-though around 50% classification accuracy can be easilyachieved using the discrete visual features that we use,higher accuracies can be obtained using better kernels, fu-sion of classifiers and optimizing Detection-Error-Tradeoffcurves while cross validating [20]. However, these discus-sions are outside the scope of this paper.

5. CONCLUSIONOur new topic models show better fits to multimedia data

representation consisting of discrete and real valued featuresfrom videos as well as accompanying short textual synopses.In general Corr-MMGLDA improves on text to video trans-lation while the non-correspondence versions perform bet-ter in video to text summarization. Video summarizationthrough topic models significantly out-perform that throughstate-of-the-art object detectors and thus can be used asnew baselines. Our NLG component has suffered from se-vere data sparsity and impoverished language models andwe wish to overcome these using external knowledge bases.ACKNOWLEDGEMENTS This work was supported by the In-

telligence Advanced Research Projects Activity (IARPA) via Depart-

ment of Interior National Business Center contract number D11PC20069.

The U.S. Government is authorized to reproduce and distribute reprints

for Governmental purposes notwithstanding any copyright annotation

thereon. Disclaimer: The views and conclusions contained herein are

those of the authors and should not be interpreted as necessarily rep-

resenting the official policies or endorsements, either expressed or im-

plied, of IARPA, DOI/NBC, or the U.S. Government. We thank Hon-

eywell ACS Labs, Simon Fraser Univ. & Kitware Inc. for support in

feature extraction and the anonymous reviewers for their comments.

6. REFERENCES[1] A Ahmed, E. P. Xing, W. W. Cohen, and R. F. Murphy.

Structured correspondence topic models for mining captionedfigures in biological literature. In SIGKDD, 2009.

[2] M. J. Beal. Variational Algorithms for Approximate BayesianInference. PhD thesis, UCL, 2003.

[3] P. Bilinski and F. Bremond. Evaluation of local descriptors foraction recognition in videos. In ICCV, 2011.

[4] D. Blei and M. Jordan. Modeling annotated data. In SIGIR,2003.

[5] D. M. Blei, A. Ng, and M. I. Jordan. Latent dirichletallocation. JMLR, 3:993–1022, 2003.

[6] C. Chang and C. Lin. LIBSVM: A library for support vectormachines. ACM TIST, 2:27:1–27:27, 2011.

[7] M. Douze, H. Jegou, H. Sandhawalia, L. Amsaleg, andC. Schmid. Evaluation of gist descriptors for web-scale imagesearch. In CIVR, pages 19:1–19:8, 2009.

[8] S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoullirelevance models for image and video annotation. In CVPR,2004.

[9] A. Gupta, Y. Verma, and C. V. Jawahar. Choosing linguisticsover vision to describe images. In AAAI, 2012.

[10] T. M. Hospedales, S. Gong, and T. Xiang. A markov clusteringtopic model for mining behaviour in video. In ICCV, 2009.

[11] A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporaldescriptor based on 3d-gradients. In BMVC, 2008.

[12] D. Klein and C. D. Manning. Accurate unlexicalized parsing. InACL, 2003.

[13] V. Lavrenko, R. Manmatha, and J. Jeon. A model for learningthe semantics of pictures. In NIPS. 2004.

[14] L. Li, K. Zhou, G. Xue, H. Zha, and Y. Yu. Videosummarization via transferrable structured learning. In WWW,2011.

[15] L-J. Li, H. Su, E. P. Xing, and L. Fei-fei. Object bank: Ahigh-level image representation for scene classification andsemantic feature sparsification. In NIPS, 2010.

[16] C-Y. Lin and E. Hovy. Automatic evaluation of summariesusing n-gram co-occurrence statistics. In NAACL HLT, 2003.

[17] L. V. D. Maaten and G. Hinton. Visualizing high-dimensionaldata using t-sne. JMLR, 9(Nov):2579–2605, 2008.

[18] A. Makadia, V. Pavlovic, and S. Kumar. A new baseline forimage annotation. In ECCV, pages 316–329, 2008.

[19] N. Nasios and A.G. Bors. Variational learning for gaussianmixture models. IEEE Transactions on Systems, Man, andCybernetics, Part B: Cybernetics, 36(4):849 –862, 2006.

[20] P. Natarajan et al. BBN VISER. In TRECVID MED, 2011.

[21] A. Nenkova and R. Passonneau. Evaluating content selection insummarization: The pyramid method. In HLT-NAACL, 2004.

[22] A. Oliva and A. Torralba. Modeling the shape of the scene: Aholistic representation of the spatial envelope. Int. J. Comput.Vision, 42(3):145–175, 2001.

[23] A. Oliva and A. Torralba. Building the gist of a scene: the roleof global image features in recognition. Progress in BrainResearch, 155(1):23–36, 2006.

[24] G. Parisi. Statistical Field Theory. Addison-Wesley, 1988.

[25] D. Putthividhya, H. T. Attias, and S. S. Nagarajan. Topicregression multi-modal latent dirichlet allocation for imageannotation. In CVPR, 2010.

[26] D. Ramage, P. Heymann, C. D. Manning, andH. Garcia-Molina. Clustering the tagged web. In WSDM, 2009.

[27] R. K. Srihari. Piction: A system that uses captions to labelhuman faces in newspaper photographs. In AAAI, 1991.

[28] L. Torresani, M. Szummer, and A. Fitzgibbon. Efficient objectcategory recognition using classemes. In ECCV, 2010.

[29] C. Vondrick, D. Patterson, and D. Ramanan. Efficiently ScalingUp Video Annotation with Crowdsourced Marketplaces. InECCV, 2010.

[30] Yang W. and Greg M. Human action recognition by semi-latenttopic models. IEEE PAMI, 31(10):1762–1774, 2009.

[31] M. J. Wainwright and M. I. Jordan. Graphical Models,Exponential Families, and Variational Inference. NowPublishers Inc., Hanover, MA, USA, 2008.

[32] J. Wanke, A. Ulges, C. Lampert, and T. Breuel. Topic modelsfor semantics-preserving video compression. In MIR, 2010.

[33] M. J. Welch, J. Cho, and W. Chang. Generating advertisingkeywords from video content. In CIKM, 2010.

[34] Y. Yang, C. L. Teo, H. Daume III, and Y. Aloimonos.Corpus-guided sentence generation of natural images. InEMNLP, 2011.

[35] H. J. Zhang, C. Low, S. Smoliar, and J. Wu. Video parsing,retrieval and browsing: an integrated and content-basedsolution. In ACM MM, 1995.

Translating Related Words to Videos and Back through ...jcorso/pubs/wsdm13-das-final.pdf ·...

Documents

Transcript of Translating Related Words to Videos and Back through ...jcorso/pubs/wsdm13-das-final.pdf ·...