A Latent Concept Topic Model for Robust Topic Inference ...by this, we propose a latent concept...

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 380–386,Berlin, Germany, August 7-12, 2016. c©2016 Association for Computational Linguistics

A Latent Concept Topic Model for Robust Topic InferenceUsing Word Embeddings

Weihua Hu† and Jun’ichi Tsujii‡§†Department of Computer Science, The University of Tokyo, Japan

‡Artificial Intelligence Research Center, AIST, Japan§School of Computer Science, The University of Manchester, UK{hu,j-tsujii}@ms.k.u-tokyo.ac.jp,@aist.go.jp

Abstract

Uncovering thematic structures of SNSand blog posts is a crucial yet challeng-ing task, because of the severe data spar-sity induced by the short length of textsand diverse use of vocabulary. This hin-ders effective topic inference of traditionalLDA because it infers topics based ondocument-level co-occurrence of words.To robustly infer topics in such contexts,we propose a latent concept topic model(LCTM). Unlike LDA, LCTM reveals top-ics via co-occurrence of latent concepts,which we introduce as latent variables tocapture conceptual similarity of words.More specifically, LCTM models eachtopic as a distribution over the latent con-cepts, where each latent concept is a local-ized Gaussian distribution over the wordembedding space. Since the number ofunique concepts in a corpus is often muchsmaller than the number of unique words,LCTM is less susceptible to the data spar-sity. Experiments on the 20Newsgroupsshow the effectiveness of LCTM in deal-ing with short texts as well as the capabil-ity of the model in handling held-out doc-uments with a high degree of OOV words.

1 Introduction

Probabilistic topic models such as Latent Dirich-let allocation (LDA) (Blei et al., 2003), are widelyused to uncover hidden topics within a text corpus.LDA models each document as a mixture of top-ics where each topic is a distribution over words.In essence, LDA reveals latent topics in a corpusby implicitly capturing document-level word co-occurrence patterns (Wang and McCallum, 2006).

In recent years, Social Networking Services andblogs have become increasingly prevalent due to

the explosive growth of the Internet. Uncover-ing the themantic structures of these posts is cru-cial for tasks like market review, trend estimation(Asur and Huberman, 2010) and so on. How-ever, compared to more conventional documents,such as news articles and academic papers, ana-lyzing the thematic content of blog posts can bechallenging, because of their typically short lengthand the use of diverse vocabulary by various au-thors. These factors can substantially decrease thechance of topically related words co-occurring inthe same post, which in turn hinders effective topicinference in conventional topic models. Addition-ally, sometimes small corpus size can further exac-erbate topic inference, since word co-occurrencestatistics becomes more sparse as the number ofdocuments decreases.

Recently, word embedding models, such asword2vec (Mikolov et al., 2013) and GloVe (Pen-nington et al., 2014) have gained much attentionwith their ability to form clusters of conceptuallysimilar words in the embedding space. Inspiredby this, we propose a latent concept topic model(LCTM) that infers topics based on document-level co-occurrence of references to the same con-cept. More specifically, we introduce a new la-tent variable, termed a latent concept to captureconceptual similarity of words, and redefine eachtopic as a distribution over the latent concepts.Each latent concept is then modeled as a localizedGaussian distribution over the embedding space.This is illustrated in Figure 1, where we denotethe centers of the Gaussian distributions as con-cept vectors. We see that each concept vectorcaptures a representative concept of surroundingwords, and the Gaussian distributions model thesmall variation between the latent concepts andthe actual use of words. Since the number ofunique concepts that are referenced in a corpusis often much smaller than the number of unique

380

Figure 1: Projected latent concepts on the wordembedding space. Concept vectors are annotatedwith their representative concepts in parentheses.

words, we expect topically-related latent conceptsto co-occur many times, even in short texts withdiverse usage of words. This in turn promotestopic inference in LCTM.

LCTM further has the advantage of using con-tinuous word embedding. Traditional LDA as-sumes a fixed vocabulary of word types. Thismodeling assumption prevents LDA from han-dling out of vocabulary (OOV) words in held-outdocuments. On the other hands, since our topicmodel operates on the continuous vector space, itcan naturally handle OOV words once their vectorrepresentation is provided.

The main contributions of our paper are as fol-lows: We propose LCTM that infers topics viadocument-level co-occurrence patterns of latentconcepts, and derive a collapsed Gibbs samplerfor approximate inference. We show that LCTMcan accurately represent short texts by outperform-ing conventional topic models in a clustering task.By means of a classification task, we furthermoredemonstrate that LCTM achieves superior perfor-mance to other state-of-the-art topic models inhandling documents with a high degree of OOVwords.

The remainder of the paper is organized as fol-lows: related work is summarized in Section 2,while LCTM and its inference algorithm are pre-sented in Section 3. Experiments on the 20News-groups are presented in Section 4, and a conclu-sion is presented in Section 5.

2 Related Work

There have been a number of previous studies ontopic models that incorporate word embeddings.The closest model to LCTM is Gaussian LDA

(Das et al., 2015), which models each topic asa Gaussian distribution over the word embeddingspace. However, the assumption that topics areunimodal in the embedding space is not appropri-ate, since topically related words such as ‘neural’and ‘networks’ can occur distantly from each otherin the embedding space. Nguyen et al. (2015) pro-posed topic models that incorporate informationof word vectors in modeling topic-word distribu-tions. Similarly, Petterson et al. (Petterson et al.,2010) exploits external word features to improvethe Dirichlet prior of the topic-word distributions.However, both of the models cannot handle OOVwords, because they assume fixed word types.

Latent concepts in LCTM are closely relatedto ‘constraints’ in interactive topic models (ITM)(Hu et al., 2014). Both latent concepts and con-straints are designed to group conceptually simi-lar words using external knowledge in an attemptto aid topic inference. The difference lies in theirmodeling assumptions: latent concepts in LCTMare modeled as Gaussian distributions over theembedding space, while constraints in ITM aresets of conceptually similar words that are interac-tively identified by humans for each topic. Eachconstraint for each topic is then modeled as amultinomial distribution over the constrained setof words that were identified as mutually relatedby humans. In Section 4, we consider a variant ofITM, whose constraints are instead inferred usingexternal word embeddings.

As regards short texts, a well-known topicmodel is Biterm Topic Model (BTM) (Yan etal., 2013). BTM directly models the genera-tion of biterms (pairs of words) in the whole cor-pus. However, the assumption that pairs of co-occurring words should be assigned to the sametopic might be too strong (Chen et al., 2015).

3 Latent Concept Topic Model

3.1 Generative ModelThe primary difference between LCTM and theconventional topic models is that LCTM describesthe generative process of word vectors in docu-ments, rather than words themselves.

Suppose α and β are parameters for the Dirich-let priors and let vd,i denote the word embeddingfor a word type wd,i. The generative model forLCTM is as follows.

1. For each topic k

(a) Draw a topic concept distribution ϕk ∼Dirichlet(β).

381

(a) LDA. (b) LCTM.

Figure 2: Graphical representation.

2. For each latent concept c

(a) Draw a concept vector µc ∼N (µ, σ2

0I).

3. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document d

i. Draw its topic assignment zd,i ∼Categorical(θd).

ii. Draw its latent concept assignmentcd,i ∼ Categorical(ϕzd,i

).iii. Draw a word vector vd,i ∼

N (µcd,i, σ2I).

The graphical models for LDA and LCTM areshown in Figure 2. Compared to LDA, LCTMadds another layer of latent variables to indicatethe conceptual similarity of words.

3.2 Posterior InferenceIn our application, we observe documents consist-ing of word vectors and wish to infer posterior dis-tributions over all the hidden variables. Since thereis no analytical solution to the posterior, we derivea collapsed Gibbs sampler to perform approximateinference. During the inference, we sample a la-tent concept assignment as well as a topic assign-ment for each word in each document as follows:

p(zd,i = k | cd,i = c,z−d,i, c−d,i,v)

∝(n−d,i

d,k + αk

)· n−d,i

k,c + βc

n−d,ik,· +

∑c′ βc′

, (1)

P (cd,i = c | zd,i = k,vd,i, z−d,i, c−d,i,v−d,i)

∝(n−d,i

k,c + βc

)· N (vd,i|µc, σ

2cI), (2)

where nd,k is the number of words assigned totopic k in document d, and nk,c is the number ofwords assigned to both topic k and latent conceptc. When an index is replaced by ‘·’, the number is

obtained by summing over the index. The super-script −d,i indicates that the current assignmentsof zd,i and cd,i are ignored. N (·|µ,Σ) is a mul-tivariate Gaussian density function with mean µ

and covariance matrix Σ. µc and σ2c in Eq. (2)

are parameters associated with the latent conceptc and are defined as follows:

µc =1

σ2 + n−d,i·,c σ2

0

σ2µ + σ20 ·

∑(d′,i′)∈A

−d,ic

vd′,i′

,

(3)

σ2c =

(1 +

σ20

n−d,i·,c σ2

0 + σ2

)σ2, (4)

where A−d,ic ≡ {(d′, i′) | cd′,i′ = c ∧ (d′, i′) =

(d, i)} (Murphy, 2012). Eq. (1) is similar to thecollapsed Gibbs sampler of LDA (Griffiths andSteyvers, 2004) except that the second term ofEq. (1) is concerned with topic-concept distribu-tions. Eq. (2) of sampling latent concepts has anintuitive interpretation: the first term encouragesconcept assignments that are consistent with thecurrent topic assignment, while the second termencourages concept assignments that are consis-tent with the observed word. The Gaussian vari-ance parameter σ2 acts as a trade-off parameterbetween the two terms via σ2

c . In Section 4.2, westudy the effect of σ2 on document representation.

3.3 Prediction of Topic Proportions

After the posterior inference, the posterior meansof {θd}, {ϕk} are straightforward to calculate:

θd,k =nd,k + αk

nd,· +∑

k′ αk′, ϕk,c =

nk,c + βc

nk,· +∑

c′ βc′. (5)

Also posterior means for {µc} are given byEq. (3). We can then use these values to predicta topic proportion θdnew of an unseen documentdnew using collapsed Gibbs sampling as follows:

p(zdnew,i = k | vdnew,i,v−dnew,i,z−dnew,i,ϕ,µ)

∝(n−dnew,i

dnew,k + αk

)·∑

c

ϕk,c

N (vdnew,i|µc, σ2c )∑

c′ N (vdnew,i|µc′ , σ2c′)

.

(6)

The second term of Eq. (6) is a weighted averageof ϕk,c with respect to latent concepts. We see thatmore weight is given to the concepts whose corre-sponding vectors µc are closer to the word vec-tor vdnew,i. This to be expected because statisticsof nearby concepts should give more informationabout the word. We also see from Eq. (6) that the

382

topic assignment of a word is determined by itsembedding, instead of its word type. Therefore,LCTM can naturally handle OOV words once theirembeddings are provided.

3.4 Reducing the Computational Complexity

From Eqs. (1) and (2), we see that the computa-tional complexity of sampling per word is O(K +SD), where K, S and D are numbers of topics, la-tent concepts and embedding dimensions, respec-tively. Since K ≪ S holds in usual settings, thedominant computation involves the sampling oflatent concept, which costs O(SD) computationper word.

However, since LCTM assumes that Gaussianvariance σ2 is relatively small, the chance of aword being assigned to distant concepts is negli-gible. Thus, we can reasonably assume that eachword is assigned to one of M ≪ S nearest con-cepts. Hence, the computational complexity isreduced to O(MD). Since concept vectors canmove slightly in the embedding space during theinference, we periodically update the nearest con-cepts for each word type.

To further reduce the computational complexity,we can apply dimensional reduction algorithmssuch as PCA and t-SNE (Van der Maaten and Hin-ton, 2008) to word embeddings to make D smaller.We leave this to future work.

4 Experiments

4.1 Datasets and Models Description

In this section, we study the empirical perfor-mance of LCTM on short texts. We used the20Newsgroups corpus, which consists of discus-sion posts about various news subjects authoredby diverse readers. Each document in the corpus istagged with one of twenty newsgroups. Only postswith less than 50 words are extracted for trainingdatasets. For external word embeddings, we used50-dimensional GloVe1 that were pre-trained onWikipedia. The datasets are summarized in Ta-ble 1. See appendix A for the detail of the datasetpreprocessing.

We compare the performance of the LCTM tothe following six baselines:

• LFLDA (Nguyen et al., 2015), an extensionof Latent Dirichlet Allocation that incorpo-rates word embeddings information.

1Downloaded athttp://nlp.stanford.edu/projects/glove/

Dataset Doc size Vocab size Avg len400short 400 4729 31.87800short 800 7329 31.781561short 1561 10644 31.83held-out 7235 37944 140.15

Table 1: Summary of datasets.

• LFDMM (Nguyen et al., 2015), an extensionof Dirichlet Multinomial Mixtures that incor-porates word embeddings information.

• nI-cLDA, non-interactive constrained LatentDirichlet Allocatoin, a variant of ITM (Hu etal., 2014), where constraints are inferred byapplying k-means to external word embed-dings. Each resulting word cluster is then re-garded as a constraint. See appendix B forthe detail of the model.

• GLDA (Das et al., 2015), Gaussian LDA.

• BTM (Yan et al., 2013), Biterm Topic Model.

• LDA (Blei et al., 2003).

In all the models, we set the number of topicsto be 20. For LCTM (resp. nI-ITM), we set thenumber of latent concepts (resp. constraints) tobe 1000. See appendix C for the detail of hyper-parameter settings.

4.2 Document ClusteringTo demonstrate that LCTM results in a superiorrepresentation of short documents compared to thebaselines, we evaluated the performance of eachmodel on a document clustering task. We useda learned topic proportion as a feature for eachdocument and applied k-means to cluster the doc-uments. We then compared the resulting clus-ters to the actual newsgroup labels. Clusteringperformance is measured by Adjusted Mutual In-formation (AMI) (Manning et al., 2008). HigherAMI indicates better clustering performance. Fig-ure 3 illustrates the quality of clustering in termsof Gaussian variance parameter σ2. We see thatsetting σ2 = 0.5 consistently obtains good clus-tering performance for all the datasets with vary-ing sizes. We therefore set σ2 = 0.5 in the laterevaluation. Figure 4 compares AMI on four topicmodels. We see that LCTM outperforms the topicmodels without word embeddings. Also, we seethat LCTM performs comparable to LFLDA andnl-cLDA, both of which incorporate informationof word embeddings to aid topic inference. How-ever, as we will see in the next section, LCTM can

383

Figure 3: Relationship between σ2 and AMI.

Figure 4: Comparisons on clustering performanceof the topic models.

better handle OOV words in held-out documentsthan LFLDA and nl-cLDA do.

4.3 Representation of Held-out Documentswith OOV words

To show that our model can better predict topicproportions of documents containing OOV wordsthan other topic models, we conducted an exper-iment on a classification task. In particular, weinfer topics from the training dataset and predictedtopic proportions of held-out documents using col-lapsed Gibbs sampler. With the inferred topicproportions on both training dataset and held-outdocuments, we then trained a multi-class classi-fier (multi-class logistic regression implementedin sklearn2 python module) on the training datasetand predicted newsgroup labels of the held-outdocuments.

We compared classification accuracy usingLFLDA, nI-cLDA, LDA, GLDA, LCTM and avariant of LCTM (LCTM-UNK) that ignores OOVin the held-out documents. A higher classifica-tion accuracy indicates a better representation ofunseen documents. Table 2 shows the propor-tion of OOV words and classification accuracy

2See http://scikit-learn.org/stable/.

Training Set 400short 800short 1561shortOOV prop 0.348 0.253 0.181Method Classification AccuracyLCTM 0.302 0.367 0.416LCTM-UNK 0.262 0.340 0.406LFLDA 0.253 0.333 0.410nI-cLDA 0.261 0.333 0.412LDA 0.215 0.293 0.382GLDA 0.0527 0.0529 0.0529Chance Rate 0.0539 0.0539 0.0539

Table 2: Proportions of OOV words and classifi-cation accuracy in the held-out documents.

of the held-out documents. We see that LCTM-UNK outperforms other topic models in almostevery setting, demonstrating the superiority ofour method, even when OOV words are ignored.However, the fact that LCTM outperforms LCTM-UNK in all cases clearly illustrates that LCTM caneffectively make use of information about OOV tofurther improve the representation of unseen docu-ments. The results show that the level of improve-ment of LCTM over LCTM-UNK increases as theproportion of OOV becomes greater.

5 Conclusion

In this paper, we have proposed LCTM that iswell suited for application to short texts with di-verse vocabulary. LCTM infers topics accordingto document-level co-occurrence patterns of la-tent concepts, and thus is robust to diverse vocab-ulary usage and data sparsity in short texts. Weshowed experimentally that LCTM can produce asuperior representation of short documents, com-pared to conventional topic models. We addition-ally demonstrated that LCTM can exploit OOV toimprove the representation of unseen documents.Although our paper has focused on improving per-formance of LDA by introducing the latent con-cept for each word, the same idea can be readilyapplied to other topic models that extend LDA.

Acknowledgments

We thank anonymous reviewers for their construc-tive feedback. We also thank Hideki Mima forhelpful discussions and Paul Thompson for in-sightful reviews on the paper. This paper is basedon results obtained from a project commissionedby the New Energy and Industrial Technology De-velopment Organization (NEDO).

384

ReferencesSitaram Asur and Bernardo A Huberman. 2010. Pre-

dicting the future with social media. In Web Intel-ligence and Intelligent Agent Technology (WI-IAT),2010 IEEE/WIC/ACM International Conference on,volume 1, pages 492–499. IEEE.

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent Dirichlet allocation. the Journal ofmachine Learning research, 3:993–1022.

Weizheng Chen, Jinpeng Wang, Yan Zhang, HongfeiYan, and Xiaoming Li. 2015. User based aggre-gation for biterm topic model. Proceedings of the53rd Annual Meeting of the Association for Compu-tational Linguistics, 2:489–494.

Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015.Gaussian LDA for topic models with word embed-dings. In Proceedings of the 53nd Annual Meet-ing of the Association for Computational Linguis-tics, pages 795–804.

Thomas L Griffiths and Mark Steyvers. 2004. Find-ing scientific topics. Proceedings of the NationalAcademy of Sciences, 101(suppl 1):5228–5235.

Yuening Hu, Jordan L. Boyd-Graber, Brianna Satinoff,and Alison Smith. 2014. Interactive topic modeling.Machine Learning, 95(3):423–469.

Christopher D Manning, Prabhakar Raghavan, HinrichSchutze, et al. 2008. Introduction to information re-trieval, volume 1. Cambridge university press Cam-bridge.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Kevin P Murphy. 2012. Machine learning: a proba-bilistic perspective. MIT press.

Dat Quoc Nguyen, Richard Billingsley, Lan Du, andMark Johnson. 2015. Improving topic models withlatent feature word representations. Transactionsof the Association for Computational Linguistics,3:299–313.

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. GloVe: Global vectors for wordrepresentation. In EMNLP, volume 14, pages 1532–1543.

James Petterson, Wray Buntine, Shravan M Narayana-murthy, Tiberio S Caetano, and Alex J Smola. 2010.Word features for latent Dirichlet allocation. In Ad-vances in Neural Information Processing Systems,pages 1921–1929.

Laurens Van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of MachineLearning Research, 9(2579-2605):85.

Xuerui Wang and Andrew McCallum. 2006. Top-ics over time: a non-Markov continuous-time modelof topical trends. In Proceedings of the 12th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 424–433. ACM.

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and XueqiCheng. 2013. A biterm topic model for short texts.In Proceedings of the 22nd international conferenceon World Wide Web, pages 1445–1456. InternationalWorld Wide Web Conferences Steering Committee.

A Dataset Preprocessing

We preprocessed the 20Newsgroups as follows:We downloaded bag-of-words representation ofthe corpus available online3. Stop words4 andwords that were not covered in the GloVe wereboth removed. After the preprocessing, we ex-tracted short texts containing less than 50 wordsfor training datasets. We created three trainingdatasets with varying numbers of documents, andone held-out dataset. Each dataset was balancedin terms of the proportion of documents belongingto each newsgroup.

B Non-Interactive Contained LDA(nI-cLDA)

We describe nI-cLDA, a variant of interactivetopic model (Hu et al., 2014). nl-cLDA is non-interactive in the sense that constraints are inferredfrom the word embeddings instead of being in-teractively identified by humans. In particular,we apply k-means to word embeddings to clusterwords. Each resulting cluster is then regarded asa constraint. In general, constraints can be differ-ent from topic to topic. Let rk,w be a constraint oftopic k which word w belongs to. The generativeprocess of nl-cLDA is as follows. It is essentiallythe same as (Hu et al., 2014)

1. For each topic k

(a) Draw a topic constraint distributionϕk ∼ Dirichlet(β).

(b) For each constraint s of topic ki. Draw a constraint word distribution

πk,s ∼ Dirichlet(γ).

2. For each document d

(a) Draw a document topic distributionθd ∼ Dirichlet(α).

(b) For the i-th word wd,i in document d

i. Draw its topic assignment zd,i ∼Categorical(θd).

3http://qwone.com/˜jason/20Newsgroups/4Available at http://www.nltk.org/

385

ii. Draw its constraint ld,i ∼Categorical(ϕzd,i

).iii. Draw a word wd,i ∼

Categorical(πzd,i,ld,i).

Let V be the set of vocabulary. We notethat πk,s is a multinomial distribution over Wk,s,which is a subset of V , defined as Wk,s ≡ {w ∈V | rk,w = s}. Wk,s represents a constrained setof words that are conceptually related to each otherunder topic k.

In our application, we observe documents andconstraints for each topic, and wish to infer poste-rior distributions over all the hidden variables. Weapply collapsed Gibbs sampling for the approxi-mate inference. For the detail of the inference, see(Hu et al., 2014).

C Hyperparameter Settings

For all the topic models, we used symmetricDirichlet priors. The hyperparameters were setas follows: for our model (LCTM and LCTM-UNK), nI-cLDA and LDA, we set α = 0.1 andβ = 0.01. For nl-cLDA, we set the parameter ofDirichlet prior for constraint-word distribution (γin appendix B) as 0.1. Also for our model, weset, σ2

0 = 1.0 and µ to be the average of wordvectors. We randomly initialized the topic assign-ments in all the models. Also, we initialized the la-tent concept assignments using k-means clusteringon the word embeddings. The k-means clusteringwas implemented using sklearn5 python module.We set M (number of nearest concepts to samplefrom) to be 300, and updated the nearest conceptsevery 5 iterations. For LFLDA, LFDMM, BTMand Gaussian LDA, we used the original imple-mentations available online6 and retained the de-fault hyperparameters.

We ran all the topic models for 1500 iterationsfor training, and 500 iterations for predicting held-out documents.

5See http://scikit-learn.org/stable/.6LFTM: https://github.com/datquocnguyen/LFTM

BTM: https://github.com/xiaohuiyan/BTMGLDA: https://github.com/rajarshd/Gaussian LDA

386

A Latent Concept Topic Model for Robust Topic Inference ...by this, we propose a latent concept...

Documents

Transcript of A Latent Concept Topic Model for Robust Topic Inference ...by this, we propose a latent concept...