Transferring Topical Knowledge from Auxiliary …qyang/Docs/2011/cikm-short-text.pdfTransferring...

Transferring Topical Knowledge from Auxiliary Long Textsfor Short Text Clustering

Ou Jin∗

Shanghai Jiao Tong University800 Dongchuan Road

Shanghai, [email protected]

Nathan N. LiuHong Kong University ofScience and Technology

Clear Water Bay, Hong [email protected]

Kai ZhaoNEC Labs China

Beijing, [email protected]

Yong YuShanghai Jiao Tong University

800 Dongchuan RoadShanghai, China

[email protected]

Qiang YangHong Kong University ofScience and Technology

Clear Water Bay, Hong [email protected]

ABSTRACTWith the rapid growth of social Web applications such asTwitter and online advertisements, the task of understand-ing short texts is becoming more and more important. Mosttraditional text mining techniques are designed to handlelong text documents. For short text messages, many of theexisting techniques are not effective due to the sparsenessof text representations. To understand short messages, weobserve that it is often possible to find topically related longtexts, which can be utilized as the auxiliary data when min-ing the target short texts data. In this article, we presenta novel approach to cluster short text messages via transferlearning from auxiliary long text data. We show that whilesome previous works for enhancing short text clustering withrelated long texts exist, most of them ignore the semanticand topical inconsistencies between the target and auxiliarydata and may hurt the clustering performance on the shorttexts. To accommodate the possible inconsistencies betweensource and target data, we propose a novel topic model - Du-al Latent Dirichlet Allocation (DLDA) model, which jointlylearns two sets of topics on short and long texts and couplesthe topic parameters to cope with the potential inconsisten-cies between data sets. We demonstrate through large-scaleclustering experiments on both advertisements and Twitterdata that we can obtain superior performance over severalstate-of-art techniques for clustering short text documents.

∗Part of this work was done while the first author was avisiting student in Hong Kong University of Science Tech-nology.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CIKM’11, October 24–28, 2011, Glasgow, Scotland, UK.Copyright 2011 ACM 978-1-4503-0717-8/11/10 ...$10.00.

Categories and Subject DescriptorsH.3.3 [Information Storage and retrieval]: InformationSearch and Retrieval—Clustering ; I.2.6 [Artificial Intelli-gence]: Learning; I.2.7 [Artificial Intelligence]: NaturalLanguage Processing—Text analysis

General TermsAlgorithms, Experimentation

KeywordsShort Text, Statistical Generative Models, Unsupervised Learn-ing

1. INTRODUCTIONShort texts play an important role in various emerging

Web applications such as online advertisements and micro-blogging. Sponsored search and display advertisements ascommonly found on search result pages or general web pagescan usually accommodate just a few keywords or sentences.Similarly, the popular micro-blogging service Twitter re-stricts the message length to be less than 140 character-s. Effective techniques for mining short texts are crucial tothese application domains. While many successful text min-ing techniques have been developed in the past, they haveonly been designed for and tested on traditional long textcorpus such as blogs or newswires. Directly applying thesemethods on short texts often leads to poor results [12]. Com-pared with long texts, short text mining has to address twoinherent difficulties caused by their highly sparse representa-tions: the lack of sufficient word co-occurrence informationand the lack of context information in the text.

To alleviate the sparseness of short texts, a common ap-proach is to conduct “pseudo relevance feedback” in orderto enrich the original short text corpus with an additionalset of auxiliary data consisting of semantically related longtexts. This can be achieved by sending the input short textsas queries to a search engine to retrieve a set of most rele-vant results [18]. Another popular method is to match shorttexts with topics learned from general knowledge reposito-ries such as Wikipedia or ODP [12, 11]. Once the auxiliary

data or auxiliary topic is obtained, the data or topics areoften directly combined with the original short texts, whichare then processed by some traditional text mining models.

While the pseudo relevance feedback based data augmen-tation approach is a promising approach, one should notethat such a process is an inherently noisy operation andthere is a risk that certain portion of the auxiliary data mayin fact be semantically unrelated to the original short texts.Similarly, the unrelated or noisy auxiliary topics may al-so bring negative result. Therefore, naively combining theshort texts with semantically unrelated long texts or topicscould hurt the performance on the short texts. The problemcan become even more serious for unsupervised learning onshort texts as there is no labeling information to guide theselection of auxiliary data and auxiliary topics.

In this paper, we study the problem of enhancing shorttext clustering by incorporating auxiliary long texts. Wepropose a class of topic model - Dual Latent Dirichlet Al-location (DLDA), which jointly learns a set of target topicson the short texts and another set of auxiliary topics on thelong texts while carefully modeling the attribution to eachtype of topics when generating documents. In particular, wedesign and compare two mechanisms for DLDA to accom-modate domain inconsistencies between the two data sets:(1) using two asymmetric priors on the topic mixing pro-portions to control the relative importance of different topicclasses for generating short and long texts (2) introducing alatent binary switch variable to control whether each docu-ment should be generated using target or auxiliary topics.Our DLDA model allows topical structure to be shared ina more flexible manner between the collections of short andlong texts and could therefore lead to more robust perfor-mance improvement when the auxiliary long texts are onlypartially related to the input short texts. Clustering exper-iments on two real world data sets consisting of textual ad-vertisements and twitter messages demonstrated consistentimprovements over existing methods for short text cluster-ing, especially when there are many irrelevant documents inthe auxiliary data set.

2. RELATED WORK

2.1 Mining Short TextDue to its importance in popular web applications such as

Twitter, short text mining has attracted growing interestsin recent years. Hong et al. [4] compared different schemesto train standard topic model on tweets from Twitter. Sri-ram et al. [19] compared some features for classification toconquer the problem coming with the short of tweets. Mi-halcea et al. [8] proposed to measure the similarity of shorttext snippets by using both corpus-based and knowledge-based measures when acquiring words similarity. Sahami etal. [18] present a novel similarity measure for short text s-nippets, which utilizes search engines to provide additionalcontext for the given short text, just like query expansiontechniques. This similarity measure can also be proven tobe a kernel. [21] improved Sahami’s work by involving alearning process to make the measure more appropriate forthe target corpus. Phan et al. [12, 11] proposed to convertadditional knowledge base to topics to improve the repre-sentation of the short texts. The knowledge base is crawledwith selected seeds from several topics to avoid noise. Huet al. [5] proposed a three level framework to utilize both

the knowledge from Wikipedia and WordNet. Most of theseworks focus on how to acquire auxiliary data in order to en-rich short text. They generally make the implicit assump-tion that the auxiliary data are semantically related to theinput short texts, which is hardly true in practice due to thenoisy nature of the pseudo relevance feedback operation.

2.2 Transfer LearningA closely related area with our work is that of transfer

learning [15] which studies how to transfer knowledge fromone related auxiliary domain to the target domain to helpthe learning task on target domain. However, most exist-ing works in this area focused on supervised [20] or semi-supervised setting [14], whereas we need to solve the problemin a totally unsupervised fashion. A similar setting has beenconsidered by Dai et al. [1], who proposed a co-clusteringbased solution to enhance clustering on target domain datawith the help of an auxiliary data set. However, the modelis not designed for handling short texts. Moreover, it makesa strong assumption that the word co-clusters are complete-ly shared between the two domains, which, as we will beshown, is much less effective than our much flexible DLDAmodel.

3. DUAL LATENT DIRICHLET ALLOCA-TION (DLDA)

Due to its high dimensional yet extremely sparse repre-sentations, clustering short texts directly based on the bagof words representation can be very ineffective as we woulddemonstrate in the experiments. In this section, we devel-op a topic modeling based approach to discover some lowdimensional structure that can more effectively capture thesemantic relationships between documents. Directly learn-ing topic models on short text is much harder than on tradi-tional long text. For this reason, [12, 11] proposed to traintopic models on a collection long text in the same domainand then make inference on short text to help the learningtask on short texts. However, in highly dynamic domain-s like Twitter where novel topics and trends constantly e-merge, it is not always possible to find strongly related longtexts via a search engine or a static knowledge base such asWikipedia. Furthermore, for application domains like ad-vertisement, short texts and long texts are often used forvery different purposes. As a result, they may adopt quitedistinct language styles. For example, when merchants ad-vertise a product using short banner Ads, the content oftenemphasizes on the credibility and price aspects. At the sametime, in a Web page for selling the product, merchants mayfocus more on the branding and product features. Similarly,when comparing Twitter messages and Blog articles postedby the same authors, one can also note significant differencesin their content as well as language styles.

In the presence of such inconsistencies between short textsand auxiliary long texts, it would be unreasonable to assumethat the topical structure of the two domains is completelyidentical, as done in several previous works [12, 11, 20]. Inthis section, we describe a better solution to the problemby designing a novel topic model, referred to as the DualLatent Dirichlet Allocation (DLDA), which can distinguishbetween consistent and inconsistent topical structures acrossdomains when learning topics from short texts with an ad-ditional set of auxiliary long texts. In the following sections,

Figure 1: Graphical representation of LDA

we first review the basic LDA model and several related ex-tensions, and then put forward the α-DLDA and γ-DLDAmodels, which extend the LDA model to cope with domaindifferences based on two different mechanisms.

3.1 Task Setting and BackgroundLetWtar = {~wtar

m }Mtar

m=1 denote the set of short texts fromthe target domain (i.e., domain of interests), which are to

be clustered, and let Waux = {~wauxm }M

aux

m=1 denote a set ofauxiliary documents consisting of long texts. The auxiliarydata set could be extracted from general knowledge base,such as Wikipedia, or extracted from some documents rele-vant to the target short texts. In this work, we attempt totransfer the topical knowledge from the auxiliary long textsto help with the unsupervised learning task on target shorttexts. Any long texts that are topically related to the targetshort texts can be used as auxiliary data.

3.2 Latent Dirichlet Allocation and Extension-s

A common approach to dealing with data with high di-mensional sparse representation is dimensionality reduction,which has a rich history and literature. A recent trend in di-mensionality reduction is the use of probabilistic models. Inparticular, Latent Dirichlet Allocation (LDA) is a Bayesianprobabilistic graphical model, which models each documentas a mixture of underlying topics and generates each wordfrom one topic. The generation process of a document is de-scribed in Table 1, which corresponds to a graphical modelas shown in Figure 1. A document ~wd = {wd,n}Nd

n=1 is asso-ciated a document under a specific multinomial distribution

Mult(~θd) that determines the mixing proportion of differen-t topics within the document. Then, topic assignment foreach word is performed by sampling a particular topic zd,nfrom a multinomial distribution Mult(~θd). Finally, a partic-ular word wd,n is generated by sampling from a multinomialdistribution Mult(~ϕd,n) over the words in the corpus.

The LDA model is entirely unsupervised, whereas in manyapplications one cannot expect to have additional knowledgesuch as class labels, tags, etc. To incorporate such side infor-

Table 1: The generation process of LDA• For each topic z ∈ {1, . . . ,K}, draw a multi-

nomial distribution over terms, φz ∼ Dir(β).

• For each document d ∈ {1, . . . ,M}

– Draw a multinomial distribution overtopics, θd ∼ Dir(α)

– For each word wd,n in document d :

∗ draw a topic zd,n ∼Multinomial(θd)

∗ draw a word wd,n ∼Multinomial(φzd,n).

mation so as to arm the LDA model with class labels, severalextensions of LDA have been proposed by imposing variousconstraints on the document-specific topic-mixing propor-

tions ~θd. In the DiscLDA model [7], additional supervisoryinformation in the form of document labels is utilized by

learning a class-label dependent linear transformation of ~θdwith some discriminative criterion.

Similarly, to handle documents annotated with multipletags such as social bookmarks, the Labeled LDA model [17]learns a topic for each tag and restricts that each documentbe generated using only those topics corresponding to theset of tags associated with the document. This is achievedby setting a document-specific hyper-parameter vector ~αd.The model was later successfully applied on a large collec-tion of twitter messages annotated with hashtags to map theshort messages into different categories [16]. Note that al-though both [16] and our work attempt to apply topic modelfor characterizing short text document collections, the focusof our works and their objectives are quite different. [16]studies the utility of hashtags, whereas our work is motivat-ed by the use of additional auxiliary long text documents.In practice, both tags and auxiliary long texts are poten-tial data sources that can be exploited. We believe the twoapproaches are indeed complementary to each other.

3.3 α-DLDATwo key ideas are exploited in our work in the DLDA

model for coping with domain inconsistencies:

1. We can model two separate sets of topics for auxil-iary data and target data. This approach capturesthe major themes within the two data sets, respective-ly. Distinguishing auxiliary topics from target topicsallows irrelevant or inconsistent topical knowledge inthe auxiliary data to be filtered out.

2. We can also use different generative process for aux-iliary and target documents, respectively, so that themodel favors generating a document using the topicsthat belongs to its domain.

To realize the first idea, we can simply split the topics in aLDA model into two groups. In particular, we assume thatthere are Ktar topics with parameters {φtar

1 , ..., φtarKtar} in

the target domain and Kaux topics with parameters{φaux

1 , ..., φauxKaux} in the auxiliary domain.

To realize the second idea, a simple idea is to properlyset the hyper-parameter vector ~α, which determines the pri-

Figure 2: Graphical representation of α-DLDA

or distribution for the document specific mixing proportions~θ. Traditionally, without any prior knowledge, the entriesof ~α are often assumed to be all equal to some small posi-tive value, which corresponds to having no preferences overparticular topics. To make the model generate target docu-ments using target topics more often, we can make the en-tries of ~α correspond to the target topics to be greater thanthose corresponding to auxiliary topics. The same idea canbe applied to designate another asymmetric Dirichlet pri-or for generating the topic mixing proportions for auxiliarydocuments. This leads to having two separate asymmetricDirichlet distributions with parameters

~αtar = [αtaraux, . . . , α

taraux, α

tartar, . . . , α

tartar]

and

~αaux = [αauxaux, . . . , α

auxaux, α

auxtar , . . . , α

auxtar ]

respectively, which corresponds to the generative process de-picted in Figure 2. We refer to this particular variation ofDLDA model based on modifying the α parameters as theα-DLDA model. Note that the inference and learning al-gorithms of the basic LDA model can be easily applied toα-DLDA model, which does not change the model structurebut only imposes certain settings of the hyperparameters.

3.4 γ-DLDAThe α-DLDA model uses asymmetric Dirichlet prior to

control the relative importance of target versus auxiliarytopics when generating a document. However, the asym-metric Dirichlet prior is imposed on all the documents fromeach domain. In this section, we develop the γ-DLDA model,which constrains that each document be generated using ei-ther auxiliary or target topics. This introduces a document-dependent binary-switch variable to be used for choosingbetween the two types of topics when generating the docu-ment. This new mechanism allows the model to automat-ically capture whether a document should be more relatedto the target or auxiliary domain.

More specifically, under the γ-DLDA model, each doc-ument is associated with (1) a binomial distribution overauxiliary topics versus target topics πd with Beta prior γc =[γc

aux, γctar] for each corpus c ∈ {aux, tar}, and (2) two multi-

nomial distribution over auxiliary topics and target topics

Table 2: The generation process of DLDA• For each target topic z ∈ {1, . . . ,Ktar},

draw a multinomial distribution over terms,φtarz ∼ Dir(βtar).

• For each auxiliary topic z ∈ {1, . . . ,Kaux},draw a multinomial distribution over terms,φauxz ∼ Dir(βaux).

• For each corpus c ∈ {aux, tar}

– For each document d ∈ {1, . . . ,Mc}∗ Draw a multinomial distribution

over target topics, θtard ∼ Dir(αtar)

∗ Draw a multinomial distributionover auxiliary topics, θauxd ∼Dir(αaux)

∗ Draw a binomial distribution overtarget topics versus auxiliary topics,πd ∼ Beta(γc)

∗ For each word wd,n in document d :

· Draw a binary switch xd,n ∼Binomial(πd)

· if xd,n = tar, draw a target top-ic zd,n ∼Multinomial(θtard )

· if xd,n = aux, draw aauxiliary topic zd,n ∼Multinomial(θauxd )

· draw a word wd,n ∼Multinomial(φ

xd,nzd,n ).

θaux, θtar separately, with a symmetric Dirichlet prior α.The generative process is shown below, and a graphical rep-resentation of the model is shown in Figure 3.

The Beta prior parameterized by γc can be used to capturethe prior belief on the consistency between the two domain-s. Here we constrain that γaux

aux > γauxtar and γtar

tar > γtaraux,

in order to ensure that the auxiliary topics and target top-ics focus more modeling the documents in their respectivecorpus.

3.4.1 Parameter Estimation with Gibbs SamplingFrom the generative graphical model, we can write the

joint distribution of all known and hidden variables giventhe hyper parameters:

p(~wcm, ~z, ~x, ~θ, ~π,Φ|~α, ~β,~γc) (1)

=

Nm∏n=1

p(wauxm,n|~ϕzxm,n

)p(xm,n|~π)p(zm,n|~θm) (2)

· p(~θm|~α) · p(~π|~γc) · p(Φ|~β) (3)

The likelihood of a document ~w can be obtained by inte-

grating out ~θ, ~π,Φ and summing over zxm,n:

Figure 3: Graphical representation of the proposedγ-DLDA model

p(~wc|~α, ~β, ~γc) =

∫ ∫ ∫p(~θm|~α)p(~πm| ~γc)p(Φ|~β) (4)

·Nm∏n=1

p(wm,n|~θm, ~π,Φ)dΦd~θmd~πm (5)

Finally, the likelihood of the whole data set under the

model is W = {~wauxm }M

aux

m=1 ∪ {~wtarm }M

tar

m=1 is

p(W|~α, ~β,~γ) (6)

=

Maux∏m=1

p(~wauxm |~α, ~β,~γaux) ·

Mtar∏m=1

p(~wtarm |~α, ~β,~γtar) (7)

To estimate the parameters, we need to estimate the la-tent variables conditioned on the observed variables; i.e.p(x, z|w, α, β, γ), where x, z are vectors of assignments ofauxiliary/target topics binary switches and topics for all thewords, respectively. We perform approximate inference us-ing Gibbs sampling, a type of Markov Chain Monte Carlo(MCMC) algorithm, with the following updating formulas.

For auxiliary topics z ∈ {1, . . . ,Kaux},

p(xi = x, zi = z|wi = w,x¬i, z¬i,w¬i, α, β, γ) (8)

∝naux,zw,¬i + α∑V

v=1 (naux,zv,¬i + αv)

·naux,zd,¬i + β∑Kaux

k=1 (naux,zk,¬i + βk)

(9)

· (nauxd,x,¬i + γci

x ) (10)

For target topics z ∈ {1, . . . ,Ktar},

p(xi = x, zi = z|wi = w,x¬i, z¬i,w¬i, α, β, γ) (11)

∝ntar,zw,¬i + α∑V

v=1 ntar,z(v,¬i + αv)

·ntar,zd,¬i + β∑Ktar

k=1 (naux,zk,¬i + βk)

(12)

· (ntard,x,¬i + γci

x ) (13)

where naux,zw,¬i is the number of times the term w is assigned

to an auxiliary topic z. Similarly, ntar,zw,¬i is the count of w

Table 3: The statistical information of the two cor-pus

Short Text Long Textwords/doc docs words/doc docs

ADs 29.06 182209 560.40 99737TWEETs 9.71 24047 1106.79 4482

for a target topic z. naux,zd,¬i is the number of times a word in

document d is assigned to a topic z, while ntar,zd,¬i is for target

topic, nauxd,x,¬i and ntar

d,x,¬i denote the number of times a wordis assigned to auxiliary and to target topics, respectively. cidenotes the corpus (aux, tar) which the current document isdrawn from. For all the counts, subscript ¬i indicates thatthe i-th word is excluded from the computation.

After the Gibbs sampler reaches a burn-in state, we canthen harvest samples and count the word assignments inorder to estimate the parameters:

θxd,z ∝ nx,zd + αx (14)

φxz,w ∝ θx,zw + βx (15)

3.5 Clustering with Hidden TopicsAfter estimating the model parameters and inferring the

parameters for new documents, we can represent each doc-ument d by θd as:

fd =

[θauxd,1

Saux1

, . . . ,θauxd,Kaux

SauxKaux

,θtard,1

Star1

, . . . ,θtard,Ktar

StarKtar

]where Sx

j =∑

i θxi,j , x ∈ {aux, tar}. We normalize the scale

of each feature in order to reduce the importance of sometopics that are overly general, such as a topic that representsthe functional words and common language. Such topicstend to occur in most documents, but they lack the discrim-inative power.

Another important issue is that we should collect suffi-cient samples of θ in the sampling. Because the texts areshort, the result in one sample may vary much, while theaverage of multi samples will produce a more robust result.This process may be affected by the label switching problemcaused by the MCMC algorithm. But, in practice, we findthat this is not a serious problem.

After having the topic based representations for the shorttext documents, we can apply the traditional clustering meth-ods on them. Since these features have already involved theknowledge from the auxiliary data, clustering on the newrepresentation may achieve better results, as we will demon-strate in the experiments.

4. EXPERIMENTAL RESULTS

4.1 Data SetIn order to evaluate our algorithm DLDA, we conduct

experiments on two real data sets, a collection of advertise-ments (ADs) from an online consumer to consumer (C2C)shopping Web site and a collection of tweets (TWEETs)from Twitter.

The short texts in the ADs data set are a collection of text-based display advertisements crawled from an e-commerceWeb site of a commercial company. To obtain the auxiliarylong texts, we randomly crawled some product web pages

listed on the e-commerce site’s home page. Each advertise-ment has been labeled with 42 classes according to the prod-uct taxonomy used by the site. To create the TWEETs dataset, we crawled 197,535 tweets from 405 Twitter users’ timeline. We then filter out 24,047 tweets (12.1%) with hashtags(i.e. semantic annotations added by the author of a tweet).In these tweets, there are 4,282 tweets (17.8%) containingURLs, we therefore crawled the content of the referencedURLs to form the set of auxiliary long texts. Some statis-tics of these two data sets are shown in the Table 3, fromwhich we can see that the short texts in each data set con-tain a very small fraction of the number of words in thecorresponding long texts. Moreover, the short texts in theTWEETs data set contain very few words on average andare much shorter in length when compared with the ADsdata set.

We performed standard data preprocessing including stop-word removal on the raw text. For the TWEETs data, wefilter out the non-semantic symbols such as mentions of usernames (@user), shorted URLs, etc. We also extract hashtags(#tag) from the text, since we rely on the hashtags as thesemantic labeling information for evaluating the clusteringquality.

It should be noted that our model does not require anycorrespondence structures between the short texts and thelong texts, such as those indicating which tweet containswhich URL. This type of correspondence is also not avail-able at all in the ADs data set. The short and long texts inboth data sets are of very different nature and are only top-ically related. The crawled Web pages in the ADs data setmay not necessarily contain any information related to theproducts mentioned in the advertisements. Similarly, in theTWEET’s data set, the referenced URLs are often news ar-ticles and blogs written by the users themselves, whereas thetweet messages are mostly the users’ subjective comments.Thus, the tweets and the web pages are very likely aboutthe same topics, though the style of the languages can bequite different.

4.2 Evaluation CriteriaIn the experiments on the ADs data, we used Entropy,

Purity and Normalized Mutual Information (NMI) as theevaluation criterion. Purity is a simple and transparent e-valuation measure, which can help us understand the qual-ity of the clustering result directly. Entropy and NMI canbe information-theoretically interpreted. NMI allows us totrade off the quality of the clustering against the numberof clusters, which the Purity measure cannot achieve. Thelarger the Purity and NMI values are, the higher the qual-ity of the clustering is. Similarly, a smaller entropy meansbetter performance.

Due to the lack of ground truth of the TWEET’s data,we use the hashtags contained in the tweets as their hidden-meaning indicator. Since a lot of tweets are assigned morethan one hashtags, we propose to employ Davies-BouldinValidity Index (DBI) [2] calculated on the hashtags as theevaluation criterion. DBI is a function of the ratio of thesum of within-cluster scatter to between-cluster separation,which is defined as:

DBI =1

n

n∑i=1

maxi 6=j

{Sn(Qi) + Sn(Qj)

S(Qi, Qj)

}where n is the number of clusters, Qx is the centroid of

cluster x, Sn(Qx) is the average distance from all elementsin cluster x to centroid Qx, and S(Qi, Qj) is the distancebetween centroids ci and cj . Here we use the Euclidiandistance as the distance measure for both functions Sn(·)and S(·, ·). Since the algorithms that produce clusters withlow intra-cluster distances (high intra-cluster similarity) andhigh inter-cluster distances (low inter-cluster similarity) willhave a low Davies-Bouldin index, the clustering algorithmthat produces a collection of clusters with the smallest DBIis considered the best algorithm based on this criterion.

4.3 Baseline Methods and Implementation De-tails

We compare our DLDA with the following methods:

Direct Direct clustering method without incorporating aux-iliary data. We use CLUTO 1 with the TF-IDF repre-sentation.

LDA-short Learning a LDA from short texts and directlyclustering with the θ learned.

LDA-long Learning a LDA from long texts and then ap-ply to short text, which is proposed in [12]. We useGibbsLDA++ 2, the implementation mentioned in thepaper.

LDA-both Learning a LDA from the combined collectionof long texts and short texts as proposed in [20].

Self-Taught Clustering (STC) This is a state of the artunsupervised transfer learning method [1], which ex-ploits an additional set of unlabeled auxiliary data forclustering target data. It does co-clustering 3 on bothdata sets simultaneously and uses shared word-clustersto bridge the two domains.

For LDA-short, LDA-long, LDA-both and our DLDA mod-el, we first build the topic based representations of the shorttexts following the procedures described in Section 3.3, andthen use CLUTO to obtain the document clusters.

For each application of the LDA-based algorithm, we re-peat the step for five times and compute mean. We tuneeach algorithm to its best parameter setting. For the exper-iments on the ADs data, we set the cluster number to 42,since the data set is labeled with 42 classes. For the experi-ments on the TWEET’s data, we set the cluster number to50.

4.4 ResultThe experimental results on two corpuses are shown in Ta-

ble 4 and Table 5, respectively. As expected, our methodsα-DLDA and γ-DLDA clearly outperformed all the otherbaseline methods on both data sets. Due to the high di-mensionality and sparseness of the representations in shorttexts, directly clustering the short texts gave the poorestperformance. LDA-short outperformed Direct on both datasets, which demonstrates the benefit of using topic basedlow dimensional representation for the short texts.

1CLUTO:http://glaros.dtc.umn.edu/gkhome/views/cluto/2GibbsLDA++:http://gibbslda.sourceforge.net/3Co-clustering:http://www.cse.ust.hk/TL/index.html

Table 4: The performance of all the evaluation meth-ods on ADs

Purity Entropy NMIDirect 0.280 0.681 0.215LDA-short 0.305 0.650 0.250LDA-long 0.360 0.599 0.310LDA-both 0.362 0.594 0.315STC 0.320 0.636 0.259α-DLDA 0.383 0.578 0.336γ-DLDA 0.392 0.553 0.364

Table 5: The performance in DBI of all the evalua-tion methods on TWEETs

DBIDirect 18.270LDA-short 15.002LDA-long 14.249LDA-both 13.960STC 13.342α-DLDA 12.485γ-DLDA 11.748

All methods that utilized auxiliary long texts can be ob-served to significantly outperform both the Direct and LDA-short, which clearly shows the value of transferring knowl-edge from auxiliary long texts. Interestingly, the LDA-longmodel, which ignores all the short texts, performs betterthan LDA-short. We believe that this is because learningtopic models on short texts is inherently much more diffi-cult.

Both variations of the proposed DLDA model consistent-ly beat the LDA-both and STC algorithms. LDA-both doesnot distinguish the target domain from the auxiliary domain,and can suffer from domain inconsistencies as a result. TheSTC model distinguishes target and auxiliary domains, butlacks a document-level mechanism for determining if a doc-ument is more related to the target or the auxiliary domain.The empirical results confirm that the DLDA can effectivelyaddress these difficulties.

Finally, when comparing α-DLDA with γ-DLDA, we seethat α-DLDA treats the documents in each set uniform-ly. However, for γ-DLDA, the document-dependent binary-switch variables may only help select those auxiliary longtexts that are related to the target domain. This documen-t level mechanism gives γ-DLDA an extra flexibility overα-DLDA, which can lead to the observed superior perfor-mance.

4.5 Influence of the Parameters and AuxiliaryData

We conduct additional experiments to show the influenceof the parameters and auxiliary data. These experiments aredone on the ADs corpus with NMI as the evaluation met-ric. Due the stochastic nature of the algorithm, we repeateach algorithm for 5 times and report the mean as well asstandard deviation values.

Hyperparameters.In [3] parameters are set as α = 50/K and β = 0.01. In

Figure 4: The performance of all the algorithms onADs with different K

our work, we set γauxtar = γtar

aux = γsmall and γtartar = γaux

aux =γbig, where γsmall < γbig. Since the model is not very sensi-tive to this prior, we use γsmall = 0.2 and γbig = 0.5 in theexperiments.

For α-DLDA, we rely on the setting of α prior to constrainthe topic selection. We have four parameters αtar

aux, αtartar,

αauxaux, αaux

tar to be tuned. We set αauxaux = 50/K and vary the

other three in {0, 0.01, 0.05, 50/K}.

Topic Numbers.The performance of all the LDA-based algorithms on d-

ifferent topic numbers is shown in the Figure 4 and 5 ,forboth the ADs and TWEETs data sets. For α-DLDA and γ-DLDA, K = Kaux +Ktar represents the overall complexityof the model. With the same K, the computational com-plexities of these LDA-based methods are nearly the same.

From these results, we can find that the method withoutany additional information is poor, while the other algo-rithms achieve much better performance. Those topic mod-els with small-sized topics have similar performance, butperform much differently when the number of topics is large.The method that treats auxiliary data and target data dif-ferently can handle a larger number of topics with betterresult.

We also examine different Kaux for γ-DLDA. The resultis shown in the Figure 6. For a set of K values ranging be-tween 60 and 140 , we plot the model performances as Kaux

goes from 10 to K. The value of K controls the overallcomplexity of the topic model, whereas tuning Kaux allowsus to enforce different degrees of topics shared among thetarget and auxiliary data. Given a fixed K, reducing Kaux

is equivalent to enforcing more shared topics between thetwo domains, whereas increasing Kaux allows more auxil-iary domain-dependent information to be captured by theauxiliary topics. From Figure 6, we can clearly see that theoptimal performance is achieved with neither very small norvery large Kaux values, which shows the important tradeoffbetween topic-sharing and domain inconsistencies.

Figure 5: The performance of all the algorithms onTWEETs with different K

Auxiliary Data.The amount of auxiliary data involved in the learning pro-

cess also influences the result. In the second set of experi-ments, we examine such an influence by varying the amountof auxiliary data used for training from 10% to 100%. Theresult is shown in the Figure 7. We can see that even witha small amount of auxiliary data, the performances of α-DLDA and γ-DLDA are already much better than the othermethods. Furthermore, increasing the amount of auxiliarydata can lead to consistent improvement. With more flexi-bility, γ-DLDA begins to perform better when more auxil-iary data is involved.

Irrelevant Data.We also compare the robustness of our algorithms α-DLDA

and γ-DLDA with the other two topic model based methods:LDA-long and LDA-both. In this experiment, the auxiliarydata is mixed with some documents that are randomly se-lected from a general knowledge base. There are certainly alot of useful documents in the general knowledge base thatshare topics with our target documents. But without anyfiltering, much more irrelevant data can be included. Thesedocuments are not strongly related to the target data andcan be considered as noise in the auxiliary data. We controlthe noise level by inserting different number of irrelevantdocuments into the auxiliary data. The performance of d-ifferent models under different noise levels is shown in theFigure 8. The results of LDA-long and LDA-both are notvery stable when more irrelevant documents are added tothe auxiliary data whereas our method γ-DLDA achievedstable performances with different amounts of noise. Thisproves that, by explicitly considering domain inconsisten-cy, we can effectively improve the robustness of short textclustering with auxiliary long texts.

When dealing with the auxiliary documents without noise,there is no huge difference between γ-DLDA and α-DLDA.In Figure 8, we found that simply treating the documentsuniformly will suffer a lot of trouble. Manually adjustingthe α priors can lighten this problem, but it requires much

Figure 6: The influence of different Kaux and Kaux +Ktar for γ-DLDA

Figure 7: The influence of the number of auxiliarydata used

human effort. Without carefully tuning the α priors, theperformance can drop dramatically.

The result of γ-DLDA can also be explained by the follow-ing analysis. We calculate the π when mixing the irrelevantdata. The π is calculated by

πcd,x ∝ nc

d,x + γcx.

Then the average value of π for auxiliary documents and tar-get documents are calculated separately, as shown in Figure9. With more irrelevant data, the model automatically ad-justs the relevance between target documents and auxiliarytopics. This adaptability helps the model escape from usingmany irrelevant topics.

5. CONCLUSIONS AND FUTURE WORKIn this paper, we presented a type of novel topic model

for enhancing short text clustering by incorporating auxil-

Figure 8: The influence of irrelevant data in auxil-iary data

Figure 9: The variation of π with irrelevant data inauxiliary data

iary long texts. The model jointly learns two sets of latenttopics on short and long texts. When considering the dif-ference between the topics of the auxiliary and target data,our method can robustly improve the result of clusteringon target data, even when there are a lot of irrelevant doc-uments in auxiliary data. The experimental result showsthat DLDA can outperform many state-of-the-art methods.This helps validate that, by considering the difference be-tween auxiliary data and target data, the clustering qualityon short text can be improved.

In the future, we wish to evaluate other forms of domaindifference criteria between data sets, and evaluate their per-formance when knowledge transfer is conducted between thedata. We will also consider other forms of topic models toenable more effective learning.

6. ACKNOWLEDGEMENTWe thank the support of Hong Kong RGC project 621010,

and Projects of International Cooperation and ExchangesNational Natural Science Foundation of China 60931160445.

7. REFERENCES[1] W. Dai, Q. Yang, G.-R. Xue, and Y. Yu. Self-taught

clustering. In Proceedings of the 25th internationalconference on Machine learning, ICML ’08, pages200–207. ACM, 2008.

[2] D. L. Davies and D. W. Bouldin. A Cluster SeparationMeasure. IEEE Transactions on Pattern Analysis andMachine Intelligence, PAMI-1(2):224–227, April 1979.

[3] T. L. Griffiths and M. Steyvers. Finding scientifictopics. Proceedings of the National Academy ofSciences of the United States of America,101:5228–5235, April 2004.

[4] L. Hong and B. Davison. Empirical study of topicmodeling in twitter. 1st Workshop on Social MediaAnalytics, 2010.

[5] X. Hu, N. Sun, C. Zhang, and T.-S. Chua. Exploitinginternal and external semantics for the clustering ofshort texts using world knowledge. In Proceeding of the18th ACM conference on Information and knowledgemanagement, CIKM ’09, pages 919–928. ACM, 2009.

[6] X. Hu, X. Zhang, C. Lu, E. K. Park, and X. Zhou.Exploiting wikipedia as external knowledge fordocument clustering. In Proceedings of the 15th ACMSIGKDD international conference on Knowledgediscovery and data mining, KDD ’09, pages 389–396.ACM, 2009.

[7] S. Lacoste-Julien, F. Sha, and M. I. Jordan. Disclda:Discriminative learning for dimensionality reductionand classification. In NIPS, 2008.

[8] R. Mihalcea, C. Corley, and C. Strapparava.Corpus-based and knowledge-based measures of textsemantic similarity. In Proceedings of the 21st nationalconference on Artificial intelligence - Volume 1, pages775–780. AAAI Press, 2006.

[9] C.-T. Nguyen, X.-H. Phan, S. Horiguchi, T.-T.Nguyen, and Q.-T. Ha. Web search clustering andlabeling with hidden topics. ACM Transactions onAsian Language Information Processing, 8:12:1–12:40,August 2009.

[10] S. J. Pan and Q. Yang. A survey on transfer learning.IEEE Transactions on Knowledge and DataEngineering, 22:1345–1359, October 2010.

[11] X.-H. Phan, C.-T. Nguyen, D.-T. Le, L.-M. Nguyen,S. Horiguchi, and Q.-T. Ha. A hidden topic-basedframework towards building applications with shortweb documents. IEEE Transactions on Knowledgeand Data Engineering, 99, 2010.

[12] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learningto classify short and sparse text & web with hiddentopics from large-scale data collections. In Proceedingof the 17th international conference on World WideWeb, WWW ’08, pages 91–100. ACM, 2008.

[13] X. Quan, G. Liu, Z. Lu, X. Ni, and L. Wenyin. Shorttext similarity based on probabilistic topics.Knowledge and Information Systems, 25:473–491,December 2010.

[14] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng.Self-taught learning: transfer learning from unlabeleddata. In Proceedings of the 24th international

conference on Machine learning, ICML ’07, pages759–766. ACM, 2007.

[15] R. Raina, A. Y. Ng, and D. Koller. Constructinginformative priors using transfer learning. In ICML’06, pages 713–720, New York, NY, USA, 2006. ACM.

[16] D. Ramage, S. Dumais, and D. Liebling.Characterizing microblogs with topic models. InICWSM, 2010.

[17] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning.Labeled lda: a supervised topic model for creditattribution in multi-labeled corpora. In Proceedings ofthe 2009 Conference on Empirical Methods in NaturalLanguage Processing: Volume 1 - Volume 1, EMNLP’09, pages 248–256, Stroudsburg, PA, USA, 2009.Association for Computational Linguistics.

[18] M. Sahami and T. D. Heilman. A web-based kernelfunction for measuring the similarity of short textsnippets. In Proceedings of the 15th internationalconference on World Wide Web, WWW ’06, pages377–386. ACM, 2006.

[19] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu,and M. Demirbas. Short text classification in twitterto improve information filtering. In Proceeding of the33rd international ACM SIGIR conference onResearch and development in information retrieval,SIGIR ’10, pages 841–842. ACM, 2010.

[20] G.-R. Xue, W. Dai, Q. Yang, and Y. Yu.Topic-bridged plsa for cross-domain text classification.In Proceedings of the 31st annual international ACMSIGIR conference on Research and development ininformation retrieval, SIGIR ’08, pages 627–634.ACM, 2008.

[21] W.-T. Yih and C. Meek. Improving similaritymeasures for short segments of text. In Proceedings ofthe 22nd national conference on Artificial intelligence- Volume 2, pages 1489–1494. AAAI Press, 2007.

Transferring Topical Knowledge from Auxiliary …qyang/Docs/2011/cikm-short-text.pdfTransferring...

Documents

Transcript of Transferring Topical Knowledge from Auxiliary …qyang/Docs/2011/cikm-short-text.pdfTransferring...