Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. ·...

12
Vol.:(0123456789) 1 3 Int. J. Mach. Learn. & Cyber. DOI 10.1007/s13042-017-0681-9 ORIGINAL ARTICLE Self‑organizing weighted incremental probabilistic latent semantic analysis Ning Li 1,2  · Wenjuan Luo 2  · Kun Yang 3  · Fuzhen Zhuang 2  · Qing He 2  · Zhongzhi Shi 2  Received: 5 February 2016 / Accepted: 10 April 2017 © Springer-Verlag Berlin Heidelberg 2017 perspexity on large dataset, which make it applicable for big data mining. In addition, WIPLSA has good perfor- mance in the application of document categorization. Keywords Probabilistic latent semantic analysis · Weighted incremental learning · Similarity · Big data 1 Introduction As our collective knowledge continues to be digitized and stored in the form of news, blogs, web pages, scien- tific articles, books, images, sound, video, and social net- works, it becomes more difficult to find and discover what we are looking for. We need new computational tools to help organize, search and understand these vast amounts of information [15]. In many text collections, we encounter the scenario that a document contains multiple topics. Extracting such top- ics/subtopics/themes from the text collection is important for many text mining tasks [611], such as text retrieving, topic detection and so on. The traditional modeling method is “bag of words” model and the VSM (Vector Space Model) is always used as the representation for documents. However, this kind of representation ignores the relation- ship between the words. For example, “actor” and “player” will be treated as different words in the “bag of words” model while actually have similar meanings. It is assumed that they should be considered as one single topic. A vari- ety of probabilistic topic models have recently been pro- posed to analyze the content of documents and represent the meaning of words [1215]. They always factor each document into different topics and each topic is represented as a distribution of terms. Research has demonstrated that it Abstract PLSA (Probabilistic Latent Semantic Analy- sis) is a popular topic modeling technique which has been widely applied to text mining applications to discover the underlying topics embedded in the data corpus. However, due to the variability of increasing data, it is necessary to discover the dynamic topics and process the large data- set incrementally. Moreover, PLSA models suffer from the problem of inferencing new documents. To overcome these problems, in this paper, we propose a novel Weighted Incremental PLSA algorithm called WIPLSA to dynami- cally discover topics and incrementally learn the topics from new documents. The experiments verify that the pro- posed WIPLSA could capture the dynamic topics hidden in the dynamic updating data corpus. Compared with PLSA, MAP PLSA and QB PLSA, WIPLSA performs better in * Ning Li [email protected] Wenjuan Luo [email protected] Kun Yang [email protected] Fuzhen Zhuang [email protected] Qing He [email protected] Zhongzhi Shi [email protected] 1 Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China 2 The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China 3 National Institute of Metrology, Beijing 100029, China

Transcript of Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. ·...

Page 1: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Vol.:(0123456789)1 3

Int. J. Mach. Learn. & Cyber. DOI 10.1007/s13042-017-0681-9

ORIGINAL ARTICLE

Self‑organizing weighted incremental probabilistic latent semantic analysis

Ning Li1,2 · Wenjuan Luo2 · Kun Yang3 · Fuzhen Zhuang2 · Qing He2 · Zhongzhi Shi2 

Received: 5 February 2016 / Accepted: 10 April 2017 © Springer-Verlag Berlin Heidelberg 2017

perspexity on large dataset, which make it applicable for big data mining. In addition, WIPLSA has good perfor-mance in the application of document categorization.

Keywords Probabilistic latent semantic analysis · Weighted incremental learning · Similarity · Big data

1 Introduction

As our collective knowledge continues to be digitized and stored in the form of news, blogs, web pages, scien-tific articles, books, images, sound, video, and social net-works, it becomes more difficult to find and discover what we are looking for. We need new computational tools to help organize, search and understand these vast amounts of information [1–5].

In many text collections, we encounter the scenario that a document contains multiple topics. Extracting such top-ics/subtopics/themes from the text collection is important for many text mining tasks [6–11], such as text retrieving, topic detection and so on. The traditional modeling method is “bag of words” model and the VSM (Vector Space Model) is always used as the representation for documents. However, this kind of representation ignores the relation-ship between the words. For example, “actor” and “player” will be treated as different words in the “bag of words” model while actually have similar meanings. It is assumed that they should be considered as one single topic. A vari-ety of probabilistic topic models have recently been pro-posed to analyze the content of documents and represent the meaning of words [12–15]. They always factor each document into different topics and each topic is represented as a distribution of terms. Research has demonstrated that it

Abstract PLSA (Probabilistic Latent Semantic Analy-sis) is a popular topic modeling technique which has been widely applied to text mining applications to discover the underlying topics embedded in the data corpus. However, due to the variability of increasing data, it is necessary to discover the dynamic topics and process the large data-set incrementally. Moreover, PLSA models suffer from the problem of inferencing new documents. To overcome these problems, in this paper, we propose a novel Weighted Incremental PLSA algorithm called WIPLSA to dynami-cally discover topics and incrementally learn the topics from new documents. The experiments verify that the pro-posed WIPLSA could capture the dynamic topics hidden in the dynamic updating data corpus. Compared with PLSA, MAP PLSA and QB PLSA, WIPLSA performs better in

* Ning Li [email protected]

Wenjuan Luo [email protected]

Kun Yang [email protected]

Fuzhen Zhuang [email protected]

Qing He [email protected]

Zhongzhi Shi [email protected]

1 Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China

2 The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China

3 National Institute of Metrology, Beijing 100029, China

Page 2: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

can be successfully applied to better understand collections of text documents [16–19].

Topic models started from latent semantic indexing(LSI) [20], which aims to represent documents in a low dimen-sional vector space using singular vector decomposition. The overwhelming success comes with Probabilistic Latent Semantic Analysis (PLSA) [21]. It is a probabilistic imple-mentation of LSI. The main idea is to describe documents in terms of their topic compositions. However, PLSA does not make any assumptions about how the documents are generated, resulting the overfitting problem. To solve such problem, Blei et al. [22] extended PLSA by introducing a Dirichlet prior, calling the new generative model Latent Dirichlet Allocation (LDA). Over the past decades, there have been numerous proposals for topic models that iden-tify patterns of word occurrences in large collections of documents which reflect the underlying topics represented in the collection, and can then be used to organize, search, index and browse large collection of documents [23–25].

While traditional topic models are computationally expensive to be constructed, a great amount of efforts have been made for topic modeling in a dynamic way. Incremen-tal learning methods provide a feasible solution to reduce the computational cost and make it applicable to large-scale datasets. These incremental learning schemes always exist in [26–28]. Particularly, an adaptive Bayesian latent seman-tic analysis algorithm is proposed in [26]. An incremental PLSA algorithm is constructed to estimate the parameters and update the incremental hyperparameters. The param-eters updating combines the previous hyperparameters with the statistics related to the current documents. However, the combined two parts are equally treated. In many cases, we believe that it is necessary to adjust the ratio between the previous hyperparameters and the new statistics. For exam-ple, if the new coming documents are semanticly similar with the old ones, that is, they share the similar latent top-ics and then the contributions of the new documents to the model training are not obvious. Taking the extreme case for example, suppose that there is only one document collected and PLSA is built upon it. If there is coming a document which is the same with the existing one, the model will not change after the new document being collected because of the same word occurrences. That is to say, the new PLSA model is unrelated to the new document. In other words, the new document has nothing knowledgeable contribution to the model. In that case, it is unreasonable to consider the new statistics in the incremental hyperparameters updating.

In this paper, we are particularly interested in topic anal-ysis for document streams and presented a weighted incre-mental PLSA. To be specific, we focus on extending the adaptive Bayesian latent semantic analysis model in a more reasonable way. Adding balance factor between the previ-ous hyperparameters and the new statistics can definitely

improve the topic modeling performance. Take the earth-quake for an example, the topics of the reports about an earthquake are evolving along the time. In the first stage, the topics are about the earchquake occurring. Casualties and rescues are reported in the second stage, and finally the reconstructions are the main topic. We use the reports about the earthquake as the input data, which is coming incrementally. Our target is to learn the topics changing over the time. When the topics of the current data have been learned, the new coming data may be sementicaly dif-ferent from them. Therefore, we should give larger weights to the new coming data, and then the new topics can be highlighted. Finally, the detected topics can reflect the news evolution process.

The major contributions of our approach are as follows.First, we add a balance factor to reflect the incremental

learning. In this way, we can clearly learn the topics when applying topic models on short text collections.

Second, the balance factor is measured by the similarity between the old and new knowledge, instead of specifying a fixed value, so the incremental modeling is a self-organ-izing process.

Finally, we use real-world data to do a lot of experi-ments, evaluating the effectiveness of the proposed algorithm.

The rest of the paper is organized as follows. We first introduce the preliminary knowledge in Sect. 2. In Sect. 3 we present the related work. The weighted incremental PLSA is discussed in detail in Sect. 4. Section 5 shows the experimental results. Finally, we draw a conclusion and discuss future research plans in Sect. 6.

Before we continue, some notations used in this paper are listed in Table 1.

2 Preliminaries

In order to describe our algorithm clearly, we introduce PLSA algorithm in this section.

The basic idea of PLSA is to treat the words in each document as observations from a mixture model where the component models are the topic word distributions. The selection of different components is controlled by a set of mixing weights. Words in the same document share the same mixing weights.

For a text collection D = {d1,… , dN}, each occur-rence of a word w belongs to W = {w1,… ,wM}. Suppose there are totally K topics, the topic of document d is the sum of the K topics. i.e. (P(z1|d),P(z2|d),… ,P(zK|d)) and ∑K

k=1P(zk�d) = 1. In other words, each document may belong

to different topics. Every topic z is represented by a mul-tinomial distribution on the vocabulary. For example, if the words such as “basketball” and “football” occur with a high

Page 3: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

probability, it is supposed to be considered as a topic about “sports”. We use P(di) to denote the probability that a par-ticular document di will be observed, P(wj|zk) denotes the class-conditional probability of a specific word conditioned on the latent class variable zk, and P(zk|di) signifies a docu-ment specific probability distribution over the latent vari-able space. Each word wj in document di can be generated as follows. First, select a document di with probability P(di). Second, pick a latent class zk with probability P(zk|di), and finally generate a word wj with probability P(wj|zk). Figure 1 is the graphic model.

The standard procedure for maximum likelihood estima-tion in PLSA is the Expectation Maximization (EM) algo-tithm [21]. According to EM algorithm and the PLSA model, in the E-step, P(z|d, w) is updated by Eq. (1).

It is the probability that a word w in a particular docu-ment d is explained by the topic corresponding to z. In the M-step, we update P(w|z) and P(z|d) by Eqs. (2) and (3) respectively.

(1)P(zk�di,wj) =P(wj�zk)P(zk�di)

∑K

l=1P(wj�zl)P(zl�di)

(2)P(wj�zk) =∑N

i=1n(di,wj)P(zk�di,wj)

∑M

m=1

∑N

i=1n(di,wm)P(zk�di,wm)

(3)P(zk�di) =∑M

j=1n(di,wj)P(zk�di,wj)

∑K

l=1

∑M

j=1n(di,wj)P(zl�di,wj)

3 Related work

In this section, we will introduce the main incremental technologies to PLSA and highlights MAP PLSA and QB PLSA. Due to the variability of increasing data, it is necessary to discover the dynamic topics and process the large data set incrementally. Moreover, PLSA models suf-fer from the problem of inferencing new documents. To overcome these problems, we incorporate new words and documents into an existing system for updating a PLSA model and different updating methods are utilized for model learning [27]. There are several noteworthy related work. Here, we give a brief introduction.

Hoffmann proposed the “fold-in” update scheme in [29]. The incremental strategy was to update P(z|d) in the model while keeping the P(w|z) fixed. A “fold-in” approach similar to this one was also used in [30]. The authors proposed incrementally Built Aspect Models (BAMs) to dynamically discover new topics from docu-ment streams. BAMs were probabilistic models designed to accommodate new topics with the spectral algorithm. This approach retained all the conditional probabilities of the old words, given the old latent variables, and the spectral step was used to estimate the probabilities of

Table 1 The notation convention for parameter estimation

Notations Explanations

D = {d1, d2,… , dN} Training/adaptation data with N documentsW = {w1,w2,… ,wM} Vocabulary with M wordsK The number of topics� = {P(wj|zk),P(zk|di)} PLSA parameter set with latent variable zk in Z = {z1,… , zK}

� = {�j,k, �k,i} Hyperparameters of PLSA parameters P(wj|zk) and P(zk|di)P(zk|di,wj) Posterior probability of latent variable zk generating document di and word wj

n(di,wj) Occurrences of word wj in document din(di) Total occurrences of {w1,… ,wM} in document diR(�̂�|𝜃) Log posterior probability with current estimate � and new estimate �̂�� balance factor between new documents and old documents

n = {D1,… ,Dn} Sequence of adaptation documents{D0,D1,… ,Dn} Sequence of input documents, including training ones and adaptation ones.

�(n) = {�(n)

j,k, �

(n)

k,i} At nth epoch, Hyperparameters of PLSA parameters P(wj|zk) and P(zk|di)

V The vector representing the document set which has N documentsvi The ith component of VP(�di

) The probability of generating the words in document di

d z w( )ip d ( | )k ip z d ( | )j kp w z

M

N

Fig. 1 The graphic model of PLSA

Page 4: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

new words and new documents. The new model became the starting point for discovering subsequent new themes so that the latent variables in the BAM model could be grown incrementally.

Tzu-Chuan Chou et  al. [28] proposed Incremental Probabilistic Latent Semantic Indexing (IPLSI) algorithm to address the problem of online event detection. It allevi-ated the threshold-dependency problem and simultaneously maintained the continuity of the latent semantics to better capture the story line development of events.

For the recommendation application, an incremental rec-ommendation algorithm based on (PLSA) was proposed in [27] for automatic question recommendation. The proposed algorithm could consider not only the users’ long-term and short-term interests, but also users’ negative and positive feedback. Wu and Wang [31] designed a incremental learn-ing scheme for Triadic PLSA (TPLSA) for collaborative filtering task that could make forced prediction and free prediction as well.

The most relevant work to our method was proposed in [26], the authors presented a Bayesian PLSA frame-work. An incremental PLSA algorithm was constructed to accomplish the parameter estimation as well as the hyperparameter updating. The maximum a posteriori PLSA (MAP PLSA) and quasi-Bayes PLSA (QB PLSA) was used for corrective training and incremental learning respectively. In MAP- PLSA, the Dirichlet prioris and their hyperparameters �, � were adopted to characterize the vari-ations of topic-dependent word and document probabilities, and MAP PLSA involved both word-level P(w|z) and doc-ument-level P(z|d) parameters. Figure 2 showed the graphi-cal model of MAP PLSA.

Here, M was the size of word sets and N was the number of documents. EM algorithm was used to train this model. In the M-step, PMAP(wj|zk) and PMAP(zk|di) were updated by Eqs. (4) and (5).

(4)

PMAP(wj�zk) =∑N

i=1n(di,wj)P(zk�di,wj) + (�j,k − 1)

∑M

m=1

�∑N

i=1n(di,wm)P(zk�di,wm) + (�j,m − 1)

In MAP PLSA, no incremental learning mechanism was designed for accumulating statistics for PLSA adaption, QB estimation was developed for incremental learning.

Suppose that the sequence of adaptation documents were n = {D1,D2,… ,Dn}, at each epoch, the parameters were updated from the the current block of documents Dn = {d

(n)

i,w

(n)

j} and the accumulated statistics

�(n−1) = {�(n−1)

j,k, �

(n−1)

k,i}. The updating equations were as

follows.

The initial hyperparameters were estimated from the origi-nal training data by

Here, the posterior probability P(zk|di,wj) was calculated by substituting Eqs. (2) and (3) into (1).

However, the relationship between the old and new doc-uments is not considered in the incremental processing.

4 Weighted incremental PLSA

In this section, we propose our method in detail. At First, we introduce the incremental mechanism of our algorithm, there is a weight to measure the contribution of the past accumulated knowledge and the current observed data. and then we introduce the computing method of the weight. Finally, we summarize the algorithm.

4.1 The incremental mechanism

In MAP PLSA, the parameters updating combines the previous hyperparameters with the statistics related to the current documents. It refers to that the current parameters

(5)PMAP(zk�di) =∑M

j=1n(di,wj)P(zk�di,wj) + (�k,i − 1)

n(di) +∑K

l=1(�l,i − 1)

(6)�(n)

j,k=

Nn∑

i=1

n(d(n)

i,w

(n)

j)P(n)(zk|d

(n)

i,w

(n)

j) + �

(n−1)

j,k

(7)�(n)

k,i=

Nn∑

i=1

n(d(n)

i,w

(n)

j)P(n)(zk|d

(n)

i,w

(n)

j) + �

(n−1)

k,i

(8)�(0)

j,k= 1 +

N∑

i=1

n(di,wj)P(zk|di,wj)

(9)�(0)

k,i= 1 +

M∑

j=1

n(di,wj)P(zk|di,wj)

M N

( | )p z d ( | )pw z

d z w

Fig. 2 The graphic model for MAP PLSA

Page 5: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

are correlated to the past accumulated knowledge and the current observed data. Furthermore, the contribution of the two parts to the model training are considered to be equal. It is unreasonable in some cases. On the one hand, when the new documents are semantically similar to the old ones, that is, they share the similar latent topics distribution, the contributions of the new documents to the model training are not obvious. On the other hand, if the new documents are totally different with the old ones, it means that they provide new knowledge into the systems, they are supposed to be paid more attention. So the ratio between the previous hyperparameters and the statistics related to the new docu-ments should not be fixed during the parameters updating. In order to address this problem, we introduce a balance

Here, D signifies the adaptation documents, � is the param-eter set, which is {P(zk|di),P(wj|zk)} and we use Dirichlet density as the conjugate prior.

Assume

where {�j,k, �k,i} are the hyperparameters of Dirichlet den-sities. EM algorithm is used to estimate the parameters. In the E-step, the posterior expectation function R(�̂�|𝜃) is calculated. In the M-step, we maximize it with respect to �̂� to get new estimates and this maximization is done under the constraints

∑M

j=1P(wj�zk) = 1 and

∑K

k=1P(zk�di) = 1. We

expand the expectation function by

where �d and �w are Lagrange multipliers. We differentiate R̃(�̂�|𝜃) with respect to P̂(wj|zk) and P̂(zk|di) to obtain new WIPLSA estimates

Similarly, we obtain the updating mechanism of the hyper-parameters given by

(11)g(�) ∝

K∏

k=1

[M∏

j=1

P(wj|zk)�j,k−1N∏

i=1

P(zk|di)�k,i−1]�

(12)

R̃(�̂�|𝜃) ∝N∑

i=1

M∑

j=1

n(di,wj)

K∑

k=1

P(zk|di,wj) log[P̂(wj|zk)P̂(zk|di)]

+

M∑

j=1

K∑

k=1

𝛾(𝛼j,k − 1) log P̂(wj|zk) +N∑

i=1

K∑

k=1

𝛾(𝛽k,i − 1) log P̂(zk|di)

+ 𝜂w

(1 −

M∑

j=1

P̂(wj|zk))

+ 𝜂d

(1 −

N∑

i=1

P̂(zk|di))

=

M∑

j=1

K∑

k=1

[(N∑

i=1

n(di,wj)P(zk|di,wj) + 𝛾(𝛼j,k − 1)

)log P̂(wj|zk)

]

+

N∑

i=1

K∑

k=1

[(M∑

j=1

n(di,wj)P(zk|di,wj) + 𝛾(𝛽k,i − 1)

)log P̂(zk|di)

]

+ 𝜂w

(1 −

M∑

j=1

P̂(wj|zk))

+ 𝜂d

(1 −

N∑

i=1

P̂(zk|di))

(13)

PWIPLSA(wj�zk) =∑N

i=1n(di,wj)P(zk�di,wj) + �(�j,k − 1)

∑M

m=1

�∑N

i=1n(di,wm)P(zk�di,wm) + �(�j,m − 1)

(14)

PWIPLSA(zk�di) =∑M

j=1n(di,wj)P(zk�di,wj) + �(�k,i − 1)

n(di) + �∑K

l=1(�l,i − 1)

.

(15)�(n)

j,k=

1

Nn∑

i=1

n(d(n)

i,w

(n)

j)P(n)(zk|d

(n)

i,wn

j) + �

(n−1)

j,k

factor to reflect this situation, that is different weights repre-sent different contributions of the old and new documents, and we propose the weighted incremental PLSA called WIPLSA. It offers the user the ability to control, through the balance factor, the tradeoff between the past accumu-lated knowledge and the current observed data. Suppose that the input documents sequence are {D0,D1,… ,Dn}, D0 is used to train the original model, and the remainings are used for the incremental updating.

We calculate WIPLSA parameters for maximizing a posteriori probability, P(�|D), it is intractable to compute directly, so we compute the sum of logarithms of likelihood function P(D|�) and priori density g(�).

(10)�WIPLSA = argmax

P(�|D) = argmax�

logP(D|�) + log g(�)

Page 6: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

We can see that new hyperparameters �(n)

j,k and �(n)

k,i are deter-

mined by the previous ones, together with the current sta-tistics related to w(n)

j and d(n)

i. The initial hyperparameters

can be estimated according to Eqs. (8) and (9). The Eqs. (13)–(16) are the M steps, and in the E step, P(zk|di,wj) is updated as the standard PLSA, i.e.

4.2 The weight measure

As we mentioned in Sect.  4.1, 1� is used as the balance

factor between the new data statistics and the priori. Sup-pose Dold and Dnew are the existing documents and the new documents respectively. If Dnew are similar to Dold, the topic distributions will change less, that is to say the contribution of Dnew to the model updating is little and the weight of Dnew is small. Conversely, if Dnew are totally different from Dold, the topic distributions will change significantly, Dnew will make a great contribution to the model updating and the weight of Dnew should be large. Let Sim(X, Y) be the similarity between the collection X and Y, we set

Here, the cosine similarity is chosen to be the similarity measure. Since Dnew and Dold both represent multiple docu-ments, we transform the vector space into a vector to meas-ure the similarity. Let V be the new vector that represents the document set D which has N documents. Each compo-nent vi is defined as:

where n(dj, vi) signifies the occurrences of word vi in document dj which is one of the document in D. Hence, Sim(Dnew,Dold) is computed by

We assume that the input document streams are {D0,D1,… ,Dn}, D0 is used to train the model and the remaining are used for updating. We can get P(z) , P(w|z)

(16)�(n)

k,i=

1

Nn∑

i=1

n(d(n)

i,w

(n)

j)P(n)(zk|d

(n)

i,w

(n)

j) + �

(n−1)

k,i.

(17)P(zk�di,wj) =PWIPLSA(wj�zk)PWIPLSA(zk�di)

∑K

l=1PWIPLSA(wj�zl)PWIPLSA(zl�di)

(18)� =Sim(Dnew,Dold)

1 − Sim(Dnew,Dold)

(19)vi =

N∑

j=1

n(dj, vi)

(20)Sim(Dnew,Dold) = cos(Vnew,Vold) =Vnew ⋅ Vold

|Vnew| × |Vold|

and P(z|d) by EM algorithm. P(w|z) reflects the probability that the word belongs to the topics and P(z|d) represents the probability that the word belongs to the topics. Our aim is to find the words that reflect the latent topics, so it needs to find the topic that each word belongs to and the related words under the topics.

The WIPLSA algorithm is summarized in 1.

Algorithm 1 WIPLSAInput: The document collection D = {D0, D1, . . . , Dn},the topic number TNum.Output: PWIPLSA(zk, di), PWIPLSA(wj |zk).

1. Initialize TNum;

2. Estimate initial hyperparameters α(0)j,k and β

(0)k,i from the training documents D0;

3. while the EM terminate condition for trainning is not satisfied;4. for each Di in {D1, . . . , Dn}5. compute the balance parameter γ according to equation (18)

6. update α(n)j,k and β

(n)k,i according to equation (15) and (16)

7. update PWIPLSA(wj |zk) and PWIPLSA(zk|di) according to equation (13) and (14)8. end for9. end while

10. outputPWIPLSA(zk|di) and PWIPLSA(wj |zk);

5 Experimental analysis

In this section, we will show that how the method was eval-uated. First we describe the datasets, and then we introduce the performance metrics.Furthermore, we show the per-formance on dynamic topic discovering and large dataset processing. Finally, we conduct the comparison of different algorithms in application of document categorization.

5.1 The datasets description

WIPLSA aims to dynamically discover topics and incre-mentally learn from new documents. In order to examine the effects of time and topic evolution on model adaptation, we choose the data changing over time as the training and adaptation documents.

Considering that the significant events tend to have the ins and outs, and different topics are presented in different time periods, the event from Netease news1 is crawled to do the experiments.

The Netease summarizes the related reports about vari-ous important events, the “Yaan earthquake” and “Asiana Airlines crash in Los Angeles” are chosen to be crawled to form two datasets named Yaan containing 800 documents and Asiana containing 200 documents respectively. In order to accurately identify the topic, we adopt the “title” and “description” fields. Each web page is considered as a document. After crawling, there are a series of preprocess-ing including word segmentation [32], word frequency sta-tistics, dictionary building, vector representation and sort-ing the data by the webpage time.

For Yaan dataset, We select the first 40% documents (N = 320) for training and the remaining for incremental

1 http://news.163.com/special/.

Page 7: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

updating. We know that, in earthquake event, the occur-rence was reported at first and then the rescue, finally would be the reconstruction. Usually, the proportion of rescue news is larger than other news. So we examine the topic evolution by adding a different number of the adapta-tion documents(N = 100, N = 280, N = 100) at three learn-ing epochs.

For Asiana dataset, we similarly get 80 documents for training and 120 adaptation documents. The documents are in the period from 2013-07-09 08:30 to 2013-08-06 14:15. In order to reflect the topic evolution more realistically, we divide the adaptation documents into five learning epochs by observing the contents in each time period.

Furthermore, in order to evaluate the model perfor-mance in dealing with large dataset, we use a subset of the TREC AP corpus containing 16,333 documents with 23,075 unique terms as the original dataset. We replicate it for 100 times to get large dataset to do the perplexities evaluation. One-third of the adaptation documents is used at each epoch. Table 2 shows the three datasets.

Finally, we conducted the evaluation of proposed method for document categorization using TDT2 data-set. The TDT2 corpus ( Nist Topic Detection and Track-ing corpus ) consists of data collected during the first half of 1998 and taken from six sources, including two news-wires (APW, NYT), two radio programs (VOA, PRI) and two television programs (CNN, ABC). It consists of 11,201 on-topic documents which are classified into 96 semantic categories. Here, we used a subset of the original TDT2 corpus. In this subset, those documents appearing in two or more categories were removed, and only the largest five categories were kept, thus leaving us with 6146 documents in total. 70% of them are used as training documents and 30% for testing. Training documents were further divided into subsets for model training and adaptation as given in Table  3. Training samples of each category were roughly partitioned into half for training and the other half for adap-tation. one-third of the adaptation documents are used at each epoch.

5.2 The performance metrics

When we train the models, the documents in the corpora are treated as unlabeled, we wish to achieve high likelihood on a

held-out test set. To evaluate the models, the perplexity of a held-out test set should be computed. The perplexity is used to measure the average word branching factor of a docu-ment model. It is monotonically decreasing in the likelihood of the test data, and is algebraicly equivalent to the inverse of the geometric mean per-word likelihood. The lower the perplexity is, the perplexity is monotonically decreasing in the likelihood of the test data. So a lower perplexity score indicates better generalization performance, and the lower the perplexity is, the smaller the source uncertainty that is obtained to achieve better modeling. For a test set of N doc-uments, the perplexity is defined as follows [22].

where, M is the total number of words.For the model performance evaluation in dealing with

large dataset, we use perplexity as the performance met-rics. And the classification error rate is used to evaluate the model performance in document categorization.

5.3 The dynamic topic discovering

For the Yaan dataset, it records the earthquake occur-rence, rescue and post-earthquake reconstruction news. We set the topic number K = 10, and for each epoch we list the topic with the most documents in Fig. 3. The dis-played word stems are the ten most probable words in the topic, from left to right in descending order.

(21)perplexity (Dtest) = exp

�−

∑N

i=1logP(�di

)

∑N

i=1n(di)

(22)P(�) =∑

z

P(z)

M∏

n=1

P(wn|z)

Table 2 The document numbers for Yaan, Asiana and TREC datasets

Datasets Number of training sets

Number of incremental updating

1st epoch 2nd epoch 3rd epoch 4th epoch 5th epoch

Yaan 320 100 280 100 – –Asiana 80 15 24 24 27 30TREC AP 653320 326660 326660 326660 – –

Table 3 The document numbers for TDT2 dataset

Class1 Class2 Class3 Class4 Class5

Number of training docu-ments

645 640 428 284 154

Number of incremental documents per epoch

215 213 142 94 51

Number of test documents 554 549 368 245 134

Page 8: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

We can see that, the balance factor is introduced into the incremental learning stage of WIPLSA, that is a kind of knowledge guiding the topic evolution, so the topics change dynamically with the new documents being col-lected. Topic(1) reflects the earthquake locations, the res-cue and the casualties embodied in Topic(2) and Topic(3) describes the post disaster reconstruction. The WIPLSA can dynamically discover topics and incrementally learn from new documents.

For the Asiana dataset, it records the whole Asi-ana Airlines pursuit event. The topic number K is set to be 5. In order to compare the topics changing between WIPLSA and QB PLSA, Fig.  4 shows all the topics in each epoch. Each topic is displayed by the top-5 stems. Considering the similarity between the old documents and new ones, the topics obtained by WIPLSA reflect the events development more clearly. First, The Asiana crash happened and China announced the list of the safe peo-ple. Second, The reason was thoroughly investigated and the Asiana Airlines apologized to Chinese people. Third, a girl was suspected of being crushed and the accident maybe caused by the pilot. Fourth, the crushed girl’s name was identified and the lawyers provided legal aid for the victims’ families. Finally, the compensation issues were discussed.

In order to measure the performance of WIPLSA and QB PLSA in each epoch, we use perplexity as the matrics. The topic number is fixed to 3 and 5 for Yaan and Asiana dataset respectively. Figure  5 shows the perplexity com-paration on Yaan and Asiana datasets. We can see that, WIPLSA performs better than QB PLSA in each epoch. It indicates that the parameter � makes good for the model generalization.

5.4 Generalization performance on large dataset

For the large-scale dataset, incremental learning provides a feasible solution which can reduce the computational cost. So in the next experiment, we evaluate the performance in dealing with large-scale dataset.

We evaluate the perplexities on the TREC AP dataset. We firstly compare WIPLSA with PLSA, MAP PLSA and QB PLSA models. EM algorithm is used to train all the hidden variable models with exactly the same stopping criteria, that the average change in expected log likelihood is less than 0.001%. Also, MAP PLSA, QB PLSA and WIPLSA are initialized from the standard PLSA. We use 30% of the data for training and 60% for incremental learn-ing, the remaining served as test data. Specifically, PLSA is trained using all the training and adaptation documents, MAP PLSA performs batch adaptation using all adaptation documents. QB PLSA and WIPLSA executes incremental adaptation by adding 20% adaptation documents at each learning epoch. Because that there is no incremental pro-gress in PLSA and MAP PLSA, in order to clearly compare the performance of different models, for QB PLSA and WIPLSA, we only use the perplexity at the final epoch.

The perplexities is examined by changing the number of latent variable K. Figure  6 presents the perplexity for the PLSA, MAP PLSA, QB PLSA and WIPLSA on TREC AP for different values of K. It can be seen due to the adap-tation stage, MAP PLSA, QB PLSA and WIPLSA do improve the document modeling and the perplexities are significantly reduced by model adaptation. Perplexities of WIPLSA are smaller than the other models. When k > 200

, the perplexity decreases at an extremely slow rate with the increase of topic number. Furthermore, it is not worth increasing the number of topics any more because the com-putational cost increases greatly with the increase of the topic number, so we choose K = 200 as the optimal value.

Furthermore, we also evaluate the performance brought by incremental learning. Figure 7 shows the perplexity for WIPLSA at each epoch. We find that perplexity is con-sistently decreased by doing more incremental learning epochs. It indicates that WIPLSA has a good performance in dealing with large datasets.

5.5 Performance for document categorization

Finally, we conduct the comparison of PLSA, MAP PLSA, QB PLSA and WIPLSA in application of document

Fig. 3 The word stems that reflect the topics in Yaan dataset

Page 9: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

Fig. 4 The topics comparation between WIPLSA and QB PLSA in Asiana dataset

Page 10: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

categorization. TDT2 dataset is used to do this experiment. For the PLSA and MAP PLSA, the training documents are used for training model and the testing documents are used for performance testing. For the the QB PLSA and WIPLSA, we use the adaptation documents to process the implemental learning. When we classify a test docu-ment, first we determine the PLSA probability for each test document, and then we calculate each class feature vector by averaging the PLSA probabilities over all documents

corresponding to the class. Finally, the cosine similarity of feature vectors between a test document and a given class model was calculated for document categorization. The test document will be labeled by the class with the maximum similarity. Comparing with the real label, the classification error rates are calculated.

Figure  8 shows the classification error rate of dif-ferent methods. It can be seen that MAP PLSA, QB PLSA and WIPLSA outperforms PLSA via adaptation.

1st 2nd 3rd265

270

275

280

285

290

295

Epoch

Perplex

ity

QB−PLSAWIPLSA

Perplexity of WIPLSA and QB PLSA on Yaandataset

1st 2nd 3rd 4th 5th52

54

56

58

60

62

64

66

68

70

72

Epoch

Perplex

ity

QB−PLSAWIPLSA

Perplexity of WIPLSA and QB PLSA on Asianadataset

(a) (b)

Fig. 5 Perplexity of WIPLSA and QB PLSA on Yaan and Asiana dataset

Fig. 6 Perplexity results on TREC AP corpora for PLSA, MAP PLSA, QB PLSA and WIPLSA

0 50 100 200 3003000

3500

4000

4500

5000

5500

6000

Number of Topics

Per

plex

ity

WIPLSA 1st epoch updatingWIPLSA 2nd epoch updatingWIPLSA 3rd epoch updating

Page 11: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

WIPLSA performs better than MAP PLSA and QB PLSA. WIPLSA obtains a classification error rate of 2.78%, while the measure is 3.56% using PLSA, 3.15% using MAP PLSA and 3.09% using QB PLSA at the 3rd epoch. And the error rate is consistently decreased by

doing incremental learning epochs. We fully consider the data characteristics in WIPLSA by adding more seman-tic information into the training process, so WIPLSA shows better performance in the application of document categorization.

Fig. 7 Perplexity results on TREC AP corpora for WIPLSA at each epoch

0 50 100 200 3003000

3500

4000

4500

5000

5500

6000

6500

Number of Topics

Per

plex

ity

PLSA

MAP PLSA

QB PLSA

WIPLSA

Fig. 8 Classification error rate results on TDT2 dataset for PLSA, MAP PLSA, QB PLSA and WIPLSA

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

Cla

ssif

icat

ion

Err

or R

ate(

%)

PLSA

MAP PLSA

QB PLSA, 1st epoch

QB PLSA, 2nd epoch

QB PLSA, 3rd epoch

WIPLSA, 1st epoch

WIPLSA, 2nd epoch

WIPLSA, 3rd epoch

Page 12: Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. · Keywords Probabilistic latent semantic analysis · Weighted incremental learning ·

Int. J. Mach. Learn. & Cyber.

1 3

6 Conclusions

This paper proposes a weighted incremental PLSA for solving updating problems in natural language systems. The learning process composes of training and updat-ing, the parameters updating combines the previous hyperparameters with the statistics related to the current documents. We introduce the similarity between the old and new knowledge to add a weight which can change dynamically during the updating. We have experimen-tally verified the claimed advantages in terms of perplex-ity evaluation on large dataset and the good performance in the application of document categorization. Future work will deal in larger detail with specific applications as well as with extensions and generalizations of the pre-sented method.

Acknowledgements The work is supported by the National Natural Science Foundation of China (No. 91546122, 61602438, 61573335, 61473273, 61473274, 61363058), National High-tech R&D Program of China (863 Program) (No. 2014AA015105), National Science and Technology Support Program (No. 2014BAK02B07), National major R&D program of Beijing Municipal Science & Technology Commis-sion (Z161100002616032), Guangdong provincial science and tech-nology plan projects (No. 2015 B 010109005).

References

1. Blei DM (2012) Probabilistic topic models. Commun ACM 55:77–84

2. Yan Y, Chen L, Tjhi W-C (2013) Fuzzy semi-supervised co-clus-tering for text documents. Fuzzy Sets Syst. 215:74–89

3. Shehata S, Karray F, Kamel MS (2013) An efficient concept-based retrieval model for enhancing text retrieval quality. Knowl Inf Syst 1–24

4. Freire A, Cacheda F, Formoso V, Carneiro V (2013) Analysis of performance evaluation techniques for large-scale information retrieval. Analyzing the Performance of Top-K Retrieval Algo-rithms, INVITED SPEAKER, p 2001

5. Choo J, Lee C, Clarkson E, Liu Z, Lee H, Chau DHP, Li F, Kan-nan R, Stolper CD, Inouye D et  al (2013) Visirr: Interactive visual information retrieval and recommendation for large-scale document data

6. Mei Q, Zhai C (2001) A note on em algorithm for probabilis-tic latent semantic analysis. In: Proceedings of the International Conference on Information and Knowledge Management, CIKM

7. Bai L, Liang J, Dang C, Cao F (2013) A novel fuzzy clustering algorithm with between-cluster information for categorical data. Fuzzy Sets Syst 215:55–73

8. Liu CL, Chang TH, Li HH (2013) Clustering documents with labeled and unlabeled documents using fuzzy semi-kmeans. Fuzzy Sets Syst

9. Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F (2013) Evex in st13: application of a large-scale text mining resource to event extraction and network construction. ACL 2013:26

10. Zhou E, Zhong N, Li Y (2013) Extracting news blog hot topics based on the w2t methodology. World Wide Web, pp 1–28

11. Wang X, Wang J (2013) A method of hot topic detection in blogs using n-gram model. J Softw 8:184–191

12. Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semantic Anal 427:424–440

13. Blei DM, Lafferty JD (2006) Dynamic topic models. In: Pro-ceedings of the 23rd international conference on Machine learn-ing. ACM, pp 113–120

14. Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge dis-covery and data mining, ACM, pp 424–433

15. Wang C, Blei D, Heckerman D (2012) Continuous time dynamic topic models. arXiv:1206.3298

16. Aggarwal CC, Zhai C (2012) Mining text data. Springer 17. Gruber A, Rosen-Zvi M, Weiss Y (2012) Latent topic models for

hypertext. arXiv:1206.3254 18. Bolshakova E, Loukachevitch N, Nokel M (2013) Topic models

can improve domain term extraction. In: Advances in Informa-tion Retrieval. Springer, pp 684–687

19. Lin C, He Y, Everson R, Ruger S (2012) Weakly supervised joint sentiment-topic detection from text. IEEE Trans Knowl Data Eng 24:1134–1145

20. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harsh-man RA (1990) Indexing by latent semantic analysis. JASIS 41:391–407

21. Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR con-ference on Research and development in information retrieval. ACM, pp 50–57

22. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

23. Chaney AJB, Blei DM (2012) Visualizing topic models. In: ICWSM

24. Zhai K, Boyd-Graber J, Asadi N, Alkhouja (2012) Mr. lda: a flexible large scale topic modeling package using variational inference in mapreduce. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 879–888

25. Li N, Zhuang F, He Q, Shi Z (2012) Pplsa: Parallel probabilis-tic latent semantic analysis based on mapreduce. In: Intelligent Information Processing VI. Springer, pp 40–49

26. Chien J-T, Wu M-S (2008) Adaptive bayesian latent semantic analysis. IEEE Trans Audio Speech Lang Process 16:198–207

27. Wu H, Wang Y, Cheng X (2008) Incremental probabilistic latent semantic analysis for automatic question recommendation. In: Proceedings of the 2008 ACM conference on Recommender sys-tems. ACM, pp 99–106

28. Tzu-Chuan Chou MCC (2008) Using incremental plsi for thresh-old-resilient online event analysis. IEEE Trans Knowl Data Eng 20:289–299

29. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196

30. Surendran AC, Sra S (2006) Incremental aspect models for min-ing document streams. In: Knowledge Discovery in Databases: PKDD 2006. Springer, pp 633–640

31. Wu H, Wang Y (2009) Incremental learning of triadic plsa for collaborative filtering. In: Active Media Technology. Springer, pp 81–92

32. Qian Y (2016) Context based approach to overlapping ambiguity resolution in chinese word segmentation. J Chongqing Technol Bus Univ (Nat Sci Edn) 20–24