MARCOULIDES MOUSTAKI-Latent Variable and Latent Structure Models
Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. ·...
Transcript of Self-organizing weighted incremental probabilistic latent semantic analysis · 2017. 5. 5. ·...
Vol.:(0123456789)1 3
Int. J. Mach. Learn. & Cyber. DOI 10.1007/s13042-017-0681-9
ORIGINAL ARTICLE
Self‑organizing weighted incremental probabilistic latent semantic analysis
Ning Li1,2 · Wenjuan Luo2 · Kun Yang3 · Fuzhen Zhuang2 · Qing He2 · Zhongzhi Shi2
Received: 5 February 2016 / Accepted: 10 April 2017 © Springer-Verlag Berlin Heidelberg 2017
perspexity on large dataset, which make it applicable for big data mining. In addition, WIPLSA has good perfor-mance in the application of document categorization.
Keywords Probabilistic latent semantic analysis · Weighted incremental learning · Similarity · Big data
1 Introduction
As our collective knowledge continues to be digitized and stored in the form of news, blogs, web pages, scien-tific articles, books, images, sound, video, and social net-works, it becomes more difficult to find and discover what we are looking for. We need new computational tools to help organize, search and understand these vast amounts of information [1–5].
In many text collections, we encounter the scenario that a document contains multiple topics. Extracting such top-ics/subtopics/themes from the text collection is important for many text mining tasks [6–11], such as text retrieving, topic detection and so on. The traditional modeling method is “bag of words” model and the VSM (Vector Space Model) is always used as the representation for documents. However, this kind of representation ignores the relation-ship between the words. For example, “actor” and “player” will be treated as different words in the “bag of words” model while actually have similar meanings. It is assumed that they should be considered as one single topic. A vari-ety of probabilistic topic models have recently been pro-posed to analyze the content of documents and represent the meaning of words [12–15]. They always factor each document into different topics and each topic is represented as a distribution of terms. Research has demonstrated that it
Abstract PLSA (Probabilistic Latent Semantic Analy-sis) is a popular topic modeling technique which has been widely applied to text mining applications to discover the underlying topics embedded in the data corpus. However, due to the variability of increasing data, it is necessary to discover the dynamic topics and process the large data-set incrementally. Moreover, PLSA models suffer from the problem of inferencing new documents. To overcome these problems, in this paper, we propose a novel Weighted Incremental PLSA algorithm called WIPLSA to dynami-cally discover topics and incrementally learn the topics from new documents. The experiments verify that the pro-posed WIPLSA could capture the dynamic topics hidden in the dynamic updating data corpus. Compared with PLSA, MAP PLSA and QB PLSA, WIPLSA performs better in
* Ning Li [email protected]
Wenjuan Luo [email protected]
Kun Yang [email protected]
Fuzhen Zhuang [email protected]
Qing He [email protected]
Zhongzhi Shi [email protected]
1 Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
2 The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
3 National Institute of Metrology, Beijing 100029, China
Int. J. Mach. Learn. & Cyber.
1 3
can be successfully applied to better understand collections of text documents [16–19].
Topic models started from latent semantic indexing(LSI) [20], which aims to represent documents in a low dimen-sional vector space using singular vector decomposition. The overwhelming success comes with Probabilistic Latent Semantic Analysis (PLSA) [21]. It is a probabilistic imple-mentation of LSI. The main idea is to describe documents in terms of their topic compositions. However, PLSA does not make any assumptions about how the documents are generated, resulting the overfitting problem. To solve such problem, Blei et al. [22] extended PLSA by introducing a Dirichlet prior, calling the new generative model Latent Dirichlet Allocation (LDA). Over the past decades, there have been numerous proposals for topic models that iden-tify patterns of word occurrences in large collections of documents which reflect the underlying topics represented in the collection, and can then be used to organize, search, index and browse large collection of documents [23–25].
While traditional topic models are computationally expensive to be constructed, a great amount of efforts have been made for topic modeling in a dynamic way. Incremen-tal learning methods provide a feasible solution to reduce the computational cost and make it applicable to large-scale datasets. These incremental learning schemes always exist in [26–28]. Particularly, an adaptive Bayesian latent seman-tic analysis algorithm is proposed in [26]. An incremental PLSA algorithm is constructed to estimate the parameters and update the incremental hyperparameters. The param-eters updating combines the previous hyperparameters with the statistics related to the current documents. However, the combined two parts are equally treated. In many cases, we believe that it is necessary to adjust the ratio between the previous hyperparameters and the new statistics. For exam-ple, if the new coming documents are semanticly similar with the old ones, that is, they share the similar latent top-ics and then the contributions of the new documents to the model training are not obvious. Taking the extreme case for example, suppose that there is only one document collected and PLSA is built upon it. If there is coming a document which is the same with the existing one, the model will not change after the new document being collected because of the same word occurrences. That is to say, the new PLSA model is unrelated to the new document. In other words, the new document has nothing knowledgeable contribution to the model. In that case, it is unreasonable to consider the new statistics in the incremental hyperparameters updating.
In this paper, we are particularly interested in topic anal-ysis for document streams and presented a weighted incre-mental PLSA. To be specific, we focus on extending the adaptive Bayesian latent semantic analysis model in a more reasonable way. Adding balance factor between the previ-ous hyperparameters and the new statistics can definitely
improve the topic modeling performance. Take the earth-quake for an example, the topics of the reports about an earthquake are evolving along the time. In the first stage, the topics are about the earchquake occurring. Casualties and rescues are reported in the second stage, and finally the reconstructions are the main topic. We use the reports about the earthquake as the input data, which is coming incrementally. Our target is to learn the topics changing over the time. When the topics of the current data have been learned, the new coming data may be sementicaly dif-ferent from them. Therefore, we should give larger weights to the new coming data, and then the new topics can be highlighted. Finally, the detected topics can reflect the news evolution process.
The major contributions of our approach are as follows.First, we add a balance factor to reflect the incremental
learning. In this way, we can clearly learn the topics when applying topic models on short text collections.
Second, the balance factor is measured by the similarity between the old and new knowledge, instead of specifying a fixed value, so the incremental modeling is a self-organ-izing process.
Finally, we use real-world data to do a lot of experi-ments, evaluating the effectiveness of the proposed algorithm.
The rest of the paper is organized as follows. We first introduce the preliminary knowledge in Sect. 2. In Sect. 3 we present the related work. The weighted incremental PLSA is discussed in detail in Sect. 4. Section 5 shows the experimental results. Finally, we draw a conclusion and discuss future research plans in Sect. 6.
Before we continue, some notations used in this paper are listed in Table 1.
2 Preliminaries
In order to describe our algorithm clearly, we introduce PLSA algorithm in this section.
The basic idea of PLSA is to treat the words in each document as observations from a mixture model where the component models are the topic word distributions. The selection of different components is controlled by a set of mixing weights. Words in the same document share the same mixing weights.
For a text collection D = {d1,… , dN}, each occur-rence of a word w belongs to W = {w1,… ,wM}. Suppose there are totally K topics, the topic of document d is the sum of the K topics. i.e. (P(z1|d),P(z2|d),… ,P(zK|d)) and ∑K
k=1P(zk�d) = 1. In other words, each document may belong
to different topics. Every topic z is represented by a mul-tinomial distribution on the vocabulary. For example, if the words such as “basketball” and “football” occur with a high
Int. J. Mach. Learn. & Cyber.
1 3
probability, it is supposed to be considered as a topic about “sports”. We use P(di) to denote the probability that a par-ticular document di will be observed, P(wj|zk) denotes the class-conditional probability of a specific word conditioned on the latent class variable zk, and P(zk|di) signifies a docu-ment specific probability distribution over the latent vari-able space. Each word wj in document di can be generated as follows. First, select a document di with probability P(di). Second, pick a latent class zk with probability P(zk|di), and finally generate a word wj with probability P(wj|zk). Figure 1 is the graphic model.
The standard procedure for maximum likelihood estima-tion in PLSA is the Expectation Maximization (EM) algo-tithm [21]. According to EM algorithm and the PLSA model, in the E-step, P(z|d, w) is updated by Eq. (1).
It is the probability that a word w in a particular docu-ment d is explained by the topic corresponding to z. In the M-step, we update P(w|z) and P(z|d) by Eqs. (2) and (3) respectively.
(1)P(zk�di,wj) =P(wj�zk)P(zk�di)
∑K
l=1P(wj�zl)P(zl�di)
(2)P(wj�zk) =∑N
i=1n(di,wj)P(zk�di,wj)
∑M
m=1
∑N
i=1n(di,wm)P(zk�di,wm)
(3)P(zk�di) =∑M
j=1n(di,wj)P(zk�di,wj)
∑K
l=1
∑M
j=1n(di,wj)P(zl�di,wj)
3 Related work
In this section, we will introduce the main incremental technologies to PLSA and highlights MAP PLSA and QB PLSA. Due to the variability of increasing data, it is necessary to discover the dynamic topics and process the large data set incrementally. Moreover, PLSA models suf-fer from the problem of inferencing new documents. To overcome these problems, we incorporate new words and documents into an existing system for updating a PLSA model and different updating methods are utilized for model learning [27]. There are several noteworthy related work. Here, we give a brief introduction.
Hoffmann proposed the “fold-in” update scheme in [29]. The incremental strategy was to update P(z|d) in the model while keeping the P(w|z) fixed. A “fold-in” approach similar to this one was also used in [30]. The authors proposed incrementally Built Aspect Models (BAMs) to dynamically discover new topics from docu-ment streams. BAMs were probabilistic models designed to accommodate new topics with the spectral algorithm. This approach retained all the conditional probabilities of the old words, given the old latent variables, and the spectral step was used to estimate the probabilities of
Table 1 The notation convention for parameter estimation
Notations Explanations
D = {d1, d2,… , dN} Training/adaptation data with N documentsW = {w1,w2,… ,wM} Vocabulary with M wordsK The number of topics� = {P(wj|zk),P(zk|di)} PLSA parameter set with latent variable zk in Z = {z1,… , zK}
� = {�j,k, �k,i} Hyperparameters of PLSA parameters P(wj|zk) and P(zk|di)P(zk|di,wj) Posterior probability of latent variable zk generating document di and word wj
n(di,wj) Occurrences of word wj in document din(di) Total occurrences of {w1,… ,wM} in document diR(�̂�|𝜃) Log posterior probability with current estimate � and new estimate �̂�� balance factor between new documents and old documents
n = {D1,… ,Dn} Sequence of adaptation documents{D0,D1,… ,Dn} Sequence of input documents, including training ones and adaptation ones.
�(n) = {�(n)
j,k, �
(n)
k,i} At nth epoch, Hyperparameters of PLSA parameters P(wj|zk) and P(zk|di)
V The vector representing the document set which has N documentsvi The ith component of VP(�di
) The probability of generating the words in document di
d z w( )ip d ( | )k ip z d ( | )j kp w z
M
N
Fig. 1 The graphic model of PLSA
Int. J. Mach. Learn. & Cyber.
1 3
new words and new documents. The new model became the starting point for discovering subsequent new themes so that the latent variables in the BAM model could be grown incrementally.
Tzu-Chuan Chou et al. [28] proposed Incremental Probabilistic Latent Semantic Indexing (IPLSI) algorithm to address the problem of online event detection. It allevi-ated the threshold-dependency problem and simultaneously maintained the continuity of the latent semantics to better capture the story line development of events.
For the recommendation application, an incremental rec-ommendation algorithm based on (PLSA) was proposed in [27] for automatic question recommendation. The proposed algorithm could consider not only the users’ long-term and short-term interests, but also users’ negative and positive feedback. Wu and Wang [31] designed a incremental learn-ing scheme for Triadic PLSA (TPLSA) for collaborative filtering task that could make forced prediction and free prediction as well.
The most relevant work to our method was proposed in [26], the authors presented a Bayesian PLSA frame-work. An incremental PLSA algorithm was constructed to accomplish the parameter estimation as well as the hyperparameter updating. The maximum a posteriori PLSA (MAP PLSA) and quasi-Bayes PLSA (QB PLSA) was used for corrective training and incremental learning respectively. In MAP- PLSA, the Dirichlet prioris and their hyperparameters �, � were adopted to characterize the vari-ations of topic-dependent word and document probabilities, and MAP PLSA involved both word-level P(w|z) and doc-ument-level P(z|d) parameters. Figure 2 showed the graphi-cal model of MAP PLSA.
Here, M was the size of word sets and N was the number of documents. EM algorithm was used to train this model. In the M-step, PMAP(wj|zk) and PMAP(zk|di) were updated by Eqs. (4) and (5).
(4)
PMAP(wj�zk) =∑N
i=1n(di,wj)P(zk�di,wj) + (�j,k − 1)
∑M
m=1
�∑N
i=1n(di,wm)P(zk�di,wm) + (�j,m − 1)
�
In MAP PLSA, no incremental learning mechanism was designed for accumulating statistics for PLSA adaption, QB estimation was developed for incremental learning.
Suppose that the sequence of adaptation documents were n = {D1,D2,… ,Dn}, at each epoch, the parameters were updated from the the current block of documents Dn = {d
(n)
i,w
(n)
j} and the accumulated statistics
�(n−1) = {�(n−1)
j,k, �
(n−1)
k,i}. The updating equations were as
follows.
The initial hyperparameters were estimated from the origi-nal training data by
Here, the posterior probability P(zk|di,wj) was calculated by substituting Eqs. (2) and (3) into (1).
However, the relationship between the old and new doc-uments is not considered in the incremental processing.
4 Weighted incremental PLSA
In this section, we propose our method in detail. At First, we introduce the incremental mechanism of our algorithm, there is a weight to measure the contribution of the past accumulated knowledge and the current observed data. and then we introduce the computing method of the weight. Finally, we summarize the algorithm.
4.1 The incremental mechanism
In MAP PLSA, the parameters updating combines the previous hyperparameters with the statistics related to the current documents. It refers to that the current parameters
(5)PMAP(zk�di) =∑M
j=1n(di,wj)P(zk�di,wj) + (�k,i − 1)
n(di) +∑K
l=1(�l,i − 1)
(6)�(n)
j,k=
Nn∑
i=1
n(d(n)
i,w
(n)
j)P(n)(zk|d
(n)
i,w
(n)
j) + �
(n−1)
j,k
(7)�(n)
k,i=
Nn∑
i=1
n(d(n)
i,w
(n)
j)P(n)(zk|d
(n)
i,w
(n)
j) + �
(n−1)
k,i
(8)�(0)
j,k= 1 +
N∑
i=1
n(di,wj)P(zk|di,wj)
(9)�(0)
k,i= 1 +
M∑
j=1
n(di,wj)P(zk|di,wj)
M N
( | )p z d ( | )pw z
d z w
Fig. 2 The graphic model for MAP PLSA
Int. J. Mach. Learn. & Cyber.
1 3
are correlated to the past accumulated knowledge and the current observed data. Furthermore, the contribution of the two parts to the model training are considered to be equal. It is unreasonable in some cases. On the one hand, when the new documents are semantically similar to the old ones, that is, they share the similar latent topics distribution, the contributions of the new documents to the model training are not obvious. On the other hand, if the new documents are totally different with the old ones, it means that they provide new knowledge into the systems, they are supposed to be paid more attention. So the ratio between the previous hyperparameters and the statistics related to the new docu-ments should not be fixed during the parameters updating. In order to address this problem, we introduce a balance
Here, D signifies the adaptation documents, � is the param-eter set, which is {P(zk|di),P(wj|zk)} and we use Dirichlet density as the conjugate prior.
Assume
where {�j,k, �k,i} are the hyperparameters of Dirichlet den-sities. EM algorithm is used to estimate the parameters. In the E-step, the posterior expectation function R(�̂�|𝜃) is calculated. In the M-step, we maximize it with respect to �̂� to get new estimates and this maximization is done under the constraints
∑M
j=1P(wj�zk) = 1 and
∑K
k=1P(zk�di) = 1. We
expand the expectation function by
where �d and �w are Lagrange multipliers. We differentiate R̃(�̂�|𝜃) with respect to P̂(wj|zk) and P̂(zk|di) to obtain new WIPLSA estimates
Similarly, we obtain the updating mechanism of the hyper-parameters given by
(11)g(�) ∝
K∏
k=1
[M∏
j=1
P(wj|zk)�j,k−1N∏
i=1
P(zk|di)�k,i−1]�
(12)
R̃(�̂�|𝜃) ∝N∑
i=1
M∑
j=1
n(di,wj)
K∑
k=1
P(zk|di,wj) log[P̂(wj|zk)P̂(zk|di)]
+
M∑
j=1
K∑
k=1
𝛾(𝛼j,k − 1) log P̂(wj|zk) +N∑
i=1
K∑
k=1
𝛾(𝛽k,i − 1) log P̂(zk|di)
+ 𝜂w
(1 −
M∑
j=1
P̂(wj|zk))
+ 𝜂d
(1 −
N∑
i=1
P̂(zk|di))
=
M∑
j=1
K∑
k=1
[(N∑
i=1
n(di,wj)P(zk|di,wj) + 𝛾(𝛼j,k − 1)
)log P̂(wj|zk)
]
+
N∑
i=1
K∑
k=1
[(M∑
j=1
n(di,wj)P(zk|di,wj) + 𝛾(𝛽k,i − 1)
)log P̂(zk|di)
]
+ 𝜂w
(1 −
M∑
j=1
P̂(wj|zk))
+ 𝜂d
(1 −
N∑
i=1
P̂(zk|di))
(13)
PWIPLSA(wj�zk) =∑N
i=1n(di,wj)P(zk�di,wj) + �(�j,k − 1)
∑M
m=1
�∑N
i=1n(di,wm)P(zk�di,wm) + �(�j,m − 1)
�
(14)
PWIPLSA(zk�di) =∑M
j=1n(di,wj)P(zk�di,wj) + �(�k,i − 1)
n(di) + �∑K
l=1(�l,i − 1)
.
(15)�(n)
j,k=
1
�
Nn∑
i=1
n(d(n)
i,w
(n)
j)P(n)(zk|d
(n)
i,wn
j) + �
(n−1)
j,k
factor to reflect this situation, that is different weights repre-sent different contributions of the old and new documents, and we propose the weighted incremental PLSA called WIPLSA. It offers the user the ability to control, through the balance factor, the tradeoff between the past accumu-lated knowledge and the current observed data. Suppose that the input documents sequence are {D0,D1,… ,Dn}, D0 is used to train the original model, and the remainings are used for the incremental updating.
We calculate WIPLSA parameters for maximizing a posteriori probability, P(�|D), it is intractable to compute directly, so we compute the sum of logarithms of likelihood function P(D|�) and priori density g(�).
(10)�WIPLSA = argmax
�
P(�|D) = argmax�
logP(D|�) + log g(�)
Int. J. Mach. Learn. & Cyber.
1 3
We can see that new hyperparameters �(n)
j,k and �(n)
k,i are deter-
mined by the previous ones, together with the current sta-tistics related to w(n)
j and d(n)
i. The initial hyperparameters
can be estimated according to Eqs. (8) and (9). The Eqs. (13)–(16) are the M steps, and in the E step, P(zk|di,wj) is updated as the standard PLSA, i.e.
4.2 The weight measure
As we mentioned in Sect. 4.1, 1� is used as the balance
factor between the new data statistics and the priori. Sup-pose Dold and Dnew are the existing documents and the new documents respectively. If Dnew are similar to Dold, the topic distributions will change less, that is to say the contribution of Dnew to the model updating is little and the weight of Dnew is small. Conversely, if Dnew are totally different from Dold, the topic distributions will change significantly, Dnew will make a great contribution to the model updating and the weight of Dnew should be large. Let Sim(X, Y) be the similarity between the collection X and Y, we set
Here, the cosine similarity is chosen to be the similarity measure. Since Dnew and Dold both represent multiple docu-ments, we transform the vector space into a vector to meas-ure the similarity. Let V be the new vector that represents the document set D which has N documents. Each compo-nent vi is defined as:
where n(dj, vi) signifies the occurrences of word vi in document dj which is one of the document in D. Hence, Sim(Dnew,Dold) is computed by
We assume that the input document streams are {D0,D1,… ,Dn}, D0 is used to train the model and the remaining are used for updating. We can get P(z) , P(w|z)
(16)�(n)
k,i=
1
�
Nn∑
i=1
n(d(n)
i,w
(n)
j)P(n)(zk|d
(n)
i,w
(n)
j) + �
(n−1)
k,i.
(17)P(zk�di,wj) =PWIPLSA(wj�zk)PWIPLSA(zk�di)
∑K
l=1PWIPLSA(wj�zl)PWIPLSA(zl�di)
(18)� =Sim(Dnew,Dold)
1 − Sim(Dnew,Dold)
(19)vi =
N∑
j=1
n(dj, vi)
(20)Sim(Dnew,Dold) = cos(Vnew,Vold) =Vnew ⋅ Vold
|Vnew| × |Vold|
and P(z|d) by EM algorithm. P(w|z) reflects the probability that the word belongs to the topics and P(z|d) represents the probability that the word belongs to the topics. Our aim is to find the words that reflect the latent topics, so it needs to find the topic that each word belongs to and the related words under the topics.
The WIPLSA algorithm is summarized in 1.
Algorithm 1 WIPLSAInput: The document collection D = {D0, D1, . . . , Dn},the topic number TNum.Output: PWIPLSA(zk, di), PWIPLSA(wj |zk).
1. Initialize TNum;
2. Estimate initial hyperparameters α(0)j,k and β
(0)k,i from the training documents D0;
3. while the EM terminate condition for trainning is not satisfied;4. for each Di in {D1, . . . , Dn}5. compute the balance parameter γ according to equation (18)
6. update α(n)j,k and β
(n)k,i according to equation (15) and (16)
7. update PWIPLSA(wj |zk) and PWIPLSA(zk|di) according to equation (13) and (14)8. end for9. end while
10. outputPWIPLSA(zk|di) and PWIPLSA(wj |zk);
5 Experimental analysis
In this section, we will show that how the method was eval-uated. First we describe the datasets, and then we introduce the performance metrics.Furthermore, we show the per-formance on dynamic topic discovering and large dataset processing. Finally, we conduct the comparison of different algorithms in application of document categorization.
5.1 The datasets description
WIPLSA aims to dynamically discover topics and incre-mentally learn from new documents. In order to examine the effects of time and topic evolution on model adaptation, we choose the data changing over time as the training and adaptation documents.
Considering that the significant events tend to have the ins and outs, and different topics are presented in different time periods, the event from Netease news1 is crawled to do the experiments.
The Netease summarizes the related reports about vari-ous important events, the “Yaan earthquake” and “Asiana Airlines crash in Los Angeles” are chosen to be crawled to form two datasets named Yaan containing 800 documents and Asiana containing 200 documents respectively. In order to accurately identify the topic, we adopt the “title” and “description” fields. Each web page is considered as a document. After crawling, there are a series of preprocess-ing including word segmentation [32], word frequency sta-tistics, dictionary building, vector representation and sort-ing the data by the webpage time.
For Yaan dataset, We select the first 40% documents (N = 320) for training and the remaining for incremental
1 http://news.163.com/special/.
Int. J. Mach. Learn. & Cyber.
1 3
updating. We know that, in earthquake event, the occur-rence was reported at first and then the rescue, finally would be the reconstruction. Usually, the proportion of rescue news is larger than other news. So we examine the topic evolution by adding a different number of the adapta-tion documents(N = 100, N = 280, N = 100) at three learn-ing epochs.
For Asiana dataset, we similarly get 80 documents for training and 120 adaptation documents. The documents are in the period from 2013-07-09 08:30 to 2013-08-06 14:15. In order to reflect the topic evolution more realistically, we divide the adaptation documents into five learning epochs by observing the contents in each time period.
Furthermore, in order to evaluate the model perfor-mance in dealing with large dataset, we use a subset of the TREC AP corpus containing 16,333 documents with 23,075 unique terms as the original dataset. We replicate it for 100 times to get large dataset to do the perplexities evaluation. One-third of the adaptation documents is used at each epoch. Table 2 shows the three datasets.
Finally, we conducted the evaluation of proposed method for document categorization using TDT2 data-set. The TDT2 corpus ( Nist Topic Detection and Track-ing corpus ) consists of data collected during the first half of 1998 and taken from six sources, including two news-wires (APW, NYT), two radio programs (VOA, PRI) and two television programs (CNN, ABC). It consists of 11,201 on-topic documents which are classified into 96 semantic categories. Here, we used a subset of the original TDT2 corpus. In this subset, those documents appearing in two or more categories were removed, and only the largest five categories were kept, thus leaving us with 6146 documents in total. 70% of them are used as training documents and 30% for testing. Training documents were further divided into subsets for model training and adaptation as given in Table 3. Training samples of each category were roughly partitioned into half for training and the other half for adap-tation. one-third of the adaptation documents are used at each epoch.
5.2 The performance metrics
When we train the models, the documents in the corpora are treated as unlabeled, we wish to achieve high likelihood on a
held-out test set. To evaluate the models, the perplexity of a held-out test set should be computed. The perplexity is used to measure the average word branching factor of a docu-ment model. It is monotonically decreasing in the likelihood of the test data, and is algebraicly equivalent to the inverse of the geometric mean per-word likelihood. The lower the perplexity is, the perplexity is monotonically decreasing in the likelihood of the test data. So a lower perplexity score indicates better generalization performance, and the lower the perplexity is, the smaller the source uncertainty that is obtained to achieve better modeling. For a test set of N doc-uments, the perplexity is defined as follows [22].
where, M is the total number of words.For the model performance evaluation in dealing with
large dataset, we use perplexity as the performance met-rics. And the classification error rate is used to evaluate the model performance in document categorization.
5.3 The dynamic topic discovering
For the Yaan dataset, it records the earthquake occur-rence, rescue and post-earthquake reconstruction news. We set the topic number K = 10, and for each epoch we list the topic with the most documents in Fig. 3. The dis-played word stems are the ten most probable words in the topic, from left to right in descending order.
(21)perplexity (Dtest) = exp
�−
∑N
i=1logP(�di
)
∑N
i=1n(di)
�
(22)P(�) =∑
z
P(z)
M∏
n=1
P(wn|z)
Table 2 The document numbers for Yaan, Asiana and TREC datasets
Datasets Number of training sets
Number of incremental updating
1st epoch 2nd epoch 3rd epoch 4th epoch 5th epoch
Yaan 320 100 280 100 – –Asiana 80 15 24 24 27 30TREC AP 653320 326660 326660 326660 – –
Table 3 The document numbers for TDT2 dataset
Class1 Class2 Class3 Class4 Class5
Number of training docu-ments
645 640 428 284 154
Number of incremental documents per epoch
215 213 142 94 51
Number of test documents 554 549 368 245 134
Int. J. Mach. Learn. & Cyber.
1 3
We can see that, the balance factor is introduced into the incremental learning stage of WIPLSA, that is a kind of knowledge guiding the topic evolution, so the topics change dynamically with the new documents being col-lected. Topic(1) reflects the earthquake locations, the res-cue and the casualties embodied in Topic(2) and Topic(3) describes the post disaster reconstruction. The WIPLSA can dynamically discover topics and incrementally learn from new documents.
For the Asiana dataset, it records the whole Asi-ana Airlines pursuit event. The topic number K is set to be 5. In order to compare the topics changing between WIPLSA and QB PLSA, Fig. 4 shows all the topics in each epoch. Each topic is displayed by the top-5 stems. Considering the similarity between the old documents and new ones, the topics obtained by WIPLSA reflect the events development more clearly. First, The Asiana crash happened and China announced the list of the safe peo-ple. Second, The reason was thoroughly investigated and the Asiana Airlines apologized to Chinese people. Third, a girl was suspected of being crushed and the accident maybe caused by the pilot. Fourth, the crushed girl’s name was identified and the lawyers provided legal aid for the victims’ families. Finally, the compensation issues were discussed.
In order to measure the performance of WIPLSA and QB PLSA in each epoch, we use perplexity as the matrics. The topic number is fixed to 3 and 5 for Yaan and Asiana dataset respectively. Figure 5 shows the perplexity com-paration on Yaan and Asiana datasets. We can see that, WIPLSA performs better than QB PLSA in each epoch. It indicates that the parameter � makes good for the model generalization.
5.4 Generalization performance on large dataset
For the large-scale dataset, incremental learning provides a feasible solution which can reduce the computational cost. So in the next experiment, we evaluate the performance in dealing with large-scale dataset.
We evaluate the perplexities on the TREC AP dataset. We firstly compare WIPLSA with PLSA, MAP PLSA and QB PLSA models. EM algorithm is used to train all the hidden variable models with exactly the same stopping criteria, that the average change in expected log likelihood is less than 0.001%. Also, MAP PLSA, QB PLSA and WIPLSA are initialized from the standard PLSA. We use 30% of the data for training and 60% for incremental learn-ing, the remaining served as test data. Specifically, PLSA is trained using all the training and adaptation documents, MAP PLSA performs batch adaptation using all adaptation documents. QB PLSA and WIPLSA executes incremental adaptation by adding 20% adaptation documents at each learning epoch. Because that there is no incremental pro-gress in PLSA and MAP PLSA, in order to clearly compare the performance of different models, for QB PLSA and WIPLSA, we only use the perplexity at the final epoch.
The perplexities is examined by changing the number of latent variable K. Figure 6 presents the perplexity for the PLSA, MAP PLSA, QB PLSA and WIPLSA on TREC AP for different values of K. It can be seen due to the adap-tation stage, MAP PLSA, QB PLSA and WIPLSA do improve the document modeling and the perplexities are significantly reduced by model adaptation. Perplexities of WIPLSA are smaller than the other models. When k > 200
, the perplexity decreases at an extremely slow rate with the increase of topic number. Furthermore, it is not worth increasing the number of topics any more because the com-putational cost increases greatly with the increase of the topic number, so we choose K = 200 as the optimal value.
Furthermore, we also evaluate the performance brought by incremental learning. Figure 7 shows the perplexity for WIPLSA at each epoch. We find that perplexity is con-sistently decreased by doing more incremental learning epochs. It indicates that WIPLSA has a good performance in dealing with large datasets.
5.5 Performance for document categorization
Finally, we conduct the comparison of PLSA, MAP PLSA, QB PLSA and WIPLSA in application of document
Fig. 3 The word stems that reflect the topics in Yaan dataset
Int. J. Mach. Learn. & Cyber.
1 3
Fig. 4 The topics comparation between WIPLSA and QB PLSA in Asiana dataset
Int. J. Mach. Learn. & Cyber.
1 3
categorization. TDT2 dataset is used to do this experiment. For the PLSA and MAP PLSA, the training documents are used for training model and the testing documents are used for performance testing. For the the QB PLSA and WIPLSA, we use the adaptation documents to process the implemental learning. When we classify a test docu-ment, first we determine the PLSA probability for each test document, and then we calculate each class feature vector by averaging the PLSA probabilities over all documents
corresponding to the class. Finally, the cosine similarity of feature vectors between a test document and a given class model was calculated for document categorization. The test document will be labeled by the class with the maximum similarity. Comparing with the real label, the classification error rates are calculated.
Figure 8 shows the classification error rate of dif-ferent methods. It can be seen that MAP PLSA, QB PLSA and WIPLSA outperforms PLSA via adaptation.
1st 2nd 3rd265
270
275
280
285
290
295
Epoch
Perplex
ity
QB−PLSAWIPLSA
Perplexity of WIPLSA and QB PLSA on Yaandataset
1st 2nd 3rd 4th 5th52
54
56
58
60
62
64
66
68
70
72
Epoch
Perplex
ity
QB−PLSAWIPLSA
Perplexity of WIPLSA and QB PLSA on Asianadataset
(a) (b)
Fig. 5 Perplexity of WIPLSA and QB PLSA on Yaan and Asiana dataset
Fig. 6 Perplexity results on TREC AP corpora for PLSA, MAP PLSA, QB PLSA and WIPLSA
0 50 100 200 3003000
3500
4000
4500
5000
5500
6000
Number of Topics
Per
plex
ity
WIPLSA 1st epoch updatingWIPLSA 2nd epoch updatingWIPLSA 3rd epoch updating
Int. J. Mach. Learn. & Cyber.
1 3
WIPLSA performs better than MAP PLSA and QB PLSA. WIPLSA obtains a classification error rate of 2.78%, while the measure is 3.56% using PLSA, 3.15% using MAP PLSA and 3.09% using QB PLSA at the 3rd epoch. And the error rate is consistently decreased by
doing incremental learning epochs. We fully consider the data characteristics in WIPLSA by adding more seman-tic information into the training process, so WIPLSA shows better performance in the application of document categorization.
Fig. 7 Perplexity results on TREC AP corpora for WIPLSA at each epoch
0 50 100 200 3003000
3500
4000
4500
5000
5500
6000
6500
Number of Topics
Per
plex
ity
PLSA
MAP PLSA
QB PLSA
WIPLSA
Fig. 8 Classification error rate results on TDT2 dataset for PLSA, MAP PLSA, QB PLSA and WIPLSA
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
4
Cla
ssif
icat
ion
Err
or R
ate(
%)
PLSA
MAP PLSA
QB PLSA, 1st epoch
QB PLSA, 2nd epoch
QB PLSA, 3rd epoch
WIPLSA, 1st epoch
WIPLSA, 2nd epoch
WIPLSA, 3rd epoch
Int. J. Mach. Learn. & Cyber.
1 3
6 Conclusions
This paper proposes a weighted incremental PLSA for solving updating problems in natural language systems. The learning process composes of training and updat-ing, the parameters updating combines the previous hyperparameters with the statistics related to the current documents. We introduce the similarity between the old and new knowledge to add a weight which can change dynamically during the updating. We have experimen-tally verified the claimed advantages in terms of perplex-ity evaluation on large dataset and the good performance in the application of document categorization. Future work will deal in larger detail with specific applications as well as with extensions and generalizations of the pre-sented method.
Acknowledgements The work is supported by the National Natural Science Foundation of China (No. 91546122, 61602438, 61573335, 61473273, 61473274, 61363058), National High-tech R&D Program of China (863 Program) (No. 2014AA015105), National Science and Technology Support Program (No. 2014BAK02B07), National major R&D program of Beijing Municipal Science & Technology Commis-sion (Z161100002616032), Guangdong provincial science and tech-nology plan projects (No. 2015 B 010109005).
References
1. Blei DM (2012) Probabilistic topic models. Commun ACM 55:77–84
2. Yan Y, Chen L, Tjhi W-C (2013) Fuzzy semi-supervised co-clus-tering for text documents. Fuzzy Sets Syst. 215:74–89
3. Shehata S, Karray F, Kamel MS (2013) An efficient concept-based retrieval model for enhancing text retrieval quality. Knowl Inf Syst 1–24
4. Freire A, Cacheda F, Formoso V, Carneiro V (2013) Analysis of performance evaluation techniques for large-scale information retrieval. Analyzing the Performance of Top-K Retrieval Algo-rithms, INVITED SPEAKER, p 2001
5. Choo J, Lee C, Clarkson E, Liu Z, Lee H, Chau DHP, Li F, Kan-nan R, Stolper CD, Inouye D et al (2013) Visirr: Interactive visual information retrieval and recommendation for large-scale document data
6. Mei Q, Zhai C (2001) A note on em algorithm for probabilis-tic latent semantic analysis. In: Proceedings of the International Conference on Information and Knowledge Management, CIKM
7. Bai L, Liang J, Dang C, Cao F (2013) A novel fuzzy clustering algorithm with between-cluster information for categorical data. Fuzzy Sets Syst 215:55–73
8. Liu CL, Chang TH, Li HH (2013) Clustering documents with labeled and unlabeled documents using fuzzy semi-kmeans. Fuzzy Sets Syst
9. Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F (2013) Evex in st13: application of a large-scale text mining resource to event extraction and network construction. ACL 2013:26
10. Zhou E, Zhong N, Li Y (2013) Extracting news blog hot topics based on the w2t methodology. World Wide Web, pp 1–28
11. Wang X, Wang J (2013) A method of hot topic detection in blogs using n-gram model. J Softw 8:184–191
12. Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semantic Anal 427:424–440
13. Blei DM, Lafferty JD (2006) Dynamic topic models. In: Pro-ceedings of the 23rd international conference on Machine learn-ing. ACM, pp 113–120
14. Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge dis-covery and data mining, ACM, pp 424–433
15. Wang C, Blei D, Heckerman D (2012) Continuous time dynamic topic models. arXiv:1206.3298
16. Aggarwal CC, Zhai C (2012) Mining text data. Springer 17. Gruber A, Rosen-Zvi M, Weiss Y (2012) Latent topic models for
hypertext. arXiv:1206.3254 18. Bolshakova E, Loukachevitch N, Nokel M (2013) Topic models
can improve domain term extraction. In: Advances in Informa-tion Retrieval. Springer, pp 684–687
19. Lin C, He Y, Everson R, Ruger S (2012) Weakly supervised joint sentiment-topic detection from text. IEEE Trans Knowl Data Eng 24:1134–1145
20. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harsh-man RA (1990) Indexing by latent semantic analysis. JASIS 41:391–407
21. Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR con-ference on Research and development in information retrieval. ACM, pp 50–57
22. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
23. Chaney AJB, Blei DM (2012) Visualizing topic models. In: ICWSM
24. Zhai K, Boyd-Graber J, Asadi N, Alkhouja (2012) Mr. lda: a flexible large scale topic modeling package using variational inference in mapreduce. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp 879–888
25. Li N, Zhuang F, He Q, Shi Z (2012) Pplsa: Parallel probabilis-tic latent semantic analysis based on mapreduce. In: Intelligent Information Processing VI. Springer, pp 40–49
26. Chien J-T, Wu M-S (2008) Adaptive bayesian latent semantic analysis. IEEE Trans Audio Speech Lang Process 16:198–207
27. Wu H, Wang Y, Cheng X (2008) Incremental probabilistic latent semantic analysis for automatic question recommendation. In: Proceedings of the 2008 ACM conference on Recommender sys-tems. ACM, pp 99–106
28. Tzu-Chuan Chou MCC (2008) Using incremental plsi for thresh-old-resilient online event analysis. IEEE Trans Knowl Data Eng 20:289–299
29. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196
30. Surendran AC, Sra S (2006) Incremental aspect models for min-ing document streams. In: Knowledge Discovery in Databases: PKDD 2006. Springer, pp 633–640
31. Wu H, Wang Y (2009) Incremental learning of triadic plsa for collaborative filtering. In: Active Media Technology. Springer, pp 81–92
32. Qian Y (2016) Context based approach to overlapping ambiguity resolution in chinese word segmentation. J Chongqing Technol Bus Univ (Nat Sci Edn) 20–24