Clustering Search Results Using PLSA
description
Transcript of Clustering Search Results Using PLSA
23/4/21 1
Clustering Search Results Using PLSA
洪春涛
23/4/21 2
Outlines
• Motivation
• Introduction to document clustering and PLSA algorithm
• Working progress and testing results
23/4/21 3
Motivation
• Current Internet search engines are giving us too much information
• Clustering the search results may help find the desired information quickly
23/4/21 4
The writer Truman Capote
The film Truman Capote
A demo of the searching result from Google.
23/4/21 5
Document clustering
• Put the ‘similar’ documents together
=> How do we define ‘similar’?
23/4/21 6
Vector Space Model of documents
The Vector Space Model (VSM) sees a document as a vector of terms:
Doc1: I see a bright future.
Doc2: I see nothing.
I see a bright future nothing
doc1 1 1 1 1 1 0
doc2 1 1 0 0 0 1
23/4/21 7
The distance between doc1 and doc2 is then defined as
1 2cos( 1, 2)
| 1| * | 2 |
doc docdoc doc
doc doc
Cosine as Distance Between Documents
23/4/21 8
Problems with cosine similarity
• Synonymy: different words may have the same meaning– Car manufacturer=automobile maker
• Polysemy: a word may have several different meanings- ‘Truman Capote’ may mean the writer or the film => We need a model that reflects the ‘meaning’
23/4/21 9
Probabilistic Latent Semantic Analysis
Graphical model of PLSA:
( , ) ( ) ( | )
( | ) ( | ) ( | )z Z
P d w P d P w d
P w d P w z P z d
D1
Z1
W1
D: document
Z: latent class
W: word
These can also be written as:
( , ) ( ) ( | ) ( | )z Z
P d w P z P w z P d z
D2
Z1
W1 W1
0.10.9
0.30.7
D2
0.8
0.2
23/4/21 10
• Through Maximization Likelihood, one gets the estimated parameters:
P(d|z)This is what we want – a document-topic matrix
that reflects meanings of the documents.
P(w|z)
P(z)
23/4/21 11
Our approach
1. Get the P(d|z) matrix by PLSA, and
2. Use k-means clustering algorithm on the matrix
23/4/21 12
Problems with this approach
• PLSA takes too much time
solution: optimization & parallelization
23/4/21 13
Algorithm Outline
Expectation Maximization(EM) Algorithm:
Tempered EM:
E-step:
M-step:
23/4/21 14
Basic Data Structures
p_w_z_current, p_w_z_prev:dense double matrix W*Z
p_d_z_current, p_d_z_prev:dense double matrix D*Z
p_z_current, p_z_prev:double array Z
n_d_w:sparse integer matrix N
Lemur Implementation
• In-need calculation of p_z_d_w
• Computational complexity:O(W*D*Z2)
• For the new3 dataset containing 9558 documents, 83487 unique terms, it takes days to finish a TEM iteration
23/4/21 15
Optimization of the Algorithm
• Reduce complexity– calculate p_z_d_w just once in an iteration– complexity reduced to O(N*Z)
• Reduce cache miss by reverting loopsfor(int d=1;d<numDocs;d++){
for(int w=0;w<numTermsInThisDoc;w++){
for(int z=0;z<numZ;z++){
….
}
}
}
23/4/21 16
Parallelization: Access Pattern
23/4/21 17
Data Race
solution: divide the co-occurrence table into blocks
Block Dispatching Algorithm
23/4/21 18
Block Dividing Algorithm
23/4/21 19
cranmed
Experiment Setup
23/4/21 20
Speedup
23/4/21 21
HPC134 Tulsa
Memory Bandwidth Usage
23/4/21 22
Memory Related Pipeline Stalls
23/4/21 23
Available Memory Bandwidth of the Two Machines
23/4/21 24
END
23/4/21 25
23/4/21 26
Backup slides
23/4/21 27
Test Results
PLSA VSM
Tr23 0.4977 0.5273
K1b 0.8473 0.5724
sports 0.7575 0.5563
Table 1. F-score of PLSA and VSM
23/4/21 28
sizeZ 10 20 50 100
Lemur 29 48 263 1015
Optimized 2 3.2 7 13
Table 2. Time used in one EM iteration (in second)
Uses the k1b dataset
(2340 docs, 21247 unique terms, 530374 terms)
23/4/21 29
Thanks!