A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
-
Upload
alexia-west -
Category
Documents
-
view
224 -
download
1
Transcript of A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
![Page 1: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/1.jpg)
A New Suffix TreeSimilarity Measure forDocument Clustering
Hung Chim, Xiaotie Deng
WWW 07
![Page 2: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/2.jpg)
1. Document Clustering
• Agglomerative Hierarchical Clustering (AHC)
![Page 3: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/3.jpg)
• Suffix Tree Clustering (STC)
- commonly used in result clustering
![Page 4: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/4.jpg)
2-1. Suffix Tree Clustering
Ex: 3 documents
• cat ate cheese• cat ate mouse too• mouse ate cheese too
![Page 5: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/5.jpg)
cat ate cheese
![Page 6: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/6.jpg)
cat ate cheese
![Page 7: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/7.jpg)
cat ate cheese
![Page 8: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/8.jpg)
cat ate cheese
![Page 9: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/9.jpg)
score(B) = |B| f(|P|)f: remove stopwords, <= 3
, > 40% && penalize single word, constant for |P| > 6
2-2. Base Cluster
![Page 10: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/10.jpg)
2-3. Combining Base Cluster
• Keep top k(=500) base cluster
• Merge high overlap base clustersmerge Bi & Bj iff
|Bi∩Bj| / |Bi| > 0.5
|Bj∩Bi| / |Bj| > 0.5
![Page 11: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/11.jpg)
2-4. Advantage
• High precision even using snippet
• Incremental and linear time
• Order Independent
• No magic k
top k base clusters? 0.5?
![Page 12: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/12.jpg)
![Page 13: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/13.jpg)
3. New Suffix Tree Clustering
diT =
[tfidf(n1, di), tfidf(n2, di), …]
Group-average AHC
(GAHC)
![Page 14: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/14.jpg)
4. Evaluation
• Use F-measure
precision(Ci, Gj) = |Ci∩ Gj | / |Ci|
recall(Ci, Gj) = |Ci∩ Gj | / | Gj |
![Page 15: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/15.jpg)
• OHSUMED Document CollectionMeSH indexing terms
• RCV1 Document Collectioncategories
![Page 16: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/16.jpg)
![Page 17: A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.](https://reader031.fdocuments.net/reader031/viewer/2022032707/56649e4b5503460f94b407d4/html5/thumbnails/17.jpg)
5. Comparison
• STC : seldom generate large cluster
• NSTC : not incremental