Clustering
-
Upload
shannon-ross -
Category
Documents
-
view
213 -
download
0
description
Transcript of Clustering
-
Clustering Web Search Results
Iwona Biaynicka-Birula
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Overview
What is clustering? Applying clustering to web search results Clustering algorithms Case studies Related topics not covered
Clustering Clustering in general Document clustering in general
Other search and browsing aids Classification Visualization Query expansion
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Clustering the act of grouping similar object into sets
In the web search context: organizing web pages (search results) into groups, so that different groups correspond to different user needs search engine i.e.: engine car part Engine Corp.
What is clustering?
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Clustering vs. Classification
Classification assigns objects to predefined groups
Clustering infers groups based on clustered objects
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Why cluster web search results?
Flat ranked list not enough
Documents pertaining to different topics cannot be compared
Relationships between the results
Cluster Hypothesis (van Rijsbergen 1979): Closely related documents tend to be relevant to the same requests.
Aids user-engine interaction
Browsing
Help user express his need
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Why not just document clustering?
Web search results clustering is a version of document clustering, but
Billions of pages
Constantly changing
Data mainly unstructured and heterogeneous
Additional information to consider (i.e. links, click-through data, etc.)
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Some requirements
Fast
Immediate response to query
Flexible
Web content changes constantly
User-oriented
Main goal is to aid the user in finding sought information
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Main issues
Online or offline clustering? What to use as input
Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc.
How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics
How to group similar documents? How to label the groups?
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Clustering algorithms
Flat or hierarchical?
Overlapping?
Hard or soft?
Incremental?
Predefined cluster number?
Requiring explicit similarity measure? Distance measure?
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Clustering algorithms
Distance-based Hierarchical
Agglomerative Hierarchical Clustering (AHC) Flat
K-means (can be fuzzy) Single-pass (incremental)
Other Suffix Tree Clustering (Grouper) Self-organizing (Kohonen) maps (neural
networks) Latent Semantic Indexing (LSI) (reducing the
dimensionality of the vector-space)
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Agglomerative hierarchical clustering
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Clustering result: dendrogram
-
Iwona Biaynicka-Birula - Clustering Web Search Results
AHC variants
Various ways of calculating cluster similarity
single-link (minimum) complete-link (maximum)
Group-average (average)
-
Iwona Biaynicka-Birula - Clustering Web Search Results
K-means clustering (k=3)
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Single-pass
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Selected systems
Scatter/Gather
Grouper
Carrot2
Vivisimo
Mapuccino
(Su et. al. 2001)
SHOC
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Scatter/Gather
(Cutting et. al. 1992)
Designed for browsing
Based on two novel clustering algorithms
Buckshot fast for online clustering
Fractionation accurate for offline initial clustering of the entire set
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Grouper
(Zamir and Etzioni 1997, 1999)
Online
Operates on query result snippets
Clusters together documents with large common subphrases
Suffix Tree Clustering (STC)
STC induces labeling
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Suffix Tree Clustering (STC)
Linear
Incremental
Overlapping
Can be extended to hierarchical
-
Iwona Biaynicka-Birula - Clustering Web Search Results
STC algorithm
Step 1: Cleaning Stemming
Sentence boundary identification
Punctuation elimination
Step 2: Suffix tree construction Produces base clusters (internal nodes)
Base clusters are scored based on size and phrase score (which depends on length and word quality)
Step 3: Merging base clusters Highly overlapping clusters are merged
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Carrot2
(Stefanowski and Weiss 2003) http://www.cs.put.poznan.pl/dweiss/carr
ot/ Component framework Allows substituting components for
Input (i.e. snippets from other search engines) Filter
Stemming Distance measure Clustering
Output
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Vivsimo
Commercial
http://www.vivisimo.com/
Online
Hierarchical
Conceptual
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Other
Mapuccino (IBM) (Maarek et. al. 2000)
http://www.alphaworks.ibm.com/tech/mapuccino
Relatively efficient AHC (O(n2))
Similarity based on vector-space model
(Su et. al. 2001) Only usage statistics used as input
Recursive Density Based Clustering
SHOC (Zhang and Dong 2004)
Grouper-like
Key phrase discovery
-
Iwona Biaynicka-Birula - Clustering Web Search Results
References
Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, 1992. Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.
O. Zamir and O. Etzioni, Grouper: a dynamic clustering interface to web search results, May 1999. In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G.
Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral document clustering for web applications, 2000. Technical Report RJ 10186, IBM Research
Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs, 2001.
J. Stefanowski, D. Weiss. Carrot2 and Language Properties in Web Search Results Clustering, 2003. In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First International Atlantic Web Intelligence Conference, Madrit, Spain, vol. 2663 (), pp. 240249
Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr 2004. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China
-
Iwona Biaynicka-Birula - Clustering Web Search Results
Thank you
Questions?
http://www.di.unipi.it/~iwona/Clustering.ppt