Clustering

Clustering Web Search Results

Iwona Biaynicka-Birula

Iwona Biaynicka-Birula - Clustering Web Search Results

Overview

What is clustering? Applying clustering to web search results Clustering algorithms Case studies Related topics not covered

Clustering Clustering in general Document clustering in general

Other search and browsing aids Classification Visualization Query expansion


Clustering the act of grouping similar object into sets

In the web search context: organizing web pages (search results) into groups, so that different groups correspond to different user needs search engine i.e.: engine car part Engine Corp.

What is clustering?


Clustering vs. Classification

Classification assigns objects to predefined groups

Clustering infers groups based on clustered objects


Why cluster web search results?

Flat ranked list not enough

Documents pertaining to different topics cannot be compared

Relationships between the results

Cluster Hypothesis (van Rijsbergen 1979): Closely related documents tend to be relevant to the same requests.

Aids user-engine interaction

Browsing

Help user express his need


Why not just document clustering?

Web search results clustering is a version of document clustering, but

Billions of pages

Constantly changing

Data mainly unstructured and heterogeneous

Additional information to consider (i.e. links, click-through data, etc.)


Some requirements

Fast

Immediate response to query

Flexible

Web content changes constantly

User-oriented

Main goal is to aid the user in finding sought information


Main issues

Online or offline clustering? What to use as input

Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc.

How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics

How to group similar documents? How to label the groups?


Clustering algorithms

Flat or hierarchical?

Overlapping?

Hard or soft?

Incremental?

Predefined cluster number?

Requiring explicit similarity measure? Distance measure?


Clustering algorithms

Distance-based Hierarchical

Agglomerative Hierarchical Clustering (AHC) Flat

K-means (can be fuzzy) Single-pass (incremental)

Other Suffix Tree Clustering (Grouper) Self-organizing (Kohonen) maps (neural

networks) Latent Semantic Indexing (LSI) (reducing the

dimensionality of the vector-space)


Agglomerative hierarchical clustering


Clustering result: dendrogram


AHC variants

Various ways of calculating cluster similarity

single-link (minimum) complete-link (maximum)

Group-average (average)


K-means clustering (k=3)


Single-pass


Selected systems

Scatter/Gather

Grouper

Carrot2

Vivisimo

Mapuccino

(Su et. al. 2001)

SHOC


Scatter/Gather

(Cutting et. al. 1992)

Designed for browsing

Based on two novel clustering algorithms

Buckshot fast for online clustering

Fractionation accurate for offline initial clustering of the entire set


Grouper

(Zamir and Etzioni 1997, 1999)

Online

Operates on query result snippets

Clusters together documents with large common subphrases

Suffix Tree Clustering (STC)

STC induces labeling


Suffix Tree Clustering (STC)

Linear

Incremental

Overlapping

Can be extended to hierarchical


STC algorithm

Step 1: Cleaning Stemming

Sentence boundary identification

Punctuation elimination

Step 2: Suffix tree construction Produces base clusters (internal nodes)

Base clusters are scored based on size and phrase score (which depends on length and word quality)

Step 3: Merging base clusters Highly overlapping clusters are merged


Carrot2

(Stefanowski and Weiss 2003) http://www.cs.put.poznan.pl/dweiss/carr

ot/ Component framework Allows substituting components for

Input (i.e. snippets from other search engines) Filter

Stemming Distance measure Clustering

Output


Vivsimo

Commercial

http://www.vivisimo.com/

Online

Hierarchical

Conceptual


Other

Mapuccino (IBM) (Maarek et. al. 2000)

http://www.alphaworks.ibm.com/tech/mapuccino

Relatively efficient AHC (O(n2))

Similarity based on vector-space model

(Su et. al. 2001) Only usage statistics used as input

Recursive Density Based Clustering

SHOC (Zhang and Dong 2004)

Grouper-like

Key phrase discovery


References

Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, 1992. Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.

O. Zamir and O. Etzioni, Grouper: a dynamic clustering interface to web search results, May 1999. In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G.

Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral document clustering for web applications, 2000. Technical Report RJ 10186, IBM Research

Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs, 2001.

J. Stefanowski, D. Weiss. Carrot2 and Language Properties in Web Search Results Clustering, 2003. In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First International Atlantic Web Intelligence Conference, Madrit, Spain, vol. 2663 (), pp. 240249

Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr 2004. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China


Thank you

Questions?

http://www.di.unipi.it/~iwona/Clustering.ppt

Clustering

Documents

Transcript of Clustering