Clustering

25
 Clustering Web Search Results Iwona Białynicka-Birula

description

Web Clustering

Transcript of Clustering

  • Clustering Web Search Results

    Iwona Biaynicka-Birula

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Overview

    What is clustering? Applying clustering to web search results Clustering algorithms Case studies Related topics not covered

    Clustering Clustering in general Document clustering in general

    Other search and browsing aids Classification Visualization Query expansion

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Clustering the act of grouping similar object into sets

    In the web search context: organizing web pages (search results) into groups, so that different groups correspond to different user needs search engine i.e.: engine car part Engine Corp.

    What is clustering?

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Clustering vs. Classification

    Classification assigns objects to predefined groups

    Clustering infers groups based on clustered objects

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Why cluster web search results?

    Flat ranked list not enough

    Documents pertaining to different topics cannot be compared

    Relationships between the results

    Cluster Hypothesis (van Rijsbergen 1979): Closely related documents tend to be relevant to the same requests.

    Aids user-engine interaction

    Browsing

    Help user express his need

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Why not just document clustering?

    Web search results clustering is a version of document clustering, but

    Billions of pages

    Constantly changing

    Data mainly unstructured and heterogeneous

    Additional information to consider (i.e. links, click-through data, etc.)

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Some requirements

    Fast

    Immediate response to query

    Flexible

    Web content changes constantly

    User-oriented

    Main goal is to aid the user in finding sought information

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Main issues

    Online or offline clustering? What to use as input

    Entire documents Snippets Structure information (links) Other data (i.e. click-through) Use stop word lists, stemming, etc.

    How to define similarity? Content (i.e. vector-space model) Link analysis Usage statistics

    How to group similar documents? How to label the groups?

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Clustering algorithms

    Flat or hierarchical?

    Overlapping?

    Hard or soft?

    Incremental?

    Predefined cluster number?

    Requiring explicit similarity measure? Distance measure?

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Clustering algorithms

    Distance-based Hierarchical

    Agglomerative Hierarchical Clustering (AHC) Flat

    K-means (can be fuzzy) Single-pass (incremental)

    Other Suffix Tree Clustering (Grouper) Self-organizing (Kohonen) maps (neural

    networks) Latent Semantic Indexing (LSI) (reducing the

    dimensionality of the vector-space)

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Agglomerative hierarchical clustering

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Clustering result: dendrogram

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    AHC variants

    Various ways of calculating cluster similarity

    single-link (minimum) complete-link (maximum)

    Group-average (average)

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    K-means clustering (k=3)

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Single-pass

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Selected systems

    Scatter/Gather

    Grouper

    Carrot2

    Vivisimo

    Mapuccino

    (Su et. al. 2001)

    SHOC

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Scatter/Gather

    (Cutting et. al. 1992)

    Designed for browsing

    Based on two novel clustering algorithms

    Buckshot fast for online clustering

    Fractionation accurate for offline initial clustering of the entire set

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Grouper

    (Zamir and Etzioni 1997, 1999)

    Online

    Operates on query result snippets

    Clusters together documents with large common subphrases

    Suffix Tree Clustering (STC)

    STC induces labeling

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Suffix Tree Clustering (STC)

    Linear

    Incremental

    Overlapping

    Can be extended to hierarchical

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    STC algorithm

    Step 1: Cleaning Stemming

    Sentence boundary identification

    Punctuation elimination

    Step 2: Suffix tree construction Produces base clusters (internal nodes)

    Base clusters are scored based on size and phrase score (which depends on length and word quality)

    Step 3: Merging base clusters Highly overlapping clusters are merged

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Carrot2

    (Stefanowski and Weiss 2003) http://www.cs.put.poznan.pl/dweiss/carr

    ot/ Component framework Allows substituting components for

    Input (i.e. snippets from other search engines) Filter

    Stemming Distance measure Clustering

    Output

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Vivsimo

    Commercial

    http://www.vivisimo.com/

    Online

    Hierarchical

    Conceptual

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Other

    Mapuccino (IBM) (Maarek et. al. 2000)

    http://www.alphaworks.ibm.com/tech/mapuccino

    Relatively efficient AHC (O(n2))

    Similarity based on vector-space model

    (Su et. al. 2001) Only usage statistics used as input

    Recursive Density Based Clustering

    SHOC (Zhang and Dong 2004)

    Grouper-like

    Key phrase discovery

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    References

    Douglass Cutting, David Karger, Jan Pedersen, and John W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, 1992. Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.

    O. Zamir and O. Etzioni, Grouper: a dynamic clustering interface to web search results, May 1999. In Proceedings of the Eighth International World Wide Web Conference, Toronto, CanadaM. Steinbach, G.

    Y.S. Maarek, R. Fagin, I.Z. Ben-Shaul, D. Pelleg, Ephemeral document clustering for web applications, 2000. Technical Report RJ 10186, IBM Research

    Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation-based Document Clustering using Web Logs, 2001.

    J. Stefanowski, D. Weiss. Carrot2 and Language Properties in Web Search Results Clustering, 2003. In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First International Atlantic Web Intelligence Conference, Madrit, Spain, vol. 2663 (), pp. 240249

    Dell Zhang, Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results, Apr 2004. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China

  • Iwona Biaynicka-Birula - Clustering Web Search Results

    Thank you

    Questions?

    http://www.di.unipi.it/~iwona/Clustering.ppt