Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an...

14
Clustering Web Search engines PAPER : WEB DOCUMENT CLUSTERING: A FEASIBILITY STUDY BY OREN ZAMIR AND OREN ETZIONI PRESENTERS : NABEEL ABUBAKER AND HAMZEH AHANGARI

Transcript of Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an...

Page 1: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Clustering Web Search enginesPA P ER : W E B D O CU M ENT C LU S T ER I NG : A F E A S I B I L I T Y S T U DY BY O R EN Z A M I R A N D O R E N E T Z I ONI

P R ES ENT ERS : N A B EEL A B U BA KER A N D H A M Z EH A H A N G AR I

Page 2: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Introduction & MotivationNormal search engines Unordered top-ranked results

◦ Takes time to find the “appropriate” result/document

Clustering search engines Top ranked results, but clustered

◦ Easier to find your “appropriate” document in the “appropriate” cluster

2

Page 3: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

An Example: The famous “Jaguar”Normal search engine might return:

◦ Jaguar, the car in document 1.

◦ Jaguar, the animal in document 2.

◦ Jaguar, the animal in document 3

◦ Jaguar, The car in document 4.

◦ Jaguar, the animal in document 5.

◦ Jaguar, The car in document 6.

Clustered search result:◦ Cluster 1: jaguar, the car. 24 results click to show

◦ Cluster 1: jaguar, the animal. 18 results click to show

3

Page 4: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Contributions STC : Suffix Tree Clustering, an incremental, lineartime algorithm.

Identified key requirements for web clustering methods:◦ Relevance

◦ Browsable summaries

◦ Overlap

◦ Snippet-tolerance

◦ Speed

◦ Incrementality

They’ve created a prototype clustering web search engine called MetaCrawler-STC◦ MetaCrawler : Normal search engine. (Actually a metasearch engine, 1994. Now called Zoo)

◦ MetaCrawler-STC : Clustering search engine.

4

Page 5: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Simple architecture

* The clustering should be performed on separate machines for better performance.

Create ClustersSearch Engine Results

Present them to user

5

Page 6: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Suffix Tree Clustering Algorithm

6

3 steps◦ Cleaning

◦ Light stemming

◦ Identifying Base Clusters◦ Making suffix tree

◦ Combining Base Clusters◦ Merging clusters based on similarity

Linear time

No need to define “number of clusters”

Not sensitive to similarity threshold

Page 7: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Suffix Tree Clustering Algorithm

7

The suffix tree of the strings “cat ate cheese”, “mouse ate cheese too” and “cat ate mouse too”.

Multi-word phrase, not single word

Overlap

Incremental

Order independent

Page 8: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Experiments

8

Critical reasons of STC success : ◦ Multi-word phrase

◦ Overlap

Page 9: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Experiments

9

Impact of multi-word phrase◦ STC-no-phrases

Impact of overlap◦ STC-no-overlap

Page 10: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Experiments

10

Improve other algorithms by multi-word phrase

vs single-word

+/- impact◦ Big impact on STC

◦ Not dramatic for others

◦ Can’t be plugged in!

◦ Inextricable STC’s part

Page 11: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Experiments

11

Improve other clustering algorithms by overlapping

vs no-overlapping

+/- impact◦ Big impact on STC

◦ Not dramatic for others

◦ Can’t be plugged in!

◦ Inextricable STC’s part

Page 12: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Experiments

12

Snippets vs whole document

Decrease in quality is apparent but small

Page 13: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Reference(s)

13

Zamir, Oren, and Oren Etzioni. "Web document clustering: A feasibility demonstration." Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1998.

Page 14: Clustering Web Search enginescanf/CS533/hwSpring2015/... · STC : Suffix Tree Clustering, an incremental, lineartime algorithm. Identified key requirements for web clustering methods:

Thank you

14