Web search-metrics-tutorial-
-
Upload
ali-dasdan -
Category
Engineering
-
view
258 -
download
2
Transcript of Web search-metrics-tutorial-
![Page 1: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/1.jpg)
1
Web Search Engine Metrics for Measuring User
Satisfaction [Section 3 of 7: Coverage]
Ali Dasdan, eBay
Kostas Tsioutsiouliklis, Yahoo!
Emre Velipasaoglu, Yahoo!
With contributions from Prasad Kantamneni, Yahoo!
27 Apr 2010
(Update in Aug 2015: The authors work in different companies now.)
![Page 2: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/2.jpg)
2
Tutorial @
19th International World Wide Web
Conference
http://www2010.org/
April 26-30, 2010
![Page 3: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/3.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Disclaimers
• This talk presents the opinions of the authors. It does not necessarily reflect the views of our employers.
• This talk does not imply that these metrics are used by our employers, or should they be used, they may not be used in the way described in this talk.
• The examples are just that – examples. Please do not generalize them to the level of comparing search engines.
3
![Page 4: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/4.jpg)
4
Coverage Metrics Section 3/7
of WWW’10 Tutorial on Web Search Engine Metrics
by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
![Page 5: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/5.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on coverage: URL was not found
5
![Page 6: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/6.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on coverage: But content was found under different URLs
6
![Page 7: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/7.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on coverage: URL was also found after some time
7
![Page 8: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/8.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Definitions for coverage
• Coverage refers to presence of content of interest in a catalog.
• Coverage ratio – defined as the ratio of the number of
documents (pages) found to the number of documents (pages) tested
– Can be represented as a distribution when many document attributes are considered together
8
![Page 9: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/9.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Some background: Shingling and Jaccard Index
9
Doc = (a b c d e) (5 terms) 2-grams: (a b, b c, c d, d e) Shingles for 2-grams (after hashing them): 10, 3, 7, 16 Min shingle: 3 (used as a signature of Doc) Doc1 = (a b c d e) Doc2 = (a e f g) Doc1 Doc2 = (a e) Doc1 Doc2 = (a b c d e f g) Jaccard index = |Doc1 Doc2| / |Doc1 Doc2| = 2 / 7 ≈ 30% (Broder’s shingling method estimates this index.)
��
� �
![Page 10: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/10.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
How to measure coverage
• Given an input document with its URL • Query by URL (QBU)
– enter URL at the target search engine’s query interface – if the URL is not found, then iterate using “normalized” forms of
the same URL • Query by content (QBC)
– if URL is not given or URL search has failed, then perform this search
– generate a set of queries (called strong queries) from the document
– submit the queries to the target search engine’s query interface – combine the returned results – perform a more thorough similarity check between the returned
documents and the input document • Compute coverage ratio over multiple documents
10
![Page 11: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/11.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Query-by-Content flowchart
11
String signature: Terms from page
Strings combined into queries
Similarity check using shingles
Search results extraction
![Page 12: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/12.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Query by content: How to generate queries
• Select sequences of terms by frequency – terms with the lowest frequency or highest TF-IDF
• Select sequences of terms by position – +/- two terms at every 5th term
• Select sequences of terms randomly – find a sequence of consecutive terms randomly in the
document • Select sequences of terms randomly via shingles
– find the document’s shingles signature – find the corresponding sequences of terms – This method produces the same query signature for the
same document, as opposed to the method above. – This method also beats the method above.
12
![Page 13: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/13.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Further issues to consider
• URL normalization – Example:
• wikipedia.org/wiki/Casino_Royale or wikipedia.org/?title=Casino_Royale – see Dasgupta, Kumar, and Sasturkar (2008)
• Page templates and ads – or how to avoid undesired matches
• Search for non-textual content – images, mathematical formulas, tables and other similar
structures • Definition of content similarity • Syntactic vs. semantic match • How to balance coverage against other
objectives – E.g., what if a page is found at position 1 vs. position 10?
13
![Page 14: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/14.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Key problems
• Measure web growth in general and along any dimension
• Compare search engines automatically and reliably
• Improve content-based search, including semantic-similarity search
• Improve copy detection methods for quality and performance, including URL based copy detection
14
![Page 15: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/15.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Reference review on coverage metrics
• Luhn (1957) – summarizes an input document by selecting terms or sentences
by frequency – Bharat and Broder (1998) discovered the same method
independently for a different purpose. – Aside: Luhn also invented hashing with chaining.
• Bar-Yossef and Gurevich (2008) – introduces improved methods to randomly sample pages from a
search engine’s index using its public query interface, a problem introduced by Bharat and Broder (1998)
15
![Page 16: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/16.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Reference review on coverage metrics
• Dasdan et al. (2009), Pereira and Ziviani (2004) – represents an input document by selecting (sequences of)
terms randomly or by frequency – uses the term-based document signature as queries (called
strong queries) for similarity search – Yang et al. (2009) proposes similar methods for blog search. – Dasdan et al. (2009) proposes random sequence selection via
shingling. • Olston and Najork (2010)
– gives a detailed survey of web crawling – discusses how to optimize for both coverage and freshness in a
web crawler
16
![Page 17: Web search-metrics-tutorial-](https://reader031.fdocuments.net/reader031/viewer/2022030304/58794f4c1a28abb1418b5877/html5/thumbnails/17.jpg)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
References
• Z. Bar-Yossef and M. Gurevich (2008), Random sampling from a search engine’s index, J. ACM, 55(5).
• K. Bharat, A. Broder (1998), A technique for measuring the relative size and overlap of public Web search engines, WWW’98.
• S. Brin, J. Davis, and H. Garcia-Molina (1995), Copy detection mechanisms for digital documents, SIGMOD’95.
• A. Dasdan, P. D’Alberto, C. Drome, and S. Kolay (2009), Automating retrieval for similar content using search engine query interface, CIKM’09.
• A. Dasgupta, R. Kumar, and A. Sasturkar (2008), De-duping URLs via Rewrite Rules, KDD’08.
– Also Koppula et al. (2009), Learning URL patterns from web page de-duplication, WSDM’10. • H. Luhn (1957), A statistical approach to mechanized encoding and
searching of literary information, IBM J. Research and Dev., 1(4):309–317.
• H. P. Luhn (1958), The automatic creation of literature abstracts, IBM J. Research and Dev., 2(2).
• C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations and Trends in Information Retrieval, 4(3):175--246.
• A.R. Pereira Jr. and N. Ziviani (2004), Retrieving similar documents from the Web, J. Web Engineering, 2(4):247-261.
• Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, D. Papadias (2009), Query by document, WSDM’09.
17