Web search-metrics-tutorial-

1

Web Search Engine Metrics for Measuring User

Satisfaction [Section 3 of 7: Coverage]

Ali Dasdan, eBay

Kostas Tsioutsiouliklis, Yahoo!

Emre Velipasaoglu, Yahoo!

With contributions from Prasad Kantamneni, Yahoo!

27 Apr 2010

(Update in Aug 2015: The authors work in different companies now.)

2

Tutorial @

19th International World Wide Web

Conference

http://www2010.org/

April 26-30, 2010

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Disclaimers

•  This talk presents the opinions of the authors. It does not necessarily reflect the views of our employers.

•  This talk does not imply that these metrics are used by our employers, or should they be used, they may not be used in the way described in this talk.

•  The examples are just that – examples. Please do not generalize them to the level of comparing search engines.

3

4

Coverage Metrics Section 3/7

of WWW’10 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu


Example on coverage: URL was not found

5


Example on coverage: But content was found under different URLs

6


Example on coverage: URL was also found after some time

7


Definitions for coverage

•  Coverage refers to presence of content of interest in a catalog.

•  Coverage ratio – defined as the ratio of the number of

documents (pages) found to the number of documents (pages) tested

– Can be represented as a distribution when many document attributes are considered together

8


Some background: Shingling and Jaccard Index

9

Doc = (a b c d e) (5 terms) 2-grams: (a b, b c, c d, d e) Shingles for 2-grams (after hashing them): 10, 3, 7, 16 Min shingle: 3 (used as a signature of Doc) Doc1 = (a b c d e) Doc2 = (a e f g) Doc1 Doc2 = (a e) Doc1 Doc2 = (a b c d e f g) Jaccard index = |Doc1 Doc2| / |Doc1 Doc2| = 2 / 7 ≈ 30% (Broder’s shingling method estimates this index.)

��

� �


How to measure coverage

•  Given an input document with its URL •  Query by URL (QBU)

–  enter URL at the target search engine’s query interface –  if the URL is not found, then iterate using “normalized” forms of

the same URL •  Query by content (QBC)

–  if URL is not given or URL search has failed, then perform this search

–  generate a set of queries (called strong queries) from the document

–  submit the queries to the target search engine’s query interface –  combine the returned results –  perform a more thorough similarity check between the returned

documents and the input document •  Compute coverage ratio over multiple documents

10


Query-by-Content flowchart

11

String signature: Terms from page

Strings combined into queries

Similarity check using shingles

Search results extraction


Query by content: How to generate queries

•  Select sequences of terms by frequency –  terms with the lowest frequency or highest TF-IDF

•  Select sequences of terms by position –  +/- two terms at every 5th term

•  Select sequences of terms randomly –  find a sequence of consecutive terms randomly in the

document •  Select sequences of terms randomly via shingles

–  find the document’s shingles signature –  find the corresponding sequences of terms –  This method produces the same query signature for the

same document, as opposed to the method above. –  This method also beats the method above.

12


Further issues to consider

•  URL normalization –  Example:

•  wikipedia.org/wiki/Casino_Royale or wikipedia.org/?title=Casino_Royale –  see Dasgupta, Kumar, and Sasturkar (2008)

•  Page templates and ads –  or how to avoid undesired matches

•  Search for non-textual content –  images, mathematical formulas, tables and other similar

structures •  Definition of content similarity •  Syntactic vs. semantic match •  How to balance coverage against other

objectives –  E.g., what if a page is found at position 1 vs. position 10?

13


Key problems

•  Measure web growth in general and along any dimension

•  Compare search engines automatically and reliably

•  Improve content-based search, including semantic-similarity search

•  Improve copy detection methods for quality and performance, including URL based copy detection

14


Reference review on coverage metrics

•  Luhn (1957) –  summarizes an input document by selecting terms or sentences

by frequency –  Bharat and Broder (1998) discovered the same method

independently for a different purpose. –  Aside: Luhn also invented hashing with chaining.

•  Bar-Yossef and Gurevich (2008) –  introduces improved methods to randomly sample pages from a

search engine’s index using its public query interface, a problem introduced by Bharat and Broder (1998)

15


Reference review on coverage metrics

•  Dasdan et al. (2009), Pereira and Ziviani (2004) –  represents an input document by selecting (sequences of)

terms randomly or by frequency –  uses the term-based document signature as queries (called

strong queries) for similarity search –  Yang et al. (2009) proposes similar methods for blog search. –  Dasdan et al. (2009) proposes random sequence selection via

shingling. •  Olston and Najork (2010)

–  gives a detailed survey of web crawling –  discusses how to optimize for both coverage and freshness in a

web crawler

16


References

•  Z. Bar-Yossef and M. Gurevich (2008), Random sampling from a search engine’s index, J. ACM, 55(5).

•  K. Bharat, A. Broder (1998), A technique for measuring the relative size and overlap of public Web search engines, WWW’98.

•  S. Brin, J. Davis, and H. Garcia-Molina (1995), Copy detection mechanisms for digital documents, SIGMOD’95.

•  A. Dasdan, P. D’Alberto, C. Drome, and S. Kolay (2009), Automating retrieval for similar content using search engine query interface, CIKM’09.

•  A. Dasgupta, R. Kumar, and A. Sasturkar (2008), De-duping URLs via Rewrite Rules, KDD’08.

–  Also Koppula et al. (2009), Learning URL patterns from web page de-duplication, WSDM’10. •  H. Luhn (1957), A statistical approach to mechanized encoding and

searching of literary information, IBM J. Research and Dev., 1(4):309–317.

•  H. P. Luhn (1958), The automatic creation of literature abstracts, IBM J. Research and Dev., 2(2).

•  C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations and Trends in Information Retrieval, 4(3):175--246.

•  A.R. Pereira Jr. and N. Ziviani (2004), Retrieving similar documents from the Web, J. Web Engineering, 2(4):247-261.

•  Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, D. Papadias (2009), Query by document, WSDM’09.

17

Web search-metrics-tutorial-

Engineering

Transcript of Web search-metrics-tutorial-