Web search-metrics-tutorial-

21
1 Web Search Engine Metrics for Measuring User Satisfaction [Section 6 of 7: Freshness] Ali Dasdan, eBay Kostas Tsioutsiouliklis, Yahoo! Emre Velipasaoglu, Yahoo! With contributions from Prasad Kantamneni, Yahoo! 27 Apr 2010 (Update in Aug 2015: The authors work in different companies now.)

Transcript of Web search-metrics-tutorial-

1

Web Search Engine Metrics for Measuring User

Satisfaction [Section 6 of 7: Freshness]

Ali Dasdan, eBay

Kostas Tsioutsiouliklis, Yahoo!

Emre Velipasaoglu, Yahoo!

With contributions from Prasad Kantamneni, Yahoo!

27 Apr 2010

(Update in Aug 2015: The authors work in different companies now.)

2

Tutorial @

19th International World Wide Web

Conference

http://www2010.org/

April 26-30, 2010

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Disclaimers

•  This talk presents the opinions of the authors. It does not necessarily reflect the views of our employers.

•  This talk does not imply that these metrics are used by our employers, or should they be used, they may not be used in the way described in this talk.

•  The examples are just that – examples. Please do not generalize them to the level of comparing search engines.

3

4

Freshness Metrics Section 6/7

of WWW’10 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Example on freshness: Stale abstract in Search Results Page

5

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Example on freshness: Actual page content

6

http://en.wikipedia.org/wiki/John_Yoo:

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Example on freshness: Fresh abstract now

7

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Definitions illustrated for a page

8

(Dasdan and Huynh, WWW’09)

Last sync Page is up-to-date or fresh until time 3.

CRAWLED

0

TIME

1 2 3 4 5 6

MODIFIED MODIFIED

INDEXED CLICKED

AGE=3

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Definitions illustrated for a page

9

CRAWLED

0

TIME

1 2 3 4 5 6

MODIFIED MODIFIED

INDEXED CLICKED

TIME

FRESHNESS

AGE 0

3

1

(Dasdan and Huynh, WWW’09)

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Freshness and age of a page

•  The freshness F(p,t) of a local page p at time t is – 1 if p is up-to-date at time t – 0 otherwise

•  The age A(p,t) of a local page p at time t is – 0 if p is up-to-date at time t –  t−tmod otherwise, where tmod is the time of

the first modification after the last sync of p.

10

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Freshness and age of a catalog

•  S: catalog of documents •  Sc: catalog of clicked documents •  Basic freshness and age •  Unweighted freshness and age

•  Weighted freshness and age (c(): #clicks)

11

F (S, t) =1

|S|�

p�S

F (p, t); A(S, t) =1

|S|�

p�S

A(p, t)

Fu(Sc, t) =1

|Sc|�

p�Sc

F (p, t); Au(Sc, t) =1

|Sc|�

p�Sc

A(p, t)

Fw(Sc, t) =�

p�ScF (p, t)c(p, t)

�p�Sc

c(p, t); Aw(Sc, t) =

�p�Sc

A(p, t)c(p, t)�

p�Scc(p, t)

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

How to measure freshness

•  Find the true refresh history of each page in the sample –  needs independent crawling

•  Compare with the history in the search engine

•  Determine freshness and age –  basic form: averaged over all documents in the catalog

•  Consider clicked or viewed documents –  unweighted form: averaged over all clicked or viewed

documents in the catalog –  weighted form: unweighted form weighted with #clicks or

#views (or any other weight function)

12

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

How to measure freshness: Example

13

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Further issues to consider

•  Sampling pages –  random, from DMOZ, revisited, popular

•  Classifying pages –  topical, importance, change period, refresh period

•  Refresh period for monitoring –  daily, hourly, minutely

•  Measuring change –  hashing (MD5, Broder’s shingles, Charikar’s SimHash),

Jaccard’s index, Dice coefficient, word frequency distribution similarity, structural similarity via DOM trees

•  note: •  What is change?

–  content, “information”, structure, status, links, features, ads •  How to balance freshness against other objectives

14

Jaccard index = |A ⇥B|/|A �B|; Dice coe�cient = 2|A ⇥B|/(|A| + |B|)

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Key problems

•  Measure the evaluation of the content on the Web

•  Design refresh policies to adapt to the changes on the Web

•  Reduce latency from discovery to serving

•  Improve freshness metrics

15

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Reference review on web page change patterns

•  Cho & Garcia-Molina (2000): Crawled 720K pages once a day for 4 months.

•  Ntoulas, Cho, & Olston (2004): Crawled 150 sites once a week for a year.

–  found: most pages didn’t change; changes were minor; freq of change couldn’t predict degree of change but degree of change could predict future degree of change;

•  Fetterly, Manasse, Najork, & Wiener (2003): Crawled 150M pages once a week for 11 weeks.

–  found: past change could predict future change; page length & top level domain name were correlated with change;

•  Olston & Panday (2008): Crawled 10K random pages and 10K pages sampled from DMOZ every two days for several months.

–  found: moderate correlation between change frequency and information longevity

•  Adar, Teevan, Dumais, & Elsas (2009): Crawled 55K revisited pages (sub)hourly for 5 weeks.

–  found: higher change rates compared to random pages; large portions of pages changing more than hourly; focus on pages with important static or dynamic content;

16

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Reference review on predicting refresh rates

•  Grimes, Ford & Tassone (2008) –  determines optimal crawl rates under a set of scenarios:

•  while doing estimation; while fairly sure of the estimate; •  when crawls are expensive, and when they are cheap;

•  Matloff (2005) –  derives estimators similar to Cho & Garcia-Molina but lower

variance (and with improved theory) –  also derives estimators for non-Poisson case –  finds that Poisson model is not very good for its data

•  but the estimators seem accurate (bias around 10%)

•  Singh (2007) –  non-homogeneous Poisson, localized windows, piecewise,

Weibull, experimental evaluation •  No work seems to consider non-periodical case.

17

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Reference review on freshness metrics

•  Cho & Garcia-Molina (2003) –  freshness & age of one page –  average/expected freshness & age of one page & corpus –  freshness & age wrt Poisson model of change –  weighted freshness & age –  sync policies

•  uniform (better): all pages at the same rate •  nonuniform: rates proportionally to change rates

–  sync order •  fixed order (better), random order

–  to improve freshness, penalize pages that change too often –  to improve age, sync proportionally to freq but uniform is not far

from optimal

18

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

Reference review on freshness metrics

•  Olston and Najork (2010) –  gives a detailed survey of web crawling –  discusses how to optimize for both coverage and freshness in a

web crawler •  Han et al. (2004) and Dasdan and Huynh (2009) add user

perspective with weights. •  Lewandowski (2008) and Kim and Kang (2007) compare top

three search engines for freshness.

19

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

References 1/2

•  E. Adar, J. Teevan, S. Dumais, and J.L. Elsas (2009), The Web changes everything: Understanding the dynamics of Web content, WSDM’09.

•  J. Cho and H. Garcia-Molina (2000), The evolution of the Web and implications for an incremental crawler, VLDB’00.

•  D. Fetterly, M. Manasse, M. Najork, and J. Wiener (2003), A Large scale study of the evolution of Web pages, WWW’03.

•  F. Grandi (2000), Introducing an annotated bibliography on temporal and evolution aspects in the World Wide Web, SIGMOD Records, 33(2):84-86.

•  A. Ntoulas, J. Cho, and C. Olston (2004), What’s new on the Web? The evolution of the Web from a search engine perspective, WWW’04.

20

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.

References 2/2

•  J. Cho and H. Garcia-Molina (2003), Effective page refresh policies for web crawlers, ACM Trans. Database Syst., 28(4):390-426.

•  J. Cho and H. Garcia-Molina (2003), Estimating frequency of change, ACM Trans. Inter. Tech., 3(3):256-290.

•  A. Dasdan and X. Huynh, User-centric content freshness metrics for search engines, WWW’09.

•  J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.

•  A. Dong et al. (2010), Towards recency ranking in web search, WSDM’10. •  C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk

and optimality in estimating refresh rates for web pages, INTERFACE’08. •  J. Han, N. Cercone, and X. Hu (2004), A Weighted freshness metric for maintaining a

search engine local repository, WI’04. •  Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines

with webpage monitoring results, WISE’07. •  D. Lewandowski, H. Wahlig, and G. Meyer-Bautor (2006), The freshness of web search

engine databases, J. Info. Syst., 32(2):131-148. •  D. Lewandowski (2008), A three-year study on the freshness of Web search engine

databases, to appear in J. Info. Syst., 2008. •  N. Matloff (2005), Estimation of internet file-access/modification rates from indirect

data, ACM Trans. Model. Comput. Simul., 15(3):233-253. •  C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations and Trends in

Information Retrieval, 4(3):175--246. •  C. Olston and S. Padley (2008), Recrawl scheduling based on information longevity,

WWW’08. •  S.R. Singh (2007), Estimating the rate of web page changes, IJCAI’07.

21