Reviewed By: Johnny Sia , csia005 Allen Wang , awan015 Li Li , lli057 Hui Zhang , hzha113

18
Project Topic: Project Topic: Performance and Cost Tradeoffs in Performance and Cost Tradeoffs in Web Search Web Search Nick Craswell, Francis Crimmins, David Hawking, Alistair Moffat Reviewed By: Reviewed By: Johnny Sia, csia005 Johnny Sia, csia005 Allen Wang, awan015 Allen Wang, awan015 Li Li, lli057 Li Li, lli057 Hui Zhang, hzha113 Hui Zhang, hzha113

description

Project Topic: Performance and Cost Tradeoffs in Web Search Nick Craswell, Francis Crimmins, David Hawking, Alistair Moffat. Reviewed By: Johnny Sia , csia005 Allen Wang , awan015 Li Li , lli057 Hui Zhang , hzha113. Outline. Motivation and Introduction by Johnny - PowerPoint PPT Presentation

Transcript of Reviewed By: Johnny Sia , csia005 Allen Wang , awan015 Li Li , lli057 Hui Zhang , hzha113

Project Topic:Project Topic:

Performance and Cost Tradeoffs in Web Performance and Cost Tradeoffs in Web SearchSearch

Nick Craswell, Francis Crimmins, David Hawking, Alistair Moffat

Reviewed By:Reviewed By:Johnny Sia, csia005Johnny Sia, csia005

Allen Wang, awan015Allen Wang, awan015Li Li, lli057Li Li, lli057

Hui Zhang, hzha113Hui Zhang, hzha113

OutlineOutline

►Motivation and Introduction by JohnnyMotivation and Introduction by Johnny►Background information by AllenBackground information by Allen►Case Study and cost model analysis Case Study and cost model analysis

(Research Finder) by Li(Research Finder) by Li►New hybrid approach and Conclusion New hybrid approach and Conclusion

by Henryby Henry

MotivationMotivation

►Web search engines crawl the web to gather Web search engines crawl the web to gather the data that they indexthe data that they index

► Slowly crawling the web to download pages Slowly crawling the web to download pages from many websites result in a large amount from many websites result in a large amount of data being transferred across networksof data being transferred across networks

► These network costs must be paid for!These network costs must be paid for!► In the case of Google, ONE crawl of the 3 In the case of Google, ONE crawl of the 3

billion websites it indexes would have a billion websites it indexes would have a network cost of over $1.5 millionnetwork cost of over $1.5 million

8 Billion web pages!!!

IntroductionIntroduction

►Two standard approaches of providing Two standard approaches of providing a search servicea search service Periodic Crawling (eg. Google)Periodic Crawling (eg. Google) Metasearch (eg. MetaCrawler)Metasearch (eg. MetaCrawler)

►A new alternativeA new alternative A crawl-metasearch hybrid modelA crawl-metasearch hybrid model

AimAim

► Aim:Aim:To find the most cost-effective way to support To find the most cost-effective way to support web search services. web search services.

►Where does cost come from?Where does cost come from?Nothing is FREE!! Nothing is FREE!! Data trafficData traffic cost a lot!! cost a lot!!

► Two common approaches:Two common approaches: Web-CrawlingWeb-Crawling MetaSearchMetaSearch

Web-CrawlingWeb-Crawling

►What is a crawler?What is a crawler?A: a program that automatically collects Web A: a program that automatically collects Web

pages to create a local index.pages to create a local index.

►Pros.Pros. Less query processing time requiredLess query processing time required Fast response to usersFast response to users Fixed amount of costFixed amount of cost

►Cons.Cons. Expensive!!!Expensive!!! Indexed data become staleIndexed data become stale

Final Results

ResultsUniversity of Auckland Results Results ResultsAUT MIT MSDN (Microsoft)

QUERYQUERYQUERYQUERY

MetasearchMetasearch

MetasearchMetasearchUsers

QUERYQuery WrappingResults Merging

Local Search Engines

How does it work?

MetasearchMetasearch

►Pros.Pros. Cheap to maintain (really?)Cheap to maintain (really?) ““Fresh” dataFresh” data

►Cons.Cons. Quality of the search depends on local Quality of the search depends on local

servers.servers. Need “wrapper” to forward queriesNeed “wrapper” to forward queries Results from various servers need to be Results from various servers need to be

mergedmerged

Case StudyCase Study

►Panoptic (Research Finder)Panoptic (Research Finder) Searchable full-text index-based retrieval Searchable full-text index-based retrieval

systemsystem Based on regular crawlBased on regular crawl The newest version also introduced The newest version also introduced

metasearch modelmetasearch model Operated by a range of Australia research Operated by a range of Australia research

institutionsinstitutions Eight largest Australian Universities Eight largest Australian Universities

contribute more data to the Panoptic crawlcontribute more data to the Panoptic crawl

Case Study (cont’d)Case Study (cont’d)

Case Study (cont’d)Case Study (cont’d)

► Rate of changeRate of changea)a) Pages which have disappearedPages which have disappeared

b)b) Pages which changed so much => bad resultsPages which changed so much => bad results

c)c) Pages which changed a little => good answersPages which changed a little => good answers

► Changes (c) could be ignoredChanges (c) could be ignored► Changes (a) and (b) are most important in Changes (a) and (b) are most important in

a search system. Why?a search system. Why?

A crawl becomes stale, the users A crawl becomes stale, the users are more likely to see an are more likely to see an embarrassing result.embarrassing result.

Case Study (cont’d)Case Study (cont’d)

Over a eight days period:•Disappearance: 1.6%•Small changes:

8.2%•Large changes: 6.4%•No changes: 83.8%

Normally, pages in .com domain change more frequently than those in the .edu domain (Result from 151 million pages)

Cost ModelCost ModelFc

Sd

SoCt

Unit: crawls/month

Crawl frequency

Value:1

B_fetched/B_indexed

Crawling overhead

Value:1.7

Unit: GB

Combined data size

Value:33.3

Unit:$/GBTrans cost

Value:22.5=0.0225$/MB

Sq

Nc

Unit: queries/month

Query arrival rate

Value:10,000

Unit: GB/query

Size of query resp pg

Value:2x10^(-5) = 20kB

Nbr of servers being federated

Value:175

Fq

1NZD = 0.9368AUD0.07NZD/MB

=0.066AUD/MB

For crawling:Fc x Sd x So x Ct

For answering queries:Fq x Sq x (Nc + 1) x Ct

Cost Model (cont’d)Cost Model (cont’d)

The number of query per month is low:Metasearch is cost effective

The number of query per month is high:Crawl is cost effective

Fq x Sq x (Nc + 1) x Ct

Fc x Sd x So x Ct

Performance and Cost Tradeoffs in Web SearchPerformance and Cost Tradeoffs in Web Search

► New hybrid approachNew hybrid approach

A full index is suitable for large query load, however, A full index is suitable for large query load, however, metasearch would be better if query arrival rate is lowermetasearch would be better if query arrival rate is lower

Metasearches the largest organizations and crawls the othersMetasearches the largest organizations and crawls the others

Can reduces the crawl cost by approximately halfCan reduces the crawl cost by approximately half►e.g. proof of concept demonstration at: e.g. proof of concept demonstration at:

http://thylacine.panopticsearch.com/hybriddemo/index.cgihttp://thylacine.panopticsearch.com/hybriddemo/index.cgi

Still face the disadvantages of metasearch Still face the disadvantages of metasearch ►e.g. The need to write wrappers, response time issues, the e.g. The need to write wrappers, response time issues, the

rely on quality local search servicesrely on quality local search services

Performance and Cost Tradeoffs in Web SearchPerformance and Cost Tradeoffs in Web Search

Performance and Cost Tradeoffs in Web SearchPerformance and Cost Tradeoffs in Web Search

►ConclusionConclusion The group presented useful cost models and discussed

several alternative approaches Regrettably, many of the discussed options are not

currently feasible The most promising cost-reduction alternative in the

current situation seems to be an incremental, variable frequency crawling

This model could be incorporated into a hybrid metasearch model with further savings, provided result merging can be performed sufficiently in the future

No single reasonable solution for all the operational search systems

The state-of-the-art in this research area remains a challenging and attractive subject