Reviewed By: Johnny Sia , csia005 Allen Wang , awan015 Li Li , lli057 Hui Zhang , hzha113
-
Upload
georgette-demarion -
Category
Documents
-
view
34 -
download
0
description
Transcript of Reviewed By: Johnny Sia , csia005 Allen Wang , awan015 Li Li , lli057 Hui Zhang , hzha113
Project Topic:Project Topic:
Performance and Cost Tradeoffs in Web Performance and Cost Tradeoffs in Web SearchSearch
Nick Craswell, Francis Crimmins, David Hawking, Alistair Moffat
Reviewed By:Reviewed By:Johnny Sia, csia005Johnny Sia, csia005
Allen Wang, awan015Allen Wang, awan015Li Li, lli057Li Li, lli057
Hui Zhang, hzha113Hui Zhang, hzha113
OutlineOutline
►Motivation and Introduction by JohnnyMotivation and Introduction by Johnny►Background information by AllenBackground information by Allen►Case Study and cost model analysis Case Study and cost model analysis
(Research Finder) by Li(Research Finder) by Li►New hybrid approach and Conclusion New hybrid approach and Conclusion
by Henryby Henry
MotivationMotivation
►Web search engines crawl the web to gather Web search engines crawl the web to gather the data that they indexthe data that they index
► Slowly crawling the web to download pages Slowly crawling the web to download pages from many websites result in a large amount from many websites result in a large amount of data being transferred across networksof data being transferred across networks
► These network costs must be paid for!These network costs must be paid for!► In the case of Google, ONE crawl of the 3 In the case of Google, ONE crawl of the 3
billion websites it indexes would have a billion websites it indexes would have a network cost of over $1.5 millionnetwork cost of over $1.5 million
IntroductionIntroduction
►Two standard approaches of providing Two standard approaches of providing a search servicea search service Periodic Crawling (eg. Google)Periodic Crawling (eg. Google) Metasearch (eg. MetaCrawler)Metasearch (eg. MetaCrawler)
►A new alternativeA new alternative A crawl-metasearch hybrid modelA crawl-metasearch hybrid model
AimAim
► Aim:Aim:To find the most cost-effective way to support To find the most cost-effective way to support web search services. web search services.
►Where does cost come from?Where does cost come from?Nothing is FREE!! Nothing is FREE!! Data trafficData traffic cost a lot!! cost a lot!!
► Two common approaches:Two common approaches: Web-CrawlingWeb-Crawling MetaSearchMetaSearch
Web-CrawlingWeb-Crawling
►What is a crawler?What is a crawler?A: a program that automatically collects Web A: a program that automatically collects Web
pages to create a local index.pages to create a local index.
►Pros.Pros. Less query processing time requiredLess query processing time required Fast response to usersFast response to users Fixed amount of costFixed amount of cost
►Cons.Cons. Expensive!!!Expensive!!! Indexed data become staleIndexed data become stale
Final Results
ResultsUniversity of Auckland Results Results ResultsAUT MIT MSDN (Microsoft)
QUERYQUERYQUERYQUERY
MetasearchMetasearch
MetasearchMetasearchUsers
QUERYQuery WrappingResults Merging
Local Search Engines
How does it work?
MetasearchMetasearch
►Pros.Pros. Cheap to maintain (really?)Cheap to maintain (really?) ““Fresh” dataFresh” data
►Cons.Cons. Quality of the search depends on local Quality of the search depends on local
servers.servers. Need “wrapper” to forward queriesNeed “wrapper” to forward queries Results from various servers need to be Results from various servers need to be
mergedmerged
Case StudyCase Study
►Panoptic (Research Finder)Panoptic (Research Finder) Searchable full-text index-based retrieval Searchable full-text index-based retrieval
systemsystem Based on regular crawlBased on regular crawl The newest version also introduced The newest version also introduced
metasearch modelmetasearch model Operated by a range of Australia research Operated by a range of Australia research
institutionsinstitutions Eight largest Australian Universities Eight largest Australian Universities
contribute more data to the Panoptic crawlcontribute more data to the Panoptic crawl
Case Study (cont’d)Case Study (cont’d)
► Rate of changeRate of changea)a) Pages which have disappearedPages which have disappeared
b)b) Pages which changed so much => bad resultsPages which changed so much => bad results
c)c) Pages which changed a little => good answersPages which changed a little => good answers
► Changes (c) could be ignoredChanges (c) could be ignored► Changes (a) and (b) are most important in Changes (a) and (b) are most important in
a search system. Why?a search system. Why?
A crawl becomes stale, the users A crawl becomes stale, the users are more likely to see an are more likely to see an embarrassing result.embarrassing result.
Case Study (cont’d)Case Study (cont’d)
Over a eight days period:•Disappearance: 1.6%•Small changes:
8.2%•Large changes: 6.4%•No changes: 83.8%
Normally, pages in .com domain change more frequently than those in the .edu domain (Result from 151 million pages)
Cost ModelCost ModelFc
Sd
SoCt
Unit: crawls/month
Crawl frequency
Value:1
B_fetched/B_indexed
Crawling overhead
Value:1.7
Unit: GB
Combined data size
Value:33.3
Unit:$/GBTrans cost
Value:22.5=0.0225$/MB
Sq
Nc
Unit: queries/month
Query arrival rate
Value:10,000
Unit: GB/query
Size of query resp pg
Value:2x10^(-5) = 20kB
Nbr of servers being federated
Value:175
Fq
1NZD = 0.9368AUD0.07NZD/MB
=0.066AUD/MB
For crawling:Fc x Sd x So x Ct
For answering queries:Fq x Sq x (Nc + 1) x Ct
Cost Model (cont’d)Cost Model (cont’d)
The number of query per month is low:Metasearch is cost effective
The number of query per month is high:Crawl is cost effective
Fq x Sq x (Nc + 1) x Ct
Fc x Sd x So x Ct
Performance and Cost Tradeoffs in Web SearchPerformance and Cost Tradeoffs in Web Search
► New hybrid approachNew hybrid approach
A full index is suitable for large query load, however, A full index is suitable for large query load, however, metasearch would be better if query arrival rate is lowermetasearch would be better if query arrival rate is lower
Metasearches the largest organizations and crawls the othersMetasearches the largest organizations and crawls the others
Can reduces the crawl cost by approximately halfCan reduces the crawl cost by approximately half►e.g. proof of concept demonstration at: e.g. proof of concept demonstration at:
http://thylacine.panopticsearch.com/hybriddemo/index.cgihttp://thylacine.panopticsearch.com/hybriddemo/index.cgi
Still face the disadvantages of metasearch Still face the disadvantages of metasearch ►e.g. The need to write wrappers, response time issues, the e.g. The need to write wrappers, response time issues, the
rely on quality local search servicesrely on quality local search services
Performance and Cost Tradeoffs in Web SearchPerformance and Cost Tradeoffs in Web Search
►ConclusionConclusion The group presented useful cost models and discussed
several alternative approaches Regrettably, many of the discussed options are not
currently feasible The most promising cost-reduction alternative in the
current situation seems to be an incremental, variable frequency crawling
This model could be incorporated into a hybrid metasearch model with further savings, provided result merging can be performed sufficiently in the future
No single reasonable solution for all the operational search systems
The state-of-the-art in this research area remains a challenging and attractive subject