HOW THINGS WORK Web Search Engines: Part 1web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf ·...

86 Computer

H O W T H I N G S W O R K

I n 1995, when the number of“usefully searchable” Web pageswas a few tens of millions, it waswidely believed that “indexing thewhole of the Web” was already

impractical or would soon become sodue to its exponential growth. A littlemore than a decade later, the GYMsearch engines—Google, Yahoo!, andMicrosoft—are indexing almost athousand times as much data andbetween them providing reliable sub-second responses to around a bil-lion queries a day in a plethora of languages.

If this were not enough, the majorengines now provide much higher-quality answers. For most searchers,these engines do a better job of rank-ing and presenting results, respondmore quickly to changes in interestingcontent, and more effectively elimi-nate dead links, duplicate pages, andoff-topic spam.

In this two-part series, we go behindthe scenes and explain how this dataprocessing “miracle” is possible. Wefocus on whole-of-Web search but notethat enterprise search tools and portalsearch interfaces use many of the samedata structures and algorithms.

Search engines cannot and shouldnot index every page on the Web.After all, thanks to dynamic Web page

generators such as automatic calen-dars, the number of pages is infinite.

To provide a useful and cost-effec-tive service, search engines must rejectas much low-value automated contentas possible. In addition, they canignore huge volumes of Web-accessi-ble data, such as ocean temperaturesand astrophysical observations, with-out harm to search effectiveness.Finally, Web search engines have noaccess to restricted content, such aspages on corporate intranets.

What follows is not an inside viewof any particular commercial engine—whose precise details are jealouslyguarded secrets—but a characteriza-tion of the problems that whole-of-Web search services face and anexplanation of the techniques avail-able to solve these problems.

INFRASTRUCTUREFigure 1 shows a generic search

engine architecture. For redundancyand fault tolerance, large searchengines operate multiple, geographi-cally distributed data centers. Withina data center, services are built upfrom clusters of commodity PCs. Thetype of PC in these clusters dependsupon price, CPU speed, memory anddisk size, heat output, reliability, and physical size (labs.google.com/

papers/googlecluster-ieee.pdf). The total number of servers for the largestengines is now reported to be in thehundreds of thousands.

Within a data center, clusters orindividual servers can be dedicated tospecialized functions, such as crawl-ing, indexing, query processing, snip-pet generation, link-graph computa-tions, result caching, and insertion ofadvertising content. Table 1 providesa glossary defining Web search engineterms.

Large-scale replication is required tohandle the necessary throughput. Forexample, if a particular set of hard-ware can answer a query every 500milliseconds, then the search enginecompany must replicate that hardwarea thousandfold to achieve throughputof 2,000 queries per second. Distribut-ing the load among replicated clustersrequires high-throughput, high-relia-bility network front ends.

Currently, the amount of Web datathat search engines crawl and index ison the order of 400 terabytes, placingheavy loads on server and networkinfrastructure. Allowing for over-heads, a full crawl would saturate a10-Gbps network link for more than10 days. Index structures for this vol-ume of data could reach 100 tera-bytes, leading to major challenges inmaintaining index consistency acrossdata centers. Copying a full set ofindexes from one data center toanother over a second 10-gigabit linktakes more than a day.

CRAWLING ALGORITHMSThe simplest crawling algorithm

uses a queue of URLs yet to be visitedand a fast mechanism for determiningif it has already seen a URL. Thisrequires huge data structures—a sim-ple list of 20 billion URLs containsmore than a terabyte of data.

The crawler initializes the queuewith one or more “seed” URLs. Agood seed URL will link to manyhigh-quality Web sites—for example,www.dmoz.org or wikipedia.org.

Crawling proceeds by making anHTTP request to fetch the page at the first URL in the queue. When the

Web Search Engines: Part 1David HawkingCSIRO ICT Centre

A behind-the-scenes look explores the data processing“miracle”that characterizes Web crawling and searching.

June 2006 87

URLs. Simple checksum comparisonscan detect exact duplicates, but whenthe page includes its own URL, a visitorcounter, or a date, more sophisticatedfingerprinting methods are needed.

Crawlers can save considerableresources by recognizing and elimi-nating duplication as early as possiblebecause unrecognized duplicates cancontain relative links to whole fami-lies of other duplicate content.

Search engines avoid some system-atic causes of duplication by trans-forming URLs to remove superfluousparameters such as session IDs and by

crawler fetches the page, it scans the contents for links to other URLsand adds each previously unseen URLto the queue. Finally, the crawler saves the page content for indexing.Crawling continues until the queue is empty.

Real crawlersIn practice, this simple crawling

algorithm must be extended to ad-dress the following issues.

Speed. If each HTTP request takesone second to complete—some willtake much longer or fail to respond atall—the simple crawler can fetch nomore than 86,400 pages per day. Atthis rate, it would take 634 years tocrawl 20 billion pages. In practice,crawling is carried out using hundredsof distributed crawling machines.

A hashing function determineswhich crawling machine is responsi-ble for a particular URL. If a crawlingmachine encounters a URL for whichit is not responsible, it passes it on tothe machine that is responsible for it.

Even hundredfold parallelism is notsufficient to achieve the necessarycrawling rate. Each crawling machinetherefore exploits a high degree ofinternal parallelism, with hundreds orthousands of threads issuing requestsand waiting for responses.

Politeness. Unless care is taken,crawler parallelism introduces the riskthat a single Web server will be bom-barded with requests to such an extentthat it becomes overloaded. Crawleralgorithms are designed to ensure thatonly one request to a server is made ata time and that a politeness delay isinserted between requests. It is alsonecessary to take into account bottle-necks in the Internet; for example,search engine crawlers have sufficientbandwidth to completely saturate net-work links to entire countries.

Excluded content. Before fetchinga page from a site, a crawler mustfetch that site’s robots.txt file to deter-mine whether the webmaster has spec-ified that some or all of the site shouldnot be crawled.

Duplicate content. Identical contentis frequently published at multiple

casefolding URLs from case-insensi-tive servers.

Continuous crawling.Carrying outfull crawls at fixed intervals wouldimply slow response to importantchanges in the Web. It would alsomean that the crawler would contin-uously refetch low-value and staticpages, thereby incurring substantialcosts without significant benefit. Forexample, a corporate site’s 2002media releases section rarely, if ever,requires recrawling.

Interestingly, submitting the query“current time New York” to the

Database 3

Docs

Localservers

Docs

e-mailTo: [email protected]: [email protected]: Buy Panopticcc: the world

Web

Crawler

1. G

athe

ring

2. Extracting 3. Indexing

4. Querying

6. Presenting

5. R

anki

ng

Localservers

Database 2

Database1

Collection Indexes

Figure 1. Generic search engine architecture. Enterprise search engines must provideadapters (top left) for all kinds of Web and non-Web data, but these are not required in apurely Web search.

Table 1.Web search engine glossary.

Term Definition

URL A Web page address—for example, http://www.computer.org. Crawling Traversing the Web by recursively following links from a seed. Indexes Data structures permitting rapid identification of which crawled pages

contain particular words or phrases. Spamming Publication of artificial Web material designed to manipulate search engine

rankings for financial gain. Hashing function A function for computing an integer within a desired range from a string of

characters, such that the integers generated from large sets of strings—for example, URLs—are evenly distributed.

88 Computer

tected. In any case, they are ineffectivenow that rankings depend heavily uponlink information (www-db.stanford.edu/pub/papers/google.pdf).

Modern spammers create artificialWeb landscapes of domains, servers,links, and pages to inflate the linkscores of the targets they have beenpaid to promote. Spammers also en-gage in cloaking, the process of deliv-ering different content to crawlers thanto site visitors.

Search engine companies use man-ual and automated analysis of linkpatterns and content to identify spamsites that are then included in a black-list. A crawler can reject links to URLson the current blacklist and can rejector lower the priority of pages that arelinked to or from blacklisted sites.

FINAL CRAWLING THOUGHTSThe full story of Web crawling must

include decoding hyperlinks com-puted in JavaScript; extraction of

GYM engines reveals that each ofthese engines crawls the www.time-anddate.com/worldclock site everycouple of days. However, no matterhow often the engines crawl this site,the search result will always show thewrong time.

To increase crawling bang-per-buck, a priority queue replaces thesimple queue. The URLs at the headof this queue have been assessed ashaving the highest priority for crawl-ing, based on factors such as changefrequency, incoming link count, clickfrequency, and so on. Once a URL iscrawled, it is reinserted at a positionin the queue determined by itsreassessed priority. In this model,crawling need never stop.

Spam rejection.Primitive spammingtechniques, such as inserting mislead-ing keywords into pages that are invis-ible to the viewer—for example, whitetext on a white background, zero-pointfonts, or meta tags—are easily de-

indexable words, and perhaps links,from binary documents such as PDFand Microsoft Word files; and con-verting character encodings such asASCII, Windows codepages, andShift-JIS into Unicode for consistentindexing (www.unicode.org/standard/standard.html).

Engineering a Web-scale crawler isnot for the unskilled or fainthearted.Crawlers are highly complex parallelsystems, communicating with millionsof different Web servers, among whichcan be found every conceivable fail-ure mode, all manner of deliberateand accidental crawler traps, andevery variety of noncompliance withpublished standards. Consequently,the authors of the Mercator crawlerfound it necessary to write their ownversions of low-level system softwareto achieve required performance andreliability (www.research.compaq.com/SRC/mercator/papers/www/paper.html).

It is not uncommon to find that acrawler has locked up, ground to ahalt, crashed, burned up an entire net-work traffic budget, or unintention-ally inflicted a denial-of-service attackon a Web server whose operator isnow very irate.

P art two of this two-part series(Computer, How Things Work,Aug. 2006) will explain how

search engines index crawled data andhow they processes queries. ■

David Hawking is a principal researchscientist at CSIRO ICT Centre, Can-berra, Australia, and Chief Scientist atfunnelback.com. Contact him at [email protected].


Computer welcomes your submissionsto this bimonthly column. Foradditional information, contact AlfWeaver, the column editor, [email protected].

25%

No

t

a

me

mb

er

?

Jo

in

o

nl

in

e

to

da

y!

save

on a l l

con fe rences

sponsored

b y the

I EEE

Computer Soc i e t y

I E E EC o m p u t e r

S o c i e t ym e m b e r s

w w w. c o m p u t e r. o r g / j o i n

88 Computer


P art 1 of this two-part series(How Things Work, June2006, pp. 86-88) describedsearch engine infrastructureand algorithms for crawling

the Web. Part 2 reviews the algo-rithms and data structures required toindex 400 terabytes of Web page textand deliver high-quality results inresponse to hundreds of millions ofqueries each day.

INDEXING ALGORITHMSSearch engines use an inverted file to

rapidly identify indexing terms—thedocuments that contain a particularword or phrase (J. Zobel and A.Moffat, “Inverted Files for Text SearchEngines,” to be published in ACMComputing Surveys, 2006). An in-verted file is a concatenation of thepostings lists for each distinct term. Inits simplest form, each postings listcomprises a sorted list of the ID num-bers of the documents that contain it.A fast-lookup term dictionary refer-ences the postings lists for each term.

An indexer can create an invertedfile in two phases. In the first phase,scanning, the indexer scans the text ofeach input document. For each index-able term it encounters, the indexerwrites a posting consisting of a docu-ment number and a term number to a

temporary file. Because of the scan-ning process, this file will naturally bein document number order.

In the second phase, inversion, theindexer sorts the temporary file intoterm number order, with the docu-ment number as the secondary sortkey. It also records the starting pointand length of the lists for each entryin the term dictionary.

REAL INDEXERSAs Figure 1 shows, for high-quality

rankings, real indexers store addi-tional information in the postings,such as term frequency or positions.(For additional information, see S.Brin and L. Page, “The Anatomy of aLarge-Scale Hypertextual Web SearchEngine;” www-db.stanford.edu/pub/papers/google.pdf.)

Scaling up. The scale of the inver-sion problem for a Web-sized crawl isenormous: Estimating 500 terms ineach of 20 billion pages, the temporaryfile might contain 10 trillion entries.

An obvious approach, documentpartitioning, divides up the URLsbetween machines in a cluster in thesame way as the crawler. If the systemuses 400 machines to index 20 billionpages, and the machines share theload evenly, then each machine man-ages a partition of 50 million pages.

Even with 400-fold partitioning,each inverted file contains around 25billion entries, still a significant index-ing challenge. An efficient indexerbuilds a partial inverted “file” in mainmemory as it scans documents andstops when available memory isexhausted. At that point, the indexerwrites a partial inverted file to disk,clears memory, and starts indexing thenext document. It then repeats thisprocess until it has scanned all docu-ments. Finally, it merges all the partialinverted files.

Term lookup.The Web’s vocabularyis unexpectedly large, containing hun-dreds of millions of distinct terms.How can this be, you ask, when eventhe largest English dictionary lists onlyabout a million words? The answer isthat the Web includes documents in alllanguages, and that human authorshave an apparently limitless propensityto create new words such as acronyms,trademarks, e-mail addresses, andproper names. People certainly wantto search for R2-D2 and C-3PO as wellas IBM, B-52, and, yes, Yahoo! andGoogle. Many nondictionary wordsare of course misspellings and typo-graphical errors, but there is no safeway to eliminate them.

Search engines can choose from var-ious forms of tries, trees, and hash tablesfor efficient term lookup (D.E. Knuth,The Art of Computer Programming:Sorting and Searching, Addison-Wesley,1973). They can use a two-level struc-ture to reduce disk seeks, an importantconsideration because modern CPUscan execute many millions of instruc-tions in the time taken for one seek.

Compression. Indexers can reducedemands on disk space and memoryby using compression algorithms forkey data structures. Compressed datastructures mean fewer disk accessesand can lead to faster indexing andfaster query processing, despite theCPU cost of compression and decom-pression.

Phrases. In principle, a query pro-cessor can correctly answer phrasequeries such as “National ScienceFoundation” by intersecting postingslists containing word position infor-

Web SearchEngines: Part 2David HawkingCSIRO ICT Centre

A data processing “miracle”providesresponses to hundreds of millions ofWeb searches each day.

August 2006 89

times, providing a strong signal ofwhat the page is about. Anchor textcontributes strongly to the quality ofsearch results.

Link popularity score. Searchengines assign pages a link popularityscore derived from the frequency ofincoming links. This can be a simplecount or it can exclude links fromwithin the same site. PageRank, amore sophisticated link popularityscore, assigns different weights tolinks depending on the source’s pagerank. PageRank computation is aneigenvector calculation on the page-page link connectivity matrix.

Processing matrices of rank 20 bil-lion is computationally impractical,and researchers have invested a greatdeal of brainpower in determining howto reduce the scale of PageRank-likeproblems. One approach is to computeHostRanks from the much smallerhost-host connectivity matrix and dis-tribute PageRanks to individual pageswithin each site afterwards. PageRank

mation. In practice, this is much tooslow when postings lists are long.

Special indexing tricks permit amore rapid response. One trick is toprecompute postings lists for commonphrases. Another is to subdivide thepostings list for a word into sublists,according to the word that follows theprimary word. For example, postingsfor the word “electrical” might bedivided into sublists for “electricalapparatus,” “electrical current,”“electrical engineering,” and so on.

Anchor text. Web browsers high-light words in a Web page to indicatethe presence of a link that users canclick on. These words are known aslink anchor text. Web search enginesindex anchor text with a link’s target—as well as its source—because anchortext provides useful descriptions of thetarget, except “click here,” of course.Pages that have many incoming linksaccumulate a variety of anchor textdescriptions. The most useful descrip-tions can be repeated thousands of

is a Google technology, but otherengines use variants of this approach.

Query-independent score. Inter-nally, search engines rank Web pagesindependently of any query, using acombination of query-independent fac-tors such as link popularity, URLbrevity, spam score, and perhaps thefrequency with which users click them.A page with a high query-independentscore has a higher a priori probabilityof retrieval than others that match thequery equally well.

QUERY PROCESSINGALGORITHMS

By far the most common type ofquery that search engines receive con-sists of a small number of querywords, without operators—for exam-ple, “Katrina” or “secretary of state.”Several researchers have reported thatthe average query length is around 2.3words.

By default, current search enginesreturn only documents containing all

(2,3)(17,1)(111,2)(111,18)(272,6)

Term count postings

aaaaa 1

nice 5

night 3

zzzzz 2

Term dictionary

Inverted file

DocID Score

2 0.407

Score accumulators

DocID Length URL

Document table

1

2

3

4

5

6

5,327 computer.org/

2,106

4,108

2,999

101

27,111

Postings for “nice”

Figure 1. Inverted file index and associated data structures. In this simplified example, the alphabetically sorted term dictionary allowsfast access to a word’s postings list within the inverted file. Postings contain both a document number and a word position within thedocument. Note that “nice” occurs five times all told, twice in document 111. On the basis of the first posting, the query processor has calculated a relevance score for document 2.

90 Computer

“the” contains billions of entries, andthe number of documents needing tobe scored and sorted—those that con-tain both “the” and “onion”—is onthe order of tens of millions.

Speeding things upReal search engines use many tech-

niques to speed things up.Skipping. Scanning postings lists

one entry at a time is very slow. If thenext document containing “Onion” isnumber 2,000,000 and the currentposition in the “the” list is 1,500,000,the search obviously should skip halfa million postings in the latter list asfast as possible. A small amount ofadditional structure in postings listspermits the query processor to skip for-ward in steps of hundreds or thousandsto the required document number.

Early termination. The queryprocessor can save a great deal ofcomputation if the indexer createsindexes in which it sorts postings listsin order of decreasing value. It canusually stop processing after scanningonly a small fraction of the listsbecause later results are less likely tobe valuable than those already seen.At first glance, early terminationseems to be inconsistent with skippingand compression techniques, whichrequire postings to be in documentnumber order. But there is a solution.

Clever assignment of documentnumbers. Instead of arbitrarily num-bering documents, the crawler orindexer can number them to reflecttheir decreasing query-independentscore. In other words, document num-ber 1 (or 0 if you are a mathemati-

the query words. To achieve this, asimple-query processor looks up eachquery term in the term dictionary andlocates its postings list. The processorsimultaneously scans the postings listsfor all the terms to find documents incommon. It stops once it has foundthe required number of matching doc-uments or when it reaches the end ofa list.

In a document-partitioned environ-ment, each machine in a cluster mustanswer the query on its subset of Webpages and then send the top-rankedresults to a coordinating machine formerging and presentation.

REAL QUERY PROCESSORSThe major problem with the simple-

query processor is that it returns poorresults. In response to the query “theOnion” (seeking the satirical newspa-per site), pages about soup and gar-dening would almost certainly swampthe desired result.

Result quality Result quality can be dramatically

improved if the query processor scansto the end of the lists and then sorts thelong list of results according to a rele-vance-scoring function that takes intoaccount the number of query termoccurrences, document length, inlinkscore, anchor text matches, phrasematches, and so on. The MSN searchengine reportedly takes into accountmore than 300 ranking factors.

Unfortunately, this approach is toocomputationally expensive to allowprocessing queries like “the Onion”quickly enough. The postings list for

cian!) is the document with the high-est a priori probability of retrieval.

This approach achieves a win-win-win solution: effective postings com-pression, skipping, and early termina-tion.

Caching. There is a strong economicincentive for search engines to usecaching to reduce the cost of answer-ing queries. In the simplest case, thesearch engine precomputes and storesHTML results pages for thousands ofthe most popular queries. A dedicatedmachine can use a simple in-memorylookup to answer such queries. Thenormal query processor also usescaching to reduce the cost of accessingcommonly needed parts of term dic-tionaries and inverted files.

L imited space prevents discussionof many other fascinating aspectsof search engine operation, such

as estimating the number of hits for aquery when it has not been fully eval-uated, generating advertisements tar-geted to the search query, searchingimages and videos, merging searchresults from other sources such asnews, generating spelling suggestionsfrom query logs, creating query-biasedresult snippets, and performing lin-guistic operations such as word-stem-ming in a multilingual environment.

A high priority for search engineoperation is monitoring the searchquality to ensure that it does notdecrease when a new index is installedor when the search algorithm is mod-ified. But that is a story in itself. ■

David Hawking is a principal researchscientist at CSIRO ICT Centre, Can-berra, Australia, and chief scientist atfunnelback.com. Contact him at [email protected].


Computer welcomes your submissionsto this bimonthly column. Foradditional information, contact AlfWeaver, the column editor, [email protected].

The IEEE Computer Society

publishes over 150 conference publications a year.

For a preview of the latest papers in your field, visit

The IEEE Computer Society

publishes over 150 conference publications a year.

For a preview of the latest papers in your field, visit

www.computer.org/publications/

HOW THINGS WORK Web Search Engines: Part 1web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf ·...

Documents

Transcript of HOW THINGS WORK Web Search Engines: Part 1web.mst.edu/~ercal/253/Papers/WebSearchEngines-1.pdf ·...