Qian Liu, Computer and Information Sciences Department

1Qian Liu, Computer and Information Sciences Department

A Presentation on

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Sergey Brin and Lawrence Page


Problem• Size of the Web:

• In the order of hundreds of terabytes• Still growing

• Problems with search engines:• Alta Vista, Excite, ect.:

• Return huge number of documents entries• Too many low quality or marginally relevant

matches


Problem• Yahoo:

• Expensive• Slow to improve• Cannot cover all esoteric topics

• Problems with users:• Inexperienced• Do not provide tightly constrained keywords


Motivation and Applications• To improve the quality web search engines

• Scale to keep up with the growth of the Web

• Academic search engine research

•Current search engine technology: advertising oriented

• “Open” search engine

• Support research activities on large-scale web data


MethodsBasic Idea:Q: “How can a search engine automatically identify high quality web pages for my topic?”

A: Hypertextual information --- improve search precision• Link structure• Anchor text• Proximity• Visual presentation


MethodsPageRank

PageRank: A measure of citation importance ranking

Link Structure: Latent human annotation of importance


MethodsPageRank

Why PageRank works?• Often users want information from “trusted” source

• Collaborative trust• Inexpensive to compute

• Allows fast updates• Fewer privacy implications

• Only public information is used


MethodsAnchor Text

• Associates anchor text the page the link is on

the page the link points to

• Accurate descriptions of web pages

• Search non-indexable web pages


MethodsProximity

• Hits

• Hits locations

• Multi-word search:

Calculate proximity --- how far apart the hits occur in the document (or anchor)


MethodsVisual Presentation

• Font size

Larger/bolder fonts --- higher weights

• Capitalization --- higher weights


MethodsArchitecture and Major Data Structures


MethodsMajor Applications:

Crawling

Indexing

Searching


CrawlingHow a crawler works:

Defined webspace

Requests URLs

Stores the returned objects into a file system

Examines the content of the objectScans for HTML anchor tags <A..>

Ignores URLs not conforming to specified rule; Visits URLs conforming to the rules


CrawlingGoogle’s web Crawling System:• Fast distributed crawling system• URLServer serves URLs to crawlers• Each crawler keeps 300 connections open at once• Different states:

1. Looking up DNS2. Connecting to host3. Sending request4. Receiving response


IndexingUses Flex to generate a lexical analyzer

Parse document

Convert word into WordID

Convert document into a set of hits

Sorter sorts the result by wordID to generate inverted index


SearchingSeek to the start of the doclist for every word

Scan through the doclists until there is a document that matches all the search terms.

Compute the rank of that document for the query

If we are not at the end of any doclist go to step 2

Sort the documents that have matched by rank and return the top k.


SearchingRanking:

Ranking Function:

• PageRank

• Type weight

• Count weight

• Proximity


ResultsA search on “bill clinton”:

• High quality pages

• Non-crawlable pages

• No results about a bill other than clinton No results about a clinton other than bill


Comparison with Other Search Engines1. Breadth-first search vs. depth-first search2. Comparison with WebCrawler:

WebCrawler: Files that the WebCrawler cannot index, such as pictures, sounds, etc., are not retrieved.Google: Uses anchor text

3. Number of crawlers:WebCrawler: 15Google: typically 3


Comparison with Other Search Engines (continued)

4. Quantity vs. quality

Alta Vista: Favors quantityGoogle: Provides quality search


Weak Points of Study1. To limit response time, when a certain number of matching documents are found, searcher stops scanning, sorts and returns results. Sub-optimal results.

2. Lack features such as boolean operators and negation, etc.

3. Search efficiency:

No optimizations such as query caching, subindices on common terms, and other common optimizations.


Suggestions for Future Study1. Using link structure:

In calculating PageRank: Exclude links between two pages with the same web domain (that often serve as navigation functions and do not confer authority).

2. Personalize PageRank by increasing the weight of a user’s homepage or bookmarks.

“99% of the Web information is useless to 99% of the Web users.”


Suggestions for Future Study(continued)

3. Make use of hubs --- collections of links to authorities4. In addition to anchor text, use text surrounding links, too.


Conclusions• Quality search results

• Techniques: PageRank Anchor text Proximity

• A complete architecture for crawling, indexing, and searching.

Qian Liu, Computer and Information Sciences Department

Documents

Transcript of Qian Liu, Computer and Information Sciences Department