Qian Liu, Computer and Information Sciences Department

33
1 Qian Liu, Computer and Information Sciences Department A Presentation on The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

description

A Presentation on The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page. Qian Liu, Computer and Information Sciences Department. Problem Size of the Web: In the order of hundreds of terabytes Still growing Problems with search engines: - PowerPoint PPT Presentation

Transcript of Qian Liu, Computer and Information Sciences Department

Page 1: Qian Liu,  Computer and Information Sciences Department

1Qian Liu, Computer and Information Sciences Department

A Presentation on

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Sergey Brin and Lawrence Page

Page 2: Qian Liu,  Computer and Information Sciences Department

2Qian Liu, Computer and Information Sciences Department

Problem• Size of the Web:

• In the order of hundreds of terabytes• Still growing

• Problems with search engines:• Alta Vista, Excite, ect.:

• Return huge number of documents entries• Too many low quality or marginally relevant

matches

Page 3: Qian Liu,  Computer and Information Sciences Department

3Qian Liu, Computer and Information Sciences Department

Problem• Yahoo:

• Expensive• Slow to improve• Cannot cover all esoteric topics

• Problems with users:• Inexperienced• Do not provide tightly constrained keywords

Page 4: Qian Liu,  Computer and Information Sciences Department

4Qian Liu, Computer and Information Sciences Department

Motivation and Applications• To improve the quality web search engines

• Scale to keep up with the growth of the Web

• Academic search engine research

•Current search engine technology: advertising oriented

• “Open” search engine

• Support research activities on large-scale web data

Page 5: Qian Liu,  Computer and Information Sciences Department

5Qian Liu, Computer and Information Sciences Department

MethodsBasic Idea:Q: “How can a search engine automatically identify high quality web pages for my topic?”

A: Hypertextual information --- improve search precision• Link structure• Anchor text• Proximity• Visual presentation

Page 6: Qian Liu,  Computer and Information Sciences Department

6Qian Liu, Computer and Information Sciences Department

MethodsPageRank

PageRank: A measure of citation importance ranking

Link Structure: Latent human annotation of importance

Page 7: Qian Liu,  Computer and Information Sciences Department

7Qian Liu, Computer and Information Sciences Department

Page 8: Qian Liu,  Computer and Information Sciences Department

8Qian Liu, Computer and Information Sciences Department

Page 9: Qian Liu,  Computer and Information Sciences Department

9Qian Liu, Computer and Information Sciences Department

Page 10: Qian Liu,  Computer and Information Sciences Department

10Qian Liu, Computer and Information Sciences Department

Page 11: Qian Liu,  Computer and Information Sciences Department

11Qian Liu, Computer and Information Sciences Department

Page 12: Qian Liu,  Computer and Information Sciences Department

12Qian Liu, Computer and Information Sciences Department

Page 13: Qian Liu,  Computer and Information Sciences Department

13Qian Liu, Computer and Information Sciences Department

Page 14: Qian Liu,  Computer and Information Sciences Department

14Qian Liu, Computer and Information Sciences Department

Page 15: Qian Liu,  Computer and Information Sciences Department

15Qian Liu, Computer and Information Sciences Department

MethodsPageRank

Why PageRank works?• Often users want information from “trusted” source

• Collaborative trust• Inexpensive to compute

• Allows fast updates• Fewer privacy implications

• Only public information is used

Page 16: Qian Liu,  Computer and Information Sciences Department

16Qian Liu, Computer and Information Sciences Department

MethodsAnchor Text

• Associates anchor text the page the link is on

the page the link points to

• Accurate descriptions of web pages

• Search non-indexable web pages

Page 17: Qian Liu,  Computer and Information Sciences Department

17Qian Liu, Computer and Information Sciences Department

MethodsProximity

• Hits

• Hits locations

• Multi-word search:

Calculate proximity --- how far apart the hits occur in the document (or anchor)

Page 18: Qian Liu,  Computer and Information Sciences Department

18Qian Liu, Computer and Information Sciences Department

MethodsVisual Presentation

• Font size

Larger/bolder fonts --- higher weights

• Capitalization --- higher weights

Page 19: Qian Liu,  Computer and Information Sciences Department

19Qian Liu, Computer and Information Sciences Department

MethodsArchitecture and Major Data Structures

Page 20: Qian Liu,  Computer and Information Sciences Department

20Qian Liu, Computer and Information Sciences Department

MethodsMajor Applications:

Crawling

Indexing

Searching

Page 21: Qian Liu,  Computer and Information Sciences Department

21Qian Liu, Computer and Information Sciences Department

CrawlingHow a crawler works:

Defined webspace

Requests URLs

Stores the returned objects into a file system

Examines the content of the objectScans for HTML anchor tags <A..>

Ignores URLs not conforming to specified rule; Visits URLs conforming to the rules

Page 22: Qian Liu,  Computer and Information Sciences Department

22Qian Liu, Computer and Information Sciences Department

CrawlingGoogle’s web Crawling System:• Fast distributed crawling system• URLServer serves URLs to crawlers• Each crawler keeps 300 connections open at once• Different states:

1. Looking up DNS2. Connecting to host3. Sending request4. Receiving response

Page 23: Qian Liu,  Computer and Information Sciences Department

23Qian Liu, Computer and Information Sciences Department

IndexingUses Flex to generate a lexical analyzer

Parse document

Convert word into WordID

Convert document into a set of hits

Sorter sorts the result by wordID to generate inverted index

Page 24: Qian Liu,  Computer and Information Sciences Department

24Qian Liu, Computer and Information Sciences Department

SearchingSeek to the start of the doclist for every word

Scan through the doclists until there is a document that matches all the search terms.

Compute the rank of that document for the query

If we are not at the end of any doclist go to step 2

Sort the documents that have matched by rank and return the top k.

Page 25: Qian Liu,  Computer and Information Sciences Department

25Qian Liu, Computer and Information Sciences Department

SearchingRanking:

Ranking Function:

• PageRank

• Type weight

• Count weight

• Proximity

Page 26: Qian Liu,  Computer and Information Sciences Department

26Qian Liu, Computer and Information Sciences Department

ResultsA search on “bill clinton”:

• High quality pages

• Non-crawlable pages

• No results about a bill other than clinton No results about a clinton other than bill

Page 27: Qian Liu,  Computer and Information Sciences Department

27Qian Liu, Computer and Information Sciences Department

Comparison with Other Search Engines1. Breadth-first search vs. depth-first search2. Comparison with WebCrawler:

WebCrawler: Files that the WebCrawler cannot index, such as pictures, sounds, etc., are not retrieved.Google: Uses anchor text

3. Number of crawlers:WebCrawler: 15Google: typically 3

Page 28: Qian Liu,  Computer and Information Sciences Department

28Qian Liu, Computer and Information Sciences Department

Comparison with Other Search Engines (continued)

4. Quantity vs. quality

Alta Vista: Favors quantityGoogle: Provides quality search

Page 29: Qian Liu,  Computer and Information Sciences Department

29Qian Liu, Computer and Information Sciences Department

Weak Points of Study1. To limit response time, when a certain number of matching documents are found, searcher stops scanning, sorts and returns results. Sub-optimal results.

2. Lack features such as boolean operators and negation, etc.

3. Search efficiency:

No optimizations such as query caching, subindices on common terms, and other common optimizations.

Page 30: Qian Liu,  Computer and Information Sciences Department

30Qian Liu, Computer and Information Sciences Department

Suggestions for Future Study1. Using link structure:

In calculating PageRank: Exclude links between two pages with the same web domain (that often serve as navigation functions and do not confer authority).

2. Personalize PageRank by increasing the weight of a user’s homepage or bookmarks.

“99% of the Web information is useless to 99% of the Web users.”

Page 31: Qian Liu,  Computer and Information Sciences Department

31Qian Liu, Computer and Information Sciences Department

Suggestions for Future Study(continued)

3. Make use of hubs --- collections of links to authorities4. In addition to anchor text, use text surrounding links, too.

Page 32: Qian Liu,  Computer and Information Sciences Department

32Qian Liu, Computer and Information Sciences Department

Conclusions• Quality search results

• Techniques: PageRank Anchor text Proximity

• A complete architecture for crawling, indexing, and searching.

Page 33: Qian Liu,  Computer and Information Sciences Department

33Qian Liu, Computer and Information Sciences Department