1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford...

1

The anatomy of a Large Scale Search Engine

Sergey Brin ,Lawrence PageDept. CS of Stanford University.

2

Outline

IntroductionDesign GoalsSystem FeaturesSystem AnatomyResults & PerformanceConclusion

3

Introduction

Google: Large-scale search engineDesign to crawl & index the web efficientlyCrawler: download web pagesGives better resultsGoogle = 10 ^ 100To engineer a SE is a challenging task.2 ways of surfing High quality human maintained lists (Yahoo!)

too slow to improve, can’t cover esoteric topics

4

Introduction(Cont.)

Expensive to build and maintain. Search engines (google)

search by keywords too many low quality matches people try to mislead automated search engines.

Challenges in creating a search engine Fast crawling technology Efficient storage space Handle queries quickly

5

Introduction(Cont.)

Scaling with the web Improved hardware performance

exceptions : disk seek time, OS Google’s data structure are optimized for fast

and efficient access. Google is a centralized SE.

6

Design Goals

Improved search quality Junk results

Often wash out any results that a user is interested in As the collection size grows, we need tools with very hi

gh precision Use of hypertextual information

Quality filteringlink structure and link text

Support novel research on large – scale web data

7

System features

PageRank : bringing order to the web Most web SE has largely ignored the link graph Maps containing 518 million of hyperlinks Correspond well with people idea of importance

citation importance.

B

C

AB and C are backlinks of A

8

•For this example:

PR(A) = (1-d) + d(PR(T1)/3 + PR(T2) + PR(T3) + PR(T4)/2)

•Motivation:–Pages that are cited from many places are worth looking at.–Pages that are cited from an important place are worth looking at.

9

System features

Pr(A) ＝ (1-d) ＋ (Pr(T1) / C(T1) +…+Pr(Tn) / C(Tn)) Assume page A has pages T1…Tn which points to it.

The parameter d can be set between 0 and 1(0.85). C(A) ： the number of links going out of page A. Random Surfer

Given a random URL Clicks randomly on links After a while gets bored and gets a new random URL d is the probability at each page the “random surfer” gets

bored and request another random page.

10

System features

Difference from traditional methods Not counting links from pages equally Normalizing by the number of links in a page

Anchor Text Associate link text with the page it points to. advantages:

Anchor provide more accurate description Can exist for documents that can’t be index

ed.Images,non-text docs.

Can return pages that hadn’t been crawled

12

System features

Other Features Location information: use of proximity in

search Visualization Information: font relative size Full raw HTML is available

Users can view a cached version of pages

13

System Anatomy

14

System Anatomy

Design to avoid disk seek.Web pages are fetched, compressed and stored in repositoryIndexer parses the documents into hits (stored in barrels) and anchors.

15

Major Data Structures

Hit Lists Forward Index Inverted Index Crawling the web Indexing the web Life of Query The Ranking system

21

Hit ListWhat is Hit List?A hit list is a list of occurrences of a particular word in a particular document including position. Font, and capitalization information.Stored in both the forward and inverted indices.Encoded by hand optimized compact encoding(less space, less bit manipulation)2 bytes storage.

Cap:1 Imp:3 Position: 12

Cap:1 Imp:3 Type:4

Pos:8

Cap:1 Imp:3 Type:4

Hash:4

Pos:4

Plain:

Fancy:

Anchor:

23

Forward IndexGiven a docID, get it’s wordID and hit lists.Partial sorted and stored in forward barrels.Each barrel holds a range of wordID’s.Duplicated docIDs exist in the barrels.Instead of storing actual wordID, each wordID is stored as a relative difference from the minimum wordID in that barrel. So 224 = 16 millions.

docID wordID:24

Nhits: 8

Hit hit hit hit

wordID:24

Nhits: 8

Hit hit hit hit

null wordID

docID wordID:24

Nhits: 8

Hit hit hit hit

24

Inverted IndexGiven a wordID docIDStored in the same barrels as forward index. Sorted by the sorter.Every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into.Two sets of inverted barrels: one set for hit lists which include title or anchor hits, another for all hit lists.

wordID Ndocs

wordID ndocsdocID:27

Nhits:5 Hit hit hit hit

docID:27

Nhits:5 Hit hit hit hit

25

Crawling the web

Fast distributed crawling system.URL Server & Crawlers are implemented in Python.One single URL server, 3 crawlers. Each keeps 300 connections open at the same time, speed at about 600K /sec of data.Internal cached DNS lookup looking up DNS connecting to host sending request receiving response.Asynchronous IO to manage events.

26

Indexing the web

Parsing Should know to handle errors.

HTML typos Non-ASCII characters HTML tags nested hundreds deep

Develop their own parser Indexing documents into barrels

Turning words into wordIDs In-memory hash table – the Lexicon New additions are logged to files

27

Indexing the web

Parallelizationshared lexicon of 14 million pages,log of all the extra words.

Sorting Creating the inverted index Produces two types of barrels.

For titles and anchor For full text

Running sorters at parallel The sorting is done in main memory

28

Searching

1. Parse the query2. Convert word into wordIDs3. Seek to the start of the doclist in the short barrel fo

r every word4. Scan through the doclist until there is a document

that matches all of the search terms5. Compute the rank of that document6. If we’re at the end of the short barrels, start at the

doclist of the full barrel for every word and go to step 4

7. If we’re not at the end of any doclist go to step 48. Sort the documents by rank return the top K.

30

The Ranking system

PageRank(TM) to determine the relative importance of each page Google crawls on the web. Among the characteristics PageRank evaluates are the text included in the links to a site, the text on each page and the PageRank of the sites linking to the site being evaluated. Single word search, check the hit list for that word.In Multi-word search, jots occurring close together in a document are weighted higher than hits occurring far apart.

32

Results

Example: query “Bill Clinton” Return results from the “Whitehouse.gov” Email address of the president All the results are high quality pages No broken links No Bill without Clinton and vice versa.

33

Storage Requirements

Using compression on the repositoryAbout 55GB for all the data used by the SEMost of the queries can be answered by just the short inverted indexWith better compression,a high quality SE can fit onto a 7GB drive of a new PC.

35

System Performance

It took 9 days to download 26 million pages48.5 pages per secondThe Indexer & Crawler ran simultaneouslyThe Indexer runs at 54 pages per secondThe sorter run in parallel using 4 machines, the whole process took 24 hours.

36

Conclusion

Scalable Search EngineHigh quality search resultsSearch techniques PageRank Anchor Text Proximity information

Search feartures Catalog, Site Search, Cached links, Similar pages,

Who links to you, File Types Speed: efficient algorithm , thousands of low cost PCs net

worked together

1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford...

Documents

Transcript of 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford...