1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford...

36
1 The anatomy of a Large Scale Search Engine Sergey Brin ,Lawrence Pag e Dept. CS of Stanford Univ ersity.
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford...

Page 1: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

1

The anatomy of a Large Scale Search Engine

Sergey Brin ,Lawrence PageDept. CS of Stanford University.

Page 2: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

2

Outline

IntroductionDesign GoalsSystem FeaturesSystem AnatomyResults & PerformanceConclusion

Page 3: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

3

Introduction

Google: Large-scale search engineDesign to crawl & index the web efficientlyCrawler: download web pagesGives better resultsGoogle = 10 ^ 100To engineer a SE is a challenging task.2 ways of surfing High quality human maintained lists (Yahoo!)

too slow to improve, can’t cover esoteric topics

Page 4: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

4

Introduction(Cont.)

Expensive to build and maintain. Search engines (google)

search by keywords too many low quality matches people try to mislead automated search engines.

Challenges in creating a search engine Fast crawling technology Efficient storage space Handle queries quickly

Page 5: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

5

Introduction(Cont.)

Scaling with the web Improved hardware performance

exceptions : disk seek time, OS Google’s data structure are optimized for fast

and efficient access. Google is a centralized SE.

Page 6: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

6

Design Goals

Improved search quality Junk results

Often wash out any results that a user is interested in As the collection size grows, we need tools with very hi

gh precision Use of hypertextual information

Quality filteringlink structure and link text

Support novel research on large – scale web data

Page 7: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

7

System features

PageRank : bringing order to the web Most web SE has largely ignored the link graph Maps containing 518 million of hyperlinks Correspond well with people idea of importance

citation importance.

B

C

AB and C are backlinks of A

Page 8: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

8

•For this example:

PR(A) = (1-d) + d(PR(T1)/3 + PR(T2) + PR(T3) + PR(T4)/2)

•Motivation:–Pages that are cited from many places are worth looking at.–Pages that are cited from an important place are worth looking at.

Page 9: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

9

System features

Pr(A) = (1-d) + (Pr(T1) / C(T1) +…+Pr(Tn) / C(Tn)) Assume page A has pages T1…Tn which points to it.

The parameter d can be set between 0 and 1(0.85). C(A) : the number of links going out of page A. Random Surfer

Given a random URL Clicks randomly on links After a while gets bored and gets a new random URL d is the probability at each page the “random surfer” gets

bored and request another random page.

Page 10: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

10

System features

Difference from traditional methods Not counting links from pages equally Normalizing by the number of links in a page

Anchor Text Associate link text with the page it points to. advantages:

Anchor provide more accurate description Can exist for documents that can’t be index

ed.Images,non-text docs.

Can return pages that hadn’t been crawled

Page 11: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

11

Page 12: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

12

System features

Other Features Location information: use of proximity in

search Visualization Information: font relative size Full raw HTML is available

Users can view a cached version of pages

Page 13: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

13

System Anatomy

Page 14: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

14

System Anatomy

Design to avoid disk seek.Web pages are fetched, compressed and stored in repositoryIndexer parses the documents into hits (stored in barrels) and anchors.

Page 15: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

15

Major Data Structures

Hit Lists Forward Index Inverted Index Crawling the web Indexing the web Life of Query The Ranking system

Page 16: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

16

Page 17: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

17

Page 18: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

18

Page 19: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

19

Page 20: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

20

Page 21: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

21

Hit ListWhat is Hit List?A hit list is a list of occurrences of a particular word in a particular document including position. Font, and capitalization information.Stored in both the forward and inverted indices.Encoded by hand optimized compact encoding(less space, less bit manipulation)2 bytes storage.

Cap:1 Imp:3 Position: 12

Cap:1 Imp:3 Type:4

Pos:8

Cap:1 Imp:3 Type:4

Hash:4

Pos:4

Plain:

Fancy:

Anchor:

Page 22: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

22

Page 23: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

23

Forward IndexGiven a docID, get it’s wordID and hit lists.Partial sorted and stored in forward barrels.Each barrel holds a range of wordID’s.Duplicated docIDs exist in the barrels.Instead of storing actual wordID, each wordID is stored as a relative difference from the minimum wordID in that barrel. So 224 = 16 millions.

docID wordID:24

Nhits: 8

Hit hit hit hit

wordID:24

Nhits: 8

Hit hit hit hit

null wordID

docID wordID:24

Nhits: 8

Hit hit hit hit

Page 24: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

24

Inverted IndexGiven a wordID docIDStored in the same barrels as forward index. Sorted by the sorter.Every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into.Two sets of inverted barrels: one set for hit lists which include title or anchor hits, another for all hit lists.

wordID Ndocs

wordID ndocsdocID:27

Nhits:5 Hit hit hit hit

docID:27

Nhits:5 Hit hit hit hit

Page 25: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

25

Crawling the web

Fast distributed crawling system.URL Server & Crawlers are implemented in Python.One single URL server, 3 crawlers. Each keeps 300 connections open at the same time, speed at about 600K /sec of data.Internal cached DNS lookup looking up DNS connecting to host sending request receiving response.Asynchronous IO to manage events.

Page 26: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

26

Indexing the web

Parsing Should know to handle errors.

HTML typos Non-ASCII characters HTML tags nested hundreds deep

Develop their own parser Indexing documents into barrels

Turning words into wordIDs In-memory hash table – the Lexicon New additions are logged to files

Page 27: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

27

Indexing the web

Parallelizationshared lexicon of 14 million pages,log of all the extra words.

Sorting Creating the inverted index Produces two types of barrels.

For titles and anchor For full text

Running sorters at parallel The sorting is done in main memory

Page 28: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

28

Searching

1. Parse the query2. Convert word into wordIDs3. Seek to the start of the doclist in the short barrel fo

r every word4. Scan through the doclist until there is a document

that matches all of the search terms5. Compute the rank of that document6. If we’re at the end of the short barrels, start at the

doclist of the full barrel for every word and go to step 4

7. If we’re not at the end of any doclist go to step 48. Sort the documents by rank return the top K.

Page 29: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

29

Page 30: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

30

The Ranking system

PageRank(TM) to determine the relative importance of each page Google crawls on the web. Among the characteristics PageRank evaluates are the text included in the links to a site, the text on each page and the PageRank of the sites linking to the site being evaluated. Single word search, check the hit list for that word.In Multi-word search, jots occurring close together in a document are weighted higher than hits occurring far apart.

Page 31: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

31

Page 32: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

32

Results

Example: query “Bill Clinton” Return results from the “Whitehouse.gov” Email address of the president All the results are high quality pages No broken links No Bill without Clinton and vice versa.

Page 33: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

33

Storage Requirements

Using compression on the repositoryAbout 55GB for all the data used by the SEMost of the queries can be answered by just the short inverted indexWith better compression,a high quality SE can fit onto a 7GB drive of a new PC.

Page 34: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

34

Page 35: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

35

System Performance

It took 9 days to download 26 million pages48.5 pages per secondThe Indexer & Crawler ran simultaneouslyThe Indexer runs at 54 pages per secondThe sorter run in parallel using 4 machines, the whole process took 24 hours.

Page 36: 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

36

Conclusion

Scalable Search EngineHigh quality search resultsSearch techniques PageRank Anchor Text Proximity information

Search feartures Catalog, Site Search, Cached links, Similar pages,

Who links to you, File Types Speed: efficient algorithm , thousands of low cost PCs net

worked together