What do P.Diddy, Sergey Brin, and Peter Drucker have in common?
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford...
-
date post
22-Dec-2015 -
Category
Documents
-
view
221 -
download
0
Transcript of 1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford...
1
The anatomy of a Large Scale Search Engine
Sergey Brin ,Lawrence PageDept. CS of Stanford University.
2
Outline
IntroductionDesign GoalsSystem FeaturesSystem AnatomyResults & PerformanceConclusion
3
Introduction
Google: Large-scale search engineDesign to crawl & index the web efficientlyCrawler: download web pagesGives better resultsGoogle = 10 ^ 100To engineer a SE is a challenging task.2 ways of surfing High quality human maintained lists (Yahoo!)
too slow to improve, can’t cover esoteric topics
4
Introduction(Cont.)
Expensive to build and maintain. Search engines (google)
search by keywords too many low quality matches people try to mislead automated search engines.
Challenges in creating a search engine Fast crawling technology Efficient storage space Handle queries quickly
5
Introduction(Cont.)
Scaling with the web Improved hardware performance
exceptions : disk seek time, OS Google’s data structure are optimized for fast
and efficient access. Google is a centralized SE.
6
Design Goals
Improved search quality Junk results
Often wash out any results that a user is interested in As the collection size grows, we need tools with very hi
gh precision Use of hypertextual information
Quality filteringlink structure and link text
Support novel research on large – scale web data
7
System features
PageRank : bringing order to the web Most web SE has largely ignored the link graph Maps containing 518 million of hyperlinks Correspond well with people idea of importance
citation importance.
B
C
AB and C are backlinks of A
8
•For this example:
PR(A) = (1-d) + d(PR(T1)/3 + PR(T2) + PR(T3) + PR(T4)/2)
•Motivation:–Pages that are cited from many places are worth looking at.–Pages that are cited from an important place are worth looking at.
9
System features
Pr(A) = (1-d) + (Pr(T1) / C(T1) +…+Pr(Tn) / C(Tn)) Assume page A has pages T1…Tn which points to it.
The parameter d can be set between 0 and 1(0.85). C(A) : the number of links going out of page A. Random Surfer
Given a random URL Clicks randomly on links After a while gets bored and gets a new random URL d is the probability at each page the “random surfer” gets
bored and request another random page.
10
System features
Difference from traditional methods Not counting links from pages equally Normalizing by the number of links in a page
Anchor Text Associate link text with the page it points to. advantages:
Anchor provide more accurate description Can exist for documents that can’t be index
ed.Images,non-text docs.
Can return pages that hadn’t been crawled
11
12
System features
Other Features Location information: use of proximity in
search Visualization Information: font relative size Full raw HTML is available
Users can view a cached version of pages
13
System Anatomy
14
System Anatomy
Design to avoid disk seek.Web pages are fetched, compressed and stored in repositoryIndexer parses the documents into hits (stored in barrels) and anchors.
15
Major Data Structures
Hit Lists Forward Index Inverted Index Crawling the web Indexing the web Life of Query The Ranking system
16
17
18
19
20
21
Hit ListWhat is Hit List?A hit list is a list of occurrences of a particular word in a particular document including position. Font, and capitalization information.Stored in both the forward and inverted indices.Encoded by hand optimized compact encoding(less space, less bit manipulation)2 bytes storage.
Cap:1 Imp:3 Position: 12
Cap:1 Imp:3 Type:4
Pos:8
Cap:1 Imp:3 Type:4
Hash:4
Pos:4
Plain:
Fancy:
Anchor:
22
23
Forward IndexGiven a docID, get it’s wordID and hit lists.Partial sorted and stored in forward barrels.Each barrel holds a range of wordID’s.Duplicated docIDs exist in the barrels.Instead of storing actual wordID, each wordID is stored as a relative difference from the minimum wordID in that barrel. So 224 = 16 millions.
docID wordID:24
Nhits: 8
Hit hit hit hit
wordID:24
Nhits: 8
Hit hit hit hit
null wordID
docID wordID:24
Nhits: 8
Hit hit hit hit
24
Inverted IndexGiven a wordID docIDStored in the same barrels as forward index. Sorted by the sorter.Every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into.Two sets of inverted barrels: one set for hit lists which include title or anchor hits, another for all hit lists.
wordID Ndocs
wordID ndocsdocID:27
Nhits:5 Hit hit hit hit
docID:27
Nhits:5 Hit hit hit hit
25
Crawling the web
Fast distributed crawling system.URL Server & Crawlers are implemented in Python.One single URL server, 3 crawlers. Each keeps 300 connections open at the same time, speed at about 600K /sec of data.Internal cached DNS lookup looking up DNS connecting to host sending request receiving response.Asynchronous IO to manage events.
26
Indexing the web
Parsing Should know to handle errors.
HTML typos Non-ASCII characters HTML tags nested hundreds deep
Develop their own parser Indexing documents into barrels
Turning words into wordIDs In-memory hash table – the Lexicon New additions are logged to files
27
Indexing the web
Parallelizationshared lexicon of 14 million pages,log of all the extra words.
Sorting Creating the inverted index Produces two types of barrels.
For titles and anchor For full text
Running sorters at parallel The sorting is done in main memory
28
Searching
1. Parse the query2. Convert word into wordIDs3. Seek to the start of the doclist in the short barrel fo
r every word4. Scan through the doclist until there is a document
that matches all of the search terms5. Compute the rank of that document6. If we’re at the end of the short barrels, start at the
doclist of the full barrel for every word and go to step 4
7. If we’re not at the end of any doclist go to step 48. Sort the documents by rank return the top K.
29
30
The Ranking system
PageRank(TM) to determine the relative importance of each page Google crawls on the web. Among the characteristics PageRank evaluates are the text included in the links to a site, the text on each page and the PageRank of the sites linking to the site being evaluated. Single word search, check the hit list for that word.In Multi-word search, jots occurring close together in a document are weighted higher than hits occurring far apart.
31
32
Results
Example: query “Bill Clinton” Return results from the “Whitehouse.gov” Email address of the president All the results are high quality pages No broken links No Bill without Clinton and vice versa.
33
Storage Requirements
Using compression on the repositoryAbout 55GB for all the data used by the SEMost of the queries can be answered by just the short inverted indexWith better compression,a high quality SE can fit onto a 7GB drive of a new PC.
34
35
System Performance
It took 9 days to download 26 million pages48.5 pages per secondThe Indexer & Crawler ran simultaneouslyThe Indexer runs at 54 pages per secondThe sorter run in parallel using 4 machines, the whole process took 24 hours.
36
Conclusion
Scalable Search EngineHigh quality search resultsSearch techniques PageRank Anchor Text Proximity information
Search feartures Catalog, Site Search, Cached links, Similar pages,
Who links to you, File Types Speed: efficient algorithm , thousands of low cost PCs net
worked together