Google Scale Data Management
-
Upload
igor-morin -
Category
Documents
-
view
38 -
download
1
description
Transcript of Google Scale Data Management
Google Scale Data Management
33/27/27
(partially based on slides by Qing Li)
Google (..a lecture on that??)
A good example of where a good understanding of computer scienceand some entrepreneurial spirit can take you..
44/27/27
(partially based on slides by Qing Li)
Google (what?..couldn’t they come up with a better name?)
..doesn’t mean anything, right? ...or does it mean “search” in French?... Surprising, but “Google” actually means stg.
• “Googol” is the name of a “number”…
• A big one : 1 followed by 100 zeros
..this reflects the company's mission....and the reality of the world:
• …there are a whole lot of “data” out there…
• …and whoever helps people (e.g. me) manage it effectively deserves to be rich!
55/27/27
(partially based on slides by Qing Li)
..in fact there is more to Google than search...
These are data intensive applications
66/27/27
(partially based on slides by Qing Li)
Data
Crawling
Web
Question #1: How can it know so much?...because it crawls…
Database
77/27/27
(partially based on slides by Qing Li)
A whole lot of data
One SearchEngine
A whole lot of users
Question #2: How can it be so fast?
88/27/27
(partially based on slides by Qing Li)
Data
Index
Crawling
Web
…
d1d3K2
d1d2K1
Question #1: How can it be so fast?...because it indexes…
Indexing
Indexes: quick lookup tables(like the index pages in a book)
Database
99/27/27
(partially based on slides by Qing Li)
Question #2: How can it be so fast?
Database(copies of all web pages!)
1
Front-end web server
1010/27/27
(partially based on slides by Qing Li)
Question #2: How can it be so fast?
Database(copies of all web pages!)
Front-end web server
Index servers
1 2
use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (databases, indexing)
1111/27/27
(partially based on slides by Qing Li)
Question #2: How can it be so fast?
Database(copies of all web pages!)
Front-end web server
Index servers Content servers
1 2 3
use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (databases, indexing)
1212/27/27
(partially based on slides by Qing Li)
Question #2: How can it be so fast?
Database(copies of all web pages!)
Front-end web server
Index servers Content servers
1 2 3
456
use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (databases, indexing)
1313/27/27
(partially based on slides by Qing Li)
Question #2: How can it be so fast?
Front-end web server
Index servers
1
2
use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (databases, indexing) …most people want to know the same thing anyhow (caching)
CACHE(copies of recent results)
Database(copies of all web pages!)
Content servers
1414/27/27
(partially based on slides by Qing Li)
Question 3: How can it serve the most relevant page?
ANALYZE!!!!
1515/27/27
(partially based on slides by Qing Li)
Question 3: How can it serve the most relevant page?
Text analysis• Analyze the content of a page to determine if it is
relevant to the query or not
1616/27/27
(partially based on slides by Qing Li)
Question 3: How can it serve the most relevant page?
Text analysis• Analyze the content of a page to determine if it is
relevant to the query or not
Link analysis• Analyze the (incoming and outgoing) links of pages to
determine if the page is “worthy” or not
1717/27/27
(partially based on slides by Qing Li)
Question 3: How can it serve the most relevant page?
Text analysis• Analyze the content of a page to determine if it is
relevant to the query or not
Link analysis• Analyze the (incoming and outgoing) links of pages to
determine if the page is “worthy” or not
User analysis• Analyze the users search context to guess what the
user wants to do to serve the most relevant content (and the advertisement!!)
1818/27/27
(partially based on slides by Qing Li)
Text Analysis
Text analysis
Rawtext
Index
CrawlingWeb
2121/27/27
(partially based on slides by Qing Li)
Term Extraction
Extraction of index terms• Stemming
• “informed”, “informs” inform
2222/27/27
(partially based on slides by Qing Li)
Term Extraction
Extraction of index terms• Stemming
• “informed”, “informs” inform
• Compound word identification • {“white”, “house”} or {“white house”} ????
2323/27/27
(partially based on slides by Qing Li)
Term Extraction
Extraction of index terms• Stemming
• “informed”, “informs” inform
• Compound word identification • {“white”, “house”} or {“white house”} ????
• Removal of stop words• “a”, “an”, “the”, “is”, “are”, “am”, …
• they do not help discriminate between relevant and irrelevant documents
2424/27/27
(partially based on slides by Qing Li)
Stop words
Those terms that occur too frequently in a language are not good discriminators for search
Rank (of the term)
Frequencyof usage
in english language
stop words
the an ASU
2626/27/27
(partially based on slides by Qing Li)
“Inverted” index
search
.
.
.
ASU
.
.
.
.
.
.
tiger
3
.
.
.
2
.
.
.
.
.
.
2
Terms
Directory
Doc #1---------------
Doc #2---------------
Doc #5---------------
...
...
Query“ASU”
Count
DocumentCollection
requires “fast” string search!!(hashes, search trees)
2828/27/27
(partially based on slides by Qing Li)
What are the weights????
They need to capture how good the term is in describing the content of the page
t1
t2
t1 better than t2
)(___
),"("_),"("
Docinpagekeywordsofnumber
DocASUinpagenumberDocASUtf
term-frequency
Document
2929/27/27
(partially based on slides by Qing Li)
What are the weights????
They need to capture how discriminating the term is (remember the stop words??)
t1
t2
t1t1
t1t1
t1 t1
t2 t2
t2 better than t1
3030/27/27
(partially based on slides by Qing Li)
What are the weights????
They need to capture how discriminating the term is (remember the stop words??)
t1
t2
t1t1
t1t1
t1 t1
t2 t2
t2 better than t1
)"("___
__log)"("
ASUwithdocumentofnumber
documentsofnumberASUidf
inverse-document-frequency
3131/27/27
(partially based on slides by Qing Li)
What are the weights????
They need to capture how • good the term is in describing the content of
the page
• discriminating the term is..
)"("),"("),"(" ASUidfDocASUtfDocASUtfidf
3333/27/27
(partially based on slides by Qing Li)
How are the results ranked based on relevance?
3838/27/27
(partially based on slides by Qing Li)
Vectors…what are they???
Page with “ASU” weight <0.5>, “CSE” weight <0.7>, and “Selcuk” weight <0.9>
0 0.5
0.7
0.9
ASU
CSE
Selcuk
<0.5,0.7.0.9>
3939/27/27
(partially based on slides by Qing Li)
A web database (a vector space!)
Database
ASU
CSE
Selcuk
4040/27/27
(partially based on slides by Qing Li)
A web database (a vector space!)
Query “ASU CSE Selcuk”
ASU
CSE
Selcuk
4141/27/27
(partially based on slides by Qing Li)
Measuring relevance: “Find 2 closest vectors”
ASU
CSE
Selcuk
Requires “multidimensional” index structures
Query “ASU CSE Selcuk”
4242/27/27
(partially based on slides by Qing Li)
How about “importance” of a page…can we measure it?
4343/27/27
(partially based on slides by Qing Li)
How about “importance” of a page…can we measure it?
We might be able to learn how important a page is by studying its connectivity (“link analysis”)
more links to a page implies a better page
4444/27/27
(partially based on slides by Qing Li)
How about “importance” of a page…can we measure it?
We might be able to learn how important a page is by studying its connectivity (“link analysis”)
more links to a page implies a better page
Links from a more important page should count more than links from a weaker page
4545/27/27
(partially based on slides by Qing Li)
Hubs and authorities
•Good hubs should point to good authorities
•Good authorities must be pointed by good hubs.
“Graph theory” helps analyze large networks of links
4646/27/27
(partially based on slides by Qing Li)
PageRank
PR(A) = PR(C)/1
more links to a page implies a better page
Links from a more important page should count more than links from a weaker page
4747/27/27
(partially based on slides by Qing Li)
PageRank
PR(A) = PR(C)/1
PR(B) = PR(A) / 2
more links to a page implies a better page
Links from a more important page should count more than links from a weaker page
4848/27/27
(partially based on slides by Qing Li)
PageRank
PR(A) = PR(C)/1
PR(B) = PR(A) / 2
PR(C) = PR(A) / 2 + PR(B)/1
These types of problems are generally solved using “linear algebra”
more links to a page implies a better page
Links from a more important page should count more than links from a weaker page
5050/27/27
(partially based on slides by Qing Li)
How to compute ranks??
PageRank (PR) definition is recursive• PageRank of a page depends on and influences PageRanks
of other pages
5151/27/27
(partially based on slides by Qing Li)
How to compute ranks??
PageRank (PR) definition is recursive
• PageRank of a page depends on and influences PageRanks of other pages
Solved iteratively• To compute PageRank:
• Choose arbitrary initial PR_old and use it to compute PR_new
• Repeat, setting PR_old to PR_new until PR converges (the difference between old and new PR is sufficiently small)
• Rank values typically converge in 50-100 iterations
5252/27/27
(partially based on slides by Qing Li)
Related CSE courses 185 Introduction to Web Development 205 Object-Oriented Programming and Data Structures 310 Data Structures and Algorithms 355 Introduction to Theoretical Computer Science 412 Database Management 414 Advanced Database Concepts 450 Design and Analysis of Algorithms 457 Theory of Formal Languages 471 Introduction to Artificial Intelligence 494 Cognitive Systems and Intelligent Agents 494 Information Retrieval, Mining 494 Numerical Linear Algebra Data Exploration 494 Introduction to High Performance Computing
5353/27/27
(partially based on slides by Qing Li)
Related CSE courses 185 Introduction to Web Development 205 Object-Oriented Programming and Data Structures 310 Data Structures and Algorithms 355 Introduction to Theoretical Computer Science 412 Database Management 414 Advanced Database Concepts 450 Design and Analysis of Algorithms 457 Theory of Formal Languages 471 Introduction to Artificial Intelligence 494 Cognitive Systems and Intelligent Agents 494 Information Retrieval, Mining 494 Numerical Linear Algebra Data Exploration 494 Introduction to High Performance Computing
5454/27/27
(partially based on slides by Qing Li)
Reference L. Page and S. Brin,The PageRank citation ranking: Bringing
order to the web , Stanford Digital Library Technique, Working paper 1999-0120, 1998.
L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Web Conference (WWW 98), 1998.