Google Scale Data Management

43
Google Scale Data Management

description

Google Scale Data Management. Google (..a lecture on that??). A good example of where a good understanding of computer science and some entrepreneurial spirit can take you. Google (what?..couldn’t they come up with a better name?). ..doesn’t mean anything, right? - PowerPoint PPT Presentation

Transcript of Google Scale Data Management

Page 1: Google Scale Data Management

Google Scale Data Management

Page 2: Google Scale Data Management

33/27/27

(partially based on slides by Qing Li)

Google (..a lecture on that??)

A good example of where a good understanding of computer scienceand some entrepreneurial spirit can take you..

Page 3: Google Scale Data Management

44/27/27

(partially based on slides by Qing Li)

Google (what?..couldn’t they come up with a better name?)

..doesn’t mean anything, right? ...or does it mean “search” in French?... Surprising, but “Google” actually means stg.

• “Googol” is the name of a “number”…

• A big one : 1 followed by 100 zeros

..this reflects the company's mission....and the reality of the world:

• …there are a whole lot of “data” out there…

• …and whoever helps people (e.g. me) manage it effectively deserves to be rich!

Page 4: Google Scale Data Management

55/27/27

(partially based on slides by Qing Li)

..in fact there is more to Google than search...

These are data intensive applications

Page 5: Google Scale Data Management

66/27/27

(partially based on slides by Qing Li)

Data

Crawling

Web

Question #1: How can it know so much?...because it crawls…

Database

Page 6: Google Scale Data Management

77/27/27

(partially based on slides by Qing Li)

A whole lot of data

One SearchEngine

A whole lot of users

Question #2: How can it be so fast?

Page 7: Google Scale Data Management

88/27/27

(partially based on slides by Qing Li)

Data

Index

Crawling

Web

d1d3K2

d1d2K1

Question #1: How can it be so fast?...because it indexes…

Indexing

Indexes: quick lookup tables(like the index pages in a book)

Database

Page 8: Google Scale Data Management

99/27/27

(partially based on slides by Qing Li)

Question #2: How can it be so fast?

Database(copies of all web pages!)

1

Front-end web server

Page 9: Google Scale Data Management

1010/27/27

(partially based on slides by Qing Li)

Question #2: How can it be so fast?

Database(copies of all web pages!)

Front-end web server

Index servers

1 2

use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (databases, indexing)

Page 10: Google Scale Data Management

1111/27/27

(partially based on slides by Qing Li)

Question #2: How can it be so fast?

Database(copies of all web pages!)

Front-end web server

Index servers Content servers

1 2 3

use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (databases, indexing)

Page 11: Google Scale Data Management

1212/27/27

(partially based on slides by Qing Li)

Question #2: How can it be so fast?

Database(copies of all web pages!)

Front-end web server

Index servers Content servers

1 2 3

456

use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (databases, indexing)

Page 12: Google Scale Data Management

1313/27/27

(partially based on slides by Qing Li)

Question #2: How can it be so fast?

Front-end web server

Index servers

1

2

use a lot of computers…but, intelligently! (parallelism) …organize data for fast access (databases, indexing) …most people want to know the same thing anyhow (caching)

CACHE(copies of recent results)

Database(copies of all web pages!)

Content servers

Page 13: Google Scale Data Management

1414/27/27

(partially based on slides by Qing Li)

Question 3: How can it serve the most relevant page?

ANALYZE!!!!

Page 14: Google Scale Data Management

1515/27/27

(partially based on slides by Qing Li)

Question 3: How can it serve the most relevant page?

Text analysis• Analyze the content of a page to determine if it is

relevant to the query or not

Page 15: Google Scale Data Management

1616/27/27

(partially based on slides by Qing Li)

Question 3: How can it serve the most relevant page?

Text analysis• Analyze the content of a page to determine if it is

relevant to the query or not

Link analysis• Analyze the (incoming and outgoing) links of pages to

determine if the page is “worthy” or not

Page 16: Google Scale Data Management

1717/27/27

(partially based on slides by Qing Li)

Question 3: How can it serve the most relevant page?

Text analysis• Analyze the content of a page to determine if it is

relevant to the query or not

Link analysis• Analyze the (incoming and outgoing) links of pages to

determine if the page is “worthy” or not

User analysis• Analyze the users search context to guess what the

user wants to do to serve the most relevant content (and the advertisement!!)

Page 17: Google Scale Data Management

1818/27/27

(partially based on slides by Qing Li)

Text Analysis

Text analysis

Rawtext

Index

CrawlingWeb

Page 18: Google Scale Data Management

2121/27/27

(partially based on slides by Qing Li)

Term Extraction

Extraction of index terms• Stemming

• “informed”, “informs” inform

Page 19: Google Scale Data Management

2222/27/27

(partially based on slides by Qing Li)

Term Extraction

Extraction of index terms• Stemming

• “informed”, “informs” inform

• Compound word identification • {“white”, “house”} or {“white house”} ????

Page 20: Google Scale Data Management

2323/27/27

(partially based on slides by Qing Li)

Term Extraction

Extraction of index terms• Stemming

• “informed”, “informs” inform

• Compound word identification • {“white”, “house”} or {“white house”} ????

• Removal of stop words• “a”, “an”, “the”, “is”, “are”, “am”, …

• they do not help discriminate between relevant and irrelevant documents

Page 21: Google Scale Data Management

2424/27/27

(partially based on slides by Qing Li)

Stop words

Those terms that occur too frequently in a language are not good discriminators for search

Rank (of the term)

Frequencyof usage

in english language

stop words

the an ASU

Page 22: Google Scale Data Management

2626/27/27

(partially based on slides by Qing Li)

“Inverted” index

search

Google

.

.

.

ASU

.

.

.

.

.

.

tiger

3

.

.

.

2

.

.

.

.

.

.

2

Terms

Directory

Doc #1---------------

Doc #2---------------

Doc #5---------------

...

...

Query“ASU”

Count

DocumentCollection

requires “fast” string search!!(hashes, search trees)

Page 23: Google Scale Data Management

2828/27/27

(partially based on slides by Qing Li)

What are the weights????

They need to capture how good the term is in describing the content of the page

t1

t2

t1 better than t2

)(___

),"("_),"("

Docinpagekeywordsofnumber

DocASUinpagenumberDocASUtf

term-frequency

Document

Page 24: Google Scale Data Management

2929/27/27

(partially based on slides by Qing Li)

What are the weights????

They need to capture how discriminating the term is (remember the stop words??)

t1

t2

t1t1

t1t1

t1 t1

t2 t2

t2 better than t1

Page 25: Google Scale Data Management

3030/27/27

(partially based on slides by Qing Li)

What are the weights????

They need to capture how discriminating the term is (remember the stop words??)

t1

t2

t1t1

t1t1

t1 t1

t2 t2

t2 better than t1

)"("___

__log)"("

ASUwithdocumentofnumber

documentsofnumberASUidf

inverse-document-frequency

Page 26: Google Scale Data Management

3131/27/27

(partially based on slides by Qing Li)

What are the weights????

They need to capture how • good the term is in describing the content of

the page

• discriminating the term is..

)"("),"("),"(" ASUidfDocASUtfDocASUtfidf

Page 27: Google Scale Data Management

3333/27/27

(partially based on slides by Qing Li)

How are the results ranked based on relevance?

Page 28: Google Scale Data Management

3838/27/27

(partially based on slides by Qing Li)

Vectors…what are they???

Page with “ASU” weight <0.5>, “CSE” weight <0.7>, and “Selcuk” weight <0.9>

0 0.5

0.7

0.9

ASU

CSE

Selcuk

<0.5,0.7.0.9>

Page 29: Google Scale Data Management

3939/27/27

(partially based on slides by Qing Li)

A web database (a vector space!)

Database

ASU

CSE

Selcuk

Page 30: Google Scale Data Management

4040/27/27

(partially based on slides by Qing Li)

A web database (a vector space!)

Query “ASU CSE Selcuk”

ASU

CSE

Selcuk

Page 31: Google Scale Data Management

4141/27/27

(partially based on slides by Qing Li)

Measuring relevance: “Find 2 closest vectors”

ASU

CSE

Selcuk

Requires “multidimensional” index structures

Query “ASU CSE Selcuk”

Page 32: Google Scale Data Management

4242/27/27

(partially based on slides by Qing Li)

How about “importance” of a page…can we measure it?

Page 33: Google Scale Data Management

4343/27/27

(partially based on slides by Qing Li)

How about “importance” of a page…can we measure it?

We might be able to learn how important a page is by studying its connectivity (“link analysis”)

more links to a page implies a better page

Page 34: Google Scale Data Management

4444/27/27

(partially based on slides by Qing Li)

How about “importance” of a page…can we measure it?

We might be able to learn how important a page is by studying its connectivity (“link analysis”)

more links to a page implies a better page

Links from a more important page should count more than links from a weaker page

Page 35: Google Scale Data Management

4545/27/27

(partially based on slides by Qing Li)

Hubs and authorities

•Good hubs should point to good authorities

•Good authorities must be pointed by good hubs.

“Graph theory” helps analyze large networks of links

Page 36: Google Scale Data Management

4646/27/27

(partially based on slides by Qing Li)

PageRank

PR(A) = PR(C)/1

more links to a page implies a better page

Links from a more important page should count more than links from a weaker page

Page 37: Google Scale Data Management

4747/27/27

(partially based on slides by Qing Li)

PageRank

PR(A) = PR(C)/1

PR(B) = PR(A) / 2

more links to a page implies a better page

Links from a more important page should count more than links from a weaker page

Page 38: Google Scale Data Management

4848/27/27

(partially based on slides by Qing Li)

PageRank

PR(A) = PR(C)/1

PR(B) = PR(A) / 2

PR(C) = PR(A) / 2 + PR(B)/1

These types of problems are generally solved using “linear algebra”

more links to a page implies a better page

Links from a more important page should count more than links from a weaker page

Page 39: Google Scale Data Management

5050/27/27

(partially based on slides by Qing Li)

How to compute ranks??

PageRank (PR) definition is recursive• PageRank of a page depends on and influences PageRanks

of other pages

Page 40: Google Scale Data Management

5151/27/27

(partially based on slides by Qing Li)

How to compute ranks??

PageRank (PR) definition is recursive

• PageRank of a page depends on and influences PageRanks of other pages

Solved iteratively• To compute PageRank:

• Choose arbitrary initial PR_old and use it to compute PR_new

• Repeat, setting PR_old to PR_new until PR converges (the difference between old and new PR is sufficiently small)

• Rank values typically converge in 50-100 iterations

Page 41: Google Scale Data Management

5252/27/27

(partially based on slides by Qing Li)

Related CSE courses 185 Introduction to Web Development 205 Object-Oriented Programming and Data Structures 310 Data Structures and Algorithms 355 Introduction to Theoretical Computer Science 412 Database Management 414 Advanced Database Concepts 450 Design and Analysis of Algorithms 457 Theory of Formal Languages 471 Introduction to Artificial Intelligence 494 Cognitive Systems and Intelligent Agents 494 Information Retrieval, Mining 494 Numerical Linear Algebra Data Exploration 494 Introduction to High Performance Computing

Page 42: Google Scale Data Management

5353/27/27

(partially based on slides by Qing Li)

Related CSE courses 185 Introduction to Web Development 205 Object-Oriented Programming and Data Structures 310 Data Structures and Algorithms 355 Introduction to Theoretical Computer Science 412 Database Management 414 Advanced Database Concepts 450 Design and Analysis of Algorithms 457 Theory of Formal Languages 471 Introduction to Artificial Intelligence 494 Cognitive Systems and Intelligent Agents 494 Information Retrieval, Mining 494 Numerical Linear Algebra Data Exploration 494 Introduction to High Performance Computing

Page 43: Google Scale Data Management

5454/27/27

(partially based on slides by Qing Li)

Reference L. Page and S. Brin,The PageRank citation ranking: Bringing

order to the web , Stanford Digital Library Technique, Working paper 1999-0120, 1998.

L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Web Conference (WWW 98), 1998.