How does Google Google: A journey into the wondrous mathematics behind your favorite websites

42
How Does Google? David F. Gleich Computer Science Purdue University A journey into the wondrous mathematics behind your favorite websites 1

description

A talk I gave at the annual meeting for the MetroNY section of the MAA about how Google works from a link-ranking perspective. (http://sections.maa.org/metrony/) Based on a talk by Margot Gerritsen (which used elements from another talk I gave years ago, yay co-author improvements!)

Transcript of How does Google Google: A journey into the wondrous mathematics behind your favorite websites

Page 1: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

How Does Google? !!

David F. Gleich!Computer Science!Purdue University!

A journey into the wondrous mathematics behind your favorite websites

1

Page 2: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

Mathematics underlies an enormous number of the websites we use everyday!

2

Page 3: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

1.  ‘s PageRank 2.  Multi-armed bandits and

internet experiments

3

Page 4: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

4

Page 5: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

Larry Page !Sergey Brin! •  Created a web-search algorithm

called “backrub” •  Spun-off a company “Googol”

based on the paper

•  The importance of a page is

determined by the importance of pages that link to it.

Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd “The PageRank Citation Ranking: Bringing Order to the Web” TR, Stanford InfoLab, 1999

5

Page 6: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

A websearch primer 1. Crawl webpages 2. Analyze webpage text (information retrieval) 3.  Analyze webpage links 4. Fit over 200 measures to human evaluations 5. Produce rankings 6. Continuously update

6

Page 7: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

Pages, nodes, incoming links, outgoing links, and “importance”

7

“Important” pages that link to me!

c

b

a “Important” pages that link to Purdue!

Page 8: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

8

Page 9: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

Tim Davis and Yifan Hu Sparse Matrix Gallery

Page 10: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

1000 vertices on 8.5-by-11 paper

1,000,000,000,000 vertices (one trillion) Paper the size of Manhattan island !(23 sq miles)?

The web

10

Page 11: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

We need something better!

11

Page 12: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

A wee web-graph: link counting is too easy to game!

1

2

3

4

5 6

1/3 1/3

1/3

1/2

1/2

12

Page 13: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

A wee web-graph: link counting is too easy to game!

1

2

3

4

5 6

1/3 1/3

1/3

1/2

1/2

The importance of a page is determined by the importance of pages that link to it. x1 = 0

x2 =13

x1

x3 =13

x1 +12

x2

x4 =13

x1 + x3 + x5

x5 = x4

x6 =12

x2 13

Page 14: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

The importance of a page is determined by the importance of pages that link to it

x

i

=X

j2Bi

1d

j

x

j

“Back-links from page i” Why it was called Backrub!

“Importance” of page i

“Importance” of page j

Number of links page j uses!out-degree in graph theory

x3 =13

x1 +12

x2

1

2

3

1/3

1/2

14

Page 15: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

We can rewrite this equation in a more mathematically convenient way

1 1 2 3 4 5 6

2 1 2 3 4 5 6

3 1 2 3 4 5 6

4 1 2 3 4 5 6

5 1 2 3 4 5 6

6 1 2 3 4 5 6

x 0 x 0 x 0 x 0 x 0 x 0 x

1x x 0 x 0 x 0 x 0 x 0 x31 1x x x 0 x 0 x 0 x 0 x3 21x x 0 x 1x 0 x 1x 0 x3

x 0 x 0 x 0 x 1x 0 x 0 x

1x 0 x x 0 x 0 x 0 x 0 x2

= + + + + +

= + + + + +

= + + + + +

= + + + + +

= + + + + +

= + + + + +

15

Page 16: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

1 1

2 2

3 3

4 4

5 5

6 6

x x0 0 0 0 0 0x x1/ 3 0 0 0 0 0x x1/ 3 1/ 2 0 0 0 0

orx x1/ 3 0 1 0 1 0x x0 0 0 1 0 0x x0 1/ 2 0 0 0 0

⎡ ⎤ ⎡ ⎤⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥

=⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥

⎣ ⎦⎣ ⎦ ⎣ ⎦

x = Px

And even more conveniently!

Element k in column m = "probability" of going from node m to node k

16

Page 17: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

The matrix P for websites shows a lot of structure

Every dot is a non-zero element indicating a link Matrices are sparse, and generally with block structure block structure can be explored to speed up ranking algorithm

17

Page 18: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

But this idea doesn’t work for the wee web-graph

1

2

3

4

5 6

1/3 1/3

1/3

1/2

1/2

Nodes 1, 4 and 5 determine everything!

x1 = 0

x2 =13

x1

x3 =13

x1 +12

x2

x4 =13

x1 + x3 + x5

x5 = x4

x6 =12

x2

x1 = 0

x2 =13

x1 = 0

x3 =13

x1 +12

x2 = 0

x4 =13

x1 + x3 + x5 = x5

x5 = x4

x6 =12

x2 = 0

18

Page 19: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

But this idea doesn’t work for the wee web-graph

1

2

3

4

5 6

1/3 1/3

1/3

1/2

1/2 Node 1 !“lonely” Nodes 4 and 5 !“mutual admiration societies” Node 6 “anti-social”

These nodes need to be “fixed” to get a reliable and useful ranking!

19

Page 20: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

The gang of four to the rescue

Andrei Markov

Oscar Perron

Georg Frogenius

Richard !von Mises

20

Page 21: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

Let’s fix it up and force node 6 to choose, or link to everyone

1

2

3

4

5 6

P =

2

6666664

0 0 0 0 0 01/3 0 0 0 0 01/3 1/2 0 0 0 01/3 0 1 0 1 00 0 0 1 0 00 1/2 0 0 0 0

3

7777775

P =

2

6666664

0 0 0 0 0 1/61/3 0 0 0 0 1/61/3 1/2 0 0 0 1/61/3 0 1 0 1 1/60 0 0 1 0 1/60 1/2 0 0 0 1/6

3

7777775

21

Page 22: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

Taxation is the way to representation!

c b

a

If is a good page, then it’ll still be a good page if we “tax” the importance from a, b, and c We can redistribute the taxed amounts to all including lonely nodes!

22

Page 23: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

The importance of a page is determined by the importance of pages that link to it*

* After tax and any benefits

The total importance that page j !contributes to page i

Benefits to page i

The taxation rate of all

x

i

=X

j2Bi

↵x

j

d

j

+ (1 � ↵)bi

23

Page 24: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

x1x2x3x4x5x6

!

"

#########

$

%

&&&&&&&&&

= α

0 0 0 0 0 1/ 61/ 3 0 0 0 0 1/ 61/ 3 1/ 2 0 0 0 1/ 61/ 3 0 1 0 1 1/ 60 0 0 1 0 1/ 60 1/ 2 0 0 0 1/ 6

!

"

#######

$

%

&&&&&&&

x1x2x3x4x5x6

!

"

#########

$

%

&&&&&&&&&

+ (1−α)

b1b2b3b4b5b6

!

"

#########

$

%

&&&&&&&&&

Perron and Frobenius showed the new equation always has a unique solution

x = ↵Px + (1 � ↵)b

24

Page 25: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

1

2

3

4

5 6

1/3 1/3

1/3

1/2

1/2

What von Mises and Richardson showed is that guess, check, and correct works!

x

(new)

= ↵Px

(old)

+ (1 � ↵)b

x

(start) =

2

6666664

0.170.170.170.170.170.17

3

7777775x

(1) =

2

6666664

0.050.100.170.380.190.12

3

7777775x

(2) =

2

6666664

0.040.060.100.360.360.08

3

7777775

x

(1) =

2

6666664

0.030.040.060.430.390.05

3

7777775

25

Page 26: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

26

Page 27: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

There’s still a lot of work left to do to make a search engine

Make it fast! Watch out for spam Watch out for manipulation Personalize Experiment!

27

Page 28: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

1.  ‘s PageRank 2.  Multi-armed bandits and

internet experiments

28

Page 29: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

http://adamlofting.com/736/drawn-multi-armed-bandit-experiments/multi-armed-bandit/

Not this!

29

Page 30: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

http://upload.wikimedia.org/wikipedia/en/8/82/Las_Vegas_slot_machines.jpg

This!

Pays out !$0.92/dollar

Pays out !$0.98/dollar

Pays out !$0.95/dollar

Pays out !$0.99/dollar

30

Page 31: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

What in the heck does a multi-armed bandit have to do with Google?

31

Page 32: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

What in the heck does a multi-armed bandit have to do with Google?

Pays out !$0.92/view

Pays out !$0.66/view

Pays out !$0.91/view to

show ads

Pays out !-$0.02/view

hide ads 32

Page 33: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

How to optimize your website without exploiting the bandits

Try condition A 100 times, find 45 “wins” Try condition B 100 times, find 85 “wins” Try condition C 100 times, find 10 “wins” … Choose the best!

33

Page 34: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

This field has some of the best terminology Explore ! Exploit ! Regret

34

Page 35: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

This field has some of the best terminology Explore – Visiting Las Vegas! Exploit – Your new winning strategy! Regret – That you didn’t quit after winning the first round

35

Page 36: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

This field has some of the best terminology Explore – Testing slot machines/experiments for their reward Exploit – Playing the best reward you’ve found so far Regret – How much you lost due !to exploration

36

Page 37: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

How to optimize your website without exploiting the bandits

Try condition A 100 times, find 45 “wins” Try condition B 100 times, find 85 “wins” Try condition C 100 times, find 10 “wins” … Choose the best!

Pure exploration!

We only exploit our findings at the end!

37

Page 38: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

How to optimize your website exploiting the bandits Try condition A 5 times, find 4 wins!Try condition B 5 times, find 4 wins!Try condition C 5 times, find 2 wins Try condition A 7 times, find 3 wins!Try condition B 7 times, find 5 wins!Try condition C 1 time, find 0 wins

Pure exploration!

Exploit our knowledge

Condition A B C Est. Return 0.58 0.75 0.33

38

Page 39: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

The goal of these problems is to construct optimal strategies to minimize regret Regret how much you left “on the table” by exploring

zero-regret strategy is one where regret(T trials) is sublinear in T! as the number of plays T → ∞

E[play best always � plays made based on data]

regret 100-each 255/300 � 140/300 = 0.38

regret 30-mixed 25.5/30 � 0.45 ⇥ 12 + 0.85 ⇥ 12 + 0.1 ⇥ 6 = 0.31

39

Page 40: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

[The bandit problem] was formulated during the [second world] war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage.

Peter Whittle (Whittle, 1979) Discussion of “Bandit processes and dynamical allocation indices”

Their importance to website optimization, advertising, and recommendation has rejuvenated research on these problems with fascinating new questions.

40

Page 41: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

Math is everywhere and especially your favorite websites! Matrices and probability are key ingredients.

41

Page 42: How does Google Google: A journey into the wondrous mathematics behind your favorite websites

PageRank on Wikipedia� = 0.50

United States

C:Living people

France

Germany

England

United Kingdom

Canada

Japan

Poland

Australia

� = 0.85

United States

C:Main topic classif.

C:Contents

C:Living people

C:Ctgs. by country

United Kingdom

C:Fundamental

C:Ctgs. by topic

C:Wikipedia admin.

France

� = 0.99

C:Contents

C:Main topic classif.

C:Fundamental

United States

C:Wikipedia admin.

P:List of portals

P:Contents/Portals

C:Portals

C:Society

C:Ctgs. by topic

Note Top 10 articles on Wikipedia with highest PageRank

David F. Gleich (Sandia) Sensitivity Purdue 11 / 36

42