PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford),...

119
Prasad L21LinkAnalysis 1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney (UT, Austin)

Transcript of PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford),...

Page 1: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Prasad L21LinkAnalysis 1

Link Analysis

Adapted from Lectures by Prabhakar Raghavan (Yahoo,

Stanford), Christopher Manning (Stanford), and Raymond Mooney (UT,

Austin)

Page 2: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Evolution of Search Engines 1st Generation : Retrieved documents that

matched keyword-based queries using boolean model.

2nd Generation : Incorporated content-specific relevance ranking based on vector space model (TF-IDF), to deal with high recall.

3rd Generation: Incorporated content-independent source ranking using hyperlink structure to overcome spamming and to exploit “collective web wisdom”.

Prasad L21LinkAnalysis 2

Page 3: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

3

HTML Structure & Feature Weighting

Weight tokens under particular HTML tags more heavily: <TITLE> tokens (Google seems to like title

matches) <H1>,<H2>… tokens <META> keyword tokens

Parse a page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.

Page 4: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

(cont’d)

3rd Generation: Tried to glean relative semantic emphasis of words based on syntactic features such as fonts and span of query term hits to enhance efficacy of VSM.

Future search engines will incorporate (i) trusted sources, (ii) background knowledge, user profile, and search context from past query history, to personalize search, and (iii) apply additional reasoning and heuristics to improve satisfaction of information need.Prasad L21LinkAnalysis 4

Page 5: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

5

Meta-Search Engines

Search engine that passes query to several other search engines and integrates results. Submit queries to host sites. Parse resulting HTML pages to extract

search results. Integrate multiple rankings (rank

aggregation) into a “consensus” ranking. Present integrated results to user.

Examples: Metacrawler, SavvySearch ,Dogpile

Page 6: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

6

Bibliometrics: Citation Analysis Many standard documents include

bibliographies (or references), explicit citations to other previously published documents.

Using citations as links, standard corpora can be viewed as a graph.

The structure of this graph, independent of the content, can provide interesting information about the similarity of documents for clustering and about their significance for ranking.

Page 7: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

7

Bibliographic Coupling: Similarity Measure

Measure of similarity of documents introduced by Kessler in 1963.

The bibliographic coupling of two documents A and B is the number of documents cited by both A and B.

Size of the intersection of their bibliographies. Maybe want to normalize by size of

bibliographies?A B

Page 8: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

8

Co-Citation : Similarity Measure

An alternate citation-based measure of similarity introduced by Small in 1973.

Number of documents that cite both A and B. Maybe want to normalize by total number of

documents citing either A or B ?

A B

Page 9: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

9

Impact Factor (of a journal)

Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals.

Measure of how often papers in the journal are cited by other scientists.

Computed and published annually by the Institute for Scientific Information (ISI).

The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y1 or Y2.

Does not account for the quality of the citing article.

Page 10: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

10

Authorities

Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic.

In-degree (number of pointers to a page) is one simple measure of authority.

However in-degree treats all links as equal. Should links from pages that are

themselves authoritative count for more?

Page 11: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

11

Hubs

Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). Hub pages for IR are included in

the course home page: http://www.cs.utexas.edu/users/mo

oney/ir-course

Page 12: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

12

Hubs and Authorities

Together they tend to form a bipartite graph:

Hubs Authorities

Page 13: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

Simple iterative logic The Good, The Bad and The Unknown

Good nodes won’t point to Bad nodes All other combinations plausible

13

?

?

?

?Good Bad

Page 14: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

Simple iterative logic Good nodes won’t point to Bad nodes

If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good

14

?

?

?

?Good Bad

Page 15: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

Simple iterative logic Good nodes won’t point to Bad nodes

If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good

15

?

?Good Bad

Page 16: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

Simple iterative logic Good nodes won’t point to Bad nodes

If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good

16

?Good Bad

Sometimes need probabilistic analogs – e.g., mail spam

Page 17: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

17

Today’s lecture:Basics of 3rd Generation Search

Engines

Role of anchor text

Link analysis for ranking Pagerank and variants (Page and Brin)

BONUS: Illustrates how to solve an important problem by adapting topology/statistics of large dataset for mathematically sound analysis and a practical scalable algorithm

HITS (Klienberg)

Prasad L21LinkAnalysis

Page 18: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

18

The Web as a Directed Graph

Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)

Assumption 2: The anchor text of the hyperlink describes the target page (textual context)

Page Ahyperlink Page BAnchor

Prasad L21LinkAnalysis

Page 19: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

19

Anchor Text WWW Worm - McBryan [Mcbr94]

For ibm how to distinguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.)

www.ibm.com

“ibm” “ibm.com” “IBM home page”

A million pieces of anchor text with “ibm” send a strong signal

Prasad L21LinkAnalysis

Page 20: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

20

Anchor text containing IBM pointing to www.ibm.com

Page 21: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

21

Indexing anchor text

When indexing a document D, include anchor text from links pointing to D.

www.ibm.com

Armonk, NY-based computergiant IBM announced today

Joe’s computer hardware linksCompaqHPIBM

Big Blue today announcedrecord profits for the quarter

Page 22: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Google Ranking Controversy surrounding “Attention : Good vs Bad”

Prasad L21LinkAnalysis 22

Page 23: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

23

Google bombs

A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.

Google introduced a new weighting function in January 2007that fixed many Google bombs.

Still some remnants: [dangerous cult] on Google, Bing, Yahoo Coordinated link creation by those who dislike the Church of

Scientology Defused Google bombs: [dumb motherf…], [who is a failure?],

[evil empire]

Page 24: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Motivation and Introduction to PageRank

Why is Page Importance Rating important? New challenges for information retrieval on the World

Wide Web. # Web pages: 150 million by 1998,

1000 billion by 2008 Indexed by Google in 2014: ~ 50 billion of > 1 trillion Diversity of web pages: different topics, quality, etc.

What is PageRank? A method for rating the importance of web

pages objectively and mechanically using the link structure of the web.

Page 25: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

29

Initial PageRank Idea

Can view it as a process of PageRank “flowing” from pages to the pages they cite.

.1

.09

.05

.05

.03

.03

.03

.08

.08

.03

Page 26: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

30

Sample Stable Fixpoint

0.4

0.4

0.2

0.2

0.2

0.2

0.4

Page 27: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

31

Pagerank scoring : More details

Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page

along one of the links on that page, with equal probability

“In the steady state” each page has a long-term visit rate - use this as the page’s score.

1/31/31/3

Prasad L21LinkAnalysis

Page 28: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

32

Not quite enough

The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit

rates.

??

Prasad L21LinkAnalysis

Page 29: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

33

Teleporting

At a dead end, jump to a random web page.

At any non-dead end, with probability 10%, jump to a random web page. With remaining probability

(90%), go out on a random link. 10% - a parameter.

Prasad L21LinkAnalysis

Page 30: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

34

Result of teleporting

Now cannot get stuck locally. There is a long-term rate at

which any page is visited (not obvious, will show this).

How do we compute this visit rate?

Prasad L21LinkAnalysis

Page 31: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

35

Markov chains

A Markov chain consists of n states, plus an nn transition probability matrix P.

At each step, we are in exactly one of these states.

For 1 i,j n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i.

i jPij

Pii>0is OK.

Prasad

Page 32: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

36

.11

ij

n

j

P

Markov chains

Clearly, for all i, Markov chains are abstractions of random

walks. Exercise: represent the teleporting random

walk (eliminating dead ends) as a Markov chain. For this case, start with …:

Prasad

Page 33: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

37

Ergodic Markov chains

A Markov chain is ergodic if you have a path from any state to

any other. For any start state, after a finite

transient time T0, the probability of being in any state at a fixed time T>T0 is nonzero.

Notergodic(even/odd).Prasad

Page 34: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

38

Ergodic Markov chains

For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state probability

distribution. Over a long time-period, we visit

each state in proportion to this rate. It doesn’t matter where we start.

Prasad L21LinkAnalysis

Page 35: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

39

Probability vectors

A probability (row) vector x = (x1, … xn) tells us where the walk is, at any point.

E.g., (000…1…000) means we’re in state i.i n1

More generally, the vector x = (x1, … xn) means the walk is in state i with probability xi.

.11

n

iix

Prasad

Page 36: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

40

Change in probability vector

If the probability vector is x = (x1, … xn) at this step, what is it at the next step?

Recall that row i of the transition probability matrix P tells us where we go next from state i.

So from x, our next state is distributed as xP.

Prasad L21LinkAnalysis

.1

jij

n

ii xPx

Page 37: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

41

Steady state example

The steady state looks like a vector of probabilities a = (a1, … an): ai is the probability of being in state i.

1 23/4

1/4

3/41/4

For this example, a1=1/4 and a2=3/4.

Prasad L21LinkAnalysis

Page 38: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

42

How do we compute this vector?

Let a = (a1, … an) denote the row vector of steady-state probabilities.

If current state is described by a, then the next step is distributed as aP.

But a is the steady state, so a=aP. So a is the (left) eigenvector for P. Corresponds to the “principal” eigenvector of

P with the largest eigenvalue. Transition probability matrices always have

largest eigenvalue 1 (and others are smaller). Principal eigenvector can be computed iteratively

from any initial vector.Prasad

Page 39: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

43

One way of computing a

Recall, regardless of where we start, we eventually reach the steady state a.

Start with any distribution (say x=(10…0)). After one step, we’re at xP; after two steps at xP2 , then xP3 and so on. “Eventually” means for “large” k, xPk = a. Algorithm: multiply x by increasing powers

of P until the product looks stable.

Prasad L21LinkAnalysis

Page 40: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

An example of Simplified PageRank (Transposed version)

PageRank Calculation: first iteration

Mij is the weight of link from j to i.

½

.1

ijij

n

j

xxM

Page 41: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

An example of Simplified PageRank (Transposed version)

PageRank Calculation: second iteration

Mij is the weight of link from j to i.

½

.1

ijij

n

j

xxM

Page 42: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

An example of Simplified PageRank (Transposed version)

Convergence after some iterations

Mij is the weight of link from j to i.

½

.1

ijij

n

j

xxM

Page 43: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

47

Power method: Example

What is the PageRank / steady state in this example?

Page 44: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

48

Computing PageRank: Power Example

Page 45: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

49

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 = xP

t1 = xP2

t2 = xP3

t3 = xP4

. . .

t∞ = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 46: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

50

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 = xP2

t2 = xP3

t3 = xP4

. . .

t∞ = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 47: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

51

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 0.3 0.7 = xP2

t2 = xP3

t3 = xP4

. . .

t∞ = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 48: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

52

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 0.3 0.7 0.24 0.76 = xP2

t2 = xP3

t3 = xP4

. . .

t∞ = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 49: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

53

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 0.3 0.7 0.24 0.76 = xP2

t2 0.24 0.76 = xP3

t3 = xP4

. . .

t∞ = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 50: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

54

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 0.3 0.7 0.24 0.76 = xP2

t2 0.24 0.76 0.252 0.748 = xP3

t3 = xP4

. . .

t∞ = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 51: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

55

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 0.3 0.7 0.24 0.76 = xP2

t2 0.24 0.76 0.252 0.748 = xP3

t3 0.252 0.748 = xP4

. . .

t∞ = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 52: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

56

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 0.3 0.7 0.24 0.76 = xP2

t2 0.24 0.76 0.252 0.748 = xP3

t3 0.252 0.748 0.2496 0.7504 = xP4

. . .

t∞ = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 53: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

57

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 0.3 0.7 0.24 0.76 = xP2

t2 0.24 0.76 0.252 0.748 = xP3

t3 0.252 0.748 0.2496 0.7504 = xP4

. . . . . .

t∞ = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 54: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

58

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 0.3 0.7 0.24 0.76 = xP2

t2 0.24 0.76 0.252 0.748 = xP3

t3 0.252 0.748 0.2496 0.7504 = xP4

. . . . . .

t∞ 0.25 0.75 = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 55: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

59

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 0.3 0.7 0.24 0.76 = xP2

t2 0.24 0.76 0.252 0.748 = xP3

t3 0.252 0.748 0.2496 0.7504 = xP4

. . . . . .

t∞ 0.25 0.75 0.25 0.75 = xP∞

Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 56: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

60

Computing PageRank: Power Examplex1

Pt(d1)x2

Pt(d2)

P11 = 0.1P21 = 0.3

P12 = 0.9P22 = 0.7

t0 0 1 0.3 0.7 = xP

t1 0.3 0.7 0.24 0.76 = xP2

t2 0.24 0.76 0.252 0.748 = xP3

t3 0.252 0.748 0.2496 0.7504 = xP4

. . . . . .

t∞ 0.25 0.75 0.25 0.75 = xP∞PageRank vector = = (, ) = (0.25, 0.75)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 57: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

61

Power method: Example

What is the PageRank / steady state in this example?

The steady state distribution (= the PageRanks) in this example are 0.25 for d1 and 0.75 for d2.

Page 58: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

62

Exercise: Compute PageRank using power method

Page 59: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

63

Exercise: Compute PageRank using power method

Page 60: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

64

Solution

Page 61: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

65

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1

t1

t2

t3

t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 62: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

66

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1 0.2 0.8

t1

t2

t3

t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 63: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

67

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1 0.2 0.8

t1 0.2 0.8

t2

t3

t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 64: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

68

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1 0.2 0.8

t1 0.2 0.8 0.3 0.7

t2

t3

t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 65: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

69

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1 0.2 0.8

t1 0.2 0.8 0.3 0.7

t2 0.3 0.7

t3

t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 66: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

70

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1 0.2 0.8

t1 0.2 0.8 0.3 0.7

t2 0.3 0.7 0.35 0.65

t3

t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 67: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

71

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1 0.2 0.8

t1 0.2 0.8 0.3 0.7

t2 0.3 0.7 0.35 0.65

t3 0.35 0.65

t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 68: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

72

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1 0.2 0.8

t1 0.2 0.8 0.3 0.7

t2 0.3 0.7 0.35 0.65

t3 0.35 0.65 0.375 0.625

t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 69: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

73

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1 0.2 0.8

t1 0.2 0.8 0.3 0.7

t2 0.3 0.7 0.35 0.65

t3 0.35 0.65 0.375 0.625

. . .

t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 70: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

74

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1 0.2 0.8

t1 0.2 0.8 0.3 0.7

t2 0.3 0.7 0.35 0.65

t3 0.35 0.65 0.375 0.625

. . .

t∞ 0.4 0.6PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 71: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

75

Solutionx1

Pt(d1)x2

Pt(d2)

P11 = 0.7P21 = 0.2

P12 = 0.3P22 = 0.8

t0 0 1 0.2 0.8

t1 0.2 0.8 0.3 0.7

t2 0.3 0.7 0.35 0.65

t3 0.35 0.65 0.375 0.625

. . .

t∞ 0.4 0.6 0.4 0.6PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21

Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22

Page 72: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

Example web graph

Page 73: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

77

Transition (probability) matrix

d0 d1 d2 d3 d4 d5 d6

d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00

d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00

d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00

d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00

d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00

d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50

d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33

Page 74: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

78

Transition matrix with teleporting

d0 d1 d2 d3 d4 d5 d6

d0 0.02 0.02 0.88 0.02 0.02 0.02 0.02

d1 0.02 0.45 0.45 0.02 0.02 0.02 0.02

d2 0.31 0.02 0.31 0.31 0.02 0.02 0.02

d3 0.02 0.02 0.02 0.45 0.45 0.02 0.02

d4 0.02 0.02 0.02 0.02 0.02 0.02 0.88

d5 0.02 0.02 0.02 0.02 0.02 0.45 0.45

d6 0.02 0.02 0.02 0.31 0.31 0.02 0.31

Page 75: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

79

Power method vectors xPk x xP1 xP2 xP3 xP4 xP5 xP6 xP7 xP8 xP9 xP10 xP11 xP12 xP13

d0 0.14 0.06 0.09 0.07 0.07 0.06 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.05

d1 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04

d2 0.14 0.25 0.18 0.17 0.15 0.14 0.13 0.12 0.12 0.12 0.12 0.11 0.11 0.11

d3 0.14 0.16 0.23 0.24 0.24 0.24 0.24 0.25 0.25 0.25 0.25 0.25 0.25 0.25

d4 0.14 0.12 0.16 0.19 0.19 0.20 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.21

d5 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04

d6 0.14 0.25 0.23 0.25 0.27 0.28 0.29 0.29 0.30 0.30 0.30 0.30 0.31 0.31

Page 76: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

80

Example web graph

PageRankd0 0.05

d1 0.04

d2 0.11

d3 0.25

d4 0.21

d5 0.04

d6 0.31

Page 77: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Link Structure of the Web

150 million web pages 1.7 billion links

Backlinks and Forward links:

A and B are C’s backlinksC is A and B’s forward link

Intuitively, a webpage is important if it has a lot of backlinks.

What if a webpage has only one link off www.yahoo.com?

Page 78: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

A Simple Version of PageRank

u: a web page Bu: the set of u’s backlinks

Nv: the number of forward links of page v c: the normalization factor to make

||R||L1 = 1 (||R||L1= |R1 + … + Rn|)

Page 79: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

A Problem with Simplified PageRank

A loop:

During each iteration, the loop accumulates rank but never distributes rank to other

pages!

Page 80: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

An example of the Problem

½

Page 81: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

An example of the Problem

½

Page 82: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

An example of the Problem

½

Page 83: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Random Walks in Graphs

The Random Surfer Model The simplified model: the standing probability

distribution of a random walk on the graph of the web. User simply keeps clicking successive links at random.

The modified model: the “random surfer” simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E.

Page 84: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Modified Version of PageRank

E(u): a distribution of ranks of web pages that “users” jump to when they “get bored” after successive links at random.

Page 85: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

An example of Modified PageRank

89

½

Page 86: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

90

Pagerank summary

Preprocessing: Given graph of links, build matrix P. From it, compute a. The entry ai is a number between 0 and 1:

the pagerank of page i. Query processing:

Retrieve pages meeting query. Rank them by their pagerank. Order is query-independent.

Prasad L21LinkAnalysis

Page 87: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

91

The reality

Pagerank is used in google, but with so many other clever heuristics.

Prasad L21LinkAnalysis

Page 88: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

92

Pagerank: Issues and Variants How realistic is the random surfer model?

What if we modeled the back button? [Fagi00]

Surfer behavior sharply skewed towards short paths [Hube98]

Search engines, bookmarks & directories make jumps non-random.

Biased Surfer Models Weight link traversal probabilities based on

match with topic/query (skewed selection) Bias jumps to pages on topic (e.g., based on

personal bookmarks & categories of interest)Prasad

Page 89: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

96

Non-uniform Teleportation

Teleport with 10% probability to a Sports page

Sports

Prasad

Page 90: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Hubs and Authorities

Alternative to Pagerank

Prasad L21LinkAnalysis 101

Page 91: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Hyperlink-Induced Topic Search (HITS) - Klei98

In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links on a

subject. e.g., “Bob’s list of cancer-related links.”

Authority pages occur recurrently on good hubs for the subject.

Best suited for “broad topic” queries rather than for page-finding queries.

Gets at a broader slice of common opinion.

Page 92: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

103

Hubs and Authorities

Thus, a good hub page for a topic points to many authoritative pages for that topic.

A good authority page for a topic is pointed to by many good hubs for that topic.

Circular definition - will turn this into an iterative computation.Prasad L21LinkAnalysis

Page 93: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

104

The hope

AT&T Alice

SprintBob MCI

Long distance telephone companies

HubsAuthorities

Prasad

Page 94: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

105

Example for hubs and authorities

Page 95: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

106

High-level scheme

Extract from the web a base set of pages that could be good hubs or authorities.

From these, identify a small set of top hub and authority pages; iterative algorithm.

Prasad L21LinkAnalysis

Page 96: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

107

Base set

Given text query (say browser), use a text index to get all pages containing browser. Call this the root set of pages.

Add in any page that either points to a page in the root set, or is pointed to by a page in the root set.

Call this the base set.

Prasad L21LinkAnalysis

Page 97: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

108

Visualization

Rootset

Base set

Prasad L21LinkAnalysis

Page 98: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

109

Assembling the base set [Klei98]

Root set typically 200-1000 nodes. Base set may have up to 5000 nodes. How do you find the base set nodes?

Follow out-links by parsing root set pages. Get in-links (and out-links) from a

connectivity server. (Actually, suffices to text-index strings of

the form href=“URL” to get in-links to URL.)

Prasad L21LinkAnalysis

Page 99: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

110

Distilling hubs and authorities

Compute, for each page x in the base set, a hub score h(x) and an authority score a(x).

Initialize: for all x, h(x)1; a(x) 1; Iteratively update all h(x), a(x); After iterations

output pages with highest h() scores as top hubs

highest a() scores as top authorities.

Key

Prasad L21LinkAnalysis

Page 100: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

111

Illustrated Update Rules

2

3

a4 = h1 + h2 + h3

1

5

7

6

4

4h4 = a5 + a6 + a7

Page 101: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

112

Iterative update

Repeat the following updates, for all x:

yx

yaxh

)()(

xy

yhxa

)()(

x

x

Prasad L21LinkAnalysis

y ’s

y ’s

Page 102: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

113

Scaling

To prevent the h() and a() values from getting too big, can normalize after each iteration.

Scaling factor doesn’t really matter: we only care about the relative

values of the scores.

Prasad L21LinkAnalysis

Page 103: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

114

Old Results

Authorities for query: “Java” java.sun.com comp.lang.java FAQ

Authorities for query: “search engine” Yahoo.com Excite.com Lycos.com Altavista.com

Authorities for query: “Gates” Microsoft.com roadahead.com

Page 104: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

115

Similar Page Results from Hubs

Given “honda.com” toyota.com ford.com bmwusa.com saturncars.com nissanmotors.com audi.com volvocars.com

Page 105: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

118

Japan Elementary Schools

The American School in Japan The Link Page ‰ªès—§ˆä� � “c¬ŠwZƒz[ƒƒy[ƒW � � � � � Kids' Space ˆÀés—§ˆÀ鼕� � � � ”¬ŠwZ � � ‹{鋳ˆç� ‘åŠw•�‘®¬ŠwZ � � KEIMEI GAKUEN Home Page ( Japanese ) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _� “Þ쌧E‰¡•ls—§� � � ’†ì¼¬ŠwZ‚̃y� � � � http://www...p/~m_maru/index.html fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School...

schools LINK Page-13 “ú–{‚ÌŠwZ � a‰„¬ŠwZƒz[ƒƒy[ƒW � � � � � � 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) http://www...iglobe.ne.jp/~IKESAN ‚l‚f‚j¬ŠwZ‚U� � ”N‚P‘g•¨Œê ÒŠ—� ’¬—§ÒŠ—� “Œ¬ŠwZ � � Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) –y“쬊wZ‚̃z[ƒƒy[ƒW � � � � � UNIVERSITY ‰J—³¬ŠwZ DRAGON97-TOP � � ‰ª¬ŠwZ‚T� � � ”N‚P‘gƒz[ƒƒy[ƒW � � � ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼

Hubs Authorities

Page 106: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

119

Authorities for query [Chicago Bulls]

0.85 www.nba.com/bulls0.25 www.essex1.com/people/jmiller/bulls.htm

“da Bulls”0.20 www.nando.net/SportServer/basketball/nba/chi.html

“The Chicago Bulls”0.15 Users.aol.com/rynocub/bulls.htm

“The Chicago Bulls Home Page ”0.13 www.geocities.com/Colosseum/6095

“Chicago Bulls”

(Ben Shaul et al, WWW8)

Page 107: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

120

Hubs for query [Chicago Bulls]

1.62 www.geocities.com/Colosseum/1778“Unbelieveabulls!!!!!”

1.24 www.webring.org/cgi-bin/webring?ring=chbulls“Chicago Bulls”

0.74 www.geocities.com/Hollywood/Lot/3330/Bulls.html“Chicago Bulls”

0.52 www.nobull.net/web_position/kw-search-15-M2.html“Excite Search Results: bulls ”

0.52 www.halcyon.com/wordltd/bball/bulls.html“Chicago Bulls Links”

(Ben Shaul et al, WWW8)

Page 108: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

121

Things to note

Pulled together good pages regardless of language of page content.

Use link analysis only after base set assembled iterative scoring is query-independent.

Iterative computation after text index retrieval - significant overhead.

Prasad L21LinkAnalysis

Page 109: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

122

Proof of convergence

nn adjacency matrix A: each of the n pages in the base set has a

row and column in the matrix. Entry Aij = 1 if page i links to page j, else =

0.

1 2

3

1 2 3

1

2

3

0 1 0

1 1 1

1 0 0Prasad L21LinkAnalysis

Page 110: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

123

Hub/authority vectors

View the hub scores h() and the authority scores a() as vectors with n components.

Recall the iterative updates

yx

yaxh

)()(

xy

yhxa

)()(

Prasad

Page 111: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

124

Rewrite in matrix form

h=Aa. a=Ath.

Recall At is the

transpose of A.

Substituting, h=AAth and a=AtAa.Thus, h is an eigenvector of AAt and a is an eigenvector of AtA.

Further, we know the algorithm is for computing eigenvector: the power iteration method.

Guaranteed to converge.Prasad

Page 112: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

Example web graph

Page 113: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

Raw matrix A for HITS

d0 d1 d2 d3 d4 d5 d6

d0 0 0 1 0 0 0 0

d1 0 1 1 0 0 0 0

d2 1 0 1 2 0 0 0

d3 0 0 0 1 1 0 0

d4 0 0 0 0 0 0 1

d5 0 0 0 0 0 1 1

d6 0 0 0 2 1 0 1

Page 114: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

Hub vectors h0 , hi = A*ai , i ≥1

h0 h1 h2 h3 h4 h5

d0 0.14 0.06 0.04 0.04 0.03 0.03

d1 0.14 0.08 0.05 0.04 0.04 0.04

d2 0.14 0.28 0.32 0.33 0.33 0.33

d3 0.14 0.14 0.17 0.18 0.18 0.18

d4 0.14 0.06 0.04 0.04 0.04 0.04

d5 0.14 0.08 0.05 0.04 0.04 0.04

d6 0.14 0.30 0.33 0.34 0.35 0.35

d i

1

Page 115: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

Authority vector a = AT*hi-1 , i ≥ 1

a1 a2 a3 a4 a5 a6 a7

d0 0.06 0.09 0.10 0.10 0.10 0.10 0.10

d1 0.06 0.03 0.01 0.01 0.01 0.01 0.01

d2 0.19 0.14 0.13 0.12 0.12 0.12 0.12

d3 0.31 0.43 0.46 0.46 0.46 0.47 0.47

d4 0.13 0.14 0.16 0.16 0.16 0.16 0.16

d5 0.06 0.03 0.02 0.01 0.01 0.01 0.01

d6 0.19 0.14 0.13 0.13 0.13 0.13 0.13

ci1

Page 116: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

129

Example web graph

a hd0 0.10 0.03

d1 0.01 0.04

d2 0.12 0.33

d3 0.47 0.18

d4 0.16 0.04

d5 0.01 0.04

d6 0.13 0.35

Page 117: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

Top-ranked pages

Pages with highest in-degree: d2, d3, d6

Pages with highest out-degree: d2, d6

Pages with highest PageRank: d6

Pages with highest in-degree: d6 (close: d2)

Pages with highest authority score: d3

Page 118: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

Introduction to Information RetrievalIntroduction to Information Retrieval

131

PageRank vs. HITS: Discussion PageRank can be precomputed, HITS has to be computed at

query time. HITS is too expensive in most application scenarios.

PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.

These two are orthogonal. We could also apply HITS to the entire web and PageRank to a

small base set. Claim: On the web, a good hub almost always is also a good authority. The actual difference between PageRank ranking and HITS ranking is

therefore not as large as one might expect.

Page 119: PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford), Christopher Manning (Stanford), and Raymond Mooney.

132

Issues

Topic Drift Off-topic pages can cause off-topic

“authorities” to be returned E.g., the neighborhood graph can be

about a “super topic” Mutually Reinforcing Affiliates

Affiliated pages/sites can boost each others’ scores

Linkage between affiliated pages is not a useful signal

Prasad L21LinkAnalysis