PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford),...
-
Upload
zoe-singleton -
Category
Documents
-
view
221 -
download
2
Transcript of PrasadL21LinkAnalysis1 Link Analysis Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford),...
Prasad L21LinkAnalysis 1
Link Analysis
Adapted from Lectures by Prabhakar Raghavan (Yahoo,
Stanford), Christopher Manning (Stanford), and Raymond Mooney (UT,
Austin)
Evolution of Search Engines 1st Generation : Retrieved documents that
matched keyword-based queries using boolean model.
2nd Generation : Incorporated content-specific relevance ranking based on vector space model (TF-IDF), to deal with high recall.
3rd Generation: Incorporated content-independent source ranking using hyperlink structure to overcome spamming and to exploit “collective web wisdom”.
Prasad L21LinkAnalysis 2
3
HTML Structure & Feature Weighting
Weight tokens under particular HTML tags more heavily: <TITLE> tokens (Google seems to like title
matches) <H1>,<H2>… tokens <META> keyword tokens
Parse a page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.
(cont’d)
3rd Generation: Tried to glean relative semantic emphasis of words based on syntactic features such as fonts and span of query term hits to enhance efficacy of VSM.
Future search engines will incorporate (i) trusted sources, (ii) background knowledge, user profile, and search context from past query history, to personalize search, and (iii) apply additional reasoning and heuristics to improve satisfaction of information need.Prasad L21LinkAnalysis 4
5
Meta-Search Engines
Search engine that passes query to several other search engines and integrates results. Submit queries to host sites. Parse resulting HTML pages to extract
search results. Integrate multiple rankings (rank
aggregation) into a “consensus” ranking. Present integrated results to user.
Examples: Metacrawler, SavvySearch ,Dogpile
6
Bibliometrics: Citation Analysis Many standard documents include
bibliographies (or references), explicit citations to other previously published documents.
Using citations as links, standard corpora can be viewed as a graph.
The structure of this graph, independent of the content, can provide interesting information about the similarity of documents for clustering and about their significance for ranking.
7
Bibliographic Coupling: Similarity Measure
Measure of similarity of documents introduced by Kessler in 1963.
The bibliographic coupling of two documents A and B is the number of documents cited by both A and B.
Size of the intersection of their bibliographies. Maybe want to normalize by size of
bibliographies?A B
8
Co-Citation : Similarity Measure
An alternate citation-based measure of similarity introduced by Small in 1973.
Number of documents that cite both A and B. Maybe want to normalize by total number of
documents citing either A or B ?
A B
9
Impact Factor (of a journal)
Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals.
Measure of how often papers in the journal are cited by other scientists.
Computed and published annually by the Institute for Scientific Information (ISI).
The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y1 or Y2.
Does not account for the quality of the citing article.
10
Authorities
Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic.
In-degree (number of pointers to a page) is one simple measure of authority.
However in-degree treats all links as equal. Should links from pages that are
themselves authoritative count for more?
11
Hubs
Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). Hub pages for IR are included in
the course home page: http://www.cs.utexas.edu/users/mo
oney/ir-course
12
Hubs and Authorities
Together they tend to form a bipartite graph:
Hubs Authorities
Introduction to Information RetrievalIntroduction to Information Retrieval
Simple iterative logic The Good, The Bad and The Unknown
Good nodes won’t point to Bad nodes All other combinations plausible
13
?
?
?
?Good Bad
Introduction to Information RetrievalIntroduction to Information Retrieval
Simple iterative logic Good nodes won’t point to Bad nodes
If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good
14
?
?
?
?Good Bad
Introduction to Information RetrievalIntroduction to Information Retrieval
Simple iterative logic Good nodes won’t point to Bad nodes
If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good
15
?
?Good Bad
Introduction to Information RetrievalIntroduction to Information Retrieval
Simple iterative logic Good nodes won’t point to Bad nodes
If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good
16
?Good Bad
Sometimes need probabilistic analogs – e.g., mail spam
17
Today’s lecture:Basics of 3rd Generation Search
Engines
Role of anchor text
Link analysis for ranking Pagerank and variants (Page and Brin)
BONUS: Illustrates how to solve an important problem by adapting topology/statistics of large dataset for mathematically sound analysis and a practical scalable algorithm
HITS (Klienberg)
Prasad L21LinkAnalysis
18
The Web as a Directed Graph
Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal)
Assumption 2: The anchor text of the hyperlink describes the target page (textual context)
Page Ahyperlink Page BAnchor
Prasad L21LinkAnalysis
19
Anchor Text WWW Worm - McBryan [Mcbr94]
For ibm how to distinguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.)
www.ibm.com
“ibm” “ibm.com” “IBM home page”
A million pieces of anchor text with “ibm” send a strong signal
Prasad L21LinkAnalysis
Introduction to Information RetrievalIntroduction to Information Retrieval
20
Anchor text containing IBM pointing to www.ibm.com
21
Indexing anchor text
When indexing a document D, include anchor text from links pointing to D.
www.ibm.com
Armonk, NY-based computergiant IBM announced today
Joe’s computer hardware linksCompaqHPIBM
Big Blue today announcedrecord profits for the quarter
Google Ranking Controversy surrounding “Attention : Good vs Bad”
Prasad L21LinkAnalysis 22
Introduction to Information RetrievalIntroduction to Information Retrieval
23
Google bombs
A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.
Google introduced a new weighting function in January 2007that fixed many Google bombs.
Still some remnants: [dangerous cult] on Google, Bing, Yahoo Coordinated link creation by those who dislike the Church of
Scientology Defused Google bombs: [dumb motherf…], [who is a failure?],
[evil empire]
Motivation and Introduction to PageRank
Why is Page Importance Rating important? New challenges for information retrieval on the World
Wide Web. # Web pages: 150 million by 1998,
1000 billion by 2008 Indexed by Google in 2014: ~ 50 billion of > 1 trillion Diversity of web pages: different topics, quality, etc.
What is PageRank? A method for rating the importance of web
pages objectively and mechanically using the link structure of the web.
29
Initial PageRank Idea
Can view it as a process of PageRank “flowing” from pages to the pages they cite.
.1
.09
.05
.05
.03
.03
.03
.08
.08
.03
30
Sample Stable Fixpoint
0.4
0.4
0.2
0.2
0.2
0.2
0.4
31
Pagerank scoring : More details
Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page
along one of the links on that page, with equal probability
“In the steady state” each page has a long-term visit rate - use this as the page’s score.
1/31/31/3
Prasad L21LinkAnalysis
32
Not quite enough
The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit
rates.
??
Prasad L21LinkAnalysis
33
Teleporting
At a dead end, jump to a random web page.
At any non-dead end, with probability 10%, jump to a random web page. With remaining probability
(90%), go out on a random link. 10% - a parameter.
Prasad L21LinkAnalysis
34
Result of teleporting
Now cannot get stuck locally. There is a long-term rate at
which any page is visited (not obvious, will show this).
How do we compute this visit rate?
Prasad L21LinkAnalysis
35
Markov chains
A Markov chain consists of n states, plus an nn transition probability matrix P.
At each step, we are in exactly one of these states.
For 1 i,j n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i.
i jPij
Pii>0is OK.
Prasad
36
.11
ij
n
j
P
Markov chains
Clearly, for all i, Markov chains are abstractions of random
walks. Exercise: represent the teleporting random
walk (eliminating dead ends) as a Markov chain. For this case, start with …:
Prasad
37
Ergodic Markov chains
A Markov chain is ergodic if you have a path from any state to
any other. For any start state, after a finite
transient time T0, the probability of being in any state at a fixed time T>T0 is nonzero.
Notergodic(even/odd).Prasad
38
Ergodic Markov chains
For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state probability
distribution. Over a long time-period, we visit
each state in proportion to this rate. It doesn’t matter where we start.
Prasad L21LinkAnalysis
39
Probability vectors
A probability (row) vector x = (x1, … xn) tells us where the walk is, at any point.
E.g., (000…1…000) means we’re in state i.i n1
More generally, the vector x = (x1, … xn) means the walk is in state i with probability xi.
.11
n
iix
Prasad
40
Change in probability vector
If the probability vector is x = (x1, … xn) at this step, what is it at the next step?
Recall that row i of the transition probability matrix P tells us where we go next from state i.
So from x, our next state is distributed as xP.
Prasad L21LinkAnalysis
.1
jij
n
ii xPx
41
Steady state example
The steady state looks like a vector of probabilities a = (a1, … an): ai is the probability of being in state i.
1 23/4
1/4
3/41/4
For this example, a1=1/4 and a2=3/4.
Prasad L21LinkAnalysis
42
How do we compute this vector?
Let a = (a1, … an) denote the row vector of steady-state probabilities.
If current state is described by a, then the next step is distributed as aP.
But a is the steady state, so a=aP. So a is the (left) eigenvector for P. Corresponds to the “principal” eigenvector of
P with the largest eigenvalue. Transition probability matrices always have
largest eigenvalue 1 (and others are smaller). Principal eigenvector can be computed iteratively
from any initial vector.Prasad
43
One way of computing a
Recall, regardless of where we start, we eventually reach the steady state a.
Start with any distribution (say x=(10…0)). After one step, we’re at xP; after two steps at xP2 , then xP3 and so on. “Eventually” means for “large” k, xPk = a. Algorithm: multiply x by increasing powers
of P until the product looks stable.
Prasad L21LinkAnalysis
An example of Simplified PageRank (Transposed version)
PageRank Calculation: first iteration
Mij is the weight of link from j to i.
½
.1
ijij
n
j
xxM
An example of Simplified PageRank (Transposed version)
PageRank Calculation: second iteration
Mij is the weight of link from j to i.
½
.1
ijij
n
j
xxM
An example of Simplified PageRank (Transposed version)
Convergence after some iterations
Mij is the weight of link from j to i.
½
.1
ijij
n
j
xxM
47
Power method: Example
What is the PageRank / steady state in this example?
Introduction to Information RetrievalIntroduction to Information Retrieval
48
Computing PageRank: Power Example
Introduction to Information RetrievalIntroduction to Information Retrieval
49
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 = xP
t1 = xP2
t2 = xP3
t3 = xP4
. . .
t∞ = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
50
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 = xP2
t2 = xP3
t3 = xP4
. . .
t∞ = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
51
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 0.3 0.7 = xP2
t2 = xP3
t3 = xP4
. . .
t∞ = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
52
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 0.3 0.7 0.24 0.76 = xP2
t2 = xP3
t3 = xP4
. . .
t∞ = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
53
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 0.3 0.7 0.24 0.76 = xP2
t2 0.24 0.76 = xP3
t3 = xP4
. . .
t∞ = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
54
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 0.3 0.7 0.24 0.76 = xP2
t2 0.24 0.76 0.252 0.748 = xP3
t3 = xP4
. . .
t∞ = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
55
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 0.3 0.7 0.24 0.76 = xP2
t2 0.24 0.76 0.252 0.748 = xP3
t3 0.252 0.748 = xP4
. . .
t∞ = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
56
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 0.3 0.7 0.24 0.76 = xP2
t2 0.24 0.76 0.252 0.748 = xP3
t3 0.252 0.748 0.2496 0.7504 = xP4
. . .
t∞ = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
57
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 0.3 0.7 0.24 0.76 = xP2
t2 0.24 0.76 0.252 0.748 = xP3
t3 0.252 0.748 0.2496 0.7504 = xP4
. . . . . .
t∞ = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
58
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 0.3 0.7 0.24 0.76 = xP2
t2 0.24 0.76 0.252 0.748 = xP3
t3 0.252 0.748 0.2496 0.7504 = xP4
. . . . . .
t∞ 0.25 0.75 = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
59
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 0.3 0.7 0.24 0.76 = xP2
t2 0.24 0.76 0.252 0.748 = xP3
t3 0.252 0.748 0.2496 0.7504 = xP4
. . . . . .
t∞ 0.25 0.75 0.25 0.75 = xP∞
Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
60
Computing PageRank: Power Examplex1
Pt(d1)x2
Pt(d2)
P11 = 0.1P21 = 0.3
P12 = 0.9P22 = 0.7
t0 0 1 0.3 0.7 = xP
t1 0.3 0.7 0.24 0.76 = xP2
t2 0.24 0.76 0.252 0.748 = xP3
t3 0.252 0.748 0.2496 0.7504 = xP4
. . . . . .
t∞ 0.25 0.75 0.25 0.75 = xP∞PageRank vector = = (, ) = (0.25, 0.75)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
61
Power method: Example
What is the PageRank / steady state in this example?
The steady state distribution (= the PageRanks) in this example are 0.25 for d1 and 0.75 for d2.
Introduction to Information RetrievalIntroduction to Information Retrieval
62
Exercise: Compute PageRank using power method
Introduction to Information RetrievalIntroduction to Information Retrieval
63
Exercise: Compute PageRank using power method
Introduction to Information RetrievalIntroduction to Information Retrieval
64
Solution
Introduction to Information RetrievalIntroduction to Information Retrieval
65
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1
t1
t2
t3
t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
66
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1 0.2 0.8
t1
t2
t3
t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
67
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1 0.2 0.8
t1 0.2 0.8
t2
t3
t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
68
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1 0.2 0.8
t1 0.2 0.8 0.3 0.7
t2
t3
t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
69
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1 0.2 0.8
t1 0.2 0.8 0.3 0.7
t2 0.3 0.7
t3
t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
70
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1 0.2 0.8
t1 0.2 0.8 0.3 0.7
t2 0.3 0.7 0.35 0.65
t3
t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
71
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1 0.2 0.8
t1 0.2 0.8 0.3 0.7
t2 0.3 0.7 0.35 0.65
t3 0.35 0.65
t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
72
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1 0.2 0.8
t1 0.2 0.8 0.3 0.7
t2 0.3 0.7 0.35 0.65
t3 0.35 0.65 0.375 0.625
t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
73
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1 0.2 0.8
t1 0.2 0.8 0.3 0.7
t2 0.3 0.7 0.35 0.65
t3 0.35 0.65 0.375 0.625
. . .
t∞PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
74
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1 0.2 0.8
t1 0.2 0.8 0.3 0.7
t2 0.3 0.7 0.35 0.65
t3 0.35 0.65 0.375 0.625
. . .
t∞ 0.4 0.6PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
75
Solutionx1
Pt(d1)x2
Pt(d2)
P11 = 0.7P21 = 0.2
P12 = 0.3P22 = 0.8
t0 0 1 0.2 0.8
t1 0.2 0.8 0.3 0.7
t2 0.3 0.7 0.35 0.65
t3 0.35 0.65 0.375 0.625
. . .
t∞ 0.4 0.6 0.4 0.6PageRank vector = = (, ) = (0.4, 0.6)Pt(d1) = Pt-1(d1) * P11 + Pt-1(d2) * P21
Pt(d2) = Pt-1(d1) * P12 + Pt-1(d2) * P22
Introduction to Information RetrievalIntroduction to Information Retrieval
Example web graph
Introduction to Information RetrievalIntroduction to Information Retrieval
77
Transition (probability) matrix
d0 d1 d2 d3 d4 d5 d6
d0 0.00 0.00 1.00 0.00 0.00 0.00 0.00
d1 0.00 0.50 0.50 0.00 0.00 0.00 0.00
d2 0.33 0.00 0.33 0.33 0.00 0.00 0.00
d3 0.00 0.00 0.00 0.50 0.50 0.00 0.00
d4 0.00 0.00 0.00 0.00 0.00 0.00 1.00
d5 0.00 0.00 0.00 0.00 0.00 0.50 0.50
d6 0.00 0.00 0.00 0.33 0.33 0.00 0.33
Introduction to Information RetrievalIntroduction to Information Retrieval
78
Transition matrix with teleporting
d0 d1 d2 d3 d4 d5 d6
d0 0.02 0.02 0.88 0.02 0.02 0.02 0.02
d1 0.02 0.45 0.45 0.02 0.02 0.02 0.02
d2 0.31 0.02 0.31 0.31 0.02 0.02 0.02
d3 0.02 0.02 0.02 0.45 0.45 0.02 0.02
d4 0.02 0.02 0.02 0.02 0.02 0.02 0.88
d5 0.02 0.02 0.02 0.02 0.02 0.45 0.45
d6 0.02 0.02 0.02 0.31 0.31 0.02 0.31
Introduction to Information RetrievalIntroduction to Information Retrieval
79
Power method vectors xPk x xP1 xP2 xP3 xP4 xP5 xP6 xP7 xP8 xP9 xP10 xP11 xP12 xP13
d0 0.14 0.06 0.09 0.07 0.07 0.06 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.05
d1 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04
d2 0.14 0.25 0.18 0.17 0.15 0.14 0.13 0.12 0.12 0.12 0.12 0.11 0.11 0.11
d3 0.14 0.16 0.23 0.24 0.24 0.24 0.24 0.25 0.25 0.25 0.25 0.25 0.25 0.25
d4 0.14 0.12 0.16 0.19 0.19 0.20 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.21
d5 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04
d6 0.14 0.25 0.23 0.25 0.27 0.28 0.29 0.29 0.30 0.30 0.30 0.30 0.31 0.31
Introduction to Information RetrievalIntroduction to Information Retrieval
80
Example web graph
PageRankd0 0.05
d1 0.04
d2 0.11
d3 0.25
d4 0.21
d5 0.04
d6 0.31
Link Structure of the Web
150 million web pages 1.7 billion links
Backlinks and Forward links:
A and B are C’s backlinksC is A and B’s forward link
Intuitively, a webpage is important if it has a lot of backlinks.
What if a webpage has only one link off www.yahoo.com?
A Simple Version of PageRank
u: a web page Bu: the set of u’s backlinks
Nv: the number of forward links of page v c: the normalization factor to make
||R||L1 = 1 (||R||L1= |R1 + … + Rn|)
A Problem with Simplified PageRank
A loop:
During each iteration, the loop accumulates rank but never distributes rank to other
pages!
An example of the Problem
½
An example of the Problem
½
An example of the Problem
½
Random Walks in Graphs
The Random Surfer Model The simplified model: the standing probability
distribution of a random walk on the graph of the web. User simply keeps clicking successive links at random.
The modified model: the “random surfer” simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E.
Modified Version of PageRank
E(u): a distribution of ranks of web pages that “users” jump to when they “get bored” after successive links at random.
An example of Modified PageRank
89
½
90
Pagerank summary
Preprocessing: Given graph of links, build matrix P. From it, compute a. The entry ai is a number between 0 and 1:
the pagerank of page i. Query processing:
Retrieve pages meeting query. Rank them by their pagerank. Order is query-independent.
Prasad L21LinkAnalysis
91
The reality
Pagerank is used in google, but with so many other clever heuristics.
Prasad L21LinkAnalysis
92
Pagerank: Issues and Variants How realistic is the random surfer model?
What if we modeled the back button? [Fagi00]
Surfer behavior sharply skewed towards short paths [Hube98]
Search engines, bookmarks & directories make jumps non-random.
Biased Surfer Models Weight link traversal probabilities based on
match with topic/query (skewed selection) Bias jumps to pages on topic (e.g., based on
personal bookmarks & categories of interest)Prasad
96
Non-uniform Teleportation
Teleport with 10% probability to a Sports page
Sports
Prasad
Hubs and Authorities
Alternative to Pagerank
Prasad L21LinkAnalysis 101
Hyperlink-Induced Topic Search (HITS) - Klei98
In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links on a
subject. e.g., “Bob’s list of cancer-related links.”
Authority pages occur recurrently on good hubs for the subject.
Best suited for “broad topic” queries rather than for page-finding queries.
Gets at a broader slice of common opinion.
103
Hubs and Authorities
Thus, a good hub page for a topic points to many authoritative pages for that topic.
A good authority page for a topic is pointed to by many good hubs for that topic.
Circular definition - will turn this into an iterative computation.Prasad L21LinkAnalysis
104
The hope
AT&T Alice
SprintBob MCI
Long distance telephone companies
HubsAuthorities
Prasad
Introduction to Information RetrievalIntroduction to Information Retrieval
105
Example for hubs and authorities
106
High-level scheme
Extract from the web a base set of pages that could be good hubs or authorities.
From these, identify a small set of top hub and authority pages; iterative algorithm.
Prasad L21LinkAnalysis
107
Base set
Given text query (say browser), use a text index to get all pages containing browser. Call this the root set of pages.
Add in any page that either points to a page in the root set, or is pointed to by a page in the root set.
Call this the base set.
Prasad L21LinkAnalysis
108
Visualization
Rootset
Base set
Prasad L21LinkAnalysis
109
Assembling the base set [Klei98]
Root set typically 200-1000 nodes. Base set may have up to 5000 nodes. How do you find the base set nodes?
Follow out-links by parsing root set pages. Get in-links (and out-links) from a
connectivity server. (Actually, suffices to text-index strings of
the form href=“URL” to get in-links to URL.)
Prasad L21LinkAnalysis
110
Distilling hubs and authorities
Compute, for each page x in the base set, a hub score h(x) and an authority score a(x).
Initialize: for all x, h(x)1; a(x) 1; Iteratively update all h(x), a(x); After iterations
output pages with highest h() scores as top hubs
highest a() scores as top authorities.
Key
Prasad L21LinkAnalysis
111
Illustrated Update Rules
2
3
a4 = h1 + h2 + h3
1
5
7
6
4
4h4 = a5 + a6 + a7
112
Iterative update
Repeat the following updates, for all x:
yx
yaxh
)()(
xy
yhxa
)()(
x
x
Prasad L21LinkAnalysis
y ’s
y ’s
113
Scaling
To prevent the h() and a() values from getting too big, can normalize after each iteration.
Scaling factor doesn’t really matter: we only care about the relative
values of the scores.
Prasad L21LinkAnalysis
114
Old Results
Authorities for query: “Java” java.sun.com comp.lang.java FAQ
Authorities for query: “search engine” Yahoo.com Excite.com Lycos.com Altavista.com
Authorities for query: “Gates” Microsoft.com roadahead.com
115
Similar Page Results from Hubs
Given “honda.com” toyota.com ford.com bmwusa.com saturncars.com nissanmotors.com audi.com volvocars.com
118
Japan Elementary Schools
The American School in Japan The Link Page ‰ªès—§ˆä� � “c¬ŠwZƒz[ƒƒy[ƒW � � � � � Kids' Space ˆÀés—§ˆÀ鼕� � � � ”¬ŠwZ � � ‹{鋳ˆç� ‘åŠw•�‘®¬ŠwZ � � KEIMEI GAKUEN Home Page ( Japanese ) Shiranuma Home Page fuzoku-es.fukui-u.ac.jp welcome to Miasa E&J school _� “Þ쌧E‰¡•ls—§� � � ’†ì¼¬ŠwZ‚̃y� � � � http://www...p/~m_maru/index.html fukui haruyama-es HomePage Torisu primary school goo Yakumo Elementary,Hokkaido,Japan FUZOKU Home Page Kamishibun Elementary School...
schools LINK Page-13 “ú–{‚ÌŠwZ � a‰„¬ŠwZƒz[ƒƒy[ƒW � � � � � � 100 Schools Home Pages (English) K-12 from Japan 10/...rnet and Education ) http://www...iglobe.ne.jp/~IKESAN ‚l‚f‚j¬ŠwZ‚U� � ”N‚P‘g•¨Œê ÒŠ—� ’¬—§ÒŠ—� “Œ¬ŠwZ � � Koulutus ja oppilaitokset TOYODA HOMEPAGE Education Cay's Homepage(Japanese) –y“쬊wZ‚̃z[ƒƒy[ƒW � � � � � UNIVERSITY ‰J—³¬ŠwZ DRAGON97-TOP � � ‰ª¬ŠwZ‚T� � � ”N‚P‘gƒz[ƒƒy[ƒW � � � ¶µ°é¼ÂÁ© ¥á¥Ë¥å¡¼ ¥á¥Ë¥å¡¼
Hubs Authorities
Introduction to Information RetrievalIntroduction to Information Retrieval
119
Authorities for query [Chicago Bulls]
0.85 www.nba.com/bulls0.25 www.essex1.com/people/jmiller/bulls.htm
“da Bulls”0.20 www.nando.net/SportServer/basketball/nba/chi.html
“The Chicago Bulls”0.15 Users.aol.com/rynocub/bulls.htm
“The Chicago Bulls Home Page ”0.13 www.geocities.com/Colosseum/6095
“Chicago Bulls”
(Ben Shaul et al, WWW8)
Introduction to Information RetrievalIntroduction to Information Retrieval
120
Hubs for query [Chicago Bulls]
1.62 www.geocities.com/Colosseum/1778“Unbelieveabulls!!!!!”
1.24 www.webring.org/cgi-bin/webring?ring=chbulls“Chicago Bulls”
0.74 www.geocities.com/Hollywood/Lot/3330/Bulls.html“Chicago Bulls”
0.52 www.nobull.net/web_position/kw-search-15-M2.html“Excite Search Results: bulls ”
0.52 www.halcyon.com/wordltd/bball/bulls.html“Chicago Bulls Links”
(Ben Shaul et al, WWW8)
121
Things to note
Pulled together good pages regardless of language of page content.
Use link analysis only after base set assembled iterative scoring is query-independent.
Iterative computation after text index retrieval - significant overhead.
Prasad L21LinkAnalysis
122
Proof of convergence
nn adjacency matrix A: each of the n pages in the base set has a
row and column in the matrix. Entry Aij = 1 if page i links to page j, else =
0.
1 2
3
1 2 3
1
2
3
0 1 0
1 1 1
1 0 0Prasad L21LinkAnalysis
123
Hub/authority vectors
View the hub scores h() and the authority scores a() as vectors with n components.
Recall the iterative updates
yx
yaxh
)()(
xy
yhxa
)()(
Prasad
124
Rewrite in matrix form
h=Aa. a=Ath.
Recall At is the
transpose of A.
Substituting, h=AAth and a=AtAa.Thus, h is an eigenvector of AAt and a is an eigenvector of AtA.
Further, we know the algorithm is for computing eigenvector: the power iteration method.
Guaranteed to converge.Prasad
Introduction to Information RetrievalIntroduction to Information Retrieval
Example web graph
Introduction to Information RetrievalIntroduction to Information Retrieval
Raw matrix A for HITS
d0 d1 d2 d3 d4 d5 d6
d0 0 0 1 0 0 0 0
d1 0 1 1 0 0 0 0
d2 1 0 1 2 0 0 0
d3 0 0 0 1 1 0 0
d4 0 0 0 0 0 0 1
d5 0 0 0 0 0 1 1
d6 0 0 0 2 1 0 1
Introduction to Information RetrievalIntroduction to Information Retrieval
Hub vectors h0 , hi = A*ai , i ≥1
h0 h1 h2 h3 h4 h5
d0 0.14 0.06 0.04 0.04 0.03 0.03
d1 0.14 0.08 0.05 0.04 0.04 0.04
d2 0.14 0.28 0.32 0.33 0.33 0.33
d3 0.14 0.14 0.17 0.18 0.18 0.18
d4 0.14 0.06 0.04 0.04 0.04 0.04
d5 0.14 0.08 0.05 0.04 0.04 0.04
d6 0.14 0.30 0.33 0.34 0.35 0.35
d i
1
Introduction to Information RetrievalIntroduction to Information Retrieval
Authority vector a = AT*hi-1 , i ≥ 1
a1 a2 a3 a4 a5 a6 a7
d0 0.06 0.09 0.10 0.10 0.10 0.10 0.10
d1 0.06 0.03 0.01 0.01 0.01 0.01 0.01
d2 0.19 0.14 0.13 0.12 0.12 0.12 0.12
d3 0.31 0.43 0.46 0.46 0.46 0.47 0.47
d4 0.13 0.14 0.16 0.16 0.16 0.16 0.16
d5 0.06 0.03 0.02 0.01 0.01 0.01 0.01
d6 0.19 0.14 0.13 0.13 0.13 0.13 0.13
ci1
Introduction to Information RetrievalIntroduction to Information Retrieval
129
Example web graph
a hd0 0.10 0.03
d1 0.01 0.04
d2 0.12 0.33
d3 0.47 0.18
d4 0.16 0.04
d5 0.01 0.04
d6 0.13 0.35
Introduction to Information RetrievalIntroduction to Information Retrieval
Top-ranked pages
Pages with highest in-degree: d2, d3, d6
Pages with highest out-degree: d2, d6
Pages with highest PageRank: d6
Pages with highest in-degree: d6 (close: d2)
Pages with highest authority score: d3
Introduction to Information RetrievalIntroduction to Information Retrieval
131
PageRank vs. HITS: Discussion PageRank can be precomputed, HITS has to be computed at
query time. HITS is too expensive in most application scenarios.
PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.
These two are orthogonal. We could also apply HITS to the entire web and PageRank to a
small base set. Claim: On the web, a good hub almost always is also a good authority. The actual difference between PageRank ranking and HITS ranking is
therefore not as large as one might expect.
132
Issues
Topic Drift Off-topic pages can cause off-topic
“authorities” to be returned E.g., the neighborhood graph can be
about a “super topic” Mutually Reinforcing Affiliates
Affiliated pages/sites can boost each others’ scores
Linkage between affiliated pages is not a useful signal
Prasad L21LinkAnalysis