Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity...

38
Link Analysis

Transcript of Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity...

Page 1: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

Link Analysis

Page 2: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

2

Objectives

• To review common approaches to link analysis

• To calculate the popularity of a site based on link analysis

• To model human judgments indirectly

Page 3: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

3

Outline

1. Early Approaches to Link Analysis2. Hubs and Authorities: HITS3. Page Rank4. Stability5. Probabilistic Link Analysis6. Limitation of Link Analysis

Page 4: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

4

Early Approaches

• Hyperlinks contain information about the human judgment of a site

• The more incoming links to a site, the more it is judged important

Basic Assumptions

Bray 1996• The visibility of a site is measured by the number

of other sites pointing to it• The luminosity of a site is measured by the number

of other sites to which it points Limitation: failure to capture the relative

importance of different parents (children) sites

Page 5: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

5

Early Approaches

• To calculate the score S of a document at vertex v

S(v) = s(v) + 1

| ch[v] | Σ S(w) w Є |ch(v)|

v: a vertex in the hypertext graph G = (V, E)S(v): the global score s(v): the score if the document is isolatedch(v): children of the document at vertex v

• Limitation:- Require G to be a directed acyclic graph (DAG)- If v has a single link to w, S(v) > S(w)- If v has a long path to w and s(v) < s(w),

then S(v) > S (w) unreasonable

Mark 1988

Page 6: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

6

Early ApproachesMarchiori (1997)

S(v) = s(v) + h(v)- S(v): overall information- s(v): textual information - h(v): hyper information

• Hyper information should complement textual information to obtain the overall information

• h(v) = Σ F S(w)w Є |ch[v]|

r(v, w)

- F: a fading constant, F Є (0, 1) - r(v, w): the rank of w after sorting the children

of v by S(w)

a remedy of the previous approach (Mark 1988)

Page 7: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

7

HITS - Kleinberg’s Algorithm

• HITS – Hypertext Induced Topic Selection

• For each vertex v Є V in a subgraph of interest:

• A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites

• Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites

a(v) - the authority of vh(v) - the hubness of v

Page 8: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

8

Authority and Hubness

2

3

4

1 1

5

6

7

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

Page 9: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

9

Authority and Hubness Convergence

• Recursive dependency:

a(v) Σ h(w)

h(v) Σ a(w)

• Using Linear Algebra, we can prove:

w Є pa[v]

w Є ch[v]

a(v) and h(v) converge

Page 10: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

10

HITS Example

• {1, 2, 3, 4} - nodes relevant to the topic

• Expand the root set R to include all the children and a fixed number of parents of nodes in R

A new set S (base subgraph)

• Start with a root set R {1, 2, 3, 4}

Find a base subgraph:

Page 11: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

11

HITS Example

BaseSubgraph( R, d)1. S r2. for each v in R3. do S S U ch[v]4. P pa[v]5. if |P| > d6. then P arbitrary subset of P having size d7. S S U P8. return S

Page 12: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

12

HITS Example

HubsAuthorities(G)1 1 [1,…,1] Є R 2 a h 13 t 14 repeat5 for each v in V6 do a (v) Σ h

(w)

7 h (v) Σ a (w)

8 a a / || a ||9 h h / || h ||10 t t + 111 until || a – a || + || h – h || < ε12 return (a , h )

Hubs and authorities: two n-dimensional a and h

0 0

t

t

t

t

t

t

t

t

t

t

tt

t -1

t -1

t -1

t -1

w Є pa[v]

w Є pa[v]

|V|

Page 13: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

13

HITS Example Results

AuthorityHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

Page 14: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

14

HITS ImprovementsBrarat and Henzinger (1998)

• HITS problems1) The document can contain many identical links to

the same document in another host2) Links are generated automatically (e.g. messages

posted on newsgroups)

• Solutions1) Assign weight to identical multiple edges, which

are inversely proportional to their multiplicity2) Prune irrelevant nodes or regulating the influence

of a node with a relevance weight

Page 15: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

15

PageRank

• Introduced by Page et al (1998)– The weight is assigned by the rank of parents

• Difference with HITS– HITS takes Hubness & Authority weights– The page rank is proportional to its parents’ rank, but

inversely proportional to its parents’ outdegree

Page 16: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

16

Matrix Notation

Adjacent Matrix

A =

* http://www.kusatro.kyoto-u.com

Page 17: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

17

Matrix Notation

• Matrix Notation

r = α B r = M rα : eigenvalue

r : eigenvector of B

A x = λ x

| A - λI | x = 0B =

Finding Pagerank to find eigenvector of B with an associated eigenvalue α

Page 18: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

18

Matrix Notation

PageRank: eigenvector of P relative to max eigenvalue

B = P D P-1

D: diagonal matrix of eigenvalues {λ1, … λn}

P: regular matrix that consists of eigenvectors

PageRank r1 = normalized

Page 19: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

19

Matrix Notation

• Confirm the result

# of inlinks from high ranked page

hard to explain about 5&2, 6&7

• Interesting Topic

How do you create your homepage highly ranked?

Page 20: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

20

Markov Chain Notation

• Random surfer model– Description of a random walk through the Web graph– Interpreted as a transition matrix with asymptotic

probability that a surfer is currently browsing that page

Does it converge to some sensible solution (as too) regardless of the initial ranks ?

rt = M rt-1

M: transition matrix for a first-order Markov chain (stochastic)

Page 21: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

21

Problem

• “Rank Sink” Problem– In general, many Web pages have no

inlinks/outlinks– It results in dangling edges in the graph

E.g.

no parent rank 0MT converges to a matrix whose last column is all zero

no children no solutionMT converges to zero matrix

Page 22: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

22

Modification

• Surfer will restart browsing by picking a new Web page at random

M = ( B + E )E : escape matrix

M : stochastic matrix

• Still problem?

– It is not guaranteed that M is primitive

– If M is stochastic and primitive, PageRank converges to corresponding stationary distribution of M

Page 23: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

23

PageRank Algorithm

* Page et al, 1998

Page 24: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

24

Distribution of the Mixture Model

• The probability distribution that results from combining the Markovian random walk distribution & the static rank source distribution

r = εe + (1- ε)xε: probability of selecting non-linked page

PageRank

Now, transition matrix [εH + (1- ε)M] is primitive and stochastic

rt converges to the dominant eigenvector

Page 25: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

25

Stability

• Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly?

• The connectivity of a portion of the graph is changed arbitrary– How will it affect the results of algorithms?

Page 26: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

26

Stability of HITS

δ: eigengap λ1 – λ2

d: maximum outdegree of G

Ng et al (2001)

• A bound on the number of hyperlinks k that can added or deleted from one page without affecting the authority or hubness weights

• It is possible to perturb a symmetric matrix by a quantity that grows

as δ that produces a constant perturbation of the dominant eigenvector

Page 27: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

27

Stability of PageRank

• The parameter ε of the mixture model has a stabilization role

• If the set of pages affected by the perturbation have a small rank, the overall change will also be small

tighter bound byBianchini et al (2001)

δ(j) >= 2 depends on the edges incident on j

Ng et al (2001)

V: the set of vertices touched by the perturbation

Page 28: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

28

SALSA

• SALSA (Lempel, Moran 2001)– Probabilistic extension of the HITS algorithm– Random walk is carried out by following

hyperlinks both in the forward and in the backward direction

• Two separate random walks– Hub walk– Authority walk

Page 29: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

29

Forming a Bipartite Graph in SALSA

Page 30: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

30

Random Walks

• Hub walk– Follow a Web link from a page uh to a page wa (a

forward link) and then

– Immediately traverse a backlink going from wa to vh, where (u,w) Є E and (v,w) Є E

• Authority Walk– Follow a Web link from a page w(a) to a page u(h) (a

backward link) and then– Immediately traverse a forward link going back from

vh to wa where (u,w) Є E and (v,w) Є E

Page 31: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

31

Computing Weights

• Hub weight computed from the sum of the product of the inverse degree of the in-links and the out-links

Page 32: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

32

Why We Care

• Lempel and Moran (2001) showed theoretically that SALSA weights are more robust that HITS weights in the presence of the Tightly Knit Community (TKC) Effect. – This effect occurs when a small collection of pages (related to a

given topic) is connected so that every hub links to every authority and includes as a special case the mutual reinforcement effect

• The pages in a community connected in this way can be ranked highly by HITS, higher than pages in a much larger collection where only some hubs link to some authorities

• TKC could be exploited by spammers hoping to increase their page weight (e.g. link farms)

Page 33: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

33

A Similar Approach

• Rafiei and Mendelzon (2000) and Ng et al. (2001) propose similar approaches using reset as in PageRank

• Unlike PageRank, in this model the surfer will follow a forward link on odd steps but a backward link on even steps

• The stability properties of these ranking distributions are similar to those of PageRank (Ng et al. 2001)

Page 34: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

34

Overcoming TKC

• Similarity downweight sequencing and sequential clustering (Roberts and Rosenthal 2003)– Consider the underlying structure of clusters– Suggest downweight sequencing to avoid the Tight

Knit Community problem– Results indicate approach is effective for few tested

queries, but still untested on a large scale

Page 35: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

35

PHITS and More

• PHITS: Cohn and Chang (2000)– Only the principal eigenvector is extracted using

SALSA, so the authority along the remaining eigenvectors is completely neglected

• Account for more eigenvectors of the co-citation matrix

• See also Lempel, Moran (2003)

Page 36: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

36

Limits of Link Analysis

• META tags/ invisible text– Search engines relying on meta tags in documents

are often misled (intentionally) by web developers

• Pay-for-place – Search engine bias : organizations pay search

engines and page rank– Advertisements: organizations pay high ranking

pages for advertising space• With a primary effect of increased visibility to end users and

a secondary effect of increased respectability due to relevance to high ranking page

Page 37: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

37

Limits of Link Analysis

• Stability– Adding even a small number of nodes/edges to the

graph has a significant impact

• Topic drift – similar to TKC– A top authority may be a hub of pages on a different

topic resulting in increased rank of the authority page

• Content evolution– Adding/removing links/content can affect the intuitive

authority rank of a page requiring recalculation of page ranks

Page 38: Link Analysis. 2 Objectives To review common approaches to link analysis To calculate the popularity of a site based on link analysis To model human judgments.

38

Further Reading

• R. Lempel and S. Moran, Rank Stability and Rank Similarity of Link-Based Web Ranking Algorithms in Authority Connected Graphs, Submitted to Information Retrieval, special issue on Advances in Mathematics/Formal Methods in Information Retrieval, 2003.

• M. Henzinger, Link Analysis in Web Information Retreival, Bulletin of the IEEE computer Society Technical Committee on Data Engineering, 2000.

• L. Getoor, N. Friedman, D. Koller, and A. Pfeffer. Relational Data Mining, S. Dzeroski and N. Lavrac, Eds., Springer-Verlag, 2001