A Theoretical Justification of Link Prediction Heuristics

23
1 A Theoretical Justification of Link Prediction Heuristics Deepayan Chakrabarti ([email protected]) Purnamrita Sarkar Andrew Moore

description

A Theoretical Justification of Link Prediction Heuristics. Deepayan Chakrabarti ([email protected]). Purnamrita Sarkar. Andrew Moore. Link Prediction. Which pair of nodes {i,j} should be connected?. Alice. Bob. Charlie. Goal: Recommend a movie. Link Prediction. - PowerPoint PPT Presentation

Transcript of A Theoretical Justification of Link Prediction Heuristics

Page 1: A Theoretical Justification of Link Prediction Heuristics

1

A Theoretical Justification of Link Prediction Heuristics

Deepayan Chakrabarti ([email protected])

Purnamrita Sarkar Andrew Moore

Page 2: A Theoretical Justification of Link Prediction Heuristics

Link Prediction

Which pair of nodes {i,j} should be connected?

Alice

Bob

Charlie

Goal: Recommend a movie

2

Page 3: A Theoretical Justification of Link Prediction Heuristics

Link Prediction

Which pair of nodes {i,j} should be connected?

Goal: Suggest friends

3

Page 4: A Theoretical Justification of Link Prediction Heuristics

Link Prediction Heuristics

Predict link between nodes Connected by the shortest path With the most common neighbors (length 2 paths) More weight to low-degree common nbrs

(Adamic/Adar)

3 followers

1000

followers

Prolific common friends

Less evidence

Less prolific

Much more evidence

Alice

Bob

Charlie

Page 5: A Theoretical Justification of Link Prediction Heuristics

Link Prediction Heuristics

Predict link between nodes Connected by the shortest path With the most common neighbors (length 2 paths) More weight to low-degree common nbrs (Adamic/Adar) With more short paths (e.g. length 3 paths )

exponentially decaying weights to longer paths (Katz measure)

Page 6: A Theoretical Justification of Link Prediction Heuristics

Previous Empirical Studies*

Random Shortest Path

Common Neighbors

Adamic/Adar Ensemble of short paths

Link

pre

dict

ion

accu

racy

*

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

How do we justify these observations?

Especially if the graph is sparse

6

Page 7: A Theoretical Justification of Link Prediction Heuristics

Link Prediction – Generative Model

Unit volume universe

Model:1. Nodes are uniformly distributed points in a latent space

2. This space has a distance metric

3. Points close to each other are likely to be connected in the graph

Logistic distance function (Raftery+/2002)

7

Page 8: A Theoretical Justification of Link Prediction Heuristics

8

1

½

Higher probability of linking

radius r

α determines the steepness

Link prediction ≈ find nearest neighbor who is not currently linked to the node.

Equivalent to inferring distances in the latent space

Link Prediction – Generative Model

Model:1. Nodes are uniformly distributed points in a latent space

2. This space has a distance metric

3. Points close to each other are likely to be connected in the graph

Page 9: A Theoretical Justification of Link Prediction Heuristics

Previous Empirical Studies*

Random Shortest Path

Common Neighbors

Adamic/Adar Ensemble of short paths

Link

pre

dict

ion

accu

racy

*

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

Especially if the graph is sparse

9

Page 10: A Theoretical Justification of Link Prediction Heuristics

Common Neighbors

Pr2(i,j) = Pr(common neighbor|dij)

jkikijjkikjkik2 dd)d|d,d()d|~Pr()d|~Pr(j)(i,Pr Pkjki

Product of two logistic probabilities, integrated over a volume determined by dij

i j

10

As α∞ Logistic Step function

Much easier to analyze!

Page 11: A Theoretical Justification of Link Prediction Heuristics

Common Neighbors

11

Everyone has same radius r

i j

)dr,A(r,j)(i,Pr ij2

# common nbrs gives a bound

on distance

DD

rr/2

ij

/1

ij

V(r)

εη/N12d

V(r)

εη/N12

21εN

η)dr,A(r,ε

N

ηP

η=Number of common

neighbors

V(r)=volume of radius r in

D dims

Unit volume universe

Page 12: A Theoretical Justification of Link Prediction Heuristics

Common Neighbors

OPT = node closest to i MAX = node with max common neighbors with i

Theorem:

w.h.p

Link prediction by common neighbors is asymptotically optimal

dOPT ≤ dMAX ≤ dOPT + 2[ε/V(1)]1/D

12

Page 13: A Theoretical Justification of Link Prediction Heuristics

Common Neighbors: Distinct Radii Node k has radius rk .

ik if dik ≤ rk (Directed graph) rk captures popularity of node k

13

Type 2: i k j

rk rk

A(rk , rk ,dij)

i jk

Type 1: i k j

rirj

A(ri , rj ,dij)

i jk

i

rk

k

j

m

Page 14: A Theoretical Justification of Link Prediction Heuristics

Type 2 common neighbors

i j

kη1 ~ Bin[N1 , A(r1, r1,

dij)]

η2 ~ Bin[N2 , A(r2, r2, dij)]

Example graph:

N1 nodes of radius r1 and N2 nodes of radius r2

r1 << r2

Pick d* to maximize Pr[η1 , η2 | dij]

w(r1) E[η1|d*] + w(r2) E[η2|d*] = w(r1)η1 + w(r2) η2

Weighted common neighbors

Inversely related to d*

Page 15: A Theoretical Justification of Link Prediction Heuristics

Common Neighbors: Distinct Radii Node k has radius rk .

ik if dik ≤ rk (Directed graph) rk captures popularity of node k

“Weighted” common neighbors: Predict (i,j) pairs with highest Σ w(r)η(r)

Weight for nodes of radius r

# common neighbors of radius r

i

rk

k

j

m

15

Page 16: A Theoretical Justification of Link Prediction Heuristics

Type 2 common neighbors

r is close to max radius

D1

deg

const

r

constw(r)

Real world graphs generally fall in this range

i

rk

k

j

Presence of common neighbor is very informative

Absence is very informative

Adamic/Adar

1/r

16

Page 17: A Theoretical Justification of Link Prediction Heuristics

Previous Empirical Studies*

Random Shortest Path

Common Neighbors

Adamic/Adar Ensemble of short paths

Link

pre

dict

ion

accu

racy

*

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

Especially if the graph is sparse

17

Page 18: A Theoretical Justification of Link Prediction Heuristics

l hop Paths

Common neighbors = 2 hop paths

Analysis of longer paths: two components1. Bounding E(ηl | dij). [ηl = # l hop paths]

Bounds Prl (i,j) by using triangle inequality on a series of common neighbor probabilities.

2. ηl ≈ E(ηl | dij)

Triangulation

Page 19: A Theoretical Justification of Link Prediction Heuristics

l hop Paths

Common neighbors = 2 hop paths

Analysis of longer paths: two components1. Bounding E(ηl | dij). [ηl = # l hop paths]

Bounds Prl (i,j) by using triangle inequality on a series of common neighbor probabilities.

2. ηl ≈ E(ηl | dij) Bounded dependence of ηl on position of each node

Can use McDiarmid’s inequality to bound

|ηl - E(ηl|dij)|

Page 20: A Theoretical Justification of Link Prediction Heuristics

ℓ-hop Paths

Common neighbors = 2 hop paths

For longer paths:

Bounds are weaker For ℓ’ ≥ ℓ we need ηℓ’ >> ηℓ to obtain similar bounds

justifies the exponentially decaying weight given to longer paths by the Katz measure

δN,,ηg-11)r(rdij

20

Page 21: A Theoretical Justification of Link Prediction Heuristics

Summary

Three key ingredients

1. Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001

2. Triangle inequality holds necessary to extend to ℓ-hop paths

3. Points are spread uniformly at random Otherwise properties will depend on location as well as distance

21

Page 22: A Theoretical Justification of Link Prediction Heuristics

Summary

Random Shortest Path

Common Neighbors

Adamic/Adar Ensemble of short paths

Link

pre

dict

ion

accu

racy

*

*Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

The number of paths matters, not the

length

For large dense graphs, common neighbors are

enough

Differentiating between different degrees is

important

In sparse graphs, paths of length 3 or

more help in prediction.

22

Page 23: A Theoretical Justification of Link Prediction Heuristics

Sweep Estimators

Qr = Fraction of nodes with radius ≤ r which are

common neighbors

TR = Fraction of nodes with radius ≥ R which

are common neighbors

Number of common

neighbors of a given radius r

Large Qr small dijSmall TR large dij