k-Nearest Neighbors in Uncertain Graphs (Michalis Potamias, Francesco Bonchi, Aristides Gionis,...

Post on 05-Aug-2015

1.404 views 1 download

Tags:

Transcript of k-Nearest Neighbors in Uncertain Graphs (Michalis Potamias, Francesco Bonchi, Aristides Gionis,...

k-Nearest Neighbors in Uncertain Graphs

Michalis Potamias Francesco Bonchi

Aristides Gionis George Kollios

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

2

Thesis

• Many complex networks are modeled as probabilistic (i.e., uncertain) graphs.

• The probabilistic treatment of such graphs leads to better understanding of real data.

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

3

Source: Asthana et al., Genome Research 2004

Possible interactions between proteins are established through biological experiments that entail uncertainty. The edge probabilityrepresents that uncertainty. A

B C

D

0.2

0.4

0.6

0.3 0.7

A

B C

D

Probabilistic Protein-Protein Interaction Networks

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

4

• Neighbors of a given node in a standard graph?– Nodes close in terms of shortest path distance!

• How do we define neighbors in probabilistic graphs?

• How do we define the distance?

– Treat them as weighted graphs (N06)– Nodes with high reliability(GR04)– Most probable path (BI03)– …shortest paths? (VLDB10)

A

B C

D

0.2

0.4

0.6

0.3 0.7

A

B C

D

Probabilistic Protein-Protein Interaction Networks

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

5

• Why is it important to find good neighbors of proteins in PPI networks?– Detection of candidate co-complex relationships.– Actual co-complex relationships can be

established through experiments in the lab.

Probabilistic Protein-Protein Interaction Networks

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

6

Outline

• Thesis

• Probabilistic PPI Networks

• Distance Definition

• Sampling Algorithms

• kNN Pruning

• Experiments

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

7

Outline

• Thesis

• Probabilistic PPI Networks

• Distance Definition

• Sampling Algorithms

• kNN Pruning

• Experiments

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

8

A

B C

D

0.2

0.4

0.6

0.3 0.7

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

A

B C

D

Distance Definition

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

9

Distance Definition

the graphA

B C

D

0.2

0.4

0.6

0.3 0.7

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

10

Distance Definition

the graph

)),(1()),(1()),(1(

),(),()Pr(

DApDCpCBp

DBpBApworld

a worldA

B C

D

0.2

0.4

0.6

0.3 0.7

A

B C

D

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

12

Distance Definition

the graph a worldA

B C

D

0.2

0.4

0.6

0.3 0.7

A

B C

D

.3.26

.44

1 2 infshortest path length d(B,D)

PDF

)),(1()),(1()),(1(

),(),()Pr(

DApDCpCBp

DBpBApworld

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

14

• Use well known statistics of the Shortest Path PDF:– Median– Majority (mode)– ExpectedReliable

• infinity problem

• Hard! they require explicit enumeration of possible worlds: resort to sampling!

.3.26

.44

1 2 inf46.1

inf

2

exp

d

d

d

maj

med

shortest path length d(B,D)

PDF

Distance Definition

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

15

Outline

• Thesis

• Probabilistic PPI Networks

• Distance Definition

• Sampling Algorithms

• kNN Pruning

• Experiments

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

16

1. sample (a small number of) worlds

2. compute sample median (approximation)

3. output result– Median (Chernoff bound) – ExpectedReliable (Hoeffding inequality)– Majority (No bound)

Sampling Algorithms

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

17

Sampling Algorithms

BIOMINEdatabase of biological entities and uncertain interactions fromUHelsinki1M nodes, 10M edges

FLICKRusers from flickr.com. edges have been created assuming homophily based on jaccard of flickr groups77K nodes, 20M edges

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

18

Outline

• Thesis

• Probabilistic PPI Networks

• Distance Definition

• Sampling Algorithms

• kNN Pruning

• Experiments

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

19

kNN Pruning

• Query: Given a probabilistic graph, and a source node find the set of k nodes closest to the source.

• Naïve algorithm:1. sample worlds

2. run dijkstra traversals and compute a pdf of the sp distance per node

3. calculate the median distance to all nodes using the pdf’s

4. compute k-nn

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

20

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

naive

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

21

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

E

B

G

D

A

C

F

1

B C D E F G

2 3

naive

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

22

kNN Pruning

1nn - mediannode: Asample: 5 worlds

B C D E F G

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

E

B

G

D

A

C

F

E

B

G

D

A

C

F

1 2 3

naive

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

23

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

A

C

F

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

E

B

G

D

A

C

F

E

B

G

D

A

C

F

1

B C D E F G

2 3 21 2 2

naive

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

24

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

A

C

F

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

E

B

G

D

A

C

F

E

B

G

D

A

C

F

E

B

G

D

A

C

F

1

B C D E F G

2 3 21 2 2

naive

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

25

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

A

C

F

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

E

B

G

D

A

C

F

E

B

G

D

A

C

F

E

B

G

D

A

C

F

E

B

G

D

A

C

F

1

B C D E F G

2 3 21 2 2

naive

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

26

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

A

C

F

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

E

B

G

D

A

C

F

E

B

G

D

A

C

F

E

B

G

D

A

C

F

E

B

G

D

A

C

F

1

B C D E F G

2 3 21 2 2

1

2

3

naive

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

27

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

A

C

F

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

E

B

G

D

A

C

F

E

B

G

D

A

C

F

E

B

G

D

A

C

F

E

B

G

D

A

C

F

1

B C D E F G

2 3 21 2 2

1

2

3

naive

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

28

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

• algorithm– sample worlds on the fly– increase the horizon of each dijkstra one hop at a

time– maintain truncated pdf histograms

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

29

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

30

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.71

B

B

A

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

31

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.71

B

B

A

B

A

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

32

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.71

B C

B

A

B

A

B

A

C

1

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

33

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

B

A

B

A

B

A

C

B

A

C

1

B C

1

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

34

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

B

A

B

A

B

A

C

B

A

C

A

1

B C

1

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

35

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

B

A

B

A

B

A

C

B

A

C

A

1

B C

1

1

>1

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

36

kNN Pruning

1nn - mediannode: Asample: 5 worlds

E

B

G

D

0.9

A

C

F

0.3

0.4

0.6

0.8

0.5

0.3

0.7

B

A

B

A

B

A

C

B

A

C

A

1

B C

1

•B has distance 1•C has distance greater than 1•D, E, F, G, … were not discovered (d>1)•1NN set is complete with B – no need to cont

•just 2 nodes visited (and 2 histograms maintained)•worlds were only partially instantiated •same answer as the naive

•with a small cost: dijkstra state needs to be maintained in memory for all worlds

1

>1

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

37

kNN Pruning

BIOMINEdatabase of biological entities and uncertain interactions fromUHelsinki1M nodes, 10M edges

FLICKRusers from flickr.com. edges have been created assuming homophily based on jaccard of flickr groups77K nodes, 20M edges

DBLPauthors from dblp. probabilities have been assigned based on number of coauthored papers226K nodes, 1.4M edges

for 200 worlds and 5NN the speedups were:247x (BIOMINE), 111x (FLICKR), 269x (DBLP)

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

38

kNN Pruning

BIOMINEdatabase of biological entities and uncertain interactions fromUHelsinki1M nodes, 10M edges

FLICKRusers from flickr.com. edges have been created assuming homophily based on jaccard of flickr groups77K nodes, 20M edges

DBLPauthors from dblp. probabilities have been assigned based on number of coauthored papers226K nodes, 1.4M edges

for 200 worlds and 5NN the speedups were:247x (BIOMINE), 111x (FLICKR), 269x (DBLP)

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

39

Less uncertainty, more pruning

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

40

Less uncertainty, more pruning

A

B C

D

0.2

0.4

0.6

0.3 0.7

d

A

B C

D

1-0.8

1-0.6

1-0.4

1-0.7 1-0.3

d

d

d d

•boost probabilities of edges by giving each edge d chances

•d=1: original graph•increasing d, p goes to 1

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

41

Less uncertainty, more pruning

A

B C

D

0.2

0.4

0.6

0.3 0.7

d

A

B C

D

1-0.8

1-0.6

1-0.4

1-0.7 1-0.3

d

d

d d

•boost probabilities of edges by giving each edge d chances

•d=1: original graph•increasing d, p goes to 1

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

42

Less uncertainty, more pruning

A

B C

D

0.2

0.4

0.6

0.3 0.7

d

A

B C

D

1-0.8

1-0.6

1-0.4

1-0.7 1-0.3

d

d

d d

•boost probabilities of edges by giving each edge d chances

•d=1: original graph•increasing d, p goes to 1

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

43

Outline

• Thesis

• Probabilistic PPI Networks

• Distance Definition

• Sampling Algorithms

• kNN Pruning

• Experiments

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

44

Experiments

• Dataset– Probabilistic PPI network

[Krogan et al, Nature 06]

– Protein co-complex relationships (ground truth)

[Mewes et al, Nuc Acids Res 04]

• Experiment– Choose a ground truth edge

(A,B)– Choose a node C s.t. there is

no ground truth edge (A,C)– Classification task: Distinguish

between the two types of edges: (A,B) and (A,C)

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

45

Experiments

• Dataset– Probabilistic PPI network

[Krogan et al, Nature 06]

– Protein co-complex relationships (ground truth)

[Mewes et al, Nuc Acids Res 04]

• Experiment– Choose a ground truth edge

(A,B)– Choose a node C s.t. there is

no ground truth edge (A,C)– Classification task: Distinguish

between the two types of edges: (A,B) and (A,C)

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

46

Conclusion

• Probabilistic graph analysis benefits from possible-world semantics.

– Extended standard graph concepts to probabilistic graphs and designed approximation algorithms to compute them

– Introduced novel pruning algorithms for kNN in probabilistic graphs

– Confirmed the efficacy of our framework on real data.

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

47

Future Work

• Enrich model– Node probabilities– Arbitrary PDFs

• Explore random walks further

Nearest Neighbors in Uncertain Graphs @ VLDB 2010

48

Thank you!

?