Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014)...

Reachability Querying: An Independent Permutation Labeling Approach(published in VLDB 2014)

Presenter: WEI, Hao

Graph Reachability QueryGiven a directed graph G = (V, E) and two vertices u and

v, u is said to reach v if there exists a path from u to v over G.

Any directed graph can be easily transformed into a DAG trivial if u and v are in the same connect component0 1

2 3

54

6 7

98 1110

Query(v1, v8) Reachable

Query(v2, v11)

Unreachable

The Issue and the Challenge ‘Big Data’ era brings us large

graph with millions of nodes and

edges. web-uk dataset: 133 million

nodes, 5 billion edges DAG of web-uk: 22 million

nodes, 38 million edges Traditional approaches are not

applicable.

Related WorkRecent works builds index, label(u), offline for every node u.

Label-Only Approach: answer Query(u, v) only by label(u) and

label(v) only Hop Labeling: TF-Label, Hierarchy Label, Distribution Label, … Transitive Closure Compression: Chain-Cover, Tree-Cover, … non-linear index construction time and index size, may generate

unacceptable large index

Label+G Approach: answer Query(u, v) by label(u) and label(v)

with the possibility of accessing G if needed interval labeling: GRIPP, GRAIL, Ferrari, … linear index size, but may perform DFS

Main Idea of IP Labeling Out(u) denote the set of vertices that u can reach, including u

itself. In(u) denote the set of vertices in which every vertex can

reach u, including u. u can reach v iff Out(v) Out(u) and In(u) In(v).

if Out(v) Out(u) or In(u)In(v), u cannot reach v.

Both are time/space consuming if an exact answer is needed for large sets.

Main Idea of IP Labeling

IP label aims to answer unreachable query pair (u, v) by detecting Out(v) Out(u) or In(u) In(v)

based on Min-wise Independent Permutation

high probability guarantee to answer query

linear index construction time and index size

Min-wise Independent Permutation

Given two sets and B ( Out(u), Out(v) or In(v), In(u) ) and a

random permutation , according to the definition of

min-wise independent permutation,

=

K-min-wise Independent PermutationWe propose to use top-k smallest numbers instead of top-1 smallest number to improve the performance.

mink{} be the subset of containing up to the k smallest numbers of .

an order() between and , such that if every (bi) \is larger than the largest number in . We use otherwise.

K-min-wise Independent PermutationWe prove that

if is true, BA

Let |A| = p, |A B| = q and = kA for kA k,= (for q p kA)

Independent Permutation Generation

0 1

2 3

54

6 7

98 1110

7 11

8 6

03

2 1

410 59

Knuth Shuffle

IP Label

The IP label of u consists of two parts:

Lout(u): the mink{ } set of Out(u), mink{Out(u)}

Lin(u): the mink{ } set of In(u), mink{In(u)}

IP Label

7 11

8 6

03

2 1

410 59

Vertex LoutLin

v0 {0, 1, 2, 3, 4} {7}

v1 {0, 1, 2, 3, 4} {11}

v2 {2, 3, 4, 8, 10}

{7, 8}

v3 {1, 2, 3, 4, 6} {6, 7}

v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}

v5 {0, 1, 5, 9, 10}

{0, 7, 11}

v6 {2, 10} {2, 3, 6, 7, 8}

v7 {1} {0, 1, 6, 7, 11}

v8 {10} {0, 2, 3, 6, 7}

v9 {4} {3, 4, 6, 7, 8}

v10 {9} {0, 7, 9, 11}

v11 {5} {0, 5, 7, 11}

for k = 5{10} {4}

{2, 10}

{3}

{8} {2, 3, 4, 10}

{2, 10}

{2, 3, 4, 8, 10}

IP LabelVertex Lout

Lin

v0 {0, 1, 2, 3, 4} {7}

v1 {0, 1, 2, 3, 4} {11}

v2 {2, 3, 4, 8, 10}

{7, 8}

v3 {1, 2, 3, 4, 6} {6, 7}

v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}

v5 {0, 1, 5, 9, 10}

{0, 7, 11}

v6 {2, 10} {2, 3, 6, 7, 8}

v7 {1} {0, 1, 6, 7, 11}

v8 {10} {0, 2, 3, 6, 7}

v9 {4} {3, 4, 6, 7, 8}

v10 {9} {0, 7, 9, 11}

v11 {5} {0, 5, 7, 11}

for k = 5

0 1

2 3

54

6 7

98 1110

Lout(v2) = {2, 3, 4, 8, 10}Lout(v7) = {1}Lout(v2) Lout(v7)Out(v7) Out(v2)

Q1: Query(v2, v7)

1 Lout(v2) , 1Lout(v7) and 1 is smaller than the largest number in Lout(v2)

So Lout(v2) Lout(v7)Out(v7) Out(v2)

IP LabelVertex Lout

Lin

v0 {0, 1, 2, 3, 4} {7}

v1 {0, 1, 2, 3, 4} {11}

v2 {2, 3, 4, 8, 10}

{7, 8}

v3 {1, 2, 3, 4, 6} {6, 7}

v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}

v5 {0, 1, 5, 9, 10}

{0, 7, 11}

v6 {2, 10} {2, 3, 6, 7, 8}

v7 {1} {0, 1, 6, 7, 11}

v8 {10} {0, 2, 3, 6, 7}

v9 {4} {3, 4, 6, 7, 8}

v10 {9} {0, 7, 9, 11}

v11 {5} {0, 5, 7, 11}

for k = 5

0 1

2 3

54

6 7

98 1110

Q2: Query(v1, v3)

Lout(v1) Lout(v3) Lin(v3) Lin(v1)

Pr(Lout(v1) Lout(v3)) = Pr(Lin(v3) Lin(v1)) =

𝟏𝟐

,𝟐𝟑

Let |A| = p, |A B| = q and = kA for kA <k,

=

IP LabelVertex Lout

Lin

v0 {0, 1, 2, 3, 4} {7}

v1 {0, 1, 2, 3, 4} {11}

v2 {2, 3, 4, 8, 10}

{7, 8}

v3 {1, 2, 3, 4, 6} {6, 7}

v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}

v5 {0, 1, 5, 9, 10}

{0, 7, 11}

v6 {2, 10} {2, 3, 6, 7, 8}

v7 {1} {0, 1, 6, 7, 11}

v8 {10} {0, 2, 3, 6, 7}

v9 {4} {3, 4, 6, 7, 8}

v10 {9} {0, 7, 9, 11}

v11 {5} {0, 5, 7, 11}

for k = 5

0 1

2 3

54

6 7

98 1110

Lout(v4) Lout(v3) Lin(v3)Lin(v4)


𝟏𝟐

,𝟐𝟑

𝟏𝟒𝟏𝟓

,𝟗𝟏𝟎

Q4: Query(v1, v3)

IP LabelVertex Lout

Lin

v0 {0, 1, 2, 3, 4} {7}

v1 {0, 1, 2, 3, 4} {11}

v2 {2, 3, 4, 8, 10}

{7, 8}

v3 {1, 2, 3, 4, 6} {6, 7}

v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}

v5 {0, 1, 5, 9, 10}

{0, 7, 11}

v6 {2, 10} {2, 3, 6, 7, 8}

v7 {1} {0, 1, 6, 7, 11}

v8 {10} {0, 2, 3, 6, 7}

v9 {4} {3, 4, 6, 7, 8}

v10 {9} {0, 7, 9, 11}

v11 {5} {0, 5, 7, 11}

for k = 5

0 1

2 3

54

6 7

98 1110

Lout(v5) Lout(v3) Lin(v3)Lin(v5)


𝟏𝟐

,𝟐𝟑

𝟏𝟒𝟏𝟓

,𝟗𝟏𝟎

𝟏𝟐𝟓𝟏𝟐𝟔

,𝟓𝟔

Q4: Query(v1, v3)

IP LabelVertex Lout

Lin

v0 {0, 1, 2, 3, 4} {7}

v1 {0, 1, 2, 3, 4} {11}

v2 {2, 3, 4, 8, 10}

{7, 8}

v3 {1, 2, 3, 4, 6} {6, 7}

v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}

v5 {0, 1, 5, 9, 10}

{0, 7, 11}

v6 {2, 10} {2, 3, 6, 7, 8}

v7 {1} {0, 1, 6, 7, 11}

v8 {10} {0, 2, 3, 6, 7}

v9 {4} {3, 4, 6, 7, 8}

v10 {9} {0, 7, 9, 11}

v11 {5} {0, 5, 7, 11}

for k = 5

0 1

2 3

54

6 7

98 1110

The probability increase significantly !

𝟏𝟐

,𝟐𝟑

𝟏𝟒𝟏𝟓

,𝟗𝟏𝟎

𝟏𝟐𝟓𝟏𝟐𝟔

,𝟓𝟔

Q4: Query(v1, v3)

IP Label

Assume DFS is needed even though u cannot reach v. Consider a vertex w, as a descendant of u, is visited by DFS towards v, the followings are true:

Pr(Lout(u) Lout(v)) <Pr(Lout(w)Lout(v))

Pr(Lin(v)Lin(u)) <Pr(Lin(v)Lin(w))

While DFS becomes deeper, it is much more likely to answer the unreachability queries, and therefore, it can stop in an early stage.

Two Optimizations

Huge-Vertex Label: build additional index to handle the huge vertices of the graph

Level Label: use the topological structure to prune the search space

Performance Studies

Real Dataset:

Dataset | V(G) | | E(G) | davg R-ratio

uniprotenc 25M 25M 0.999 1.30E-7

twitter 18M 18M 1.013 7.39E-2

web-uk 22M 38M 1.678 1.50E-1

citeseerx 6.5M 15M 2.295 4.07E-4

go-uniprot 6.9M 34M 4.990 3.64E-6

govwild 8.0M 23M 2.948 7.20E-5

Performance Studies

Index Construction Time (in second)

Dataset TF-Label DL GRAIL Ferrari IP+

uniprotenc 58.529 22.280 58.242 24.292 18.96

twitter 15.291 13.719 32.323 19.972 12.44

web-uk --- 24.240 44.031 26.927 17.46

citeseerx 91.877 12.045 23.170 19.792 7.54

go-uniprot 38.668 18.277 44.557 40.365 9.68

govwild 30.520 18.584 29.237 19.924 8.45

Performance Studies

Query Time (in millisecond)

Dataset TF-Label DL GRAIL Ferrari IP+

uniprotenc 119.164 119.618 820.249 116.351 54.205

twitter 102.923 104.698 --- 82.212 79.285

web-uk --- 146.429 --- 214.857 253.082

citeseerx 230.318 111.329 28774 131.534 101.444

go-uniprot 55.279 153.214 499.505 313.300 34.577

govwild 254.785 128.199 719.494 295.432 112.990

Performance Studies

Performance Studies

Distribution of the number of vertices visited

Conclusion

We propose a new IP labeling approach, the first one to explore the randomness to answer reachability queries.

Our new labeling approach has linear index construction time and index size. By independent permutation, the query performance is guaranteed by high probability.

We analyze the performance of our proposed approach by extensive experimental studies and our approach shows both good efficiency and scalability.

Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014)...

Documents

Transcript of Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014)...