Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014)...
-
Upload
roman-carvell -
Category
Documents
-
view
216 -
download
3
Transcript of Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014)...
Reachability Querying: An Independent Permutation Labeling Approach(published in VLDB 2014)
Presenter: WEI, Hao
Graph Reachability QueryGiven a directed graph G = (V, E) and two vertices u and
v, u is said to reach v if there exists a path from u to v over G.
Any directed graph can be easily transformed into a DAG trivial if u and v are in the same connect component0 1
2 3
54
6 7
98 1110
Query(v1, v8) Reachable
Query(v2, v11)
Unreachable
The Issue and the Challenge ‘Big Data’ era brings us large
graph with millions of nodes and
edges. web-uk dataset: 133 million
nodes, 5 billion edges DAG of web-uk: 22 million
nodes, 38 million edges Traditional approaches are not
applicable.
Related WorkRecent works builds index, label(u), offline for every node u.
Label-Only Approach: answer Query(u, v) only by label(u) and
label(v) only Hop Labeling: TF-Label, Hierarchy Label, Distribution Label, … Transitive Closure Compression: Chain-Cover, Tree-Cover, … non-linear index construction time and index size, may generate
unacceptable large index
Label+G Approach: answer Query(u, v) by label(u) and label(v)
with the possibility of accessing G if needed interval labeling: GRIPP, GRAIL, Ferrari, … linear index size, but may perform DFS
Main Idea of IP Labeling Out(u) denote the set of vertices that u can reach, including u
itself. In(u) denote the set of vertices in which every vertex can
reach u, including u. u can reach v iff Out(v) Out(u) and In(u) In(v).
if Out(v) Out(u) or In(u)In(v), u cannot reach v.
Both are time/space consuming if an exact answer is needed for large sets.
Main Idea of IP Labeling
IP label aims to answer unreachable query pair (u, v) by detecting Out(v) Out(u) or In(u) In(v)
based on Min-wise Independent Permutation
high probability guarantee to answer query
linear index construction time and index size
Min-wise Independent Permutation
Given two sets and B ( Out(u), Out(v) or In(v), In(u) ) and a
random permutation , according to the definition of
min-wise independent permutation,
=
K-min-wise Independent PermutationWe propose to use top-k smallest numbers instead of top-1 smallest number to improve the performance.
mink{} be the subset of containing up to the k smallest numbers of .
an order() between and , such that if every (bi) \is larger than the largest number in . We use otherwise.
K-min-wise Independent PermutationWe prove that
if is true, BA
Let |A| = p, |A B| = q and = kA for kA k,= (for q p kA)
Independent Permutation Generation
0 1
2 3
54
6 7
98 1110
7 11
8 6
03
2 1
410 59
Knuth Shuffle
IP Label
The IP label of u consists of two parts:
Lout(u): the mink{ } set of Out(u), mink{Out(u)}
Lin(u): the mink{ } set of In(u), mink{In(u)}
IP Label
7 11
8 6
03
2 1
410 59
Vertex LoutLin
v0 {0, 1, 2, 3, 4} {7}
v1 {0, 1, 2, 3, 4} {11}
v2 {2, 3, 4, 8, 10}
{7, 8}
v3 {1, 2, 3, 4, 6} {6, 7}
v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}
v5 {0, 1, 5, 9, 10}
{0, 7, 11}
v6 {2, 10} {2, 3, 6, 7, 8}
v7 {1} {0, 1, 6, 7, 11}
v8 {10} {0, 2, 3, 6, 7}
v9 {4} {3, 4, 6, 7, 8}
v10 {9} {0, 7, 9, 11}
v11 {5} {0, 5, 7, 11}
for k = 5{10} {4}
{2, 10}
{3}
{8} {2, 3, 4, 10}
{2, 10}
{2, 3, 4, 8, 10}
IP LabelVertex Lout
Lin
v0 {0, 1, 2, 3, 4} {7}
v1 {0, 1, 2, 3, 4} {11}
v2 {2, 3, 4, 8, 10}
{7, 8}
v3 {1, 2, 3, 4, 6} {6, 7}
v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}
v5 {0, 1, 5, 9, 10}
{0, 7, 11}
v6 {2, 10} {2, 3, 6, 7, 8}
v7 {1} {0, 1, 6, 7, 11}
v8 {10} {0, 2, 3, 6, 7}
v9 {4} {3, 4, 6, 7, 8}
v10 {9} {0, 7, 9, 11}
v11 {5} {0, 5, 7, 11}
for k = 5
0 1
2 3
54
6 7
98 1110
Lout(v2) = {2, 3, 4, 8, 10}Lout(v7) = {1}Lout(v2) Lout(v7)Out(v7) Out(v2)
Q1: Query(v2, v7)
1 Lout(v2) , 1Lout(v7) and 1 is smaller than the largest number in Lout(v2)
So Lout(v2) Lout(v7)Out(v7) Out(v2)
IP LabelVertex Lout
Lin
v0 {0, 1, 2, 3, 4} {7}
v1 {0, 1, 2, 3, 4} {11}
v2 {2, 3, 4, 8, 10}
{7, 8}
v3 {1, 2, 3, 4, 6} {6, 7}
v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}
v5 {0, 1, 5, 9, 10}
{0, 7, 11}
v6 {2, 10} {2, 3, 6, 7, 8}
v7 {1} {0, 1, 6, 7, 11}
v8 {10} {0, 2, 3, 6, 7}
v9 {4} {3, 4, 6, 7, 8}
v10 {9} {0, 7, 9, 11}
v11 {5} {0, 5, 7, 11}
for k = 5
0 1
2 3
54
6 7
98 1110
Q2: Query(v1, v3)
Lout(v1) Lout(v3) Lin(v3) Lin(v1)
Pr(Lout(v1) Lout(v3)) = Pr(Lin(v3) Lin(v1)) =
𝟏𝟐
,𝟐𝟑
Let |A| = p, |A B| = q and = kA for kA <k,
=
IP LabelVertex Lout
Lin
v0 {0, 1, 2, 3, 4} {7}
v1 {0, 1, 2, 3, 4} {11}
v2 {2, 3, 4, 8, 10}
{7, 8}
v3 {1, 2, 3, 4, 6} {6, 7}
v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}
v5 {0, 1, 5, 9, 10}
{0, 7, 11}
v6 {2, 10} {2, 3, 6, 7, 8}
v7 {1} {0, 1, 6, 7, 11}
v8 {10} {0, 2, 3, 6, 7}
v9 {4} {3, 4, 6, 7, 8}
v10 {9} {0, 7, 9, 11}
v11 {5} {0, 5, 7, 11}
for k = 5
0 1
2 3
54
6 7
98 1110
Lout(v4) Lout(v3) Lin(v3)Lin(v4)
Pr(Lout(v4) Lout(v3)) = Pr(Lin(v3) Lin(v4)) =
𝟏𝟐
,𝟐𝟑
𝟏𝟒𝟏𝟓
,𝟗𝟏𝟎
Q4: Query(v1, v3)
IP LabelVertex Lout
Lin
v0 {0, 1, 2, 3, 4} {7}
v1 {0, 1, 2, 3, 4} {11}
v2 {2, 3, 4, 8, 10}
{7, 8}
v3 {1, 2, 3, 4, 6} {6, 7}
v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}
v5 {0, 1, 5, 9, 10}
{0, 7, 11}
v6 {2, 10} {2, 3, 6, 7, 8}
v7 {1} {0, 1, 6, 7, 11}
v8 {10} {0, 2, 3, 6, 7}
v9 {4} {3, 4, 6, 7, 8}
v10 {9} {0, 7, 9, 11}
v11 {5} {0, 5, 7, 11}
for k = 5
0 1
2 3
54
6 7
98 1110
Lout(v5) Lout(v3) Lin(v3)Lin(v5)
Pr(Lout(v5) Lout(v3)) = Pr(Lin(v3) Lin(v5)) =
𝟏𝟐
,𝟐𝟑
𝟏𝟒𝟏𝟓
,𝟗𝟏𝟎
𝟏𝟐𝟓𝟏𝟐𝟔
,𝟓𝟔
Q4: Query(v1, v3)
IP LabelVertex Lout
Lin
v0 {0, 1, 2, 3, 4} {7}
v1 {0, 1, 2, 3, 4} {11}
v2 {2, 3, 4, 8, 10}
{7, 8}
v3 {1, 2, 3, 4, 6} {6, 7}
v4 {2, 3, 4, 10} {3, 6, 7, 8, 11}
v5 {0, 1, 5, 9, 10}
{0, 7, 11}
v6 {2, 10} {2, 3, 6, 7, 8}
v7 {1} {0, 1, 6, 7, 11}
v8 {10} {0, 2, 3, 6, 7}
v9 {4} {3, 4, 6, 7, 8}
v10 {9} {0, 7, 9, 11}
v11 {5} {0, 5, 7, 11}
for k = 5
0 1
2 3
54
6 7
98 1110
The probability increase significantly !
𝟏𝟐
,𝟐𝟑
𝟏𝟒𝟏𝟓
,𝟗𝟏𝟎
𝟏𝟐𝟓𝟏𝟐𝟔
,𝟓𝟔
Q4: Query(v1, v3)
IP Label
Assume DFS is needed even though u cannot reach v. Consider a vertex w, as a descendant of u, is visited by DFS towards v, the followings are true:
Pr(Lout(u) Lout(v)) <Pr(Lout(w)Lout(v))
Pr(Lin(v)Lin(u)) <Pr(Lin(v)Lin(w))
While DFS becomes deeper, it is much more likely to answer the unreachability queries, and therefore, it can stop in an early stage.
Two Optimizations
Huge-Vertex Label: build additional index to handle the huge vertices of the graph
Level Label: use the topological structure to prune the search space
Performance Studies
Real Dataset:
Dataset | V(G) | | E(G) | davg R-ratio
uniprotenc 25M 25M 0.999 1.30E-7
twitter 18M 18M 1.013 7.39E-2
web-uk 22M 38M 1.678 1.50E-1
citeseerx 6.5M 15M 2.295 4.07E-4
go-uniprot 6.9M 34M 4.990 3.64E-6
govwild 8.0M 23M 2.948 7.20E-5
Performance Studies
Index Construction Time (in second)
Dataset TF-Label DL GRAIL Ferrari IP+
uniprotenc 58.529 22.280 58.242 24.292 18.96
twitter 15.291 13.719 32.323 19.972 12.44
web-uk --- 24.240 44.031 26.927 17.46
citeseerx 91.877 12.045 23.170 19.792 7.54
go-uniprot 38.668 18.277 44.557 40.365 9.68
govwild 30.520 18.584 29.237 19.924 8.45
Performance Studies
Query Time (in millisecond)
Dataset TF-Label DL GRAIL Ferrari IP+
uniprotenc 119.164 119.618 820.249 116.351 54.205
twitter 102.923 104.698 --- 82.212 79.285
web-uk --- 146.429 --- 214.857 253.082
citeseerx 230.318 111.329 28774 131.534 101.444
go-uniprot 55.279 153.214 499.505 313.300 34.577
govwild 254.785 128.199 719.494 295.432 112.990
Performance Studies
Performance Studies
Distribution of the number of vertices visited
Conclusion
We propose a new IP labeling approach, the first one to explore the randomness to answer reachability queries.
Our new labeling approach has linear index construction time and index size. By independent permutation, the query performance is guaranteed by high probability.
We analyze the performance of our proposed approach by extensive experimental studies and our approach shows both good efficiency and scalability.