Faster Query Answering in Probabilistic Databases using Read-Once Functions
description
Transcript of Faster Query Answering in Probabilistic Databases using Read-Once Functions
1
Faster Query Answering in Probabilistic Databases using Read-Once Functions
Sudeepa Roy
Joint work with
Vittorio PerducaVal Tannen
University of Pennsylvania
2
Probabilistic DatabasesPossible worlds model
Each possible world w is a standard database instance, has a probability P[w]
Compact representation D based on independence assumptions
Query Semantics in Probabilistic Databases (wlog.) Boolean query q Traditional database: q(D) {true, false} Probabilistic database: P[q(D)] = ∑q(w) = true P[w]
Goal: Efficiently evaluate P[q(D)] Data complexity; want time polynomial in n = |D|
3
Computation of P[q(D)]Can we efficiently compute P[q(D)]?
NO, In general #P-hard
DalviSuciu’04, ff. : Positive queries can be partitioned into Safe queries: Safe plans run in poly-time on all instances Unsafe queries: Data complexity is #P-hard
Includes very simple queries like R(x) S(x, y) T(y) Given q as input, we can efficiently decide whether q
is safe
BUT: For unsafe queries, probabilities on some instances can be
efficiently computed Our Approach: Take both q and D as input
Restrictions
a1
a2
a3
a3
b1
b1
b2
b3
0.1
0.5
0.2
0.1
Tuple-independent representation D Tuple t annotated by P[t]
a1
a2
a3
0.3
0.4
0.6
b1
b2
b3
0.7
0.8
0.4
R S T
a1 b1a1 b1
R S T
P[w] = 0.3 (1 – 0.4) (1 – 0.6) 0.1 (1 – 0.5) (1–0.2) (1–0.1) 0.7 (1–0.8) (1 – 0.4)
w = a possible world
D =
Conjunctive query without self-join (CQ-) q():= R(x)S(x, y)T(y) (This is the H0 query from Suciu’s keynote)
Probability
5
Query Answering in Two Steps: Example Event variables for tuples Step 1: Event expression for q(D) or “lineage”
E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 The “form” of the expression depends on query plan; here ()((R ⋈ S) ⋈
T)
Step 2: Compute P[q(D)] = P[E] given Pr[w1] = 0.3, Pr[v1] = 0.4, ….
This work: take advantage of Read-Once expressions
D a1
a2
a3
a3
b1
b1
b2
b3
v1
v2
v3
v4
0.1
0.5
0.2
0.1
a1
a2
a3
w1
w2
w3
0.3
0.4
0.6
b1
b2
b3
u1
u2
u3
0.7
0.8
0.4
RT
S
Probability
Event variables
q():= R(x), S(x, y), T(y)
EASY
HARD
a1
a2
a3
a3
b1
b1
b2
b3
0.1
0.5
0.2
0.1
a1
a2
a3
0.3
0.4
0.6
b1
b2
b3
0.7
0.8
0.4
6
Read-Once Boolean ExpressionsExpression in Read-once Form: Every variable occurs exactly once
e.g. ((x+y)z + w)(u+v) Linear time probability computation
P(x y) = P(x) P(y) P(x + y) = 1 – (1 -P(x)) (1 – P(y))
Read-once Expression: Has an equivalent read-once form. e.g.
xzu + xzv + yzu + yzv + wu+ wv [in DNF, as large as O(n|q|)] xzu + xzv + (yz + w)(u+v) [not in DNF, can be much smaller]
Non-read-once Expressions: No read-once form e.g.. xy + yz + zx, xy + yz + zw
x y
z u v
7
Read-Once Event ExpressionsSafe plans for safe queries directly produce
expressions in read-once form (OlteanuHuang’08)
Unsafe queries can also produce read-once expressions
Our example is read-once E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3 = (w1 v1 + w2 v2) u1 + w3 (v3 u2 + v4 u3) Corresponds to unsafe query q():= R(x) S(x, y) T(y) No query plan can produce the read-once form directly
8
Problem DefinitionGiven
a boolean CQ- query q, a tuple-independent database D,
Can we efficiently decide whether the event expression corresponding to q(D) is read-once?
If yes, can we compute the read-once form efficiently? (then P[q(D)] can be computed efficiently)
9
Read-once-ness: only a sufficient condition to efficiently compute P[q(D)]
e.g., E = x1 x2 + x2 x3 + x3 x4 + …… Not read-once P[E] can be computed in poly-time using dynamic
programming Moreover, see detailed analysis in JhaSuciu ’11 using
OBDD, FBDD, d-DNNF
E is read-once
read-once formof E can be computed efficiently P[E] can be
computed efficiently
10
OutlineBackground
Existing characterization of read-once expressions Co-occurrence Graphs
Our Contributions Co-table graph Step1. Computation of co-table graph Step2. Computation of read-once form
Related work, Future work and Conclusion
11
OutlineBackground
Existing characterization of read-once expressions Co-occurrence Graphs
Our Contributions Co-table graph Step1. Computation of co-table graph Step2. Computation of read-once form
Related work, Future work and Conclusion
12
Characterization of Read-once Expressions
A positive boolean expression is read-once if and only if its “co-occurrence graph” is P4-free (no simple induced path with four vertices) and “normal”.
Gurvich’ 77, ’ 91 Can be checked (and computed) in poly-time if the
expression is given in DNF (GolumbicMR’ 06)
z
13
Co-occurrence Graph - GCO
Graph on variables in the expression as vertices
1. Express boolean expression in irredundant DNF xy + xyz + zx xy + zx
2. Put an edge between variables if they co-occur in a disjunct
Can be easily computed if the expression is in DNF
y
x
z
14
OutlineBackground
Existing characterization of read-once expressions Co-occurrence Graphs
Our Contributions Co-table graph Step1. Computation of co-table graph Step2. Computation of read-once form
Related work, Future work and Conclusion
15
Our Contributions1. DNF of event expression is not needed for CQ-
GCO can be directly computed from “provenance DAGs”
2. We do not need to compute GCO
A subgraph of GCO suffices – “Co-table graph” GCT
Our Framework
Compute GCO
Use existing read-once testing algorithms
Compute GCT
Use our read-once testing algorithm
(1) Uses Gurvich’s characterization
vs.(2) Uses alternative
(2) Is faster than (1)
(1)
(2)
16
Provenance DAGEvent expressions, called “lineage” (Suciu keynote), are
a form of provenance (GreenKarvounarakisT ’07).
We use provenance DAGs (Green et. al. ’07)
Query q():= R(x), S(x, y), T(y)
Query Plan ()((R ⋈ S) ⋈ T) E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4
u3
w1 w2 w3
v1 v2v3 v4
u1 u2 u3
17
Co-Table Graph -- GCT
Subgraph of Gco: |GCT| |GCO|
Put an edge between variables only if their tables share variables in q
e.g.: q():= R(x) S(y) R, S have n tuples each, GCO has n2 edges, GCT has zero!
q():= R(x) S(x, y) T(y)
E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3
w1
w2
w3
u1
u2
u3
v1
v2
v3
v4
w1
w2
w3
v1
v2
v3
v4
u1
u2
u3
GCO GCT
18
Our AlgorithmInput: Provenance DAG, H
Obtained from the query plan
Step1: Compute GCT
(the same procedure can compute GCO as well)
Step2: Compute read-once form (if possible) Otherwise output that event expression is not read-
once
19
Step1: Computing GCT Theorem: Two variables are adjacent in GCT
if and only if their least common ancestor set contains a
product-node in the provenance DAG
y x Z
E = xy + xz
Proof uses critically the no-self-join assumption
20
Step2: Computing Read-once formInput: GCT
Alternate between Row Decomposition and Table Decomposition
Recursive computationExactly one can be done at a recursion level, otherwise not read-onceProof uses critically no-union assumptionSound and Complete
q
q
q
E1
E2
E3
E = E1 + E2 + E3
Row decompositionq1 q2
E1 E2
E = E1 E2
Table decomposition
21
Example: Row Decomposition
a1
a2
a3
a3
b1
b1
b2
b3
v1
v2
v3
v4
a1
a2
a3
w1
w2
w3
b1
b2
b3
u1
u2
u3
R S T
q():= R(x), S(x, y), T(y)
E = w1 v1 u1 + w2 v2 u1 + w3 v3 u2 + w3 v4 u3
w1
w2
w3
v1
v2
v3
v4
a1
a2
b1
b1
v1
v2
a1
a2
w1
w2
b1 u1R1 S1 T1
u1
u2
u3
+
22
Example: Table Decomposition
w1
w2
v1
v2
a1
a2
b1
b1
v1
v2
a1
a2
w1
w2
b1 u1
R1 S1T1
u1
q():= R(x), S(x, y), T(y)
q1():= R(x), S(x, y1) q2():=
T(y2)
(w1 v1 + w2 v2)
u1(w1 v1 + w2 v2)u1
Final Expression: (w1 v1 + w2 v2)u1 + w3(v3 u2 + v4 u3)
23
Overall Time ComplexityInput: Provenance DAG HStep1: Compute GCT or GCO
Time complexity ≈ O(n mH + WH mCO) mH = #edges in H, WH = width of H, mCO = #edges in GCO, mCT = #edges in GCT
Step2: Compute read-once form (if possible) Using our algorithm: O((mCT + n) min (|q|, √n)) ; Data complexity O(mCT + n) Using existing algorithms: O(mCO + n), mCT ≤ mCO
SummaryAnalysis uses “charging argument”Bound recursion depth, total time at each recursion levelStep1 is more expensiveStep2 is linear
In |GCO| for existing algorithms
In |GCT| for our algorithms
|GCT| ≤ |GCO|
24
OutlineBackground
Co-occurrence Graphs Existing characterization of read-once expressions
Our Contributions Co-table graph Step1. Computation of co-table graph Step2. Computation of read-once form
Related work, Future work and Conclusion
25
Related Work
SenDeshpandeGetoor’ 10 Independent work, considers the same problem Shows that “normality” check is not needed for CQ-
Tests P4-freeness using “lineage-trees” without computing the co-occurrence graph
Our work: Computes the co-occurrence graph without DNF
computation existing algorithms can be used. Was an open question in SenDeshpandeGetoor’10
Obtains a faster and simpler algorithm Time complexity comparison in the paper Uses BFS/DFS, easier to implement Uses compact provenance DAGs instead of lineage trees
26
Other Related Work Semantics of probabilistic query answering
Fuhr-Rollecke ’97, Zimanyi ‘97 Dichotomy of CQ- ,CQ and UCQ queries
Dalvi-Suciu ’04, ’07, Dalvi-Schnaitter-Suciu ’10 Knowledge compilation techniques
Olteanu-Huang ’08 Jha-Olteanu-Suciu ‘10 Jha-Suciu ’11 Fink-Olteanu ‘11
27
Conclusion and Future WorkCan co-occurrence/co-table graph be computed as a pre-processing step?
This is the more expensive step Akin to building indexes on databases but depends on
query’s “join pattern” Cache the already computed GCT with the join pattern
How to handle Larger classes of queries (UCQ?) and database models
(disjoint independent?) Other efficient knowledge-compilation forms
28
Thank You.
Questions?