Constraint Processing Techniques for Improving Join Computation: A Proof of Concept Anagh Lal &...
-
date post
19-Dec-2015 -
Category
Documents
-
view
221 -
download
0
Transcript of Constraint Processing Techniques for Improving Join Computation: A Proof of Concept Anagh Lal &...
Constraint Processing Techniques for Improving Join Computation:
A Proof of Concept
Anagh Lal & Berthe Y. Choueiry
Constraint Systems Laboratory
Department of Computer Science & EngineeringUniversity of Nebraska-Lincoln
An illustrative example
R2
A B C
1 12 23
1 13 23
1 14 23
1 15 23
2 10 25
3 17 20
3 18 22
4 10 25
5 12 23
5 13 23
5 14 23
5 15 23
6 13 27
6 14 27
8 14 28
Join querySELECT R1.A,R1.B,R1.C
FROM R1,R2
WHERE R1.A=R2.A
AND R1.B=R2.B
AND R1.C=R2.C 10 tuples in 3 nested tuples
R1 J oin R2 (Compacted)
A B C
{1, 5} {12, 13, 14} {23}
{2, 4} {10} {25}
{6} {13, 14} {27}
R1
A B C
1 12 23
1 13 23
1 14 23
2 10 25
3 16 30
3 16 24
4 10 25
5 12 23
5 13 23
5 14 23
6 13 27
6 14 27
7 14 28
7 19 20
Advantages
Direct• Savings of number of tuple comparisons• Savings in I/O for next operator• Space reduction of materialized join queries
Future applications• Use for query size estimation• Assist in high-level analysis of data & in data
mining
Our contributions A new representation of a join query as a Constraint
Satisfaction Problem (CSP) A new sorting-based bundling algorithm
• Suitable for CSPs with fewer and larger constraints (i.e., join)• Improves memory usage
A new sort-merge join algorithm for producing (dynamically) bundled tuples• Yields compact representation, saves memory space
Identification of possible applications• Data analysis • Materialized views• Assisting query-size estimation Suggested, not yet demonstrated
Constraint Satisfaction Problem
Given P = (V, D, C)• V = {Vi}, a set of variables
• D = {DVi}, the set of their respective domains
• C is a set of constraints restricting the acceptable combination of values for variables.
• Solution is a consistent assignment of values to variables
Query: find 1 solution, all solutions, etc.
V3
{d}
{a, b, d} {a, b, c}
{c, d, e, f}
V4
V2V1
Solving CSPs Typically, DFS & backtracking Improvement
• Static bundling [Freuder 91]
• Dynamic bundling [our group]– Based on dynamically identifying symmetries
– Guaranteed never less efficient than non-bundling, static bundling
Without bundling Static bundling
S
c d, e, f
dV1
V2
Dynamic bundling
c e, f d
dV1
V2
S
c e f d
dV1
V2
S
V3
{d}
{a, b, d} {a, b, c}
{c, d, e, f}
V4
V2V1
Modeling Join as a CSP
Attributes of relations CSP variables Attribute values variable domains Relations relational constraints Join conditions join-condition constraintsSELECT R1.A,R1.B,R1.C
FROM R1,R2
WHERE R1.A=R2.A
AND R1.B=R2.B
AND R1.C=R2.C
Sorting-based bundling
Heuristic for variable ordering Place variables linked by join conditions as close to each other as possible
R1.A
R2.A
R1.B
R2.B
R1.C
R2.C
R1
R2
Sort relations using above ordering Next: Compute bundles of variable
ahead in variable ordering (R1.A)
Bundling an attribute
Partition of a constraintTuples of the relation having the same value of R1.A
Compare projected tuples of first partition with those of another partition
Compare with every other partition to get complete bundle
Partition
Unequalpartitions
Symmetricpartitions
Bundle {1, 5}
R1
A B C
1 12 23
1 13 23
1 14 23
2 10 25
5 12 23
5 13 23
5 14 23
Join using dynamic bundlingSelect next-
variableCompute next
valid bundle
Foundbundle?
Last variable?
Move to previous variable
Undo previous
assignment
1st in Ordering?
No
No
Yes
Output onetuple
Start
Stop
Yes
Yes
No
Assign bundle
Finding the valid bundle
R1
A B C
1 12 23
1 13 23
1 14 23
2 10 25
3 16 30
3 16 24
R2
A B C
1 12 23
1 13 23
1 14 23
1 15 23
2 10 25
3 17 20
{1, 5, x}{1, 5, y, z}
Common {1, 5}1. Compute a bundle
for the attribute 2. Check bundle
validity with future constraints
3. If no common value found GOTO 1
Assign variable with the surviving values in the bundle
Analysis of overheads
For Bundling• Additional data structures: 2 arrays, 1 pointer• Only 1 array may become cumbersome
Array size is largest • when all the values of a variable are in one
bundle • But, this case also leads to best savings!
Improved implementation • Use of Bitmaps?
Progressive Merge Join PMJ: A sort-merge algorithm by [Dittrich et al. 03] Provides early results
• Assists in query size-estimation Two main phases
• Sorting: starts producing results in this phase• Merging phase: merges sorted runs
We use the framework of the PMJ for our external join.
Implemented & evaluated with the XXL library• We use the same library for our implementation
Preliminary experiments
Data sets• Random: 2 relations R1, R2 with same schema as example
– Each relation: 10’000 tuples– Memory size: 4’000 tuples– Page size 200 tuples
• Real-world problem: 3 relations, 4 attributes
Compaction rate achieved• Random problem: 1.48
– Savings compensate for even worst case (of the current experimental implementation)
• Real-world problem: 2.26 (69 tuples in 32 nested tuples)
Related work
Join algorithms• Well established algorithms• Do not focus on exploiting symmetry
Database compression• Output results are not compressed• Compression at value level, not tuple level
Related work (contd)
[Mamoulis & Papadias 1998] • Join using FC for spatial DB • Restricted to binary constraints• No compaction of solution space
[Bayardo et al. 1996]• Reduce the number of the intermediate tuples of a sequence of
joins
[Rich et al. 1993]• Do not compact join attribute values• Does not detect redundancy present in the grouped sub-relations
Future work
Refine implementation • Use of lighter data structures
Test usefulness in the context of Constraint DBs• Values are continuous intervals, e.g. spatial database
Conduct thorough evaluations of overall performance & overhead (memory & CPU) on different data distributions
Investigate benefit of using bundling• query size estimation • materialized views
Bundling relations: Data structures Considering the portion
of the relation in memory Current-Inst: To store the
current instantiations of past variables Vp of R1.
Current-Constraint: selection of R’:• Past variable values equal
Current-Inst• Current variable Vc > all
previous instantiations of Vc
Bundling relations:Computing bundles (Algorithm 1)
NEXT-PARTITION(p) returns the first unchecked partition in Current-Constraint following the partition p.
Sorted constraints Checking equality of tuples is efficient
Bundling relations: Data structures
Processed-Values: Cumulatively stores non-representative values of bundles
Computing bundles of Vc Values of Vc in it are ignored
Partition p is marked as checked when:• Value(p) is in an instantiation bundle
• p is selected for comparing with other partitions to check for bundles
Join computation: In memory
Two subsets of relations (some pages) in memory: • Algorithm to find result of joining the two.• Join computed as a search
– Finding all solutions
• After finding one solution, search resumes from same depth
– Algorithm shown can be entered at any “depth” in the search
• Uses Algorithm 1 to find bundles for assigning to variables
Join computation: In memory
Join as a search (Algo. 2) BACKTRACK
• Variable[depth] in Current-Inst reset
• Processed-Values for the variable emptied
• Value in Current-Solution reset
• Current-Constraint re-computed
Undoes the effects of the previous instantiation.
Expanded onnext slide