Supporting top-k join queries in relational databases

49
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April 19, 2007

description

Supporting top-k join queries in relational databases. Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley. Thursday, April 19, 2007. Examples to Consider. single criterion ranking Example 1 – Video DB System storing video features - PowerPoint PPT Presentation

Transcript of Supporting top-k join queries in relational databases

Page 1: Supporting top-k join queries in relational databases

Supporting top-k join queries in relational databases

Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid

Presented by Rebecca M. AtchleyThursday, April 19, 2007

Page 2: Supporting top-k join queries in relational databases

2

Examples to Consider

single criterion rankingExample 1 – Video DB System storing video features

multi-criteria ranking (or top-k join query)Example 2 – Same Video DB System - Get top 10

frames most similar to query image based on color AND texture combined.

Example 3 - a user interested in finding top 5 locations where combined cost of buying a house (in Houses DB) and paying school tuition (in Schools DB) in that location is minimum.

Page 3: Supporting top-k join queries in relational databases

3

Motivation

• Sort-merge joins (MGJN) only preserves order of joined column data

• Nested-loop joins (NLJN) only orders on the outer relations are preserved through the join

• Hash join (HSJN) doesn’t preserve order if hash tables do not fit in memory

Page 4: Supporting top-k join queries in relational databases

4

Example Ranking Query Q1

SELECT A.1, B.2

FROM A, B, C

WHERE A.1 = B.1 and B.2 = C.2

ORDER BY (0.3*A.1 + 0.7*B.2)

STOP AFTER 5

Problems???

Page 5: Supporting top-k join queries in relational databases

5

What Is Needed

1) Perform basic join operation

2) Conform with current query operator interfaceSo it can be integrated into query execution plans

3) Utilize the orderings of its inputsAvoid the unnecessary sorting of the join results

4) Produce 1st ranked join results ASAP

5) Adapt to input fluctuationsMajor characteristic in applications that depend on ranking

Page 6: Supporting top-k join queries in relational databases

6

Paper’s Contributions• Proposes a new Rank-Join algorithm satisfying this

criteria along with its correctness proof• Analyzes the I/O cost of the algorithm along with proof of

its optimality• Implements the proposed algorithm in pipelined rank-join

operators (based on ripple join)Integrate into QEPs as ordinary join operationsRetain orders of inputs – avoid sort of join resultsProduce top-k results incrementally

• Proposes an optimal join strategyscore-guided and adaptive

• Provides optimization mechanism to determine best order to perform the rank-join operations

• Evaluates performance and compares other approaches

Page 7: Supporting top-k join queries in relational databases

7

Related Work

• Fagin et al. introduced the 1st set of algorithms to answer ranking queries

• The TA Algorithm

• The NRA Algorithm

• The J* Algorithm

• The NRA-RJ Algorithm

• Importance-based join processing

Page 8: Supporting top-k join queries in relational databases

8

Overview of the Ripple JoinJOIN : L.A = R.A L and R are descending, ordered by B

We get a tuple from L and a tuple from R

(L1(1,1,5) R1(1,3,5))

No valid join resultL

R

Page 9: Supporting top-k join queries in relational databases

9

3. Ripple Join --contd

JOIN: L.A = R.A

We get a second tuple from L and a second tuple from R and join with prior tuples, creating all possible combinations

(L2,R2) {(2,2,4),(2,1,4)}

L

R

Page 10: Supporting top-k join queries in relational databases

10

Ripple Join --contd

JOIN: L.A = R.A

(L2,R2) {(2,2,4),(2,1,4)}

(L2,R1) {2,2,4), (1,3,5)}

(L1,R2) {(1,1,5), (2,1,4)}

is a valid join result!

L

R

Page 11: Supporting top-k join queries in relational databases

11

Variations of the Ripple Join

Rectangle – obtain tuples from one source at a higher rate than from the other source

Block – obtain data b tuples at a time, for classic ripple join b = 1

Hash –in memory, maintain hash tables of the samples obtained so far

Faster IO Degrades to block ripple join when hash tables exceed memory size

Page 12: Supporting top-k join queries in relational databases

12

The Rank-Join Algorithm:

Page 13: Supporting top-k join queries in relational databases

13

A Rank-Join Algorithm ExampleSelect * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

Initial Input

(1). Get a valid join combination using some join strategy

Ripple Select (L1, R1) => No Valid Join Result

Next input(1). Get a valid join combination using some join strategyRipple Select (L2, R2)

(L2, R2), (L2, R1), (L1, R2) => (L1, R2) is a valid join result

Page 14: Supporting top-k join queries in relational databases

14

Select * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

(2). Compute the score (J) for the result

J1(L1, R2) => L.B + R.B = 5 + 4 = 9

(3). Compute a threshold scoreT = Max ( Last L.B + First R.B, First L.B + Last R.B )

For Ripple Selection (L2, R2) => T = Max ( L2.B + R1.B, L1.B + R2.B ) = Max (4+5, 5+4) = 9

Example Continued (1)

Page 15: Supporting top-k join queries in relational databases

15

Example Continued (4)

J1 = 9

T = 9

(4). J1 >= T, so report J1 in top-k results (i.e. add it to list Lk )

Since we need top 2 (k=2), continue until k=2 and Min(J1, J2, …Jk) > T

Select * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

Page 16: Supporting top-k join queries in relational databases

16

Example Continued (5)

Next input1). Get a valid join combination using some join strategyRipple Select (L3, R3)

(L3, R3), (L3, R1), (L3, R2), (L1, R3), (L2, R3)

=> (L3, R3), (L2, R3) are valid join results

(2). Compute the scores (J) for the results

J2(L2, R3) = 4 + 3 = 7 J3(L3, R3) = 3 + 3= 6

Select * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

Page 17: Supporting top-k join queries in relational databases

17

Example Continued (6)

(3). Calculate a NEW threshold T

T = Max ( Last L.B + First R.B, First L.B + Last R.B )

= Max ( L3.B + R1.B , L1.B + R3.B )

= Max(3 + 5, 5 + 3)

= 8

Select * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

Page 18: Supporting top-k join queries in relational databases

18

Example Continued (7)

T = 8

J1(L1,R2) = 9 reported

J2( L2, R3) = 7

J3(L3, R3) = 6

Note, J’s are in descending order

(4). Min (J) = 6 < T so continue

Comment: Calculate T before J is more efficient. Can stop after find first Jk >= T

Select * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

Page 19: Supporting top-k join queries in relational databases

19

Example Continued (8)

Next input(1). Get a valid join combination using some join strategyRipple Select ( L4, R4) => (L4, R1), (L2, R4), (L3, R4)

(2). Compute the scores (J) for the results

J(L4, R4) = 7, J(L2, R4) = 6, J(L3, R4) = 5

(3). Calculate a NEW threshold T

T = Max( L4.B+R1.B, L1.B + R4.B ) = Max( 7, 7 ) = 7

Select * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

Page 20: Supporting top-k join queries in relational databases

20

Example Continued (9)

T= 7J1(L1,R2) = 9, J2(L2, R3) = 7, J3(L4, R1) = 7, J3(L3, R3) = 6, J4(L2, R4) = 6, J5(L3, R4) = 5(4). Min(J1, J2) = 7 >= T (k = 2), so report J2 and STOP

Comment: When reach all records, T does not need to be calculated, unless, calculate T first, and compare each J(i) with T immediately

Select * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

Page 21: Supporting top-k join queries in relational databases

21

Example With Different StrategySelect * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

Instead of Select Ripple, Select Rectangle Ripple…Obtain all tuples in L and only 1st tuple in R

Initial Input

(1). Get a valid join combination using some join strategyRectangle Ripple (L1 to L4, R1) => (L4, R1) is a valid join result

(2). Compute the score (J) for the resultJ1(L4, R1) = 2 + 5 = 7

(3). Calculate a NEW threshold TT = Max( L4.B+R1.B, L1.B + R1.B ) = Max( 7, 10 ) = 10

Page 22: Supporting top-k join queries in relational databases

22

Example With Different Strategy Continued (2)

Select * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

(4). Min(J1, J2) = 7 < T = 10 (k = 2), so just continue

Next Input(1). Get a valid join combination using some join strategyRectangle Ripple (L1 to L4, R2) => (L1, R2) is a new valid join result

(2). Compute the score (J) for the resultJ2(L1, R2) = 9

(3). Calculate a NEW threshold TT = Max( L1.B+R1.B, L1.B + R2.B ) = Max( 10, 9 ) = 10

Page 23: Supporting top-k join queries in relational databases

23

Example With Different Strategy Continued (3)

Select * From L, R

Where L.A = R.A

Order By L.B + R.B

Stop After 2

(4). Min(J1, J2) = 9 < T = 10 (k = 2), so continue

J2 cannot be reported because of threshold = 10, but this was the top-ranked join result in prior strategy…

This suggests using join strategies that reduce the threshold value as quickly as possible to be able to report top-ranked join results early on.

Page 24: Supporting top-k join queries in relational databases

24

New Physical rank-join Operators

• Hash rank join operator (HRJN)

- Use Hash Ripple Join

- Two hash table contain the two inputs

- A queue holds ordered join results

- Ltop, Rtop, Lbottom, Rbottom are used

to calculate T

Page 25: Supporting top-k join queries in relational databases

25

Open operation of HRJN

Page 26: Supporting top-k join queries in relational databases

26

GetNext operation of HRJN

Page 27: Supporting top-k join queries in relational databases

27

HRJN Implementation Issues

1) Buffer problem

2) Local Ranking Problem

L1 L2

OP2

OP1

L3

Page 28: Supporting top-k join queries in relational databases

28

Solving the Issues

• Use Block Ripple Join to Solve Local Ranking Problem –- Set p = 2

Page 29: Supporting top-k join queries in relational databases

29

HRJN* join strategy• HRJN* is score-guided

- How to select next (block) tupleT1 = Ltop + Rbottom , T2 = Lbottom + Rtop

T = Max(T1, T2)If T1 > T2, need to reduce T1. How?

• HRJN* uses XJoin to determine input availability and use that as a guide

Page 30: Supporting top-k join queries in relational databases

30

HRJN* join strategy• HRJN* is score-guided

- How to select next (block) tupleT1 = Ltop + Rbottom , T2 = Lbottom + Rtop

T = Max(T1, T2)If T1 > T2, need to reduce T1. How? Reduce Rbottom and not reduce Lbottom (descending ordered), thus more tuples should be retrieved from R to reduce T1

• HRJN* uses XJoin to determine input availability and use that as a guide

Page 31: Supporting top-k join queries in relational databases

31

Effect of Join Order

• When more than two tables join, the join order matters. (A and C have high similarity)

Page 32: Supporting top-k join queries in relational databases

32

Determining Join Order

• Rank-Join order heuristic

- Get a ranked sample of size S from L and R

- Calculate the similarity using footrule distance

Where L(i) and R(j) are the ranks of object i in L

and object j in R and i,j is a valid join result

Page 33: Supporting top-k join queries in relational databases

33

The Rank-Join Order Algorithm

Page 34: Supporting top-k join queries in relational databases

34

Similarity-Based Join Ordering

Page 35: Supporting top-k join queries in relational databases

35

7. Generalizing the rank-join

• Using indexes1) an index on only one of the two inputs

2) an index on each of the two inputs.

• Eliminate duplications

• Faster termination

Page 36: Supporting top-k join queries in relational databases

36

Performance Evaluation

Page 37: Supporting top-k join queries in relational databases

37Selectivity = 0.2 and m= 4

Page 38: Supporting top-k join queries in relational databases

38Selectivity = 0.2 and m= 4

Page 39: Supporting top-k join queries in relational databases

39Selectivity = 0.2 and m= 4

Page 40: Supporting top-k join queries in relational databases

40m = 4 and k = 50

Page 41: Supporting top-k join queries in relational databases

41m = 4 and k = 50

Page 42: Supporting top-k join queries in relational databases

42m = 4 and k = 50

Page 43: Supporting top-k join queries in relational databases

43selectivity = 0.2% and k = 50

Page 44: Supporting top-k join queries in relational databases

44selectivity = 0.2% and k = 50

Page 45: Supporting top-k join queries in relational databases

45selectivity = 0.2% and k = 50

Page 46: Supporting top-k join queries in relational databases

46

Page 47: Supporting top-k join queries in relational databases

47

Conclusions

• New join-rank algorithm (independent of join strategy) is correct and optimal

• Physical query operator HRJN (Hash Rank Join) is based on ripple join implements algorithm

• Score-guided join strategy applied to HRJN is the HRJN* operator and integrates into QEPs

• Efficient rank-order join heuristic chooses near-optimal join order

• General rank-join algorithm uses indexes for faster termination of ranking

• Experimental evaluation shows significant performance enhancement

Page 48: Supporting top-k join queries in relational databases

48

References

• Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid: “Supporting top-k join queries in relational databases”. VLDB J. 13(3): 207-221 (2004)

• Jing Chen: CSE 6392 - Spring 2005, University of Texas at Arlington, PowerPoint slide presentation of “Supporting top-k join queries in relational databases”, http://ranger.uta.edu/~gdas/Courses/Spring2005/DBIR/slides/top-k_join.ppt.

Page 49: Supporting top-k join queries in relational databases

49