Answering Why-Not Questions on Top-K Queries
-
Upload
andy-he -
Category
Data & Analytics
-
view
61 -
download
0
Transcript of Answering Why-Not Questions on Top-K Queries
![Page 1: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/1.jpg)
Answering Why-not Questions on Top-K
QueriesAndy He and Eric Lo
The Hong Kong Polytechnic University
![Page 2: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/2.jpg)
Background The database community has
focused on the performance issues for decades
Recently more people turn their focus on to the usability issues Supporting keyword search Query auto-completion Explaining your query result (a.k.a. Why
and Why-Not Questions)2/33
![Page 3: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/3.jpg)
Why-Not Questions You post a query Q Database returns you a result R R gives you “surprise”
E.g., a tuple m that you are expecting in the result is missing, you ask “WHY??!”
You pose a why-not question (Q,R,m) Database returns you an explanation
E3/33
![Page 4: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/4.jpg)
The (short) history of Why-Not
Chapman and Jagadish “Why Not?” [SIGMOD 09] Select-Project-Join (SPJ) Questions Explanation E = “tell you which operator
excludes the expected tuple” Hung, Che, A.H. Doan, and J. Naughton
“On the Provenance of Non-Answers to Queries Over Extracted Data” [PVLDB 09]
SPJ Queries Explanation E =“tell you how to modify the
data”4/33
![Page 5: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/5.jpg)
The (short) history of Why-Not
Herschel and Herandez “Explaining Missing Answers to SPJUA Queries”
[PVLDB 10] SPJUA Queries Explanation E =“tell you how to modify the data”
Tran and C.Y. Chan “How to Conquer why-not Questions” [SIGMOD
10] SPJA Queries Explanation E =“tell you how to modify your
query”
5/33
![Page 6: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/6.jpg)
About this work Why-Not question on Top-k queries. Hotel <Price, Distance to CityCenter>
Top-3 Hotel Weighting worigin =<0.5, 0.5> Result
Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental
“WHY my favorite Renaissance NOT in the Top-3 result?” If my value of k is too small? Or I should revise my weighting? Or need to modify both k and weighting?
Explanation E = “tell you how to refine your Top-K query in order to get your favorites back to the result”
6/33
![Page 7: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/7.jpg)
One possible answer-only modify k
Original query Q(koriginal=3,woriginal=<0.5,0.5>)
The ranking of Renaissance under the original weighting woriginal=<0.5,0.5> Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental Rank 4: Hilton Rank 5: Renaissance
Refined query #1: Q1(k=3,w=<0.5,0.5>)
5
7/33
X
![Page 8: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/8.jpg)
Another possible answer-only modify weighting
Original query Q(k=3,woriginal=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.1,0.9>
Rank 1: Hotel E Rank 2: Hotel F Rank 3: Renaissance
Refined query #2: Q2(k=3,w=<0.1,0.9>)
8/33
![Page 9: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/9.jpg)
Yet another possible answer-modify both
Original query Q(k=3,w=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) Refined query #2: Q2(k=3,w=<0.1,0.9>) If we set weighting w=<0.9,0.1>
Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … … Rank 10000: Renaissance
Refined query #3: Q3(k=10000,w=<0.9,0.1>)9/33
![Page 10: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/10.jpg)
Our objective Find the refined query that minimizes
a penalty function with the missing tuple m in the Top-K results
Prefer Modify K PMK
Prefer Modify Weighting
PMW
Never Mind (Default) NM
10/33
![Page 11: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/11.jpg)
Basic idea For each weighting wi ∈ W
Run PROGRESS(wi, UNTIL-SEE-m) Obtain the ranking ri of m under the
weighting wi Form a refined query Qi(k=ri,w=wi)
Return the refined query with the least penalty
W is infinite!!
!
11/33
![Page 12: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/12.jpg)
Our approach: sampling For each weighting wi ∈ W
Run PROGRESS(wi, UNTIL-SEE-m) Obtain the ranking ri of m under the
weighting wi Form a refined query Qi(k=ri,w=wi)
Return the refined query with the least penalty
W is a set of weightings draw from a restricted weighting space
Key Theorem: The optimal refined query Qbest is either Q1 or else Qbest has a weighting
wbest in a restricted weighting space.
12/33
W
![Page 13: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/13.jpg)
How large the sample size should be?
We say a refined query is the best-T% refined query if its penalty is smaller than (1-T)% refined queries
And we hope to get such a query with a probability larger than a threshold Pr
13/33
![Page 14: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/14.jpg)
The PROGRESS operation can be expensive
Original query Q(k=3,woriginal=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.9,0.1>
Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … … Rank 10000: Renaissance
Refined query: Q2(k=10000,w=<0.5,0.5>)
Very Slow!!!
14/33
![Page 15: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/15.jpg)
Two optimization techniques
Stop each PROGRESS operation early Skip some PROGRESS operations
15/33
![Page 16: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/16.jpg)
Stop earlier The original query
Q(k=3,worigin=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>) If we set weighting w=<0.9,0.1>
Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … Rank 5: Hotel D …
16/33
![Page 17: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/17.jpg)
Skip PROGRESS operation(a)
Similar weightings may lead to similar rankings Based on “Reverse Top-K” paper, ICDE’10
Therefore The query result of PROGRESS(wx, UNTIL-SEE-
m) could be used to deduce
The query result of PROGRESS(wy, UNTIL-SEE-m)
[Provided that wx and wy are similar]
17/33
![Page 18: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/18.jpg)
Skip PROGRESS operation(a)
E.g., Original query Q(k=3,worigin=<0.5,0.5>) Refined query #1: Q1(k=5,w=<0.5,0.5>)
Score under w=<0.5,0.5>Hotel ScoreSheraton 10Westin 9InterContinental
8
Hilton 7Renaissance 6
Score under w=<0.6,0.4>Hotel ScoreSheraton 9Westin 10InterContinental
7
Hilton 8Renaissance 5
How the score looks like if
we set w=<0.6,0.4>
18/33
![Page 19: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/19.jpg)
Skip PROGRESS operation(b)
We can skip a weighting w if we find its change ∆w between the original weighting worigin is too large.
E.g., We have a refined query with penalty equals to 0.5, for a weighting w, if its changing ∆w is 1. We can totally skip it.
19/33
![Page 20: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/20.jpg)
Experiments Case Study on NBA data Experiments on Synthetic Data
20/33
![Page 21: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/21.jpg)
Case study on NBA data Compare with a pure random
sampling version Which do not draw sample from the
restricted weighting space but from the complete weighting space
21/33
![Page 22: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/22.jpg)
Find the top-3 centers in NBA history
5 Attributes (Weighting = 1/5) POINTS REBOUND BLOCKING FIELD GOAL FREE THROW
Initial Result Rank 1: Chamberlain Rank 2: Abdul-Jabber Rank 3: O’Neal
22/33
![Page 23: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/23.jpg)
Find the top-3 centers in NBA history
Sampling on the restricted sampling space
Sampling on the whole weighting space
Refined query Top-3 Top-7∆k 0 4Time (ms) 156 154Penalty 0.069 0.28
Why Not ?!
We choose “Prefer Modify Weighting”
23/33
![Page 24: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/24.jpg)
Synthetic Data Uniform, Anti-correlated, Correlated Scalability
24/33
![Page 25: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/25.jpg)
Varying query dimensions
25/33
![Page 26: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/26.jpg)
Varying ko
26/33
![Page 27: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/27.jpg)
Varying the ranking of the missing object
27/33
![Page 28: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/28.jpg)
Varying the number of missing objects
28/33
![Page 29: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/29.jpg)
Varying T%
29/33
Time Time
Quality Quality
![Page 30: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/30.jpg)
Varying Pr
30/33
![Page 31: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/31.jpg)
Optimization effectiveness
31/33
![Page 32: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/32.jpg)
Conclusions We are the first one to answer why-not question
on top-k query We prove that finding the optimal answer is
computationally expensive A sampling based method is proposed The optimal answer is proved to be in a restricted
sample space Two optimization techniques are proposed
Stop each PROGRESS operation early Skip some PROGRESS operations
32/33
![Page 33: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/33.jpg)
ThanksQ&A
![Page 34: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/34.jpg)
Deal with multiple missing objects M
We have to modify the algorithm a litte bit: Do a simple filtering on the set of
missing objects If mi dominates mj in the data space Remove mi from M Because every time mj
shows up in a top-k result, mi must be there Condition UNTIL-SEE-m becomes UNTIL-
SEE-ALL-OBJECTS-IN-M
34/33
![Page 35: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/35.jpg)
Penalty Model Original Query Q(3, worigin) Refined Query Q1(5, worigin) Penalty of changing k
∆ k = 5 - 3 = 2 Penalty of changing w
∆ w = ||worigin -worigin||2=0 Basic penalty model
Penalty(5,w0) = λk ∆ k + λw ∆ w (λk + λw = 1)
35/33
![Page 36: Answering Why-Not Questions on Top-K Queries](https://reader036.fdocuments.net/reader036/viewer/2022062523/58f28edd1a28ab7d5a8b4639/html5/thumbnails/36.jpg)
Normalized penalty function
36/33