Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das,...
-
Upload
noreen-simmons -
Category
Documents
-
view
219 -
download
0
Transcript of Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das,...
![Page 1: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/1.jpg)
1
Probabilistic Ranking of Database Query Results
Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik
Presented by Raghunath Ravi
Sivaramakrishnan SubramaniCSE@UTA
![Page 2: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/2.jpg)
2
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
![Page 3: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/3.jpg)
3
MotivationMany-answers problemTwo alternative solutions:
Query reformulation Automatic rankingApply probabilistic model in IR to
DB tuple ranking
![Page 4: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/4.jpg)
4
Example – Realtor DatabaseHouse Attributes: Price, City,
Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year
Query: City =`Seattle’ AND Waterfront = TRUE
Too Many Results!
Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable
![Page 5: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/5.jpg)
5
Rank According to Unspecified AttributesScore of a Result Tuple t depends onGlobal Score: Global Importance of
Unspecified Attribute Values [CIDR2003]◦ E.g., Newer Houses are generally preferred
Conditional Score: Correlations between Specified and Unspecified Attribute Values◦ E.g., Waterfront BoatDock
Many Bedrooms Good School District
![Page 6: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/6.jpg)
6
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
![Page 7: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/7.jpg)
7
Key ProblemsGiven a Query Q, How to
Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).
How to Calculate the Global and Conditional Scores.Use Query Workload and Data.
![Page 8: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/8.jpg)
8
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
![Page 9: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/9.jpg)
9
System Architecture
![Page 10: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/10.jpg)
10
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking
FunctionImplementationExperimentsConclusion and open problems
![Page 11: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/11.jpg)
11
PIR Review
Bayes’ RuleProduct Rule
)(
)()|()|(
bp
apabpbap
),|()|()|,( cabpcapcbap
)|(
)|(
)(
)()|()(
)()|(
)|(
)|()(
Rtp
Rtp
tp
RpRtptp
RpRtp
tRp
tRptScore
Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents
![Page 12: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/12.jpg)
12
Adaptation of PIR to DBTuple t is considered as a
documentPartition t into t(X) and t(Y)t(X) and t(Y) are written as X and
YDerive from initial scoring
function until final ranking function is obtained
![Page 13: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/13.jpg)
13
Preliminary Derivation
![Page 14: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/14.jpg)
14
Limited Independence AssumptionsGiven a query Q and a tuple t,
the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
Xx
CxpCXp )()(
Yy
CypCYp )()(
![Page 15: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/15.jpg)
15
Continuing Derivation
![Page 16: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/16.jpg)
16
Pre-computing Atomic Probabilities in Ranking Function
)( Wyp
)( Dyp
),( Dyxp
Relative frequency in W
Relative frequency in D
),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W
(#of tuples in D that conatains x, y)/total # of tuples in D
Yy XxYy DyxpDyp
RyptScore
),|(
1
)|(
)|()(
Use Workload
Use Data
![Page 17: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/17.jpg)
17
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
![Page 18: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/18.jpg)
18
Architecture of Ranking Systems
![Page 19: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/19.jpg)
19
Scan Algorithm
Preprocessing - Atomic Probabilities Module
Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y
ExecutionSelect Tuples that Satisfy the QueryScan and Compute Score for Each
Result-TupleReturn Top-K Tuples
![Page 20: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/20.jpg)
20
Beyond Scan Algorithm
Scan algorithm is InefficientMany tuples in the answer set
Another extremePre-compute top-K tuples for all possible queriesStill infeasible in practice
Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples
![Page 21: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/21.jpg)
21
Output from Index Module
CondList Cx
{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)
GlobList Gx
{AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)
![Page 22: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/22.jpg)
22
Index Module
![Page 23: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/23.jpg)
23
Preprocessing ComponentPreprocessing For Each Distinct Value x of Database, Calculate and
Store the Conditional (Cx) and the Global (Gx) Lists as follows◦ For Each Tuple t Containing x Calculate
and add to Cx and Gx respectively Sort Cx, Gx by decreasing scores
Execution Query Q: X1=x1 AND … AND Xs=xs
Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs
![Page 24: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/24.jpg)
24
List Merge Algorithm
![Page 25: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/25.jpg)
25
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open problems
![Page 26: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/26.jpg)
26
Experimental Setup
Datasets:◦ MSR HomeAdvisor Seattle
(http://houseandhome.msn.com/)◦ Internet Movie Database
(http://www.imdb.com)
Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO
![Page 27: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/27.jpg)
27
Quality ExperimentsConducted on Seattle Homes and
Movies tablesCollect a workload from usersCompare Conditional Ranking
Method in the paper with the Global Method [CIDR03]
![Page 28: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/28.jpg)
28
Quality Experiment-Average Precision
For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples
Let each user mark 10 tuples in Hi as most relevant to Qi
Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm
![Page 29: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/29.jpg)
29
Quality Experiment- Fraction of Users Preferring Each Algorithm
5 new queries Users were given the top-5 results
![Page 30: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/30.jpg)
30
Performance Experiments
Table NumTuples Database Size (MB)
Seattle Homes 17463 1.936
US Homes 1380762 140.432
Datasets
Compare 2 Algorithms: Scan algorithm List Merge algorithm
![Page 31: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/31.jpg)
31
Performance Experiments – Pre-computation Time
![Page 32: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/32.jpg)
32
Performance Experiments – Execution Time
![Page 33: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/33.jpg)
33
Roadmap
MotivationKey ProblemsSystem ArchitectureConstruction of Ranking FunctionImplementationExperimentsConclusion and open
problems
![Page 34: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/34.jpg)
34
Conclusions – Future WorkConclusionsCompletely Automated Approach for the Many-
Answers Problem which Leverages Data and Workload Statistics and Correlations
Based on PIR
DrawbacksMutiple-table queryNon-categorical attributes
Future WorkEmpty-Answer ProblemHandle Plain Text Attributes
![Page 35: Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.](https://reader035.fdocuments.net/reader035/viewer/2022062423/56649e8f5503460f94b9293e/html5/thumbnails/35.jpg)
35
Questions?