Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides...

37
Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer Gupta

Transcript of Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides...

Page 1: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Automated Ranking of Database Query ResultsSanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis

Presented by

Mahadevkirthi Mahadevraj

Sameer Gupta

Page 2: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Contents

Introduction Different ranking functions Breaking ties Implementation Experiments Conclusion

Page 3: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Introduction

Automated ranking of the results of the query is popular aspect of IR.

Database system support only a boolean query model. Empty answers

- Too selective queries. Many answers

- Too broad queries.

Page 4: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Automated Ranking functions for the ‘Empty Answers Problem’

IDF Similarity

- Mimics the TF-IDF concept for Heterogeneous data.

QF Similarity

- Utilizes workload information.

QFIDF Similarity

- Combination of QF and IDF.

Page 5: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Inverse Document Frequency (IDF)

IR technique

Cosine similarity between Q and D is normalized dot product of these two vectors.

whereQ = set of key wordsD = set of documents

IDF(Inverse Document Frequency) is defined as IDF(w)=log(N/F(w))

whereN = number of documents.F(w) = number of documents in which ‘w’ appears.

IDF implies most commonly occurring words convey less information The Cosine similarity may be further refined by scaling each component with the IDF of corresponding

word

Page 6: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

IDF Similarity

),(),(1

qtS kk

m

kk

QTSIM

Database(only categorical attribute)

T=<t1,……tm>

Q=<q1,…...qm> Condition of the form “WHERE A1=q1 AND … AND Am=qm “

IDFk(t)=log(n/Fk(t))

n-number of tuples in database

Fk(t) -Frequency of tuples in database where Ak=t a1 a2 : : : : a4

For a pair of values ‘u’ and ‘v’ in Ak, Sk (u,v) is defined as t1

IDFk (u) if u =v and 0 otherwise . Similarity between T and Q : :

t3

Sum of corresponding similarity coefficients over all attributes: :

•TF is irrelavant

Similarity function known as IDF similarity

Page 7: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

IDF Similarity Example

Consider a query

Select car from automobile_database

Where type=“convertible” and manufacturer=“Nissan”;

- Convertible is rare and hence IDF is high.

Page 8: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Numerical V/s Categorical Data

Consider a query

Select house from house_database

Where price=$300,000 and no_bedrooms=10;

Page 9: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Numeric IDF Similarity Example

Consider a query :

Select house from house_database where no_of_rooms=4

Here v=4

Attribute b1 No of Rooms(u)

Diff ( |u-v|+1) Sim= 1/Diff Output Attr

B1 5 2 0.5 B4

B2 11 6 0.166667 B1

B3 9 4 0.25 B3

B4 4 1 1 B2

Page 10: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Generalizations of IDF similarity For numeric data

Inappropriate to use previous categorical similarity coefficients. frequency of numeric value depends on nearby values.

Discretizing numeric to categorical attribute is problematic. Solution:

{t1,t2…..tn} be the values of attribute A.For every value t,

sum of”contributions” of t from every other point t i

contributions modeled as gaussian

distribution Similarity function is

bandwidth parameter

Page 11: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Other Generalisations

Let query ‘q’ have a C – condition where C is generalized

as “A in Q” where Q is a set of values for categorical

attributes or a range [lb,ub] for numerical attributes.

The generalized similarity function is as shown :

Page 12: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Problems with IDF Similarity

Problem :

In a realtor database, more homes are built in

recent years such as 2007 and 2008 as compared to 1980

and1981.Thus recent years have small IDF. Yet newer

homes have higher demand.

Solution : QF Similarity.

Page 13: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

QF Similarity : leveraging workloads

Importance of attribute values is determined by frequency of their occurrence in workload.

In the example above , it is reasonable to assume that more

queries are requesting for newer homes than for older homes. Thus the frequency of the year 2008 appearing in the workload will be more than that of year 1981.

Page 14: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

QF Similarity

For categorical data

query frequency QF(q)=

raw frequency of occurrence of value q of attribute A in query strings of workload (RQF(q) _______________________________________________________________

raw frequency of most frequently occuring value in workload (RQFMax)

s(t,q)= QF(q), if q=t 0 , otherwise

Page 15: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

QF similarity example

Consider a workload where attribute

A= { 1,1,2,3,4,5,5,5,5,2}

If now a query requests for A=1, then

QF (1) = RQF(1)/RQFMax = 2/4 .

If a query requests for an attribute value not in the

workload, then QF=0.

Page 16: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

QF similarity : Different Attributes

Similarity between pairs of different categorical attribute values can also be derived from workload

eg. To find S(TOYOTA CAMRY,HONDA ACCORD)

The similarity coefficient between tuple and query in this case is defined by jaccard coefficient scaled by QF factor as shown below.

S(t,q)=J(W(t),W(q))QF(q)

Page 17: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Analyzing workloads

Analyzing IN clauses of queries:If certain pair of values often occur together in the workload ,they are similar .e.g. queries with C as “MFR IN {TOYOTA,HONDA,NISSAN}”

Several recent queries in workload by a specific user repeatedly requesting for TOYOTA and HONDA.

Numerical values that occur in the workload can also benefit from query frequency analysis.

Page 18: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

QFIDF Similarity

QF is purely workload-based. Big disadvantage for insufficient or

unreliable workloads. For QFIDF Similarity

S(t,q)=QF(q) *IDF(q) when t=q

where QF(q)=(RQF(q)+1)/(RQFMax+1). Thus we get small non zero value even if value is never referenced in

workload model

Page 19: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Breaking ties using QF

Problem: Many tuples may tie for the same similarity score and get ordered arbitarily.Arise in empty and many answers problem.

Solution: Determine the weights of missing attribute values that reflect their “global importance” for ranking purposes by using workload information. Extend QF similarity ,use quantity to break ties.

Consider a query requesting for 4 bedroom houses .

- Arlington is less important than dallas .

Page 20: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Problems with Breaking ties using IDF

large IDF scenario for missing attributes.- Arlington homes are given more preference than dallas

homes since Arlington has a higher IDF, but this scenario is not true in real practice.

Small IDF scenario for missing attributes.- Consider homes with decks , but since we are considering smaller

IDF preference will be given to homes without decks since they have a smaller IDF which is not true in real practice.

Page 21: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Implementation

Pre-processing component Query–processing component

Page 22: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Pre-processing component

Compute and store a representation of similarity function in auxiliary database tables. For categorical data, compute IDF(t) (resp QF(t)) ,to compute frequency

of occurences of values in database and store the results in auxillary database tables.

For numeric data, an approximate representation of smooth function IDF() (resp(QF()) is stored, so that function value is retrieved at runtime.

Page 23: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Query processing component Main task: Given a query Q and an integer K, retrieve Top-K tuples

from the database using one of the ranking functions. Ranking function is extracted in pre-processing phase. SQL-DBMS functionality used for solving top-K problem.

Handling simpler query processing problem Input: table R with M categorical columns, Key column TID, C is

conjunction of form Ak=qk..... and integer K.

Output: top-K tuples of R similar to Q. Similarity function: Overlap Similarity.

Page 24: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Implementation of Top-K operator

Traditional approach ?

Indexed based approach

overlap similarity function satisfies the following monotonic property. Adapt Fagin’s TA algorithm. If T and U are two tuples such that for all K, Sk(tk,qk)< Sk(uk,qk)

then SIM(T,Q) < SIM(U,Q) To adapt TA implement Sorted and random access methods. Performs sorted access for each attribute, retrieve complete tuples with corresponding TID by

random access and maintains buffer of Top-K tuples seen so far.

Page 25: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Read all grades of an object once seen from a sorted access• No need to wait until the lists give k common objects

Do sorted access (and corresponding random accesses) until you have seen the top k answers.

• How do we know that grades of seen objects are higher than the grades of unseen objects ?

• Predict maximum possible grade unseen objects:

a: 0.9

b: 0.8

c: 0.72....

L1L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.f: 0.65

d: 0.6

f: 0.6

Seen

Possibly unseen Threshold value

Threshold Algorithm (TA)

T = min(0.72, 0.7) = 0.7

Page 26: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

ID A1 A2 Min(A1,A2)

Step 1: - parallel sorted access to each list

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

L1 L2

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

a

d

0.9

0.9

0.85 0.85

0.6 0.6

For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer

Example – Threshold Algorithm

Page 27: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

ID A1 A2 Min(A1,A2)a: 0.9

b: 0.8

c: 0.72

d: 0.6

.

.

.

.

L1 L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.

Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2)

a

d

0.9

0.9

0.85 0.85

0.6 0.6

T = min(0.9, 0.9) = 0.9

- 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1

Example – Threshold Algorithm

Page 28: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

ID A1 A2 Min(A1,A2)

Step 1 (Again): - parallel sorted access to each list

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

L1 L2

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

a

d

0.9

0.9

0.85 0.85

0.6 0.6

For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer

b 0.8 0.7 0.7

Example – Threshold Algorithm

Page 29: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

ID A1 A2 Min(A1,A2)a: 0.9

b: 0.8

c: 0.72

d: 0.6

.

.

.

.

L1 L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.

Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2)

a

b

0.9

0.7

0.85 0.85

0.8 0.7

T = min(0.8, 0.85) = 0.8

- 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1

Example – Threshold Algorithm

Page 30: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

ID A1 A2 Min(A1,A2)a: 0.9

b: 0.8

c: 0.72

d: 0.6

.

.

.

.

L1 L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.

Situation at stopping condition

a

b

0.9

0.7

0.85 0.85

0.8 0.7

T = min(0.72, 0.7) = 0.7

Example – Threshold Algorithm

Page 31: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Indexed-based TA(ITA)

Sorted access

Random access

Page 32: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Indexed-based TA(ITA)

Stopping Condition

Hypothetical tuple – current value a1,…, ap for A1,… Ap, corresponding to index seeks on L1,…, Lp and qp+1,….. qm for remaining columns from the query directly.

Termination – Similarity of hypothetical tuple to the query< tuple in Top-k buffer with least similarity.

Page 33: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

ITA for Numeric columns

Consider a query has condition Ak = qk for a numeric column Ak.

Two index scan is performed on Ak.

- First retrieve TID’s > qk in incresing order.

- Second retrieve TID’s < qk in decreasing order.

We then pick TID’s from the merged stream.

ITA can be extended for IN,range queries . Additional

challenges arise for ranking function over set of tables.

Page 34: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Ranking functions quality v/s workload

Experiments

Page 35: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Varying no of attributes Varying K in Top-K

Experiments

Page 36: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Conclusion

Automated Ranking Infrastructure for SQL databases. Extended TF-IDF based techniques from Information

retrieval to numeric and mixed data. Implementation of Ranking function that exploited

indexed access (Fagin’s TA)

Page 37: Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.