Query Flocks

32
Query Flocks Umar Hammoud Elizabeth Cash March 25, 2003

description

Query Flocks. Umar Hammoud Elizabeth Cash March 25, 2003. Presentation Based On. Paper Title Query Flocks: A Generalization of Association-Rule Mining Authors   Dick Tsur   Jeffrey D. Ullman   Serge Abiteboul   Chris Clifton   Rajeev Motwani   Svetlozar Nestorov   Arnon Rosenthal. - PowerPoint PPT Presentation

Transcript of Query Flocks

Page 1: Query Flocks

Query Flocks

Umar Hammoud

Elizabeth CashMarch 25, 2003

Page 2: Query Flocks

Presentation Based On

Paper TitleQuery Flocks: A Generalization of Association-Rule Mining

Authors   Dick Tsur   Jeffrey D. Ullman   Serge Abiteboul   Chris Clifton   Rajeev Motwani   Svetlozar Nestorov   Arnon Rosenthal

Page 3: Query Flocks

Association-Rule

• The goal is to find sets of items that are associated

• The fact of their association is called association-rule

Page 4: Query Flocks

Market Basket Mining

• Understand the behavior of the customers when they shop to improve marketing

• An attempt by retail store to learn what items its customers purchase together

• A way to find items that tend to appear together in a market basket

Page 5: Query Flocks

Precise Measures of Association

Given a relation,baskets(BID, Item) where BID is basket ID

1. Support: The items must appear in many baskets.

2. Confidence: The probability of one item given that the others are in the basket must be high.

3. Interest: That probability must be significantly higher or lower than the expected probability if items were purchased at random.

Page 6: Query Flocks

Examples:Measures of Association

People who buy milk often by cereal. {cereal, milk } 1. High support means that many people buy both

cereal and milk 2. High confidence means that a lot of people who

buy cereal also buy milk. 3. High interest means that if you buy cereal, then

you are much more likely to buy milk than the general population.

Page 7: Query Flocks

Association-Rule Optimization

Can be optimized by taking advantage of many of the query optimization ideas (e.g. “a-priori”)

Page 8: Query Flocks

The A-Priori Optimization

• Using this technique tuples can be eliminated before the join

Let S be a set of items that appear in at least n baskets

And S’ is subset of S

Then S’ appears in at least n baskets

Page 9: Query Flocks

A-Priori Generalization

• Extended to provide efficient mining of very large databases, for many different kinds of patterns.

• Can be used for:• general-purpose mining systems

• future generation of query optimizers.

• Known as “Query flocks”

Page 10: Query Flocks

Query Flocks

• A parameterized query with a filter condition to eliminate the “uninteresting” values of the parameters

• Represented in Datalog

Page 11: Query Flocks

Mining Languages

Can SQL be used as a mining Language?

 In principal, it can, but right optimization is not there.

Page 12: Query Flocks

SQL: What’s the Problem?

SELECT i1.Item, i2.Item

FROM baskets i1, baskets i2

WHERE i1.Item <i2.Item AND

i1.BID = i2.BID

GROUP BY i1.Item, i2.Item

HAVING 20 <= COUNT(i1.BID)

• The A-Priori trick has not been implemented by any conventional optimizer

• Better performance can be achieved if the query is rewritten in the following way:

• First find those items that appeared in at least 20 baskets• Join the set of these items with the baskets relation

Page 13: Query Flocks

Mining with Flocks

• Many data mining problems can benefit from the A-priori for code optimization

• The Formalism of “query flocks” is an important tool for building better optimizers

Page 14: Query Flocks

Query Flocks

• A family of identical queries that are asked simultaneously

• The answers to these queries are filtered

• The ones filtered enable their parameters to become part of the answer

Page 15: Query Flocks

Query Flock Settings

• Queries are parameterized by one or more parameter

• Ability to express filter conditions about the results of the query

Page 16: Query Flocks

Query Flock Designation

• One or more predicates that represent data stored as relations – A set of parameters with names starting with $

– A query

– A filter that specifies a condition

Page 17: Query Flocks

Language for Flocks

• “Conjunctive Queries” augmented with arithmetic and with union

• Datalog is used rather than SQL because it gives the following capabilities:• The notion of “safe query” for Datalog figures into potential

optimizations

• The set of options for adapting the A-priori trick to arbitrary flocks is most easily expressed in Datalog

• SQL is used for the filter language only

Page 18: Query Flocks

Market Basket as a Query Flock

QUERY:

FILTER:

Answer(B) :-

baskets(B,$1) AND

baskets(B,$2)

COUNT(answer.B) >= 20

Page 19: Query Flocks

Language Extensions

• To apply query optimizations proposed, extensions must be added– Negated subgoals – Arithmetic subgoals for variables and

parameters

Page 20: Query Flocks

Extensions Usage

• Add arithmetic extension to the previous query to restrict item pairs to appear in lexicographic order

Answer(B) :-

baskets(B,$1) AND

baskets(B,$2) AND $1 < $2

Page 21: Query Flocks

Extensions Usage

• Given the following relationsdiagnoses(Patient, Disease)exhibits(Patient, Symptom)treatments(Patient,Medicine ) causes(Disease, Symptom)

Find unexplained side effects

QUERY:answer(P) :-

exhibits(P,$s) ANDtreatment(P,$m) ANDdiagnosis(P,D) ANDNOT causes(D,$s)

FILTER: COUNT(answer.P) >= 20

Page 22: Query Flocks

Generalizing A-Priori Techniques

• Evaluate the less expensive query firstThe answer allows us to upper bound the size of the answer obtained with

certain parameters.

If the bound is less than the filter threshold, eliminate the certain values of parameters without further consideration

For Query Q1 to puts an upper bound on the size of the result of query Q2

It must be provable that the result of Q2 is a subset of the result of Q1 • The containment-mapping theorem says:Q2 Q1 can hold if Q1 is constructed from Q2 by:

1. Taking a subset of the subgoals of Q2, and2. Splitting zero or more variables into several variables.

Page 23: Query Flocks

Safe Query Example

answer(B) :-baskets(B,$1) ANDbaskets(B,$2) AND

 Two formed by taking two proper subsets of subgoals

answer(B) :-baskets(B,$1)

and

answer(B) :-baskets(B,$2)

Page 24: Query Flocks

Safe Query Example cont.

• Any other value of $1 can be eliminated as member of a pair of items meeting the filter condition

If we take the first, we can ask:

• What values of $1 does the query answer (B) : - baskets (B, $1)

• Produce a number of values of B that is over the threshold given in the filter.

Page 25: Query Flocks

Search for Optimal Query-Flock Evaluators

R(P) := FILTER(P,Q,C)

P set of parameters

Q query involving parameters P

R relation whose tuples are values of parameter P

C condition on the result of the query Q

Page 26: Query Flocks

A Query Plan

okS($s) := FILTER($s, answer(P) :- exhibits(P,$s), COUNT(answer.P) >= 20);

okM($m) := FILTER($m, answer(P) :- treatments(P,$m), COUNT(answer.P) >= 20);

ok($s,$m) := FILTER({$s,$m}, answer(P) :-

okS($s) AND okM($m) AND diagnoses(P,D) AND exhibits(P,$s) AND treatments(P,$m) AND NOT causes(D,$s),

COUNT(answer.P) >= 20);

Page 27: Query Flocks

Is there a Rule for Generating the Query Plans?????

• Consider only sequences of filter steps that satisfy these conditions:–Steps must use same filter condition as original query flock query –Each step must define a uniquely named relation

–The final step must not delete any subgoals of the original query; it may have additional subgoals derived from previous steps, of course.

–Each step derived from the given query flock by following:•Start with original query flock•Add in zero or more subgoals that are copies of the left side of the assignment ( := ) in some previous filter step•Delete zero or more subgoals but, following the optimization principle for conjunctive queries, make sure that the resulting query is safe.

Page 28: Query Flocks

Exponential Search:Query Plan

• Candidate for best possible– Long sequence of steps in which each uses the results

of the previous step

• How to restrict the search– Select sets of parameters

– Select list of subsets of the subgoals of the original query that form safe queries.

Page 29: Query Flocks

Dynamic Selection of Filter Steps

• We let the sizes of intermediate relations determine whether or not to apply filters

• The important special case

• When the set of parameters for a relation has not previously been encountered.– If support threshold is low, then it is likely to be useful to filter

– If support threshold is high, then it is unlikely a useful filter

Page 30: Query Flocks

Possible Query Plan

temp1($s) := FILTER($s,

answer(P) :- exhibits(P.$s),

COUNT(answer.X) >= 20

)

temp2(P,$s,$m) := (temp1($s) JOIN

exhibits(P,$s)) JOIN treatments(P,$m)

temp3($s,$m) := FILTER({$s,$m},

answer(P) :- temp2($s,$m).,

COUNT(answer.X) >= 20

)

temp4(P,D,$s,$m) := ((temp3($s,$m) JOIN

temp2(P,$s,$m))

JOIN diagnoses(P,D)) JOIN

(NOT causes(D,$s)

sideEffect($s,$m) := FILTER({$s,$m},

answer(P) :- temp4(P,D,$s,$m),

COUNT(answer.X) >= 20

)

Page 31: Query Flocks

Conclusions

• It’s a generate-and-test model for data-mining problems

• Uses "parts of queries" constructively to prune answer sets for main queries

• Provides a parameterized way to specify a set of queries, whose answer is the parameter(s)

Page 32: Query Flocks

So What Should Tim Tell His Mother?

• In one sentence:

Generalization of query optimization techniques to be used for data mining.

• And Questions?