Query Flocks

Query Flocks

Umar Hammoud

Elizabeth CashMarch 25, 2003

Presentation Based On

Paper TitleQuery Flocks: A Generalization of Association-Rule Mining

Authors Dick Tsur Jeffrey D. Ullman Serge Abiteboul Chris Clifton Rajeev Motwani Svetlozar Nestorov Arnon Rosenthal

Association-Rule

• The goal is to find sets of items that are associated

• The fact of their association is called association-rule

Market Basket Mining

• Understand the behavior of the customers when they shop to improve marketing

• An attempt by retail store to learn what items its customers purchase together

• A way to find items that tend to appear together in a market basket

Precise Measures of Association

Given a relation,baskets(BID, Item) where BID is basket ID

1. Support: The items must appear in many baskets.

2. Confidence: The probability of one item given that the others are in the basket must be high.

3. Interest: That probability must be significantly higher or lower than the expected probability if items were purchased at random.

Examples:Measures of Association

People who buy milk often by cereal. {cereal, milk } 1. High support means that many people buy both

cereal and milk 2. High confidence means that a lot of people who

buy cereal also buy milk. 3. High interest means that if you buy cereal, then

you are much more likely to buy milk than the general population.

Association-Rule Optimization

Can be optimized by taking advantage of many of the query optimization ideas (e.g. “a-priori”)

The A-Priori Optimization

• Using this technique tuples can be eliminated before the join

Let S be a set of items that appear in at least n baskets

And S’ is subset of S

Then S’ appears in at least n baskets

A-Priori Generalization

• Extended to provide efficient mining of very large databases, for many different kinds of patterns.

• Can be used for:• general-purpose mining systems

• future generation of query optimizers.

• Known as “Query flocks”

Query Flocks

• A parameterized query with a filter condition to eliminate the “uninteresting” values of the parameters

• Represented in Datalog

Mining Languages

Can SQL be used as a mining Language?

In principal, it can, but right optimization is not there.

SQL: What’s the Problem?

SELECT i1.Item, i2.Item

FROM baskets i1, baskets i2

WHERE i1.Item <i2.Item AND

i1.BID = i2.BID

GROUP BY i1.Item, i2.Item

HAVING 20 <= COUNT(i1.BID)

• The A-Priori trick has not been implemented by any conventional optimizer

• Better performance can be achieved if the query is rewritten in the following way:

• First find those items that appeared in at least 20 baskets• Join the set of these items with the baskets relation

Mining with Flocks

• Many data mining problems can benefit from the A-priori for code optimization

• The Formalism of “query flocks” is an important tool for building better optimizers

Query Flocks

• A family of identical queries that are asked simultaneously

• The answers to these queries are filtered

• The ones filtered enable their parameters to become part of the answer

Query Flock Settings

• Queries are parameterized by one or more parameter

• Ability to express filter conditions about the results of the query

Query Flock Designation

• One or more predicates that represent data stored as relations – A set of parameters with names starting with $

– A query

– A filter that specifies a condition

Language for Flocks

• “Conjunctive Queries” augmented with arithmetic and with union

• Datalog is used rather than SQL because it gives the following capabilities:• The notion of “safe query” for Datalog figures into potential

optimizations

• The set of options for adapting the A-priori trick to arbitrary flocks is most easily expressed in Datalog

• SQL is used for the filter language only

Market Basket as a Query Flock

QUERY:

FILTER:

Answer(B) :-

baskets(B,$1) AND

baskets(B,$2)

COUNT(answer.B) >= 20

Language Extensions

• To apply query optimizations proposed, extensions must be added– Negated subgoals – Arithmetic subgoals for variables and

parameters

Extensions Usage

• Add arithmetic extension to the previous query to restrict item pairs to appear in lexicographic order

Answer(B) :-

baskets(B,$1) AND

baskets(B,$2) AND $1 < $2

Extensions Usage

• Given the following relationsdiagnoses(Patient, Disease)exhibits(Patient, Symptom)treatments(Patient,Medicine ) causes(Disease, Symptom)

Find unexplained side effects

QUERY:answer(P) :-

exhibits(P,$s) ANDtreatment(P,$m) ANDdiagnosis(P,D) ANDNOT causes(D,$s)

FILTER: COUNT(answer.P) >= 20

Generalizing A-Priori Techniques

• Evaluate the less expensive query firstThe answer allows us to upper bound the size of the answer obtained with

certain parameters.

If the bound is less than the filter threshold, eliminate the certain values of parameters without further consideration

For Query Q1 to puts an upper bound on the size of the result of query Q2

It must be provable that the result of Q2 is a subset of the result of Q1 • The containment-mapping theorem says:Q2 Q1 can hold if Q1 is constructed from Q2 by:

1. Taking a subset of the subgoals of Q2, and2. Splitting zero or more variables into several variables.

Safe Query Example

answer(B) :-baskets(B,$1) ANDbaskets(B,$2) AND

Two formed by taking two proper subsets of subgoals

answer(B) :-baskets(B,$1)

and

answer(B) :-baskets(B,$2)

Safe Query Example cont.

• Any other value of $1 can be eliminated as member of a pair of items meeting the filter condition

If we take the first, we can ask:

• What values of $1 does the query answer (B) : - baskets (B, $1)

• Produce a number of values of B that is over the threshold given in the filter.

Search for Optimal Query-Flock Evaluators

R(P) := FILTER(P,Q,C)

P set of parameters

Q query involving parameters P

R relation whose tuples are values of parameter P

C condition on the result of the query Q

A Query Plan

okS($s) := FILTER($s, answer(P) :- exhibits(P,$s), COUNT(answer.P) >= 20);

okM($m) := FILTER($m, answer(P) :- treatments(P,$m), COUNT(answer.P) >= 20);

ok($s,$m) := FILTER({$s,$m}, answer(P) :-

okS($s) AND okM($m) AND diagnoses(P,D) AND exhibits(P,$s) AND treatments(P,$m) AND NOT causes(D,$s),

COUNT(answer.P) >= 20);

Is there a Rule for Generating the Query Plans?????

• Consider only sequences of filter steps that satisfy these conditions:–Steps must use same filter condition as original query flock query –Each step must define a uniquely named relation

–The final step must not delete any subgoals of the original query; it may have additional subgoals derived from previous steps, of course.

–Each step derived from the given query flock by following:•Start with original query flock•Add in zero or more subgoals that are copies of the left side of the assignment ( := ) in some previous filter step•Delete zero or more subgoals but, following the optimization principle for conjunctive queries, make sure that the resulting query is safe.

Exponential Search:Query Plan

• Candidate for best possible– Long sequence of steps in which each uses the results

of the previous step

• How to restrict the search– Select sets of parameters

– Select list of subsets of the subgoals of the original query that form safe queries.

Dynamic Selection of Filter Steps

• We let the sizes of intermediate relations determine whether or not to apply filters

• The important special case

• When the set of parameters for a relation has not previously been encountered.– If support threshold is low, then it is likely to be useful to filter

– If support threshold is high, then it is unlikely a useful filter

Possible Query Plan

temp1($s) := FILTER($s,

answer(P) :- exhibits(P.$s),

COUNT(answer.X) >= 20

)

temp2(P,$s,$m) := (temp1($s) JOIN

exhibits(P,$s)) JOIN treatments(P,$m)

temp3($s,$m) := FILTER({$s,$m},

answer(P) :- temp2($s,$m).,


)

temp4(P,D,$s,$m) := ((temp3($s,$m) JOIN

temp2(P,$s,$m))

JOIN diagnoses(P,D)) JOIN

(NOT causes(D,$s)

sideEffect($s,$m) := FILTER({$s,$m},

answer(P) :- temp4(P,D,$s,$m),


)

Conclusions

• It’s a generate-and-test model for data-mining problems

• Uses "parts of queries" constructively to prune answer sets for main queries

• Provides a parameterized way to specify a set of queries, whose answer is the parameter(s)

So What Should Tim Tell His Mother?

• In one sentence:

Generalization of query optimization techniques to be used for data mining.

• And Questions?

Query Flocks

Documents

Transcript of Query Flocks