Zeng Zeng, Bharadwaj, IEEE TRASACTION ON COMPUTERS, VOL. 55, NO. 11, NOVEMBER 2006
Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.
-
Upload
scot-wilson -
Category
Documents
-
view
215 -
download
0
Transcript of Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.
![Page 1: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/1.jpg)
Sampling Large Databases for Association Rules
Jingting ZengCIS 664 Presentation
March 13, 2007
![Page 2: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/2.jpg)
Association Rules Outline
Association Rules Problem Overview
Association Rules Definitions Previous Work on Association Rules Toivonen’s Algorithm Experiments Result Conclusion
![Page 3: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/3.jpg)
Overview
Purpose If people tend to buy A and B
together, then a buyer of A is a good target for an advertisement for B.
![Page 4: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/4.jpg)
The Market-Basket Example
Items frequently purchased together:Bread PeanutButter
Uses: Placement Advertising Sales Coupons
Objective: increase sales and reduce costs
![Page 5: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/5.jpg)
Other Example
The same technology has other uses University course enrollment data has
been analyzed to find combinations of courses taken by the same students
![Page 6: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/6.jpg)
Scale of Problem
WalMart sells 100,000 items and can store hundreds of millions of baskets.
The Web has 100,000,000 words and several billion pages.
![Page 7: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/7.jpg)
Association Rule Definitions
Set of items: I={I1,I2,…,Im}
Transactions: D={t1,t2, …, tn}, tj I Support of an itemset: Percentage
of transactions which contain that itemset.
Frequent itemset: Itemset whose number of occurrences is above a threshold.
![Page 8: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/8.jpg)
Association Rule Definitions
Association Rule (AR): implication X Y where X,Y I and X Y = ;
Support of AR (s) X Y: Percentage of transactions that contain X Y
Confidence of AR () X Y: Ratio of number of transactions that contain X Y to the number that contain X
![Page 9: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/9.jpg)
Example
B1 = {m, c, b} B2 = {m, p, j}B3 = {m, b} B4 = {c, j}B5 = {m, p, b} B6 = {m, c, b, j}B7 = {c, b, j} B8 = {b, c}
Association Rule {m, b} c Support = 2/8 = 25% Confidence = 2/4 = 50%
![Page 10: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/10.jpg)
Association Rule Problem
Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence threshold.
![Page 11: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/11.jpg)
Association Rule Techniques
Find all frequent itemsets Generate strong association rules
from the frequent itemsets
![Page 12: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/12.jpg)
APriori Algorithm
A two-pass approach called a-priori limits the need for main memory.
Key idea: monotonicity : if a set of items appears at least s times, so does every subset. Converse for pairs: if item i does not
appear in s baskets, then no pair including i can appear in s baskets.
![Page 13: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/13.jpg)
APriori Algorithm (contd.)
Pass 1: Read baskets and count in main memory the occurrences of each item. Requires only memory proportional to
#items.
Pass 2: Read baskets again and count in main memory only those pairs both of which were found in Pass 1 to have occurred at least s times. Requires memory proportional to square of
frequent items only.
![Page 14: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/14.jpg)
Partitioning
Divide database into partitions D1,D2,…,Dp
Apply Apriori to each partition Any large itemset must be large in
at least one partition.
![Page 15: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/15.jpg)
Partitioning Algorithm
1. Divide D into partitions D1,D2,…,Dp;
2. For I = 1 to p do3. Li = Apriori(Di);4. C = L1 … Lp;5. Count C on D to generate L;
![Page 16: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/16.jpg)
Sampling
Large databases Sample the database and apply Apriori
to the sample. Potentially Frequent Itemsets (PL):
Large itemsets from sample Negative Border (BD - ):
Generalization of Apriori-Gen applied to itemsets of varying sizes.
Minimal set of itemsets which are not in PL, but whose subsets are all in PL.
![Page 17: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/17.jpg)
Negative Border Example
Let Items = {A,…,F} and there areitemsets: {A}, {B}, {C}, {F}, {A,B}, {A,C},
{A,F}, {C,F}, {A,C,F}
The whole negative border is: {{B,C}, {B,F}, {D}, {E}}
![Page 18: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/18.jpg)
Toivonen’s Algorithm
Start as in the simple algorithm, but lower the threshold slightly for the sample. Example: if the sample is 1% of the
baskets, use 0.008 as the support threshold rather than 0.01 .
Goal is to avoid missing any itemset that is frequent in the full set of baskets.
![Page 19: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/19.jpg)
Toivonen’s Algorithm (contd.)
Add to the itemsets that are frequent in the sample the negative border of these itemsets.
An itemset is in the negative border if it is not deemed frequent in the sample, but all its immediate subsets are. Example: ABCD is in the negative border
if and only if it is not frequent, but all of ABC , BCD , ACD , and ABD are.
![Page 20: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/20.jpg)
Toivonen’s Algorithm (contd.)
In a second pass, count all candidate frequent itemsets from the first pass, and also count the negative border.
If no itemset from the negative border turns out to be frequent, then the candidates found to be frequent in the whole data are exactly the frequent itemsets.
![Page 21: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/21.jpg)
Toivonen’s Algorithm (contd.)
What if we find something in the negative border is actually frequent?
We must start over again! But by choosing the support threshold
for the sample wisely, we can make the probability of failure low, while still keeping the number of itemsets checked on the second pass low enough for main-memory.
![Page 22: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/22.jpg)
Experiment
Synthetic data set characteristics (T = row size on average, I = size of maximal frequent sets on average)
![Page 23: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/23.jpg)
Experiment (contd.)
Lowered frequency thresholds (%) for probability of missing any given frequent set is less than δ = 0.001
![Page 24: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/24.jpg)
Number of trials with misses
![Page 25: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/25.jpg)
Conclusions
Advantages: Reduced failure probability, while
keeping candidate-count low enough for memory
Disadvantages:Potentially large number of
candidatesin second pass
![Page 26: Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.](https://reader036.fdocuments.net/reader036/viewer/2022070404/56649f385503460f94c5510d/html5/thumbnails/26.jpg)
Thank you!