Association Analysis. Association Rule Mining: Definition Given a set of records each of which...
-
Upload
philomena-peters -
Category
Documents
-
view
238 -
download
2
Transcript of Association Analysis. Association Rule Mining: Definition Given a set of records each of which...
![Page 1: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/1.jpg)
Association Analysis
![Page 2: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/2.jpg)
Association Rule Mining: Definition• Given a set of records each of which contain some number of
items from a given collection;– Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
![Page 3: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/3.jpg)
Association Rules
• Marketing and Sales Promotion:– Let the rule discovered be {Bagels, … } --> {Potato Chips}– Potato Chips as consequent =>
Can be used to determine what should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.
![Page 4: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/4.jpg)
Two key issues• First, discovering patterns from a large transaction
data set can be computationally expensive.
• Second, some of the discovered patterns are potentially
spurious because they may happen simply by chance.
![Page 5: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/5.jpg)
Items and transactions • Let:
– I = {i1, i2,…,id } be the set of all items in a market basket data and – T = {t1, t2 ,…, tN } be the set of all transactions.
• Each transaction ti contains a subset of items chosen from I. • Itemset
– A collection of one or more items• Example: {Milk, Bread, Diaper}
• k-itemset– An itemset that contains k items
• Transaction width – The number of items present in a transaction.
• A transaction tj is said to contain an itemset X, if X is a subset of tj. – E.g., the second transaction contains the itemset {Bread, Diapers} but not
{Bread, Milk}.
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
![Page 6: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/6.jpg)
Definition: Frequent Itemset• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5 = /N
• Frequent Itemset
– An itemset whose support is greater than or equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
![Page 7: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/7.jpg)
Definition: Association Rule
Example:Beer}Diaper,Milk{
4.052
|T|)BeerDiaper,,Milk(
s
67.032
)Diaper,Milk()BeerDiaper,Milk,(
c
• Association Rule
– An implication expression of the form X Y, where X and Y are itemsets
– Example: {Milk, Diaper} {Beer}
• Rule Evaluation Metrics (X Y)
– Support (s)
• Fraction of transactions that contain both X and Y
– Confidence (c)
• Measures how often items in Y appear in transactions thatcontain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
![Page 8: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/8.jpg)
Why Use Support and Confidence?Support
– A rule that has very low support may occur simply by chance.
– Support is often used to eliminate uninteresting rules.
– Support also has a desirable property that can be exploited for the efficient discovery of association rules.
Confidence– Measures the reliability of the inference made by a rule.
– For a rule X Y , the higher the confidence, the more likely it is for Y to be present in transactions that contain X.
– Confidence provides an estimate of the conditional probability of Y given X.
![Page 9: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/9.jpg)
Association Rule Mining Task• Given a set of transactions T, the goal of association rule mining
is to find all rules having – support minsup threshold
– confidence minconf threshold
• Brute-force approach:– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
![Page 10: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/10.jpg)
Brute-force approach• Suppose there are d items. We first choose k of the items to form
the left hand side of the rule. There are Cd,k ways for doing this.
• Now, there are Cd−k,i ways to choose the remaining items to form the right hand side of the rule, where 1 ≤ i ≤ d−k.
122
2
12
1
11
1
1 1
dd
k
kd
d
k
d
k
kd
d
k
kd
d
k
kd
i
k
d
k
d
k
d
k
d
i
kd
k
dR
1231223 Therefore
223
2For
1
: thathave also We
12
:applied We
1
1
1
1
ddddd
didd
i
d
didd
i
d
nn
i
R
i
d
x
xxi
dx
i
n
![Page 11: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/11.jpg)
Brute-force approach• R=3d-2d+1+1
• For d=6,
36-27+1=602 possible rules
• However, 80% of the rules are discarded after applying minsup=20% and minconf=50%, thus making most of the computations become wasted.
• So, it would be useful to prune the rules early without having to compute their support and confidence values.
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
An initial step toward improving the performance: decouple the support and confidence requirements.
![Page 12: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/12.jpg)
Mining Association RulesExample of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different confidence
• Thus, we may decouple the support and confidence requirements
If the itemset is infrequent, then all six candidate rules can be pruned immediately without us having to compute their confidence values.
![Page 13: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/13.jpg)
Mining Association Rules• Two-step approach:
1. Frequent Itemset Generation– Generate all itemsets whose support minsup (these itemsets are
called frequent itemset)
2. Rule Generation– Generate high confidence rules from each frequent itemset, where
each rule is a binary partitioning of a frequent itemset (these rules are called strong rules)
• The computational requirements for frequent itemset generation are more expensive than those of rule generation.
– We focus first on frequent itemset generation.
![Page 14: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/14.jpg)
Frequent Itemset Generationnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets
![Page 15: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/15.jpg)
Frequent Itemset Generation• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!• w is max transaction width.
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
N
Transactions List ofCandidates
M
w
![Page 16: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/16.jpg)
Reducing Number of Candidates• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be frequent
• This is due to the anti-monotone property of support
)()()(:, YsXsYXYX
• Apriori principle conversely said:
– If an itemset such as {a, b} is infrequent, then all of its supersets must be infrequent too.
![Page 17: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/17.jpg)
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Apriori Principlenull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
![Page 18: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/18.jpg)
Illustrating Apriori PrincipleItem CountBread 4Coke 2Milk 4Beer 3Diaper 4Eggs 1
Items (1-itemsets)
Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Minimum Support = 3
If every subset is considered, 6C1 + 6C2 + 6C3 = 41
With support-based pruning,6 + 6 + 1 = 13
Itemset Count {Bread,Milk,Diaper} 3
Triplets (3-itemsets)
With the Apriori principle we need to keep only this triplet, because it’s the only one whose subsets are all frequent.
![Page 19: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/19.jpg)
Apriori Algorithm• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified• k=k+1
• Generate length-k candidate itemsets from length-k-1 frequent itemsets
• Prune candidate itemsets containing subsets of length-k-1 that are infrequent
• Count the support of each candidate by scanning the DB and eliminate candidates that are infrequent, leaving only those that are frequent
![Page 20: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/20.jpg)
Candidate generation and prunningMany ways to generate candidate itemsets.
An effective candidate generation procedure:
1. Should avoid generating too many unnecessary candidates. – A candidate itemset is unnecessary if at least one of its subsets is
infrequent.
2. Must ensure that the candidate set is complete, – i.e., no frequent itemsets are left out by the candidate generation
procedure.
3. Should not generate the same candidate itemset more than once. – E.g., the candidate itemset {a, b, c, d} can be generated in many
ways---• by merging {a, b, c} with {d},
• {c} with {a, b, d}, etc.
![Page 21: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/21.jpg)
Brute force• A brute force method considers every k itemset as a potential
candidate and then applies the candidate pruning step to remove any unnecessary candidates.
![Page 22: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/22.jpg)
Fk-1F1 Method• Extend each frequent (k - 1) itemset
with a frequent 1-itemset.• Is it complete?• The procedure is complete because
every frequent k itemset is composed of a frequent (k - 1) itemset and a frequent 1 itemset.
• However, it doesn’t prevent the same candidate itemset from being generated more than once. – E.g., {Bread, Diapers, Milk} can be
generated by merging • {Bread, Diapers} with {Milk}, • {Bread, Milk} with {Diapers}, or • {Diapers, Milk} with {Bread}.
![Page 23: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/23.jpg)
Lexicographic Order• Avoid generating duplicate candidates by ensuring that the items
in each frequent itemset are kept sorted in their lexicographic order.
• Each frequent (k-1) itemset X is then extended with frequent items that are lexicographically larger than the items in X.
• For example, the itemset {Bread, Diapers} can be augmented with {Milk} since Milk is lexicographically larger than Bread and Diapers.
• However, we don’t augment {Diapers, Milk} with {Bread} nor {Bread, Milk} with {Diapers} because they violate the lexicographic ordering condition.
• Is it complete?
![Page 24: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/24.jpg)
Lexicographic Order - Completeness• Is it complete?• Yes. Let (i1,…, ik-1, ik) be a frequent k-itemset sorted in
lexicographic order.
• Since it is frequent, by the Apriori principle, (i1,…, ik-1) and (ik) are frequent as well. • I.e. (i1,…, ik-1) Fk-1 and (ik) F1.
• Since, (ik) is lexicographically bigger than i1,…, ik-1, we have that (i1,…, ik-1) would be joined with (ik) for giving (i1,…, ik-1, ik) as a candidate k-itemset.
![Page 25: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/25.jpg)
Still too many candidates…• E.g. merging {Beer, Diapers} with {Milk} is unnecessary
because one of its subsets, {Beer, Milk}, is infrequent.
• Heuristics available to reduce (prune) the number of unnecessary candidates.
• E.g., for a candidate k itemset to be worthy, – every item in the candidate must be contained in at least k-1 of the
frequent (k-1) itemsets. – {Beer, Diapers, Milk} is a viable candidate 3 itemset only if every
item in the candidate, including Beer, is contained in at least 2 frequent 2 itemsets.
• Since there is only one frequent 2 itemset containing Beer, all candidate itemsets involving Beer must be infrequent.
• Why?• Because each of k-1subsets containing an item must be frequent.
![Page 26: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/26.jpg)
Fk-1F1
![Page 27: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/27.jpg)
Fk-1Fk-1 Method• Merge a pair of frequent (k-1) itemsets only if their first k-2 items are identical.
• E.g. frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a candidate 3 itemset {Bread, Diapers, Milk}.
• We don’t merge {Beer, Diapers} with {Diapers, Milk} because the first item in both itemsets is different.
– Indeed, if {Beer, Diapers, Milk} is a viable candidate, it would have been
obtained by merging {Beer, Diapers} with {Beer, Milk} instead.
• This illustrates both the completeness of the candidate generation procedure and the advantages of using lexicographic ordering to prevent duplicate candidates.
Pruning?
• Because each candidate is obtained by merging a pair of frequent (k-1) itemsets, an additional candidate pruning step is needed to ensure that the remaining k-2 subsets of k-1 elements are frequent.
![Page 28: Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.](https://reader035.fdocuments.net/reader035/viewer/2022081723/56649d215503460f949f6c03/html5/thumbnails/28.jpg)
Fk-1Fk-1