Data Mining Association Analysis: Basic Concepts
and Algorithms
Based on
Introduction to Data Miningby
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2
Outline
Association rules & mining
Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm
Generating association rules
Evaluating discovered rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3
Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk},
Implication means co-occurrence, not causality!
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4
Definition: Frequent Itemset
Itemset– A collection of one or more items
Example: {Milk, Bread, Diaper}
– k-itemset An itemset that contains k items
Support count ()– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread, Diaper}) = 2
Support– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset– An itemset whose support is greater
than or equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5
Definition: Association Rule
Example:Beer}Diaper,Milk{
4.052
|T|)BeerDiaper,,Milk(
s
67.032
)Diaper,Milk()BeerDiaper,Milk,(
c
Association Rule– An implication expression of the form
X Y, where X and Y are itemsets
– Example: {Milk, Diaper} {Beer}
Rule Evaluation Metrics– Support (s)
Fraction of transactions that contain both X and Y
– Confidence (c) Measures how often items in Y
appear in transactions thatcontain X
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
)X(
)(
YX
c
||
)(
T
YXs
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold
– confidence ≥ minconf threshold
Brute-force approach:– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7
Computational Complexity
Given d unique items:– Total number of itemsets = 2d
– Total number of possible association rules:
123 1
1
1 1
dd
d
k
kd
j j
kd
k
dR
If d=6, R = 602 rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Mining Association Rules
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0){Diaper,Beer} {Milk} (s=0.4, c=0.67){Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different confidence
• Thus, we may decouple the support and confidence requirements
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9
Mining Association Rules
Two-step approach: 1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally expensive
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
Frequent Itemset Generation
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given d items, there are 2d possible candidate itemsets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11
Frequent Itemset Generation
Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
– Match each transaction against every candidate
– Complexity ~ O(MNw) => Expensive since M = 2d !!!
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
N
Transactions List ofCandidates
M
w
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Outline
Association rules & mining
Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm
Generating association rules
Evaluating discovered rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)– Complete search: M=2d
– Use pruning techniques to reduce M
Reduce the number of comparisons (MN)– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every transaction
Reduce the number of transactions (N)– Transactions which does not contain any frequent k-itemsets can
not contain any frequent k+1-itemsets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
Reducing Number of Candidates
Apriori principle:– If an itemset is frequent, then all of its subsets must also be
frequent
– Or equivalently, if an itemset is NOT frequent, then none of its supersets can be frequent
Apriori principle holds due to the following property of support of an itemset:
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support: support decreases when more items added to the set
)()()(:, XsYsYXYX
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Illustrating Apriori Principle
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
Apriori Algorithm
Method:
1. Let k=12. Generate frequent itemsets of length 13. Repeat until no new frequent itemsets are identified
a) Generate a (k+1)-candidate itemsets from every two frequent k-itemsets with only one different item
b) Prune candidate itemsets containing subsets of length k that are infrequent
c) Count the support of each candidate by scanning the DB
d) Prune candidates that are infrequent, leaving only those that are frequent
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17
Generating Frequent 1-Itemsets
Consider a set of transactions T with 6 unique items
– suppose minimum support count = 3 Apriori makes first pass through T to obtain support
counts for the candidate 1-itemsets
Itemset {Bread} {Coke} {Milk} {Beer} {Diaper} {Eggs}
Candidates for frequent 1-itemsets
Obtain support count
Itemset Count {Bread} 4 {Coke} 2 {Milk} 4 {Beer} 3 {Diaper} 4 {Eggs} 1
Coke & Eggs are infrequent
Itemset Count {Bread} 4 {Milk} 4 {Beer} 3 {Diaper} 4
Pruneinfrequent itemsets
Frequent1-itemsets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
Generating Frequent 2-Itemsets
Itemset Count {Bread} 4 {Milk} 4 {Beer} 3 {Diaper} 4
Frequent1-itemsets
Step 3.a: Generate
candidates
Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}
Candidates for frequent 2-itemsets
Step 3.b: No pruning due to
infrequent subsets
{Bread,Beer} &{Milk,Beer} are infrequent
Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3
Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}
Step 3.c: Obtainsupport counts
Itemset Count {Bread,Milk} 3 {Bread,Diaper} 3 {Milk,Diaper} 3 {Beer,Diaper} 3
Step 3.d: Pruneinfrequent itemsets
Frequent 2-itemsets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19
Generating Frequent 3-Itemsets
Itemset Count {Bread,Milk} 3 {Bread,Diaper} 3 {Diaper,Milk} 3 {Beer,Diaper} 3
Frequent 2-itemsets
Itemset {Bread,Diaper,Milk} {Beer,Bread,Diaper} {Beer,Diaper,Milk}
Candidates for
frequent 3-itemsets
Step 3.a: Generate
candidates Itemset {Bread,Diaper,Milk}
Step 3.b: Prune due infrequent
subsets
E.g., {Beer,Bread,Diaper}pruned since {Beer,Diaper}
is infrequent
Itemset Count {Bread,Milk,Diaper} 3
Frequent 3-itemsets
Step 3.c: Obtainsupport counts
Step 3.d: no pruning due to insufficient support
# of candidates considered:• without pruning: 6C1 + 6C2 + 6C3 = 41 (there are 6C1 candidates for frequent 1-itemsets)• with pruning: 6C1 + 4C2 + 3 = 6 + 6 + 3 = 15
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
Reducing Number of Comparisons
Candidate counting:– Scan the database of transactions to determine the
support of each candidate itemset– To reduce the number of comparisons, store the
candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
N
Transactions Hash Structure
k
Buckets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21
Generate Hash Tree
2 3 45 6 7
1 4 51 3 6
1 2 44 5 7 1 2 5
4 5 81 5 9
3 4 5 3 5 63 5 76 8 9
3 6 73 6 81,4,7
2,5,8
3,6,9
Hash functionh(x) = x mod 3
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
• Items in itemsets & transactions are ordered (in the same way)
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)
hash on first item
on second item
on third
max leaf size = 3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Association Rule Discovery: Hash tree
1,4,7
2,5,8
3,6,9
Hash Function
First item is: 1, 4, or 7
1 5 9
1 4 5 1 3 63 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 71 2 5
4 5 8
Candidate Hash Tree
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23
Association Rule Discovery: Hash tree
1,4,7
2,5,8
3,6,9
Hash Function
Second item is: 2, 5, or 8
1 5 9
1 4 5 1 3 6
3 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 71 2 5
4 5 8
Candidate Hash Tree
Second item is: 1, 4, or 7
Second item is: 3, 6, or 9
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24
Association Rule Discovery: Hash tree
1 5 9
1 4 5 1 3 63 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 71 2 5
4 5 8
1,4,7
2,5,8
3,6,9
Hash Function Candidate Hash Tree First item is 2, 5, or 8
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25
Association Rule Discovery: Hash tree
1 5 9
1 4 5 1 3 63 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 71 2 5
4 5 8
1,4,7
2,5,8
3,6,9
Hash Function Candidate Hash TreeFirst item is 3, 6, or 9
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26
Association Rule Discovery: Hash tree
1 5 9
1 4 5 1 3 63 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 71 2 5
4 5 8
1,4,7
2,5,8
3,6,9
Hash Function Candidate Hash Tree
Split by second item
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27
Association Rule Discovery: Hash tree
1 5 9
1 4 5 1 3 63 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 71 2 5
4 5 8
Candidate Hash Tree
not further split since <= max size, if split (by third item), we would have:
3 5 66 8 9
3 5 7
1,4,7
2,5,8
3,6,9
Hash Function
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28
Subset Operation
1 2 3 5 6
Transaction, t
2 3 5 61 3 5 62
5 61 33 5 61 2 61 5 5 62 3 62 5
5 63
1 2 31 2 51 2 6
1 3 51 3 6
1 5 62 3 52 3 6
2 5 6 3 5 6
Subsets of 3 items
Level 1
Level 2
Level 3
63 5
Given a transaction t, what are the possible subsets of size 3?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29
Subset Operation Using Hash Tree
1 5 9
1 4 5 1 3 63 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 71 2 5
4 5 8
1 2 3 5 6
1 + 2 3 5 63 5 62 +
5 63 +
1,4,7
2,5,8
3,6,9
Hash Functiontransaction
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30
Subset Operation Using Hash Tree
1 5 9
1 4 5 1 3 63 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 71 2 5
4 5 8
1,4,7
2,5,8
3,6,9
Hash Function1 2 3 5 6
3 5 61 2 +
5 61 3 +
61 5 +
3 5 62 +
5 63 +
1 + 2 3 5 6
transaction
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31
Subset Operation Using Hash Tree
1 5 9
1 4 5 1 3 63 4 5 3 6 7
3 6 8
3 5 6
3 5 7
6 8 9
2 3 4
5 6 7
1 2 4
4 5 71 2 5
4 5 8
1,4,7
2,5,8
3,6,9
Hash Function1 2 3 5 6
3 5 61 2 +
5 61 3 +
61 5 +
3 5 62 +
5 63 +
1 + 2 3 5 6
transaction
5 leaves visited, 9 candidates compared125
123126156
135136
235236256
356 3-itemset from transaction
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32
Factors Affecting Complexity of Apriori
Choice of minimum support threshold– lowering support threshold results in more frequent itemsets– this may increase (1) number of candidates and (2) max length of
frequent itemsets Dimensionality (number of unique items) of the data set
– more candidates & space needed to store their support counts– if number of frequent items also increases, both computation and
I/O costs may also increase Size of database
– since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
Average transaction width– transaction width increases with denser data sets– this may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its width)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33
Outline
Association rules & mining
Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm
Generating association rules
Evaluating discovered rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34
Compact Representation of Frequent Itemsets
Itemsets with positive support counts– 3 groups: {A1, …, A10}, {B1, …, B10}, {C1, …, C10}– no transactions containing items from different groups
All subsets of items within the same group have the same support count
– e.g., support counts of {A1}, {A1, A2}, {A1, A2, A3} all = 5
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35
Compact Representation of Frequent Itemsets
Suppose minimum support count is 3 Number of frequent itemsets:
need a compact representation
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
)12(310
3 1010
1
k k
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36
Maximal Frequent Itemset
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Border
Infrequent Itemsets
Maximal Itemsets
A itemset is maximally frequent if it is frequent but none of its immediate supersets is frequent
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37
Maximal Frequent Itemset
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Border
Infrequent Itemsets
Maximal Itemsets
Every frequent itemset is either maximal or a subset of some maximal frequent itemset
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38
Properties of Maximal Frequent Itemsets
If level k is the highest level with maximal frequent itemsets, then level k only contains maximal frequent itemsets or infrequent itemsets
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Border
Infrequent Itemsets
Maximal Itemsets
level 0
level 1
level 2
level 3
level 4
level 5
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39
Properties of Maximal Frequent Itemsets
Furthermore, level k + 1 (if exists) only contains infrequent itemsets
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Border
Infrequent Itemsets
Maximal Itemsets
level 0
level 1
level 2
level 3
level 4
level 5
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40
Maximal Frequent Itemset
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Border
Infrequent Itemsets
Maximal Itemsets
Know maximal frequent itemsets know all frequent itemsets, but not their supports
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41
Closed Itemset
An itemset is closed if none of its immediate supersets has the same support as the itemset
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
support count
Not supported by any transactions
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
3 3 4 3 3
2 3 2 1 3 1 1 2 2 2
2 1 2 1 1 1 1 1
1 1
Closed
0
000
0
0
BCE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42
Properties of Closed Itemsets
Every itemset is either a closed itemset or has the same support as one of its immediate supersets
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
3 3 4 3 3
2 3 2 1 3 1 1 2 2 2
2 1 2 1 1 1 1 1
1 1
0
000
0
0BCE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43
Properties of Closed Itemsets
If X is not closed, X has the same support as its immediate superset with the largest support
Suppose Y1, …, Yk are X’s immediate supersets & Yi has the largest support
(X) ≥ (Y1), (Y2), …, (Yk)
≥ max{(Y1), …, (Yk)}
= (Yi)
I.e., (X) ≥ (Yi)
since X is not closed, (X) = (Yi)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44
Properties of Closed Itemsets
Suppose the highest level of lattice is level k– the itemset at level k (only one k-itemset) must be closed
level 0
level 1
level 2
level 3
level 4
level 5
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
3 3 4 3 3
2 3 2 1 3 1 1 2 2 2
2 1 2 1 1 1 1 1
1 1
0
000
0
0BCE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45
Properties of Closed Itemsets
Can compute support counts of all itemsets from support counts of closed itemsets
level 0
level 1
level 2
level 3
level 4
level 5
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
3 3 4 3 3
2 3 2 1 3 1 1 2 2 2
2 1 2 1 1 1 1 1
1 1
0
000
0
0BCE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46
Properties of Closed Itemsets
Compute support of itemsets on level k from those in level k + 1: if an itemset on level k is not closed, we can compute its support from those at level k + 1
level 0
level 1
level 2
level 3
level 4
level 5
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
3 3 4 3 3
2 3 2 1 3 1 1 2 2 2
2 1 2 1 1 1 1 1
1 1
0
000
0
0BCE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47
Closed Frequent Itemsets
An itemset is a closed frequent itemset (shown in shaded ovals) if it is both frequent & closed
Minimum support count = 2
Closed frequent itemsets
Closed frequent itemsets
# frequent = 14
# closed frequent = 9
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
3 3 4 3 3
2 3 2 1 3 1 1 2 2 2
2 1 2 1 1 1 1 1
1 1
0
000
0
0BCE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48
Frequent Itemsets: Maximal vs Closed
Every maximal frequent itemset is a closed frequent itemset
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
3 3 4 3 3
2 3 2 1 3 1 1 2 2 2
2 1 2 1 1 1 1 1
1 1
0
000
0
Closed and maximal
Closed but not maximal
# closed frequent = 9
# maximal frequent = 4
0
Closed and maximal
Minimum support count = 2
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 49
Frequent Itemsets: Maximal vs Closed
FrequentItemsets
ClosedFrequentItemsets
MaximalFrequentItemsets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 50
Outline
Association rules & mining
Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm
Generating association rules
Evaluating discovered rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 51
Alternative Methods for Frequent Itemset
Generation
Traversal of itemset lattice– Breadth-first: e.g., Apriori algorithm
– Depth-first: often used to search maximal frequent itemsets
(a) Breadth first (b) Depth first
a
ab
abc
a b c d e
ab
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 52
Alternative Methods for Frequent Itemset
Generation
Depth-first finds maximal frequent itemsets more quickly It also enables substantial pruning
– if abcd is not maximal, all its subsets (e.g., bc) are not; but supersets of bc that are not subsets of abcd (e.g., bce) may still be maximal
(a) Breadth first (b) Depth first
a
ab
abc
a b c d e
ab
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 53
Alternative Methods for Frequent Itemset
Generation
Traversal of itemset lattice– General-to-specific, specific-to-general, & bidirectional
Frequentitemsetborder null
{a1,a2,...,an}
(a) General-to-specific
null
{a1,a2,...,an}
Frequentitemsetborder
(b) Specific-to-general
..
......
Frequentitemsetborder
null
{a1,a2,...,an}
(c) Bidirectional
..
..
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 54
Alternative Methods for Frequent Itemset
Generation
General-to-specific: going from k to k+1-itemsets– Adopted by Apriori, effective when border (of frequent
and infrequent itemsets) is near the top of latticeFrequentitemsetborder null
{a1,a2,...,an}
(a) General-to-specific
null
{a1,a2,...,an}
Frequentitemsetborder
(b) Specific-to-general
..
......
Frequentitemsetborder
null
{a1,a2,...,an}
(c) Bidirectional
..
..
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 55
Alternative Methods for Frequent Itemset
Generation
Specific-to-general: going from k+1 to k-itemsets– often used to find maximal frequent itemsets
– effective when border is near the bottom of latticeFrequentitemsetborder null
{a1,a2,...,an}
(a) General-to-specific
null
{a1,a2,...,an}
Frequentitemsetborder
(b) Specific-to-general
..
......
Frequentitemsetborder
null
{a1,a2,...,an}
(c) Bidirectional
..
..
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 56
Alternative Methods for Frequent Itemset
Generation
Bidirectional: going in two directions at same time– need more space for storing candidates
– may quickly discover border in situation shown in (c)Frequentitemsetborder null
{a1,a2,...,an}
(a) General-to-specific
null
{a1,a2,...,an}
Frequentitemsetborder
(b) Specific-to-general
..
......
Frequentitemsetborder
null
{a1,a2,...,an}
(c) Bidirectional
..
..
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 57
Alternative Methods for Frequent Itemset
Generation
Traverse lattice by equivalent classes, formed by– having the same # of items in the itemset, e.g., Apriori
– or having the same prefix (e.g., A, B, C, D) or suffix null
AB AC AD BC BD CD
A B C D
ABC ABD ACD BCD
ABCD
null
AB AC ADBC BD CD
A B C D
ABC ABD ACD BCD
ABCD
(a) Prefix tree (b) Suffix tree
same prefixA
same suffixD
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 58
Storage Formats for Transactions
Horizontal: store a set/list of items associated with each transaction (e.g., Apriori)
Vertical: store a set/list of transactions associated with each item
TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D
10 B
HorizontalData Layout
A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109
Vertical Data Layout
TID-list
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 59
Vertical Data Layout
Searching support = intersecting TID-lists– length of TID-list shrinks as size of itemsets grows
– problem: initial lists may be too large to fit in memory
Need more compact representation FP-growth
TID Items1 A,B,E2 B,C,D3 C,E4 A,C,D5 A,B,C,D6 A,E7 A,B8 A,B,C9 A,C,D
10 B
HorizontalData Layout
A B C D E1 1 2 2 14 2 3 4 35 5 4 5 66 7 8 97 8 98 109
Vertical Data Layout
TID-list
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 60
Outline
Association rules & mining
Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm
Generating association rules
Evaluating discovered rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 61
FP-growth Algorithm: Key Ideas
Limitations of Apriori– need to generate a large number of candidates, e.g.,
1k frequent 1-itemsets 1M candidates of 2-itemset
– need to repeatedly scan the database
FP-growth: frequent pattern growth– discover frequent itemsets w/o generating candidates
– often just need to scan databases twice
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 62
FP-growth Algorithm
Step 1:– scan database to discover frequent 1-itemsets
– scan database the second time to build a compact representation of transactions in form of FP-tree
Step 2: – use the constructed FP-tree to recursively find
frequent itemsets via divide-and-conquer: turn problem of finding k-itemsets into a set of subproblems, each finding k-itemsets ending in a different suffix
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 63
FP-tree & Construction
Transactions represented by paths from root – items in transactions are ordered
– node = item:# of transactions having path to item
TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}
null
A:1
B:1
After reading TID=1:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 64
FP-tree & Construction
Transactions represented by paths from root – transactions with common prefixes (e.g., T1 and T3)
share paths for their prefixes
TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}
null
A:1
B:1
After reading TID=1:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 65
FP-tree & Construction
Items often ordered in decreasing support counts– s(A) = 8, s(B) = 7, s(C) = 5, s(D) = 5, s(E) = 3
the order is: A, B, C, D, E
TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}
null
A:1
B:1
After reading TID=1:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 66
FP-tree & Construction
Items often ordered in decreasing support counts– s(A) = 8, s(B) = 7, s(C) = 5, s(D) = 5, s(E) = 3
– a pass over database needed for getting these counts
TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}
null
A:1
B:1
After reading TID=1:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 67
FP-tree & Construction
Items often ordered in decreasing support counts– with this order, transactions tend to share paths
smaller tree: low branching factor, less bushy
TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}
null
A:1
B:1
After reading TID=1:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 68
FP-tree Construction
Nodes for the same item on different paths are connected via node links
TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}
null
A:1
B:1
B:1
C:1
D:1
After reading TID=2:
node link
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 69
FP-tree Construction
Shared paths between TID=1 and TID=3
TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}
After reading TID=3:null
A:2
B:1
B:1
C:1
D:1
C:1
D:1
E:1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 70
FP-tree Construction: Completed Tree
null
A:8
B:5
B:2
C:2
D:1
C:1
D:1C:3
D:1
D:1
E:1E:1
TID Items1 {A,B}2 {B,C,D}3 {A,C,D,E}4 {A,D,E}5 {A,B,C}6 {A,B,C,D}7 {A}8 {A,B,C}9 {A,B,D}10 {B,C,E}
Pointers are used to assist frequent itemset generation
D:1
E:1
Transaction Database
Item PointerABCDE
Header table
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 71
FP-growth: Finding Frequent Itemsets
Take FP-tree generated in step 1 as input
Find frequent itemsets with common suffixes– e.g., suppose the order of items is: A, B, C, D, E– first find itemsets in reverse order: ending with E, then, D, C, …– start with items w/ fewer supports, so pruning is more likely
For each particular suffix, find frequent itemsets via divide-and-conquer, e.g., consider itemsets ending w/ E
1. obtain sub-tree of FP-tree where all paths end with E
2. compute support counts of E; if E is frequent,
3. check & prepare to solve subproblems: DE, CE, BE, AE
4. recursively compute each subproblem in a similar fashion
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 72
FP-growth: Finding Frequent Itemsets
For each particular suffix, find frequent itemsets via divide-and-conquer, e.g., consider itemsets ending w/ E 1. obtain sub-tree of FP-tree where all paths end with E2. compute support counts of E; if E is frequent,3. transform subtree for suffix E into conditional FP-tree
for suffix E– update count– remove leaves (E’s)– prune infrequent itemsets
4. Use conditional FP-tree for E to recursively solve: DE, CE, BE, AE in a similar fashion (as using FP-tree to solve E, D, C, B, A)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 73
Traversal of Lattice by FP-growth
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
1
2
3
4
common suffix E
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74
Discover Frequent Itemsets via FP-tree
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
1
2
3
4
common suffix E
FP-tree
Sub-tree for suffix E
Sub-tree for suffix DE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 75
Obtaining Sub-Tree for Suffix E
Find first path containing E (by looking up in the header table), find rest by following node links
null
A:8
B:5
B:2
C:2
D:1
C:1
D:1C:3
D:1
D:1
E:1E:1
D:1
E:1
A:8
B:5
C:3
D:1
D:1Item Pointer
ABCDE
Header table
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 76
Sub-Tree for Suffix E
Remove all E’s descendants
& nodes that are not E’s
ancestors
E:1
null
A:8 B:2
C:2C:1
D:1
D:1
E:1
E:1
A:8
null
A:8
B:5
B:2
C:2
D:1
C:1
D:1C:3
D:1
D:1
E:1E:1
D:1
E:1
A:8
B:5
C:3
D:1
D:1
Paths in the sub-tree are theprefix paths ending in E
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 77
Current Position of Processing
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
1
2
common suffix E
FP-tree
Sub-tree for suffix E just obtained
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 78
Determine if E is Frequent
Support count of E = sum of counts of all E nodes– suppose minimum support count is 2
– support count of E = 3, so E is frequentnull
A:8 B:2
C:2C:1
D:1
D:1
E:1
E:1
A:8
E:1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 79
Prepare for Solving DE, CE, BE, and AE
Turn original subtree into conditional FP-tree by1. update counts on prefix paths
2. remove E nodes
3. prune infrequent itemsetsnull
A:8 B:2
C:2C:1
D:1
D:1
E:1
E:1
A:8
E:1original sub-tree
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 80
Update Counts on Prefix Paths
Counts for E (leaf) nodes are correct Counts for internal nodes may not be correct
– due to removal of paths which do not have E’s
null
A:8 B:2
C:2C:1
D:1
D:1
E:1
E:1
A:8
E:1
Path BCD was removed,so 2 is not correct anymore
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 81
Update Counts on Prefix Paths
Start from leaves, going upward– if node X has only child Y, count of X = count of Y
– otherwise, count of X = sum of counts of X’s children
null
A:2 B:1
C:1C:1
D:1
D:1
E:1
E:1E:1
null
A:8 B:2
C:2C:1
D:1
D:1
E:1
E:1
A:8
E:1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 82
Remove E Nodes
E nodes can be removed– counts of internal nodes have been updated
– contain no more information for solving: DE, CE, …
null
A:2 B:1
C:1C:1
D:1
D:1
null
A:2 B:1
C:1C:1
D:1
D:1
E:1
E:1E:1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 83
Prune Infrequent Itemsets
If sum of counts of all X nodes < minimum support count, remove X– since XE can not be frequent
null
A:2 B:1
C:1C:1
D:1
D:1
remove B
null
A:2
C:1
C:1
D:1
D:1
Conditional FP-tree for E
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 84
Current Position of Processing
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
1
2
common suffix E
FP-treeSub-tree for suffix E turned intoconditional FP-tree for suffix E• counts updated• leaves (E’s) removed• infrequent itemsets pruned
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 85
Current Position of Processing
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
1
2
common suffix E
FP-tree Conditional FP-tree for suffix Ethen used to solve DE, CE, BE, AEsimilarly as we use FP-tree to solveE, D, C, B, A
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 86
Solving for DE
From conditional FP-tree for E, obtain sub-tree for suffix DE
null
A:2
C:1
C:1
D:1
D:1
Conditional FP-tree for E
null
A:2
C:1
D:1
D:1
Sub-tree for suffix DE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 87
Current Position of Processing
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
1
2
common suffix E
FP-tree
Sub-tree for suffix DE just obtained
3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 88
Solving for DE
Support count of DE = 2 (sum of counts of all D’s)– DE is frequent, need to solve: CDE, BDE, ADE
Sub-tree for suffix DE
null
A:2
C:1
D:1
D:1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 89
Preparing for Solving CDE, BDE, ADE
null
A:2
C:1
D:1
D:1
null
A:2
C:1 Conditional FP-tree for suffix DE
Sub-tree for suffix DE
• update count• remove leaves
null
A:2
prune infrequent itemsets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 90
Current Position of Processing
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
1
2
common suffix E
FP-tree
Sub-tree for suffix DE turned into conditionalFP-tree for suffix DE,ready to solve CDE, BDE ADE
3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 91
Solving CDE, BDE, ADE
Sub-trees for both CDE and BDE are empty– no paths containing item C or B
Work on ADE
ADE (support count = 2) is frequent– but no more subproblem for ADE, backtrack– & no more subproblem for DE, backtrack
Conditional FP-tree for suffix DE
null
A:2
null
A:2
Sub-tree for suffix for ADE
solving next subproblem CE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 92
Current Position of Processing
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
1
2
common suffix E
FP-tree
3
456
7
About to solve suffix CE
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 93
Solving for Suffix CE
CE is frequent (support count = 2) No more subproblems for CE (empty conditional FP-tree,
why?), so done with CE Checking next subproblem: BE
null
A:2
C:1
C:1
D:1
D:1
Conditional FP-tree for E
null
A:2
C:1
C:1
Sub-tree for suffix CE
Conditional FP-tree for suffix CE
Empty tree
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 94
Solving for Suffix CE
Suftree for BE is empty (no path in conditional FP-tree for E contains B)
Checking AE: there are paths containing A
null
A:2
C:1
C:1
D:1
D:1
Conditional FP-tree for E
null
A:2
C:1
C:1
Sub-tree for suffix CE
Conditional FP-tree for suffix CE
Empty tree
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 95
Current Position of Processing
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
1
2
common suffix E
FP-tree
3
456
7
Preparing to solve suffix AE
89
10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 96
Solving for Suffix AE
AE is frequent (support count = 2) Done with AE, backtrack; done with E, backtrack
null
A:2
C:1
C:1
D:1
D:1
Conditional FP-tree for E
null
A:2
Sub-tree for suffix AE
Solving next subproblem: suffix D
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 97
Current Position of Processing
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
1
2
common suffix E
FP-tree
3
456
7
Ready to solve suffix D
89
10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 98
Found Frequent Itemsets with Suffix E
E, DE, ADE, CE, AE discovered in this order
null
A B C D E
DECEBEAE
CDEBDEADE
BCDE
ABCDE
BCEACE
ACDE
CDBDADBCACAB
ABEBCDACDABDABC
ABDEABCEABCD
common suffix E
Ready to solve suffix D
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 99
Outline
Association rules & mining
Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm
Generating association rules
Evaluating discovered rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 100
Rule Generation
Given a frequent itemset L, find all non-empty subsets f L such that f L – f satisfies the minimum confidence requirement– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABCAB CD, AC BD, AD BC, BC AD, BD AC, CD AB,
If |L| = k, then there are 2k – 2 candidate association rules (ignoring L and L)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 101
Rule Generation
How to efficiently generate rules from frequent itemsets?– In general, confidence does not have an anti-
monotone property
– e.g., c(ABC D) can be larger or smaller than c(AB D)
)(
)()(
ABC
ABCDDABCc
)(
)()(
AB
ABDDABc
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 102
Rule Generation
But confidence of rules generated from the same itemset has an anti-monotone property– e.g., L = {A,B,C,D}:
c(BCD A) c(CD AB) c(D ABC)
Confidence is anti-monotone w.r.t. number of items on the RHS of the rule– more items on the right lower/equal confidence
)(
)(
)(
)(
)(
)(
D
ABCD
CD
ABCD
BCD
ABCD
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 103
Rule Generation for Apriori Algorithm
ABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Lattice of rulesABCD=>{ }
BCD=>A ACD=>B ABD=>C ABC=>D
BC=>ADBD=>ACCD=>AB AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD
Pruned Rules
Low Confidence Rule
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 104
Rule Generation for Apriori Algorithm
Candidate rule is generated by merging two rules that share the same prefixin the rule consequent
join(CD=>AB,BD=>AC)would produce the candidaterule D => ABC
Prune rule D=>ABC if itssubset AD=>BC does not havehigh confidence
BD=>ACCD=>AB
D=>ABC
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 105
Outline
Association rules & mining
Finding frequent itemsets– Apriori algorithm – compact representations of frequent itemsets– alternative discovery methods – FP-growth algorithm
Generating association rules
Evaluating discovered rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 106
Pattern Evaluation
Association rule algorithms tend to produce too many rules – many of them are uninteresting
In the original formulation of association rules, support & confidence are used
Additional interestingness measures can be used to prune/rank the derived patterns
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 107
Computing Interestingness Measure
Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 |T|
Contingency table for X Y
f11: support count of X and Yf10: support count of X and Yf01: support count of X and Yf00: support count of X and Y
Used to define various measures
support, confidence, lift, cosine, Jaccard coefficient, etc
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 108
Computing Interestingness Measure
Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 |T|
Contingency table for X Y
confidence (X Y)
=
= f11 / (f11 + f10)
= f11 / f1+
)|()(
),(
)(
)(XYP
XP
YXP
X
YX
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 109
Drawback of Confidence
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9 & P(Coffee|Tea) = 0.9375 > .75
Although confidence is high, rule is misleading
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 110
Computing Interestingness Measure
Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 |T|
Contingency table for X Y
confidence (X Y)
=
based on P(X,Y) & P(X), but also need to consider P(Y)
)|()(
),(
)(
)(XYP
XP
YXP
X
YX
need to measure correlation of X & Y
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 111
Measuring Correlation
Population of 1000 students– 600 students know how to swim (S)
– 700 students know how to bike (B)
– 420 students know how to swim and bike (S,B)
– P(SB) = 420/1000 = 0.42
– P(S) P(B) = 0.6 0.7 = 0.42
– P(SB) = P(S) P(B) => Independent
– P(SB) > P(S) P(B) => Positively correlated
– P(SB) < P(S) P(B) => Negatively correlated
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 112
Measuring Correlation
Population of 1000 students– 600 students know how to swim (S)
– 700 students know how to bike (B)
– 420 students know how to swim and bike (S,B)
– P(SB) = P(S) P(B) lift = 1 S & B independent
– P(SB) > P(S) P(B) lift > 1 positive-correlated
– P(SB) < P(S) P(B) lift < 1 negative-correlated
)()(
),(
)(
)|(),(
BPSP
BSP
BP
SBPBSLift
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 113
Measures on Correlation
The larger the value, usually the more likely the two variables/events are correlated
othersmany ...
),()()(
),(),(
)()(
),(),(
)()(
),(),(
YXPYPXP
YXPYXJaccard
YPXP
YXPYXCosine
YPXP
YXPYXLift
X Y
X,Y
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 114
Computing Lift from Contingency Table
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 N
11
11
11
11
)()(),(
),(
ffNf
YPXPYXP
Nf
NfNf
YXLift
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 115
Computing Lift from Contingency Table
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Association Rule: Tea Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 116
Drawback of Lift
Y Y
X 10 0 10
X 0 90 90
10 90 100
Y Y
X 90 0 90
X 0 10 10
90 10 100
10)1.0)(1.0(
1.0 Lift 11.1)9.0)(9.0(
9.0 Lift
Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1
10 (X & Y seldom occur together) >> 1.11 (X & Y frequently occur together)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 117
Compute Cosine from Contingency Table
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 N
11
11
11
11
)()(
),(
),(
ff
f
YPXP
YXP
Nf
Nf
Nf
YXCosine
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 118
Cosine vs Lift
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 N
11
11
11
11
)()(),(
)()(
),(
),(
),(
ffNf
YPXPYXP
ff
f
YPXP
YXP
YXLift
YXCosine
• If X and Y are independent, lift(X,Y) = 1 but Cosine(X,Y) = P(X)P(Y)• Cosine does not depend on N (& f00), unlike Lift
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 119
Compute Jaccard from Contingency Table
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 N
011011
11
),()()(),(
),(
ffff
YXPYPXPYXP
YXJaccard
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 120
Property under Null Addition
Invariant measures:
confidence, Cosine, Jaccard, etc
Non-invariant measures:
support (f11/N vs f11/(N+s)), lift, etc
Nf+0f+1
fo+f00f01X
f1+f10f11X
Y Y
N+sf+0 + sf+1
fo+ + sf00 + sf01X
f1+f10f11X
Y Y
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 121
Property under Variable Permutation
B B A p q A r s
A A B p r B q s
Does M(A,B) = M(B,A)?
If yes, M is symmetric; otherwise asymmetric
e.g., c(AB) = c(BA)?
)(
)()(
A
BABAc
)(
)()(
B
ABABc
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 122
Property under Variable Permutation
B B A p q A r s
A A B p r B q s
Does M(A,B) = M(B,A)?
If yes, M is symmetric; otherwise asymmetric
lift(AB) = lift(BA)?
)()(
),()(
BPAP
BAPBAlift
)()(
),()(
APBP
ABPABlift
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 123
Property under Variable Permutation
B B A p q A r s
A A B p r B q s
Does M(A,B) = M(B,A)?
If yes, M is symmetric; otherwise asymmetric
Symmetric measures:
support, lift, Cosine, Jaccard, etc
Asymmetric measures:
confidence, etc
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 124
Applying Interestingness Measures
B B A p q A r s
A A B p r B q s
Association rule (AB) directional
confidence (asymmetric) is intuitive
but confidence does not capture correlation
use additional measures such as lift to rank discovered rules & further prune those with low ranks
Top Related