Mining Frequent patterns without candidate generation

18
Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin

description

Mining Frequent patterns without candidate generation. Jiawei Han, Jian Pei and Yiwen Yin. Abdullah Mueen. Problem : Mining Frequent Pattern. I={a 1 , a 1 , …, a m } is a set of items . - PowerPoint PPT Presentation

Transcript of Mining Frequent patterns without candidate generation

Page 1: Mining Frequent patterns without candidate generation

Mining Frequent patterns without candidate generation

Jiawei Han,

Jian Pei

and

Yiwen Yin

Page 2: Mining Frequent patterns without candidate generation

Problem : Mining Frequent Pattern

• I={a1, a1, …, am} is a set of items.

• DB={T1, T1, …, Tn} is the database of transactions where each transaction is a non empty subset of I.

• A pattern is also a subset of I.• A pattern is frequent if it is contained in

(supported by) more than a fixed number (ξ) of transactions.

Page 3: Mining Frequent patterns without candidate generation

Previous work : Apriori

• It may need to generate a huge number of candidate itemsets. To discover a frequent pattern of size k it needs to generate more than 2k candidates in total.

• It may need to scan the database repeatedly and check for the frequencies of the candidates.

Page 4: Mining Frequent patterns without candidate generation

FP-growth

• FP-growth mines frequent patterns without generating the candidate sets. It grows the patterns from fragments.

• It builds an extended prefix tree (FP-tree) for the transaction database. This tree is a compressed representation of the database. It saves repeated scan of the database.

Page 5: Mining Frequent patterns without candidate generation

FP-tree

TID Items BoughtFrequent Items

100f,a,c,d,g,i,m,p

f,c,a,m,p

200 a,b,c,f,l,m,o f,c,a,b,m

300 b,f,h,j,o f,b

400 b,c,k,s,p c,b,p

500a,f,c,e,l,p,m,n

f,c,a,m,p

sorted in descending order of the freq.

root

c:1

m:1

b:1

b:1

p:2

m:2

c:3

a:3

f:4

b:1

p:1

Minimum support (ξ) = 3

Page 6: Mining Frequent patterns without candidate generation

Conditional FP-tree of p

Items Bought

Frequent Items

f,c,a,m c

f,c,a,m c

c,b c

Conditional pattern base for p

root

c:3

Conditional FP-treeof p

Minimum support (ξ) = 3

The set of frequent patterns containing p is { cp , p }{p }

Page 7: Mining Frequent patterns without candidate generation

Frequent patterns containing mItems Bought

Frequent Items

f,c,a f,c,a

f,c,a f,c,a,

f,c,a,b f,c,a

Conditional pattern base for m

root

c:3

a:3

f:3

Conditional FP-treeof m

ItemsFreq. Items

f,c f,c

f,c f,c

f,c f,c

Conditional pattern base for am

root

c:3

f:3

Conditional FP-treeof am

root

f:3

ItemsFreq. Items

f f

f f

f f

The set of frequent patterns containing m is { m }{ m, am }{ m, am, cam }{ m, am, cam, fcam }

Conditional FP-treeof cam

root

c:3

a:3

f:3

{ m, am, cam, fcam, fam}pattern

base for cam

root

c:3

f:3

{ m, am, cam, fcam, fam, cm, fcm }

root

c:3

a:3

f:3

{ m, am, cam, fcam, fam, cm, fcm, fm }

Page 8: Mining Frequent patterns without candidate generation

Complete Frequent Pattern set

f ac pm

fc

b

apfmfa ca amcm

camfca fcm fam

fcam

Generated by conditional FP tree of m which is a single

Path

• A single path generates each combination of its nodes as frequent pattern

• Supports for a pattern is equal to the minimum support of a node in it.

root

c:3

a:3

f:3

Page 9: Mining Frequent patterns without candidate generation

Pseudocode

• Procedure FP-growth(Tree,α)• if Tree contains a single path P

• for each combination (β) of the nodes in P• Generate pattern βUα with support = minimum support of

a node in β

• else• for each ai in the header of Tree do

• Generate pattern β= αUai with support = ai.support.

• Construct β’s conditional pattern base and conditional FP-tree Treeβ

• if Treeβ ≠ Ø

• Call FP-growth(Treeβ, β)

Page 10: Mining Frequent patterns without candidate generation

Implementation issues

• For different support thresholds (ξ) there are different FP-trees. We may chose ξ=20 if 98% of the queries have ξ≥20.

• Updating the FP-tree after each new transaction may be costly. We may count the occurrence frequency of every items and update the tree if relative frequency of an item gets a large change.

Page 11: Mining Frequent patterns without candidate generation

New Challenges

• FP-growth may output a large number of frequent patterns for small (ξ) and very small number of frequent patterns for large (ξ). We may not know the (ξ) for our purpose.

• Which frequent patterns are good instances for generating interesting association rules?

Page 12: Mining Frequent patterns without candidate generation

Top-K frequent closed patterns

• Closed pattern is a pattern whose support is larger than any of its super pattern.

TID Items BoughtFrequent Items

100f,a,c,d,g,i,m,p

f,c,a,m,p

200 a,b,c,f,l,m,o f,c,a,b,m

300 b,f,h,j,o f,b

400 b,c,k,s,p f,c,b,p

500a,f,c,e,l,p,m,n

f,c,a,m,pf:5 a:3c:4 p:3m:3

fc:4

b:3

ap:3fm:3fa:3 ca:3 am:3cm:3

cam:3fca:3 fcm:3 fam:3

fcam:3

fp:3 fb:3

• We can also specify the minimum length of the patterns.• Top-2 frequent closed patterns with length ≥ 2 is fc and fcam

Page 13: Mining Frequent patterns without candidate generation

Mining Top-K closed FP

• The algorithm starts with an FP-tree having 0 support threshold.

• While building the tree, it prunes the smaller patterns with length < min_length.

• After the tree is built, it prunes the relatively infrequent patterns by raising the support threshold.

• Mining is performed on the final pruned FP-tree.

Page 14: Mining Frequent patterns without candidate generation

Compressed Frequent Pattern

• FP-growth may end up with a large set of patterns.

• We can compress the set of frequent patterns by clustering it minimally and selecting a representative pattern from each cluster.

f ac pm

fc

b

apfmfa ca amcm

camfca fcm fam

fcam{fcam, cam, ap, b}

Page 15: Mining Frequent patterns without candidate generation

Clustering Criterion

• For each cluster there must be a representative pattern Pr .

• D(P,Pr ) ≤ δ for all patterns inside the cluster of Pr .

• D(P1,P2 ) = 1- |T(P1)∩T(P2)| |T(P1)UT(P2)|

• T(P) is the set of transactions that support P.• D is a metric for closed patterns.

Page 16: Mining Frequent patterns without candidate generation

Summary

• FP-tree is an extended prefix tree that summarizes the database in a compressed form.

• FP-growth is an algorithm for mining frequent patterns using FP-tree.

• FP-tree can also be used to mine Top-K frequent closed patterns and Compressed frequent patterns.

Page 17: Mining Frequent patterns without candidate generation

References

• Mining Frequent Patterns without Candidate Generation– Jiawei Han, Jian Pei and Yiwen Yin

• Mining Top-K Frequent Closed Patterns without Minimum Support– Jiawei Han, Jianyong Wang, Ying Lu and Petre

Tzetkov

• Mining Compressed Frequent-Pattern Sets– Dong Xin, Jiawei Han, Xipheng Yan and Hong Cheng

Page 18: Mining Frequent patterns without candidate generation

Thank You