Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)

Frequent Pattern and Frequent Pattern and Association AnalysisAssociation Analysis

(baseado nos slides do livro: Data (baseado nos slides do livro: Data Mining: C & T)Mining: C & T)

SAD Tagus 2004/05 H. Galhardas


Basic conceptsBasic concepts

Scalable mining methodsScalable mining methods

Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns

Constraint-based miningConstraint-based mining

Mining sequential and structured patternsMining sequential and structured patterns

Extensions and applicationsExtensions and applications


What Is Frequent What Is Frequent Pattern Analysis?Pattern Analysis?

Frequent patternFrequent pattern: a pattern (a set of items, : a pattern (a set of items,

subsequences, substructures, etc.) that occurs subsequences, substructures, etc.) that occurs

frequently in a data set frequently in a data set

First proposed by Agrawal, Imielinski, and Swami First proposed by Agrawal, Imielinski, and Swami

[AIS93] in the context of [AIS93] in the context of association rule miningassociation rule mining


MotivationMotivation

Finding inherent Finding inherent regularitiesregularities in data in data What products were often purchased together?— Beer and diapers?!

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?

ExsExs: Basket data analysis, cross-marketing, catalog : Basket data analysis, cross-marketing, catalog

design, sale campaign analysis, Web log (click design, sale campaign analysis, Web log (click

stream) analysis, DNA sequence analysis, etc.stream) analysis, DNA sequence analysis, etc.


Basic Concepts (1)Basic Concepts (1)

ItemItem: Boolean variable representing its : Boolean variable representing its presence or absence presence or absence

BasketBasket: Boolean vector of variables: Boolean vector of variablesAnalyzed to discover patterns of items that are

frequently associated (or buyed) together

Association rulesAssociation rules: association patterns: association patternsX => Y


Basic Concepts (2)Basic Concepts (2) Itemset X = {xItemset X = {x11, …, x, …, xkk}} Find all the rules Find all the rules X X Y Y with minimum with minimum

support and confidencesupport and confidence

SupportSupport, , ss, , probabilityprobability that a transaction that a transaction contains X contains X Y Y

Support = # tuples (Support = # tuples (X X Y) / # total tuples Y) / # total tuples = P(= P(X X Y) Y)

ConfidenceConfidence, , c,c, conditional probabilityconditional probability that that a transaction having X also contains a transaction having X also contains YY

Confidence = # tuples (Confidence = # tuples (X X Y) / # tuples Y) / # tuples X X = P(Y|X)= P(Y|X)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Transaction-idTransaction-id Items boughtItems bought

1010 A, B, DA, B, D

2020 A, C, DA, C, D

3030 A, D, EA, D, E

4040 B, E, FB, E, F

5050 B, C, D, E, FB, C, D, E, F


Basic Concepts (3)Basic Concepts (3)Interesting (or strong)

rules – satisfy minimum support threshold and minimum confidence threshold

Transaction-idTransaction-id Items boughtItems bought

1010 A, B, DA, B, D

2020 A, C, DA, C, D

3030 A, D, EA, D, E

4040 B, E, FB, E, F

5050 B, C, D, E, FB, C, D, E, F

Let supmin = 50%, confmin = 50%Frequent Pattern: {A:3, B:3, D:4, E:3, AD:3}

Association rules: A D (60%, 100%), 60% of all analyzed transactions show that A and D are

bought together; 100% of customers that bought A also bought D D A (60%, 75%)


Basic Concepts (4)Basic Concepts (4)Itemset (or pattern)Itemset (or pattern): set of items (k-itemset has k : set of items (k-itemset has k

items)items)Occurrence frequency of itemsetOccurrence frequency of itemset: # of transactions : # of transactions

containing the itemsetcontaining the itemsetAlso known as: frequency, support count or count of the

itemset

Frequent itemsetFrequent itemset: Itemset satisfies minimum : Itemset satisfies minimum support, i.e., support, i.e., frequency >= min-sup * # total transactions

ConfidenceConfidence (A => B) = P(A|B) = frequency (A (A => B) = P(A|B) = frequency (A B)/frequency (A) B)/frequency (A)

Pb. Mining assoc. Rules = Pb. Mining frequent itemsetsPb. Mining assoc. Rules = Pb. Mining frequent itemsets


Mining frequent Mining frequent itemsetsitemsets

A long pattern contains a combinatorial number of sub-patternsA long pattern contains a combinatorial number of sub-patterns

SolutionSolution: : Mine Mine closed patternsclosed patterns and and max-patternsmax-patterns

An itemset XAn itemset X is is closed frequent closed frequent if X is if X is frequentfrequent and there exists and there exists no super-itemsetno super-itemset Y Y ככ X, X, with the same supportwith the same support as X as X

An itemset X is a An itemset X is a max-itemsetmax-itemset if X is frequent and there exists if X is frequent and there exists no frequent super-itemset Y no frequent super-itemset Y ככ X X

Closed itemset is a lossless compression of freq. patterns: Closed itemset is a lossless compression of freq. patterns: reducing the # of patterns and rulesreducing the # of patterns and rules


ExampleExample

DB = {<aDB = {<a11, …, a, …, a100100>, < a>, < a11, …, a, …, a5050>} >}

Min_sup = 1.Min_sup = 1. What is the set of What is the set of closed itemsetclosed itemset??

<a1, …, a100>: 1

< a1, …, a50>: 2

What is the set of What is the set of max-pattern?max-pattern?<a1, …, a100>: 1

What is the set of What is the set of all patternsall patterns?? !!





Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns





Scalable Methods for Mining Scalable Methods for Mining Frequent PatternsFrequent Patterns

Downward closureDownward closure (or Apriori) property of frequent (or Apriori) property of frequent patterns: patterns: any subset of a frequent itemset must be any subset of a frequent itemset must be frequentfrequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper}

Scalable mining methodsScalable mining methods: Three major approaches: Three major approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)


Apriori: A Candidate Apriori: A Candidate Generation-and-Test ApproachGeneration-and-Test Approach

MethodMethod: : Initially, scan DB once to get frequent 1-itemset

Generate length (k+1) candidate itemsets from length k

frequent itemsets

Test the candidates against DB

Terminate when no frequent or candidate set can be

generated

Apriori pruning principle: Apriori pruning principle: If there is any itemset which If there is any itemset which

is is infrequentinfrequent, its superset should not be , its superset should not be

generated/tested! generated/tested!


How to Generate How to Generate Candidates?Candidates?

Suppose the items in Suppose the items in LLk-1k-1 are listed in an order are listed in an order

Step 1Step 1: self-joining : self-joining LLk-1k-1 insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1

from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2Step 2: pruning: pruningforall itemsets c in Ck do

forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck


Example Example

Database TDB

1st scan

C1L1

L2

C2 C22nd scan

C3 L33rd scan

TidTid ItemsItems

1010 A, C, DA, C, D

2020 B, C, EB, C, E

3030 A, B, C, EA, B, C, E

4040 B, EB, E

ItemsetItemset supsup

{A}{A} 22

{B}{B} 33

{C}{C} 33

{D}{D} 11

{E}{E} 33


{A}{A} 22

{B}{B} 33

{C}{C} 33

{E}{E} 33

ItemsetItemset

{A, B}{A, B}

{A, C}{A, C}

{A, E}{A, E}

{B, C}{B, C}

{B, E}{B, E}

{C, E}{C, E}


{A, B}{A, B} 11

{A, C}{A, C} 22

{A, E}{A, E} 11

{B, C}{B, C} 22

{B, E}{B, E} 33

{C, E}{C, E} 22


{A, C}{A, C} 22

{B, C}{B, C} 22

{B, E}{B, E} 33

{C, E}{C, E} 22

ItemsetItemset

{B, C, E}{B, C, E}


{B, C, E}{B, C, E} 22

Supmin = 2


The Apriori AlgorithmThe Apriori Algorithm Pseudo-codePseudo-code::

Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;


Important Details of AprioriImportant Details of Apriori How to generate candidates?How to generate candidates?

Step 1: self-joining Lk

Step 2: pruning

ExampleExample of of Candidate-generationCandidate-generation L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd acde from acd and ace

Pruning: acde is removed because ade is not in L3

C4={abcd}


Another example Another example (1)(1)


Another example Another example (2)(2)


Generation of ass. rules Generation of ass. rules from frequent itemsetsfrom frequent itemsets

Confidence(A=>B) = P(B|A) = Confidence(A=>B) = P(B|A) = = support-count (= support-count (A A B) / support-count (A) B) / support-count (A)

support-count (support-count (A A B): nb of transactions containing itemsets A B): nb of transactions containing itemsets A B Bsupport-count (A): nb of transactions containing itemset Asupport-count (A): nb of transactions containing itemset A

Association rules can be generated as:Association rules can be generated as: For each frequent itemset For each frequent itemset ll, generate all , generate all

nonempty subsets of nonempty subsets of ll For every nonempty subset For every nonempty subset ss of of ll, output the , output the

rule:rule:s=>(l-s) if support-count(l)/support-count(s) >= min-conf


How to Count Supports of How to Count Supports of Candidates?Candidates?

Why counting supports of candidates a problem?Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates

Method based on hashing:Method based on hashing: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a

transaction


Challenges of Frequent Challenges of Frequent Pattern MiningPattern Mining

ChallengesChallengesMultiple scans of transaction databaseHuge number of candidatesTedious workload of support counting for

candidates Improving AprioriImproving Apriori: general ideas: general ideas

PartitioningSamplingothers


Partition: Scan Database Partition: Scan Database Only TwiceOnly Twice

Any itemset that is potentially frequent in Any itemset that is potentially frequent in DB must be frequent in at least one of the DB must be frequent in at least one of the partitions of DBpartitions of DBScan 1: partition database and find local

frequent patternsScan 2: consolidate global frequent patterns


Sampling for Frequent Sampling for Frequent PatternsPatterns

Select a Select a samplesample of original database, mine frequent of original database, mine frequent

patterns within sample using Aprioripatterns within sample using Apriori

Scan database Scan database onceonce to verify frequent itemsets found to verify frequent itemsets found

in samplein sample

Scan database Scan database againagain to find missed frequent patterns to find missed frequent patterns Use lower support threshold to find frequent itemsets in

sample

Trade off Trade off accuracy vs efficiencyaccuracy vs efficiency


Bottleneck of Frequent-Bottleneck of Frequent-pattern Miningpattern Mining

Multiple database scans are Multiple database scans are costlycostly Mining long patterns needs many passes of Mining long patterns needs many passes of

scanning and generates lots of candidatesscanning and generates lots of candidatesTo find frequent itemset i1i2…i100

# of scans: 100

# of Candidates: (1001) + (100

2) + … + (11

00

00) = 2100-1 =

1.27*1030 !

Bottleneck: candidate-generation-and-testBottleneck: candidate-generation-and-test Can we avoid candidate generation? Yes!Can we avoid candidate generation? Yes!


Visualization of Association Rules: Plane Graph


Visualization of Association Rules: Rule Graph


Visualization of Association Visualization of Association Rules (SGI/MineSet 3.0)Rules (SGI/MineSet 3.0)





Mining a variety of rules and interesting Mining a variety of rules and interesting

patterns patterns





Mining Various Kinds of Mining Various Kinds of Association RulesAssociation Rules

Mining multi-level associationMining multi-level association

Miming multi-dimensional associationMiming multi-dimensional association

Mining quantitative association Mining quantitative association

Mining interesting correlation patternsMining interesting correlation patterns


ExampleExample


Mining Multiple-Level Mining Multiple-Level Association RulesAssociation Rules

Items often form Items often form hierarchyhierarchy Flexible supportFlexible support settings settings

Items at the lower level are expected to have lower support

uniform supportuniform support

Milk[support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Level 1min_sup = 5%

Level 2min_sup = 3%

reduced support


Multi-level Association: Multi-level Association: Redundancy FilteringRedundancy Filtering

Some rules may be redundant due to “ancestor” Some rules may be redundant due to “ancestor”

relationships between items. Example:relationships between items. Example:

R1: milk wheat bread [support = 8%, confidence = 70%]

R2: 2% milk wheat bread [support = 2%, confidence = 72%]

R1 is an R1 is an ancestorancestor of R2: R1 can be obtained from of R2: R1 can be obtained from

R2 replacing items by its ancestors in the R2 replacing items by its ancestors in the

concept hierarchyconcept hierarchy

R2 is R2 is redundantredundant if its support is close to the if its support is close to the

“expected” value, based on the rule’s ancestor.“expected” value, based on the rule’s ancestor.


Mining Multi-Dimensional Mining Multi-Dimensional AssociationAssociation

Single-dimensionalSingle-dimensional rules: rules:buys(X, “milk”) buys(X, “bread”)

Multi-dimensionalMulti-dimensional rules: rules: 2 dimensions or predicates 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates)

age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)

Hybrid-dimension assoc. rules (repeated predicates)age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)


Recall: categorical vs Recall: categorical vs quantitative attributesquantitative attributes

Categorical (or nominal) attributesCategorical (or nominal) attributes: finite number of : finite number of possible values, no ordering among valuespossible values, no ordering among values

Quantitative AttributesQuantitative Attributes: numeric, implicit ordering : numeric, implicit ordering among valuesamong values


Mining Quantitative AssociationsMining Quantitative Associations

Techniques can be categorized by how numerical Techniques can be categorized by how numerical attributes, such as ageattributes, such as age oror salary are treated:salary are treated:

Static discretization based on predefined concept hierarchies (data cube methods)

Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)

Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97): one dimensional clustering then association

Deviation: (such as Aumann and Lindell@KDD99)Sex = female => Wage: mean=$7/hr (overall mean = $9)


Static Discretization of Static Discretization of Quantitative Attributes (1)Quantitative Attributes (1)

Discretized prior to mining using Discretized prior to mining using concept hierarchyconcept hierarchy..

Numeric values are replaced by Numeric values are replaced by rangesranges..

Frequent itemset mining algo.Frequent itemset mining algo. must be modified so must be modified so

that frequent predicate sets are searched that frequent predicate sets are searched Instead of searching only one attribute (ex: buys), search

through relevant attributes (ex: age, occupation, buys) and

treat each attribute-value pair as an itemset.


Static Discretization of Static Discretization of Quantitative Attributes (2)Quantitative Attributes (2)

Data cubeData cube is well suited for mining and may already is well suited for mining and may already

existexist

The The cellscells of an n-dimensional cuboid correspond to of an n-dimensional cuboid correspond to

the predicate sets and can be used to store the the predicate sets and can be used to store the

support countssupport counts

Mining from data cubesMining from data cubes

can be much can be much fasterfaster..(income)(age)

()

(buys)

(age, income) (age,buys) (income,buys)

(age,income,buys)


Mining Various Kinds of Mining Various Kinds of Association RulesAssociation Rules

Mining multi-level associationMining multi-level association

Miming multi-dimensional associationMiming multi-dimensional association

Mining quantitative association Mining quantitative association

Mining interesting correlation patternsMining interesting correlation patterns


Mining interesting Mining interesting correlation patternscorrelation patterns

Strong association rules (w/ high support and confidence) can be Strong association rules (w/ high support and confidence) can be

uninteresting. Example:uninteresting. Example:

play basketball play basketball eat cereal [40%, 66.7%] eat cereal [40%, 66.7%] is misleading is misleading

The overall percentage of students eating cereal is 75% which is higher than

66.7%.

play basketball play basketball not eat cereal [20%, 33.3%] not eat cereal [20%, 33.3%] is more accurate, although with is more accurate, although with

lower support and confidencelower support and confidence

BasketballBasketball Not basketballNot basketball Sum (row)Sum (row)

CerealCereal 20002000 17501750 37503750

Not cerealNot cereal 10001000 250250 12501250

Sum(col.)Sum(col.) 30003000 20002000 50005000


Are All the Rules Found Are All the Rules Found Interesting?Interesting?

The The confidenceconfidence of a rule only estimates the conditional of a rule only estimates the conditional

probability of item B given item A, but doesn’t measure the probability of item B given item A, but doesn’t measure the

real correlationreal correlation between A and B between A and B

Another example:Another example:

Buy walnuts buy milk [1%, 80%] is misleading if 85% of customers

buy milk

Support and confidence are not good to represent correlationsSupport and confidence are not good to represent correlations

Many other interestingness measures (Tan, Kumar, Sritastava @KDD’02)Many other interestingness measures (Tan, Kumar, Sritastava @KDD’02)


Are All the Rules Found Are All the Rules Found Interesting? Interesting?

Correlation ruleCorrelation rule: A => B [supp, conf., corr. measure]: A => B [supp, conf., corr. measure]

Measure of dependent/correlated events: correlation Measure of dependent/correlated events: correlation

coefficient or coefficient or liftlift

The occurrence of A is The occurrence of A is independentindependent of the occurrence of of the occurrence of

B ifB if

If lift > 1, A and B If lift > 1, A and B positively correlatedpositively correlated, if lift < 1 then , if lift < 1 then

negatively correlated negatively correlated

sup(B)

B)conf(A

P(B)

A)|P(B

P(A)P(B)

B)P(Alift

P(A)P(B) B)P(A


Lift - exampleLift - example

MilkMilk No MilkNo Milk Sum (row)Sum (row)

CoffeeCoffee m, cm, c ~m, c~m, c cc

No CoffeeNo Coffee m, ~cm, ~c ~m, ~c~m, ~c ~c~c

Sum(col.)Sum(col.) mm ~m~m

DBDB m, cm, c ~m, c~m, c m~cm~c ~m~c~m~c liftlift

A1A1 10001000 100100 100100 10,00010,000 9.269.26

A2A2 100100 10001000 10001000 100,000100,000 8.448.44

A3A3 10001000 100100 1000010000 100,000100,000 9.189.18

A4A4 10001000 10001000 10001000 10001000 11

Coffee => Milk Coffee => Milk


Mining Highly Mining Highly Correlated PatternsCorrelated Patterns

lift lift and and 22 are not good measures for correlations in transactional DBsare not good measures for correlations in transactional DBs all-confall-conf or or coherencecoherence could be good measures (Omiecinski @TKDE’03)could be good measures (Omiecinski @TKDE’03)

Both Both all-confall-conf and and coherence coherence have the downward closure property have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)

)sup(_max_

)sup(_

Xitem

Xconfall

|)(|

)sup(

Xuniverse

Xcoh


BibliografiaBibliografia

(Livro) (Livro) Data Mining: Concepts and Data Mining: Concepts and TechniquesTechniques, J. Han & M. Kamber, Morgan , J. Han & M. Kamber, Morgan Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4 Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4 – draft)– draft)

Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)

Documents

Transcript of Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)