Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)
-
date post
15-Jan-2016 -
Category
Documents
-
view
213 -
download
0
Transcript of Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)
Frequent Pattern and Frequent Pattern and Association AnalysisAssociation Analysis
(baseado nos slides do livro: Data (baseado nos slides do livro: Data Mining: C & T)Mining: C & T)
SAD Tagus 2004/05 H. Galhardas
Frequent Pattern and Frequent Pattern and Association AnalysisAssociation Analysis
Basic conceptsBasic concepts
Scalable mining methodsScalable mining methods
Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns
Constraint-based miningConstraint-based mining
Mining sequential and structured patternsMining sequential and structured patterns
Extensions and applicationsExtensions and applications
SAD Tagus 2004/05 H. Galhardas
Frequent Pattern and Frequent Pattern and Association AnalysisAssociation Analysis
Basic conceptsBasic concepts
Scalable mining methodsScalable mining methods
Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns
Constraint-based miningConstraint-based mining
Mining sequential and structured patternsMining sequential and structured patterns
Extensions and applicationsExtensions and applications
SAD Tagus 2004/05 H. Galhardas
What Is Frequent What Is Frequent Pattern Analysis?Pattern Analysis?
Frequent patternFrequent pattern: a pattern (a set of items, : a pattern (a set of items,
subsequences, substructures, etc.) that occurs subsequences, substructures, etc.) that occurs
frequently in a data set frequently in a data set
First proposed by Agrawal, Imielinski, and Swami First proposed by Agrawal, Imielinski, and Swami
[AIS93] in the context of [AIS93] in the context of association rule miningassociation rule mining
SAD Tagus 2004/05 H. Galhardas
MotivationMotivation
Finding inherent Finding inherent regularitiesregularities in data in data What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
ExsExs: Basket data analysis, cross-marketing, catalog : Basket data analysis, cross-marketing, catalog
design, sale campaign analysis, Web log (click design, sale campaign analysis, Web log (click
stream) analysis, DNA sequence analysis, etc.stream) analysis, DNA sequence analysis, etc.
SAD Tagus 2004/05 H. Galhardas
Basic Concepts (1)Basic Concepts (1)
ItemItem: Boolean variable representing its : Boolean variable representing its presence or absence presence or absence
BasketBasket: Boolean vector of variables: Boolean vector of variablesAnalyzed to discover patterns of items that are
frequently associated (or buyed) together
Association rulesAssociation rules: association patterns: association patternsX => Y
SAD Tagus 2004/05 H. Galhardas
Basic Concepts (2)Basic Concepts (2) Itemset X = {xItemset X = {x11, …, x, …, xkk}} Find all the rules Find all the rules X X Y Y with minimum with minimum
support and confidencesupport and confidence
SupportSupport, , ss, , probabilityprobability that a transaction that a transaction contains X contains X Y Y
Support = # tuples (Support = # tuples (X X Y) / # total tuples Y) / # total tuples = P(= P(X X Y) Y)
ConfidenceConfidence, , c,c, conditional probabilityconditional probability that that a transaction having X also contains a transaction having X also contains YY
Confidence = # tuples (Confidence = # tuples (X X Y) / # tuples Y) / # tuples X X = P(Y|X)= P(Y|X)
Customerbuys diaper
Customerbuys both
Customerbuys beer
Transaction-idTransaction-id Items boughtItems bought
1010 A, B, DA, B, D
2020 A, C, DA, C, D
3030 A, D, EA, D, E
4040 B, E, FB, E, F
5050 B, C, D, E, FB, C, D, E, F
SAD Tagus 2004/05 H. Galhardas
Basic Concepts (3)Basic Concepts (3)Interesting (or strong)
rules – satisfy minimum support threshold and minimum confidence threshold
Transaction-idTransaction-id Items boughtItems bought
1010 A, B, DA, B, D
2020 A, C, DA, C, D
3030 A, D, EA, D, E
4040 B, E, FB, E, F
5050 B, C, D, E, FB, C, D, E, F
Let supmin = 50%, confmin = 50%Frequent Pattern: {A:3, B:3, D:4, E:3, AD:3}
Association rules: A D (60%, 100%), 60% of all analyzed transactions show that A and D are
bought together; 100% of customers that bought A also bought D D A (60%, 75%)
SAD Tagus 2004/05 H. Galhardas
Basic Concepts (4)Basic Concepts (4)Itemset (or pattern)Itemset (or pattern): set of items (k-itemset has k : set of items (k-itemset has k
items)items)Occurrence frequency of itemsetOccurrence frequency of itemset: # of transactions : # of transactions
containing the itemsetcontaining the itemsetAlso known as: frequency, support count or count of the
itemset
Frequent itemsetFrequent itemset: Itemset satisfies minimum : Itemset satisfies minimum support, i.e., support, i.e., frequency >= min-sup * # total transactions
ConfidenceConfidence (A => B) = P(A|B) = frequency (A (A => B) = P(A|B) = frequency (A B)/frequency (A) B)/frequency (A)
Pb. Mining assoc. Rules = Pb. Mining frequent itemsetsPb. Mining assoc. Rules = Pb. Mining frequent itemsets
SAD Tagus 2004/05 H. Galhardas
Mining frequent Mining frequent itemsetsitemsets
A long pattern contains a combinatorial number of sub-patternsA long pattern contains a combinatorial number of sub-patterns
SolutionSolution: : Mine Mine closed patternsclosed patterns and and max-patternsmax-patterns
An itemset XAn itemset X is is closed frequent closed frequent if X is if X is frequentfrequent and there exists and there exists no super-itemsetno super-itemset Y Y ככ X, X, with the same supportwith the same support as X as X
An itemset X is a An itemset X is a max-itemsetmax-itemset if X is frequent and there exists if X is frequent and there exists no frequent super-itemset Y no frequent super-itemset Y ככ X X
Closed itemset is a lossless compression of freq. patterns: Closed itemset is a lossless compression of freq. patterns: reducing the # of patterns and rulesreducing the # of patterns and rules
SAD Tagus 2004/05 H. Galhardas
ExampleExample
DB = {<aDB = {<a11, …, a, …, a100100>, < a>, < a11, …, a, …, a5050>} >}
Min_sup = 1.Min_sup = 1. What is the set of What is the set of closed itemsetclosed itemset??
<a1, …, a100>: 1
< a1, …, a50>: 2
What is the set of What is the set of max-pattern?max-pattern?<a1, …, a100>: 1
What is the set of What is the set of all patternsall patterns?? !!
SAD Tagus 2004/05 H. Galhardas
Frequent Pattern and Frequent Pattern and Association AnalysisAssociation Analysis
Basic conceptsBasic concepts
Scalable mining methodsScalable mining methods
Mining a variety of rules and interesting patterns Mining a variety of rules and interesting patterns
Constraint-based miningConstraint-based mining
Mining sequential and structured patternsMining sequential and structured patterns
Extensions and applicationsExtensions and applications
SAD Tagus 2004/05 H. Galhardas
Scalable Methods for Mining Scalable Methods for Mining Frequent PatternsFrequent Patterns
Downward closureDownward closure (or Apriori) property of frequent (or Apriori) property of frequent patterns: patterns: any subset of a frequent itemset must be any subset of a frequent itemset must be frequentfrequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
Scalable mining methodsScalable mining methods: Three major approaches: Three major approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
SAD Tagus 2004/05 H. Galhardas
Apriori: A Candidate Apriori: A Candidate Generation-and-Test ApproachGeneration-and-Test Approach
MethodMethod: : Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
Apriori pruning principle: Apriori pruning principle: If there is any itemset which If there is any itemset which
is is infrequentinfrequent, its superset should not be , its superset should not be
generated/tested! generated/tested!
SAD Tagus 2004/05 H. Galhardas
How to Generate How to Generate Candidates?Candidates?
Suppose the items in Suppose the items in LLk-1k-1 are listed in an order are listed in an order
Step 1Step 1: self-joining : self-joining LLk-1k-1 insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2Step 2: pruning: pruningforall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
SAD Tagus 2004/05 H. Galhardas
Example Example
Database TDB
1st scan
C1L1
L2
C2 C22nd scan
C3 L33rd scan
TidTid ItemsItems
1010 A, C, DA, C, D
2020 B, C, EB, C, E
3030 A, B, C, EA, B, C, E
4040 B, EB, E
ItemsetItemset supsup
{A}{A} 22
{B}{B} 33
{C}{C} 33
{D}{D} 11
{E}{E} 33
ItemsetItemset supsup
{A}{A} 22
{B}{B} 33
{C}{C} 33
{E}{E} 33
ItemsetItemset
{A, B}{A, B}
{A, C}{A, C}
{A, E}{A, E}
{B, C}{B, C}
{B, E}{B, E}
{C, E}{C, E}
ItemsetItemset supsup
{A, B}{A, B} 11
{A, C}{A, C} 22
{A, E}{A, E} 11
{B, C}{B, C} 22
{B, E}{B, E} 33
{C, E}{C, E} 22
ItemsetItemset supsup
{A, C}{A, C} 22
{B, C}{B, C} 22
{B, E}{B, E} 33
{C, E}{C, E} 22
ItemsetItemset
{B, C, E}{B, C, E}
ItemsetItemset supsup
{B, C, E}{B, C, E} 22
Supmin = 2
SAD Tagus 2004/05 H. Galhardas
The Apriori AlgorithmThe Apriori Algorithm Pseudo-codePseudo-code::
Ck: Candidate itemset of size kLk : frequent itemset of size k
L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;
SAD Tagus 2004/05 H. Galhardas
Important Details of AprioriImportant Details of Apriori How to generate candidates?How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
ExampleExample of of Candidate-generationCandidate-generation L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd acde from acd and ace
Pruning: acde is removed because ade is not in L3
C4={abcd}
SAD Tagus 2004/05 H. Galhardas
Another example Another example (1)(1)
SAD Tagus 2004/05 H. Galhardas
Another example Another example (2)(2)
SAD Tagus 2004/05 H. Galhardas
Generation of ass. rules Generation of ass. rules from frequent itemsetsfrom frequent itemsets
Confidence(A=>B) = P(B|A) = Confidence(A=>B) = P(B|A) = = support-count (= support-count (A A B) / support-count (A) B) / support-count (A)
support-count (support-count (A A B): nb of transactions containing itemsets A B): nb of transactions containing itemsets A B Bsupport-count (A): nb of transactions containing itemset Asupport-count (A): nb of transactions containing itemset A
Association rules can be generated as:Association rules can be generated as: For each frequent itemset For each frequent itemset ll, generate all , generate all
nonempty subsets of nonempty subsets of ll For every nonempty subset For every nonempty subset ss of of ll, output the , output the
rule:rule:s=>(l-s) if support-count(l)/support-count(s) >= min-conf
SAD Tagus 2004/05 H. Galhardas
How to Count Supports of How to Count Supports of Candidates?Candidates?
Why counting supports of candidates a problem?Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates
Method based on hashing:Method based on hashing: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a
transaction
SAD Tagus 2004/05 H. Galhardas
Challenges of Frequent Challenges of Frequent Pattern MiningPattern Mining
ChallengesChallengesMultiple scans of transaction databaseHuge number of candidatesTedious workload of support counting for
candidates Improving AprioriImproving Apriori: general ideas: general ideas
PartitioningSamplingothers
SAD Tagus 2004/05 H. Galhardas
Partition: Scan Database Partition: Scan Database Only TwiceOnly Twice
Any itemset that is potentially frequent in Any itemset that is potentially frequent in DB must be frequent in at least one of the DB must be frequent in at least one of the partitions of DBpartitions of DBScan 1: partition database and find local
frequent patternsScan 2: consolidate global frequent patterns
SAD Tagus 2004/05 H. Galhardas
Sampling for Frequent Sampling for Frequent PatternsPatterns
Select a Select a samplesample of original database, mine frequent of original database, mine frequent
patterns within sample using Aprioripatterns within sample using Apriori
Scan database Scan database onceonce to verify frequent itemsets found to verify frequent itemsets found
in samplein sample
Scan database Scan database againagain to find missed frequent patterns to find missed frequent patterns Use lower support threshold to find frequent itemsets in
sample
Trade off Trade off accuracy vs efficiencyaccuracy vs efficiency
SAD Tagus 2004/05 H. Galhardas
Bottleneck of Frequent-Bottleneck of Frequent-pattern Miningpattern Mining
Multiple database scans are Multiple database scans are costlycostly Mining long patterns needs many passes of Mining long patterns needs many passes of
scanning and generates lots of candidatesscanning and generates lots of candidatesTo find frequent itemset i1i2…i100
# of scans: 100
# of Candidates: (1001) + (100
2) + … + (11
00
00) = 2100-1 =
1.27*1030 !
Bottleneck: candidate-generation-and-testBottleneck: candidate-generation-and-test Can we avoid candidate generation? Yes!Can we avoid candidate generation? Yes!
SAD Tagus 2004/05 H. Galhardas
Visualization of Association Rules: Plane Graph
SAD Tagus 2004/05 H. Galhardas
Visualization of Association Rules: Rule Graph
SAD Tagus 2004/05 H. Galhardas
Visualization of Association Visualization of Association Rules (SGI/MineSet 3.0)Rules (SGI/MineSet 3.0)
SAD Tagus 2004/05 H. Galhardas
Frequent Pattern and Frequent Pattern and Association AnalysisAssociation Analysis
Basic conceptsBasic concepts
Scalable mining methodsScalable mining methods
Mining a variety of rules and interesting Mining a variety of rules and interesting
patterns patterns
Constraint-based miningConstraint-based mining
Mining sequential and structured patternsMining sequential and structured patterns
Extensions and applicationsExtensions and applications
SAD Tagus 2004/05 H. Galhardas
Mining Various Kinds of Mining Various Kinds of Association RulesAssociation Rules
Mining multi-level associationMining multi-level association
Miming multi-dimensional associationMiming multi-dimensional association
Mining quantitative association Mining quantitative association
Mining interesting correlation patternsMining interesting correlation patterns
SAD Tagus 2004/05 H. Galhardas
ExampleExample
SAD Tagus 2004/05 H. Galhardas
Mining Multiple-Level Mining Multiple-Level Association RulesAssociation Rules
Items often form Items often form hierarchyhierarchy Flexible supportFlexible support settings settings
Items at the lower level are expected to have lower support
uniform supportuniform support
Milk[support = 10%]
2% Milk [support = 6%]
Skim Milk [support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
Level 1min_sup = 5%
Level 2min_sup = 3%
reduced support
SAD Tagus 2004/05 H. Galhardas
Multi-level Association: Multi-level Association: Redundancy FilteringRedundancy Filtering
Some rules may be redundant due to “ancestor” Some rules may be redundant due to “ancestor”
relationships between items. Example:relationships between items. Example:
R1: milk wheat bread [support = 8%, confidence = 70%]
R2: 2% milk wheat bread [support = 2%, confidence = 72%]
R1 is an R1 is an ancestorancestor of R2: R1 can be obtained from of R2: R1 can be obtained from
R2 replacing items by its ancestors in the R2 replacing items by its ancestors in the
concept hierarchyconcept hierarchy
R2 is R2 is redundantredundant if its support is close to the if its support is close to the
“expected” value, based on the rule’s ancestor.“expected” value, based on the rule’s ancestor.
SAD Tagus 2004/05 H. Galhardas
Mining Multi-Dimensional Mining Multi-Dimensional AssociationAssociation
Single-dimensionalSingle-dimensional rules: rules:buys(X, “milk”) buys(X, “bread”)
Multi-dimensionalMulti-dimensional rules: rules: 2 dimensions or predicates 2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
Hybrid-dimension assoc. rules (repeated predicates)age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
SAD Tagus 2004/05 H. Galhardas
Recall: categorical vs Recall: categorical vs quantitative attributesquantitative attributes
Categorical (or nominal) attributesCategorical (or nominal) attributes: finite number of : finite number of possible values, no ordering among valuespossible values, no ordering among values
Quantitative AttributesQuantitative Attributes: numeric, implicit ordering : numeric, implicit ordering among valuesamong values
SAD Tagus 2004/05 H. Galhardas
Mining Quantitative AssociationsMining Quantitative Associations
Techniques can be categorized by how numerical Techniques can be categorized by how numerical attributes, such as ageattributes, such as age oror salary are treated:salary are treated:
Static discretization based on predefined concept hierarchies (data cube methods)
Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)
Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97): one dimensional clustering then association
Deviation: (such as Aumann and Lindell@KDD99)Sex = female => Wage: mean=$7/hr (overall mean = $9)
SAD Tagus 2004/05 H. Galhardas
Static Discretization of Static Discretization of Quantitative Attributes (1)Quantitative Attributes (1)
Discretized prior to mining using Discretized prior to mining using concept hierarchyconcept hierarchy..
Numeric values are replaced by Numeric values are replaced by rangesranges..
Frequent itemset mining algo.Frequent itemset mining algo. must be modified so must be modified so
that frequent predicate sets are searched that frequent predicate sets are searched Instead of searching only one attribute (ex: buys), search
through relevant attributes (ex: age, occupation, buys) and
treat each attribute-value pair as an itemset.
SAD Tagus 2004/05 H. Galhardas
Static Discretization of Static Discretization of Quantitative Attributes (2)Quantitative Attributes (2)
Data cubeData cube is well suited for mining and may already is well suited for mining and may already
existexist
The The cellscells of an n-dimensional cuboid correspond to of an n-dimensional cuboid correspond to
the predicate sets and can be used to store the the predicate sets and can be used to store the
support countssupport counts
Mining from data cubesMining from data cubes
can be much can be much fasterfaster..(income)(age)
()
(buys)
(age, income) (age,buys) (income,buys)
(age,income,buys)
SAD Tagus 2004/05 H. Galhardas
Mining Various Kinds of Mining Various Kinds of Association RulesAssociation Rules
Mining multi-level associationMining multi-level association
Miming multi-dimensional associationMiming multi-dimensional association
Mining quantitative association Mining quantitative association
Mining interesting correlation patternsMining interesting correlation patterns
SAD Tagus 2004/05 H. Galhardas
Mining interesting Mining interesting correlation patternscorrelation patterns
Strong association rules (w/ high support and confidence) can be Strong association rules (w/ high support and confidence) can be
uninteresting. Example:uninteresting. Example:
play basketball play basketball eat cereal [40%, 66.7%] eat cereal [40%, 66.7%] is misleading is misleading
The overall percentage of students eating cereal is 75% which is higher than
66.7%.
play basketball play basketball not eat cereal [20%, 33.3%] not eat cereal [20%, 33.3%] is more accurate, although with is more accurate, although with
lower support and confidencelower support and confidence
BasketballBasketball Not basketballNot basketball Sum (row)Sum (row)
CerealCereal 20002000 17501750 37503750
Not cerealNot cereal 10001000 250250 12501250
Sum(col.)Sum(col.) 30003000 20002000 50005000
SAD Tagus 2004/05 H. Galhardas
Are All the Rules Found Are All the Rules Found Interesting?Interesting?
The The confidenceconfidence of a rule only estimates the conditional of a rule only estimates the conditional
probability of item B given item A, but doesn’t measure the probability of item B given item A, but doesn’t measure the
real correlationreal correlation between A and B between A and B
Another example:Another example:
Buy walnuts buy milk [1%, 80%] is misleading if 85% of customers
buy milk
Support and confidence are not good to represent correlationsSupport and confidence are not good to represent correlations
Many other interestingness measures (Tan, Kumar, Sritastava @KDD’02)Many other interestingness measures (Tan, Kumar, Sritastava @KDD’02)
SAD Tagus 2004/05 H. Galhardas
Are All the Rules Found Are All the Rules Found Interesting? Interesting?
Correlation ruleCorrelation rule: A => B [supp, conf., corr. measure]: A => B [supp, conf., corr. measure]
Measure of dependent/correlated events: correlation Measure of dependent/correlated events: correlation
coefficient or coefficient or liftlift
The occurrence of A is The occurrence of A is independentindependent of the occurrence of of the occurrence of
B ifB if
If lift > 1, A and B If lift > 1, A and B positively correlatedpositively correlated, if lift < 1 then , if lift < 1 then
negatively correlated negatively correlated
sup(B)
B)conf(A
P(B)
A)|P(B
P(A)P(B)
B)P(Alift
P(A)P(B) B)P(A
SAD Tagus 2004/05 H. Galhardas
Lift - exampleLift - example
MilkMilk No MilkNo Milk Sum (row)Sum (row)
CoffeeCoffee m, cm, c ~m, c~m, c cc
No CoffeeNo Coffee m, ~cm, ~c ~m, ~c~m, ~c ~c~c
Sum(col.)Sum(col.) mm ~m~m
DBDB m, cm, c ~m, c~m, c m~cm~c ~m~c~m~c liftlift
A1A1 10001000 100100 100100 10,00010,000 9.269.26
A2A2 100100 10001000 10001000 100,000100,000 8.448.44
A3A3 10001000 100100 1000010000 100,000100,000 9.189.18
A4A4 10001000 10001000 10001000 10001000 11
Coffee => Milk Coffee => Milk
SAD Tagus 2004/05 H. Galhardas
Mining Highly Mining Highly Correlated PatternsCorrelated Patterns
lift lift and and 22 are not good measures for correlations in transactional DBsare not good measures for correlations in transactional DBs all-confall-conf or or coherencecoherence could be good measures (Omiecinski @TKDE’03)could be good measures (Omiecinski @TKDE’03)
Both Both all-confall-conf and and coherence coherence have the downward closure property have the downward closure property Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)Efficient algorithms can be derived for mining (Lee et al. @ICDM’03sub)
)sup(_max_
)sup(_
Xitem
Xconfall
|)(|
)sup(
Xuniverse
Xcoh
SAD Tagus 2004/05 H. Galhardas
BibliografiaBibliografia
(Livro) (Livro) Data Mining: Concepts and Data Mining: Concepts and TechniquesTechniques, J. Han & M. Kamber, Morgan , J. Han & M. Kamber, Morgan Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4 Kaufmann, 2001 (Cap. 6 – livro 2001, Cap. 4 – draft)– draft)