Association Discovery from Databases Association rules are a simple formalism for expressing...

15
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical example is a rule which states that if a customer buys beer and sausage, then with 80% confidence he/she also buys mustard. Association rule mining: Finding associations or correlation among a set of items or objects in transaction databases, relational databases, and data warehouses. Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, etc. Examples: Rule form: LHS RHS [support, confidence]. buys(x, diapers) buys(x, beers) [0.5%, 60%]

Transcript of Association Discovery from Databases Association rules are a simple formalism for expressing...

Page 1: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Association Discovery from Databases

Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical example is a rule which states that if a customer buys beer and sausage, then with 80% confidence he/she also buys mustard.

Association rule mining:Finding associations or correlation among a set of items or objects in transaction databases, relational databases, and data warehouses.

Applications:Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, etc.

Examples:Rule form: LHS RHS [support, confidence].

buys(x, diapers) buys(x, beers) [0.5%, 60%]major(x, CS) ^ takes(x, DB) grade(x, A) [1%, 75%]

Page 2: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Association Rule: Basic Concepts• Given: (1) database of transactions, (2) each transaction is a

list of items (purchased by a customer in a visit)• Find: all rules that correlate the presence of one set of items

with that of another set of items– E.g., 98% of people who purchase tires and auto accessories also

get automotive services done

• Applications– * Maintenance Agreement (What the store should do to boost

Maintenance Agreement sales)– Home Electronics * (What other products should the store

stocks up?)– Attached mailing in direct marketing– Detecting “ping-pong”ing of patients, faulty “collisions”

Page 3: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Associations, Support and ConfidenceLet I = {i1, i2,.., im} be a set of literals, each called an itemLet D = {t1, t2,.., tn} be a set of transactions, where a transaction,

t, is a set of items

•An association rule is of the form : X => Y where X, Y are subsets of I, and X INTERSECT Y = EMPTY

•Each rule has two measures of value, support, and confidence.

•Support indicates the frequencies of the occurring patterns, and confidence denotes the strength of implication in the rule.

•The support of the rule X => Y is support (X UNION Y) c is the CONFIDENCE of rule X => Y if c% of transactions that contain X also contain Y, which can be written as the radio:

support(X UNION Y)/support(X)

Page 4: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Rule Measures: Support and Confidence

• Find all the rules X & Y Z with minimum confidence and support– support, s, probability that a

transaction contains {X &Y &Z}– confidence, c, conditional

probability that a transaction having {X & Y} also contains Z

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Let minimum support 50%, and minimum confidence 50%, we have– A C (50%, 66.6%)– C A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Page 5: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Association Discovery

Given a user specified minimum support (called MINSUP)and minimum confidence (called MINCONF), an importantPROBLEM is to find all high confidence, large itemsets (frequent sets, sets with high support). (where support and confidence are larger than minsup and minconf).

This problem can be decomposed into two subproblems:

1. Find all large itemsets: with support > minsup (frequent sets).

2. For a large itemset, X and BX (or Y X) ,find those rules, X\{B} => B ( X-Y => Y) for which confidence > minconf.

Page 6: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Mining Association Rules—An Example

For rule A C:support = support({A &C}) = 50%

confidence = support({A &C})/support({A}) = 66.6%

The Apriori principle:Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Page 7: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Rules from frequent sets

• X ={mustard, sausage, beer}; frequency = 0.4• Y = {mustard, sausage, beer, chips}; frequency = 0.2• if the customer buys mustard, sausage, and beer, then

the probability that he/she buys chips is 0.5• simple descriptive pattern• statistical meaning :confidence of A=> B : P(B|A)

Page 8: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Mining Frequent Itemsets: the Key Step

• Find the frequent itemsets: the sets of items that have

minimum support

– A subset of a frequent itemset must also be a frequent

itemset (Apriori rule)

• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a

frequent itemset

– Iteratively find frequent itemsets with cardinality from 1 to

k (k-itemset)

• Use the frequent itemsets to generate association

rules.

Page 9: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

The Algorithm1) The frequent set can be computed through iteration.

1th ITERATION: large 1-candidate sets are found by scanning.Kth ITERATION: Ck is created by applying Apriori-gen to Lk-1. and scanned for frequent sets.

Apriori-gen generates only those k-itemsets whose every (k-1)-itemset subset is frequent (in Lk-1).

(2) Generating rules.

Foreach frequent set, X, output all rules R(X, Y) = (X-Y=> Y), (Y is a subset of X) where c(R(X, Y)) = supp(X)/supp(X-Y) is at least minconf.

Page 10: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

The Apriori Algorithm

• Join Step: Ck is generated by joining Lk-1with itself

• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

• Pseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

Page 11: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

ExampleConsider the database in Table 2.2. Table 2.2. Sample transaction database

TID Items --- ---------------------- 100 A C D 200 B C E 300 A B C E 400 B E

Let minimum-support =50% and minimum-confidence = 60% . Since there are four records in the table, the number of transactions above the minsup is 2 (4 x 50% = 2).

Page 12: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

The process of finding frequent setsDatabase_D Candidate_1-itemset Frequent_1-itemsetTID Items Itemset Support_Count Itemset Support_Count100 A C D {A} 2 {A} 2200 B C E --> {B} 3 {B} 3300 A B C E {C} 3 {C} 3400 B E {D} 1 {E} 3 {E} 3Candidate_2-itemset Candidate_2-itemset Frequent_2-itemsetItemset Itemset Support_Count Itemset Support_Count{A, B} {A, B} 1 {A, C} 2 {A, C} {A, C} 2 {B, C} 2{A, E} --> {A, E} 1 {B, E} 3{B, C} {B, C} 2 {C, E} 2{B, E} {B, E} 3{C, E} {C, E} 2

Candidate_3-itemset Candidate_3-itemset Frequent_3-itemsetItemset Itemset Support_Count Itemset Support_Count{B, C, E} --> {B, C, E} 2 {B, C, E} 2

Derive association rules. We have large 3-itemset {{B, C, E}}where s = 50%. Remember the predetermined minconf = 60%. we get:

B and C implies E, with support = 50% and confidence = 100%.B and E implies C, with support = 50% and confidence = 66.7%.C and E implies B, with support = 50% and confidence = 100%.B implies C and E, with support = 50% and confidence = 66.7%.C implies B and E, with support = 50% and confidence = 66.7%.E implies B and C, with support = 50% and confidence = 66.7%.

Page 13: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

The Apriori Algorithm — Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2

Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Page 14: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

General framework for rule discovery

• a class P of patterns• specify whether a pattern p P occurs frequently enough (support)

and is also interesting (confidence )• compute

PI(d, P) = { p P | p occurs sufficiently often in d and p is interesting }

• examples:– P : all association rules– P’: all association rules with B on the right-hand side– P’’: all association rules with B on the right-hand side and C

occurring in the left-hand side

Page 15: Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Association Rules in Table Form