Dbm630 lecture05

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 5

Association Rule Mining

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Topics

2

Association rule mining

Mining single-dimensional association rules

Mining multilevel association rules

Other measurements: interest and conviction

Association rule mining to correlation analysis

Data Warehousing and Data Mining by Kritsada Sriphaew

What is Association Mining?

3

Association rule mining:

Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

Applications:

Basket data analysis, cross-marketing, catalog design, clustering, classification, etc.

Ex.: Rule form: “Body Head [support, confidence]” “Antecedent Consequent [support, confidence]” buys(x, “diapers*”) buys(x, “beers”) [0.5%,60%]

major(x, “CS”)^takes(x, “DB”) grade(x, “A”) [1%, 75%]


A typical example of association rule mining is market basket analysis.

4 Data Warehousing and Data Mining by Kritsada Sriphaew

Rule Measures: Support/Confidence

5

Find all the rules “Antecedent(s) Consequent(s)” with minimum support and confidence support, s, probability that a transaction contains {A C}

confidence, c, conditional probability that a transaction having A also contains C

Let min. sup. 50%, and min. conf. 50%, A C (s=50%, c=66.7%)

C A (s=50%, c=100%)

Typically association rules are considered interesting if they satisfy both a minimum support threshold and a mininum confidence threshold

Such thresholds can be set by users or domain experts

Transaction ID Items Bought

2000 A,B,C

1000 A,C

4000 A,D

5000 B,E,F

Transactional databases

• Support= 50% means that 50% of all transactions under analysis show that A and C are purchased together

• Confidence=66.7% means that 66.7% of the customers who purchased A also bought C


6

Rule Measures: Support/Confidence

TransID Items Bought

T001 A,B,C

T002 A,C

T003 A,D

T004 B,E,F

Rule: A C

support (AC) = P({AC}) = P(AC)

confidence(AC) = P(C|A) = P({AC})/P({A})

• A B (1/4 = 25%, 1/3 = 33.3%)

• B A (1/4 = 25%, 1/2 = 50%)

• A C (2/4 = 50%, 2/3 = 66.7%)

• C A (2/4 = 50%, 2/2 =100%)

• A, B C (1/4 = 25%, 1/1 = 100%)

• A, C B (1/4 = 25%, 1/2 = 50%)

• B, C A (1/4 = 25%, 1/1 = 100%)

Frequency

A = 3

B = 2

C = 2

AB = 1

AC = 2

BC = 1

ABC = 1

Customer buys diaper(C)

Customer buys both (A&C)

Customer buys beer (A)

probability


Association Rule: Support/Confidence for Relational Tables

7

In case that each transaction is a row in a relational table

Find: all rules that correlate the presence of one set of attributes with that of another set of attributes

temp. humidity

sunny

sunny

rainy

rainy

overcast

overcast

overcast

rainy

rainy

sunny

hot

hot

hot

mild

cool

hot

cool

mild

mild

hot

high

high

normal

high

low

low

normal

high

low

high

True

False

True

True

False

True

True

True

False

True

play-time play

85

90

63

5

56

25

5

86

78

74

Y

Y

Y

N

Y

N

N

Y

Y

Y

windy Sponsor

Sony

HP

Ford

Ford

HP

Sony

Nokia

Honda

Ford

Sony

outlook • If temperature = hot

then humidity = high (s=3/10,c=3/5)

• If windy=true and play=Y then humidity=high and outlook=overcast

(s=2/10, c=2/4) • If windy=true and play=Y

and humidity=high then outlook=overcast

(s=2/10, c=2/3)


Association Rule Mining: Types

8

Boolean vs. quantitative associations (Based on the types of values handled) (Single vs. multiple Dim.)

Single level vs. multilevel analysis What brands of beers are associated with what brands of diapers?

Various extensions Maxpatterns and closed itemsets

SQLServer ^ DMBooks DBMiner [0.2%, 60%]

buys(x, “SQLServer”) ^ buys(x, “DMBook”)

buys(x, “DBMiner”) [0.2%, 60%]

age(x, “30..39”) ^ income(x, “42..48K”)

buys(x, “PC”) [1%, 75%]


An Example (single dimensional Boolean association Rule Mining)

9

For rule A C: support = support({A, C}) = 50%

confidence = support({A, C})/support({A}) = 66.7%

The Apriori principle:

Any subset of a frequent itemset must be frequent

Transaction ID Items Bought

2000 A,B,C

1000 A,C

4000 A,D

5000 B,E,F

Frequent Itemset Support

{A} 75%

{B} 50%

{C} 50%

{A,C} 50%

Min. support 50%

Min. confidence 50%


Two Steps in Mining Association Rules

10

A subset of a frequent itemset must also be a frequent itemset

i.e., if {AB} is a frequent itemset, both {A} and {B} must be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Step1: Find the frequent itemsets: the sets of items that

have minimum support

Step2: Use the frequent itemsets to generate association

rules


The Apriori Algorithm

11

Join Step: Ck is generated by joining Lk-1with itself

Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Pseudo-code:

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent 1-itemsets};

for (k = 1; Lk !=f; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1

that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

return Uk Lk;

1

2

Find the frequent itemsets


The Apriori Algorithm — Example

12

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database D itemset sup.

{1} 2

{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2

{2} 3

{3} 3

{5} 3

Scan D

C1

L1

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

itemset sup

{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3

{3 5} 2

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

L2

C2 C2

Scan D

C3 L3 itemset

{2 3 5}Scan D itemset sup

{2 3 5} 2

How to Generate Candidates?

13

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

INSERT INTO Ck SELECT p.item1, p.item2, …,

p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q

WHERE p.item1 = q.item1, …,

p.itemk-2 = q.itemk-2,

p.itemk-1 < q.itemk-1

Step 2: pruning ForAll itemsets c IN Ck DO

ForAll (k-1)-subsets s OF c DO

IF (s is not in Lk-1) THEN DELETE c FROM Ck

Example of Generating Candidates

14

Self-joining: L3×L3

abc + abd abcd

acd + ace acde

Pruning: acde is removed because

ade is not in L3

L3={abc, abd, acd, ace, bcd}

C4={abcd}


How to Count Supports of Candidates?

15

Why counting supports of candidates a problem?

The total number of candidates can be very huge

One transaction may contain many candidates

Method:

Candidate itemsets are stored in a hash-tree

Leaf node of hash-tree contains a list of itemsets and counts

Interior node contains a hash table

Subset function: finds all the candidates contained in a transaction


Subset Function

16

Subset function: finds all the candidates contained in a transaction. (1) Generate Hash Tree (2) Hashing each item in the transactions

f

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

C2

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database 1

2

3

2

3

5

3

5

5

1+1

1+1

1+1+1

1+1

1

1


Is Apriori Fast Enough? — Performance Bottlenecks

17

The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent

k-itemsets Use database scan and pattern matching to collect counts for

the candidate itemsets

The bottleneck of Apriori: candidate generation Huge candidate sets:

104 frequent 1-itemset will generate 107 candidate 2-itemsets

To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.

Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern


Mining Frequent Patterns Without Candidate Generation

18

Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure

highly condensed, but complete for frequent pattern mining

avoid costly database scans

Develop an efficient, FP-tree-based frequent pattern mining method

A divide-and-conquer methodology: decompose mining tasks into smaller ones

Avoid candidate generation: sub-database test only!


Construct FP-tree from Transaction DB

19

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

min_support = 0.5

TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:

1. Scan DB once, find

frequent 1-itemset

(single item pattern)

2. Order frequent items

in frequency

descending order

3. Scan DB again,

construct FP-tree


Mining Frequent Patterns using FP-tree

20

General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree

Method For each item, construct its conditional pattern-base, and then its

conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path (single

path will generate all the combinations of its sub-paths, each of which is a frequent pattern)

Benefit: Completeness & Compactness Completeness: never breaks a long pattern of any transaction and

preserves complete information for frequent pattern mining Compactness: reduces irrelevant information (infrequent items are gone),

orders in frequency descending ordering (more frequent items are likely to be shared), and smaller than the original database.


Step 1: From FP-tree to Conditional Pattern Base

Knowledge Management and Discovery © Kritsada Sriphaew 21

Starting at the frequent header table in the FP-tree

Traverse the FP-tree by following the link of each frequent item

Accumulate all of transformed prefix paths of that item to form a conditional pattern base

Conditional pattern bases

item cond. pattn base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1


Step 2: Construct Conditional FP-tree

22

For each pattern-base

Accumulate the count for each item in the base

Construct the FP-tree for the frequent items of the pattern base

m-conditional pattern base:

fca:2,

fcab:1

{}

f:3

c:3

a:3

m-conditional FP-tree

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1


Mining Frequent Patterns by

(Creating Conditional Pattern-Bases)

23

Empty Empty f

{(f:3)}|c {(f:3)} c

{(f:3, c:3)}|a {(fc:3)} a

Empty {(fca:1), (f:1), (c:1)} b

{(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m

{(c:3)}|p {(fcam:2), (cb:1)} p

Conditional FP-tree Conditional pattern-base Item


Step 3: Recursively mine the conditional FP-

tree

24

{}

f:3

c:3

a:3


{}

f:3

c:3

am-conditional FP-tree

{}

f:3

cm-conditional FP-tree

{}

f:3

cam-conditional FP-tree


Single FP-tree Path Generation

25

Suppose an FP-tree T has a single path P

The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P

m-conditional pattern base:

fca:2,

fcab:1

{}

f:3

c:3

a:3


All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1


FP-growth vs. Apriori: Scalability With the

Support Threshold

26

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3Support threshold(%)

Ru

n t

ime

(se

c.)

D1 FP-growth runtime

D1 Apriori runtime

Data set T25I20D10K


CHARM - Mining Closed Association Rules

Instead of horizontal DB format, vertical format is used.

Instead of traditional frequent itemsets, closed frequent

itemsets are mined.

Transaction Items

1 ABDE

2 BCE

3 ABDE

4 ABCE

5 ABCDE

6 BCD

Items Transaction

A 1345

B 123456

C 2456

D 1356

E 12345

27

Horizontal DB Vertical DB


CHARM – Frequent Itemsets and Their Supports

28

An example database and its frequent itemsets

Items Trans. A 1345 B 123456 C 2456 D 1356 E 12345

Vertical DB

Support Itemsets 1.00 B 0.83 BE, E 0.67 A, C, D, AB,AE, BC, BD, ABE 0.50 AD, CE, DE, ABD, ADE, BCE, BDE, ABDE

Min. support = 0.5


CHARM - Closed Itemsets

29

Closed frequent itemsets and their corresponding frequent itemsets

Closed Itemsets Tidsets Sup. Freq. Itemsets

B 123456 1.00 B

BE 12345 0.83 BE, E ABE 1345 0.67 ABE, AB, AE, A BD 1356 0.67 BD, D BC 2456 0.67 BC, C ABDE 135 0.50 ABDE, ABD, ADE, BDE, AD, DE BCE 245 0.50 CE, BCE


The CHARM Algorithm

30

CHARM (? I T, minsup);

1. Nodes = { Ij t(Ij) : Ij I |t(Ij )| minsup }

2. CHARM-EXTEND (Nodes, C)

CHARM-EXTEND (Nodes, C)

3. for each Xi t(Xi) in Nodes

4. NewN = f and X = Xi

5. for each Xj t(Xj) in Nodes, with f(j) > f(I)

6. X’ = X Xj and Y = t(Xi) t(Xj)

7. CHARM-PROPERTY(Nodes, NewN)

8. if NewN f then CHARM-EXTEND(NewN)

9. C = C {X} // if X is not subsumed

CHARM-PROPERY(Nodes, NewN)

1. if (|Y| minsup) then

2. if t(Xi) = t(Xj) then // Propery 1

3. Remove Xj from Nodes

4. Replace all Xi with X’

5. else if t(Xj) t(Xj) then // Propery 2

6. Replace all Xi with X’


8. Remove Xj from Nodes

9. Add X Y to NewN


11. Add X Y to NewN

Ax1345 Bx123456 Cx2456 Dx1356 Ex12345

ABx1345

ABEx1345

ABCx45 ABDx135 ABDEx135

BCx2456

BCDx56 BCEx245

BDx1356 BEx12345

BDEx135

f


Presentation of Association Rules (Table Form)


Visualization of Association Rule Using Plane Graph

32

Visualization of Association Rule Using Rule Graph

33

Multiple-Level Association Rules

Items often form hierarchy.

Items at the lower level are

expected to have lower

support.

Rules regarding itemsets at

the appropriate levels could

be quite useful.

Transaction database can be

encoded based on

dimensions and levels

We can explore shared

multi-level mining

TID ITEMS

T1 {1121, 1122, 1212}

T2 {1222, 1121, 1122, 1213}

T3 {1124, 1213}

T4 {1111, 1211, 1232, 1221, 1223}

34

Food

(1)

Bread

(12)

Milk

(11)

Skim

(111)

Sunset

(1124)

Fraser

(1121)

2%

(112)

White

(122)

Wheat

(121)

Mining multilevel association rules from transactional databases

Wonder (1222)

Wonder (1213)


Mining Multi-Level Associations

35

A top_down, progressive deepening approach:

First find high-level strong rules:

Then find their lower-level “weaker” rules:

Variations at mining multiple-level association rules.

Level-crossed association rules:

Association rules with multiple, alternative hierarchies:

milk bread [20%, 60%]

2% milk wheat bread [6%, 50%]

2% milk Wonder wheat bread [3%, 60%]

2% milk Wonder bread [8%, 72%]


Multi-level Association: Redundancy Filtering

36

Some rules may be redundant due to “ancestor” relationships between items.

Example

We say the first rule is an ancestor of the second rule.

A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

milk wheat bread [s=8%, c=70%]

2% milk wheat bread [s=2%, c=72%]


Multi-Level Mining: Progressive Deepening

37

A top-down, progressive deepening approach: First mine high-level frequent items:

milk (15%), bread (10%) Then mine their lower-level “weaker” frequent itemsets:

2% milk (5%), wheat bread (4%)

Different min_support threshold across multi-levels lead to different algorithms: If adopting the same min_support across multi-levels

then toss t if any of t’s ancestors is infrequent. If adopting reduced min_support at lower levels

then examine only those descendants whose ancestor’s support is frequent/non-negligible.


Problem of Confidence

38

Example: (Aggarwal & Yu, PODS98)

Among 5000 students

3000 play basketball

3750 eat cereal

2000 both play basket ball and eat cereal

play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)

cereal 2000 1750 3750

not cereal 1000 250 1250

sum(col.) 3000 2000 5000


Interest/Lift/Correlation

Interest (or lift, correlation)

taking both P(A) and P(B) in consideration

P(AB)=P(B)P(A), if A and B are independent

events

A and B negatively correlated, if the value is less

than 1; otherwise A and B positively correlated

Lift(play basketball eat cereal) = 0.89

Lift(play basketball not eat cereal) = 1.33


cereal 2000 1750 3750

not cereal 1000 250 1250

sum(col.) 3000 2000 5000

39

)()(

)(

BPAP

BAP

889.0

5000

3750

5000

30005000

2000

33.1

5000

1250

5000

30005000

1000


Conviction

Conviction (Brin, 1997)

0 <= conv(AB) <=

A and B are statistically independent if and only if

conv(AB) = 1

0 < conv(AB) < 1 if and only if p(B|A) < p(B)

B is negatively correlated with A.

1 < conv(AB) < if and only if p(B|A) > p(B)

B is positively correlated with A.

)(1(

))(1()(

BAConfidence

BSupportBAConviction

40


cereal 2000 1750 3750

not cereal 1000 250 1250

sum(col.) 3000 2000 5000

375.0667.01

5000

37501

25.2333.01

5000

12501

conviction(play basketball eat cereal) = 0.375

conviction(play basketball not eat cereal) = 2.25


From Association Mining to Correlation Analysis

Ex. Strong rules are not necessarily interesting

41

Of 10000 transactions

• 6000 customer transactions include computer games

• 7500 customer transactions include videos • 4000 customer transactions include both computer game and video

• Suppose that data mining program for

discovering association rules is run on

the data, using min_sup of 30% and

min_conf. of 60%

• The following association rule is

discovered:

buys(X, “computer games”) buys(X, “videos”) [s=40%, c=66%]

=4000/10000 =4000/6000

4,000

videos games

This rule is misleading because the probability of purchasing video is 75% (>66%)

In fact, computer games and videos are negatively associated because the purchase of one of these items actually decreases the likelihood of purchasing the other. Therefore, we could easily make unwise business decisions based on this rule

42

buys(X, “computer games”) buys(X, “videos”) [support=40%, confidence=66%]

A misleading “strong” association rule


From Association Analysis to Correlation Analysis

To help filter out misleading “strong” association

Correlation rules

A B [support, confidence, correlation]

Lift is a simple correlation measure that is given as follows

The occurrence of itemset A is independent of the occurrence of itemset B if

P(AB) = P(A)P(B);

Otherwise, itemset A and B are dependent and correlated

lift(A,B) = P(AB) / P(A)P(B) = P(B|A) / P(B) = conf(AB) / sup(B)

If lift(A,B) < 1, then the occurrence of A is negatively correlated with the occurrence of B

If lift(A,B) > 1, then A and B is positively correlated, meaning that the occurrence of one implies the occurrence of the other.


From Association Analysis to Correlation Analysis (Cont.)

Ex. Correlation analysis using lift

The lift of this rule is P{game,video} / (P{game} × P{video}) = 0.40/(0.6 ×0.75) = 0.89

There is a negative correlation between the occurrence of {game} and {video}

Ex. Is this following rule misleading?

Buy walnuts Buy milk [1%, 80%]”

if 85% of customers buy milk

44

buys(X, “computer games”) buys(X, “videos”) [support=40%, confidence=66%]


Homework

45

ให transactional database ซงเปน LOG ไฟลบนทกการเขาเยยมชมเวบเพจของปใชแตละคน

ในชวงระยะเวลาหนง จงหากฎสมพนธทนาเชอถอ โดยสมมตวาเราเปนปวเคราะหขอมล ม

สทธตง minimum support และ minimum confidence ดวยตวเอง พรอมอธบายเหตปล

ประกอบวาท าไมถงตงคานน และตรวจสอบดวยวากฎเหลานนเปน misleading หรอไม ถามให

แกไขอยางไร

TID List of items

T001 P1, P2, P3, P4

T002 P3, P6

T003 P2, P5, P1

T004 P5, P4, P3,P6

T005 P1, P3, P4, P2

P1 P2 P3 P4 P5

P6

Feb 26, 2011 (14:00)

Data Warehousing and Data Mining by Kritsada Sriphaew 46

Quiz I

Star-net Query (Multidimensional Table)

Data Cube Computation (Memory Calculation)

Data Preprocessing (Normalization, Smoothing by binning)

Association Rule Mining

Dbm630 lecture05

Technology

Transcript of Dbm630 lecture05