Dbm630 lecture05

46
DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University 1 Semester 2/2011 Lecture 5 Association Rule Mining by Kritsada Sriphaew (sriphaew.k AT gmail.com)

description

 

Transcript of Dbm630 lecture05

Page 1: Dbm630 lecture05

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 5

Association Rule Mining

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Page 2: Dbm630 lecture05

Topics

2

Association rule mining

Mining single-dimensional association rules

Mining multilevel association rules

Other measurements: interest and conviction

Association rule mining to correlation analysis

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 3: Dbm630 lecture05

What is Association Mining?

3

Association rule mining:

Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

Applications:

Basket data analysis, cross-marketing, catalog design, clustering, classification, etc.

Ex.: Rule form: “Body Head [support, confidence]” “Antecedent Consequent [support, confidence]” buys(x, “diapers*”) buys(x, “beers”) [0.5%,60%]

major(x, “CS”)^takes(x, “DB”) grade(x, “A”) [1%, 75%]

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 4: Dbm630 lecture05

A typical example of association rule mining is market basket analysis.

4 Data Warehousing and Data Mining by Kritsada Sriphaew

Page 5: Dbm630 lecture05

Rule Measures: Support/Confidence

5

Find all the rules “Antecedent(s) Consequent(s)” with minimum support and confidence support, s, probability that a transaction contains {A C}

confidence, c, conditional probability that a transaction having A also contains C

Let min. sup. 50%, and min. conf. 50%, A C (s=50%, c=66.7%)

C A (s=50%, c=100%)

Typically association rules are considered interesting if they satisfy both a minimum support threshold and a mininum confidence threshold

Such thresholds can be set by users or domain experts

Transaction ID Items Bought

2000 A,B,C

1000 A,C

4000 A,D

5000 B,E,F

Transactional databases

• Support= 50% means that 50% of all transactions under analysis show that A and C are purchased together

• Confidence=66.7% means that 66.7% of the customers who purchased A also bought C

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 6: Dbm630 lecture05

6

Rule Measures: Support/Confidence

TransID Items Bought

T001 A,B,C

T002 A,C

T003 A,D

T004 B,E,F

Rule: A C

support (AC) = P({AC}) = P(AC)

confidence(AC) = P(C|A) = P({AC})/P({A})

• A B (1/4 = 25%, 1/3 = 33.3%)

• B A (1/4 = 25%, 1/2 = 50%)

• A C (2/4 = 50%, 2/3 = 66.7%)

• C A (2/4 = 50%, 2/2 =100%)

• A, B C (1/4 = 25%, 1/1 = 100%)

• A, C B (1/4 = 25%, 1/2 = 50%)

• B, C A (1/4 = 25%, 1/1 = 100%)

Frequency

A = 3

B = 2

C = 2

AB = 1

AC = 2

BC = 1

ABC = 1

Customer buys diaper(C)

Customer buys both (A&C)

Customer buys beer (A)

probability

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 7: Dbm630 lecture05

Association Rule: Support/Confidence for Relational Tables

7

In case that each transaction is a row in a relational table

Find: all rules that correlate the presence of one set of attributes with that of another set of attributes

temp. humidity

sunny

sunny

rainy

rainy

overcast

overcast

overcast

rainy

rainy

sunny

hot

hot

hot

mild

cool

hot

cool

mild

mild

hot

high

high

normal

high

low

low

normal

high

low

high

True

False

True

True

False

True

True

True

False

True

play-time play

85

90

63

5

56

25

5

86

78

74

Y

Y

Y

N

Y

N

N

Y

Y

Y

windy Sponsor

Sony

HP

Ford

Ford

HP

Sony

Nokia

Honda

Ford

Sony

outlook • If temperature = hot

then humidity = high (s=3/10,c=3/5)

• If windy=true and play=Y then humidity=high and outlook=overcast

(s=2/10, c=2/4) • If windy=true and play=Y

and humidity=high then outlook=overcast

(s=2/10, c=2/3)

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 8: Dbm630 lecture05

Association Rule Mining: Types

8

Boolean vs. quantitative associations (Based on the types of values handled) (Single vs. multiple Dim.)

Single level vs. multilevel analysis What brands of beers are associated with what brands of diapers?

Various extensions Maxpatterns and closed itemsets

SQLServer ^ DMBooks DBMiner [0.2%, 60%]

buys(x, “SQLServer”) ^ buys(x, “DMBook”)

buys(x, “DBMiner”) [0.2%, 60%]

age(x, “30..39”) ^ income(x, “42..48K”)

buys(x, “PC”) [1%, 75%]

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 9: Dbm630 lecture05

An Example (single dimensional Boolean association Rule Mining)

9

For rule A C: support = support({A, C}) = 50%

confidence = support({A, C})/support({A}) = 66.7%

The Apriori principle:

Any subset of a frequent itemset must be frequent

Transaction ID Items Bought

2000 A,B,C

1000 A,C

4000 A,D

5000 B,E,F

Frequent Itemset Support

{A} 75%

{B} 50%

{C} 50%

{A,C} 50%

Min. support 50%

Min. confidence 50%

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 10: Dbm630 lecture05

Two Steps in Mining Association Rules

10

A subset of a frequent itemset must also be a frequent itemset

i.e., if {AB} is a frequent itemset, both {A} and {B} must be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Step1: Find the frequent itemsets: the sets of items that

have minimum support

Step2: Use the frequent itemsets to generate association

rules

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 11: Dbm630 lecture05

The Apriori Algorithm

11

Join Step: Ck is generated by joining Lk-1with itself

Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Pseudo-code:

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent 1-itemsets};

for (k = 1; Lk !=f; k++) do begin

Ck+1 = candidates generated from Lk;

for each transaction t in database do

increment the count of all candidates in Ck+1

that are contained in t

Lk+1 = candidates in Ck+1 with min_support

end

return Uk Lk;

1

2

Find the frequent itemsets

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 12: Dbm630 lecture05

The Apriori Algorithm — Example

12

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database D itemset sup.

{1} 2

{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2

{2} 3

{3} 3

{5} 3

Scan D

C1

L1

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

itemset sup

{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3

{3 5} 2

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

L2

C2 C2

Scan D

C3 L3 itemset

{2 3 5}Scan D itemset sup

{2 3 5} 2

Page 13: Dbm630 lecture05

How to Generate Candidates?

13

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1

INSERT INTO Ck SELECT p.item1, p.item2, …,

p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q

WHERE p.item1 = q.item1, …,

p.itemk-2 = q.itemk-2,

p.itemk-1 < q.itemk-1

Step 2: pruning ForAll itemsets c IN Ck DO

ForAll (k-1)-subsets s OF c DO

IF (s is not in Lk-1) THEN DELETE c FROM Ck

Page 14: Dbm630 lecture05

Example of Generating Candidates

14

Self-joining: L3×L3

abc + abd abcd

acd + ace acde

Pruning: acde is removed because

ade is not in L3

L3={abc, abd, acd, ace, bcd}

C4={abcd}

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 15: Dbm630 lecture05

How to Count Supports of Candidates?

15

Why counting supports of candidates a problem?

The total number of candidates can be very huge

One transaction may contain many candidates

Method:

Candidate itemsets are stored in a hash-tree

Leaf node of hash-tree contains a list of itemsets and counts

Interior node contains a hash table

Subset function: finds all the candidates contained in a transaction

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 16: Dbm630 lecture05

Subset Function

16

Subset function: finds all the candidates contained in a transaction. (1) Generate Hash Tree (2) Hashing each item in the transactions

f

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

C2

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database 1

2

3

2

3

5

3

5

5

1+1

1+1

1+1+1

1+1

1

1

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 17: Dbm630 lecture05

Is Apriori Fast Enough? — Performance Bottlenecks

17

The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent

k-itemsets Use database scan and pattern matching to collect counts for

the candidate itemsets

The bottleneck of Apriori: candidate generation Huge candidate sets:

104 frequent 1-itemset will generate 107 candidate 2-itemsets

To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.

Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 18: Dbm630 lecture05

Mining Frequent Patterns Without Candidate Generation

18

Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure

highly condensed, but complete for frequent pattern mining

avoid costly database scans

Develop an efficient, FP-tree-based frequent pattern mining method

A divide-and-conquer methodology: decompose mining tasks into smaller ones

Avoid candidate generation: sub-database test only!

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 19: Dbm630 lecture05

Construct FP-tree from Transaction DB

19

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

min_support = 0.5

TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Steps:

1. Scan DB once, find

frequent 1-itemset

(single item pattern)

2. Order frequent items

in frequency

descending order

3. Scan DB again,

construct FP-tree

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 20: Dbm630 lecture05

Mining Frequent Patterns using FP-tree

20

General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree

Method For each item, construct its conditional pattern-base, and then its

conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path (single

path will generate all the combinations of its sub-paths, each of which is a frequent pattern)

Benefit: Completeness & Compactness Completeness: never breaks a long pattern of any transaction and

preserves complete information for frequent pattern mining Compactness: reduces irrelevant information (infrequent items are gone),

orders in frequency descending ordering (more frequent items are likely to be shared), and smaller than the original database.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 21: Dbm630 lecture05

Step 1: From FP-tree to Conditional Pattern Base

Knowledge Management and Discovery © Kritsada Sriphaew 21

Starting at the frequent header table in the FP-tree

Traverse the FP-tree by following the link of each frequent item

Accumulate all of transformed prefix paths of that item to form a conditional pattern base

Conditional pattern bases

item cond. pattn base

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

Page 22: Dbm630 lecture05

Step 2: Construct Conditional FP-tree

22

For each pattern-base

Accumulate the count for each item in the base

Construct the FP-tree for the frequent items of the pattern base

m-conditional pattern base:

fca:2,

fcab:1

{}

f:3

c:3

a:3

m-conditional FP-tree

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

Page 23: Dbm630 lecture05

Mining Frequent Patterns by

(Creating Conditional Pattern-Bases)

23

Empty Empty f

{(f:3)}|c {(f:3)} c

{(f:3, c:3)}|a {(fc:3)} a

Empty {(fca:1), (f:1), (c:1)} b

{(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m

{(c:3)}|p {(fcam:2), (cb:1)} p

Conditional FP-tree Conditional pattern-base Item

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 24: Dbm630 lecture05

Step 3: Recursively mine the conditional FP-

tree

24

{}

f:3

c:3

a:3

m-conditional FP-tree

{}

f:3

c:3

am-conditional FP-tree

{}

f:3

cm-conditional FP-tree

{}

f:3

cam-conditional FP-tree

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 25: Dbm630 lecture05

Single FP-tree Path Generation

25

Suppose an FP-tree T has a single path P

The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P

m-conditional pattern base:

fca:2,

fcab:1

{}

f:3

c:3

a:3

m-conditional FP-tree

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

{}

f:4 c:1

b:1

p:1

b:1 c:3

a:3

b:1 m:2

p:2 m:1

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

Page 26: Dbm630 lecture05

FP-growth vs. Apriori: Scalability With the

Support Threshold

26

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3Support threshold(%)

Ru

n t

ime

(se

c.)

D1 FP-growth runtime

D1 Apriori runtime

Data set T25I20D10K

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 27: Dbm630 lecture05

CHARM - Mining Closed Association Rules

Instead of horizontal DB format, vertical format is used.

Instead of traditional frequent itemsets, closed frequent

itemsets are mined.

Transaction Items

1 ABDE

2 BCE

3 ABDE

4 ABCE

5 ABCDE

6 BCD

Items Transaction

A 1345

B 123456

C 2456

D 1356

E 12345

27

Horizontal DB Vertical DB

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 28: Dbm630 lecture05

CHARM – Frequent Itemsets and Their Supports

28

An example database and its frequent itemsets

Items Trans. A 1345 B 123456 C 2456 D 1356 E 12345

Vertical DB

Support Itemsets 1.00 B 0.83 BE, E 0.67 A, C, D, AB,AE, BC, BD, ABE 0.50 AD, CE, DE, ABD, ADE, BCE, BDE, ABDE

Min. support = 0.5

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 29: Dbm630 lecture05

CHARM - Closed Itemsets

29

Closed frequent itemsets and their corresponding frequent itemsets

Closed Itemsets Tidsets Sup. Freq. Itemsets

B 123456 1.00 B

BE 12345 0.83 BE, E ABE 1345 0.67 ABE, AB, AE, A BD 1356 0.67 BD, D BC 2456 0.67 BC, C ABDE 135 0.50 ABDE, ABD, ADE, BDE, AD, DE BCE 245 0.50 CE, BCE

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 30: Dbm630 lecture05

The CHARM Algorithm

30

CHARM (? I T, minsup);

1. Nodes = { Ij t(Ij) : Ij I |t(Ij )| minsup }

2. CHARM-EXTEND (Nodes, C)

CHARM-EXTEND (Nodes, C)

3. for each Xi t(Xi) in Nodes

4. NewN = f and X = Xi

5. for each Xj t(Xj) in Nodes, with f(j) > f(I)

6. X’ = X Xj and Y = t(Xi) t(Xj)

7. CHARM-PROPERTY(Nodes, NewN)

8. if NewN f then CHARM-EXTEND(NewN)

9. C = C {X} // if X is not subsumed

CHARM-PROPERY(Nodes, NewN)

1. if (|Y| minsup) then

2. if t(Xi) = t(Xj) then // Propery 1

3. Remove Xj from Nodes

4. Replace all Xi with X’

5. else if t(Xj) t(Xj) then // Propery 2

6. Replace all Xi with X’

7. else if t(Xj) t(Xj) then // Propery 3

8. Remove Xj from Nodes

9. Add X Y to NewN

10. else if t(Xj) t(Xj) then // Propery 4

11. Add X Y to NewN

Ax1345 Bx123456 Cx2456 Dx1356 Ex12345

ABx1345

ABEx1345

ABCx45 ABDx135 ABDEx135

BCx2456

BCDx56 BCEx245

BDx1356 BEx12345

BDEx135

f

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 31: Dbm630 lecture05

Presentation of Association Rules (Table Form)

31 Data Warehousing and Data Mining by Kritsada Sriphaew

Page 32: Dbm630 lecture05

Visualization of Association Rule Using Plane Graph

32

Page 33: Dbm630 lecture05

Visualization of Association Rule Using Rule Graph

33

Page 34: Dbm630 lecture05

Multiple-Level Association Rules

Items often form hierarchy.

Items at the lower level are

expected to have lower

support.

Rules regarding itemsets at

the appropriate levels could

be quite useful.

Transaction database can be

encoded based on

dimensions and levels

We can explore shared

multi-level mining

TID ITEMS

T1 {1121, 1122, 1212}

T2 {1222, 1121, 1122, 1213}

T3 {1124, 1213}

T4 {1111, 1211, 1232, 1221, 1223}

34

Food

(1)

Bread

(12)

Milk

(11)

Skim

(111)

Sunset

(1124)

Fraser

(1121)

2%

(112)

White

(122)

Wheat

(121)

Mining multilevel association rules from transactional databases

Wonder (1222)

Wonder (1213)

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 35: Dbm630 lecture05

Mining Multi-Level Associations

35

A top_down, progressive deepening approach:

First find high-level strong rules:

Then find their lower-level “weaker” rules:

Variations at mining multiple-level association rules.

Level-crossed association rules:

Association rules with multiple, alternative hierarchies:

milk bread [20%, 60%]

2% milk wheat bread [6%, 50%]

2% milk Wonder wheat bread [3%, 60%]

2% milk Wonder bread [8%, 72%]

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 36: Dbm630 lecture05

Multi-level Association: Redundancy Filtering

36

Some rules may be redundant due to “ancestor” relationships between items.

Example

We say the first rule is an ancestor of the second rule.

A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

milk wheat bread [s=8%, c=70%]

2% milk wheat bread [s=2%, c=72%]

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 37: Dbm630 lecture05

Multi-Level Mining: Progressive Deepening

37

A top-down, progressive deepening approach: First mine high-level frequent items:

milk (15%), bread (10%) Then mine their lower-level “weaker” frequent itemsets:

2% milk (5%), wheat bread (4%)

Different min_support threshold across multi-levels lead to different algorithms: If adopting the same min_support across multi-levels

then toss t if any of t’s ancestors is infrequent. If adopting reduced min_support at lower levels

then examine only those descendants whose ancestor’s support is frequent/non-negligible.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 38: Dbm630 lecture05

Problem of Confidence

38

Example: (Aggarwal & Yu, PODS98)

Among 5000 students

3000 play basketball

3750 eat cereal

2000 both play basket ball and eat cereal

play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.

play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence

basketball not basketball sum(row)

cereal 2000 1750 3750

not cereal 1000 250 1250

sum(col.) 3000 2000 5000

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 39: Dbm630 lecture05

Interest/Lift/Correlation

Interest (or lift, correlation)

taking both P(A) and P(B) in consideration

P(AB)=P(B)P(A), if A and B are independent

events

A and B negatively correlated, if the value is less

than 1; otherwise A and B positively correlated

Lift(play basketball eat cereal) = 0.89

Lift(play basketball not eat cereal) = 1.33

basketball not basketball sum(row)

cereal 2000 1750 3750

not cereal 1000 250 1250

sum(col.) 3000 2000 5000

39

)()(

)(

BPAP

BAP

889.0

5000

3750

5000

30005000

2000

33.1

5000

1250

5000

30005000

1000

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 40: Dbm630 lecture05

Conviction

Conviction (Brin, 1997)

0 <= conv(AB) <=

A and B are statistically independent if and only if

conv(AB) = 1

0 < conv(AB) < 1 if and only if p(B|A) < p(B)

B is negatively correlated with A.

1 < conv(AB) < if and only if p(B|A) > p(B)

B is positively correlated with A.

)(1(

))(1()(

BAConfidence

BSupportBAConviction

40

basketball not basketball sum(row)

cereal 2000 1750 3750

not cereal 1000 250 1250

sum(col.) 3000 2000 5000

375.0667.01

5000

37501

25.2333.01

5000

12501

conviction(play basketball eat cereal) = 0.375

conviction(play basketball not eat cereal) = 2.25

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 41: Dbm630 lecture05

From Association Mining to Correlation Analysis

Ex. Strong rules are not necessarily interesting

41

Of 10000 transactions

• 6000 customer transactions include computer games

• 7500 customer transactions include videos • 4000 customer transactions include both computer game and video

• Suppose that data mining program for

discovering association rules is run on

the data, using min_sup of 30% and

min_conf. of 60%

• The following association rule is

discovered:

buys(X, “computer games”) buys(X, “videos”) [s=40%, c=66%]

=4000/10000 =4000/6000

4,000

videos games

Page 42: Dbm630 lecture05

This rule is misleading because the probability of purchasing video is 75% (>66%)

In fact, computer games and videos are negatively associated because the purchase of one of these items actually decreases the likelihood of purchasing the other. Therefore, we could easily make unwise business decisions based on this rule

42

buys(X, “computer games”) buys(X, “videos”) [support=40%, confidence=66%]

A misleading “strong” association rule

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 43: Dbm630 lecture05

From Association Analysis to Correlation Analysis

To help filter out misleading “strong” association

Correlation rules

A B [support, confidence, correlation]

Lift is a simple correlation measure that is given as follows

The occurrence of itemset A is independent of the occurrence of itemset B if

P(AB) = P(A)P(B);

Otherwise, itemset A and B are dependent and correlated

lift(A,B) = P(AB) / P(A)P(B) = P(B|A) / P(B) = conf(AB) / sup(B)

If lift(A,B) < 1, then the occurrence of A is negatively correlated with the occurrence of B

If lift(A,B) > 1, then A and B is positively correlated, meaning that the occurrence of one implies the occurrence of the other.

43 Data Warehousing and Data Mining by Kritsada Sriphaew

Page 44: Dbm630 lecture05

From Association Analysis to Correlation Analysis (Cont.)

Ex. Correlation analysis using lift

The lift of this rule is P{game,video} / (P{game} × P{video}) = 0.40/(0.6 ×0.75) = 0.89

There is a negative correlation between the occurrence of {game} and {video}

Ex. Is this following rule misleading?

Buy walnuts Buy milk [1%, 80%]”

if 85% of customers buy milk

44

buys(X, “computer games”) buys(X, “videos”) [support=40%, confidence=66%]

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 45: Dbm630 lecture05

Homework

45

ให transactional database ซงเปน LOG ไฟลบนทกการเขาเยยมชมเวบเพจของปใชแตละคน

ในชวงระยะเวลาหนง จงหากฎสมพนธทนาเชอถอ โดยสมมตวาเราเปนปวเคราะหขอมล ม

สทธตง minimum support และ minimum confidence ดวยตวเอง พรอมอธบายเหตปล

ประกอบวาท าไมถงตงคานน และตรวจสอบดวยวากฎเหลานนเปน misleading หรอไม ถามให

แกไขอยางไร

TID List of items

T001 P1, P2, P3, P4

T002 P3, P6

T003 P2, P5, P1

T004 P5, P4, P3,P6

T005 P1, P3, P4, P2

P1 P2 P3 P4 P5

P6

Page 46: Dbm630 lecture05

Feb 26, 2011 (14:00)

Data Warehousing and Data Mining by Kritsada Sriphaew 46

Quiz I

Star-net Query (Multidimensional Table)

Data Cube Computation (Memory Calculation)

Data Preprocessing (Normalization, Smoothing by binning)

Association Rule Mining