2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan,...
-
Upload
doris-gregory -
Category
Documents
-
view
224 -
download
0
Transcript of 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan,...
![Page 1: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/1.jpg)
112/04/21 1
Efficiently Clustering Transactional data with Weighted
Coverage Density
M. Hua Yan , Keke Chen, and Ling Liu
Proceedings of the 15th International Conference on Information and Knowledge Management, ACM CIKM, 2006
報告人 : 吳建良
![Page 2: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/2.jpg)
2
Outline Motivation SCALE Framework BKPlot Method WCD Clustering Algorithm Cluster Validity Evaluation Experimental Results
![Page 3: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/3.jpg)
3
Motivation Transactional data is a kind of special categorical data
t1={milk, bread, beer}, t2={milk, bread} Can be transformed to row by column table with Boolean value Large volume and high dimensionality make the existing
algorithms inefficient to process the transformed data
Clustering transactional data algorithm: LargeItem, CLPOE, CCCD Require users to manually tune at least one or two parameters Setting these parameters are different from dataset to dataset
![Page 4: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/4.jpg)
SCALE Framework
ACE & BkPlot (SSDBM’05) ACE: Agglomerative Categorical
clustering with Entropy criterion BkPlot:
Examine the entropy difference between the clustering structures with varying K
Reports the Ks where the clustering stricture changes dramatically
Evaluation Metrics LISR: Large Item Size Ratio AMI: Average pair-clusters Merging
Index
4
![Page 5: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/5.jpg)
ACE Algorithm Bottom-up process
Initially, each record is a cluster Iteratively, find the most similar pair of clusters Cp and Cq,
and then merge them Incremental entropy
The most similar pair of clusters
is minimum among all possible pairs
denote the Im value in forming the K-cluster partition from the K+1-cluster partition
5
))(ˆ)(ˆ()(ˆ)(),( qqppqpqpqpm CHnCHnCCHnnCCI
)(KmI
),( qpm CCI
![Page 6: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/6.jpg)
BkPlot Increasing rate of entropy:
N: total records, d: columns Small increasing rate
Merging does not introduce any impurity to the clusters Clustering structure is not significantly changed
Large increasing rate Introduce considerable impurity into the partitions Clustering structure can be changed significantly
6
)(1)( K
mINd
KI
![Page 7: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/7.jpg)
BkPlot (contd.)
Relative changes Use relative changes to determine if a globally
significant clustering structure emerges7
I(K)≈I(K+1), but I(K-1)>I(K)
![Page 8: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/8.jpg)
BkPlot (contd.)
8)()1()(
)1()()(2 KIKIKI
KIKIKI
Entropy Characteristic Graph (ECG) Second-order differential of ECG: )(2 KI
![Page 9: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/9.jpg)
WCD Clustering Algorithm Notations
D: transactional dataset N: size of dataset I={I1, I2,…, Im}: a set of items
tj={Ij1, Ij2,…, Ijl}: a transaction
A transaction clustering result CK={C1, C2,…,CK} is a partition of D, where
9
jiiK CCCDCC , ,1
![Page 10: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/10.jpg)
Intra-cluster Similarity Measure Coverage Density (CD)
Given a cluster Ck
Mk: Number of distinct items
: Items set of Ck
Nk : Number of transaction in Ck
Sk: Sum occurrences of all items in Ck
10
area rectangle
cells filled
},,,{ 21 kkMkkk IIII
kk
M
j kj
kk
kk MN
Ioccur
MN
SCCD
k
1)(
)( CD↑, compactness ↑
![Page 11: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/11.jpg)
Intra-cluster Similarity Measure (contd.)
Drawback of CD Insufficient to measure the density of frequent
itemset Each item has equal contribution in a cluster
Two clusters may have the same CD but different filled-cell distribution
11
kj M
W1
a b c a b c
9
5CD
![Page 12: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/12.jpg)
Intra-cluster Similarity Measure (contd.)
Weighted Coverage Density (WCD) Focus on high-frequency items Define Wj as
12
1 . )(
1
kM
j jk
kjj Wst
S
IoccurW
kk
M
j kj
k
kjM
j kjk
j
M
j kjk
k
SN
Ioccur
S
IoccurIoccur
N
WIoccurN
CWCD
k
k
k
2
1
1
1
)(
)()(
1
)(1
)(
a b c a b c
CD WCD
3
1
6
3
6
2
6
1
![Page 13: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/13.jpg)
Clustering Criterion Expected Weighted Coverage Density (EWCD)
Clustering algorithm try to maximize the EWCD When every individual transaction is considered as a
cluster, it will get the maximum EWCD=1 Use BKPlot method to generate a set of candidate “best Ks”
13
K
k k
M
j kjK
kk
kK
S
Ioccur
NCWCD
N
NCEWCD
k
1
1
2
1
)(1)()(
![Page 14: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/14.jpg)
WCD Clustering Algorithm
14
Input: Dataset D, Number of clusters K, Initial K seedsOutput: K clusters
/* Phase 1 – Initialization*/K seeds form the initial K clusters;while not end of D do read one transaction t from D; add t into Ci that maximizes EWCD; write <t, i> back to D;
/* Phase 2 – Iteration*/while moveMark = true do moveMark = false; randomly generate the access sequence R while has not checked all transactions do read <t, i>; if moving t to cluster Cj increases EWCD and i ≠ j
moveMark = true; write <t, j> back to D;
![Page 15: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/15.jpg)
Cluster Validity Evaluation LISR (Large Item Size Ratio)
Measure the preservation of frequent itemsets , where LSk is #Large Items in Ck
high concurrences of items
high possibility of finding more frequent
itemsets at user-specified minimum support
15
K
kk
kk
S
LS
N
NLISR
1
LISR
![Page 16: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/16.jpg)
Cluster Validity Evaluation (contd.)
Inter-cluster dissimilarity between Ci and Cj
16
)()()(),( jijji
ji
ji
iji CCCDCCD
NN
NCCD
NN
NCCd
))11
()11
((1
)(1
)(),(
ijjj
ijii
ji
ij
ji
j
j
i
i
ji
ijji
ji
jj
j
ji
j
ii
i
ji
iji
MMS
MMS
NN
M
SS
M
S
M
S
NN
MNN
SS
MN
S
NN
N
MN
S
NN
NCCd
simplify
, where Mij is the number of distinct items after merging two cluster
thus Mij max{≧ Mi, Mj}
Because of and , d(Ci, Cj) is a real number between 0 and 1 iij MM
11
jij MM
11
![Page 17: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/17.jpg)
Cluster Validity Evaluation (contd.)
Example If Mi=Mj=Mij, then d(Ci,Cj)=0
Mi=Mj=3, Mij=5
17
a b c
Ci Cj
3
1))
5
1
3
1(4)
5
1
3
1(5(
6
1 ),( ji CCd
a b c
a b c
Ci Cj
c d e
![Page 18: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/18.jpg)
Cluster Validity Evaluation (contd.)
AMI (Average pair-clusters Merging Index) Evaluate the overall inter-dissimilarity of a
clustering result having K clusters
better the clustering quality
18
},,,1,),,(min{
,1
1
jiKjiCCdD
DK
AMI
jii
K
i i
AMI
![Page 19: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/19.jpg)
Experiments Dataset
Tc30a6r1000 1000 records, 30 column, 6 possible attribute values
Zoo 101 records, 18 attributes
Mushroom 8124 instances, 22 attributes
Mushroom100k Sample the mushroom data with duplicates 100,000 instances
TxI4Dx IBM Data Generator 19
![Page 20: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/20.jpg)
Experimental Results Tc30a6r1000
20
The repulsion parameter r of CLOPE iscontrolling the number of clusters
5 clusters 9 clusters
![Page 21: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/21.jpg)
Experimental Results (contd.)
Zoo: K=7 is the best
21
2 clusters 4 clusters 7 clusters
![Page 22: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/22.jpg)
Experimental Results (contd.)
Mushroom: K=19 is the best
22
![Page 23: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/23.jpg)
Experimental Results (contd.)
Performance evaluation on mushroom100k
23
r=0.5~4.0 r=2.0
![Page 24: 2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.](https://reader036.fdocuments.net/reader036/viewer/2022062309/56649f425503460f94c61b6d/html5/thumbnails/24.jpg)
Experimental Results (contd.)
Performance evaluation on TxI4Dx
24
T10I4Dx TxI4D100k