Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values
Clustering Categorical Data
-
Upload
blackcross -
Category
Documents
-
view
235 -
download
0
Transcript of Clustering Categorical Data
-
7/28/2019 Clustering Categorical Data
1/29
Clustering Categorical Data
Steven X. WangDepartment of Mathematics and Statistics
York University
April 11, 2005
-
7/28/2019 Clustering Categorical Data
2/29
Presentation Outline
Brief literature review Some new algorithms for categorical data
Challenges in clustering categorical data Future work and discussions
-
7/28/2019 Clustering Categorical Data
3/29
Algorithms for Continuous Data
There are many clustering algorithmsproposed in the literature:
1. K-means
2. EM algorithm3. Hierarchical clustering
4. CLARAN5. OPTICS
-
7/28/2019 Clustering Categorical Data
4/29
Algorithms for Categorical Data
K modes (modification of K-means) AutoClass (based on EM algorithm)
ROCK and CLOPE
There are only several algorithms for
clustering categorical data.
-
7/28/2019 Clustering Categorical Data
5/29
Categorical Data Structure
Categorical data has a different structurethan the continuous data.
The distance functions in the continuous
data might not be applicable to thecategorical data.
Algorithms for clustering continuous data
can not be applied directly to categoricaldata.
-
7/28/2019 Clustering Categorical Data
6/29
K-means for clusteringContinuous data
K-means is one of the oldest and widely usedalgorithm for clustering categorical data
1) Choose the number of clusters and initialize
the clusters centers2) Perform iterations until a selectedconvergence criterion is reached.
3) Computational Complexity O(n).
-
7/28/2019 Clustering Categorical Data
7/29
Categorical Sample Space
Assume that the data set is stored in an*p matrix,wheren is the number of observations andp the
number of categorical variables.
The sample space consists of all possiblecombinations generated byp variables.
The sample space is discrete and has no natural
origin.
-
7/28/2019 Clustering Categorical Data
8/29
K-modes for Categorical Data
K-modes has exactly the same structure ofthe k-means, i.e., choose k cluster modesand iterates until convergence.
K- modes has a fundamental flaw:
the partition is sensitive to the input order,
i.e., the clustering results would bedifferent for the same data set if the inputorder is different.
-
7/28/2019 Clustering Categorical Data
9/29
AutoClass Algorithm
This is an algorithms applicable to bothcontinuous and categorical data
It is model-based algorithm without the input ofthe number of clusters
Computational Complexity O(n log n)
EM algorithm has a slow convergence andsensitive to the initial values
-
7/28/2019 Clustering Categorical Data
10/29
Hamming Distance and CD vector
Hamming distance measures the number of differentattributes between two categorical variables.
Hamming Distance has been used in clusteringcategorical data in algorithms similar to K-modes.
We construct Categorical Distance (CD) vector toproject the sample space into 1-dimesional space.
-
7/28/2019 Clustering Categorical Data
11/29
Example of a CD vector
0 2 4 6 8 10 12 14 16 180
5
10
15
20
25
-
7/28/2019 Clustering Categorical Data
12/29
More on CD vector The dense region of the
CD vector is notnecessarily a cluster!
We can construct many
CD vectors on one data
set by choosing different
origin.
0 2 4 6 8 10 12 14 16 180
5
10
15
20
25
-
7/28/2019 Clustering Categorical Data
13/29
UCD: Expected CD vector under Null.
0 2 4 6 8 10 12 14 16 180
2
4
6
8
10
12
14
16
18
20
-
7/28/2019 Clustering Categorical Data
14/29
0 2 4 6 8 10 12 14 16 180
5
10
15
20
25
0 2 4 6 8 10 12 14 16 180
2
4
6
8
10
12
14
16
18
20
CD
Vector
UCDVector
-
7/28/2019 Clustering Categorical Data
15/29
CD Algorithm
Find a cluster center; Construct the CD vector given the current
center ;
Perform modified Chi-square test;
If we reject the null, then determine the radius
of the current cluster; Extract the cluster
Repeat until we do not reject the null.
-
7/28/2019 Clustering Categorical Data
16/29
Numerical Comparison withK-mode and AutoClass
CD AutoClass K-mode
No. of Clusters 4 4 [3] [4] [5]
_____________________________________________________
Classi. Rates 100% 100% 75% 84% 82%
Variations 0% 0% 6% 15% 10%
Inform. Gain 100% 100% 67% 84% 93%
Variations 0% 0% 10% 15% 11%
_____________________________________________________
Soybean Data: n=47 and p=35. No of clusters=4.
-
7/28/2019 Clustering Categorical Data
17/29
Numerical Comparison withK-mode and AutoClass
CD AutoClass K-mode
No. of Clusters 7 3 [6] [7] [8]
_____________________________________________________
Classi. Rates 95% 73% 74% 72% 71%
Variations 0% 0% 6% 15% 10%
Inform. Gain 92% 60% 75% 79% 81%
Variations 0% 0% 7% 6% 6%
_____________________________________________________
Zoo Data: n=101 and p=16. No of clusters=7.
-
7/28/2019 Clustering Categorical Data
18/29
Computational Complexity
The upper bound of the computationalcomplexity of our algorithm is O(kpn)
It is much less computational intensive than K-
modes and AutoClass since it does not demand
convergence.
-
7/28/2019 Clustering Categorical Data
19/29
CD Algorithm
It is based on hamming distance.
It does not require the input of parameters.
It has no convergence criterion.
Ref: Zhang, Wang and Song (2005). JASA. To appear.
-
7/28/2019 Clustering Categorical Data
20/29
Difficulties in ClusteringCategorical data
Distance function Similarity measure to organize clusters
Scalability or computational complexity
-
7/28/2019 Clustering Categorical Data
21/29
Challenge 1: Distance Function
Hamming distance is a natural andreasonable one if the categorical scalinghas not natural order (nominal data).
If we apply the method for nominal datasuch as the CD algorithm to the ordinaldata, there might be a serious loss ofinformation as the order is ignored.
-
7/28/2019 Clustering Categorical Data
22/29
Challenge 2:Organization of Clusters
Organization of clusters is crucial inclustering large data sets.
Similarity measures are needed to
organize clusters in hierarchical clustering
Different similarity measure will be have
different results
-
7/28/2019 Clustering Categorical Data
23/29
Challenge 3: Scalability
In practice, an approximate answer is somuch better than no answer at all.
Complexity O(n).
Scalability O(mn)
How many variables that we are dealing
with?
-
7/28/2019 Clustering Categorical Data
24/29
Challenge 1:
What to do about the ordering? To propose a reasonable distance function
for ordinal data might require a careful
examination of the dependence structure.
We need to look into different measure of
association for categorical data.
-
7/28/2019 Clustering Categorical Data
25/29
Challenge 2:
A nave measure of similarity would be thedistance between two clusters. Entropymight be a good one to try even thought it
is not a distance function.
-
7/28/2019 Clustering Categorical Data
26/29
Challenge 3:
There are many hierarchical clusteringalgorithms available. Any clusteringalgorithm could be integrated into those
algorithms if the distance function andsimilarity measure could be definedappropriately.
-
7/28/2019 Clustering Categorical Data
27/29
Beyond Categorical Data
The ultimate goal is to cluster any datasets with complex data structures.
Mixed data types would be the next on the
list. The challenge there is again on the
distance function (dependence structure
between the continuous part andcategorical portion.)
-
7/28/2019 Clustering Categorical Data
28/29
More Challenges
Measure of uncertainty Hard clustering vs. soft clustering
Parallel computing.
-
7/28/2019 Clustering Categorical Data
29/29
Thank you!