Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf ·...
Transcript of Cluster Analysis - Texas Tech Universityredwood.cs.ttu.edu/~rahewett/teach/ai16/11-Cluster.pdf ·...
1
CS 5331 by Rattikorn Hewett Texas Tech University 1
Cluster Analysis
Han and Kamber, ch 8
2
Outline
n Overview n Input Data types n Clustering methods
¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods
n Outlier Analysis n Summary
2
3
Motivation
n Given a data set that we do not know what to look for ¨ The first step in identifying useful patterns is to
group data by their similarity ¨ Once data are grouped (clustered/categorized),
properties of each group can be analyzed n Cluster Analysis
¨ Is a process of grouping data into meaningful classes/clusters (e.g., those with similar measures)
¨ Is unsupervised learning – by observations vs. by examples ¨ Can be used
n As a stand-alone tool to get insight on data distribution n As a preprocessing step for other algorithms
4
Examples Applications
n Bioinformatics: Categorize genes with similar functions n Marketing: Characterize customer based on purchasing patterns for
targeted marketing programs n Land use: Identify areas of similar land use from earth observation data n Insurance: Identify groups of motor insurance policy holders with a high
average claim cost n City-planning: Identify groups of houses according to their house type,
value, and geographical location n Earth-quake studies: Characterize earth quake epicenters along
continent faults n WWW: Document classification, cluster Weblog data to discover groups
of similar access patterns
3
5
Evaluation and Issues
n Basic criteria for evaluating clustering method ¨ quality of results (depending on similarity measures applied)
n High qualities clusters à high intra-class and low inter-class similarity
¨ ability to discover some or all of the hidden patterns
n Main issue in data mining – scalability
¨ Size of databases: need efficient clustering techniques for very large, high dimensional databases
¨ Complexity of data types and shapes: need clustering techniques for mixed numerical & categorical data
6
Requirements
n Scalability ¨ Large databases ¨ High dimensionality ¨ Ability to deal with different types of attributes
n Performance ¨ Robust - able to deal with noise and outliers ¨ Insensitive to order of input records ¨ Discovery of clusters with arbitrary shape
n Practical Aspects ¨ Minimal requirements for domain knowledge to determine input
parameters ¨ Incorporation of user-specified constraints ¨ Interpretability and usability
4
7
Outline
n Overview n Input Data types n Clustering methods
¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods
n Outlier Analysis n Summary
8
Typical data structures
n Memory-based clustering algorithms typically operate on two data structures: ¨ Data matrix – n data instances with p attributes
¨ Dissimilarity matrix – d(i,j) = difference/dissimilarity between data instances i and j
⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0...)2,()1,(:::
)2,3()
...ndnd
0dd(3,10d(2,1)
0
Row & cols represent different entities
One-mode matrix Two-mode matrix
5
9
Input data types
n Interval-scaled variables rough measurements on a continuous “linear” scale E.g., weights, coordinates
n Binary n Nominal, ordinal, and ratio-scaled
¨ Nominal: extend binary variables to have more than two values e.g., map-color = {blue, yellow, red, green}
¨ Ordinal: extend nominal to be ordered in a meaningful sequence e.g., medal = {gold, silver, bronze}
¨ Ratio-scaled: +ve measurements on a non-linear scale e.g., growth and decay – exponential scales
n Mixed – attributes have different data types
10
Prepare input data
From a given data set, we need to 1) Standardize/normalize data (if needed) 2) Compute the difference matrix
6
11
Interval-scaled variables (1)
n Standardize data – for each column (attribute) f¨ Calculate the mean absolute deviation:
where
¨ Calculate the standardized measurement (z-score)
n Using mean absolute deviation is more robust than using standard deviation (deviation is not square à reduce effect of outliers)
.)...211
nffff xx(xn m +++=
|)|...|||(|1 21 fnffffff mxmxmxns −++−+−=
f
fifif s
mx z
−=
12
Interval-scaled variables (2)
n Distances are normally used to measure the similarity or dissimilarity between two data objects
n Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer
n If q = 1, d is Manhattan distance n If q = 2, d is Euclidean distance
pp
jxixjxixjxixjid )||...|||(|),(2211
−++−+−=
)||...|||(|),( 22
22
2
1121
pp jxixwjxixwjxixwjid p −++−+−=Weighted Euclidean Distance
7
13
Binary Variables (2)
Computing Dissimilarity matrix n A contingency table for binary data
n Simple matching coefficient: for symmetric binary variables I.e., values 0 and 1 are equally valuable e.g., gender (0 = male, 1 = female)
n Jaccard coefficient: for asymmetric binary variables I.e., outcome 1 is more important than 0, e.g., 1 = HIV pos, 0 = HIV neg
dcbacb jid+++
+=),(
cbacb jid++
+=),(
pdbcasumdcdcbaba
sum
++
+
+
01
01
Instance i
Instance j
Note: b, c represent 1-0 and 0-1 d represents 0-0
14
Example
¨ gender is a symmetric attribute ¨ the remaining attributes are asymmetric binary ¨ let the values Y and P be set to 1, and the value N be set to 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N
75.021121),(
67.011111),(
33.010210),(
=++
+=
=++
+=
=++
+=
maryjimd
jimjackd
maryjackd
8
15
Nominal Variables (2)
n A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
Computing dissimilarity matrix:
n Method 1: Simple matching ¨ m = # of matches (# variables that instances i and j have same state)
¨ p = total # of variables
n Method 2: use a large number of binary variables ¨ create a new binary variable for each of the nominal states
pmpjid −=),(
16
Ordinal Variables
n An ordinal variable can be discrete or continuous n Order is important, e.g., rank n Can be treated like interval-scaled
¨ Let xif be the data value of variable f for the ith instance
¨ Replace, xif by its rank rif ∈ {1, .., Mf} ¨ Normalize each variable so that its range would be in [0, 1] by
replacing rif by
¨ Compute the dissimilarity using methods for interval-scaled variables
Zif = 11−
−
f
if
Mr
9
17
Ratio-Scaled Variables
n Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, E.g., AeBt or Ae-Bt
n Methods: ¨ treat them like interval-scaled variables—not a good choice! (why?
—the scale can be distorted)
¨ apply logarithmic transformation: yif= log(xif) then use techniques for interval-scaled variables
¨ treat xif as continuous ordinal data & treat their ranks as interval-scaled values - recommended
18
Mixed Variable Types
n A database may contain all the six types of variables ¨ symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio
n One may use a weighted formula to combine their effects
¨ δij(f) = 0 if either (1) xif or xjf is missing, or (2) f is
asymmetric binary and xif = xjf = 0 ; o.w. δij(f) = 1
¨ d(f)ij is computed according to the data type of f
∑=
∑== p
ffij
p
ffij
fij d
jid1
1)(
)(
)(
),(δ
δ
10
19
Outline
n Overview n Input Data types n Clustering methods
¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods
n Outlier Analysis n Summary
20
Categories of Clustering Methods
n Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
n Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion
n Density-based: based on connectivity and density functions
n Grid-based: based on a multiple-level granularity structure
n Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other
11
21
Partitioning Methods
n Given ¨ a database of n objects ¨ k = number of clusters to form, k ≤ n
n To find k partitions of the database that optimize a “similarity function” in that ¨ Objects of different clusters/partitions are “dissimilar” ¨ Objects of the same partition are “similar”
n Basic Idea: ¨ Create initial partitions ¨ Iterative relocation – move objects from one group to the other to
improve objective function (e.g., min. error)
22
Partitioning Methods (cont)
n Approaches ¨ Global optimal: exhaustively enumerate all partitions ¨ Heuristic methods:
n Basic idea: ¨ Pick a reference point of each cluster ¨ Assign data objects to minimize sum of dissimilarities between
each object and its reference point of the same cluster n k-means
Each cluster is referenced by the center of the cluster n k-medoids
Each cluster is referenced by one of the objects in the cluster
12
23
The k-Means Method
n Input: k = number of clusters and a database of n objects n Output: a set of k clusters that min the squared-error
n Algorithm (Sketched)
¨ Randomly pick k objects as center of clusters ¨ Repeat
n Assign each object to the cluster with the nearest center
n Update the center (mean value of objects in the cluster) of each cluster
¨ Until no change (i.e., the squared-error converges)
n squared-error: ∑∑=
∈−=
k
iCp ii
mpE1
2|| where mi = mean of cluster Ci and p = object point
24
The k-Means Method K=2 Arbitrarily choose K objects as initial cluster center
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Assign each objects to most similar center
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Update the cluster means 0
1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
reassign
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Update the cluster means
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
reassign
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
13
25
Advantages and disadvantages n Strength
¨ Relatively efficient n O(tkn), where n is # objects, k is # clusters, and t is # iterations. k, t << n. n Other clustering methods, e.g., PAM: O(k(n-k)2 )
n Weakness ¨ Applicable only when mean is defined - categorical data? ¨ Need to specify the number of clusters, k, in advance ¨ Mean is sensitive to noisy data and outliers ¨ Not suitable to discover clusters with non-convex shapes
n Often terminates at a local optimum. The global optimum may be found using techniques such as deterministic annealing and genetic algorithms
26
Variations of the k-Means Method
n Variants of the k-means differ in ¨ Selection of the initial k means ¨ Dissimilarity calculations ¨ Strategies to calculate cluster means
n k-modes method: handles categorical data [Huang’98] ¨ Replacing means of clusters with modes ¨ Use a frequency-based method to update modes of clusters
n k-prototypes method: integrates k-means and k-modes to handle mixed data types
n EM (Expectation Maximization): extends k-means by assigning each object to a cluster according to a weight representing the probability of membership
n Scalable k-means method – Discard object whose membership is ascertained
14
27
Problem with k-Means Method
n The k-means algorithm is sensitive to outliers ! ¨ Since an object with an extremely large value may substantially distort the
distribution of the data
n k-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
n k-means is centroid-based, k-medoids is most central object-based
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
28
The k-Medoids Method
n Input: k = number of clusters and a database of n objects n Output: a set of k clusters that min sum of dissimilarities of all objects to
their nearest medoid (most central object) n Algorithm (Sketched)
¨ Randomly pick k objects as medoids of k clusters ¨ Repeat
n Assign each remaining object to the cluster with the nearest medoid n Randomly select a non-medoid object, Orandom n Compute cost of swapping a medoid object Oj with Orandom n If the cost is < 0 then swap (i.e., Orandom is a medoid and Oj is not)
¨ Until no change (i.e., the squared-error converges) n Cost ~ squared-error with swapping - squared-error without swapping
if cost < 0 means swapping reduces squared-error
15
29
Variations of the k-Medoids Methods
n PAM(Partitioning Around Medoids) [Kaufman and Rousseeuw, 87] ¨ Randomly select an initial set of medoids ¨ Iteratively replace the medoid by the non-medoid that gives the
most error reduction
n PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean
n PAM works effectively for small data sets, but does not scale well for large data sets ¨ O(k(n-k)2 ) for each iteration where n is # of data objects, k is # of clusters
30
Variations of the k-Medoids Methods
n CLARA (Clustering LARge Application) [Kaufmann & Rousseeuw, 90] ¨ Uses multiple samples of the data set, applies PAM on each sample,
and gives the best clustering as the output ¨ Strength: deals with larger data sets than PAM ¨ Weakness:
n Efficiency depends on the sample size n A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the sample is biased
n CLARANS (Clustering Large Applications based on RANdomized search) [Ng & Han, 94] ¨ Combines CLARA’s idea and random search to avoid sample’s bias ¨ CLARAN does not confine medoid search to a fixed sample but it randomly
selects nodes to be searched after a local optimum is found ¨ More efficient than PAM and CLARA
16
31
Outline
n Overview n Input Data types n Clustering methods
¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods
n Outlier Analysis n Summary
32
Hierarchical Clustering Method
n Agglomerative method – bottom-up (merge iteratively) ¨ Place objects in its own cluster, and merge the clusters into larger
clusters until all objects are in a single cluster or termination ¨ Most hierarchical methods are of this kind – they differ by the
definition of “intercluster” similarity ¨ Example: AGNES (AGglomerative NESting)
n Divisive method – top-down (divide iteratively) ¨ Start with all objects are in one cluster, and subdivides the cluster
into smaller pieces until termination (e.g., reach specified number of clusters, Two closest clusters are close enough)
¨ Example: DIANA (DIvisive ANAlysis)
17
33
Hierarchical Methods
n Needs a termination condition – not necessary be the number of clusters as an input
n Use distance matrix as clustering criteria
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
divisive
34
Measures of distances
Let d(p, p’) be a distance between points p and p’ pi and pj be points in clusters Ci and Cj , respectively ni be a number of points in clusters Ci, and mi be a mean of points in clusters Ci
n Minimum distance: dmin(Ci, Cj) = mini, j d(pi, pj) n Maximum distance: dmax(Ci, Cj) = maxi, j d(pi, pj) n Mean distance: dmean(Ci, Cj) = d(mi, mj) n Average distance: davg(Ci, Cj) =
∑ji
jiji
ppdnn ,),(1
18
35
AGNES & DIANA
n Introduced in [Kaufmann and Rousseeuw, 90] n Implemented in statistical analysis packages, e.g., S+ n Use the Single-Link method (the similarity between two clusters is the similarity
between the closest pair of points in each cluster) and the dissimilarity matrix n AGNES
¨ Merge nodes that have the least dissimilarity ¨ Go on in a non-descending fashion ¨ Eventually all nodes belong to the same cluster
n DIANA performs in inverse order of AGNES
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
36
Issues
n Major weakness ¨ do not scale well: e.g., agglomerative time complexity of at least
O(n2), where n is the number of total objects ¨ Irreversable process
n To improve these drawbacks: ¨ Integration of hierarchical with iterative relocation
BIRCH Balanced Iterative Reducing and Clustering using Hierarchies [Zhang Ramakrishnan & Livny, 96]
¨ Careful analysis of object “linkages” at each level of hierarchy CURE Clustering Using Representatives [Guha, Rastogi & Shim, 98]
ROCK Robust Clustering using linKs [Guha, Rastogi & Shim, 99] CHAMELEON [Karypis, Han & Kumar, 99]
19
37
BIRCH
n Basic Idea: Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering ¨ Phase 1: scan DB to build an initial in-memory CF tree
to preserve inherent hierarchical structures of the data ~ similar to buidling B+-trees
¨ Phase 2: apply a (selected) clustering algorithm to cluster the leaf nodes of the CF-tree
38
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: ∑Ni=1=Xi
SS: ∑Ni=1=Xi
2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
Clustering Feature
Summarizes statistics (0, 1th, 2nd moments) of a sub-cluster
20
39
BIRCH
n Scales linearly ¨ finds a good clustering with a single scan and improves the quality
with a few additional scans
n Weaknesses ¨ handles only numeric data
¨ sensitive to the order of the data record
¨ Uses notions of diameters for clustering à doesn’t perform well for non-spherical shapes
40
CURE
¨ Basic Idea: n use multiple representative points for each cluster à adjust well with non-spherical shapes
n Shrink the representative points toward the center of the clusterà dampen effects of outliers
¨ Scales well without sacrificing clustering quality ¨ Does not handle categorical data and results are
significantly impacted from parameter setting
21
41
ROCK
n Basic Idea: ¨ Construct sparse graph from a similarity matrix using concept of “interconnectivity”
¨ Hirarchical clustering on the sparse graph
n Similarity measures are not distance-based n Cluster “similarity” is based on
¨ the number of points from different clusters that have neighbors in common
n Suitable for clustering categorical data
42
CHAMELEON
n Hierarchical clustering based on ¨ k-nearest neighbors ¨ Dynamic modeling
n Measures the similarity based on interconnectivity and closeness (proximity) ¨ If these measures between two clusters are high relative to the
internal measures within the clusters, merge the clusters ¨ Merge is based on a dynamic model adapting to internal
structures of the clusters being merged n Cure ignores interconnectivity of objects n Rock ignores the closeness of clusters
22
43
Outline
n Overview n Input Data types n Clustering methods
¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods
n Outlier Analysis n Summary
44
Density-Based Clustering Methods
n Clustering based on density (local cluster criterion), such as density-connected points
n Major features: ¨ Discover clusters of arbitrary shape ¨ Handle noise ¨ One scan ¨ Need density parameters as termination condition
n Several interesting studies: ¨ DBSCAN [Ester et al., 96] ¨ OPTICS [Ankerst et al., 99] ¨ DENCLUE [Hinneburg & Keim, 98] ¨ CLIQUE [Agrawal et al., 98]
23
45
Terminologies n ε-neighborhood of an object p, Nε(p) = {q | dist(p,q) ≤ ε} n q is a core object if Nε(q) contains at least M, min number of objects
n p is directly density-reachable from q if
¨ p is in Nε(q) and
¨ q is a core point
p q
MinPts = 5
ε = 1 cm
p is directly density-reachable from q
46
Additional terms
n Density-reachable: ¨ p is density-reachable from q if there
is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi wrt. ε, MinPts
n Density-connected ¨ A point p is density-connected to a
point q if there is a point o such that both, p and q are density-reachable from o wrt. ε and MinPts
n Density-reachable is transitive
n Density-connected is symmetric
p
q p1
p q
o
24
47
Density-based Clustering n DBSCAN (Density-based Spatial Clustering of Applications with Noise)
¨ Grows region with sufficiently high density into clusters ¨ A cluster has a maximal set of density-connected points with respect to
density-reachability ¨ Objects not in any cluster is considered “noise” ¨ Requires input parameters: e and MinPts
n OPTICS (Ordering Points To Identify the Clustering Structure) ¨ Computes cluster ordering for automatic and interactive analysis ¨ The ordering represents the density-based clustering structure (select an
object in ways that clusters with small radius ε (high density) will be finished first
48
Density-based Clustering n DENCLUE (Density-based Spatial ClustEring)
¨ Solid mathematical foundation – use influence function ¨ Good for data sets with large amounts of noise ¨ Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets ¨ Significant faster than existing algorithm (faster than DBSCAN by a
factor of up to 45) ¨ But needs a large number of parameters
25
49
Outline
n Overview n Input Data types n Clustering methods
¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods
n Outlier Analysis n Summary
50
Grid-Based Clustering Method
n Basic steps: ¨ Quantize the object space into a finite number of grid cells
¨ Perform clustering on the grid structure
n Examples
STING [Wang et al., 97] a grid-based method using statistical information stored in grid cells to answer queries
WaveCluster [Sheikholeslami et al., 98] and CLIQUE [Agrawal, et al., 98] are both grid-based and density-based
26
51
STING n Each cell
¨ at a high level is partitioned into smaller cells in the next lower level ¨ Has statistical information computed
E.g., means, max, min, distribution types n Query-answering is a top-down process
¨ Determine highest layer to start ¨ For each cell in the layer,
n Compute confidence of intervals to reflect relevance to query
n Remove irrelevant cells n Move to next lower level
¨ Repeat the above step until the query specification is met Return the region of relevant cells
n PROS: fast, easy to parallelize, CONS: boundaries are only horizontal/vertical
52
WaveCluster
n A multi-resolution clustering approach ¨ Impose grid structures on the data space ¨ Apply wavelet transform to each feature space (from grid cell info)
n Decompose a signal into different frequency sub-band n Preserve relative distance between objects at different resolutions
¨ Clustering by finding dense regions in the transformed space n Both grid-based and density-based with input parameters:
¨ # of grid cells for each dimension ¨ the wavelet, and the # of applications of wavelet transform
n Advantages: ¨ Scales for large databases – O(n) ¨ Do not require # clusters or neighborhood radius ¨ Effective removal of outliers ¨ Detect arbitrary shaped clusters at different scales ¨ Not sensitive to noise, not sensitive to input order
27
53
Transformation
High resolution Medium resolution Low resolution
Decompose data into four frequency sub-bands focusing on • mean neighborhood of each point • horizontal edges • vertical edges • corners
54
CLIQUE n Grid-based and density-based using the Apriori principle
¨ Partition space into grid structure ¨ Cluster - maximal set of connected dense units
n Narrowing search space: A k-dimensional candidate dense unit is not dense if its (k-1)-dimensional projection unit is not dense
¨ Determine min cover of each cluster n PROS:
¨ Scales well - good for clustering high-dimensional data in large databases
¨ Insensitive to the order of input n CONS:
¨ low accuracy of the clustering at the expense of simplicity
28
55
Outline
n Overview n Input Data types n Clustering methods
¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods
n Outlier Analysis n Summary
56
Model-Based Clustering Methods n Goal – to fit the data to some mathematical model n Statistical approach ~ Conceptual learning
¨ Cobweb ¨ CLASSIT ¨ Autoclass
n Neural net approach ¨ SOM (Self Organizing Map)
29
57
Statistical Approach
n Conceptual learning: cluster (group like objects) + characterize (generalize & simplify) concepts
n COBWEB [Fischer, 87] ¨ Build a classification tree where each node (not branch) has a label ¨ Use heuristic evaluation function called category utility that
n Represents a difference in expected number of attribute values when a hypothesized category is used and unused
n Rewards intraclass similarity and interclass dissimilarity n Based on assumption that attributes are independent
¨ CONS: n independent assumption n Expensive to compute and update probability distribution n not height-balanced – bad for skewed data
n CLASSIT – extension of COBWEB for continuous data – similar problems n AutoClass – uses Bayesian statistical analysis – popular in industries
58
Neural Net Approach
n Competitive learning ¨ Hierarchical architecture of neuron units where
n Each unit in a given layer takes inputs from ALL units from previous layer n Units within a cluster in a layer compete – pick one that best corresponds
to output from previous layer n The winning unit adjust weights on its connections
¨ Cluster ~ mapping of low-level features to high-level features
n SOM (Self-Organizing Feature Maps) ¨ Unit whose weight vector is closest to current object becomes the
winner ¨ More details in reference papers
30
59
Outline
n Overview n Input Data types n Clustering methods
¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods
n Outlier Analysis n Summary
60
Outlier Discovery
n Outliers : data objects that ¨ do not comply with general behavior ¨ are considerably dissimilar (inconsistent) to remaining data
n Generally, data mining algorithms try to min effect of outliers But … “one person’s noise could be another person’s signal”
n Example Applications: ¨ Credit card fraud detection ¨ Telecom fraud detection ¨ Marketing customized to purchasers with extreme incomes ¨ Finding unusual responses to medical treatments
n Outlier Discovery: Find top n outlier points
31
61
Approaches
n Statistical-based ¨ Assume a data distribution model (e.g., normal) ¨ Apply hypothesis testing to see if objects belong to the distribution ¨ Drawbacks:
n Most tests are for single attributes n Data distribution may not be known
n Distance-based ¨ Outliers – objects that do not have enough neighbors ¨ Various algorithms: index-based, nested-loop, cell-based ¨ Drawbacks:
n Does not scale well for high dimensions n Requires experimentation for input parameters
62
Approaches (contd)
n Deviation-based ¨ Outliers – objects deviate from main characteristics of data ¨ Example techniques:
n Sequential exception technique n OLAP data cube technique (explore regions of anomalies – ch 2)
¨ Sequential exception technique n Given set S of n objects, build a sequence of subsets:
S1 ⊂ S2 ⊂ … … ⊂ Sm where m ≤ n n For each subset, test its similarity with the preceding subset in sequence n Find the smallest subset whose removal results in greatest reduction of
dissimilarity in the residual set – called “Exception set” – outliers
32
63
Outline
n Overview n Input Data types n Clustering methods
¨ Partitioning Methods ¨ Hierarchical Methods ¨ Density-Based Methods ¨ Grid-Based Methods ¨ Model-Based Methods
n Outlier Analysis n Summary
64
Problems and Challenges
n Considerable progress has been made in scalable clustering methods ¨ Partitioning: k-means, k-medoids, CLARANS ¨ Hierarchical: BIRCH, CURE ¨ Density-based: DBSCAN, CLIQUE, OPTICS ¨ Grid-based: STING, WaveCluster ¨ Model-based: Autoclass, Denclue, Cobweb
n Current clustering techniques do not address all the requirements adequately
n Constraint-based clustering analysis: Constraints exist in data space (bridges and highways) or in user queries
33
65
Constraint-Based Clustering Analysis
n Less parameters but more user-desired constraints E.g., an ATM allocation problem
66
Clustering With Obstacle Objects
Taking obstacles into account Not Taking obstacles into account
34
67
Summary
n Cluster analysis groups objects based on their similarity and has wide applications
n Measure of similarity can be computed for various types of data
n Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods
n Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches
n There are still lots of research issues on cluster analysis, such as constraint-based clustering
68
References n R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. SIGMOD'98 n M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. n M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the
clustering structure, SIGMOD’99. n P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996 n M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters
in large spatial databases. KDD'96. n M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95. n D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning,
2:139-172, 1987. n D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on
dynamic systems. In Proc. VLDB’98. n S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.
SIGMOD'98. n A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
35
69
References (cont) n L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis.
John Wiley & Sons, 1990. n E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98. n G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering.
John Wiley and Sons, 1988. n P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997. n R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94. n E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets.
Proc. 1996 Int. Conf. on Pattern Recognition, 101-105. n G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering
approach for very large spatial databases. VLDB’98. n W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining,
VLDB’97. n T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very
large databases. SIGMOD'96.