Lecture 10 Clustering

Preview

Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

Land use: Identification of areas of similar land use in an earth observation database

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

Urban planning: Identifying groups of houses according to their house type, value, and geographical location

Seismology: Observed earth quake epicenters should be clustered along continent faults

What Is a Good Clustering?

A good clustering method will produce clusters with High intra-class similarity Low inter-class similarity

Precise definition of clustering quality is difficult Application-dependent Ultimately subjective

Requirements for Clustering in Data Mining

Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal domain knowledge required to

determine input parameters Ability to deal with noise and outliers Insensitivity to order of input records Robustness wrt high dimensionality Incorporation of user-specified constraints Interpretability and usability

Similarity and Dissimilarity Between Objects

Euclidean distance (p = 2):

Properties of a metric d(i,j): d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

)||...|||(|),( 22

11 pp jx

Major Clustering Approaches

Partitioning: Construct various partitions and then

evaluate them by some criterion

Hierarchical: Create a hierarchical decomposition of

the set of objects using some criterion

Model-based: Hypothesize a model for each cluster

and find best fit of models to data

Density-based: Guided by connectivity and density

functions

Partitioning Algorithms

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids

algorithms k-means (MacQueen, 1967): Each cluster is

represented by the center of the cluster k-medoids or PAM (Partition around medoids)

(Kaufman & Rousseeuw, 1987): Each cluster is represented by one of the objects in the cluster

K-Means Clustering

Given k, the k-means algorithm consists of four steps: Select initial centroids at random. Assign each object to the cluster with

the nearest centroid. Compute each centroid as the mean

of the objects assigned to it. Repeat previous 2 steps until no

change.

k-Means [MacQueen ’67]

Another example (k=3)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 6

K-Means Clustering (contd.)

Example

0 1 2 3 4 5 6 7 8 9 10

Comments on the K-Means Method

Strengths Relatively efficient: O(tkn), where n is # objects, k is

# clusters, and t is # iterations. Normally, k, t << n.

Often terminates at a local optimum. The global optimum may be found using techniques such as simulated annealing and genetic algorithms

Weaknesses Applicable only when mean is defined (what about

categorical data?) Need to specify k, the number of clusters, in advance Trouble with noisy data and outliers Not suitable to discover clusters with non-convex

shapes

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

Simple example

1 3 2 5 4 60

Cut-off point

Single link example

close?

Inter-Cluster similarity?

MIN(Single-link)

MAX(complete-link)

Distance Between Centroids

Group Average(average-link)

What is: density-based problem?

Classes of arbitrary spatial shapes

Desired properties: good efficiency (large databases) minimal domain knowledge (determine

parameters)

Intuition: notion of density

Lecture 10 Clustering

Documents

Transcript of Lecture 10 Clustering

CSC321: Neural Networks Lecture 12: Clustering

Lecture 17: Hierarchical Clustering

Lecture 10: Semantic Segmentation and Clustering

Unsupervised Learning: Clustering 1 Lecture 16: Clustering Web Search and Mining.

Lecture 15 Clustering - Oregon State University

Lecture 10: Clustering - GitHub Pages · •Unsupervised learning •Clustering •K-means •Algorithm •How to choose 𝐾 •Initialization •Properties •Agglomerative clustering

CSC 411: Lecture 12: Clustering - Department of Computer …urtasun/courses/CSC411_Fall16/12... · 2016. 10. 30. · CSC 411: Lecture 12: Clustering Richard Zemel, Raquel Urtasun

Lecture 12: Clustering May 5, 2010. Clustering (Ch 16 and 17) Document clustering Motivations Document representations Success criteria Clustering.

Lecture 20: Clustering and Evolution

DMTM Lecture 14 Density based clustering

Lecture on Modeling Tools for Clustering & Regressionhy590-21/pdfs/Modeling_tools_k... · 2017. 10. 31. · Lecture on Modeling Tools for Clustering & Regression . Data Clustering

lecture 8: clustering algorithms · Clustering Givenalargedataset,groupdatapointsinto‘clusters’. Datapointsinthesameclusteraresimilarinsomesense. E.g.clusterstudentsscores(todecidegrade)

Lecture outline Clustering aggregation – Reference: A. Gionis, H. Mannila, P. Tsaparas: Clustering aggregation, ICDE 2004 Co-clustering (or bi-clustering)

LECTURE 22: CLUSTERING

Lecture 12: Unsupervised learning. Clustering€¦ · Lecture 12: Unsupervised learning. Clustering Introduction to unsupervised learning Clustering problem K-means clustering Hierarchical

DMTM Lecture 15 Clustering evaluation

Andrew Rosenberg- Lecture 23: Clustering Evaluation

Lecture 6 : Clustering

Information Networks Graph Clustering Lecture 14.

LECTURE 28: HIERARCHICAL CLUSTERING