Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Machine Learning: Algorithms and Applicationszini/ML/slides/ml_2012_lecture_11.pdf · Commonly used...
Transcript of Machine Learning: Algorithms and Applicationszini/ML/slides/ml_2012_lecture_11.pdf · Commonly used...
21/05/12
1
Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 11: 21 May 2012
Unsupervised Learning (cont…)
Slides courtesy of Bing Liu: www.cs.uic.edu/~liub/WebMiningBook.html
21/05/12
2
Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary
Mixed attributes n The distance functions we have seen are for
data with all numeric attributes, or all nominal attributes, etc.
n In many practical cases data has different types of attributes, from the following 6: q interval-scaled q ratio-scaled q symmetric binary q asymmetric binary q nominal q ordinal
n Clustering a data set involving mixed attributes is a challenging problem
21/05/12
3
Convert to a single type n One common way of dealing with mixed
attributes is to: 1. Choose a dominant attribute type 2. Convert the other types to this type
n E.g., if most attributes in a data set are interval-scaled q we convert ordinal attributes and ratio-scaled
attributes to interval-scaled attributes q it is also appropriate to treat symmetric binary
attributes as interval-scaled attributes
Convert to a single type (cont …)
n It does not make much sense to convert a nominal attribute or an asymmetric binary attribute to an interval-scaled attribute q but it is frequently done in practice by assigning
some numbers to them according to some hidden ordering, e.g., prices of the fruits
n Alternatively, a nominal attribute can be converted to a set of (symmetric) binary attributes, which are then treated as numeric attributes
21/05/12
4
Combining individual distances n This approach computes individual attribute distances and then
combine them n A combination formula, proposed by Gower, is
q The distance dist(xi,xj) is between 0 and 1 q r is the number of attributes
q
q dij
f is the distance contributed by attribute f, in the range [0,1]
!ijf =
1 if xif and x jf are not missing
0 if xif or x jf is missing
0 if attribute f is asymmetric and xif and x jf are both 0
!
"##
$##
∑∑
=
== r
ffij
fij
r
ffij
ji
ddist
1
1),(δ
δxx (4)
Combining individual distances (cont …) n If f is a binary or nominal attribute
q distance (4) reduces to
n equation (3)-lect 10 if all attributes are nominal n the simple matching distance (1)-lect 10 if all attributes are symmetric binary n the Jaccard distance (2)-lect 10 if all attributes are asymmetric
n If f is interval-scaled q Rf is the value range of f q If all the attributes are interval-scaled, distance (4) reduces to
Manhattan distance n Assuming that all attributes values are standardized
n Ordinal and ratio-scaled attributes are converted to interval-scaled attributes and handled in the same way
dijf =
1 if xif ! x jf0 otherwise
"#$
%$
dijf =
xif ! x jfRf
Rf =max( f )!min( f )
21/05/12
5
Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary
How to choose a clustering algorithm n Clustering research has a long history
q A vast collection of algorithms are available q We only introduced several main algorithms
n Choosing the “best” algorithm is challenging q Every algorithm has limitations and works well with certain
data distributions q It is very hard, if not impossible, to know what distribution
the application data follow n The data may not fully follow any “ideal” structure or distribution
required by the algorithms
q One also needs to decide how to standardize the data, to choose a suitable distance function and to select other parameter values
21/05/12
6
How to choose a clustering algorithm (cont …)
n Due to these complexities, the common practice is to 1. run several algorithms using different distance functions
and parameter settings 2. carefully analyze and compare the results
n The interpretation of the results must be based on q insight into the meaning of the original data q knowledge of the algorithms used
n Clustering is highly application dependent and to certain extent subjective (personal preferences)
Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary
21/05/12
7
Cluster Evaluation: hard problem
n The quality of a clustering is very hard to evaluate because q We do not know the correct clusters
n Some methods are used q User inspection
n A panel of experts inspects the resulting clusters and scores them q Study centroids as spreads q Examine rules (e.g., from a decision tree) that describe the clusters q For text documents, one can inspect by reading
n The final score is the average of the individual scoring n Manual inspection is labor intensive and time consuming
Cluster evaluation: ground truth
n We use some labeled data (for classification) q Assumption: Each class is a cluster
n Let the classes in the data D be C=(c1, c2,…,ck) q The clustering method produces k clusters, which
divides D into k disjoint subsets, D1, D2, …, Dk
n After clustering, a confusion matrix is constructed q From the matrix, we compute various measurements:
entropy, purity, precision, recall and F-score
21/05/12
8
Evaluation measures: Entropy
n For each cluster, we can measure the entropy as
q Pri(cj): proportion of class cj in cluster Di
n The entropy of the whole clustering is
q |Di|/|D| is the weight of cluster Di, proportional to its size
entropy(Di ) = ! Pri (cj )log2j=1
k
" Pri (cj )
entropytotal (D) =Di
Dentropy(Di )
i=1
k
!
Evaluation measures: purity
n Measures the extent a cluster contains only one class of data
n The purity of the whole clustering is
q |Di|/|D| is the weight of cluster Di, proportional to its size
n Precision, recall, and F-measure can be computed as well q Based on the class that is most frequent in the cluster
purity(Di ) =max j Pr(cj )( )
puritytotal (D) =Di
Dpurity(Di )
i=1
k
!
21/05/12
9
An example
n We can use the total entropy or purity to compare q different clustering results from the same algorithm q different algorithms
n Precision, recall and F-measure can be computed as well for each cluster q The precision of Science in cluster 1 is 0.89, the recall is 0.83, the F-measure is
thus 0.86
A remark about ground truth evaluation
n Commonly used to compare different clustering algorithms
n A real-life data set for clustering has no class labels q Thus although an algorithm may perform very well on some
labeled data sets, no guarantee that it will perform well on the actual application data at hand
n The fact that it performs well on some label data sets does give us some confidence of the quality of the algorithm
n This evaluation method is said to be based on external data or information
21/05/12
10
Evaluation based on internal information
n Intra-cluster cohesion (compactness): q Cohesion measures how near the data points in a
cluster are to the cluster centroid q Sum of squared error (SSE) is a commonly used
measure n Inter-cluster separation (isolation):
q Separation means that different cluster centroids should be far away from one another
n In most applications, expert judgments are still the key
Indirect evaluation n In some applications, clustering is not the primary
task, but used to help perform another task n We can use the performance on the primary task to
compare clustering methods n For instance, in an application, the primary task is to
provide recommendations on book purchasing to online shoppers q If we can cluster shoppers according to their features, we
might be able to provide better recommendations q We can evaluate different clustering algorithms based on
how well they help with the recommendation task q Here, we assume that the recommendation can be reliably
evaluated
21/05/12
11
Road map n Basic concepts n K-means algorithm n Representation of clusters n Hierarchical clustering n Distance functions n Data standardization n Handling mixed attributes n Which clustering algorithm to use? n Cluster evaluation n Summary
Summary n Clustering is has along history and still active
q There are a huge number of clustering algorithms q More are still coming every year
n We only introduced several main algorithms. There are many others, e.g., q density based algorithm, sub-space clustering, scale-up methods,
neural networks based methods, fuzzy clustering, co-clustering, etc.
n Clustering is hard to evaluate, but very useful in practice q This partially explains why there are still a large number of clustering
algorithms being devised every year
n Clustering is highly application dependent and to some extent subjective
21/05/12
12
Reinforcement Learning
These slides are an adaptation of slides drawn by Tom Mitchell and modified by Liviu Ciortuz
Introduction � Supervised learning is the simplest and most
studied type of learning � How can an agent learn behaviors when it doesn’t
have a teacher to tell it how to perform? � The agent has a task to perform � It takes some actions in the world � At some later point, it gets feedback telling it how well it
did on performing the task � The agent performs the same task over and over again
� This problem is called reinforcement learning: � The agent gets positive reward for tasks done well � The agent gets negative reward for tasks done poorly
21/05/12
13
Introduction (cont…) � The goal is to get the agent to act in the world
so as to maximize its rewards � The agent has to figure out what it did that
made it get the reward/punishment � This is known as the credit assignment problem
� Reinforcement learning can be used to train computers to do many tasks, such as: � playing board games � job shop scheduling � controlling robot � flight/taxy scheduling � …
Overview � Task: Control learning
� make an autonomous agent (robot) to perform actions, observe consequences and learn a control strategy
� The Q learning algorithm � acquire optimal control strategies from delayed rewards,
even when the agent has no prior knowledge of the effect of its actions on the environment
� Reinforcement Learning is related to dynamic programming, used to solve optimization problems � While DP assumes that the agent/program knows the
effect (and rewards) of all its actions, in RL the agent has to experiment in the real world
21/05/12
14
Reinforcement Learning Problem
Example: • play Backgammon (TD-Gammon [Tesauro, 1995]);
immediate reward • +100 if win, • -100 if lose, • 0 otherwise
� Target function to learn:
� Goal: maximize where
! : S! A
r0 +!r1 +!2r2 +...
0 ! ! <1
Control learning characteristics
21/05/12
15
Learning Sequential Control Strategies Using Markov Decision Processes
Agent’s Learning Task