9977 Cluster
-
Upload
amiya-arnav -
Category
Documents
-
view
222 -
download
0
Transcript of 9977 Cluster
-
8/16/2019 9977 Cluster
1/38
-
8/16/2019 9977 Cluster
2/38
What is clustering?
• A grouping of data objects such that the objects within agroup are similar (or related) to one another and diferentrom (or unrelated to) the objects in other groups
Inter-cluster
distances aremaximizedIntra-clusterdistances areminimized
-
8/16/2019 9977 Cluster
3/38
Outliers
• Outliers are objects that do not belong to
any cluster or form clusters of ver small
cardinalit
• In some a!!lications "e are interested in
discovering outliers# not clusters (outlier analsis)
cluster
outliers
-
8/16/2019 9977 Cluster
4/38
Wh do "e cluster?
• Clustering $ given a collection of data objects grou!them so that – %imilar to one another "ithin the same cluster
– &issimilar to the objects in other clusters
• Clustering results are used$ – As a stand-alone tool to get insight into data distribution
• 'isualization of clusters ma unveil im!ortant information
– As a !re!rocessing ste! for other algorithms• cient indexing or com!ression often relies on clustering
-
8/16/2019 9977 Cluster
5/38
*he clustering tas+
• ,rou! observations so that theobservations belonging in the same grou!are similar# "hereas observations in
dierent grou!s are dierent
• Basic questions:
– What does .similar/ mean
– What is a good !artition of the objects? I0e0#ho" is the 1ualit of a solution measured
– 2o" to 3nd a good !artition of the observations
-
8/16/2019 9977 Cluster
6/38
Observations to cluster
• 4suall data objects consist of a set of
attributes (also +no"n as dimensions)
• 5eal-value attributes6variables
– e0g0# salar# height
• 7inar attributes
– e0g0# gender (869)# has:cancer(*69)
• ;ominal (categorical) attributes
– e0g0# religion (Christian# 8uslim# 7uddhist# 2indu# etc0)
• Ordinal65an+ed attributes
– e0g0# militar ran+ (soldier# sergeant# lutenant# ca!tain# etc0)
-
8/16/2019 9977 Cluster
7/38
Observations to cluster
• If all d dimensions are real-valuedthen "e can visualize each data!oint as !oints in a d -dimensionalspace
• If all d dimensions are binary then"e can thin+ of each data !oint as abinary vector
-
8/16/2019 9977 Cluster
8/38
&istance functions• *he distance d(x, y) bet"een t"o objects x and y is a
metricmetric if if
– d(i, j) (non-negativit) – d(i, i)! (isolation)
– d(i, j)! d(j, i) (smmetr) – d(i, j) " d(i, h)#d(h, j) (triangular ine1ualit)
• *he de3nitions of distance functions are usualldierent for real# boolean# categorical# and ordinal
variables0
• Weights ma be associated "ith dierent variablesbased on a!!lications and data semantics0
-
8/16/2019 9977 Cluster
9/38
&ata %tructures
• data matrix
• Distance matrix
nd x...
n x...
n1 x
............... id
x...
i
x...
i1
x
...............1d
x...1
x...11
x
0...)2,()1,(
:::
)2,3()
...nd nd
0d d(3,1
0d(2,1)
0
attributes6dimensions
t u ! l e
s 6 o b j e c t s
objects
o b j e c t s
-
8/16/2019 9977 Cluster
10/38
&istance functions for binarvectors
$
%
$
&
$
'
$
$
$
*< = > > = = =
> = = > = >
• +accard similarity bet"een binarvectors and -
• +accard distance bet"een binarvectors and -
@dist(
-
8/16/2019 9977 Cluster
11/38
&istance functions for real-valuedvectors
• .p norms or Minkowski distance$
"here p is a !ositive integer
• If p ! %, .%
is the Manhattan (or city block)
distance$
p pd
i i
yi
x p p
d x
d x
p y x
p y x y x p L
/1
1||
/1||...|
22||
11|),(
∑
=−=−++−+−=
∑
=−=−++−+−=
d
i i
yi
xd
yd
x y x y x y x L1
||...|22
|||),(1 11
-
8/16/2019 9977 Cluster
12/38
&istance functions for real-valued vectors
• If p ! &, .& is the /uclidean
distance$
• Also one can use weighted
distance$
)||...|22
||11
(|),( 222
d y
d x y x y x y xd −++−+−=
)||...|22
|2
|11
|1
(),( 222d
yd
xd
w x xw x xw y xd −++−+−=
d y
d xd w y xw y xw y xd −++−+−= ...222111),(
-
8/16/2019 9977 Cluster
13/38
Algorithms$ basic conce!t
• Construct a !artition of a set of n objects into a set of k clusters – 2ierarchical Clustering
%ingle Din+age Com!lete Din+age Average Din+age
Eartitioning Clustering
01means
-
8/16/2019 9977 Cluster
14/38
*he +-means !roblem
• ,iven a set of n !oints in a d-dimensional s!ace and an integer 2
• 3as2: choose a set of 2 !oints 4c%, c&,5,c2 6 in the d-dimensional s!ace toform clusters 47%, 7&,5,72 6 such that
is minimized• %ome s!ecial cases$ + =# + n
( )∑ ∑= ∈
−=k
i C x
i
i
c x LC Cost
1
2
2)(
-
8/16/2019 9977 Cluster
15/38
*he +-means algorithm
• One "a of solving the 2 -means !roblem
• 5andoml !ic+ 2 cluster centers 4c%,5,c2 6
• 9or each i# set the cluster 7i to be the set of !ointsin that are closer to ci than the are to c j for alli8j
• 9or each i let ci be the center of cluster 7i (meanof the vectors in 7i)
• 5e!eat until convergence
-
8/16/2019 9977 Cluster
16/38
+-means algorithm
• 9inds a local o!timum
• Converges often 1uic+l (but not al"as)
• *he choice of initial !oints can havelarge inFuence
– Clusters of dierent densities
– Clusters of dierent sizes
• Outliers can also cause a !roblem(xam!le?)
% lt ti t d
-
8/16/2019 9977 Cluster
17/38
%ome alternatives to randominitialization of the central
!oints• 8ulti!le runs – 2el!s# but !robabilit is not on our side
• %elect original set of !oints bmethods other than random 0 0g0#!ic+ the most distant (from each
other) !oints as cluster centers(+meansGG algorithm)
-
8/16/2019 9977 Cluster
18/38
What is the right number ofclusters?
• Hor "ho sets the value of 2 ?
• 9or n !oints to be clustered consider the
case "here 2!n0 What is the value ofthe error function
• What ha!!ens "hen 2 ! %?
• %ince "e "ant to minimize the error "hdont "e select al"as 2 ! n?
-
8/16/2019 9977 Cluster
19/38
2ierarchical Clustering
• Eroduces a set of nested clustersorganized as a hierarchical tree
• Can be visualized as a dendrogram
– A tree-li+e diagram that records these1uences of merges or s!lits
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
-
8/16/2019 9977 Cluster
20/38
%trengths of 2ierarchicalClustering
• ;o assum!tions on the number ofclusters – An desired number of clusters can be
obtained b Jcutting the dendrogram atthe !ro!er level
•2ierarchical clustering macorres!ond to meaningful taxonomies
-
8/16/2019 9977 Cluster
21/38
gg omera ve c us er ng
-
8/16/2019 9977 Cluster
22/38
gg omera ve c us er ngalgorithm
• 8ost !o!ular hierarchical clustering techni1ue
• 7asic algorithm=0 Com!ute the distance matrix bet"een the in!ut data
!oints
K0 Det each data !oint be a cluster'< =epeat
L0 8erge the t"o closest clusters
0 4!date the distance matrix
*< >ntil onl a single cluster remains
• Me o!eration is the com!utation of the distancebet"een t"o clusters
– &ierent de3nitions of the distance bet"een clusterslead to dierent algorithms
-
8/16/2019 9977 Cluster
23/38
In!ut6 Initial setting
• %tart "ith clusters of individual !ointsand a distance6!roximit matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Distance/Proximity Matrix
-
8/16/2019 9977 Cluster
24/38
Intermediate %tate
• After some merging ste!s# "e have some clusters
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance/Proximity Matrix
-
8/16/2019 9977 Cluster
25/38
Intermediate %tate
• 8erge the t"o closest clusters (CK and C) andu!date the distance matrix0
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance/Proximity Matrix
-
8/16/2019 9977 Cluster
26/38
After 8erging
• .2o" do "e u!date the distance matrix?/
C1
C4
C2 U C5
C3
? ? ? ?
?
?
?
C2
U
C5C1
C1
C3
C4
C2 U C5
C3 C4
-
8/16/2019 9977 Cluster
27/38
&istance bet"een t"oclusters
• ach cluster is a set of !oints
•2o" do "e de3ne distance bet"eent"o sets of !oints
– Dots of alternatives
– ;ot an eas tas+
-
8/16/2019 9977 Cluster
28/38
&istance bet"een t"oclusters
• ?ingle1lin2 distance bet"eenclusters 7i and 7 j is the minimum
distance bet"een an object in 7i
and an object in 7 j
•
*he distance is de@ned by the twomost similar objects( ) { } ji y x ji sl C yC x y xd C C D ∈∈= ,),(min, ,
-
8/16/2019 9977 Cluster
29/38
%ingle-lin+ clustering$exam!le
• &etermined b one !air of !oints#i0e0# b one lin+ in the !roximitgra!h0
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 % & '
-
8/16/2019 9977 Cluster
30/38
%ingle-lin+ clustering$ exam!le
Nested Clusters Dendrogram
%
&
'
*
1
2
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2
-
8/16/2019 9977 Cluster
31/38
&istance bet"een t"oclusters
• 7omplete1lin2 distance bet"eenclusters 7i and 7 j is the maximum
distance bet"een an object in 7i
and an object in 7 j
•
*he distance isde@ned by the two
most dissimilar objects( ) { } ji y x jicl C yC x y xd C C D ∈∈= ,),(max, ,
-
8/16/2019 9977 Cluster
32/38
Com!lete-lin+ clustering$exam!le
• &istance bet"een clusters isdetermined b the t"o most distant!oints in the dierent clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 % & '
om! e e n c us er ng$
-
8/16/2019 9977 Cluster
33/38
om! e e- n c us er ng$exam!le
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
%
&
'
*
1
2 5
3
4
-
8/16/2019 9977 Cluster
34/38
&istance bet"een t"oclusters
• Aroup aerage distance bet"eenclusters 7i and 7 j is the average
distance bet"een an object in 7i
and an object in 7 j
( ) ∑∈∈×= ji C yC x ji jiavg y xd C C C C D , ),(
1
,
-
8/16/2019 9977 Cluster
35/38
Average-lin+ clustering$exam!le
• Eroximit of t"o clusters is the average of!air"ise !roximit bet"een !oints in thet"o clusters0
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 % & '
-
8/16/2019 9977 Cluster
36/38
Average-lin+ clustering$
exam!le
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
%
&
'
*
1
2
5
3
4
-
8/16/2019 9977 Cluster
37/38
Average-lin+ clustering
• Com!romise bet"een %ingle andCom!lete Din+
• %trengths – Dess susce!tible to noise and outliers
-
8/16/2019 9977 Cluster
38/38
*han+ ou