Clustering Lecture

download Clustering Lecture

of 46

Transcript of Clustering Lecture

  • 8/16/2019 Clustering Lecture

    1/46

    1

    Unsupervised Clustering

    Clustering is a very general problem that appears inmany different settings (not necessarily in a datamining context)

    Grouping “similar products” together to improvethe efficiency of a production line

    Pacing “similar items” into containers

    Grouping “similar customers” together

    Grouping “similar stocs” together

  • 8/16/2019 Clustering Lecture

    2/46

    !

    "he #imilarity Concept

    $bviously% the concept of similarity is eyto clustering& Using similarity definitions that are specific to a

    domain may generate more acceptable clusters& '&g& products that reuire same or similar

    toolsprocesses in the production line aresimilar&

    *rticles that are in the course pac of the samecourse are similar&

    General similarity measures are reuired forgeneral purpose algorithms&

  • 8/16/2019 Clustering Lecture

    3/46

    +

    Clustering, "he -./eans *lgorithm(0loyd% 12!)

    1& Choose a value for K % the total number ofclusters&

    !& 3andomly choose K  points as cluster centers&

    +& *ssign the remaining instances to theirclosest  cluster center&

    4& Calculate a ne5 cluster center for eachcluster&

    6& 3epeat steps +.6 until the cluster centers donot change&

  • 8/16/2019 Clustering Lecture

    4/46

    4

    7istance /easure

    "he similarity is captured by adistance measure in this algorithm&

    "he original proposed measure ofdistance is the 'uclidean distance&

    ∑=

    −=

    ==

    n

    i

    ii   y x y xd 

     y y yY  x x x X    nn

    1

    2

    2

    )(),(

    ),,,(),,,,(   211  

  • 8/16/2019 Clustering Lecture

    5/46

    6

    Table 3.6 • K-Means Input Values

    Instance X Y

    1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.5

    6 5.0 6.0

    *n 'xample Using -.

    /eans

  • 8/16/2019 Clustering Lecture

    6/46

    8

    0

    1

    2

    3

    4

    5

    6

    7

    0 1 2 3 4 5 6

    f(x)

    x

    1

    2

    3

    4

    5

    6

  • 8/16/2019 Clustering Lecture

    7/46

    9

    Table 3.7 • Several Applications of the K-Means Algorithm ( K = 2)

    Outcome Cluster Centers Cluster Points Squared Error  

    1 (2.67,4.67) 2, 4, 614.50

    (2.00,1.83) 1, 3, 5

    2 (1.5,1.5) 1, 315.94

    (2.75,4.125) 2, 4, 5, 6

    3 (1.8,2.7) 1, 2, 3, 4, 59.60

    (5,6) 6

  • 8/16/2019 Clustering Lecture

    8/46

    2

    0

    1

    2

    3

    4

    5

    6

    7

    0 1 2 3 4 5 6

    x

    f(x)

  • 8/16/2019 Clustering Lecture

    9/46

    General Considerations

    :ors best 5hen the clusters in the data are ofapproximately eual si;e&

    *ttribute significance cannot be determined&

    0acs explanation capabilities&

    3euires real.valued data& Categorical data can beconverted into real% but the distance function needs tobe 5ored out carefully&

    :e must select the number of clusters present in thedata&

    7ata normali;ation may be reuired if attribute rangesvary significantly&

    *lternative distance measures may generate differentclusters&

  • 8/16/2019 Clustering Lecture

    10/46

    -.means Clustering

    Partitional clustering approach

    'ach cluster is associated 5ith a centroid (center point)

    'ach point is assigned to the cluster 5ith the closest centroid

  • 8/16/2019 Clustering Lecture

    11/46

    -.means Clustering = 7etails

    >nitial centroids are often chosen randomly& Clusters produced vary from one run to another&

    "he centroid is (typically) the mean of the points in the cluster&

     ?Closeness@ is measured by 'uclidean distance% cosine similarity% correlation% etc&

    -.means 5ill converge for common similarity measures mentioned above&

    /ost of the convergence happens in the first fe5 iterations&

    $ften the stopping condition is changed to ?Until relatively fe5 points change clusters@  Complexity is $( n A - A > A d )

    n B number of points% - B number of cl usters%> B number of iterations% d B number of attributes

  • 8/16/2019 Clustering Lecture

    12/46

    "5o different -.means Clusterings

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Sub-otimal Clusterin!

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Otimal Clusterin!

    Ori!inal Points

  • 8/16/2019 Clustering Lecture

    13/46

    >mportance of Choosing >nitial Centroids

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 1

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 2

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 3

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 4

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 5

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 6

  • 8/16/2019 Clustering Lecture

    14/46

    >mportance of Choosing >nitial Centroids

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 1

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 2

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 3

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 4

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 5

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 6

  • 8/16/2019 Clustering Lecture

    15/46

    'valuating -.means Clusters

    /ost common measure is #um of #uared 'rror (##')

    or each point% the error is the distance to the nearestcluster

    "o get ##'% 5e suare these errors and sum them&

      x is a data point in cluster C i and mi  is the representativepoint for cluster C i 

     can sho5 that mi  corresponds to the center (mean) ofthe cluster

    Given t5o clusters% 5e can choose the one 5ith thesmallest error

    ∑ ∑= ∈

    = K 

    i C  x

    i

    i

     xmdist SSE 1

    2 ),(

  • 8/16/2019 Clustering Lecture

    16/46

    >mportance of Choosing >nitial Centroids D

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 1

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 2

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 3

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 4

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 5

  • 8/16/2019 Clustering Lecture

    17/46

    >mportance of Choosing >nitial Centroids D

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 1

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 2

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 3

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 4

    -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

    0

    0.5

    1

    1.5

    2

    2.5

    3

    x

          y

    Iteration 5

  • 8/16/2019 Clustering Lecture

    18/46

    Problems 5ith #electing >nitial Points

    >f there are - ?real@ clusters then the chance of selecting one centroid from each cluster is small& Chance is relatively small 5hen - is large

    >f clusters are the same si;e% n% then

    or example% if - B 1E% then probability B 1EF1E 1E B E&EEE+8

    #ometimes the initial centroids 5ill readust themselves in ?right@ 5ay% and sometimes they don@t

    Consider an example of five pairs of clusters

  • 8/16/2019 Clustering Lecture

    19/46

    Hierarchical Clustering

    Produces a set of nested clustersorgani;ed as a hierarchical tree

    Can be visuali;ed as a dendrogram * tree lie diagram that records theseuences of merges or splits

    1 3 2 5 4 6

    0

    0.05

    0.1

    0.15

    0.2

    1

    2

    !

    "

    #

    1

    "3 #

    $

  • 8/16/2019 Clustering Lecture

    20/46

    #trengths of HierarchicalClustering

    7o not have to assume any particularnumber of clusters *ny desired number of clusters can be

    obtained by ?cutting@ the dendogram at theproper level

    "hey may correspond to meaningful

    taxonomies 'xample in biological sciences (e&g&% animal

    ingdom% phylogeny reconstruction% D)

  • 8/16/2019 Clustering Lecture

    21/46

    Hierarchical Clustering

    "5o main types of hierarchical clustering *gglomerative,

     #tart 5ith the points as individual clusters

     *t each step% merge the closest pair of clusters until only onecluster (or clusters) left

    7ivisive,

     #tart 5ith one% all.inclusive cluster

     *t each step% split a cluster until each cluster contains a point (orthere are clusters)

    "raditional hierarchical algorithms use a similarity ordistance matrix /erge or split one cluster at a time

  • 8/16/2019 Clustering Lecture

    22/46

    *gglomerative Clustering*lgorithm /ore popular hierarchical clustering techniue

    Iasic algorithm is straightfor5ard1& Compute the proximity matrix

    !& 0et each data point be a cluster

    3.   Repeat

    4& /erge the t5o closest clusters

    6& Update the proximity matrix

    6.   Until only a single cluster remains 

    -ey operation is the computation of the

    proximity of t5o clusters 7ifferent approaches to defining the distance

    bet5een clusters distinguish the different algorithms

  • 8/16/2019 Clustering Lecture

    23/46

    #tarting #ituation

    #tart 5ith clusters of individual pointsand a proximity matrix

    1

    3

    $

    #

    "

    1 " 3 # $ % % %

    %

    %

    % Proximit& 'atrix

  • 8/16/2019 Clustering Lecture

    24/46

    >ntermediate #ituation

    *fter some merging steps% 5e have some clusters

    C1

    C#

    C" C$

    C3

    C"C1

    C1

    C3

    C$

    C#

    C"

    C3 C# C$

    Proximit& 'atrix

  • 8/16/2019 Clustering Lecture

    25/46

    >ntermediate #ituation :e 5ant to merge the t5o closest clusters (C! and C6)

    and update the proximity matrix&

    C1

    C#

    C" C$

    C3

    C"C1

    C1

    C3

    C$

    C#

    C"

    C3 C# C$

    Proximit& 'atrix

  • 8/16/2019 Clustering Lecture

    26/46

    *fter /erging "he uestion is “Ho5 do 5e update the proximity

    matrixJ”

    C1

    C#

    C" U C$

    C3

    ( ( ( (

    (

    (

    (

    C"

    U

    C$C1

    C1

    C3

    C#

    C" U C$

    C3 C#

    Proximit& 'atrix

  • 8/16/2019 Clustering Lecture

    27/46

    Ho5 to 7efine >nter.Cluster #imilarity

     

    1

    3

    $

    #

    "

    1 " 3 # $ % % %

    %

    %

    %

    Similarit&(

    MIN

    MAX

    Group Average

    D!"a#$e %e"&ee# 'e#"ro(!

    )"*er +e"*o(! (rve# ,- a# o,.e$"ve

    /u#$"o#  ar(! Me"*o( u!e! !uare( error 

    Proximit& 'atrix

  • 8/16/2019 Clustering Lecture

    28/46

    Ho5 to 7efine >nter.Cluster #imilarity

     

    1

    3

    $

    #

    "

    1 " 3 # $ % % %

    %

    %

    %Proximit& 'atrix

    MIN

    MAX

    Group Average

    D!"a#$e %e"&ee# 'e#"ro(!

    )"*er +e"*o(! (rve# ,- a# o,.e$"ve

    /u#$"o#  ar(! Me"*o( u!e! !uare( error 

  • 8/16/2019 Clustering Lecture

    29/46

    Ho5 to 7efine >nter.Cluster #imilarity

     

    1

    3

    $

    #

    "

    1 " 3 # $ % % %

    %

    %

    %Proximit& 'atrix

    MIN

    MAX

    Group Average

    D!"a#$e %e"&ee# 'e#"ro(!

    )"*er +e"*o(! (rve# ,- a# o,.e$"ve

    /u#$"o#  ar(! Me"*o( u!e! !uare( error 

  • 8/16/2019 Clustering Lecture

    30/46

    Ho5 to 7efine >nter.Cluster #imilarity

     

    1

    3

    $

    #

    "

    1 " 3 # $ % % %

    %

    %

    %Proximit& 'atrix

    MIN

    MAX

    Group Average

    D!"a#$e %e"&ee# 'e#"ro(!

    )"*er +e"*o(! (rve# ,- a# o,.e$"ve

    /u#$"o#  ar(! Me"*o( u!e! !uare( error 

  • 8/16/2019 Clustering Lecture

    31/46

    Ho5 to 7efine >nter.Cluster #imilarity

     

    1

    3

    $

    #

    "

    1 " 3 # $ % % %

    %

    %

    %Proximit& 'atrix

    MIN

    MAX

    Group Average

    D!"a#$e %e"&ee# 'e#"ro(!

    )"*er +e"*o(! (rve# ,- a# o,.e$"ve

    /u#$"o#  ar(! Me"*o( u!e! !uare( error 

     

  • 8/16/2019 Clustering Lecture

    32/46

    Cluster #imilarity, />< or#ingle 0in #imilarity of t5o clusters is based on

    the t5o most similar (closest) pointsin the different clusters

    7etermined by one pair of points% i&e&% byone lin in the proximity graph&

    I1 I2 I3 I4 I5

    I1 1.00 0.90 0.10 0.65 0.20

    I2 0.90 1.00 0.70 0.60 0.50

    I3 0.10 0.70 1.00 0.40 0.30

    I4 0.65 0.60 0.40 1.00 0.80

    I5 0.20 0.50 0.30 0.80 1.00 1 2 ! "

  • 8/16/2019 Clustering Lecture

    33/46

    Hierarchical Clustering, /><

    )ested Clusters *endro!ram

    1

    2

    !

    "

    #

    1

    "

    3

    #

    $

    3 6 2 5 4 10

    0.05

    0.1

    0.15

    0.2

  • 8/16/2019 Clustering Lecture

    34/46

    #trength of /><

    Ori!inal Points +,o Clusters

     Can andle non-ellitical saes

  • 8/16/2019 Clustering Lecture

    35/46

    0imitations of /><

    Ori!inal Points +,o Clusters

     Sensiti.e to noise and outliers

  • 8/16/2019 Clustering Lecture

    36/46

    Cluster #imilarity, /*K or Complete0inage

    #imilarity of t5o clusters is based onthe t5o least similar (most distant)points in the different clusters 7etermined by all pairs of points in the

    t5o clustersI1 I2 I3 I4 I5

    I1 1.00 0.90 0.10 0.65 0.20

    I2 0.90 1.00 0.70 0.60 0.50

    I3 0.10 0.70 1.00 0.40 0.30

    I4 0.65 0.60 0.40 1.00 0.80

    I5 0.20 0.50 0.30 0.80 1.00 1 2 ! "

  • 8/16/2019 Clustering Lecture

    37/46

    Hierarchical Clustering, /*K

    )ested Clusters *endro!ram

    3 6 4 1 2 50

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    1

    2

    !

    "

    #

    1

    " $

    3

    #

  • 8/16/2019 Clustering Lecture

    38/46

    #trength of /*K

    Ori!inal Points +,o Clusters

     /ess suscetible to noise and outliers

    f

  • 8/16/2019 Clustering Lecture

    39/46

    0imitations of /*K

    Ori!inal Points +,o Clusters

    +ends to brea0 lar!e clusters

    iased to,ards !lobular clusters

  • 8/16/2019 Clustering Lecture

    40/46

    Cluster #imilarity, Group *verage

    Proximity of t5o clusters is the average of pair5ise

    proximity bet5een points in the t5o clusters&

  • 8/16/2019 Clustering Lecture

    41/46

    Hierarchical Clustering, Group*verage

    )ested Clusters *endro!ram

    3 6 4 1 2 50

    0.05

    0.1

    0.15

    0.2

    0.25

    1

    2

    !

    "

    #

    1

    "

    $

    3

    #

  • 8/16/2019 Clustering Lecture

    42/46

    Hierarchical Clustering, Group*verage

    Compromise bet5een #ingle andComplete 0in

    #trengths 0ess susceptible to noise and outliers

    0imitations Iiased to5ards globular clusters

  • 8/16/2019 Clustering Lecture

    43/46

    Cluster #imilarity, :ard@s/ethod

    #imilarity of t5o clusters is based on the increasein suared error 5hen t5o clusters are merged #imilar to group average if distance bet5een points is

    distance suared

    0ess susceptible to noise and outliers

    Iiased to5ards globular clusters

    Hierarchical analogue of -.means Can be used to initiali;e -.means

  • 8/16/2019 Clustering Lecture

    44/46

    Hierarchical Clustering, Comparison

    2rou .era!e

    4ard5s 'etod

    1

    2

    !

    "

    #

    1

    "

    $

    3

    #

    'I) 'X

    1

    2

    !

    "

    #

    1

    "

    $

    3#

    1

    2

    !

    "

    #

    1

    " $

    3

    #1

    2

    !

    "

    #

    1

    "

    3

    #

    $

  • 8/16/2019 Clustering Lecture

    45/46

    Hierarchical Clustering, "ime and #pace reuirements

    $(

    $(

  • 8/16/2019 Clustering Lecture

    46/46

    Hierarchical Clustering, Problems and 0imitations

    $nce a decision is made to combine t5oclusters% it cannot be undone