9977 Cluster

download 9977 Cluster

of 38

Transcript of 9977 Cluster

  • 8/16/2019 9977 Cluster

    1/38

  • 8/16/2019 9977 Cluster

    2/38

    What is clustering?

    • A grouping of data objects such that the objects within agroup are similar (or related) to one another and diferentrom (or unrelated to) the objects in other groups

    Inter-cluster

    distances aremaximizedIntra-clusterdistances areminimized

  • 8/16/2019 9977 Cluster

    3/38

    Outliers

    • Outliers are objects that do not belong to

    any cluster or form clusters of ver small

    cardinalit

    • In some a!!lications "e are interested in

    discovering outliers# not clusters (outlier analsis)

    cluster

    outliers

  • 8/16/2019 9977 Cluster

    4/38

    Wh do "e cluster?

    • Clustering $ given a collection of data objects grou!them so that – %imilar to one another "ithin the same cluster

     – &issimilar to the objects in other clusters

    • Clustering results are used$ – As a stand-alone tool to get insight into data distribution

    • 'isualization of clusters ma unveil im!ortant information

     – As a !re!rocessing ste! for other algorithms• cient indexing or com!ression often relies on clustering

  • 8/16/2019 9977 Cluster

    5/38

     *he clustering tas+

    • ,rou! observations so that theobservations belonging in the same grou!are similar# "hereas observations in

    dierent grou!s are dierent

    • Basic questions:

     – What does .similar/ mean

     – What is a good !artition of the objects? I0e0#ho" is the 1ualit of a solution measured

     – 2o" to 3nd a good !artition of the observations

  • 8/16/2019 9977 Cluster

    6/38

    Observations to cluster

    • 4suall data objects consist of a set of

    attributes (also +no"n as dimensions)

    • 5eal-value attributes6variables

     – e0g0# salar# height

    • 7inar attributes

     – e0g0# gender (869)# has:cancer(*69)

    • ;ominal (categorical) attributes

     – e0g0# religion (Christian# 8uslim# 7uddhist# 2indu# etc0)

    • Ordinal65an+ed attributes

     – e0g0# militar ran+ (soldier# sergeant# lutenant# ca!tain# etc0)

  • 8/16/2019 9977 Cluster

    7/38

    Observations to cluster

    • If all d dimensions are real-valuedthen "e can visualize each data!oint as !oints in a d -dimensionalspace

    • If all d dimensions are binary  then"e can thin+ of each data !oint as abinary vector

  • 8/16/2019 9977 Cluster

    8/38

    &istance functions•  *he distance d(x, y) bet"een t"o objects x and y is a

    metricmetric if if 

     – d(i, j)  (non-negativit) – d(i, i)! (isolation)

     – d(i, j)! d(j, i) (smmetr) – d(i, j) " d(i, h)#d(h, j) (triangular ine1ualit)

    •  *he de3nitions of distance functions are usualldierent for real# boolean# categorical# and ordinal 

    variables0

    • Weights ma be associated "ith dierent variablesbased on a!!lications and data semantics0

  • 8/16/2019 9977 Cluster

    9/38

    &ata %tructures

    • data matrix

    • Distance matrix

    nd  x...

    n x...

    n1 x

    ............... id 

     x...

    i

     x...

    i1

     x

    ...............1d 

     x...1

     x...11

     x

     

     

    0...)2,()1,(

    :::

    )2,3()

     ...nd nd 

    0d d(3,1

    0d(2,1)

    0

    attributes6dimensions

          t     u     !       l     e

         s       6     o       b       j     e     c      t     s

    objects

         o       b       j     e     c      t     s

  • 8/16/2019 9977 Cluster

    10/38

    &istance functions for binarvectors

    $

    %

    $

    &

    $

    '

    $

    $

    $

    *< = > > = = =

      > = = > = >

    •  +accard similarity bet"een binarvectors  and - 

    •  +accard distance bet"een binarvectors  and - 

     @dist(

  • 8/16/2019 9977 Cluster

    11/38

    &istance functions for real-valuedvectors

    • .p norms or Minkowski distance$

    "here  p is a !ositive integer

    • If p ! %, .%

     is the Manhattan (or city block)

    distance$

     p pd 

    i  i

     yi

     x p p

    d  x

    d  x

     p y x

     p y x y x p L

    /1

    1||

    /1||...|

    22||

    11|),(

        

     

     

     

     

    ∑  

     

     

     

     

    =−=−++−+−=

    =−=−++−+−=

      d 

    i  i

     yi

     xd 

     yd 

     x y x y x y x L1

    ||...|22

    |||),(1   11

  • 8/16/2019 9977 Cluster

    12/38

    &istance functions for real-valued vectors

    • If  p ! &, .& is the /uclidean

    distance$

    • Also one can use weighted

    distance$

    )||...|22

    ||11

    (|),(  222

    d  y

    d  x y x y x y xd    −++−+−=

    )||...|22

    |2

    |11

    |1

    (),(   222d 

     yd 

     xd 

    w x xw x xw y xd    −++−+−=

    d  y

    d  xd w y xw y xw y xd    −++−+−=   ...222111),(

  • 8/16/2019 9977 Cluster

    13/38

    Algorithms$ basic conce!t

    • Construct a !artition of a set of n objects into a set of k  clusters – 2ierarchical Clustering

    %ingle Din+age  Com!lete Din+age  Average Din+age

     Eartitioning Clustering 

    01means 

  • 8/16/2019 9977 Cluster

    14/38

     *he +-means !roblem

    • ,iven a set  of n !oints in a d-dimensional s!ace and an integer 2 

    • 3as2: choose a set of 2  !oints 4c%, c&,5,c2 6 in the d-dimensional s!ace toform clusters 47%, 7&,5,72 6 such that

    is minimized• %ome s!ecial cases$ + =# + n

    ( )∑ ∑= ∈

    −=k 

    i C  x

    i

    i

    c x LC Cost 

    1

    2

    2)(

  • 8/16/2019 9977 Cluster

    15/38

     *he +-means algorithm

    • One "a of solving the 2 -means !roblem

    • 5andoml !ic+ 2  cluster centers 4c%,5,c2 6

    • 9or each i# set the cluster 7i to be the set of !ointsin  that are closer to ci than the are to c j for alli8j

    • 9or each i let ci be the center of cluster 7i (meanof the vectors in 7i)

    • 5e!eat until convergence

  • 8/16/2019 9977 Cluster

    16/38

    +-means algorithm

    • 9inds a local o!timum

    • Converges often 1uic+l (but not al"as)

    •  *he choice of initial !oints can havelarge inFuence

     – Clusters of dierent densities

     – Clusters of dierent sizes

    • Outliers can also cause a !roblem(xam!le?)

    % lt ti t d

  • 8/16/2019 9977 Cluster

    17/38

    %ome alternatives to randominitialization of the central

    !oints• 8ulti!le runs – 2el!s# but !robabilit is not on our side

    • %elect original set of !oints bmethods other than random 0 0g0#!ic+ the most distant (from each

    other) !oints as cluster centers(+meansGG algorithm)

  • 8/16/2019 9977 Cluster

    18/38

    What is the right number ofclusters?

    • Hor "ho sets the value of 2 ?

    • 9or n !oints to be clustered consider the

    case "here 2!n0 What is the value ofthe error function

    • What ha!!ens "hen 2 ! %?

    • %ince "e "ant to minimize the error "hdont "e select al"as 2 ! n?

  • 8/16/2019 9977 Cluster

    19/38

    2ierarchical Clustering

    • Eroduces a set of nested clustersorganized as a hierarchical tree

    • Can be visualized as a dendrogram

     – A tree-li+e diagram that records these1uences of merges or s!lits

    1 3 2 5 4 60

    0.05

    0.1

    0.15

    0.2

    1

    2

    3

    4

    5

    6

    1

    23 4

    5

  • 8/16/2019 9977 Cluster

    20/38

    %trengths of 2ierarchicalClustering

    • ;o assum!tions on the number ofclusters – An desired number of clusters can be

    obtained b Jcutting the dendrogram atthe !ro!er level

    •2ierarchical clustering macorres!ond to meaningful taxonomies

  • 8/16/2019 9977 Cluster

    21/38

    gg omera ve c us er ng

  • 8/16/2019 9977 Cluster

    22/38

    gg omera ve c us er ngalgorithm

    • 8ost !o!ular hierarchical clustering techni1ue

    • 7asic algorithm=0 Com!ute the distance matrix bet"een the in!ut data

    !oints

    K0 Det each data !oint be a cluster'< =epeat

    L0 8erge the t"o closest clusters

    0 4!date the distance matrix

    *< >ntil onl a single cluster remains 

    • Me o!eration is the com!utation of the distancebet"een t"o clusters

     – &ierent de3nitions of the distance bet"een clusterslead to dierent algorithms

  • 8/16/2019 9977 Cluster

    23/38

    In!ut6 Initial setting

    • %tart "ith clusters of individual !ointsand a distance6!roximit matrix

    p1

    p3

    p5

    p4

    p2

    p1 p2 p3 p4 p5 . . .

    .

    .

    .Distance/Proximity Matrix

  • 8/16/2019 9977 Cluster

    24/38

    Intermediate %tate

    • After some merging ste!s# "e have some clusters

    C1

    C4

    C2 C5

    C3

    C2C1

    C1

    C3

    C5

    C4

    C2

    C3 C4 C5

    Distance/Proximity Matrix

  • 8/16/2019 9977 Cluster

    25/38

    Intermediate %tate

    • 8erge the t"o closest clusters (CK and C) andu!date the distance matrix0

    C1

    C4

    C2 C5

    C3

    C2C1

    C1

    C3

    C5

    C4

    C2

    C3 C4 C5

    Distance/Proximity Matrix

  • 8/16/2019 9977 Cluster

    26/38

    After 8erging

    • .2o" do "e u!date the distance matrix?/

    C1

    C4

    C2 U C5

    C3

    ? ? ? ?

    ?

    ?

    ?

    C2

    U

    C5C1

    C1

    C3

    C4

    C2 U C5

    C3 C4

  • 8/16/2019 9977 Cluster

    27/38

    &istance bet"een t"oclusters

    • ach cluster is a set of !oints

    •2o" do "e de3ne distance bet"eent"o sets of !oints

     – Dots of alternatives

     – ;ot an eas tas+

  • 8/16/2019 9977 Cluster

    28/38

    &istance bet"een t"oclusters

    • ?ingle1lin2 distance bet"eenclusters 7i and 7 j is the minimum

    distance bet"een an object in 7i 

    and an object in 7 j

     *he distance is de@ned by the twomost similar objects( )   { } ji y x ji sl    C  yC  x y xd C C  D   ∈∈=   ,),(min, ,

  • 8/16/2019 9977 Cluster

    29/38

    %ingle-lin+ clustering$exam!le

    • &etermined b one !air of !oints#i0e0# b one lin+ in the !roximitgra!h0

    I1 I2 I3 I4 I5

    I1 1.00 0.90 0.10 0.65 0.20

    I2 0.90 1.00 0.70 0.60 0.50

    I3 0.10 0.70 1.00 0.40 0.30

    I4 0.65 0.60 0.40 1.00 0.80

    I5 0.20 0.50 0.30 0.80 1.00 % & '

  • 8/16/2019 9977 Cluster

    30/38

    %ingle-lin+ clustering$ exam!le

    Nested Clusters Dendrogram

    %

    &

    '

    *

    1

    2

    3

    4

    5

    3 6 2 5 4 10

    0.05

    0.1

    0.15

    0.2

  • 8/16/2019 9977 Cluster

    31/38

    &istance bet"een t"oclusters

    • 7omplete1lin2 distance bet"eenclusters 7i and 7 j is the maximum

    distance bet"een an object in 7i 

    and an object in 7 j

     *he distance isde@ned by the two

    most dissimilar objects( )   { } ji y x jicl    C  yC  x y xd C C  D   ∈∈=   ,),(max, ,

  • 8/16/2019 9977 Cluster

    32/38

    Com!lete-lin+ clustering$exam!le

    • &istance bet"een clusters isdetermined b the t"o most distant!oints in the dierent clusters

    I1 I2 I3 I4 I5

    I1 1.00 0.90 0.10 0.65 0.20

    I2 0.90 1.00 0.70 0.60 0.50

    I3 0.10 0.70 1.00 0.40 0.30

    I4 0.65 0.60 0.40 1.00 0.80

    I5 0.20 0.50 0.30 0.80 1.00 % & '

    om! e e n c us er ng$

  • 8/16/2019 9977 Cluster

    33/38

    om! e e- n c us er ng$exam!le

    Nested Clusters Dendrogram

    3 6 4 1 2 50

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    %

    &

    '

    *

    1

    2 5

    3

    4

  • 8/16/2019 9977 Cluster

    34/38

    &istance bet"een t"oclusters

    • Aroup aerage distance bet"eenclusters 7i and 7 j is the average

    distance bet"een an object in 7i 

    and an object in 7 j

    ( )   ∑∈∈×=  ji   C  yC  x ji jiavg    y xd C C C C  D , ),(

    1

    ,

  • 8/16/2019 9977 Cluster

    35/38

    Average-lin+ clustering$exam!le

    • Eroximit of t"o clusters is the average of!air"ise !roximit bet"een !oints in thet"o clusters0

    I1 I2 I3 I4 I5

    I1 1.00 0.90 0.10 0.65 0.20

    I2 0.90 1.00 0.70 0.60 0.50

    I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80

    I5 0.20 0.50 0.30 0.80 1.00 % & '

  • 8/16/2019 9977 Cluster

    36/38

    Average-lin+ clustering$

    exam!le

    Nested Clusters Dendrogram

    3 6 4 1 2 50

    0.05

    0.1

    0.15

    0.2

    0.25

    %

    &

    '

    *

    1

    2

    5

    3

    4

  • 8/16/2019 9977 Cluster

    37/38

    Average-lin+ clustering

    • Com!romise bet"een %ingle andCom!lete Din+

    • %trengths – Dess susce!tible to noise and outliers

  • 8/16/2019 9977 Cluster

    38/38

      *han+ ou