Clustering With R

download Clustering With R

of 4

Transcript of Clustering With R

  • 8/13/2019 Clustering With R

    1/4

    1.Install R from http://cran.r-project.org/bin/windows/base/

    For clustering you need the following packages: cluster usually installed by

    default!" fpc" p#clust" mcclust.

    $.Install all these packages with the command

    %install.packages&package'name(" lib)(path'of'lib(!

    e.g.

    %install.packages&fpc(" lib)(*:/+rogram Files/R/R-$.1,.1/library(!

    &%( is the R prompter!

    .n installed package can be load with command

    %librarypackage'name!

    e.g.

    %libraryfpc!

    .*opy the data file in working directory. 0ou can find your working directory with

    command

    %getwd!

    r you can set the path with

    %setwd&path'of'wd(!

    ,.2oad the data in an R matri3/#ector with command read.cs# for cs# files!

    e.g.

    %mydata4-read.cs#&1-total'#an5'client'engros'num.cs#(!

    f course" you can change this too long Romanian file name.

    0ou can load any other file" but remember" for this type of culstering file could

    ha#e only numerical data.

    6.First type of clustering is a &classical( clustering using k-means algorithm.

    7ust type

    %kmeans.result4-kmeansmydata" !

    &( is the number of clusters you want could be" theoretical" any number!. 0ou

    can try the algorithm with $"","6 etc. clusters.

    http://cran.r-project.org/bin/windows/base/http://cran.r-project.org/bin/windows/base/
  • 8/13/2019 Clustering With R

    2/4

    If you type %kmeans.result you can see anytime the result of clustering.

    8he particular data about clustering you can see using some culstering #ariables:

    9cluster9 9centers9 9totss9 9withinss9 9tot.withinss9 9betweenss9

    9si5e9

    e.g.

    %kmeans.resultcluster to see only the clusters!

    or

    %kmeans.resultcenters to see the centroid of e#ery cluster! etc.

    ;e can plot a graph for $ or #ariables but I will not enter in too many details.

    pamk! function does not re?uire to user to choose number

    of clusters" and it calls the function pam! and estimate the number of clusters.

    e.g.

    %pamk.result4-pamkmydata!

    8ype

    %pamk.result

    and you will see the result.

    For using pam! you ha#e to choose the number of clusters for e3ample !:

    %pam.result4-pammydata" !

    8ype %pam.result and see the result.

    0ou will obser#e that pamk! takes more time than kmeans! or pamk!.

    8he major difference between kmeans and pam/pamk is that while in k-means a

    cluster is represented with its center" in k-medoids pam/pamk algorithms! the

    cluster is represented with the object closest to the center of the cluster.

    @.;e can ha#e hierarchical clusteringwith hclust! function.

    8ype

    %hc4-hclustdistmydata!" method)(a#e(!

    8his method is more complicated - for plotting we need a #ariable as label" which

    could be an inde3 of initial data. If weAll apply this IAll gi#e more details.

  • 8/13/2019 Clustering With R

    3/4

    @.For density-based clusteringwe can use BCD*E algorithm from fpc

    package. 8he main idea is to group objects into one cluster if they are connected

    to one another by density populated area. 8here are $ parameters: epsA G

    reachability distance" defines the si5e of neighborhood if it is too small you can

    ha#e 5ero clustersH! and =in+tsA- reachability minimum numbers of points. =ost

    of the time you can try different #alues of these parameters.

    For e3ample" if you try with

    %ds4-dbscanmydata" eps).1" =in+ts),!

    you get 5ero cluster no enough density points!

    If the number of points in the neighborhood of a point is no less than =in+ts" then

    this pointis a &dense point(.

    8he strength of density-based clustering is that it can disco#er clusters with

    #arious shapes and si5es and it is insensiti#e to noise k-means find clusters with

    sphere shape and appro3imately with similar si5es!.

    Jnfortunately" the file I found seems to be insensiti#e to density based clustering

    it seems to ha#e no rele#ant #ariance in density points - you can check this if

    type with different #alues for eps and =in+ts" and with %ds4-

    dbscanmydata"eps)1" =in+ts)1! you will findK 1LL clusters!.

    Cut if you want to see how this algorithms working with some results" you can try

    it with a #ery small data file which is by default in R!.

    8ype

    %iris$4-irisM-,N Ofor remo#e a nonnumeric column

    %ds4-dbscaniris$" eps).$" =in+ts),!

    Dee result with

    %ds

    and clusters with

    %dscluster

    0ou can change eps and =in+ts for to see what happens the data file are with

    flowers species!.

    For what we ha#e" I think is good to testkmeans

    "pam

    andpamk

    .

    If we decide what kind of algorithms weAll use" we can write an R function for

    simplify this entire manual job.

  • 8/13/2019 Clustering With R

    4/4