Clustering With R
-
Upload
adrian-iosif -
Category
Documents
-
view
217 -
download
0
Transcript of Clustering With R
-
8/13/2019 Clustering With R
1/4
1.Install R from http://cran.r-project.org/bin/windows/base/
For clustering you need the following packages: cluster usually installed by
default!" fpc" p#clust" mcclust.
$.Install all these packages with the command
%install.packages&package'name(" lib)(path'of'lib(!
e.g.
%install.packages&fpc(" lib)(*:/+rogram Files/R/R-$.1,.1/library(!
&%( is the R prompter!
.n installed package can be load with command
%librarypackage'name!
e.g.
%libraryfpc!
.*opy the data file in working directory. 0ou can find your working directory with
command
%getwd!
r you can set the path with
%setwd&path'of'wd(!
,.2oad the data in an R matri3/#ector with command read.cs# for cs# files!
e.g.
%mydata4-read.cs#&1-total'#an5'client'engros'num.cs#(!
f course" you can change this too long Romanian file name.
0ou can load any other file" but remember" for this type of culstering file could
ha#e only numerical data.
6.First type of clustering is a &classical( clustering using k-means algorithm.
7ust type
%kmeans.result4-kmeansmydata" !
&( is the number of clusters you want could be" theoretical" any number!. 0ou
can try the algorithm with $"","6 etc. clusters.
http://cran.r-project.org/bin/windows/base/http://cran.r-project.org/bin/windows/base/ -
8/13/2019 Clustering With R
2/4
If you type %kmeans.result you can see anytime the result of clustering.
8he particular data about clustering you can see using some culstering #ariables:
9cluster9 9centers9 9totss9 9withinss9 9tot.withinss9 9betweenss9
9si5e9
e.g.
%kmeans.resultcluster to see only the clusters!
or
%kmeans.resultcenters to see the centroid of e#ery cluster! etc.
;e can plot a graph for $ or #ariables but I will not enter in too many details.
pamk! function does not re?uire to user to choose number
of clusters" and it calls the function pam! and estimate the number of clusters.
e.g.
%pamk.result4-pamkmydata!
8ype
%pamk.result
and you will see the result.
For using pam! you ha#e to choose the number of clusters for e3ample !:
%pam.result4-pammydata" !
8ype %pam.result and see the result.
0ou will obser#e that pamk! takes more time than kmeans! or pamk!.
8he major difference between kmeans and pam/pamk is that while in k-means a
cluster is represented with its center" in k-medoids pam/pamk algorithms! the
cluster is represented with the object closest to the center of the cluster.
@.;e can ha#e hierarchical clusteringwith hclust! function.
8ype
%hc4-hclustdistmydata!" method)(a#e(!
8his method is more complicated - for plotting we need a #ariable as label" which
could be an inde3 of initial data. If weAll apply this IAll gi#e more details.
-
8/13/2019 Clustering With R
3/4
@.For density-based clusteringwe can use BCD*E algorithm from fpc
package. 8he main idea is to group objects into one cluster if they are connected
to one another by density populated area. 8here are $ parameters: epsA G
reachability distance" defines the si5e of neighborhood if it is too small you can
ha#e 5ero clustersH! and =in+tsA- reachability minimum numbers of points. =ost
of the time you can try different #alues of these parameters.
For e3ample" if you try with
%ds4-dbscanmydata" eps).1" =in+ts),!
you get 5ero cluster no enough density points!
If the number of points in the neighborhood of a point is no less than =in+ts" then
this pointis a &dense point(.
8he strength of density-based clustering is that it can disco#er clusters with
#arious shapes and si5es and it is insensiti#e to noise k-means find clusters with
sphere shape and appro3imately with similar si5es!.
Jnfortunately" the file I found seems to be insensiti#e to density based clustering
it seems to ha#e no rele#ant #ariance in density points - you can check this if
type with different #alues for eps and =in+ts" and with %ds4-
dbscanmydata"eps)1" =in+ts)1! you will findK 1LL clusters!.
Cut if you want to see how this algorithms working with some results" you can try
it with a #ery small data file which is by default in R!.
8ype
%iris$4-irisM-,N Ofor remo#e a nonnumeric column
%ds4-dbscaniris$" eps).$" =in+ts),!
Dee result with
%ds
and clusters with
%dscluster
0ou can change eps and =in+ts for to see what happens the data file are with
flowers species!.
For what we ha#e" I think is good to testkmeans
"pam
andpamk
.
If we decide what kind of algorithms weAll use" we can write an R function for
simplify this entire manual job.
-
8/13/2019 Clustering With R
4/4