05 k-means clustering
-
Upload
subhas-kumar-ghosh -
Category
Software
-
view
275 -
download
1
Transcript of 05 k-means clustering
K-Means Clustering
Scale!
• Scale to large datasets
– Hadoop MapReduce implementations that scales linearly with data.
– Fast sequential algorithms whose runtime doesn’t depend on the size of the data
– Goal: To be as fast as possible for any algorithm
• Scalable to support your business case
– Apache Software License 2
• Scalable community
– Vibrant, responsive and diverse
– Come to the mailing list and find out more
History of Mahout
• Summer 2007– Developers needed scalable ML
– Mailing list formed
• Community formed– Apache contributors
– Academia & industry
– Lots of initial interest
• Project formed under Apache Lucene– January 25, 2008
Where is ML Used Today
• Internet search clustering
• Knowledge management systems
• Social network mapping
• Taxonomy transformations
• Marketing analytics
• Recommendation systems
• Log analysis & event filtering
• SPAM filtering, fraud detection
Clustering• Call it fuzzy grouping based on a notion of similarity
Mahout Clustering
• Plenty of Algorithms: K-Means,Fuzzy K-Means, Mean Shift,Canopy, Dirichlet
• Group similar looking objects
• Notion of similarity: Distance measure: – Euclidean
– Cosine
– Tanimoto
– Manhattan
Classification• Predicting the type of a new object based on its features
• The types are predetermined
Dog Cat
Mahout Classification
• Plenty of algorithms– Naïve Bayes
– Complementary Naïve Bayes
– Random Forests
– Logistic Regression (SGD)
– Support Vector Machines (patch ready)
• Learn a model from a manually classified data
• Predict the class of a new object based on itsfeatures and the learned model
Understanding data - Vectors
X = 5 , Y = 3(5, 3)
• The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3])
Y
X
Representing Vectors – The basics
• Now think 3, 4, 5, ….. n-dimensional
• Think of a document as a bag of words.
“she sells sea shells on the sea shore”
• Now map them to integers
she => 0
sells => 1
sea => 2
and so on
• The resulting vector [1.0, 1.0, 2.0, … ]
Vectors
• Imagine one dimension for each word.
• Each dimension is also called a feature
• Two techniques
– Dictionary Based
– Randomizer Based
K-Means clustering• K-means clustering is a classical clustering algorithm that uses an
expectation maximization like technique to partition a number of data points into k clusters.
• K-means clustering is commonly used for a number of classification applications.
• It is often run on extremely large data sets, on the order of hundreds of millions of points and tens of gigabytes of data.
• Because k-means is run on such large data sets, and because of certain characteristics of the algorithm, it is a good candidate for parallelization.
K-Means clustering• At its simplest, the algorithm is given as inputs a set of n d-dimensional points
and a number of desired clusters k. • For the purposes of this explanation, we will consider points in a Euclidean
space. • However, the clustering algorithm will work in any space provided a distance
metric is given as input as well. • Initially, k points are chosen as cluster centers. • There is no fixed way to determine these initial points, instead a number of
heuristics are used, most commonly, k-points are chosen at random from the sample of n points.
• Once the k initial centers are chosen, the distance is calculated from every point in the set to each of the k centers and each point is assigned to the particular cluster center whose distance is closest.
• Using this assignation of points to cluster centers, each cluster center is recalculated as the centroid of its member points.
• This process is then iterated until convergence is reached.• That is, points are reassigned to centers, and centroids recalculated until the k
cluster centers shift by less than some delta value.• Because of the way k-means works it has the weakness that it can converge to a
local optimum but not a global optimum. • This happens when the initial k center points are chosen badly.
Pseudocode
iterate {Compute distance from all points to all k-centersAssign each point to the nearest k-centerCompute the average of all points assigned to all specific k-centersReplace the k-centers with the new averages}
c1
c2
c3
K-Means clustering
c1
c2
c3
K-Means clustering
c1
c2
c3
c1
c2
c3
K-Means clustering
c1
c2
c3
K-Means clustering
Parallelizing k-means• In order to parallelize k-means, we want to come up with a scheme where
we can operate on each point in the data set independently.
• In the first step of the iterative process of k-means, it is necessary to compute the distance from each point to each of the k cluster centers and assign that point to the cluster with the minimum distance.
• Thus, there is a small amount of shared data – namely the cluster centers.
• However, this is small in comparison to the number of data points.
• So the parallelization scheme involves duplicating the cluster centers, however once this is duplicated each data point can be operated on independently of the others and we can gain a nice speedup.
Step 1 – Convert dataset into a Hadoop Sequence File
• SGML files
– $ mkdir –p /reuters $ cd reuters-sgm && tar xzf../reuters21578.tar.gz && cd .. && cd ..
• Extract content from SGML to text file
– $ mahout org.apache.lucene.benchmark.utils.ExtractReutersreuters reuters/out
Step 1 – Convert dataset into a Hadoop Sequence File
• Use seqdirectory tool to convert text file into a Hadoop Sequence File
– $ mahout seqdirectory \
-i reuters/out \
-o reuters/seqdir \
-c UTF-8\
-chunk 5
Hadoop Sequence File• Sequence of Records, where each record is a <Key, Value> pair
– <Key1, Value1>
– <Key2, Value2>
– …
– …
– …
– <Keyn, Valuen>
• Key and Value needs to be of class org.apache.hadoop.io.Text
– Key = Record name or File name or unique identifier
– Value = Content as UTF-8 encoded string
• TIP: Dump data from your database directly into Hadoop Sequence Files
Writing to Sequence FilesConfiguration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path path = new Path("testdata/part-00000");
SequenceFile.Writer writer = new SequenceFile.Writer(
fs, conf, path, Text.class, Text.class);
for (int i = 0; i < MAX_DOCS; i++)
writer.append(new Text(documents(i).Id()),
new Text(documents(i).Content()));
}
writer.close();
Generate Vectors from Sequence Files
• Steps
1. Compute Dictionary
2. Assign integers for words
3. Compute feature weights
4. Create vector for each document using word-integer mapping and feature-weight
Or
• Simply run $ mahout seq2sparse
Generate Vectors from Sequence Files• $ mahout seq2sparse \
-i reuter/seqdir/ \-o reuters/sparse
• Important options
– Ngrams
– Lucene Analyzer for tokenizing
– Feature Pruning• Min support
• Max Document Frequency
• Min LLR (for ngrams)
– Weighting Method• TF v/s TFIDF
• lp-Norm
• Log normalize length
Start K-Means clustering• $ mahout kmeans \
-i reuters/sparse/tfidf/ \-c reuters-kmeans-clusters \-o reuters-kmeans \-dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1 \-x 10 -k 20 –ow
• Things to watch out for
– Number of iterations
– Convergence delta
– Distance Measure
– Creating assignments
Inspect clusters
• $ bin/mahout clusterdump \-s reuters-kmeans/clusters-9 \-d reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \-dt sequencefile -b 100 -n 20
Typical output
:VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, …
Top Terms:
iran => 3.1861672217321213
strike => 2.567886952727918
iranian => 2.133417966282966
union => 2.116033937940266
said => 2.101773806290277
workers => 2.066259451354332
gulf => 1.9501374918521601
had => 1.6077752463145605
he => 1.5355078004962228
FAQs
• How to get rid of useless words
• How to see documents to cluster assignments
• How to choose appropriate weighting
• How to run this on a cluster
• How to scale
• How to choose k
• How to improve similarity measurement
FAQs
• How to get rid of useless words– Use StopwordsAnalyzer
• How to see documents to cluster assignments – Run clustering process at the end of centroid generation using –cl
• How to choose appropriate weighting– If its long text, go with tfidf. Use normalization if documents different in
length
• How to run this on a cluster– Set HADOOP_CONF directory to point to your hadoop cluster conf
directory
• How to scale– Use small value of k to partially cluster data and then do full clustering
on each cluster.
FAQs
• How to choose k
– Figure out based on the data you have. Trial and error
– Or use Canopy Clustering and distance threshold to figure it out
– Or use Spectral clustering
• How to improve Similarity Measurement
– Not all features are equal
– Small weight difference for certain types creates a large semantic difference
– Use WeightedDistanceMeasure
– Or write a custom DistanceMeasure
More clustering algorithms
• Canopy
• Fuzzy K-Means
• Mean Shift
• Dirichlet process clustering
• Spectral clustering.
End of session
Day – 4: K-Means Clustering