K-means under Mahout

5
Document clustering using K-means under Mahout 1. Edit system variables. Edit some environment variables for your system. In your home directory, edit the file: .bash_profile (for Mac user) or .bashrc (for linux user) and add the following lines. $vi .bash_profile export JAVA_HOME=$(/usr/libexec/java_home) export HADOOP_HOME=/Users/DoubleJ/Software/hadoop-1.0.3 (replace with your hadoop directory) export HADOOP_CONF_DIR=/Users/DoubleJ/Software/hadoop-1.0.3/conf Then save and exit the file. $source .bash_profile 2. Download data $curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o ~/Software/mahout-work/reuters.tar.gz 3. Uncompress data $mkdir ~/Software/mahout-work/sgm $tar xzf ~/Software/mahout-work/reuters.tar.gz -C ~/Software/mahout-work/sgm 4. Obtain the segment data $./bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters ~/Software/mahout-work/sgm ~/Software/mahout-work/text 5. Generate the sequence directory, which contains all texts 5.1 Send the segment data to the Hadoop file system $cd $HADOOP_HOME $./bin/start-all.sh $./bin/hadoop fs –mkdir hadoop-mahout

description

Document clustering example using K-means under Mahout.

Transcript of K-means under Mahout

Page 1: K-means under Mahout

Document clustering using K-means under Mahout

1. Edit system variables. Edit some environment variables for your system. In your home directory, edit the file: .bash_profile (for Mac user) or .bashrc (for linux user) and add the following lines. $vi .bash_profile export JAVA_HOME=$(/usr/libexec/java_home) export HADOOP_HOME=/Users/DoubleJ/Software/hadoop-1.0.3 (replace with your hadoop directory) export HADOOP_CONF_DIR=/Users/DoubleJ/Software/hadoop-1.0.3/conf Then save and exit the file. $source .bash_profile 2. Download data $curl http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz -o ~/Software/mahout-work/reuters.tar.gz 3. Uncompress data $mkdir ~/Software/mahout-work/sgm $tar xzf ~/Software/mahout-work/reuters.tar.gz -C ~/Software/mahout-work/sgm 4. Obtain the segment data $./bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters ~/Software/mahout-work/sgm ~/Software/mahout-work/text

5. Generate the sequence directory, which contains all texts 5.1 Send the segment data to the Hadoop file system $cd $HADOOP_HOME $./bin/start-all.sh $./bin/hadoop fs –mkdir hadoop-mahout

Page 2: K-means under Mahout

$./bin/hadoop fs –put ~/Software/mahout-work/out hadoop-mahout (takes for a while) $./bin/hadoop fs –ls hadoop-mahout/out

5.2 Create the sequence directory $cd ~/Software/mahout-distribution-0.7 $./bin/mahout seqdirectory -i hadoop-mahout/out -o hadoop-mahout/out-seqdir -c UTF-8 -chunk 5 $./bin/hadoop fs –ls hadoop-mahout $./bin/hadoop fs –ls hadoop-mahout/out-seqdir

Page 3: K-means under Mahout

6. Create vector files to represent these texts $./bin/mahout seq2sparse -i hadoop-mahout/out-seqdir/ -o hadoop-mahout/vectors --maxDFPercent 85 --namedVector (takes for a while) $cd $HADOOP_HOME $./bin/hadoop fs –ls hadoop-mahout $./bin/hadoop fs –ls hadoop-mahout/vectors

In addition, when you check the completed jobs using http://localhost:50030, you will see the following screen.

7. Running K-means $./bin/mahout kmeans -i hadoop-mahout/vectors/tfidf-vectors -c hadoop-mahout/cluster-centroids -o hadoop-mahout/kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering –cl $cd $HADOOP_HOME $./bin/hadoop fs –ls hadoop-mahout/ $./bin/hadoop fs –ls hadoop-mahout/cluster-centroids

Page 4: K-means under Mahout

In addition, when you check the completed jobs using http://localhost:50030, you will see the following screen.

8. View results in a human readable way. Dump the documents into a cluster mapping $./bin/mahout seqdumper -i hadoop-mahout/kmeans/clusteredPoints/part-m-00000 > ~/Software/mahout-work/cluster-docs.txt $ls -l ~/Software/mahout-work

Then download the view.pl and run it using the following command.

Page 5: K-means under Mahout

$perl view.pl ~/Software/mahout-work/cluster-docs.txt ~/Software/mahout-work/cluster-results.txt

The cluster-results.txt is a text file where each line has two columns with a tab separated: clustered and its corresponding file name.