final_ppt
-
Upload
me2kathick -
Category
Documents
-
view
185 -
download
8
Transcript of final_ppt
![Page 1: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/1.jpg)
T.Karthikeyan
![Page 2: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/2.jpg)
Wh
at o
n W
orl
d is
Apache MAHOUT
MathVectors/Matrices/SVD
RecommendersClusteringClassificationFreq. PatternMining
Genetic
UtilitiesLucene/Vectorizer
Collections (primitives)
Apache Hadoop
Applications
Examples
![Page 3: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/3.jpg)
Mahout ClusteringAlgorithms : K-Means Fuzzy K-MeansMean shift Canopy Dirichlet Spectral Clustering based on Eigen valuesMinhash clustering LDA based clustering
Notion Of similarity : Distance Measure : Euclidean Cosine Tanimoto Manhattan
![Page 4: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/4.jpg)
Dataset
Hadoop Sequence File format
./mahout seqdirectory <options>
Sparse vector Format
./mahout seq2sparse <options>
Clustering Driver class
./mahout <kmeans/…> <options>
Dump cluster output
./mahout clusterdump <options>Clu
ste
rin
g o
ur
ow
n d
ata
![Page 5: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/5.jpg)
Clustering Examples
Using Reuters Dataset (SGML File) :
$ bin/mahout seqdirectory -i reuters-ip -o reuters-seqdir \-c UTF-8 -chunk 1
$ bin/mahout seq2sparse -i reuters-seqdir -o reuters-sparse
$ bin/mahout kmeans -i reuters-sparse/tfidf-vectors / -c reuters-clusters \-o reuters-kmeans \-dm org.apache.mahout.distance.CosineDistanceMeasure\-cd 0.1 -x 10 -k 20 –ow
$ bin/mahout clusterdump -d reuters-sparse \dictionary.file-0 -s reuters-kmeans-clusters/clusters-19 -b 10 –n 10
![Page 6: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/6.jpg)
Mahout Classification
Algorithms Implemented: Naïve Bayes Complementary Naïve Bayes Random Forest Logistic Regression (Sequential Algorithm) Hidden markov models
Upcoming Algorithms:Support vector machinesClassification based on perception and winnow
![Page 7: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/7.jpg)
Bayes , Cbayes Classifier
Preprocessing Raw data into classifiable data
![Page 8: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/8.jpg)
Bayes ,Cbayes Classifier Example
Using Newsgroup Dataset:
$./mahout prepare20newsgroups -p 20news-bydate-train -o 20news-train \-a org.apache.lucene.analysis.standard.StandardAnalyzer \-c UTF-8
$./mahout trainclassifier –i 20news-train -o 20news-model \-type <cbayes ,bayes> \-ng 1 -source hdfs
$./mahout testclassifier -d 20news-test -m 20news-model \-type <cbayes,bayes> \-ng 1 -source hdfs
Output :Confusion matrix
![Page 9: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/9.jpg)
Logistic Regression
x","y","shape","color","k","k0","xx","xy","yy","a","b","c","bias""
0.923307513352484,0.0135197141207755,21,2,4,8,0.852496764213146,...,1
./mahout trainlogistic --input input.csv --output ./model \
--target color --categories 2
./mahout runlogistic --input test.csv --model ./model \
--auc --confusion
CONFUSION MATRIX ( 0/P)
A B
AUC = 0.97 ; A {[24.0, 2.0],
B [3.0, 11.0]]
![Page 10: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/10.jpg)
Random Forest Input : arff or csv
Generate a file descriptor for the dataset:$ericsson>$HADOOP_HOME/bin/hadoop jar \$MAHOUT_HOME/core/target/mahout-core-0.6-SNAPSHOT-job.jar \org.apache.mahout.df.tools.Describe -p KDDTrain.arff -f Train.info \-d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
Run the example:$ericsson>$HADOOP_HOME/hadoop jar \$MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar\org.apache.mahout.df.mapreduce.BuildForest <options>
Using the Decision Forest to Classify new data$HADOOP_HOME/hadoop jar \$MAHOUT_HOME/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar org.apache.mahout.df.mapreduce.TestForest -i Test.arff -ds Train.info <options>
Output : confusion matrix
![Page 11: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/11.jpg)
Dimension reductionAlgorithms Implemented:Singular value DecompositionStochastic singular value Decomposition
Upcoming Algorithms : Principal Components Analysis Independent Component Analysis Gaussian Discriminative Analysis
0.12 0.8 0.123
0.89 2.33 1.445
4.12 2.123 3.12
./mahout <svd/ssvd> <options>
Eigen Vectors
Input : Real value Matrix
![Page 12: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/12.jpg)
Frequent Pattern mining
Algorithm: Parallel FP growth Algorithm
Input : dat or csv
Running Parallel FPGrowth:$./mahout fpg retail.dat -o patterns -k 50 -method mapreduce -regex '[\ ]' -s 2
Viewing the results :$./mahout seqdumper -s patterns/part-?-00000 -n 4
![Page 13: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/13.jpg)
Recommenders / Collaborative FilteringAlgorithms:Non-distributed recommenders ("Taste") Distributed Item-Based Collaborative Filtering Collaborative Filtering using a parallel matrix factorization Input is text file: user ,item ,preference
T A
S T
E
![Page 14: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/14.jpg)
Collaborative Filtering using a parallel matrix factorization
• To Run distributed ALS-WR to factorize the rating matrix defined by the training set
$MAHOUT parallelALS –input TrainingSet --output out \
--tempDir tmp -- numFeatures 20 -- numIterations 10 --lambda 0.065
• Compute predictions against the probe set, measure the error
$MAHOUT evaluateFactorization –input TrainingSet --output op \
--tempDir tmp1
• Compute recommendations
$MAHOUT recommendfactorized –input userRatings --output recommendations \numRecommendations 6 --maxRating 5
Input : Rating Matrix or csv
![Page 15: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/15.jpg)
ALGORITHMS INPUT
All Clustering Algorithms,Bayes, Cbayes classifier
Sparse Vector
Logistic regression, Random forest, FP Growth
CSV
Taste ,Collaborative Filtering
User ,Item ,Preference
SVD, SSVD Matrix
SUMMARY
![Page 16: final_ppt](https://reader030.fdocuments.net/reader030/viewer/2022013121/54364a0a219acdd95f8b52e4/html5/thumbnails/16.jpg)