Apache Mahout

12
Apache Mahout Qiaodi Zhuang Xijing Zhang

description

Apache Mahout. Qiaodi Zhuang Xijing Zhang. What is Mahout?. Mahout is a scalable machine learning library from Apache. It uses MapReduce paradigm which in combination with Hadoop can be used as an inexpensive solution to solve machine learning problems. - PowerPoint PPT Presentation

Transcript of Apache Mahout

Page 1: Apache  Mahout

Apache MahoutQiaodi ZhuangXijing Zhang

Page 2: Apache  Mahout

What is Mahout?

Mahout is a scalable machine learning library from Apache.

It uses MapReduce paradigm which in combination with Hadoop can be used as an inexpensive solution to solve machine learning problems.

[1].Anil, Robin, Ted Dunning, and Ellen Friedman. Mahout in action. Manning, 2011.

Page 3: Apache  Mahout

Problem&

ChallengeMany datasets now are:

Far too large for a single machine, cannot fit into main memory

[2].http://www.orzota.com/apache-mahout-and-machine-learning/

Page 4: Apache  Mahout

Mahout’s Algorithms: Clustering: Kmeans, Fuzzy Kmeans

Classification: SVM, Random Forests Recommender Pattern Mining Regression

Page 5: Apache  Mahout

Input: a database D, of m records, r1, ..., rm and a desired number of clusters k

Output: set of k clusters that minimizes the squared error criterion

Begin Randomly choose k records as the centroids for the k clusters; repeat

assign each record ri to a cluster such that the distance between ri and the cluster centroid (mean) is the smallest among the k

clusters; recalculate the centroid (mean) for each cluster based on the records

assigned to the cluster; until no change; End;

K-means Algorithms:

Page 6: Apache  Mahout

K-means Clustering in Mahout

[3].K-means Clustering in the Cloud -- A Mahout Test, R. M. Esteves et al.,IEEE Advanced Information Networking and Applications , 2011,

Page 7: Apache  Mahout

Evaluation

The dataset is from the 1999 KDD cup. It has 4,940,000 records, with 41 attributes and 1 label (converted to numerical. A 1.1 GB dataset was used. This file was randomly segmented into smaller files.

[3].K-means Clustering in the Cloud -- A Mahout Test, R. M. Esteves et al.,IEEE Advanced Information Networking and Applications , 2011,

Page 8: Apache  Mahout

[3].K-means Clustering in the Cloud -- A Mahout Test, R. M. Esteves et al.,IEEE Advanced Information Networking and Applications , 2011,

Page 9: Apache  Mahout

Future

Classification Decision Trees such as J48 and ID3

Clustering DBSCAN and CoWeb Clustering techniques

Association Rules Apriori

Page 10: Apache  Mahout

References:

[1].Anil, Robin, Ted Dunning, and Ellen Friedman. Mahout in action. Manning, 2011.

[2].http://www.orzota.com/apache-mahout-and-machine-learning/

[3].K-means Clustering in the Cloud -- A Mahout Test, R. M. Esteves et al.,IEEE Advanced Information Networking and Applications , 2011,

[4].https://mahout.apache.org/

[5].http://www.ibm.com/developerworks/java/library/j-mahout/

Page 11: Apache  Mahout

Question?

Page 12: Apache  Mahout

Thank you!