Mahout Tutorial FOSSMEET NITC

37
Practical Machine Learning A Tutorial on Apache Mahout Biju B NLP R&D Division 365Media Pvt. Ltd. [email protected] FOSSMEET NITC, Calicut 4-6 February 2011 Biju B & Jaganadh G Practical Machine Learning

description

Biju B and Jaganadh G's presentation on Mahout at FOSSMEET-NITC

Transcript of Mahout Tutorial FOSSMEET NITC

Page 1: Mahout Tutorial FOSSMEET NITC

Practical Machine LearningA Tutorial on Apache Mahout

Biju BNLP R&D Division365Media Pvt. [email protected]

FOSSMEET NITC,Calicut

4-6 February 2011

Biju B & Jaganadh G Practical Machine Learning

Page 2: Mahout Tutorial FOSSMEET NITC

nlp r d $ whoweare

Working in Natural Language Processing (NLP), Machine Learning,Data Mining

Passionate about Free and Open source :-)

When gets free time teaches Python and blogs athttp://jaganadhg.freeflux.net/blog and contributes toOpenstreetmap

Works for 365Media Pvt. Ltd. Coimbatore India.

twitter handle : @jaganadhg, @bijub

Biju B & Jaganadh G Practical Machine Learning

Page 3: Mahout Tutorial FOSSMEET NITC

Machine Learning

Machine Learning

Machine learning is a subfield of artificial intelligence (AI) concerned withalgorithms that allow computers to learn.

This talk is not aimed to give introduction about Machine Learning

Dont expect some mathy equations here

Biju B & Jaganadh G Practical Machine Learning

Page 4: Mahout Tutorial FOSSMEET NITC

Machine Learning

Machine Learning

Machine learning is a subfield of artificial intelligence (AI) concerned withalgorithms that allow computers to learn.

This talk is not aimed to give introduction about Machine Learning

Dont expect some mathy equations here

Biju B & Jaganadh G Practical Machine Learning

Page 5: Mahout Tutorial FOSSMEET NITC

Machine Learning

Machine Learning

Machine learning is a subfield of artificial intelligence (AI) concerned withalgorithms that allow computers to learn.

This talk is not aimed to give introduction about Machine Learning

Dont expect some mathy equations here

Biju B & Jaganadh G Practical Machine Learning

Page 6: Mahout Tutorial FOSSMEET NITC

Machine Learning

Machine Learning

Machine learning is a subfield of artificial intelligence (AI) concerned withalgorithms that allow computers to learn.

This talk is not aimed to give introduction about Machine Learning

Dont expect some mathy equations here

Biju B & Jaganadh G Practical Machine Learning

Page 7: Mahout Tutorial FOSSMEET NITC

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life ??

Yes

In our day to day life we may use many Machine Learning poweredtools

Recommendation Engines

Clustering

Classification , Spam Filtering

Sentiment Analysis

Fraud Detraction

Biju B & Jaganadh G Practical Machine Learning

Page 8: Mahout Tutorial FOSSMEET NITC

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life ??

Yes

In our day to day life we may use many Machine Learning poweredtools

Recommendation Engines

Clustering

Classification , Spam Filtering

Sentiment Analysis

Fraud Detraction

Biju B & Jaganadh G Practical Machine Learning

Page 9: Mahout Tutorial FOSSMEET NITC

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life ??

Yes

In our day to day life we may use many Machine Learning poweredtools

Recommendation Engines

Clustering

Classification , Spam Filtering

Sentiment Analysis

Fraud Detraction

Biju B & Jaganadh G Practical Machine Learning

Page 10: Mahout Tutorial FOSSMEET NITC

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life ??

Yes

In our day to day life we may use many Machine Learning poweredtools

Recommendation Engines

Clustering

Classification , Spam Filtering

Sentiment Analysis

Fraud Detraction

Biju B & Jaganadh G Practical Machine Learning

Page 11: Mahout Tutorial FOSSMEET NITC

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life ??

Yes

In our day to day life we may use many Machine Learning poweredtools

Recommendation Engines

Clustering

Classification , Spam Filtering

Sentiment Analysis

Fraud Detraction

Biju B & Jaganadh G Practical Machine Learning

Page 12: Mahout Tutorial FOSSMEET NITC

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life ??

Yes

In our day to day life we may use many Machine Learning poweredtools

Recommendation Engines

Clustering

Classification , Spam Filtering

Sentiment Analysis

Fraud Detraction

Biju B & Jaganadh G Practical Machine Learning

Page 13: Mahout Tutorial FOSSMEET NITC

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life ??

Yes

In our day to day life we may use many Machine Learning poweredtools

Recommendation Engines

Clustering

Classification , Spam Filtering

Sentiment Analysis

Fraud Detraction

Biju B & Jaganadh G Practical Machine Learning

Page 14: Mahout Tutorial FOSSMEET NITC

Machine Learning and Our Life

Do you think that Machine Learning has any impact in our life ??

Yes

In our day to day life we may use many Machine Learning poweredtools

Recommendation Engines

Clustering

Classification , Spam Filtering

Sentiment Analysis

Fraud Detraction

Biju B & Jaganadh G Practical Machine Learning

Page 15: Mahout Tutorial FOSSMEET NITC

Mahout

Mahout

Open Source project by Apache FoundationGoal of this project is to build scalable machine learning libraries

Biju B & Jaganadh G Practical Machine Learning

Page 16: Mahout Tutorial FOSSMEET NITC

Mahout

Mahout

Mahout: a person who drives elephant ;-)The name comes from the project’s use of Apache Hadoop.

Biju B & Jaganadh G Practical Machine Learning

Page 17: Mahout Tutorial FOSSMEET NITC

Why a new library ?

There are more than 30 Java libraries/ tools available for MachineLearning.Weka , Mallet, Classifier4j, Rapidminer ........

Large Amount of data processing is not an easy task

Machine Learning tools are supposed to produce quick results

If the amount of data is too large it is not easy to process with asingle machine (Even if it is powerful)

Mahout is scalable: the core algorithms in Mahout are implementedon top of Apache Hadoop using the map/reduce paradigm

Biju B & Jaganadh G Practical Machine Learning

Page 18: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 19: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 20: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 21: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 22: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 23: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 24: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 25: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 26: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 27: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 28: Mahout Tutorial FOSSMEET NITC

Algorithms in Apache Mahout

Collaborative Filtering

User and Item based recommenders

K-Means, Fuzzy K-Means clustering

Mean Shift clustering

Dirichlet process clustering

Latent Dirichlet Allocation

Singular value decomposition

Parallel Frequent Pattern mining

Complementary Naive Bayes classifier

Random forest decision tree based classifier

Biju B & Jaganadh G Practical Machine Learning

Page 29: Mahout Tutorial FOSSMEET NITC

Recommendation

Filter information based on user preference

Searching a large set of people and finding a smaller set with tastessimilar to you

e.g :- Amazon’s book recommendation , Netflix movierecommendation

Biju B & Jaganadh G Practical Machine Learning

Page 30: Mahout Tutorial FOSSMEET NITC

Document Classification

Classify documents based on its content

e.g: - spam filtering,priority inbox

Biju B & Jaganadh G Practical Machine Learning

Page 31: Mahout Tutorial FOSSMEET NITC

Demo

Building recommendations engines with Mahout

Document Classification with Mahout

Biju B & Jaganadh G Practical Machine Learning

Page 32: Mahout Tutorial FOSSMEET NITC

Reference

Biju B & Jaganadh G Practical Machine Learning

Page 33: Mahout Tutorial FOSSMEET NITC

Reference

Mahout in Action - Book by Sean Owen and Robin Anil, publishedby Manning Publications.

Taming Text - By Grant Ingersoll and Tom Morton, published byManning Publications.

Introducing Apache Mahout - Grant Ingersoll - Intro to ApacheMahout focused on clustering, classification and collaborativefiltering. https://www.ibm.com/developerworks/java/library/j-mahout/index.html

Programming Collective Intelligence: Building Smart Web 2.0Applicationshttp://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325

Biju B & Jaganadh G Practical Machine Learning

Page 34: Mahout Tutorial FOSSMEET NITC

Useful Resources

Apache Mahout Site http://mahout.apache.org/

Apache Mahout Mailing List [email protected]

The code which I used for Mahout demo is available athttp://bitbucket.org/jaganadhg/blog/src/tip/bck9/java/

Twenty News Group data sethttp://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Biju B & Jaganadh G Practical Machine Learning

Page 35: Mahout Tutorial FOSSMEET NITC

Questions ??

Biju B & Jaganadh G Practical Machine Learning

Page 36: Mahout Tutorial FOSSMEET NITC

Acknowledgments

Thanks to :

Manning Publications for Review Copy of the book ”Mahout inAction”

Apache Mahout mailing list members

Ted Dunning and Robin Anil for suggestions

@chelakkandupoda for review and criticism

Mukundhanchari R&D Director 365Media Pvt. Ltd. for support andencouragement

Biju B & Jaganadh G Practical Machine Learning

Page 37: Mahout Tutorial FOSSMEET NITC

Finally

Biju B & Jaganadh G Practical Machine Learning