SDEC2011 Essentials of Mahout

22
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Mahout Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com

description

 

Transcript of SDEC2011 Essentials of Mahout

Page 1: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Essentials of MahoutMastering Hadoop Map-reduce for Data Analysis

Shashank Tiwariblog: shanky.org | twitter: @[email protected]

Page 2: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

What is Apache Mahout?

• A scalable machine learning infrastructure

• Built on top of Hadoop MapReduce

• Currently supports:

• Clustering, classification, and collaborative filtering, etc...

Page 3: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

A Little History

• Founded by folks active in the Lucene community

• Inspired by work at Stanford: “Map-Reduce for Machine Learning on Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf

Page 4: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Project Goal

• Create a community driven scalable and robust machine learning infrastructure

• Leverage Hadoop for parallel processing and scalability

• Provide an abstraction on top of Hadoop so the machine-learning users are not concerned with the map and reduce primitives when they build their solutions.

Page 5: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Supported Algorithms

• Collaborative Filtering

• User and Item based recommenders

• K-Means, Fuzzy K-Means clustering

• Mean Shift clustering

• Dirichlet process clustering

Page 6: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

More Supported Algorithms

• Latent Dirichlet Allocation

• Singular value decomposition

• Parallel Frequent Pattern mining

• Complementary Naive Bayes classifier

• Random forest decision tree based classifier

• ...and growing

Page 7: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Focus Areas

• Collaborative Filtering

• Clustering

• Classification

Page 8: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Build and Install

• Required Software:

• Java 1.6.x

• Maven 2.0.11+

• Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout

• Compile & install core & examples: mvn install

• Alternatively, individually mvn compile, mvn package, and mvn install

Page 9: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Recommendation Examples

• mvn -q exec:java -Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/workspace/hadoop_workspace/grouplens/ratings.dat"

• https://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples

Page 10: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Common Use Cases

• Shopping: Amazon, Netflix

• Who to follow/friend: Twitter/Facebook

• Web resource classification, spam filtering, financial markets pattern recognition, classification

Page 11: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Collaborative Filtering Basis

• User-based: recommend items by finding similar users. User preferences keep changing so this method poses challenges.

• Item-based: calculate similarity between items and make recommendations. Usually items don’t change much so the method is often reliable.

• Slope-one: fast and efficient item based recommendation when user ratings are more than boolean yes/no, like/dislike.

• Model-based: provide recommendation on the basis of developing a model of users and their ratings.

Page 12: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Clustering Basis

• Clustering algorithms also use the notion of similarity to group similar items into a cluster.

• Both Collaborative filtering and clustering use the notion of a distance, which could be calculated using a number of different techniques.

• Example: Euclidean distance,

Page 13: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Mahout Taste Framework

• Taste Collaborative Filtering:

• Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008.

• Has been applied to a number of different data sets successfully.

Page 14: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Mahout Taste Framework

• Taste Collaborative Filtering:

• Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008.

• Has been applied to a number of different data sets successfully.

• Mahout supports building recommendation engines primarily basis the Taste library.

• The library supports both user-based and item-based recommendations.

• Can be used with Java or over RESTful web-service endpoints.

Page 15: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Taste Framework : Primary Classes

• DataModel: Model for Users, Items, and Preferences

• UserSimilarity: Interface defining the similarity between two users

• ItemSimilarity: Interface defining the similarity between two items

• Recommender: Interface for providing recommendations

• UserNeighborhood: Interface for computing a neighborhood of similar users. These are used by the Recommenders.

Page 16: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Taste Framework : Online vs Offline

• Can do online recommendations for a few thousand data sets.

• Leverages Hadoop for offline recommendation calculations on large data sets.

Page 17: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Understanding the Group Lens Implementation

• Provide an insight into a sample Mahout Taste Framework Implementation.

• Uses the publicly available data set

• Part of the distribution so you can analyze it, modify it, and use it as an inspiration for your own implementation

• Easy to follow example

Page 18: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Group Lens Implementation Source

• GroupLensDataModel.java

• GroupLensRecommender.java

• GroupLensRecommenderBuilder.java

• GroupLensRecommenderEvaluatorRunner.java

Page 19: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Group Lens Runner -- evaluator

• Instantiates an evaluator:

• RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();

• a “mean average error” algorithm

• Parses input parameters:

• File ratingsFile = TasteOptionParser.getRatings(args);

Page 20: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Group Lens Runner -- data model

• Parses a colon delimiter pattern file:

• DataModel model = ratingsFile == null ? new GroupLensDataModel() : new GroupLensDataModel(ratingsFile);

Page 21: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.Group Lens Runner -- evaluate with

recommendation builder

• evaluates using GroupLensRecommender

• double evaluation = evaluator.evaluate(new GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);

Page 22: SDEC2011 Essentials of Mahout

Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.

Questions?

• blog: shanky.org | twitter: @tshanky

[email protected]