SDEC2011 Essentials of Mahout
-
Upload
korea-sdec -
Category
Technology
-
view
1.296 -
download
0
description
Transcript of SDEC2011 Essentials of Mahout
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Essentials of MahoutMastering Hadoop Map-reduce for Data Analysis
Shashank Tiwariblog: shanky.org | twitter: @[email protected]
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
What is Apache Mahout?
• A scalable machine learning infrastructure
• Built on top of Hadoop MapReduce
• Currently supports:
• Clustering, classification, and collaborative filtering, etc...
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
A Little History
• Founded by folks active in the Lucene community
• Inspired by work at Stanford: “Map-Reduce for Machine Learning on Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Project Goal
• Create a community driven scalable and robust machine learning infrastructure
• Leverage Hadoop for parallel processing and scalability
• Provide an abstraction on top of Hadoop so the machine-learning users are not concerned with the map and reduce primitives when they build their solutions.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Supported Algorithms
• Collaborative Filtering
• User and Item based recommenders
• K-Means, Fuzzy K-Means clustering
• Mean Shift clustering
• Dirichlet process clustering
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
More Supported Algorithms
• Latent Dirichlet Allocation
• Singular value decomposition
• Parallel Frequent Pattern mining
• Complementary Naive Bayes classifier
• Random forest decision tree based classifier
• ...and growing
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Focus Areas
• Collaborative Filtering
• Clustering
• Classification
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Build and Install
• Required Software:
• Java 1.6.x
• Maven 2.0.11+
• Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout
• Compile & install core & examples: mvn install
• Alternatively, individually mvn compile, mvn package, and mvn install
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Recommendation Examples
• mvn -q exec:java -Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.GroupLensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/workspace/hadoop_workspace/grouplens/ratings.dat"
• https://cwiki.apache.org/confluence/display/MAHOUT/RecommendationExamples
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Common Use Cases
• Shopping: Amazon, Netflix
• Who to follow/friend: Twitter/Facebook
• Web resource classification, spam filtering, financial markets pattern recognition, classification
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Collaborative Filtering Basis
• User-based: recommend items by finding similar users. User preferences keep changing so this method poses challenges.
• Item-based: calculate similarity between items and make recommendations. Usually items don’t change much so the method is often reliable.
• Slope-one: fast and efficient item based recommendation when user ratings are more than boolean yes/no, like/dislike.
• Model-based: provide recommendation on the basis of developing a model of users and their ratings.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Clustering Basis
• Clustering algorithms also use the notion of similarity to group similar items into a cluster.
• Both Collaborative filtering and clustering use the notion of a distance, which could be calculated using a number of different techniques.
• Example: Euclidean distance,
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Mahout Taste Framework
• Taste Collaborative Filtering:
• Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008.
• Has been applied to a number of different data sets successfully.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Mahout Taste Framework
• Taste Collaborative Filtering:
• Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008.
• Has been applied to a number of different data sets successfully.
• Mahout supports building recommendation engines primarily basis the Taste library.
• The library supports both user-based and item-based recommendations.
• Can be used with Java or over RESTful web-service endpoints.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Taste Framework : Primary Classes
• DataModel: Model for Users, Items, and Preferences
• UserSimilarity: Interface defining the similarity between two users
• ItemSimilarity: Interface defining the similarity between two items
• Recommender: Interface for providing recommendations
• UserNeighborhood: Interface for computing a neighborhood of similar users. These are used by the Recommenders.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Taste Framework : Online vs Offline
• Can do online recommendations for a few thousand data sets.
• Leverages Hadoop for offline recommendation calculations on large data sets.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Understanding the Group Lens Implementation
• Provide an insight into a sample Mahout Taste Framework Implementation.
• Uses the publicly available data set
• Part of the distribution so you can analyze it, modify it, and use it as an inspiration for your own implementation
• Easy to follow example
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Group Lens Implementation Source
• GroupLensDataModel.java
• GroupLensRecommender.java
• GroupLensRecommenderBuilder.java
• GroupLensRecommenderEvaluatorRunner.java
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Group Lens Runner -- evaluator
• Instantiates an evaluator:
• RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
• a “mean average error” algorithm
• Parses input parameters:
• File ratingsFile = TasteOptionParser.getRatings(args);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Group Lens Runner -- data model
• Parses a colon delimiter pattern file:
• DataModel model = ratingsFile == null ? new GroupLensDataModel() : new GroupLensDataModel(ratingsFile);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.Group Lens Runner -- evaluate with
recommendation builder
• evaluates using GroupLensRecommender
• double evaluation = evaluator.evaluate(new GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Questions?
• blog: shanky.org | twitter: @tshanky