Introducing Apache Mahout
description
Transcript of Introducing Apache Mahout
Introducing Apache Mahout
Scalable Machine Learning for All!Grant Ingersoll
Agenda• What is Machine Learning?
– Definitions– Types– Applications
• Mahout– What?– Why? – How?– Who?
What is Machine Learning?
QuickTime™ and a decompressor
are needed to see this picture.
http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg
Or?QuickTime™ and a decompressorare needed to see this picture.
http://en.wikipedia.org/wiki/Image:Hal-9000.jpg
NOT!
How about?
Google News
Or?
Amazon.com
Definition• “Machine Learning is programming
computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E.
Alpaydin• Subset of Artificial Intelligence
– Many other fields: comp sci., biology, math, psychology, etc.
Characterizations• Lots of Data
• Identifiable Features in that Data
• Too big/costly for people to handle– People still can help
Types• Supervised
– Using labeled training data, create function that predicts output of unseen inputs
• Unsupervised– Using unlabeled data, create function
that predicts output• Semi-Supervised
– Uses labeled and unlabeled data
Classification/Categorization• Spam Filtering• Named Entity Recognition• Phrase Identification• Sentiment Analysis• Classification into a Taxonomy
Clustering• Find Natural Groupings
– Documents– Search Results– People– Genetic traits in groups– Many, many more uses
Collaborative Filtering• Recommend people and products
– User-User• User likes X, you might too
– Item-Item• People who bought X also bought Y
Info. Retrieval• Learning Ranking Functions
• Learning Spelling Corrections
• User Click Analysis and Tracking
Other• Image Analysis• Robotics• Games• Higher level natural language
processing• Many, many others
What is Apache Mahout?• A Mahout is an elephant
trainer/driver/keeper, hence…QuickTime™ and a
decompressorare needed to see this picture.
+Machine Learning
=
(and other distributed techniques)
What?• Hadoop brings:
– Map/Reduce API– HDFS– In other words, scalability and fault-
tolerance• Thus, Mahout’s Goal is:
– Scalable Machine Learning with Apache License
Why Mahout?• Many Open Source ML libraries either:
– Lack Community– Lack Documentation and Examples– Lack Scalability– Lack the Apache License ;-)– Or are research-oriented
• Personal: Learn more ML• Intelligent Apps are the Present and Future
– See the Hadoop talks tomorrow and Friday!• Goal: Overcome gaps the Apache Way!
Current Status• Close to Initial release
– Focused on examples, docs, bug fixes• What’s in it:
– Simple Matrix/Vector library– Taste Collaborative Filtering– Clustering
• Canopy/K-Means/Fuzzy K-Means/Mean-shift– Classifiers
• Naïve Bayes• Complementary NB
– Evolutionary• Integration with Watchmaker for fitness function
How?• Examples
– Taste– Clustering– Classification– Evolutionary
Taste: Movie Recommendations
• Given ratings by users of movies, recommend other movies
• http://lucene.apache.org/mahout/taste.html#demo
Clustering: Synthetic Control Data
• http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series
• Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples– o.a.mahout.clustering.syntheticcontrol.*
• Outputs clusters…
Classification: NB and CNB Examples
• 20 Newsgroups– http://cwiki.apache.org/confluence/
display/MAHOUT/TwentyNewsgroups
• Wikipedia– http://cwiki.apache.org/confluence/
display/MAHOUT/WikipediaBayesExample
Evolutionary• Traveling Salesman
– http://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman
• Class Discovery– http://cwiki.apache.org/confluence/
display/MAHOUT/Class+Discovery
What’s Next?• Release 0.1!• Shared Amazon Images (others?)• More Examples• Winnow/Perceptron (MAHOUT-85)• Hbase and HAMA support• Normalize I/O format for data• Solr Integration (SOLR-769)• Other Algorithms: SVM, Linear Regression,
etc.
When, Where, Who• When? Now!
– Mahout is growing• Who? You!
– We want Java programmers who:• Are comfortable with math• Like to work on large, hard problems
• Where?– http://lucene.apache.org/mahout– http://cwiki.apache.org/MAHOUT– mahout-{user|dev}@lucene.apache.org
Resources• “Programming Collective Intelligence”
by Toby Segaran• “Data Mining - Practical Machine
Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
• Hadoop - http://hadoop.apache.org• http://mloss.org/software/