Introducing Apache Mahout

25
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll

description

Introducing Apache Mahout. Scalable Machine Learning for All! Grant Ingersoll. Agenda. What is Machine Learning? Definitions Types Applications Mahout What? Why? How? Who?. What is Machine Learning?. NOT!. Or?. http://en.wikipedia.org/wiki/Image:Hal-9000.jpg. - PowerPoint PPT Presentation

Transcript of Introducing Apache Mahout

Page 1: Introducing Apache Mahout

Introducing Apache Mahout

Scalable Machine Learning for All!Grant Ingersoll

Page 2: Introducing Apache Mahout

Agenda• What is Machine Learning?

– Definitions– Types– Applications

• Mahout– What?– Why? – How?– Who?

Page 3: Introducing Apache Mahout

What is Machine Learning?

QuickTime™ and a decompressor

are needed to see this picture.

http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg

Or?QuickTime™ and a decompressorare needed to see this picture.

http://en.wikipedia.org/wiki/Image:Hal-9000.jpg

NOT!

Page 4: Introducing Apache Mahout

How about?

Google News

Page 5: Introducing Apache Mahout

Or?

Amazon.com

Page 6: Introducing Apache Mahout

Definition• “Machine Learning is programming

computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E.

Alpaydin• Subset of Artificial Intelligence

– Many other fields: comp sci., biology, math, psychology, etc.

Page 7: Introducing Apache Mahout

Characterizations• Lots of Data

• Identifiable Features in that Data

• Too big/costly for people to handle– People still can help

Page 8: Introducing Apache Mahout

Types• Supervised

– Using labeled training data, create function that predicts output of unseen inputs

• Unsupervised– Using unlabeled data, create function

that predicts output• Semi-Supervised

– Uses labeled and unlabeled data

Page 9: Introducing Apache Mahout

Classification/Categorization• Spam Filtering• Named Entity Recognition• Phrase Identification• Sentiment Analysis• Classification into a Taxonomy

Page 10: Introducing Apache Mahout

Clustering• Find Natural Groupings

– Documents– Search Results– People– Genetic traits in groups– Many, many more uses

Page 11: Introducing Apache Mahout

Collaborative Filtering• Recommend people and products

– User-User• User likes X, you might too

– Item-Item• People who bought X also bought Y

Page 12: Introducing Apache Mahout

Info. Retrieval• Learning Ranking Functions

• Learning Spelling Corrections

• User Click Analysis and Tracking

Page 13: Introducing Apache Mahout

Other• Image Analysis• Robotics• Games• Higher level natural language

processing• Many, many others

Page 14: Introducing Apache Mahout

What is Apache Mahout?• A Mahout is an elephant

trainer/driver/keeper, hence…QuickTime™ and a

decompressorare needed to see this picture.

+Machine Learning

=

(and other distributed techniques)

Page 15: Introducing Apache Mahout

What?• Hadoop brings:

– Map/Reduce API– HDFS– In other words, scalability and fault-

tolerance• Thus, Mahout’s Goal is:

– Scalable Machine Learning with Apache License

Page 16: Introducing Apache Mahout

Why Mahout?• Many Open Source ML libraries either:

– Lack Community– Lack Documentation and Examples– Lack Scalability– Lack the Apache License ;-)– Or are research-oriented

• Personal: Learn more ML• Intelligent Apps are the Present and Future

– See the Hadoop talks tomorrow and Friday!• Goal: Overcome gaps the Apache Way!

Page 17: Introducing Apache Mahout

Current Status• Close to Initial release

– Focused on examples, docs, bug fixes• What’s in it:

– Simple Matrix/Vector library– Taste Collaborative Filtering– Clustering

• Canopy/K-Means/Fuzzy K-Means/Mean-shift– Classifiers

• Naïve Bayes• Complementary NB

– Evolutionary• Integration with Watchmaker for fitness function

Page 18: Introducing Apache Mahout

How?• Examples

– Taste– Clustering– Classification– Evolutionary

Page 19: Introducing Apache Mahout

Taste: Movie Recommendations

• Given ratings by users of movies, recommend other movies

• http://lucene.apache.org/mahout/taste.html#demo

Page 20: Introducing Apache Mahout

Clustering: Synthetic Control Data

• http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series

• Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples– o.a.mahout.clustering.syntheticcontrol.*

• Outputs clusters…

Page 21: Introducing Apache Mahout

Classification: NB and CNB Examples

• 20 Newsgroups– http://cwiki.apache.org/confluence/

display/MAHOUT/TwentyNewsgroups

• Wikipedia– http://cwiki.apache.org/confluence/

display/MAHOUT/WikipediaBayesExample

Page 22: Introducing Apache Mahout

Evolutionary• Traveling Salesman

– http://cwiki.apache.org/confluence/display/MAHOUT/Traveling+Salesman

• Class Discovery– http://cwiki.apache.org/confluence/

display/MAHOUT/Class+Discovery

Page 23: Introducing Apache Mahout

What’s Next?• Release 0.1!• Shared Amazon Images (others?)• More Examples• Winnow/Perceptron (MAHOUT-85)• Hbase and HAMA support• Normalize I/O format for data• Solr Integration (SOLR-769)• Other Algorithms: SVM, Linear Regression,

etc.

Page 24: Introducing Apache Mahout

When, Where, Who• When? Now!

– Mahout is growing• Who? You!

– We want Java programmers who:• Are comfortable with math• Like to work on large, hard problems

• Where?– http://lucene.apache.org/mahout– http://cwiki.apache.org/MAHOUT– mahout-{user|dev}@lucene.apache.org

Page 25: Introducing Apache Mahout

Resources• “Programming Collective Intelligence”

by Toby Segaran• “Data Mining - Practical Machine

Learning Tools and Techniques” by Ian H. Witten and Eibe Frank

• Hadoop - http://hadoop.apache.org• http://mloss.org/software/