Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

68
Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland

Transcript of Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Page 1: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Hands on!Speakers: Ted Dunning, Robin Anil

OSCON 2011, Portland

Page 2: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

About Us Ted Dunning:

Chief Application Architect at MapRCommitter and PMC Member at Apache MahoutPreviously: MusicMatch (Yahoo! Music), Veoh recommendation, ID Analytics

Robin Anil:Software Engineer at GoogleCommitter and PMC Member at Apache MahoutPreviously: Yahoo! (Display ads), Minekey recommendation

Page 3: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Agenda Intro to Mahout (5 mins) Overview of Algorithms in Mahout (10 mins) Hands on Mahout!

- Clustering (30 mins)

- Classification (30 mins)

- Advanced topics with Q&A (15 mins)

Page 4: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Mission

To build a scalable machine learning library

Page 5: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Scale! Scale to large datasets

- Hadoop MapReduce implementations that scales linearly with data.

- Fast sequential algorithms whose runtime doesn’t depend on the size of the data

- Goal: To be as fast as possible for any algorithm Scalable to support your business case

- Apache Software License 2 Scalable community

- Vibrant, responsive and diverse

- Come to the mailing list and find out more

Page 6: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Current state of ML libraries Lack community Lack scalability Lack documentations and examples Lack Apache licensing Are not well tested Are Research oriented Not built over existing production quality libraries Lack “Deployability”

Page 7: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Algorithms and Applications

Page 8: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Clustering Call it fuzzy grouping based on a notion of similarity

Page 9: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Mahout Clustering Plenty of Algorithms: K-Means,

Fuzzy K-Means, Mean Shift,Canopy, Dirichlet

Group similar looking objects

Notion of similarity: Distance measure:

- Euclidean

- Cosine

- Tanimoto

- Manhattan

Page 10: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Classification Predicting the type of a new object based on its features The types are predetermined

Dog Cat

Page 11: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Mahout Classification Plenty of algorithms

- Naïve Bayes

- Complementary Naïve Bayes

- Random Forests

- Logistic Regression (SGD)

- Support Vector Machines (patch ready)

Learn a model from a manually classified data Predict the class of a new object based on its

features and the learned model

Page 12: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Part 1 - Clustering

Page 13: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Understanding data - Vectors

X = 5 , Y = 3(5, 3)

The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3])

Y

X

Page 14: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Representing Vectors – The basics Now think 3, 4, 5, ….. n-dimensional Think of a document as a bag of words.

“she sells sea shells on the sea shore” Now map them to integers

she => 0

sells => 1

sea => 2

and so on The resulting vector [1.0, 1.0, 2.0, … ]

Page 15: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Vectors Imagine one dimension for each word. Each dimension is also called a feature Two techniques

- Dictionary Based

- Randomizer Based

Page 16: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Clustering Reuters dataset

Page 17: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Step 1 – Convert dataset into a Hadoop Sequence File http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz Download (8.2 MB) and extract the SGML files.

- $ mkdir -p mahout-work/reuters-sgm

- $ cd mahout-work/reuters-sgm && tar xzf ../reuters21578.tar.gz && cd .. && cd ..

Extract content from SGML to text file

- $ bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters mahout-work/reuters-sgm mahout-work/reuters-out

Page 18: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Step 1 – Convert dataset into a Hadoop Sequence File Use seqdirectory tool to convert text file into a Hadoop Sequence File

- $ bin/mahout seqdirectory \

-i mahout-work/reuters-out \

-o mahout-work/reuters-out-seqdir \

-c UTF-8 -chunk 5

Page 19: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Hadoop Sequence File Sequence of Records, where each record is a <Key, Value> pair

- <Key1, Value1>

- <Key2, Value2>

- …

- …

- …

- <Keyn, Valuen> Key and Value needs to be of class org.apache.hadoop.io.Text

- Key = Record name or File name or unique identifier

- Value = Content as UTF-8 encoded string

TIP: Dump data from your database directly into Hadoop Sequence Files (see next slide)

Page 20: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Writing to Sequence Files Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path path = new Path("testdata/part-00000");

SequenceFile.Writer writer = new SequenceFile.Writer(

fs, conf, path, Text.class, Text.class);

for (int i = 0; i < MAX_DOCS; i++)

writer.append(new Text(documents(i).Id()),

new Text(documents(i).Content()));

}

writer.close();

Page 21: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Generate Vectors from Sequence Files Steps

1. Compute Dictionary

2. Assign integers for words

3. Compute feature weights

4. Create vector for each document using word-integer mapping and feature-weight

Or

Simply run $ bin/mahout seq2sparse

Page 22: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Generate Vectors from Sequence Files $ bin/mahout seq2sparse \ -i mahout-work/reuters-out-seqdir/ \ -o mahout-work/reuters-out-seqdir-sparse-kmeans

Important options

- Ngrams

- Lucene Analyzer for tokenizing

- Feature Pruning

- Min support

- Max Document Frequency

- Min LLR (for ngrams)

- Weighting Method

- TF v/s TFIDF

- lp-Norm

- Log normalize length

Page 23: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Start K-Means clustering $ bin/mahout kmeans \ -i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ \ -c mahout-work/reuters-kmeans-clusters \ -o mahout-work/reuters-kmeans \ -dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1 \ -x 10 -k 20 –ow

Things to watch out for

- Number of iterations

- Convergence delta

- Distance Measure

- Creating assignments

Page 24: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

c1

c2

c3

K-Means clustering

Page 25: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

c1

c2

c3

K-Means clustering

Page 26: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

c1

c2

c3

c1

c2

c3

K-Means clustering

Page 27: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

c1

c2

c3

K-Means clustering

Page 28: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Inspect clusters $ bin/mahout clusterdump \ -s mahout-work/reuters-kmeans/clusters-9 \ -d mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \ -dt sequencefile -b 100 -n 20

Typical output

:VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, …

Top Terms:

iran => 3.1861672217321213

strike => 2.567886952727918

iranian => 2.133417966282966

union => 2.116033937940266

said => 2.101773806290277

workers => 2.066259451354332

gulf => 1.9501374918521601

had => 1.6077752463145605

he => 1.5355078004962228

Page 29: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

FAQs How to get rid of useless words How to see documents to cluster assignments How to choose appropriate weighting How to run this on a cluster How to scale How to choose k How to improve similarity measurement

Page 30: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

FAQs How to get rid of useless words

- Increase minSupport and or decrease dfPercent

- Use StopwordsAnalyzer How to see documents to cluster assignments

- Run clustering process at the end of centroid generation using –cl How to choose appropriate weighting

- If its long text, go with tfidf. Use normalization if documents different in length

How to run this on a cluster

- Set HADOOP_CONF directory to point to your hadoop cluster conf directory How to scale

- Use small value of k to partially cluster data and then do full clustering on each cluster.

Page 31: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

FAQs How to choose k

- Figure out based on the data you have. Trial and error

- Or use Canopy Clustering and distance threshold to figure it out

- Or use Spectral clustering How to improve Similarity Measurement

- Not all features are equal

- Small weight difference for certain types creates a large semantic difference

- Use WeightedDistanceMeasure

- Or write a custom DistanceMeasure

Page 32: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Interesting problems Cluster users talking about OSCON’11 and cluster them based on what they are tweeting

- Can you suggest people to network with. Use user generate tags that people have given for musicians and cluster them

- Use the cluster to pre-populate suggest-box to autocomplete tags when users type Cluster movies based on abstract and description and show related movies.

- Note: How it can augment recommendations or collaborative filtering algorithms.

Page 33: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

More clustering algorithms Canopy Fuzzy K-Means Mean Shift Dirichlet process clustering Spectral clustering.

Page 34: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Part 2 - Classification

Page 35: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Preliminaries Code is available from github:

- [email protected]:tdunning/Chapter-16.git EC2 instances available Thumb drives also available Email to [email protected] Twitter @ted_dunning

Page 36: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

A Quick Review What is classification?

- goes-ins: predictors

- goes-outs: target variable What is classifiable data?

- continuous, categorical, word-like, text-like

- uniform schema How do we convert from classifiable data to feature vector?

Page 37: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Data Flow

Not quite so simple

Not quite so simple

Page 38: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Classifiable Data Continuous

- A number that represents a quantity, not an id

- Blood pressure, stock price, latitude, mass Categorical

- One of a known, small set (color, shape) Word-like

- One of a possibly unknown, possibly large set Text-like

- Many word-like things, usually unordered

Page 39: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

But that isn’t quite there Learning algorithms need feature vectors

- Have to convert from data to vector Can assign one location per feature

- or category

- or word Can assign one or more locations with hashing

- scary

- but safe on average

Page 40: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Data Flow

Page 41: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.
Page 42: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

The pipeline

Classifiable DataClassifiable Data VectorsVectors

Page 43: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Instance and Target Variable

Page 44: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Instance and Target Variable

Page 45: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Hashed Encoding

Page 46: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

What about collisions?

Page 47: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Let’s write some code

(cue relaxing background music)

Page 48: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Generating new features Sometimes the existing features are difficult to use Restating the geometry using new reference points may help Automatic reference points using k-means can be better than manual references

Page 49: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

K-means using target

Page 50: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

K-means features

Page 51: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

More code!

(cue relaxing background music)

Page 52: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Integration Issues Feature extraction is ideal for map-reduce

- Side data adds some complexity Clustering works great with map-reduce

- Cluster centroids to HDFS

Model training works better sequentially

- Need centroids in normal files Model deployment shouldn’t depend on HDFS

Page 53: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Parallel Stochastic Gradient Descent

Averagemodels

Trainsubmodel

Model

Input

Page 54: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Variational Dirichlet Assignment

Updatemodel

Gathersufficientstatistics

Model

Input

Page 55: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Old tricks, new dogs

Mapper

- Assign point to cluster

- Emit cluster id, (1, point)

Combiner and reducer

- Sum counts, weighted sum of points

- Emit cluster id, (n, sum/n)

Output to HDFS

Read fromHDFS to local disk by distributed cache

Read fromHDFS to local disk by distributed cache

Written by map-reduceWritten by map-reduce

Read from local disk from distributed cache

Read from local disk from distributed cache

Page 56: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Old tricks, new dogs

Mapper

- Assign point to cluster

- Emit cluster id, 1, point

Combiner and reducer

- Sum counts, weighted sum of points

- Emit cluster id, n, sum/n

Output to HDFSMapR FS

Read fromNFS

Read fromNFS

Written by map-reduceWritten by map-reduce

Page 57: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Modeling architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduceMap-reduce

Now via NFSNow via NFS

Page 58: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

More in Mahout

Page 59: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Topic modeling Grouping similar or co-occurring features into a topic

- Topic “Lol Cat”:

- Cat

- Meow

- Purr

- Haz

- Cheeseburger

- Lol

Page 60: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Mahout Topic Modeling Algorithm: Latent Dirichlet Allocation

- Input a set of documents

- Output top K prominent topics and thefeatures in each topic

Page 61: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Recommendations Predict what the user likes based on

- His/Her historical behavior

- Aggregate behavior of people similar to him

Page 62: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Mahout Recommenders Different types of recommenders

- User based

- Item based Full framework for storage, online

online and offline computation of recommendations Like clustering, there is a notion of similarity in users or items

- Cosine, Tanimoto, Pearson and LLR

Page 63: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Frequent Pattern Mining Find interesting groups of items based on how they co-occur in a dataset

Page 64: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Mahout Parallel FPGrowth Identify the most commonly

occurring patterns from

- Sales Transactions buy “Milk, eggs and bread”

- Query Logs

ipad -> apple, tablet, iphone

- Spam Detection

Yahoo! http://www.slideshare.net/hadoopusergroup/mail-antispam

Page 65: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Get Started http://mahout.apache.org [email protected] - Developer mailing list [email protected] - User mailing list Check out the documentations and wiki for quickstart http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code

Send me email!

- [email protected]

- [email protected]

- [email protected]

Try out MapR!

- www.mapr.com

Page 66: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Resources “Mahout in Action” Owen, Anil, Dunning, Friedman

http://www.manning.com/owen

“Taming Text” Ingersoll, Morton, Farrishttp://www.manning.com/ingersoll

“Introducing Apache Mahout”http://www.ibm.com/developerworks/java/library/j-mahout/

Page 67: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

Thanks to Apache Foundation Mahout Committers Google Summer of Code Organizers And Students OSCON Open source!

Page 68: Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland.

References news.google.com Cat http://www.flickr.com/photos/gattou/3178745634/ Dog http://www.flickr.com/photos/30800139@N04/3879737638/ Milk Eggs Bread http://www.flickr.com/photos/nauright/4792775946/ Amazon Recommendations twitter