Classification with Naive Bayes

23
Copyright 2011 Cloudera Inc. All rights reserved Classification with Naïve Bayes A Deep Dive into Apache Mahout

Transcript of Classification with Naive Bayes

Page 1: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Classification with Naïve BayesA Deep Dive into Apache Mahout

Page 2: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Today’s speaker – Josh Patterson

[email protected] / twitter: @jpatanooga

• Master’s Thesis: self-organizing mesh networks– Published in IAAI-09: TinyTermite: A Secure Routing Algorithm

• Conceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff)

– Led small team which designed classification techniques for time series and Map Reduce

– Open source work at http://openpdc.codeplex.com

• Now: Solutions Architect at Cloudera

2

Page 3: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

What is Classification?

• Supervised Learning

• We give the system a set of instances to learn from

• System builds knowledge of some structure

– Learns “concepts”

• System can then classify new instances

Page 4: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Supervised vs Unsupervised Learning

• Supervised

– Give system examples/instances of multiple concepts

– System learns “concepts”

– More “hands on”

– Example: Naïve Bayes, Neural Nets

• Unsupervised

– Uses unlabled data

– Builds joint density model

– Example: k-means clustering

Page 5: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Naïve Bayes

• Called Naïve Bayes because its based on “Baye’s Rule” and “naively” assumes independence given the label

– It is only valid to multiply probabilities when the events are independent

– Simplistic assumption in real life

– Despite the name, Naïve works well on actual datasets

Page 6: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Naïve Bayes Classifier

• Simple probabilistic classifier based on

– applying Baye’s theorem (from Bayesian statistics)

– strong (naive) independence assumptions.

– A more descriptive term for the underlying probability model would be “independent feature model".

Page 7: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Naïve Bayes Classifier (2)

• Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. – Example:

• a fruit may be considered to be an apple if it is red, round, and about 4" in diameter.

• Even if these features depend on each other or upon the existence of the other features, a naive Bayesclassifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Page 8: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

A Little Bit o’ Theory

Page 9: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Condensing Meaning

• To train our system we need

– Total number input training instances (count)

– Counts tuples:

• {attributen,outcomeo,valuem}

– Total counts of each outcomeo

• {outcome-count}

• To Calculate each Pr[En|H]– ({attributen,outcomeo,valuem} / {outcome-count} )

…From the Vapor of That Last Big Equation

Page 10: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

A Real Example From Witten, et al

Page 11: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Enter Apache Mahout

• What is it?

– Apache Mahout is a scalable machine learning library that supports large data sets

• What Are the Major Algorithm Type?

– Classification

– Recommendation

– Clustering

• http://mahout.apache.org/

Page 12: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Mahout Algorithms

Size of Dataset Mahout Algo Execution Model Characteristics

Small SGD Sequential Uses all types of predictor vars

Medium Naïve Bayes / Complementary Naïve Bayes

Parallel Prefers text, high training cost

Large Random Forrest Parallel Uses all type of predictor vars, high training cost

Page 13: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Naïve Bayes and Text

• Naive Bayes does not model text well.

– “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”

• http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

– Mahout does some modifications based around TF-IDF scoring (Next Slide)

• Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification

Page 14: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

High Level Algorithm

• For Each Feature(word) in each Doc:– Calc: “Weight Normalized Tf-Idf”

• for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized Tf

– We calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0

Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N ) ]

Page 15: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

BayesDriver Training Workflow

1

• BayesFeatureDriver• Compute parts of TF-IDF via Term-Doc-Count, WordFreq, and

FeatureCount

2• BayesTfIdfDriver

• Calc the TF-IDF of each feature/word in each label

3• BayesWeightSummerDriver

• Take TF-IDF and Calc Trainer Weights

4• BayesThetaNormalizerDriver

• Calcs norm factor SigmaWij for each complement class

Naïve Bayes Training MapReduce Workflow in Mahout

Page 16: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Logical Classification Process

1. Gather, Clean, and Examine the Training Data

– Really get to know your data!

2. Train the Classifier, allowing the system to “Learn” the “Concepts”

– But not “overfit” to this specific training data set

3. Classify New Unseen Instances

– With Naïve Bayes we’ll calculate the probabilities of each class wrt this instance

Page 17: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

How Is Classification Done?

• Sequentially or via Map Reduce

• TestClassifier.java

– Creates ClassifierContext

– For Each File in Dir

• For Each Line– Break line into map of tokens

– Feed array of words to Classifier engine for new classification/label

– Collect classifications as output

Page 18: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

A Quick Note About Training Data…

• Your classifier can only be as good as the training data lets it be…

– If you don’t do good data prep, everything will perform poorly

– Data collection and pre-processing takes the bulk of the time

Page 19: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Enough Math, Run the Code

• Download and install Mahout

– http://www.apache.org

• Run 20Newsgroups Example

– https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups

– Uses Naïve Bayes Classification

– Download and extract 20news-bydate.tar.gz from the 20newsgroups dataset

Page 20: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Generate Test and Train Dataset

Training Dataset:

mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \-p examples/bin/work/20news-bydate/20news-bydate-train \-o examples/bin/work/20news-bydate/bayes-train-input \-a org.apache.mahout.vectorizer.DefaultAnalyzer \-c UTF-8

Test Dataset:

mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \-p examples/bin/work/20news-bydate/20news-bydate-test \-o examples/bin/work/20news-bydate/bayes-test-input \-a org.apache.mahout.vectorizer.DefaultAnalyzer \-c UTF-8

Page 21: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Train and Test Classifier

Train:$MAHOUT_HOME/bin/mahout trainclassifier \-i 20news-input/bayes-train-input \-o newsmodel \-type bayes \-ng 3 \-source hdfs

Test:$MAHOUT_HOME/bin/mahout testclassifier \-m newsmodel \-d 20news-input \-type bayes \-ng 3 \-source hdfs \-method mapreduce

Page 22: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Other Use Cases

• Predictive Analytics

– You’ll hear this term a lot in the field, especially in the context of SAS

• General Supervised Learning Classification

– We can recognize a lot of things with practice

• And lots of tuning!

• Document Classification

• Sentiment Analysis

Page 23: Classification with Naive Bayes

Copyright 2011 Cloudera Inc. All rights reserved

Questions?

• We’re Hiring!

• Cloudera’s Distro of Apache Hadoop:

– http://www.cloudera.com

• Resources

– “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”

• http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf