Text Classification Powered by Apache Mahout and Lucene

88
Text classification With Apache Mahout and Lucene

description

Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features. Configure

Transcript of Text Classification Powered by Apache Mahout and Lucene

Page 1: Text Classification Powered by Apache Mahout and Lucene

Text classificationWith Apache Mahout and Lucene

Page 2: Text Classification Powered by Apache Mahout and Lucene

Isabel Drost-Fromm

Software Engineer at Nokia Maps*

Member of the Apache Software Foundation

Co-Founder of Berlin Buzzwords and Berlin Apache Hadoop GetTogether

Co-founder of Apache Mahout

*We are hiring, talk to me or mail [email protected]

Page 3: Text Classification Powered by Apache Mahout and Lucene
Page 4: Text Classification Powered by Apache Mahout and Lucene
Page 5: Text Classification Powered by Apache Mahout and Lucene

TM

Page 6: Text Classification Powered by Apache Mahout and Lucene

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

… provide your own success story online.

Page 7: Text Classification Powered by Apache Mahout and Lucene

TM

Page 8: Text Classification Powered by Apache Mahout and Lucene

Classification?

Page 9: Text Classification Powered by Apache Mahout and Lucene
Page 10: Text Classification Powered by Apache Mahout and Lucene

January 8, 2008 by Pink Sherbet Photographyhttp://www.flickr.com/photos/pinksherbet/2177961471/

Page 11: Text Classification Powered by Apache Mahout and Lucene

By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/

Page 12: Text Classification Powered by Apache Mahout and Lucene

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

Page 13: Text Classification Powered by Apache Mahout and Lucene

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

Page 14: Text Classification Powered by Apache Mahout and Lucene

Image by jasondevillahttp://www.flickr.com/photos/jasondv/91960897/

Page 15: Text Classification Powered by Apache Mahout and Lucene

How a linear classifier sees data

Page 16: Text Classification Powered by Apache Mahout and Lucene

Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415

Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415

Page 17: Text Classification Powered by Apache Mahout and Lucene

Instance*

(sometimes also called example, item, or in databases a row)

Page 18: Text Classification Powered by Apache Mahout and Lucene

Feature*

(sometimes also called attribute, signal, predictor, co-variate, or column in databases)

Page 19: Text Classification Powered by Apache Mahout and Lucene

Label*

(sometimes also called class, target variable)

Page 20: Text Classification Powered by Apache Mahout and Lucene
Page 21: Text Classification Powered by Apache Mahout and Lucene
Page 22: Text Classification Powered by Apache Mahout and Lucene
Page 23: Text Classification Powered by Apache Mahout and Lucene

Image taken in Lisbon/ Portugal.

Page 24: Text Classification Powered by Apache Mahout and Lucene

Image by jasondevillahttp://www.flickr.com/photos/jasondv/91960897/

Page 25: Text Classification Powered by Apache Mahout and Lucene
Page 26: Text Classification Powered by Apache Mahout and Lucene

● Remove noise.

Page 27: Text Classification Powered by Apache Mahout and Lucene
Page 28: Text Classification Powered by Apache Mahout and Lucene

● Remove noise.

● Convert text to vectors.

Page 29: Text Classification Powered by Apache Mahout and Lucene

Text consists of terms and phrases.

Page 30: Text Classification Powered by Apache Mahout and Lucene

Encoding issues?

Chinese? Japanese?

“New York” vs. new York?

“go” vs. “going” vs. “went” vs. “gone”?

“go” vs. “Go”?

Page 31: Text Classification Powered by Apache Mahout and Lucene

Terms? Tokens? Wait!

Page 32: Text Classification Powered by Apache Mahout and Lucene
Page 33: Text Classification Powered by Apache Mahout and Lucene

Now we have terms – how to turn theminto vectors?

Page 34: Text Classification Powered by Apache Mahout and Lucene

Sunny weather

High performance computing

If we looked at two phrases only:

Page 35: Text Classification Powered by Apache Mahout and Lucene

Aaron

Zuse

Page 36: Text Classification Powered by Apache Mahout and Lucene

Binary bag of words

● Imagine a n-dimensional space.

● Each dimension = one possible word in texts.

● Entry in vector is one, if word occurs in text.

● Problem:

– How to know all possible terms in unknown text?

bi , j={1∀ xi∈d j0else }

Page 37: Text Classification Powered by Apache Mahout and Lucene

Term Frequency

● Imagine a n-dimensional space.

● Each dimension = one possible word in texts.

● Entry in vector equal to the words frequency.

● Problem:

– Common words dominate vectors.

bi , j=ni , j

Page 38: Text Classification Powered by Apache Mahout and Lucene

TF with stop wording

● Imagine a n-dimensional space.

● Each dimension = one possible word in texts.

● Filter stopwords.

● Entry in vector equal to the words frequency.

● Problem:

– Common and uncommon words with same weight.

bi , j=ni , j

Page 39: Text Classification Powered by Apache Mahout and Lucene

TF- IDF

● Imagine a n-dimensional space.

● Each dimension = one possible word in texts.

● Filter stopwords.

● Entry in vector equal to the weighted frequency.

● Problem:

– Long texts get larger values.

bi , j=ni , j×log ∣D∣

∣{d : ti∈d }∣

Page 40: Text Classification Powered by Apache Mahout and Lucene

Hashed feature vectors

● Imagine a n-dimensional space.

● Each word in texts = hashed to one dimension.

● Entry in vector set to one, if word hashed to it.

Page 41: Text Classification Powered by Apache Mahout and Lucene
Page 42: Text Classification Powered by Apache Mahout and Lucene

<

Page 43: Text Classification Powered by Apache Mahout and Lucene

How a linear classifier sees data

Page 44: Text Classification Powered by Apache Mahout and Lucene
Page 45: Text Classification Powered by Apache Mahout and Lucene

LuceneAnalyzer

HTML Apache Tika Fulltext

OnlineLearner

Tokenstream+xFeatureVector

EncoderVector Model

Page 46: Text Classification Powered by Apache Mahout and Lucene

Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415

Page 47: Text Classification Powered by Apache Mahout and Lucene

Goals

● Did I use the best model parameters?

● How well will my model perform in the wild?

Page 48: Text Classification Powered by Apache Mahout and Lucene

Tune modelParameters,

Experiment withTokenization,

Experiment withVector Encoding

Compute expectedperformance

Page 49: Text Classification Powered by Apache Mahout and Lucene
Page 50: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use same data for training and testing.

● Problem:

– Highly optimistic.

– Model generalization unknown.

Page 51: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use same data for training and testing.

● Problem:

– Highly optimistic.

– Model generalization unknown.

DON'T

Page 52: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use just a fraction for training.

● Set some data aside for testing.

● Problems:

– Pessimistic predictor: Not all data used for training.

– Result may depend on which data was set aside.

Page 53: Text Classification Powered by Apache Mahout and Lucene

Performance

● Partition your data into n fractions.

● Each fraction set aside for testing in turn.

● Problem:

– Still a pessimistic predictor.

Page 54: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use just a fraction for training.

● Set some data aside for tuning and testing.

● Problems:

– Highly optimistic.

– Parameters manually tuned to testing data.

Page 55: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use just a fraction for training.

● Set some data aside for tuning and testing.

● Problems:

– Highly optimistic.

– Parameters manually tuned to testing data.

DON'T

Page 56: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use just a fraction for training.

● Set some data aside for tuning.

● Set another set of data aside for testing.

● Problems:

– Pretty pessimistic as not all data is used.

– May depend on which data was set aside.

Page 57: Text Classification Powered by Apache Mahout and Lucene

Performance Measures

Page 58: Text Classification Powered by Apache Mahout and Lucene

Correct prediction: negative Correct prediction: positive

Model prediction: positive

Model prediction: negative

Page 59: Text Classification Powered by Apache Mahout and Lucene

Accuracy

ACC=true positivetruenegative

true positive false positive false negativetruenegative

● Problems:

– What if class distribution is skewed?

Page 60: Text Classification Powered by Apache Mahout and Lucene

Precision/ Recall

Precision=true positive

true positive false positive

Recall=true positive

true positive false negative

● Problem:

– Depends on decision threshold.

Page 61: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

Page 62: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

Orange rate

Page 63: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

False orange rate

True orange rate

Page 64: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

False orange rate

True orange rate

Page 65: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

False orange rate

True orange rate

Page 66: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

False orange rate

True orange rate

Page 67: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

False orange rate

True orange rate

Page 68: Text Classification Powered by Apache Mahout and Lucene

AUC – area under ROC

False orange rate

True orange rate

Page 69: Text Classification Powered by Apache Mahout and Lucene

Foto taken by fras1977http://www.flickr.com/photos/fras/4992313333/

Page 70: Text Classification Powered by Apache Mahout and Lucene

Image by Medienmagazin prohttp://www.flickr.com/photos/medienmagazinpro/6266643422

Page 71: Text Classification Powered by Apache Mahout and Lucene
Page 72: Text Classification Powered by Apache Mahout and Lucene

http://www.flickr.com/photos/generated/943078008/

Page 73: Text Classification Powered by Apache Mahout and Lucene

Math libs/ Mahout collections

Apache Hadoop-ready

Recommendations/Collaborative filtering

Classification/Logistic Regression/ SGD

Sequence learning/HMM

kNN and matrix factorizationbased Collaborative filtering

Classification/Naïve Bayes, random forest

Frequent item sets/(P)FPGrowth

Co-Location search

LDA

Clustering/ Mean shift, k-Means,Canopy, Dirichlet Process,

Page 74: Text Classification Powered by Apache Mahout and Lucene

Image by pareeericahttp://www.flickr.com/photos/pareeerica/3711741298/

Libraries to have a look at:Vowpal Wabbit MalletLibSvm LibLinearLibfm IncanterGraphLab Skikits learn

Get your hands dirty:http://kaggle.com

https://cwiki.apache.org/confluence/display/MAHOUT/Collections

Where to get more information:“Mahout in Action” - Manning“Taming Text” - Manning“Machine Learning” - Andrew Ng

https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks

https://cwiki.apache.org/confluence/display/MAHOUT/Reference+Reading

Frameworks worth mentioning:Apache Mahout Apache GiraphMatlab/ Otave RShogun WekaRapidI MyMedialight

Where to meet these people:RecSys ICMLNIPS ECMLKDD WSDMPKDD JMLRApacheCon Berlin BuzzwordsO'Reilly Strata

Page 75: Text Classification Powered by Apache Mahout and Lucene

Get started today with the right tools.

January 8, 2008 by dreizehn28http://www.flickr.com/photos/1328/2176949559

Page 76: Text Classification Powered by Apache Mahout and Lucene

Discuss ideas and problems online.

November 16, 2005 [phil h]http://www.flickr.com/photos/hi-phi/64055296

Page 77: Text Classification Powered by Apache Mahout and Lucene

Discuss ideas and problems in person.

Images taken at Berlin Buzzwords 2011/12/13 byPhilipp Kaden. See you there end of May 2014.

Page 78: Text Classification Powered by Apache Mahout and Lucene
Page 79: Text Classification Powered by Apache Mahout and Lucene

Become a committer yourself

Page 80: Text Classification Powered by Apache Mahout and Lucene

http://BerlinBuzzwords.de – End of May 2014 in Berlin/ Germany.

Online – user/[email protected], [email protected], [email protected]

Interest in solving hard problems.

Being part of lively community.

Engineering best practices.

Bug reports, patches, features.

Documentation, code, examples.

Image by: Patrick McEvoy

Page 81: Text Classification Powered by Apache Mahout and Lucene
Page 82: Text Classification Powered by Apache Mahout and Lucene
Page 83: Text Classification Powered by Apache Mahout and Lucene
Page 84: Text Classification Powered by Apache Mahout and Lucene

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

Page 85: Text Classification Powered by Apache Mahout and Lucene
Page 86: Text Classification Powered by Apache Mahout and Lucene
Page 87: Text Classification Powered by Apache Mahout and Lucene

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

Page 88: Text Classification Powered by Apache Mahout and Lucene

By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/