Download - Text Classification Powered by Apache Mahout and Lucene

Transcript
Page 1: Text Classification Powered by Apache Mahout and Lucene

Text classificationWith Apache Mahout and Lucene

Page 2: Text Classification Powered by Apache Mahout and Lucene

Isabel Drost-Fromm

Software Engineer at Nokia Maps*

Member of the Apache Software Foundation

Co-Founder of Berlin Buzzwords and Berlin Apache Hadoop GetTogether

Co-founder of Apache Mahout

*We are hiring, talk to me or mail [email protected]

Page 3: Text Classification Powered by Apache Mahout and Lucene
Page 4: Text Classification Powered by Apache Mahout and Lucene
Page 5: Text Classification Powered by Apache Mahout and Lucene

TM

Page 6: Text Classification Powered by Apache Mahout and Lucene

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

… provide your own success story online.

Page 7: Text Classification Powered by Apache Mahout and Lucene

TM

Page 8: Text Classification Powered by Apache Mahout and Lucene

Classification?

Page 9: Text Classification Powered by Apache Mahout and Lucene
Page 10: Text Classification Powered by Apache Mahout and Lucene

January 8, 2008 by Pink Sherbet Photographyhttp://www.flickr.com/photos/pinksherbet/2177961471/

Page 11: Text Classification Powered by Apache Mahout and Lucene

By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/

Page 12: Text Classification Powered by Apache Mahout and Lucene

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

Page 13: Text Classification Powered by Apache Mahout and Lucene

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

Page 14: Text Classification Powered by Apache Mahout and Lucene

Image by jasondevillahttp://www.flickr.com/photos/jasondv/91960897/

Page 15: Text Classification Powered by Apache Mahout and Lucene

How a linear classifier sees data

Page 16: Text Classification Powered by Apache Mahout and Lucene

Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415

Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415

Page 17: Text Classification Powered by Apache Mahout and Lucene

Instance*

(sometimes also called example, item, or in databases a row)

Page 18: Text Classification Powered by Apache Mahout and Lucene

Feature*

(sometimes also called attribute, signal, predictor, co-variate, or column in databases)

Page 19: Text Classification Powered by Apache Mahout and Lucene

Label*

(sometimes also called class, target variable)

Page 20: Text Classification Powered by Apache Mahout and Lucene
Page 21: Text Classification Powered by Apache Mahout and Lucene
Page 22: Text Classification Powered by Apache Mahout and Lucene
Page 23: Text Classification Powered by Apache Mahout and Lucene

Image taken in Lisbon/ Portugal.

Page 24: Text Classification Powered by Apache Mahout and Lucene

Image by jasondevillahttp://www.flickr.com/photos/jasondv/91960897/

Page 25: Text Classification Powered by Apache Mahout and Lucene
Page 26: Text Classification Powered by Apache Mahout and Lucene

● Remove noise.

Page 27: Text Classification Powered by Apache Mahout and Lucene
Page 28: Text Classification Powered by Apache Mahout and Lucene

● Remove noise.

● Convert text to vectors.

Page 29: Text Classification Powered by Apache Mahout and Lucene

Text consists of terms and phrases.

Page 30: Text Classification Powered by Apache Mahout and Lucene

Encoding issues?

Chinese? Japanese?

“New York” vs. new York?

“go” vs. “going” vs. “went” vs. “gone”?

“go” vs. “Go”?

Page 31: Text Classification Powered by Apache Mahout and Lucene

Terms? Tokens? Wait!

Page 32: Text Classification Powered by Apache Mahout and Lucene
Page 33: Text Classification Powered by Apache Mahout and Lucene

Now we have terms – how to turn theminto vectors?

Page 34: Text Classification Powered by Apache Mahout and Lucene

Sunny weather

High performance computing

If we looked at two phrases only:

Page 35: Text Classification Powered by Apache Mahout and Lucene

Aaron

Zuse

Page 36: Text Classification Powered by Apache Mahout and Lucene

Binary bag of words

● Imagine a n-dimensional space.

● Each dimension = one possible word in texts.

● Entry in vector is one, if word occurs in text.

● Problem:

– How to know all possible terms in unknown text?

bi , j={1∀ xi∈d j0else }

Page 37: Text Classification Powered by Apache Mahout and Lucene

Term Frequency

● Imagine a n-dimensional space.

● Each dimension = one possible word in texts.

● Entry in vector equal to the words frequency.

● Problem:

– Common words dominate vectors.

bi , j=ni , j

Page 38: Text Classification Powered by Apache Mahout and Lucene

TF with stop wording

● Imagine a n-dimensional space.

● Each dimension = one possible word in texts.

● Filter stopwords.

● Entry in vector equal to the words frequency.

● Problem:

– Common and uncommon words with same weight.

bi , j=ni , j

Page 39: Text Classification Powered by Apache Mahout and Lucene

TF- IDF

● Imagine a n-dimensional space.

● Each dimension = one possible word in texts.

● Filter stopwords.

● Entry in vector equal to the weighted frequency.

● Problem:

– Long texts get larger values.

bi , j=ni , j×log ∣D∣

∣{d : ti∈d }∣

Page 40: Text Classification Powered by Apache Mahout and Lucene

Hashed feature vectors

● Imagine a n-dimensional space.

● Each word in texts = hashed to one dimension.

● Entry in vector set to one, if word hashed to it.

Page 41: Text Classification Powered by Apache Mahout and Lucene
Page 42: Text Classification Powered by Apache Mahout and Lucene

<

Page 43: Text Classification Powered by Apache Mahout and Lucene

How a linear classifier sees data

Page 44: Text Classification Powered by Apache Mahout and Lucene
Page 45: Text Classification Powered by Apache Mahout and Lucene

LuceneAnalyzer

HTML Apache Tika Fulltext

OnlineLearner

Tokenstream+xFeatureVector

EncoderVector Model

Page 46: Text Classification Powered by Apache Mahout and Lucene

Image by ZapTheDingbat (Light meter)http://www.flickr.com/photos/zapthedingbat/3028168415

Page 47: Text Classification Powered by Apache Mahout and Lucene

Goals

● Did I use the best model parameters?

● How well will my model perform in the wild?

Page 48: Text Classification Powered by Apache Mahout and Lucene

Tune modelParameters,

Experiment withTokenization,

Experiment withVector Encoding

Compute expectedperformance

Page 49: Text Classification Powered by Apache Mahout and Lucene
Page 50: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use same data for training and testing.

● Problem:

– Highly optimistic.

– Model generalization unknown.

Page 51: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use same data for training and testing.

● Problem:

– Highly optimistic.

– Model generalization unknown.

DON'T

Page 52: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use just a fraction for training.

● Set some data aside for testing.

● Problems:

– Pessimistic predictor: Not all data used for training.

– Result may depend on which data was set aside.

Page 53: Text Classification Powered by Apache Mahout and Lucene

Performance

● Partition your data into n fractions.

● Each fraction set aside for testing in turn.

● Problem:

– Still a pessimistic predictor.

Page 54: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use just a fraction for training.

● Set some data aside for tuning and testing.

● Problems:

– Highly optimistic.

– Parameters manually tuned to testing data.

Page 55: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use just a fraction for training.

● Set some data aside for tuning and testing.

● Problems:

– Highly optimistic.

– Parameters manually tuned to testing data.

DON'T

Page 56: Text Classification Powered by Apache Mahout and Lucene

Performance

● Use just a fraction for training.

● Set some data aside for tuning.

● Set another set of data aside for testing.

● Problems:

– Pretty pessimistic as not all data is used.

– May depend on which data was set aside.

Page 57: Text Classification Powered by Apache Mahout and Lucene

Performance Measures

Page 58: Text Classification Powered by Apache Mahout and Lucene

Correct prediction: negative Correct prediction: positive

Model prediction: positive

Model prediction: negative

Page 59: Text Classification Powered by Apache Mahout and Lucene

Accuracy

ACC=true positivetruenegative

true positive false positive false negativetruenegative

● Problems:

– What if class distribution is skewed?

Page 60: Text Classification Powered by Apache Mahout and Lucene

Precision/ Recall

Precision=true positive

true positive false positive

Recall=true positive

true positive false negative

● Problem:

– Depends on decision threshold.

Page 61: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

Page 62: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

Orange rate

Page 63: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

False orange rate

True orange rate

Page 64: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

False orange rate

True orange rate

Page 65: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

False orange rate

True orange rate

Page 66: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

False orange rate

True orange rate

Page 67: Text Classification Powered by Apache Mahout and Lucene

ROC Curves

False orange rate

True orange rate

Page 68: Text Classification Powered by Apache Mahout and Lucene

AUC – area under ROC

False orange rate

True orange rate

Page 69: Text Classification Powered by Apache Mahout and Lucene

Foto taken by fras1977http://www.flickr.com/photos/fras/4992313333/

Page 70: Text Classification Powered by Apache Mahout and Lucene

Image by Medienmagazin prohttp://www.flickr.com/photos/medienmagazinpro/6266643422

Page 71: Text Classification Powered by Apache Mahout and Lucene
Page 72: Text Classification Powered by Apache Mahout and Lucene

http://www.flickr.com/photos/generated/943078008/

Page 73: Text Classification Powered by Apache Mahout and Lucene

Math libs/ Mahout collections

Apache Hadoop-ready

Recommendations/Collaborative filtering

Classification/Logistic Regression/ SGD

Sequence learning/HMM

kNN and matrix factorizationbased Collaborative filtering

Classification/Naïve Bayes, random forest

Frequent item sets/(P)FPGrowth

Co-Location search

LDA

Clustering/ Mean shift, k-Means,Canopy, Dirichlet Process,

Page 74: Text Classification Powered by Apache Mahout and Lucene

Image by pareeericahttp://www.flickr.com/photos/pareeerica/3711741298/

Libraries to have a look at:Vowpal Wabbit MalletLibSvm LibLinearLibfm IncanterGraphLab Skikits learn

Get your hands dirty:http://kaggle.com

https://cwiki.apache.org/confluence/display/MAHOUT/Collections

Where to get more information:“Mahout in Action” - Manning“Taming Text” - Manning“Machine Learning” - Andrew Ng

https://cwiki.apache.org/confluence/display/MAHOUT/Books+Tutorials+and+Talks

https://cwiki.apache.org/confluence/display/MAHOUT/Reference+Reading

Frameworks worth mentioning:Apache Mahout Apache GiraphMatlab/ Otave RShogun WekaRapidI MyMedialight

Where to meet these people:RecSys ICMLNIPS ECMLKDD WSDMPKDD JMLRApacheCon Berlin BuzzwordsO'Reilly Strata

Page 75: Text Classification Powered by Apache Mahout and Lucene

Get started today with the right tools.

January 8, 2008 by dreizehn28http://www.flickr.com/photos/1328/2176949559

Page 76: Text Classification Powered by Apache Mahout and Lucene

Discuss ideas and problems online.

November 16, 2005 [phil h]http://www.flickr.com/photos/hi-phi/64055296

Page 77: Text Classification Powered by Apache Mahout and Lucene

Discuss ideas and problems in person.

Images taken at Berlin Buzzwords 2011/12/13 byPhilipp Kaden. See you there end of May 2014.

Page 78: Text Classification Powered by Apache Mahout and Lucene
Page 79: Text Classification Powered by Apache Mahout and Lucene

Become a committer yourself

Page 80: Text Classification Powered by Apache Mahout and Lucene

http://BerlinBuzzwords.de – End of May 2014 in Berlin/ Germany.

Online – user/[email protected], [email protected], [email protected]

Interest in solving hard problems.

Being part of lively community.

Engineering best practices.

Bug reports, patches, features.

Documentation, code, examples.

Image by: Patrick McEvoy

Page 81: Text Classification Powered by Apache Mahout and Lucene
Page 82: Text Classification Powered by Apache Mahout and Lucene
Page 83: Text Classification Powered by Apache Mahout and Lucene
Page 84: Text Classification Powered by Apache Mahout and Lucene

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

Page 85: Text Classification Powered by Apache Mahout and Lucene
Page 86: Text Classification Powered by Apache Mahout and Lucene
Page 87: Text Classification Powered by Apache Mahout and Lucene

http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/

Page 88: Text Classification Powered by Apache Mahout and Lucene

By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/