Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725...

28
Text Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT [email protected] Some slides by Dan Jurafsky & James Martin, Jacob Eisenstein

Transcript of Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725...

Page 1: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Text Classification

& Naïve Bayes

CMSC 723 / LING 723 / INST 725

MARINE CARPUAT

[email protected]

Some slides by Dan Jurafsky & James Martin, Jacob Eisenstein

Page 2: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Today

• Text classification problems

– and their evaluation

• Linear classifiers

– Features & Weights

– Bag of words

– Naïve Bayes

Machine Learning, Probability

Linguistics

Page 3: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

TEXT CLASSIFICATION

Page 4: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Is this spam?

From: "Fabian Starr“ <[email protected]>

Subject: Hey! Sofware for the funny prices!

Get the great discounts on popular software today for PC and Macintoshhttp://iiled.org/Cj4Lmx

70-90% Discounts from retail price!!!All sofware is instantly available to download - No Need Wait!

Page 5: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Who wrote which Federalist

papers?

• 1787-8: anonymous essays try to convince

New York to ratify U.S Constitution: Jay,

Madison, Hamilton.

• Authorship of 12 of the letters in dispute

• 1963: solved by Mosteller and Wallace

using Bayesian methods

James Madison Alexander Hamilton

Page 6: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Positive or negative movie review?

• unbelievably disappointing

• Full of zany characters and richly applied

satire, and some great plot twists

• this is the greatest screwball comedy

ever filmed

• It was pathetic. The worst part about it

was the boxing scenes.

Page 7: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

What is the subject of this

article?

• Antogonists and

Inhibitors

• Blood Supply

• Chemistry

• Drug Therapy

• Embryology

• Epidemiology

• …

MeSH Subject Category Hierarchy

?

MEDLINE Article

Page 8: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Text Classification

• Assigning subject categories, topics, or genres

• Spam detection

• Authorship identification

• Age/gender identification

• Language Identification

• Sentiment analysis

• …

Page 9: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Text Classification: definition

• Input:

– a document w

– a fixed set of classes Y = {y1, y2,…, yJ}

• Output: a predicted class y Y

Page 10: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Classification Methods:

Hand-coded rules

• Rules based on combinations of words or other features

– spam: black-list-address OR (“dollars” AND “have been selected”)

• Accuracy can be high

– If rules carefully refined by expert

• But building and maintaining these rules is expensive

Page 11: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Classification Methods:

Supervised Machine Learning

• Input

– a document w

– a fixed set of classes Y = {y1, y2,…, yJ}

– A training set of m hand-labeled documents (w1,y1),....,(wm,ym)

• Output

– a learned classifier w y

Page 12: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Aside: getting examples for

supervised learning

• Human annotation

– By experts or non-experts (crowdsourcing)

– Found data

• Truth vs. gold standard

• How do we know how good a classifier is?

– Accuracy on held out data

Page 13: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Aside: evaluating classifiers

• How do we know how good a classifier is?

– Compare classifier predictions with human

annotation

– On held out test examples

– Evaluation metrics: accuracy, precision, recall

Page 14: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

The 2-by-2 contingency table

correct not correct

selected tp fp

not selected fn tn

Page 15: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Precision and recall

• Precision: % of selected items that are

correct

Recall: % of correct items that are selected

correct not correct

selected tp fp

not selected fn tn

Page 16: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

A combined measure: F

• A combined measure that assesses the P/R

tradeoff is F measure (weighted harmonic mean):

• People usually use balanced F1 measure

– i.e., with = 1 (that is, = ½):

F = 2PR/(P+R)

RP

PR

RP

F+

+=

-+

=2

2 )1(

1)1(

1

1

b

b

aa

Page 17: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

LINEAR CLASSIFIERS

Page 18: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Bag of words

Page 19: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Defining features

Page 20: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Linear classification

Page 21: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Linear Models

for ClassificationFeature function

representation

Weights

Page 22: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

How can we learn weights?

• By hand

• Probability

– Today: Naïve Bayes

• Discriminative training

– e.g., perceptron, support vector machines

Page 23: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Generative Story

for Multinomial Naïve Bayes

• A hypothetical stochastic process describing how

training examples are generated

Page 24: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Prediction with Naïve Bayes

Page 25: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Parameter Estimation

• “count and normalize”

• Parameters of a multinomial distribution

– Relative frequency estimator

– Formally: this is the maximum likelihood estimate

• See CIML for derivation

Page 26: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Smoothing

Page 27: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Naïve Bayes recap

Page 28: Text Classification & Naïve BayesText Classification & Naïve Bayes CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu Some slides by Dan Jurafsky & James Martin, Jacob

Today

• Text classification problems

– and their evaluation

• Linear classifiers

– Features & Weights

– Bag of words

– Naïve Bayes

Machine Learning, Probability

Linguistics