Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum...

Two Methods for ImprovingText Classification

when Training Data is Sparse

Andrew [email protected]

Just Research (formerly JPRC)

Carnegie Mellon University

For more detail see http://www.cs.cmu.edu/~mccallum

Improving Text Classification by Shrinkage in a Hierarchy of Classes (Sub. to ICML-98)

McCallum, Rosenfeld, Mitchell, Ng

Learning to ClassifyText from Labeled and Unlabeled Documents (AAAI-98)

Nigam, McCallum, Thrun, Mitchell

2

The Task: Document Classification(AKA “Document Categorization”, “Routing” or “Tagging”)

Automatically placing documents in their correct categories.

Magnetism RelativityEvolutionBotanyIrrigation Crops

wheatcornsilogrow...

wheattulipssplicinggrow...

watergratingditchtractor...

selectionmutationDarwin...

... ...

“wheat grow tractor…”

TrainingData:

TestingData:

Categories:

3

A Probabilistic Approach to Document Classification

)|Pr(maxargˆ dcc jc j

Pick the most probable class, given the evidence:

jc

d

- a class (like “Crops”)

- a document (like “wheat grow tractor...”)

)Pr(

)|Pr()Pr()|Pr(

d

cdcdc jjj

Bayes Rule:

||

1 )Pr(

)|Pr()Pr(

d

i d

jdj

i

i

w

cwc

Independence assumption

“Naïve Bayes”:

idw - the i th word in d (like “grow”)

4

Comparison with TFIDF

ZcwTFIDFdwTFIDFcdScoreV

i

/),(),(),(||

1

ZcwdwTFccdScoreV

i

/))|log(Pr(),()Pr(),(||

1

TFIDF/Rocchio

Naïve Bayes

Where Z is some normalization constant

5

Parameter Estimation in Naïve Bayes

jk

jk

cd

V

tkt

cdki

ji

dwTFV

dwTF

cw||

1

),(||

),(1

)|Pr(

||

1

)|Pr()Pr()|Pr(d

ijdjj cwcdc

i

Bayes optimal estimate of Pr(w|c),(via LaPlace smoothing)

Naïve Bayes

A Key Problem: Getting better estimates of Pr(w|c)

Document Classification in aHierarchy of Classes

Andrew McCallum

Roni Rosenfeld

Tom Mitchell

Andrew Ng

7

The Idea: “Deleted Interpolation” or “Shrinkage”

We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. This represents a tradeoff between reliability and specificity.

Magnetism Relativity

Physics

EvolutionBotanyIrrigation Crops

BiologyAgriculture

Science

wheatcornsilogrow...

wheattulipssplicinggrow...

watergratingditchtractor...

“wheat grow tractor…”

selectionmutationDarwin...

... ...

TestingData:

TrainingData:

Categories:

8

“Deleted Interpolation” or “Shrinkage”

“Deleted Interpolation” in class hierarchy space:

),|Pr()1(),,|Pr(),,|r(P̂12121

iiiiiiii djdjddjdjddjd wcwwwcwwwcw

Learn the ’s via EM, performing the E-step with leave-one-out cross-validation.

“Deleted Interpolation” in N-gram space:

)|Pr()1()|Pr()|r(P̂ parentjijjijji cwcwcw

[Jelinek and Mercer, 1980], [James and Stein, 1961]

9

Experimental Results

• Industry Sector Dataset– 71 classes, 6.5k documents,

1.2 million words, 30k vocabulary

• 20 Newsgroups Dataset– 15 classes, 15k documents,


• Yahoo Science Dataset– 95 classes, 13k documents,


Learning to Classify Text from Labeled and Unlabeled Documents

Kamal Nigam

Andrew McCallum

Sebastian Thrun

Tom Mitchell

11

The Scenario

Training datawith class labels

Data available at trainingtime, but without class labels

Web pagesuser says areinteresting

Web pagesuser says areuninteresting

Web pages userhasn’t seen or saidanything about

Can we use the unlabeled documents to increase accuracy?

12

Using the Unlabeled DataBuild a classificationmodel using limitedlabeled data

Use model to guess thelabels of the unlabeleddocuments

Use all documents to build a new classification model, which is more accurate because it is trained using more data.

13

Expectation Maximization[Dempster, Laird, Rubin 1977]

Applies when there are two inter-dependent unknowns. (1) The word probabilities for each class

(2) The class labels of the unlabeled doc’s.

• E-step: Use current “guess” of (1) to estimate value of (2)– Use classification model built from limited training data to

assign probabilistic labels to unlabeled documents

• M-step:Use probabilistic estimates of (2) to update (1).– Use probabilistic class labels on unlabeled documents to build

a more accurate classification model.

• Repeat E- and M-steps until convergence.

14

Why it Works -- An Example

Baseball Ice Skating

Labeled Data

Fell on the ice...The new hitter struck out...

Pete Rose is not as good an athlete as Tara Lipinski...

Struck out in last inning...

Homerun in the first inning...

Perfect triple jump...

Katarina Witt’s gold medal performance...

New ice skates...

Practice at the ice rink every day...

Unlabeled Data

Tara Lipinski new ice skates didn’t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal...

Tara Lipinski bought a new house for her parents.

Pr ( Lipinski ) = 0.01 Pr ( Lipinski ) = 0.001

Pr ( Lipinski ) = 0.02

Pr ( Lipinski ) = 0.003

15

EM for Text Classification

Expectation-step (guess the class labels)

||

1

)|Pr()Pr()|Pr(d

ijdjj cwcdc

i

Maximization-step (set parameters using the guesses)

jk

jk

cd

V

tkjkt

cdkjki

ji

dcdwTFV

dcdwTF

cw||

1

)|Pr(),(||

)|Pr(),(1

)|Pr(

16

Experimental Results -- The Data

• Four classes of Web pages– Student, Faculty, Course, Project– 4199 Web pages total

• Twenty newsgroups from UseNet – several of religion, politics, sports, comp.*– 1000 articles per class

• New articles from Reuters– 90 different categories– 12902 articles total

17

Word Vector Evolution with EM

Iteration 0intelligence

DDartificial

understandingDDwdist

identicalrus

arrangegames

dartmouthnatural

cognitivelogic

provingprolog

Iteration 1DDD

lectureccD*

DD:DDhandout

dueproblem

settay

DDamyurtas

homeworkkfoury

sec

Iteration 2D

DDlecture

ccDD:DD

dueD*

homeworkassignment

handoutsethw

examproblemDDam

postscript

(D is a digit)

18

Related Work• Using EM to reduce the need for training examples:

– [Miller and Uyar 1997] [Shahshahani and Landgrebe 1994]

• AutoClass - unsupervised EM with Naïve Bayes:– [Cheeseman 1988]

• Using EM to fill in missing values– [Ghahramani and Jordan 1995]

Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum...

Documents

Transcript of Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum...