Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum...

18
Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum [email protected] Just Research (formerly JPRC) Carnegie Mellon University For more detail see http://www.cs.cmu.edu/~mccallum Improving Text Classification by Shrinkage in a Hierarchy of Classes (Sub. to ICML- 98) McCallum, Rosenfeld, Mitchell, Ng Learning to ClassifyText from Labeled and Unlabeled Documents (AAAI-98) Nigam, McCallum, Thrun, Mitchell

Transcript of Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum...

Page 1: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

Two Methods for ImprovingText Classification

when Training Data is Sparse

Andrew [email protected]

Just Research (formerly JPRC)

Carnegie Mellon University

For more detail see http://www.cs.cmu.edu/~mccallum

Improving Text Classification by Shrinkage in a Hierarchy of Classes (Sub. to ICML-98)

McCallum, Rosenfeld, Mitchell, Ng

Learning to ClassifyText from Labeled and Unlabeled Documents (AAAI-98)

Nigam, McCallum, Thrun, Mitchell

Page 2: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

2

The Task: Document Classification(AKA “Document Categorization”, “Routing” or “Tagging”)

Automatically placing documents in their correct categories.

Magnetism RelativityEvolutionBotanyIrrigation Crops

wheatcornsilogrow...

wheattulipssplicinggrow...

watergratingditchtractor...

selectionmutationDarwin...

... ...

“wheat grow tractor…”

TrainingData:

TestingData:

Categories:

Page 3: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

3

A Probabilistic Approach to Document Classification

)|Pr(maxargˆ dcc jc j

Pick the most probable class, given the evidence:

jc

d

- a class (like “Crops”)

- a document (like “wheat grow tractor...”)

)Pr(

)|Pr()Pr()|Pr(

d

cdcdc jjj

Bayes Rule:

||

1 )Pr(

)|Pr()Pr(

d

i d

jdj

i

i

w

cwc

Independence assumption

“Naïve Bayes”:

idw - the i th word in d (like “grow”)

Page 4: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

4

Comparison with TFIDF

ZcwTFIDFdwTFIDFcdScoreV

i

/),(),(),(||

1

ZcwdwTFccdScoreV

i

/))|log(Pr(),()Pr(),(||

1

TFIDF/Rocchio

Naïve Bayes

Where Z is some normalization constant

Page 5: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

5

Parameter Estimation in Naïve Bayes

jk

jk

cd

V

tkt

cdki

ji

dwTFV

dwTF

cw||

1

),(||

),(1

)|Pr(

||

1

)|Pr()Pr()|Pr(d

ijdjj cwcdc

i

Bayes optimal estimate of Pr(w|c),(via LaPlace smoothing)

Naïve Bayes

A Key Problem: Getting better estimates of Pr(w|c)

Page 6: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

Document Classification in aHierarchy of Classes

Andrew McCallum

Roni Rosenfeld

Tom Mitchell

Andrew Ng

Page 7: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

7

The Idea: “Deleted Interpolation” or “Shrinkage”

We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. This represents a tradeoff between reliability and specificity.

Magnetism Relativity

Physics

EvolutionBotanyIrrigation Crops

BiologyAgriculture

Science

wheatcornsilogrow...

wheattulipssplicinggrow...

watergratingditchtractor...

“wheat grow tractor…”

selectionmutationDarwin...

... ...

TestingData:

TrainingData:

Categories:

Page 8: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

8

“Deleted Interpolation” or “Shrinkage”

“Deleted Interpolation” in class hierarchy space:

),|Pr()1(),,|Pr(),,|r(P̂12121

iiiiiiii djdjddjdjddjd wcwwwcwwwcw

Learn the ’s via EM, performing the E-step with leave-one-out cross-validation.

“Deleted Interpolation” in N-gram space:

)|Pr()1()|Pr()|r(P̂ parentjijjijji cwcwcw

[Jelinek and Mercer, 1980], [James and Stein, 1961]

Page 9: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

9

Experimental Results

• Industry Sector Dataset– 71 classes, 6.5k documents,

1.2 million words, 30k vocabulary

• 20 Newsgroups Dataset– 15 classes, 15k documents,

1.7 million words, 52k vocabulary

• Yahoo Science Dataset– 95 classes, 13k documents,

0.6 million words, 44k vocabulary

Page 10: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

Learning to Classify Text from Labeled and Unlabeled Documents

Kamal Nigam

Andrew McCallum

Sebastian Thrun

Tom Mitchell

Page 11: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

11

The Scenario

Training datawith class labels

Data available at trainingtime, but without class labels

Web pagesuser says areinteresting

Web pagesuser says areuninteresting

Web pages userhasn’t seen or saidanything about

Can we use the unlabeled documents to increase accuracy?

Page 12: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

12

Using the Unlabeled DataBuild a classificationmodel using limitedlabeled data

Use model to guess thelabels of the unlabeleddocuments

Use all documents to build a new classification model, which is more accurate because it is trained using more data.

Page 13: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

13

Expectation Maximization[Dempster, Laird, Rubin 1977]

Applies when there are two inter-dependent unknowns. (1) The word probabilities for each class

(2) The class labels of the unlabeled doc’s.

• E-step: Use current “guess” of (1) to estimate value of (2)– Use classification model built from limited training data to

assign probabilistic labels to unlabeled documents

• M-step:Use probabilistic estimates of (2) to update (1).– Use probabilistic class labels on unlabeled documents to build

a more accurate classification model.

• Repeat E- and M-steps until convergence.

Page 14: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

14

Why it Works -- An Example

Baseball Ice Skating

Labeled Data

Fell on the ice...The new hitter struck out...

Pete Rose is not as good an athlete as Tara Lipinski...

Struck out in last inning...

Homerun in the first inning...

Perfect triple jump...

Katarina Witt’s gold medal performance...

New ice skates...

Practice at the ice rink every day...

Unlabeled Data

Tara Lipinski new ice skates didn’t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal...

Tara Lipinski bought a new house for her parents.

Pr ( Lipinski ) = 0.01 Pr ( Lipinski ) = 0.001

Pr ( Lipinski ) = 0.02

Pr ( Lipinski ) = 0.003

Page 15: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

15

EM for Text Classification

Expectation-step (guess the class labels)

||

1

)|Pr()Pr()|Pr(d

ijdjj cwcdc

i

Maximization-step (set parameters using the guesses)

jk

jk

cd

V

tkjkt

cdkjki

ji

dcdwTFV

dcdwTF

cw||

1

)|Pr(),(||

)|Pr(),(1

)|Pr(

Page 16: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

16

Experimental Results -- The Data

• Four classes of Web pages– Student, Faculty, Course, Project– 4199 Web pages total

• Twenty newsgroups from UseNet – several of religion, politics, sports, comp.*– 1000 articles per class

• New articles from Reuters– 90 different categories– 12902 articles total

Page 17: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

17

Word Vector Evolution with EM

Iteration 0intelligence

DDartificial

understandingDDwdist

identicalrus

arrangegames

dartmouthnatural

cognitivelogic

provingprolog

Iteration 1DDD

lectureccD*

DD:DDhandout

dueproblem

settay

DDamyurtas

homeworkkfoury

sec

Iteration 2D

DDlecture

ccDD:DD

dueD*

homeworkassignment

handoutsethw

examproblemDDam

postscript

(D is a digit)

Page 18: Two Methods for Improving Text Classification when Training Data is Sparse Andrew McCallum mccallum@cs.cmu.edu Just Research (formerly JPRC) Carnegie Mellon.

18

Related Work• Using EM to reduce the need for training examples:

– [Miller and Uyar 1997] [Shahshahani and Landgrebe 1994]

• AutoClass - unsupervised EM with Naïve Bayes:– [Cheeseman 1988]

• Using EM to fill in missing values– [Ghahramani and Jordan 1995]