Document Categorization

25
Document Categorization • Problem: given – a collection of documents, and – a taxonomy of subject areas Classification: Determine the subject area(s) most pertinent to each document Indexing: Select a set of keywords / index terms appropriate to each document

description

Document Categorization. Problem: given a collection of documents, and a taxonomy of subject areas Classification : Determine the subject area(s) most pertinent to each document Indexing : Select a set of keywords / index terms appropriate to each document. Classification Techniques. - PowerPoint PPT Presentation

Transcript of Document Categorization

Page 1: Document Categorization

Document Categorization

• Problem: given– a collection of documents, and– a taxonomy of subject areas

• Classification: Determine the subject area(s) most pertinent to each document

• Indexing: Select a set of keywords / index terms appropriate to each document

Page 2: Document Categorization

Classification Techniques

• Manual (a.k.a. Knowledge Engineering)– typically, rule-based expert systems

• Machine Learning– Probabalistic (e.g., Naïve Bayesian)– Decision Structures (e.g., Decision Trees)– Profile-Based

• compare document to profile(s) of subject classes• similarity rules similar to those employed in I.R.

– Support Machines (e.g., SVM)

Page 3: Document Categorization

Machine Learning Procedures

• Usually train-and-test– Exploit an existing collection in which

documents have already been classified• a portion used as the training set• another portion used as a test set

– permits measurement of classifier effectiveness– allows tuning of classifier parameters to yield maximum

effectiveness

• Single- vs. multi-label– can 1 document be assigned to multiple

categories?

Page 4: Document Categorization

Automatic Indexing

• Assign to each document up to k terms drawn from a controlled vocabulary

• Typically reduced to a multi-label classification problem– each keyword corresponds to a class of

documents for which that keyword is an appropriate descriptor

Page 5: Document Categorization

Case Study: SVM categorization

• Document Collection from DTIC– 10,000 documents

• previously classified manually

– Taxonomy of• 25 broad subject fields, divided into a total of• 251 narrower groups

– Document lengths average 27051464 words, 623274 significant unique terms.

– Collection has 32457 significant unique terms

Page 6: Document Categorization

Document Size Distribution

0

10

20

30

40

50

60

70

80

0-1000 1001-2000 2001-3000 3001-4000 4001-5000 5001-6000 6001-7000 7001-8000 8001-

words per document

do

cu

me

nts

Document Collection

Page 7: Document Categorization

Unique Term Distribution

0

20

40

60

80

100

120

0-300 301-600 601-900 901-1200 1201-1500 1501-

# Unique Significant Terms per Document

#d

oc

um

en

ts

Page 8: Document Categorization

Sample: Broad Subject Fields

01--Aviation Technology02--Agriculture03--Astronomy and Astrophysics04--Atmospheric Sciences05--Behavioral and Social Sciences06--Biological and Medical Sciences07--Chemistry08--Earth Sciences and Oceanography

Page 9: Document Categorization

Sample: Narrow Subject Groups

Aviation Technology

01 Aerodynamics

02 Military Aircraft Operations

03 Aircraft 0301 Helicopters

0302 Bombers

0303 Attack and Fighter Aircraft

0304 Patrol and Reconnaissance Aircraft

Page 10: Document Categorization

Distribution among Categories

Broad Category Distribution

0

1

2

3

4

5

6

1-50 51-100 101-150 151-200 201-250 251-300 301-350 351-400 401-450 451-500 >500

# Documents in Category

# C

ate

go

rie

s

Page 11: Document Categorization

Detailed Category Distribution

0

50

100

150

200

250

0-100 101-200 201-300 301-400 401-500 501-600 601-700 701-800 801-900 901-1000 >1000

#Documents per Category

# C

ate

go

rie

s

Page 12: Document Categorization

Baseline

• Establish baseline for conventional techniques– classification– training SVM for each subject area

• “off-the-shelf” document modelling and SVM libraries

Page 13: Document Categorization

Why SVM?

• Prior studies have suggested good results with SVM

• relatively immune to “overfitting” – fitting to coincidental relations encountered during training

• low dimensionality of model parameters

Page 14: Document Categorization

Machine Learning: Support Vector Machines

• Binary Classifier – Finds the plane with

largest margin to separate the two classes of training samples

– Subsequently classifies items based on which side of line they fall

Font size

Line number

hyperplane

margin

Page 15: Document Categorization

SVM Evaluation

C o lle ct io n Te rmEx tra ct io n

C o lle ct io nM o de l

Tra in in gS a m ple

Te s tS a m ple S V M

A cce pte din C la s s

D o cu m e n tC o n v e rs io n

D o cu m e n tC o n v e rs io n

S V MTra in in g

Page 16: Document Categorization

Baseline SVM Evaluation

– Training & Testing process repeated for multiple subject categories

– Determine accuracy• overall• positive (ability to recognize new documents that

belong in the class the SVM was trained for)• negative (ability to reject new documents that

belong to other classes)

– Explore Training Issues

Page 17: Document Categorization

SVM “Out of the Box”

• 16 broad categories with 150 or more documents

• Lucene library for model preparation

• LibSVM for SVM training & testing– no normalization or parameter tuning

• Training set of 100/100 (positive/negative samples)

• Test set of 50/50

Page 18: Document Categorization

SVM Accuracy "out of the box"

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Categor ies

Std. Dev

Average

Page 19: Document Categorization

“OOtB” Interpretation

• Reasonable performance on broad categories given modest training set size.

• Related experiment showed that with normalization and optimized parameter selection, accuracy could be improved as much as an additional 10%

Page 20: Document Categorization

Training Set Size

Page 21: Document Categorization

Training Set Size

• accuracy plateaus for training set sizes well under the number of terms in the document model

Page 22: Document Categorization

Training Issues

• Training Set Size– Concern: detailed subject groups may have

too few known examples to perform effective SVM training in that subject

– Possible Solution: collection may have few positive examples, but has many, many negative example

• Positive/Negative Training Mixes– effects on accuracy

Page 23: Document Categorization

Increased Negative Training

Page 24: Document Categorization

Training Set Composition

• experiment performed with 50 positive training examples– OotB SVM training

• increasing the number of negative training examples has little effect on overall accuracy

• but positive accuracy reduced

Page 25: Document Categorization

Interpretation

• may indicate a weakness in SVM– or simply further evidence of the importance

of optimizing SVM parameters

• may indicate unsuitability of treating SVM output as simple boolean decision– might do better as “best fit” in a multi-label

classifier