Document Categorization

Document Categorization

• Problem: given– a collection of documents, and– a taxonomy of subject areas

• Classification: Determine the subject area(s) most pertinent to each document

• Indexing: Select a set of keywords / index terms appropriate to each document

Classification Techniques

• Manual (a.k.a. Knowledge Engineering)– typically, rule-based expert systems

• Machine Learning– Probabalistic (e.g., Naïve Bayesian)– Decision Structures (e.g., Decision Trees)– Profile-Based

• compare document to profile(s) of subject classes• similarity rules similar to those employed in I.R.

– Support Machines (e.g., SVM)

Machine Learning Procedures

• Usually train-and-test– Exploit an existing collection in which

documents have already been classified• a portion used as the training set• another portion used as a test set

– permits measurement of classifier effectiveness– allows tuning of classifier parameters to yield maximum

effectiveness

• Single- vs. multi-label– can 1 document be assigned to multiple

categories?

Automatic Indexing

• Assign to each document up to k terms drawn from a controlled vocabulary

• Typically reduced to a multi-label classification problem– each keyword corresponds to a class of

documents for which that keyword is an appropriate descriptor

Case Study: SVM categorization

• Document Collection from DTIC– 10,000 documents

• previously classified manually

– Taxonomy of• 25 broad subject fields, divided into a total of• 251 narrower groups

– Document lengths average 27051464 words, 623274 significant unique terms.

– Collection has 32457 significant unique terms

Document Size Distribution

0

10

20

30

40

50

60

70

80

0-1000 1001-2000 2001-3000 3001-4000 4001-5000 5001-6000 6001-7000 7001-8000 8001-

words per document

do

cu

me

nts

Document Collection

Unique Term Distribution

0

20

40

60

80

100

120

0-300 301-600 601-900 901-1200 1201-1500 1501-

# Unique Significant Terms per Document

#d

oc

um

en

ts

Sample: Broad Subject Fields

01--Aviation Technology02--Agriculture03--Astronomy and Astrophysics04--Atmospheric Sciences05--Behavioral and Social Sciences06--Biological and Medical Sciences07--Chemistry08--Earth Sciences and Oceanography

Sample: Narrow Subject Groups

Aviation Technology

01 Aerodynamics

02 Military Aircraft Operations

03 Aircraft 0301 Helicopters

0302 Bombers

0303 Attack and Fighter Aircraft

0304 Patrol and Reconnaissance Aircraft

Distribution among Categories

Broad Category Distribution

0

1

2

3

4

5

6

1-50 51-100 101-150 151-200 201-250 251-300 301-350 351-400 401-450 451-500 >500

# Documents in Category

# C

ate

go

rie

s

Detailed Category Distribution

0

50

100

150

200

250

0-100 101-200 201-300 301-400 401-500 501-600 601-700 701-800 801-900 901-1000 >1000

#Documents per Category

# C

ate

go

rie

s

Baseline

• Establish baseline for conventional techniques– classification– training SVM for each subject area

• “off-the-shelf” document modelling and SVM libraries

Why SVM?

• Prior studies have suggested good results with SVM

• relatively immune to “overfitting” – fitting to coincidental relations encountered during training

• low dimensionality of model parameters

Machine Learning: Support Vector Machines

• Binary Classifier – Finds the plane with

largest margin to separate the two classes of training samples

– Subsequently classifies items based on which side of line they fall

Font size

Line number

hyperplane

margin

SVM Evaluation

C o lle ct io n Te rmEx tra ct io n

C o lle ct io nM o de l

Tra in in gS a m ple

Te s tS a m ple S V M

A cce pte din C la s s

D o cu m e n tC o n v e rs io n

D o cu m e n tC o n v e rs io n

S V MTra in in g

Baseline SVM Evaluation

– Training & Testing process repeated for multiple subject categories

– Determine accuracy• overall• positive (ability to recognize new documents that

belong in the class the SVM was trained for)• negative (ability to reject new documents that

belong to other classes)

– Explore Training Issues

SVM “Out of the Box”

• 16 broad categories with 150 or more documents

• Lucene library for model preparation

• LibSVM for SVM training & testing– no normalization or parameter tuning

• Training set of 100/100 (positive/negative samples)

• Test set of 50/50

SVM Accuracy "out of the box"

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Categor ies

Std. Dev

Average

“OOtB” Interpretation

• Reasonable performance on broad categories given modest training set size.

• Related experiment showed that with normalization and optimized parameter selection, accuracy could be improved as much as an additional 10%

Training Set Size

Training Set Size

• accuracy plateaus for training set sizes well under the number of terms in the document model

Training Issues

• Training Set Size– Concern: detailed subject groups may have

too few known examples to perform effective SVM training in that subject

– Possible Solution: collection may have few positive examples, but has many, many negative example

• Positive/Negative Training Mixes– effects on accuracy

Increased Negative Training

Training Set Composition

• experiment performed with 50 positive training examples– OotB SVM training

• increasing the number of negative training examples has little effect on overall accuracy

• but positive accuracy reduced

Interpretation

• may indicate a weakness in SVM– or simply further evidence of the importance

of optimizing SVM parameters

• may indicate unsuitability of treating SVM output as simple boolean decision– might do better as “best fit” in a multi-label

classifier

Document Categorization

Documents

Transcript of Document Categorization