Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology...

34
Department of Electrical Engineering and Computer Science Text Classification Combining Clustering and Hierarchical Approaches Shankar Ranganathan MS Thesis Defense May 3 rd , 2004 Committee Dr. Susan Gauch (Chair) Dr. Perry Alexander Dr. David Andrews

Transcript of Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology...

Page 1: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Text Classification Combining Clustering and Hierarchical Approaches

Shankar RanganathanMS Thesis Defense

May 3rd, 2004

CommitteeDr. Susan Gauch (Chair)

Dr. Perry AlexanderDr. David Andrews

Page 2: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Presentation OutlineSearch Engines TodayContributions Related WorkText Classification – Our ApproachExperiments and EvaluationConclusionsFuture Work

Page 3: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Search Engines TodayReturn results based on simple key-word

matches.No regard for conceptual information.For E.g. : If the query is “SALSA”, Is it……

Page 4: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

KeyConcept Architecture

Page 5: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Contributions

Novel approach to Text Classification by combining clustering within the concepts with hierarchical text classificationEffect of clustering on flat classification versus hierarchical classificationEffect of ignoring versus using concept wise distinction lower down the hierarchies

Page 6: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Related Work IText Classification

Yang, Sebastiani: Comparison of Text classification methods - K-Nearest Neighbors, linear least square fit, Naïve Bayesian, Support Vector Machines, Decision treesHierarchical Classification: Proposed by Koller. Further work by – Sun, Labrou, Sasaki, Dumais, Wang

Page 7: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Related Work IIChaffee, YAHOO, Open Directory Project : OntologyManning, Dubes, Kaufman –Document clusteringAgglomerative (Guha, Karypis) vs. Divisive (Zhao)Lots of packages available on net –Cluto, Chameleon, Rock, Cure, DocCluster, Siftware etc.,Perkowitz – Cluster Mining

Page 8: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Text ClassificationTwo Step Process : Training the classifier and Classification of new documentsTraining Phase:

Classifier is fed with documents that have been classified manuallyLearns about the features (vocabulary) of the various categories into which new documents can be classified

Page 9: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Text Classification contd…

Classification Phase:Classifier assigns category (ies) to new documents based on the similarity of the features of input document and of the categories that it learned during training

Page 10: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Text Classification – Our ApproachVector Space model (tf-idf)Training data are documents that are manually assigned to the categories Open Directory Project’s Standard Tree which is our reference OntologyClassifier creates a vector of vocabulary terms and associated weights in an inverted file

Page 11: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Standard Tree

Page 12: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Text Classification – Our Approach ..Feature selection during training (selecting training documents) plays a primary role towards improving classification accuracy.

Hierarchical classificationUse of Clustering

Page 13: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Flat Classification vs. Hierarchical Classification

1

2 3

654

1 65432

Hierarchical ClassificationFlat Classification

Top

BusinessArts

Music TV Employment

Top-down level-based Hierarchical classification

Page 14: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Role of ClusteringImprove feature selectionEliminate documents that tend to confuse the classifierIdentify within-category clusters, and extract cluster(s)’ representative pagesDocument mining within the framework of cluster mining

Page 15: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Text Classification – Our Approach contd…

During Classification phase, a vector of input document is createdSimilarity between training this vector and vector of each concept during training is computed using dot productNew document is assigned to the categories with best matches

Page 16: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Classifier Output

Page 17: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Experimental Set-upSource of training data: Open Directory Project (dmoz.org) – ODP ontology contains hierarchical informationTest data: Randomly-selected level 3 documentsClustering package: CLUTO

Clustering method: Partitional clusteringSimilarity function: Cosine functionProgram used: vcluster - zscores

Page 18: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Experimental Setup…..Baseline – Random Selection

All concepts from levels 1, 2 and 3 with at least 32 documents (total 1484)2 documents from each concept was randomly withheld for testing (total - 2978)Trained with randomly-selected 30 documents from each concept( around 44500)Accuracy = 46.6 %

Performance of the flat classifier

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

1 2 3 4 5 6 7 8 9 10

topn %

of d

ocum

ents

in to

pn

Baseline - RandomSelection

Page 19: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

EvaluationDoes selecting documents closest to the centroid to train improve accuracy ?For hierarchical classification, how far down the hierarchy should we go in each step ?What is the number of documents to train the classifier to get best results ?‘Ignore’ or ‘consider’ tree structure among children ?

Page 20: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Experiment 1 : Effect of clustering on Flat Classification

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

1 2 3 4 5 6 7 8 9 10

Random

closest to centroid

Farthest from centroid

farthest to each other

Best observed accuracy – Selecting documents closest to the centroid (49.5%)Poor performance – Selecting documents farthest from the centroid (29.5%)Selecting documents farthest from each other –48.6%

Page 21: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Experiment 2- Effect of clustering on training Set selection for hierarchical classification

1 Classifier at level 1, 15 at level 2, 358 at level 3Documents from parent & children ( & grandchildren put in the same pool to select)Parameters we tune : Depth, Random selection vs. clustering, # of documents

0

1 2

3 4 5 6

87 109

0

1 2

3 4 5 6

8710

9

0

1 2

3 4 5 6

8710

9

Page 22: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Experiment 2a – Study of Level 1 Decision

Using Level I documents

12.5

13

13.5

14

14.5

15

15.5

16

10 20 30 40

Number of documents

Maximum observed accuracy 15.8% - Very Poor

Very few documents at level-1

So, go deeper…...

% c

orre

ct m

atch

for l

evel

I de

cisi

on

Random

closest tcentroid

o

Page 23: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

2.a: Study of Level 1 Decision.....

Level I Decision Using Level I and II

010203040506070

10 20 30 40 50 60 90

Number of Documents

% c

orre

ct m

atch

es fo

r de

cisi

on a

t Lev

el 1

Random

Closest tocentroid

Level I Decision using Level I , II and III

0

20

40

60

80

100

10 20 30 40 50 60 90

Number of Documents

% c

orre

ct m

atch

es fo

r lev

el

one

deci

sion Random

Closest tocentroid

Maximum accuracy of 81.6% for level 1 decision when documents from levels 1,2 & 3 are used

Page 24: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Expt 2.b: Study of level 2 decision

Level II Decision Using Just level II documents

0

10

20

30

40

50

60

70

10 20 30 40 50 60

Number of documents

% C

orre

ct M

atch

es

Random

closest to centroid

Level I Decision using Level I , II and III

0

20

40

60

80

100

10 20 30 40 50 60 90

Number of Documents

% c

orre

ct m

atch

es fo

r lev

el

one

deci

sion Random

Closest tocentroid

Maximum accuracy of 71.3% for level 2 decision when documents from levels 1,2 & 3 are used. 40 documents to train per concept.

Page 25: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Expt 2.c: Study of Level 3 DecisionLevel III Decision Using Level III Documents

0

10

20

30

40

50

60

70

10 20 30 40 50 60

Number of Documents

%Co

rrec

tMat

ches

RandomClosest to centroid

Maximum accuracy for random selection = 55.2%Maximum accuracy by selecting docts closest to the centroid = 65.4%40.3% relative improvement over baseline

Page 26: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Expt 3: Effect of clustering on hierarchical classification, distributing training set across sub-concepts

Documents selected from each sub-conceptParameters we plan to tune : Depth, # of docts, random vs. closest to the centroid

Page 27: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Experiment 3.a: Level 1 DecisionLevel One decision Using Documents

Closest Centroid

0

20

40

60

80

100

1 2 3 4

Number of documents

% A

ccur

acy

Usingdocumentsfrom level one Using documentsupto level 2Usingdocumentsupto level 3Usingdocumentsupto level 4

Including level 4 – almost same results as level 391.2% Accuracy – 2 documents closest to the centroid from each concept down till level 3Poor results while using just level 1 or level 1 & 2

Page 28: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Experiment 3.b: Level 2 DecisionUsing documents from levels 2&3, 2,3&4 yield almost identical resultsWe use till level 3 -computational time and complexityBest observed accuracy – 84.4% - 2 docts per concept closest to the centroid

Level Two Decision Using Documents Closest to Centroid

0102030405060708090

1 2 3 4

Number of Documents

% A

ccur

acy

Usingdocuments atlevel 2 onlyUsingDocuments atlevel 2 and 3Usingdocuments atlevel 2, 3 and 4

Page 29: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Experiment 3.c Level 3 decisionOverall best accuracy of 79.1% at level 3 using one document from each concept that is closest to the centroid.

Level Three Decision Using Documents Closest to Centroid

7374757677787980

1 2 3 4

Number of documents

% A

ccur

acy

Using documents atlevel 3 only

Using Documents atlevels 3 and 4

Page 30: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Training Strategy2 training documents from each concept Down to level-3These documents are closest to the centroid in each conceptAccuracy of 77.9% when we use clustering as compared to 71.8% when we select random documents

Two documents per category closest to the centroid

0

20

40

60

80

100

1 2 3

Decision at Level

% A

ccur

acy

RandomClosest to centroid

Page 31: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Validation TestingValidation Testing

0

20

40

60

80

100

1 2 3

Level

% A

ccur

acy Random

Documents closestto the centroid

Different Test dataRole of clustering enhances accuracy from 79.7% to 89% at level-1 and final accuracy from 69.8% to 76.2%.Statistically significant( t-test value = 3.23E-05) improvement

Page 32: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

ConclusionsMaximum Accuracy of 77.9% when we use :Hierarchical Classification,2 documents closest to the centroid from each concept down till level-3 to train the classifier

Summary

46.649.5

65.469.8

77.9

0

10

20

30

40

50

60

70

80

90

1

Experiments

% A

ccur

acy

Flat Classif icat ion

Flat Classif icat ion with Clustering

Hierarchical Classif icat ion (putt ingsub-concepts in the same pool)

Hierarchical classif icat ion -randomly-select ing 2 documentsfrom each subconcept

Hierarchical classif icat ion -Select ing 2 documents from eachconcept that are closest to thecentroid

Page 33: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

Future WorkUse of other classifiers like the SVMHow to deal with the dynamic web ?Trials on other data setsRecovery mechanism when error is made at the parent levelFurther ‘divide and conquer’ –Binary decisions

Page 34: Text Classification Combining Clustering and Hierarchical ...Project (dmoz.org) – ODP ontology contains hierarchical information {Test data: Randomly-selected level 3 documents {Clustering

Department of Electrical Engineering and Computer Science

????’s or !!!!’s

Thank You