Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue...

25
Adaptation of Hierarchical Adaptation of Hierarchical clustering by areas for automatic clustering by areas for automatic construction of electronic construction of electronic catalogue catalogue Nizhny Novgorod, 2010 Prepared by Prepared by Fedor Vladimirovich Fedor Vladimirovich Borisyuk Borisyuk Scientific adviser Scientific adviser Doctor of technical Doctor of technical sciences sciences , , Professor Professor Vladimir Ivanovich Vladimir Ivanovich Lobachevsly State University of Nizhny Novgorod Faculty of Computational Mathematics and Cybernatics Chair of the Mathematical Support of Computer

Transcript of Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue...

Page 1: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Adaptation of Hierarchical clustering by areas Adaptation of Hierarchical clustering by areas for automatic construction of electronic for automatic construction of electronic

cataloguecatalogue

Nizhny Novgorod, 2010

Prepared byPrepared by Fedor Vladimirovich BorisyukFedor Vladimirovich Borisyuk

Scientific adviserScientific adviser Doctor of technical sciencesDoctor of technical sciences, , ProfessorProfessor Vladimir Ivanovich ShvetsovVladimir Ivanovich Shvetsov

Lobachevsly State University of Nizhny NovgorodFaculty of Computational Mathematics and Cybernatics

Chair of the Mathematical Support of Computer

Page 2: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Problem definitionProblem definition

Initial dataInitial data

Collection of text documents.Collection of text documents.

PurposePurpose

Automatically construct hierarchical catalogue, Automatically construct hierarchical catalogue, which reflects thematic areas of given initial which reflects thematic areas of given initial collection.collection.

Page 3: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Examples of web catalogues: Examples of web catalogues: Yandex CatalogueYandex Catalogue

Yandex Catalogue stores Yandex Catalogue stores information on  tens information on  tens of thousands of Russian of thousands of Russian websites.websites.

Uses 16 major topics,Uses 16 major topics,  each of them not more than six each of them not more than six

levels in depth.levels in depth.

Yandex Catalogue was Yandex Catalogue was compiled and is updated compiled and is updated manually. manually.

Page 4: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Examples of web catalogues: Examples of web catalogues: eLibrary.rueLibrary.ru

Russian scientific electronic Russian scientific electronic library library eLIBRARY.RUeLIBRARY.RU

12 12 millions of scientific millions of scientific articlesarticles

Use library classificator Use library classificator GRNTI (State rubricator of GRNTI (State rubricator of scientific and technical scientific and technical information)information)

Maintained by group of Maintained by group of experts.experts.

Depth of catalogue is noDepth of catalogue is no more than 3.more than 3.

Page 5: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Why to construct electronic Why to construct electronic catalogue automaticallycatalogue automatically

Big amounts of text data is accumulated and is Big amounts of text data is accumulated and is continuously growing.continuously growing.

Most of the catalogues are maintained with Most of the catalogues are maintained with support of human experts. High labor costs!support of human experts. High labor costs!

Subjectivity of human experts.Subjectivity of human experts.

Traditional and manually prepared catalogues Traditional and manually prepared catalogues and classifiers can not reflect high rates of and classifiers can not reflect high rates of informational progress in required areasinformational progress in required areas.

Page 6: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Related worksRelated works

Tao Li and Shenghuo Zhu have used linear Tao Li and Shenghuo Zhu have used linear discriminant projection approach for discriminant projection approach for transformation of document space onto transformation of document space onto lower-dimensional space and then cluster lower-dimensional space and then cluster the documents into hierarchy using the documents into hierarchy using Hierarchical agglomerative clustering Hierarchical agglomerative clustering algorithm.algorithm.

O. Peskova develops a modification of O. Peskova develops a modification of layerwise clustering method of Ayvazyan. layerwise clustering method of Ayvazyan. There was found a 4% advantage in average There was found a 4% advantage in average f-measure of the developed clustering f-measure of the developed clustering method over Hierarchical agglomerative method over Hierarchical agglomerative clustering algorithm.clustering algorithm.

Page 7: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Mechanism of automatic construction Mechanism of automatic construction of electronic catalogueof electronic catalogue

Preparation of document images for clustering

Unclassified text

collection

Hierarchical

structure of

electronic

catalogue

Hierarchical clustering by

areasPost

processing of Hierarchical

structure

Parameters of clustering algorithm

Page 8: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Preparation of document images: Preparation of document images: Suggested algorithm of keywords selectionSuggested algorithm of keywords selection

For all words of the document stem is extracted using For all words of the document stem is extracted using Porter algorithm.Porter algorithm.

Remove stop words and words, which have frequency Remove stop words and words, which have frequency more than predefined max frequency or less than more than predefined max frequency or less than predefined minimum frequency.predefined minimum frequency.

Weight of the stemWeight of the stemii in the document D is calculated using in the document D is calculated using modified TFxIDF formula.modified TFxIDF formula.

No more than 300 stems with the highest weight are No more than 300 stems with the highest weight are selected as keywords to represent the document.selected as keywords to represent the document.

The number of keywords reduced using suggested The number of keywords reduced using suggested selective feature reduction algorithm.selective feature reduction algorithm.

Page 9: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Weighting formulaWeighting formulaTFxIDF weighting formulas TFxIDF weighting formulas

Notations:Notations:

TfTfi i - is term frequency in document D - is term frequency in document D

MaxStemFreqMaxStemFreqD D - max frequency between all stems in D - max frequency between all stems in D

TDN - total number of documents in collectionTDN - total number of documents in collection

DNDNi i - number of documents where this stem occurs- number of documents where this stem occurs

IDFIDFi i - inversed document frequency. - inversed document frequency.

ii

  TDNIDF = log (1+ )DN

(0.5 0.5 TF )MaxStemFreq i

Weight (stem ) ( )  IDFiD

Di

Page 10: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Suggested Selective feature space reductionSuggested Selective feature space reductionPurposePurpose: : Select keywords with the best discrimination power in relation to possible Select keywords with the best discrimination power in relation to possible

catalogue areas.catalogue areas.

Selective feature space reduction algorithm:Selective feature space reduction algorithm:

1.1. Cluster the documents collection using modified Hierarchical by areas algorithm. Cluster the documents collection using modified Hierarchical by areas algorithm. Each area in the tree is characterized by keywords vector.Each area in the tree is characterized by keywords vector.

2.2. Execute keywords extraction algorithm on areas keywords vectors to select the Execute keywords extraction algorithm on areas keywords vectors to select the keywords of each area in relation to other areas.keywords of each area in relation to other areas.

3.3. Remove from documents keywords, which are not presented in areas feature space.Remove from documents keywords, which are not presented in areas feature space.

TDN - total number of documents in collectionTDN - total number of documents in collection

DNDNi i - number of documents where this stem occurs- number of documents where this stem occurs

Page 11: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Basics of Hierarchical by areas Basics of Hierarchical by areas clusteringclustering

1.1. Object of clustering is text document.Object of clustering is text document.

2.2. Document is characterized by vector of keywords.Document is characterized by vector of keywords.

3.3. Tree of Areas.Tree of Areas.

Защита от сбоев вычислительных узлов

Распределенная файловая система

Высокая производительность

Автоматическое распределение нагрузки

Реализация MapReduce с открытым исходным кодом Apache Hadoop

B

A

D C

E F HG

Page 12: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Characters of AreaCharacters of Area

1.1. Characterized by vector of keywords, which Characterized by vector of keywords, which is prepared from the keywords of documents is prepared from the keywords of documents in this area. Each keyword has a weight.in this area. Each keyword has a weight.

2.2. Documents, which are belongs to Area. Documents, which are belongs to Area. There is limit on the number of documents There is limit on the number of documents in area.in area.

33. . Area can have a children. There is limit on Area can have a children. There is limit on the number of children.the number of children.

Page 13: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

StartupStartup

Lets we have incoming flow of Lets we have incoming flow of documents. documents.

Incrementally builds Area tree from Incrementally builds Area tree from incoming documentsincoming documents

Page 14: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Hierarchical by areas clusteringHierarchical by areas clustering1 1 stepstep.. Area = Root area Area = Root area Verify possibility to insert the document Doc.Verify possibility to insert the document Doc. Put Doc to RecycleBin if proximity is less than min.Put Doc to RecycleBin if proximity is less than min.

2 2 stepstep.. SearchSearch for closest to Doc child of Area.for closest to Doc child of Area.

3 3 stepstep.. IF ChildIF Child closer to Doc than tocloser to Doc than to Area, Area,

thenthen Area = Child Area = Child and go to step 2and go to step 2..

4 4 stepstep.. Insert document in AreaInsert document in Area..

5 5 stepstep.. Verify limits: IFVerify limits: IF Area Area is crowdedis crowded, , then divide it.then divide it.

IFIF number of children is more then limit – number of children is more then limit – integrate them. integrate them.

66 stepstep.. Update set of keywords of areas, which areUpdate set of keywords of areas, which are

located on the path to the resulted area.located on the path to the resulted area.

Page 15: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

RecycleBinRecycleBin

All those documents which do not meet entry criteria of All those documents which do not meet entry criteria of the areas of the certain level should be temporary stored in the areas of the certain level should be temporary stored in special area on the same level – in RecycleBin.special area on the same level – in RecycleBin.

When the number of objects in the RecycleBin exceed the When the number of objects in the RecycleBin exceed the predefined limit, RecycleBin is divided and detached area is predefined limit, RecycleBin is divided and detached area is connected to the current level.connected to the current level.

A

B C X

R

Page 16: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Divide operationDivide operation

A A

DReasonReason: : too many documents in area. too many documents in area.

Divide area using K-means algorithm into two parts. Divide area using K-means algorithm into two parts.

Connect areas C and D to the tree. B area will host Connect areas C and D to the tree. B area will host integrated characteristics.integrated characteristics.

B

C

Page 17: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Integrate operationIntegrate operation

Reason:Reason: number of children is more then predefined limit. number of children is more then predefined limit.

Find two most close areas (B and C) and unite them under Find two most close areas (B and C) and unite them under one parent area. one parent area.

Parent area D will have as center - average of keywords Parent area D will have as center - average of keywords vectors from both integrated areas B and C.vectors from both integrated areas B and C.

A A

B C X XB C

D

Page 18: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Areas Tree is filled like pyramid of Areas Tree is filled like pyramid of champagnechampagne

Page 19: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Post processing of generated Post processing of generated Hierarchical structureHierarchical structure

Setup areas Titles as three first keywords from Setup areas Titles as three first keywords from top of area keywords vector sorted by weight.top of area keywords vector sorted by weight.

For all the tree - make a links between areas at For all the tree - make a links between areas at the same level if distance between keyword the same level if distance between keyword vectors of these areas is greater than calculated vectors of these areas is greater than calculated distance. Purpose - referring similar or related distance. Purpose - referring similar or related rubrics.rubrics.

Page 20: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Test collectionsTest collections

Collection nameCollection name Collection charactericticsCollection characterictics

20NewsGroups20NewsGroups 20000 articles evenly divided 20000 articles evenly divided among 20 Usenet newsgroups. among 20 Usenet newsgroups. Language:EnglishLanguage:English

NNSU8NNSU8 1302 scientific taken from 1302 scientific taken from

portal of Nizhny Novgorod portal of Nizhny Novgorod State University. State University.

Language:RussianLanguage:Russian

Page 21: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

External clustering evaluationExternal clustering evaluationFor each pair of

the documents Di and Dj

Di and Dj contain in one

cluster of “sample”

partitioning

Di and Dj contain in different cluster of “sample”

partitioningDi and Dj contain in one cluster of automatic clustering

tp (true positive)

fp (false positive)

Di and Dj contains in different clusters of automatic clustering

fn (false negative)

tn (true negative)

RecallRecall

PrecisionPrecision

F-measureF-measure

tptp+f

n

Recall =

tptp+f

p

Precision=

2*Recall*PrecPrec+Rec

allF-measure=

Page 22: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Computational experiments: Computational experiments: Evaluation of average metricsEvaluation of average metrics

Metric Metric

Algorithms and datasets Algorithms and datasets

Hierarchical by areas Hierarchical by areas Hierarchical Hierarchical

AgglomerativeAgglomerative

20News 20News

groups groups NNSU8 NNSU8

20News 20News

groups groups NNSU8 NNSU8

Recall Recall 0.79 0.79 0.66 0.66 0.1 0.1 0.40 0.40

Precision Precision 0.35 0.35 0.59 0.59 0.11 0.11 0.38 0.38

F-measure F-measure 0.48 0.48 0.6 0.6 0.1 0.1 0.33 0.33

Time Time

(msec) (msec) 2505 2505 2391 2391 2896 2896 45116 45116

Page 23: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Top levels of catalogue generated by Top levels of catalogue generated by the hierarchical clustering by areasthe hierarchical clustering by areas

NNSU8 collectionNNSU8 collection

AreasAreas MembersMembers

11 Law, philosophy;Law, philosophy;

22 Mathematics;Mathematics;

33 Sociology;Sociology;

44 Economics;Economics;

55 PhysicsPhysics

66 Biology, Chemistry;Biology, Chemistry;

20NewsGroups collection20NewsGroups collection

AreasAreas MembersMembers

11talk.politics.mideast,talk.politics.guns, talk.politics.mideast,talk.politics.guns,

talk.politics.misctalk.politics.misc22 comp.graphics, comp.os.ms-windows.misccomp.graphics, comp.os.ms-windows.misc

33rec.sport.baseball,rec.sport.hockey, rec.sport.baseball,rec.sport.hockey,

rec.autos, rec.motorcyclesrec.autos, rec.motorcycles44 sci.crypt, sci.med, sci.spacesci.crypt, sci.med, sci.space

55comp.sys.ibm.pc.hardware; comp.sys.ibm.pc.hardware;

comp.sys.mac.hardwarecomp.sys.mac.hardware66 soc.religion.christiansoc.religion.christian

77sci.electronics;misc.forsale; sci.electronics;misc.forsale;

comp.windows.xcomp.windows.x88 talk.religion.misctalk.religion.misc

Page 24: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

ConclusionsConclusionsEffective method of keywords extraction Effective method of keywords extraction from text documents in purpose of text from text documents in purpose of text clustering is presented.clustering is presented.

Taken computational experiments showed Taken computational experiments showed efficiency of suggested approach with using efficiency of suggested approach with using of Hierarchical clustering by areas algorithm of Hierarchical clustering by areas algorithm for automatic construction of electronic for automatic construction of electronic catalogue. catalogue.

Page 25: Adaptation of Hierarchical clustering by areas for automatic construction of electronic catalogue Nizhny Novgorod, 2010 Prepared by Fedor Vladimirovich.

Thank youThank you!! Questions? Questions?

[email protected] – Fedor Vladimirovich BorisyukFedor Vladimirovich [email protected] - Vladimir Ivanovich ShvetsovVladimir Ivanovich Shvetsov