Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE
description
Transcript of Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE
![Page 1: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/1.jpg)
Self-organizing maps applied to information retrieval of dissertations and theses
from BDTD-UFPE
Bruno [email protected]
Renato [email protected]
![Page 2: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/2.jpg)
Guide• Information Retrieval Systems (IRS)• IRS + SOM• Related Works• Document Collection• System Architecture• Methodology• Results
![Page 3: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/3.jpg)
Information Retrieval Systems (IRS)
• Indexing, Searching , classifying textual documents.
• User’s information needs
• Matching user’s queries and system’s vocabulary.
![Page 4: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/4.jpg)
IRS + SOMSelf-
Organized Maps
Information Retrieval System
![Page 5: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/5.jpg)
IRS + SOM• Navigation Interface build trough document
maps
• Document’s maps– Self-Organizing Map trained with document
vectors
![Page 6: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/6.jpg)
Related Works• First Works (1991 - 1995)
– Lin / Merkl • Great projects(1996 -2000)
– Arizona Digital Library, WEBSOM , SOMLib • Diversification (2001 - 2005)
– LiGHtSOM, GHSOM, H2SOM• Convergence (2006)
![Page 7: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/7.jpg)
Document Collection• UFPE Digital Library of Theses and
Dissertations(BDTD-UFPE)– Offers in full all the theses and dissertations
produced on the graduate programs of the university.
– Approximately 6000 documents. – Linked to Brazilian BDTD and to NDLTD
(Networked Digital Library of Theses and Dissertations)
![Page 8: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/8.jpg)
Document Representation
Dimensionality Reduction
Volume Reduction
Construction of Document Map
Document Vectors
Reduced Vectors
Prototype Vectors
Document Map
Document IndexingInverted Index
Document AcquisitionDocuments’ content
System Architecture
![Page 9: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/9.jpg)
Methodology• Document Acquisition
– Harvesting process through the OAI-PMH protocol
– XMLs containing document’s metadata
– Data extraction through the java library JColtrane
![Page 10: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/10.jpg)
Methodology• Indexing
– Java library, Lucene.
– Stemming operations, digits and stopwords elimination.
– Inverted index built through vectorial space model.
![Page 11: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/11.jpg)
Methodology• Document representation
– Documents are represented by vectors, where terms are the indexes and the corresponding values are functions of term’s frequency of occurrence in the document.
![Page 12: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/12.jpg)
Methodology• Dimensionality reduction
– Feature selection based on words’ frequency– Stopwords elimination– Final dimensionality: 13095 terms
• Volume reduction– Not used.– Volume : 4781 documents
![Page 13: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/13.jpg)
Methodology• Document’s map construction
– Single stage
– somtoolbox functions for MATLAB
– Document’s vectors normalized before training
– SOM map with rectangular structure (10 x 12) and hexagonal neighborhood
![Page 14: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/14.jpg)
Methodology• Document’s map construction
– Weights initialized linearly along the two greatest eigenvectors
– Batch-type SOM algorithm with dot product metric
– Gaussian neighborhood function – Neighborhood size linearly decreasing with the
number of epochs
![Page 15: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/15.jpg)
Methodology• Document’s map construction
– Parameters• Number of epochs
– Rough phase : 10 epochs– Fine-tuning phase : 10 epoch
• Neighborhood size – Rough phase
» Initial: [(biggest dimension units number )/2 ]+ 1» Final: 2
– Fine-tuning phase: » Initial: 2» Final: 0.8
![Page 16: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/16.jpg)
Methodology• User’s interface construction
– Documents are mapped to the node with the closest model vector in terms of cosine distance
– Each map node is labeled according to the category
• Knowledge areas (CHLA, CBS, TCEN)• Graduate programs
![Page 17: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/17.jpg)
Results
Categories Accuracy F1 micro F1 macro Topographic error
3 0.96 0.96 0.96 0.01
61 0.66 0.66 0.44 0.01
![Page 18: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/18.jpg)
Results
Knowledge Areas Graduate Programs
![Page 19: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE](https://reader035.fdocuments.net/reader035/viewer/2022062501/5681637c550346895dd45aab/html5/thumbnails/19.jpg)
Acknowledgement