Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The...
description
Transcript of Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The...
![Page 1: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/1.jpg)
1
Automated Information Retrieval and Text Categorization: The RIKS Demonstrator
Acknowledge final eventNovember 25, 2008
Marie-Francine Moens, Erik Boiy, Javier Arias (HMDB-LIIR)Saskia Debergh (i.Know)
Philippe De Lombaerde, Birger Fühne (UNU-CRIS)
Overview
UNU CRIS: The RIKS Demonstrator• UNU-CRIS: The RIKS Demonstrator• K.U.Leuven:
– Content extraction from multilingual Web pages– Text categorization: machine learning approach– Search engine and indexing infrastructure– Interfacing the Acknowledge platform
Acknowledge 25-11-2008
Interfacing the Acknowledge platform • i.Know:
– Information forensics
![Page 2: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/2.jpg)
2
The RIKS Demonstrator• United Nations University – Comparative RegionalUnited Nations University Comparative Regional
Integration Studies (UNU-CRIS)• Issues addressed in research and capacity building:
– (i) emergence of regional (= supra-national) governance level
– (ii) linkages with other governance levels (national, global/UN)
Acknowledge 25-11-2008
– (iii) building of regional institutions– (iv) growing regional interdependence, etc.
• RIKS = Regional Integration Knowledge System(UNU-CRIS and GARNET NoE)
Acknowledge 25-11-2008
![Page 3: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/3.jpg)
3
The RIKS DemonstratorIssues addressed in the demonstrator:
How to automate retrieval and processing p g(cleaning, search, categorization, presentation) of particular types of relevant information in an e-learning environment?:– ‘News’: short texts, various formats, dynamic
collection, short life cycle, role of news in e-learning application
Acknowledge 25-11-2008
– ‘Documentation’: heterogeneous texts: scientific articles, theses, essays, ... , rather static collection
– Treaty texts: long and complex texts, static collection, issue of accessibility
RIKSexample output
Acknowledge 25-11-2008
![Page 4: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/4.jpg)
4
Demo
Acknowledge 25-11-2008
K.U.Leuven: Content extraction from multilingual Web pages
• = Extracting main content from Web page and removing extraneous data (navigation menu’s, advertisements, etc.)
• Requirements of the tool:– Accurate
Generic
Acknowledge 25-11-2008
– Generic– Multilingual– Fast
![Page 5: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/5.jpg)
5
Acknowledge 25-11-2008
Acknowledge 25-11-2008
[Arias et al. submitted]
![Page 6: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/6.jpg)
6
Acknowledge 25-11-2008
[Arias et al. submitted][5] =[Gottron 2008]
K.U.Leuven:Text categorization• Heterogeneous documentation and Google News
classified into 27 categories (e g trade poverty )classified into 27 categories (e.g., trade, poverty, ...)• Supervised classifier: Multinomial Naïve Bayes, Support
Vector Machine, ...• Features:
– different features: unigrams, bigrams, feature item sets, ...
• Additional feature Selection: – Chi Square, Information Gain, Linear Classifier
Weights, Orthogonal Centroid Feature Selection• Different test set ups
![Page 7: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/7.jpg)
7
K.U.Leuven: Text categorization
Acknowledge 25-11-2008
RIKSK.U.Leuven: search engine
Acknowledge 25-11-2008
![Page 8: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/8.jpg)
8
Acknowledge 25-11-2008
Demo
Acknowledge 25-11-2008
![Page 9: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/9.jpg)
9
Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten
1. Information Forensics ‐ Smart Indexingmore than just an index
distinguishes between concepts and relationsdistinguishes between concepts and relations
starts from unstructured text (bottom‐up instead of top‐down)
Top‐down: Bottom‐up:
recognises word groups as meaningful units
Acknowledge 25-11-2008© i.Know NV ‐ All rights reserved.
knowledgeknowledgekeywords
textconcepts and relations
text
Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten
1. Information Forensics – Smart Indexing
De Fortis Bank werd overgenomen door BNP Paribas.
Traditional indexing (keywords):
De Fortis Bank werd overgenomen door BNP Paribas.
stopwords
stemming
calculation
correlation
Bank
werd
Keyword Index
Fortis
0.38
0.08
0.23
Acknowledge 25-11-2008© i.Know NV ‐ All rights reserved.
stemming correlation
De Fortis Bank werd overgenomen door BNP Paribas
overgenomen
door
BNP
Paribas
0.21
0.12
0.34
0.27
![Page 10: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/10.jpg)
10
Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten
1. Information Forensics – Smart Indexing
De Fortis Bank werd overgenomen door BNP Paribas.
Smart Indexing (concepts and relations):
De Fortis Bank werd overgenomen door BNP Paribas.
relation detection
concept detection
Smart Index
Concept
werd overgenomen door
Fortis Bank
Relation
Acknowledge 25-11-2008© i.Know NV ‐ All rights reserved.
De Fortis Bank werd overgenomen door BNP Paribas
werd overgenomen doorRelation
Concept BNP Paribas
Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten
2. Categorisation based on Smart Indexing
Preconditions:
Pre‐defined taxonomy/ontologyPre defined taxonomy/ontology
Top‐down processing
Advantages of Smart Indexing:
Smart Indexing Results can be used to fill and enrich the taxonomy, thus ensuring the entries are
relevant
Acknowledge 25-11-2008© i.Know NV ‐ All rights reserved.
relevant
precise
complete
![Page 11: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/11.jpg)
11
Weten dat je niet weet wat je zou moeten wetenWeten dat je niet weet wat je zou moeten weten
2. Categorisation
Categorisation
Smart Indexing (concepts and relations):
The Agreement will be applied with European and withthe EFTA states.the
EFTAEU
Acknowledge 25-11-2008© i.Know NV ‐ All rights reserved.
The Agreement will be applied with the European Union and with the EFTA states.
Union
Input:
RIKSi.Know: news categorization
Acknowledge 25-11-2008
![Page 12: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/12.jpg)
12
RIKSi.Know: news categorization
Acknowledge 25-11-2008
Acknowledge 25-11-2008
![Page 13: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/13.jpg)
13
Acknowledge 25-11-2008
Demo
Acknowledge 25-11-2008
![Page 14: Acknowledge 07 Automated Retrieval And Categorization Of Texts In An E Learning Environment The Riks Demonstrator](https://reader033.fdocuments.net/reader033/viewer/2022051610/549683e1ac7959ff2d8b512c/html5/thumbnails/14.jpg)
14
Thank you
Acknowledge 25-11-2008