CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of...
-
Upload
abraham-townsend -
Category
Documents
-
view
218 -
download
3
Transcript of CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of...
![Page 1: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/1.jpg)
CIG Conference Norwich September 2006
AUTINDEX 1
AUTINDEX
AUTINDEX: Automatic Indexing and Classification of Texts
Catherine Pease & Paul SchmidtIAI, Saarbrücken
{cath,paul}@iai.uni-sb.de
http://www.iai.uni-sb.de
![Page 2: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/2.jpg)
CIG Conference Norwich September 2006
AUTINDEX 2
AUTINDEX
Automatic Indexing and Classification of Texts
AUTINDEX:-
• calculates keywords in texts
• places text in its appropriate classification
![Page 3: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/3.jpg)
CIG Conference Norwich September 2006
AUTINDEX 3
AUTINDEX
APPLICATIONS
• Information Services for indexing scientific articles
• Document Management Systems for text classification according to content
• Libraries for indexing incoming books and articles
![Page 4: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/4.jpg)
CIG Conference Norwich September 2006
AUTINDEX 4
AUTINDEX
Basis Components
• Morpho-syntactic analysis: tagging and lemmatisation
• Shallow parsing: resolution of grammatical ambiguities and identification of NPs
![Page 5: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/5.jpg)
CIG Conference Norwich September 2006
AUTINDEX 5
AUTINDEX
Linguistic Resources for Pre-processing
• Morphological Analyser & Morpheme dictionaries
• Grammar rules for shallow parsing
![Page 6: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/6.jpg)
CIG Conference Norwich September 2006
AUTINDEX 6
AUTINDEX
Morphological Analyser
“Cost reduction”
cost:
{lu=cost,ls=cost,c=verb,vtype=fiv}
{lu=cost,ls=cost,c=verb,vtype=inf}
{lu=cost,ls=cost,c=noun,nb=sg}
reduction:
{lu=reduction,ls=reduce,c=noun,nb=sg}
![Page 7: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/7.jpg)
CIG Conference Norwich September 2006
AUTINDEX 7
AUTINDEX
Shallow Parsing
The company evaluated the cost reduction
noun
NP finite verb NP
![Page 8: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/8.jpg)
CIG Conference Norwich September 2006
AUTINDEX 8
AUTINDEX
Controlled Indexing
• Identifies multiword terms and their syntactic variants
• Calculates keywords based on frequency and semantic weighting
• Checks thesaurus for relevant entry
• Classifies text
![Page 9: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/9.jpg)
CIG Conference Norwich September 2006
AUTINDEX 9
AUTINDEX
Linguistic Resources for Indexing
• Multiword Terms and Variants Direct Match: cost reduction -> cost reduction Indirect match: inflectional differences cost reduction -> cost reductions
![Page 10: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/10.jpg)
CIG Conference Norwich September 2006
AUTINDEX 10
AUTINDEX
Linguistic Resources for Indexing
lexical synonyms: rise - increase
derivational synonyms: biomagnetic – biomagnetism air pollutant – air pollution
![Page 11: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/11.jpg)
CIG Conference Norwich September 2006
AUTINDEX 11
AUTINDEX
Linguistic Resources for Indexing
structural variants: costs of reduction – reduction costs combined (structural plus
derivational): transmitted DC power – DC power transmission
to calculate plane waves – place wave calculation
![Page 12: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/12.jpg)
CIG Conference Norwich September 2006
AUTINDEX 12
AUTINDEX
Semantic Weighting
• 140 semantic types in dictionaries
• Weight assigned to nouns depending on semantic type
• Result of weighting set of keywords belonging to most frequent semantic classes
![Page 13: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/13.jpg)
CIG Conference Norwich September 2006
AUTINDEX 13
AUTINDEX
Classification
• Descriptors annotated with Classification Code
• Hyperonym and Synonym relations used
• Frequency used to calculate Topic Classification
![Page 14: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/14.jpg)
CIG Conference Norwich September 2006
AUTINDEX 14
AUTINDEX
User-Specific Thesauri
• Keywords checked against Thesaurus
• Hierarchical Structure of Thesaurus used to calculate Descriptors:
hyperonym relations synonym relations
![Page 15: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/15.jpg)
CIG Conference Norwich September 2006
AUTINDEX 15
AUTINDEX
Example Output
• Keywords: List of descriptors from thesaurus plus weighting
• List of free terms / free descriptors plus weighting
• Topic Classification with relevant code
![Page 16: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/16.jpg)
CIG Conference Norwich September 2006
AUTINDEX 16
AUTINDEX
Free Indexing
• Free indexing follows the same steps as for controlled indexing but without the use of a thesaurus
• The result is a list of free descriptors
![Page 17: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/17.jpg)
CIG Conference Norwich September 2006
AUTINDEX 17
AUTINDEX
Architecture
![Page 18: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/18.jpg)
CIG Conference Norwich September 2006
AUTINDEX 18
AUTINDEX
Bilingual Components
• Automatic language recognition
• Bilingual dictionaries
• Bilingual thesauri
![Page 19: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/19.jpg)
CIG Conference Norwich September 2006
AUTINDEX 19
AUTINDEX
Libraries & the Internet
• Switch of focus from libraries to Internet because of:
Search engines e.g. Google
Poor access to library resources
![Page 20: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/20.jpg)
CIG Conference Norwich September 2006
AUTINDEX 20
AUTINDEX
Reasons for Poor Access
• search tools need full text match
• human indexation too general and inconsistent
• no flexibility in terms of semantic relations
![Page 21: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/21.jpg)
CIG Conference Norwich September 2006
AUTINDEX 21
AUTINDEX
AUTINDEX in Libraries
• High percentage of all queries have no hit in electronic library catalogue
• From the rest a high percentage is not used
![Page 22: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/22.jpg)
CIG Conference Norwich September 2006
AUTINDEX 22
AUTINDEX
IntelligentCAPTURE
• Complete processing chain for digital content in libraries:
- scanning of contents tables
- treatment with OCR technology
- automatic indexation
- feeding results into library system
- integration of improved retrieval system
![Page 23: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/23.jpg)
CIG Conference Norwich September 2006
AUTINDEX 23
AUTINDEX
Dandelon database
• Supports 16 EU languages for multilingual retrieval
• Running in 4 countries at 9 libraries
![Page 24: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/24.jpg)
CIG Conference Norwich September 2006
AUTINDEX 24
AUTINDEX
Work Flow
![Page 25: CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.](https://reader035.fdocuments.net/reader035/viewer/2022062717/56649e535503460f94b49703/html5/thumbnails/25.jpg)
CIG Conference Norwich September 2006
AUTINDEX 25
AUTINDEX
Summary
• AUTINDEX provides for controlled and free indexing
• Integrated in a complete processing chain AUTINDEX can be used to improve access to library resources through efficient methods of indexation