Finding Domain Terms using Wikipedia
description
Transcript of Finding Domain Terms using Wikipedia
Finding Domain Terms using Wikipedia
Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra
Horacio Rodríguez Hontoria TALP Research Center
Universitat Politécnica de [email protected]
2
Outline
• Introduction• Related approaches• Methodology• Evaluation• Conclusions and future work
Introduction
• Problem: to automatically extract terminological units from specialized texts
• Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.
4
Related approaches
• Magnini et al., 2000 • Montoyo et al., 2001• Missikoff et al., 2002 • Vivaldi, Rodríguez, 2002 • Vivaldi, Rodríguez, 2004• Bernardini et al., 2006 • Cui et al., 2008
Graph structure of Wikipedia
WP categories WP pages
A B
C D E
F
G
P1
P2
P3
Redirectiontable
…
… …
…
… …
… …
Disamb. pagesInterwiki linksExternal links
InfoBox
Methodology: overview
domain
Pages
top categories
domain categories
domain pagesfinal domain
term setfiltering
filtering
Categories
bootstrapping
1) To find in WP the domain name as a category.2) Look for all the subcategories/pages related to the domain3) Extract all descendants from the domain name avoiding loops4) Remove proper names and service classes5) Filter categories and pages
Main steps:
WP
Methodology: filtering
• Category level
• Page level
Methodology: filtering
• Category levelTop Category of the Domain
CatSet1
C
Direct super-categories CatSet1Direct super-categories CatSet1 Direct neutral super-categories
Category Score
Methodology: filtering
• Page levelTop Category of the Domain
CatSet2
C
categories CatSet2
Pages C ... ...
neutral categories
Page Score
P
categories CatSet2
Methodology: category filtering
categories descendant filtered of set:CatSet2
} )21( if accept
11#2
11#1
of oriessupercateg direct of set# :1
{ 1
categories descendant of set:CatSet1
nnc
CatSetaCatSetn
CatSetaCatSetn
caCatSet
CatSetc
Methodology: page filtering
2 if 02 if 1
:)(
)(
a to assigned categories of set :
CatSetcCatSetc
cinCatPathToDoma
inCat(c)PathToDoma
inCat(c)PathToDomadtcWPDC
dtctermCats
termCatsc
termCatsc
Additional category filtering using pages scores:
catTerm: set of pages associated to a category
-MicroStrict: accept cat if # elements of catTerm with positive scoring is greater that # elements with negative scoring
-MicroLoose: Idem with greater or equal test.
-Macro: instead of counting the pages with positive/negative scoring we use the components of such scores.
Page filtering example: “semantics” (in Computing domain)
theoretical computer science Computing semantics
softwaresoftware engineering
formal methods
semantics {linguistics, philosophy of language, semiotics, theoretical computer science, philosophical Logic}
WPCD(semantics) = 0.25
Category filtering example using pages score: “chemistry”
# DTCMicroStrict
MicroLoose
MacroVote Result
ok ko ok ko ok ko
1 electroquímica(electrochemistry)
13 5 16 2 36 12 +3 Accept
2 quesos(cheeses)
0 8 6 2 8 12 -1 Reject
3 óxidos de carbono(carbon monoxide)
1 1 2 0 4 3 +2 Accept
Evaluation
• Partial evaluation: “chemistry” and “astronomy”:– Test against Magnini et al., 2000 (WordNet 1.6)– Low coverage: 25% for Chemistry and 15% for
Astronomy
• Full evaluation. “Medicine”– Test against SNOMED-CT Spanish Edition (2009)– Wide coverage of the clinical domain: 800K terms
Partial evaluationDomain Chemistry Astronomy Language EN ES EN ES Initial Categories 188374 2070 188816 44631 #Categories after pruning 1334 557 790 143 Categories 49 43 5 6
Precision 93,9 62,8 0 16,7 Loose 833 1038 284 119
Pages found Strict 580 700 284 81 Loose 61,3 52,6 34,8 31,9 Ite
ratio
n #
1
Prec. [%] Strict 62,7 56,6 37.2 27,2
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-loose
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-looseEN-strict
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-looseEN-strictES-loose
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-looseEN-strictES-looseES-strict
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-loose
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-looseEN-strict
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-looseEN-strictES-loose
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-looseEN-strictES-looseES-strict
Full evaluationEvaluation using WN SNOMED-CT Initial Categories 2431 Categories after pruning 839 Categories 174 394
Precision 27,6 54 Loose 2091 4182
Page Strict 1724 3492 Loose 21,0 58 It
era
tion
#1
Prec. [%] Strict 23,2 62
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WN
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WNES-loose-WN
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WNES-loose-WN
ES-loose-SNOMED
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WNES-loose-WN
ES-loose-SNOMEDES-strict-SNOMED
Validation issues
Accepts Reject
whisky
cigar
udder
fire
oral cancer
renal colic
phoniatrics
surgical instruments
17
Conclusions
• Good results when evaluated against a specialised resource
• Term list filtering must be improved (ex. Eliminate proper names)
18
Future work
• Apply this method to other languages/domains
• Improve filtering using in/out links of selected pages
• Improve filtering using also the page content
• Use this WP knowledge to improve a term extractor
19
Finding Domain Terms using Wikipedia
Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra
Horacio Rodríguez Hontoria TALP Research Center
Universitat Politécnica de [email protected]