Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute...
-
Upload
jordan-bennett -
Category
Documents
-
view
222 -
download
0
description
Transcript of Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute...
Finding Domain Terms using Wikipedia
Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra
Horacio Rodríguez Hontoria TALP Research Center
Universitat Politécnica de [email protected]
2
Outline• Introduction• Related approaches• Methodology• Evaluation• Conclusions and future work
Introduction
• Problem: to automatically extract terminological units from specialized texts
• Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.
4
Related approaches
• Magnini et al., 2000 • Montoyo et al., 2001• Missikoff et al., 2002 • Vivaldi, Rodríguez, 2002 • Vivaldi, Rodríguez, 2004• Bernardini et al., 2006 • Cui et al., 2008
Graph structure of Wikipedia
WP categories WP pages
A B
C D E
F
G
P1
P2
P3
Redirectiontable
…
… …
…
… …
… …
Disamb. pagesInterwiki linksExternal links
InfoBox
Methodology: overview
domain
Pages
top categories
domain categories
domain pages final domainterm setfiltering
filtering
Categories
bootstrapping
1) To find in WP the domain name as a category.2) Look for all the subcategories/pages related to the domain3) Extract all descendants from the domain name avoiding loops4) Remove proper names and service classes5) Filter categories and pagesMain steps:
WP
Methodology: filtering
• Category level
• Page level
Methodology: filtering
• Category levelTop Category of the Domain
CatSet1
C
Direct super-categories CatSet1Direct super-categories CatSet1 Direct neutral super-categories
Category Score
Methodology: filtering
• Page levelTop Category of the Domain
CatSet2
C
categories CatSet2
Pages C ... ...
neutral categories
Page Score
P
categories CatSet2
Methodology: category filtering
categories descendant filtered of set:CatSet2} )21( if accept
11#2 11#1
of oriessupercateg direct of set# :1 { 1
categories descendant of set:CatSet1
nncCatSetaCatSetnCatSetaCatSetn
caCatSetCatSetc
Methodology: page filtering
2 if 02 if 1:)(
)(
a to assigned categories of set :
CatSetcCatSetccinCatPathToDoma
inCat(c)PathToDomainCat(c)PathToDoma
dtcWPDC
dtctermCats
termCatsc
termCatsc
Additional category filtering using pages scores:
catTerm: set of pages associated to a category-MicroStrict: accept cat if # elements of catTerm with positive scoring is greater that # elements with negative scoring-MicroLoose: Idem with greater or equal test.-Macro: instead of counting the pages with positive/negative scoring we use the components of such scores.
Page filtering example: “semantics” (in Computing domain)
theoretical computer science Computing semantics
softwaresoftware engineering
formal methods
semantics {linguistics, philosophy of language, semiotics, theoretical computer science, philosophical Logic}
WPCD(semantics) = 0.25
Category filtering example using pages score: “chemistry”
# DTCMicroStrict
MicroLoose Macro
Vote Resultok ko ok ko ok ko
1 electroquímica(electrochemistry) 13 5 16 2 36 12 +3 Accept
2 quesos(cheeses) 0 8 6 2 8 12 -1 Reject
3 óxidos de carbono(carbon monoxide) 1 1 2 0 4 3 +2 Accept
Evaluation
• Partial evaluation: “chemistry” and “astronomy”:– Test against Magnini et al., 2000 (WordNet 1.6)– Low coverage: 25% for Chemistry and 15% for
Astronomy• Full evaluation. “Medicine”
– Test against SNOMED-CT Spanish Edition (2009)– Wide coverage of the clinical domain: 800K terms
Partial evaluationDomain Chemistry Astronomy Language EN ES EN ES Initial Categories 188374 2070 188816 44631 #Categories after pruning 1334 557 790 143 Categories 49 43 5 6
Precision 93,9 62,8 0 16,7 Loose 833 1038 284 119 Pages found Strict 580 700 284 81 Loose 61,3 52,6 34,8 31,9 Ite
ratio
n #1
Prec. [%] Strict 62,7 56,6 37.2 27,2
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-loose
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-looseEN-strict
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-looseEN-strictES-loose
50
55
60
65
70
1 2 3 4 5 6
prec
isión
iteration
Chemistry
EN-looseEN-strictES-looseES-strict
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-loose
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-looseEN-strict
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-looseEN-strictES-loose
20
25
30
35
40
45
50
1 2 3 4 5 6
prec
isión
iteration
Astronomy
EN-looseEN-strictES-looseES-strict
Full evaluationEvaluation using WN SNOMED-CT Initial Categories 2431 Categories after pruning 839 Categories 174 394
Precision 27,6 54 Loose 2091 4182 Page Strict 1724 3492 Loose 21,0 58 Ite
ratio
n #1
Prec. [%] Strict 23,2 62
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WN
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WNES-loose-WN
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WNES-loose-WN
ES-loose-SNOMED
10
20
30
40
50
60
70
1 2 3 4 5 6
prec
ision
iteration
Medicina (Medicine)
ES-loose-WNES-loose-WN
ES-loose-SNOMEDES-strict-SNOMED
Validation issues
Accepts Reject
whiskycigarudderfire
oral cancerrenal colicphoniatricssurgical instruments
17
Conclusions
• Good results when evaluated against a specialised resource
• Term list filtering must be improved (ex. Eliminate proper names)
18
Future work
• Apply this method to other languages/domains• Improve filtering using in/out links of selected
pages• Improve filtering using also the page content• Use this WP knowledge to improve a term
extractor
19
Finding Domain Terms using Wikipedia
Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra
Horacio Rodríguez Hontoria TALP Research Center
Universitat Politécnica de [email protected]