Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute...

19
Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra [email protected] Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Cataluny [email protected]

description

Introduction Problem: to automatically extract terminological units from specialized texts Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.

Transcript of Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute...

Page 1: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Finding Domain Terms using Wikipedia

Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra

[email protected]

Horacio Rodríguez Hontoria TALP Research Center

Universitat Politécnica de [email protected]

Page 2: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

2

Outline• Introduction• Related approaches• Methodology• Evaluation• Conclusions and future work

Page 3: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Introduction

• Problem: to automatically extract terminological units from specialized texts

• Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.

Page 4: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

4

Related approaches

• Magnini et al., 2000 • Montoyo et al., 2001• Missikoff et al., 2002 • Vivaldi, Rodríguez, 2002 • Vivaldi, Rodríguez, 2004• Bernardini et al., 2006 • Cui et al., 2008

Page 5: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Graph structure of Wikipedia

WP categories WP pages

A B

C D E

F

G

P1

P2

P3

Redirectiontable

… …

… …

… …

Disamb. pagesInterwiki linksExternal links

InfoBox

Page 6: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Methodology: overview

domain

Pages

top categories

domain categories

domain pages final domainterm setfiltering

filtering

Categories

bootstrapping

1) To find in WP the domain name as a category.2) Look for all the subcategories/pages related to the domain3) Extract all descendants from the domain name avoiding loops4) Remove proper names and service classes5) Filter categories and pagesMain steps:

WP

Page 7: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Methodology: filtering

• Category level

• Page level

Page 8: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Methodology: filtering

• Category levelTop Category of the Domain

CatSet1

C

Direct super-categories CatSet1Direct super-categories CatSet1 Direct neutral super-categories

Category Score

Page 9: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Methodology: filtering

• Page levelTop Category of the Domain

CatSet2

C

categories CatSet2

Pages C ... ...

neutral categories

Page Score

P

categories CatSet2

Page 10: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Methodology: category filtering

categories descendant filtered of set:CatSet2} )21( if accept

11#2 11#1

of oriessupercateg direct of set# :1 { 1

categories descendant of set:CatSet1

nncCatSetaCatSetnCatSetaCatSetn

caCatSetCatSetc

Page 11: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Methodology: page filtering

2 if 02 if 1:)(

)(

a to assigned categories of set :

CatSetcCatSetccinCatPathToDoma

inCat(c)PathToDomainCat(c)PathToDoma

dtcWPDC

dtctermCats

termCatsc

termCatsc

Additional category filtering using pages scores:

catTerm: set of pages associated to a category-MicroStrict: accept cat if # elements of catTerm with positive scoring is greater that # elements with negative scoring-MicroLoose: Idem with greater or equal test.-Macro: instead of counting the pages with positive/negative scoring we use the components of such scores.

Page 12: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Page filtering example: “semantics” (in Computing domain)

theoretical computer science Computing semantics

softwaresoftware engineering

formal methods

semantics {linguistics, philosophy of language, semiotics, theoretical computer science, philosophical Logic}

WPCD(semantics) = 0.25

Page 13: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Category filtering example using pages score: “chemistry”

# DTCMicroStrict

MicroLoose Macro

Vote Resultok ko ok ko ok ko

1 electroquímica(electrochemistry) 13 5 16 2 36 12 +3 Accept

2 quesos(cheeses) 0 8 6 2 8 12 -1 Reject

3 óxidos de carbono(carbon monoxide) 1 1 2 0 4 3 +2 Accept

Page 14: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Evaluation

• Partial evaluation: “chemistry” and “astronomy”:– Test against Magnini et al., 2000 (WordNet 1.6)– Low coverage: 25% for Chemistry and 15% for

Astronomy• Full evaluation. “Medicine”

– Test against SNOMED-CT Spanish Edition (2009)– Wide coverage of the clinical domain: 800K terms

Page 15: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Partial evaluationDomain Chemistry Astronomy Language EN ES EN ES Initial Categories 188374 2070 188816 44631 #Categories after pruning 1334 557 790 143 Categories 49 43 5 6

Precision 93,9 62,8 0 16,7 Loose 833 1038 284 119 Pages found Strict 580 700 284 81 Loose 61,3 52,6 34,8 31,9 Ite

ratio

n #1

Prec. [%] Strict 62,7 56,6 37.2 27,2

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-loose

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-looseEN-strict

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-looseEN-strictES-loose

50

55

60

65

70

1 2 3 4 5 6

prec

isión

iteration

Chemistry

EN-looseEN-strictES-looseES-strict

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-loose

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-looseEN-strict

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-looseEN-strictES-loose

20

25

30

35

40

45

50

1 2 3 4 5 6

prec

isión

iteration

Astronomy

EN-looseEN-strictES-looseES-strict

Page 16: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Full evaluationEvaluation using WN SNOMED-CT Initial Categories 2431 Categories after pruning 839 Categories 174 394

Precision 27,6 54 Loose 2091 4182 Page Strict 1724 3492 Loose 21,0 58 Ite

ratio

n #1

Prec. [%] Strict 23,2 62

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WN

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WNES-loose-WN

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WNES-loose-WN

ES-loose-SNOMED

10

20

30

40

50

60

70

1 2 3 4 5 6

prec

ision

iteration

Medicina (Medicine)

ES-loose-WNES-loose-WN

ES-loose-SNOMEDES-strict-SNOMED

Validation issues

Accepts Reject

whiskycigarudderfire

oral cancerrenal colicphoniatricssurgical instruments

Page 17: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

17

Conclusions

• Good results when evaluated against a specialised resource

• Term list filtering must be improved (ex. Eliminate proper names)

Page 18: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

18

Future work

• Apply this method to other languages/domains• Improve filtering using in/out links of selected

pages• Improve filtering using also the page content• Use this WP knowledge to improve a term

extractor

Page 19: Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

19

Finding Domain Terms using Wikipedia

Jorge Vivaldi PalatresiApplied Linguistics InstituteUniversitat Pompeu Fabra

[email protected]

Horacio Rodríguez Hontoria TALP Research Center

Universitat Politécnica de [email protected]