FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Comparative...

15
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 FF & FER Comparative Analysis of Automatic Term and Collocation Extraction Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder, Davor Delač, Matija Šamec-Gjurin, Dina Crnec Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing

Transcript of FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Comparative...

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

Comparative Analysis of Automatic Term and Collocation

Extraction

Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder,Davor Delač, Matija Šamec-Gjurin, Dina Crnec

Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FEROverview

I. Introduction– Reasons for extraction

II. Research– Resources & tools– Extracted lists

III. Evaluation– Precision, recall, F-measure

IV. Conclusion

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERI. Introduction

• Monolingual and multilingual resources– Helpful– Integrated– Require human intervention

• EU pre-accession activities– Speed up + consistency

• Used in further research and practice

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

• List:– Terms (Member State, European Union)

– Collocations (adopt a/the resolution, decided as follows)

– Multi-word units (depend on, well-being)

• Term extraction process:– Term extraction (term acquisition)- identification– Term recognition - verification

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERII. Research

• Resources– 10 documents – legislation, Cro-Eng

• Tools– TermeX tool (FER) – list A– SDL Multi Term Extract + NooJ (FF) – list B

• Reference list– Evaluation – reference list

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERReference list

• 470 terms and collocations• Exclude unigrams• Balance between lexical coverage, adequacy,

practicality– terms (NPs: 346/470)– collocations (VPs)

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERReference list

• Contains:– Terms (acquiring company, applicant country)

– Collocations (adopt a/the resolution, decided as

follows, entry into force, having regard to) – Names and abbreviations (Economic and

Monetary Union EMU, European Union EU)

– Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures).

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

• Language-independent statistically-based SDL Multi Term Extract tool – Frequency treshold set to 4– Filtered by the list of stop-words -> 369 cand.

• Language dependant NooJ tool– 36 local grammars -> 512 cand.

List B

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERList A

• TermeX– Lexical association measures (AMs)– 14 AMs (PMI, Dice, Chi-square,…)– Lemmatization– POS filtering– Frequency treshold set to ?

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERList A

• Extracted terms ranked by AM value – 1816 candidates

• AMs used:– 2-grams – PMI

– 3-grams, 4-grams – heuristic extensions

• Noun phrases only

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERResults

• Evaluation– F1-measure (precision, recall)

– True positives calculated by taking into account inflection (suffix stripping)

List A List B

No. of terms 1816 508

Valid terms 202 234

Precision (%) 11.56 47.37

Recall (%) 42.98 49.79

F1 (%) 18.22 48.55

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERResults

• List A unsatisfactory– Low recall – Verb phrases, terms consisting of

more than 4 words

– Low precision – ranked list, can be improved with cut-off (true positives are better ranked)

• List B modest– can be improved with lemmatization, definition of

upper/lower cases, more detailed local grammar

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FERConclusion

• Comparison of two hybrid approaches to term extraction

• Human created lists differ from extracted lists– human knowledge, experience and intuition

• Space for improvement – automatic extraction combined human intervention

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER

Thank you!

INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009

FF & FER