Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks,...

Discovery of MWEs

Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze

WG3 Report on MWE Processing

Valletta, Malta19,20 March 2015

D��MWE�WG� R�� MWE P��

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

A��This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC

P�� C��One-to-many correspondences are ex-

ploited in MWE detection and cross-lingualMWE detection is also enabled:

Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.

A�� Lexical measures that estimate the associ-

ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:

• Measures based on raw frequency• Measures based on information theory,

e.g. pointwise mutual information• Measures based on the contingency ta-

bles, e.g. chi-square• Statistical significance• Measures of association between 3 or

more words• Measures which use linguistic information

in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern

=) No consensus about best type of measureto use in each case

S�� M�� L��Most machine learning methods use lexical

resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.

Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-

pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.

T��

Corpus searches and concordancersSketch engine, AntConc, WordSmith

Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,

ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging

jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments

FragmentSeeker, DiscoDOP, Varro

S�� P��Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)

HOT DOG

SANDWICHCAT

Substitution methods

Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES

E��

How to evaluate a lexicon of automaticallydiscovered MWEs?

Challenging and open question

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

make a decision entscheiden

make a decision tomar uma decisão

T��

HOT DOG

SANDWICHCAT

E��

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

T��

HOT DOG

SANDWICHCAT

E��

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

T��

HOT DOG

SANDWICHCAT

E��

Discovery of MWEsMethodologies

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

T��

a/N b/Vc/J s/Ds/A d/P

a b ac s ds d a d

raw corpora

analysed corpora(POS tagged, parsed...)

a b ac s ds d a d

a bd a d

annotated corpora

[ranked]MWE lists

HOT DOG

SANDWICHCAT

E��

T�� K��, J�� M��̇, M�� O��,C�� R��, F�� S��, V�� V��

T��

HOT DOG

SANDWICHCAT

E��

3. Use/integrate2. Measure

1. Annotate/judge

MWEw1 w2 w3

w1 w4w2 w7 w1

MWE Score Annot.

Threshold

Precision: X%Recall (?): Y%

Parser+MWEsParser

Discovery of MWEsTools & Evaluation

Thank you!

Discovery of MWEs

Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze

WG3 Report on MWE Processing

Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks,...

Documents

Transcript of Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks,...

Retrofitting Word Vectors to Semantic Lexicons

Treebanks - Göteborgs universitet · Yvonne Adesam Whatare treebanks? Treebank examples Whatare parallel treebanks? Creating treebanks References Thehistoryoftreebanks I PennTreebank(English;Phase1:

Feasibility Study on the Merger Between MECS and MWES

Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science.

Translating Collocations for Bilingual Lexicons: A ...

A Survey of Multiword Expressions in Treebanks - …nlp.ipipan.waw.pl/Bib/ros:etal:15.pdf · A Survey of Multiword Expressions in Treebanks ... There are six treebanks which cannot

Diachronous Conceptual Lexicons as Linked Open Data

Chapter 26 Comparing Lexicons Cross-Linguistically Asifa ...

Ra 9504 MWEs Tax Exemption

Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Learning Bilingual Lexicons from Monolingual Corpora

Law and Early Modern English Lexicons · 2013-07-01 · Law and Early Modern English Lexicons Ian Lancashire University of Toronto 1. The First Printed English Legal Lexicons John

AnnCorra : TreeBanks for Indian Languages Guidelines for …docshare01.docshare.tips/files/20536/205364421.pdf · 2018. 11. 29. · AnnCorra : TreeBanks for Indian Languages Guidelines

Corpus Linguistics (L615) - Computational Linguisticscl.indiana.edu/~md7/13/615/slides/10-treebanks/10-treebanks.pdf · Corpus Linguistics Syntactic Annotation and Treebanks ... ...

PPT-CCL: A Universal Phrase Tagset for Multilingual Treebanks

Frequent Structure Discovery in Treebanks - · PDF fileMany languages also have idioms that translate this phrase, ... Frequent Structure Discovery in Treebanks 101 identifyingfrequentlyreoccurringconnectedsubtreesintreebanks,butitisequally

Emergence of Lexicons in Family-Based Homesign Systems …tls.ling.utexas.edu/2012tls/proceedings/TLS13-05-Richie_Fanghella_Coppola.pdfEmergence of Lexicons in Family-Based Homesign

Computing with Affective Lexicons

Building Lexicons Jae Dong Kim Matthias Eck. Building Lexicons lIntroduction lPrevious Work lTranslation Model Decomposition lReestimated Models lParameter.

Translating Collocations for Bilingual Lexicons: A Statistical Approachvh/courses/LexicalSemantics/Association/... · Translating Collocations for Bilingual Lexicons: A Statistical