Post on 26-Sep-2020
Discovery of MWEs
Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze
WG3 Report on MWE Processing
Valletta, Malta19,20 March 2015
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
S��������� M������ L�������Most machine learning methods use lexical
resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.
Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-
pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.
T����
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
Challenging and open question
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
make a decision entscheiden
make a decision tomar uma decisão
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
S��������� M������ L�������Most machine learning methods use lexical
resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.
Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-
pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.
T����
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
Challenging and open question
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
S��������� M������ L�������Most machine learning methods use lexical
resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.
Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-
pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.
T����
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
Challenging and open question
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
T����
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
Challenging and open question
Discovery of MWEsMethodologies
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
T����
ababs
a/N b/Vc/J s/Ds/A d/P
a b ac s ds d a d
raw corpora
analysed corpora(POS tagged, parsed...)
a b ac s ds d a d
a bd a d
annotated corpora
[ranked]MWE lists
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
Challenging and open question
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
T����
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
3. Use/integrate2. Measure
1. Annotate/judge
MWEw1 w2 w3
w1 w4w2 w7 w1
MWE Score Annot.
Threshold
Precision: X%Recall (?): Y%
xi
Parser+MWEsParser
Challenging and open question
Discovery of MWEsTools & Evaluation
Thank you!
Discovery of MWEs
Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze
WG3 Report on MWE Processing