Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks,...
Transcript of Discovery of MWEs · Most machine learning methods use lexical resources, i.e., corpora, treebanks,...
Discovery of MWEs
Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze
WG3 Report on MWE Processing
Valletta, Malta19,20 March 2015
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
S��������� M������ L�������Most machine learning methods use lexical
resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.
Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-
pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.
T����
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
Challenging and open question
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
make a decision entscheiden
make a decision tomar uma decisão
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
S��������� M������ L�������Most machine learning methods use lexical
resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.
Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-
pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.
T����
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
Challenging and open question
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
S��������� M������ L�������Most machine learning methods use lexical
resources, i.e., corpora, treebanks, dictionar-ies, lexicons, etc. ! Dependency on certainresources makes supervised machine learn-ing approaches partly language-dependent Pri-mary lexical resources can be complementedwith web dictionaries and WordNet.
Features typically employed:• n-gram frequencies• lemmas• orthographic variations• association measures• morphosyntactic patternsCommon techniques for complementing su-
pervised machine learning: filtering, pregroup-ing, re-ranking, thresholds, combination, POStags, chunks, chunk sequences, manual anno-tation and evaluation.
T����
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
Challenging and open question
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
T����
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
Challenging and open question
Discovery of MWEsMethodologies
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
T����
ababs
a/N b/Vc/J s/Ds/A d/P
a b ac s ds d a d
raw corpora
analysed corpora(POS tagged, parsed...)
a b ac s ds d a d
a bd a d
annotated corpora
[ranked]MWE lists
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
Challenging and open question
D����������MWE�WG� R����� �� MWE P���������
T���� K����������, J������ M�������������̇, M������ O����,C����� R������, F������� S������, V������� V�����
A�������This poster summarizes the WG3 survey on tools and techniques for automatic MWE discovery. It serves as a ground forthe state-of-the-art report of WG3 on hybrid & multilingual MWE processing. Keywords and pointers provided here are furtherdetailed in our shared document: http://goo.gl/IE0hLC
P������� C������One-to-many correspondences are ex-
ploited in MWE detection and cross-lingualMWE detection is also enabled:
Techniques based on word alignment,dependency parsing, alignment mismatchesand/or decision trees have been used for MWEdetection. Statistical MT systems also exploitMWE-annotated paralell corpora.
A���������� ��������Lexical measures that estimate the associ-
ation strength between words are one of themain tools employed in unsupervised discoveryof MWEs in corpora. There are di�erent waysof measuring this strength of word association:
• Measures based on raw frequency• Measures based on information theory,
e.g. pointwise mutual information• Measures based on the contingency ta-
bles, e.g. chi-square• Statistical significance• Measures of association between 3 or
more words• Measures which use linguistic information
in addition to word frequencies, e.g. a�n-ity of a word to a syntactic pattern
=) No consensus about best type of measureto use in each case
T����
Corpus searches and concordancersSketch engine, AntConc, WordSmith
Association measures and patternsUCS, Text::NSP, mwetoolkit, LocalMaxs,
ACCURAT toolkit, Xtract (Dragon), bgMWEToken-based annotation/tagging
jMWE, AMALGr, FIPS-Co, StringNetRecurring tree fragments
FragmentSeeker, DiscoDOP, Varro
S������� P���������Based on the non-decomposability property:the meaning of an MWE cannot be derived fromthe meanings of its component words.Context distribution methods (hot dog 6= dog)
HOT DOG
SANDWICHCAT
DOG
Substitution methods
Expression Substitution MWEBreak the vase Break the cup NOBreak the ice Break the snow YES
E���������
How to evaluate a lexicon of automaticallydiscovered MWEs?
3. Use/integrate2. Measure
1. Annotate/judge
MWEw1 w2 w3
w1 w4w2 w7 w1
MWE Score Annot.
Threshold
Precision: X%Recall (?): Y%
xi
Parser+MWEsParser
Challenging and open question
Discovery of MWEsTools & Evaluation
Thank you!
Discovery of MWEs
Tomas Krilavičius, Justina Mandravickaitė, Michael Oakes, Carlos Ramisch, Federico Sangati, Veronika Vincze
WG3 Report on MWE Processing