Collocation Extraction Using Monolingual Word Alignment Method
-
Upload
jemima-schneider -
Category
Documents
-
view
56 -
download
0
description
Transcript of Collocation Extraction Using Monolingual Word Alignment Method
Collocation Extraction Using Monolingual Word Alignment
MethodZhanyi Liu, Haifeng Wang, Hua
Wu, Sheng Li
EMNLP 2009
Collocation
• Two words– Consecutive ("by accident")– Interrupted ("take ... advice")
• Other examples– Proper noun ("New York")– Compound nouns ("ice cream")– Correlative conjunction ("either ... or")
Previous Works
• Co-occurring word pairs– Word pairs in a given window size
• Association measures– Frequency, log-likelihood, mutual information ...
• Disadvantage– Long-span collocation
• "either ... or", "because ... so"
• Limited by window size
– False collocation• Any word pairs in window size
Monolingual Word Alignment
• Bilingual word alignment (BWA)– Source-target sentence pairs
• Monolingual Word Alignment (MWA)– Source-source sentence pairs
– Replicate the corpus
Monolingual Word Alignment (2)
Bilingual
Monolingual
A word never collocates with itself
MWA Model
• Sentence with l words S ={w1,...,wl}
• Alignment A = {(i,ai) | i [1,l]}∈
A = {(2,3), (3,2), (4,7), (6,7)...}
MWA Model (2)
• Adapt IBM Model 3 to MWA
• EM training algorithm, produce 3 probability– Word collocation probability
– Position collocation probability• d(4|7,12)• Prob that 4th collocates with 7th word in a 12-word sentence
– Fertility probability
• Prob that wi is collocate with Φi words
Collocation Extraction
• Extract and rank. Filter when freq(wi,wj)<5
• Symmetric assumption– (wi, wj) = (wj, wi)
Initial Experiment
• Chinese
• Training data– LDC2007T03 Tagged Chinese Giga Word
– Xinhua portion, 28M words
• Gold set– Handcrafted collocation dictionaries
– 56888 collocations
Initial Experiment (2)
• Precision
• Baseline– Frequency, log-likelihood, mutual information
– Log-likelihood achieves the best performance
Initial Experiment (3)Observation
Precision is lowSmall gold set (57K/200K = 28%)
Low precision when N < 20K
ObservationFrequency vs. Probability vs. PrecisionPrecision curve
Lower freq --> lower precisionAlignment probability curve
Lower freq --> higher probability
Observation (2)
• Conclusion– What causes lower precision of top 20K?
– Collocation with low freq but high probability
Improved MWA Method
• Add a penalization function y=f(x), x=freq(w1,w2)– When x is small, y approaches 0 (penalize)– When x is large, y approaches 1 (do not penalize)
• y = e-b/x (b is tuned to 25)• New ranking score
Further Evaluation
• Automatic evaluation– Greatly outperforms the best baseline– For top 1K, 20.6% vs. 11.7%– Exponential function plays a key role
Further Evaluation (2)
• Human Evaluation– Top 1K collocations– For each collocation, tag "True" or "False"
• 4 "False" cases– A: two semantically related words
• (醫生 , 護士 )
– B: a part of multi-word collocation(>= 3 words)• (自我 , 機制 ) in (自我 , 約束 , 機制 )
– C: high frequency bigram• (他 , 說 ), (這 , 是 ), (很 , 好 )
– D: two words co-occurring frequently• (北京 , 月 ), (和 , 為 )
Further Evaluation (3)
• True collocations are much more than baseline• False collocation
– A: semantically related, not distinguishable by MWA– B: only two-word collocation is extracted.
• Few collocations have >=3 words
– C: frequent bigram, not distinguishable by MWA– D: much less than baseline
Further Evaluation (3) cont.
• MWA are able to produce long-span collocations• 48 extracted collocations with span > 6
– 33 are tagged "True"• ("處於 ", "狀態 "), ("由於 ", "因此 ")
– 69% precision
Fertility vs. Precision
• Manually label 100 sentences and observe fertility– 78% words collocate with 1 word– 17% words collocate with 2 words– 95% words have fertility <= 2
• Limit Φmax
Conclusion
• Main contribution– Successfully adapt BWA to MWA– Propose a ranking method
• Alignment probability + Exponential penalty function
• Initial failure are well discussed• Future work
– Improving Statistical Machine Translation with Monolingual Collocation, ACL 2010
– Improve alignment, phrase table