Collocation Extraction Using Monolingual Word Alignment Method

Collocation Extraction Using Monolingual Word Alignment

MethodZhanyi Liu, Haifeng Wang, Hua

Wu, Sheng Li

EMNLP 2009

Collocation

• Two words– Consecutive ("by accident")– Interrupted ("take ... advice")

• Other examples– Proper noun ("New York")– Compound nouns ("ice cream")– Correlative conjunction ("either ... or")

Previous Works

• Co-occurring word pairs– Word pairs in a given window size

• Association measures– Frequency, log-likelihood, mutual information ...

• Disadvantage– Long-span collocation

• "either ... or", "because ... so"

• Limited by window size

– False collocation• Any word pairs in window size

Monolingual Word Alignment

• Bilingual word alignment (BWA)– Source-target sentence pairs

• Monolingual Word Alignment (MWA)– Source-source sentence pairs

– Replicate the corpus

Monolingual Word Alignment (2)

Bilingual

Monolingual

A word never collocates with itself

MWA Model

• Sentence with l words S ={w1,...,wl}

• Alignment A = {(i,ai) | i [1,l]}∈

A = {(2,3), (3,2), (4,7), (6,7)...}

MWA Model (2)

• Adapt IBM Model 3 to MWA

• EM training algorithm, produce 3 probability– Word collocation probability

– Position collocation probability• d(4|7,12)• Prob that 4th collocates with 7th word in a 12-word sentence

– Fertility probability

• Prob that wi is collocate with Φi words

Collocation Extraction

• Extract and rank. Filter when freq(wi,wj)<5

• Symmetric assumption– (wi, wj) = (wj, wi)

Initial Experiment

• Chinese

• Training data– LDC2007T03 Tagged Chinese Giga Word

– Xinhua portion, 28M words

• Gold set– Handcrafted collocation dictionaries

– 56888 collocations

Initial Experiment (2)

• Precision

• Baseline– Frequency, log-likelihood, mutual information

– Log-likelihood achieves the best performance

Initial Experiment (3)Observation

Precision is lowSmall gold set (57K/200K = 28%)

Low precision when N < 20K

ObservationFrequency vs. Probability vs. PrecisionPrecision curve

Lower freq --> lower precisionAlignment probability curve

Lower freq --> higher probability

Observation (2)

• Conclusion– What causes lower precision of top 20K?

– Collocation with low freq but high probability

Improved MWA Method

• Add a penalization function y=f(x), x=freq(w1,w2)– When x is small, y approaches 0 (penalize)– When x is large, y approaches 1 (do not penalize)

• y = e-b/x (b is tuned to 25)• New ranking score

Further Evaluation

• Automatic evaluation– Greatly outperforms the best baseline– For top 1K, 20.6% vs. 11.7%– Exponential function plays a key role

Further Evaluation (2)

• Human Evaluation– Top 1K collocations– For each collocation, tag "True" or "False"

• 4 "False" cases– A: two semantically related words

• (醫生 , 護士 )

– B: a part of multi-word collocation(>= 3 words)• (自我 , 機制 ) in (自我 , 約束 , 機制 )

– C: high frequency bigram• (他 , 說 ), (這 , 是 ), (很 , 好 )

– D: two words co-occurring frequently• (北京 , 月 ), (和 , 為 )

Further Evaluation (3)

• True collocations are much more than baseline• False collocation

– A: semantically related, not distinguishable by MWA– B: only two-word collocation is extracted.

• Few collocations have >=3 words

– C: frequent bigram, not distinguishable by MWA– D: much less than baseline

Further Evaluation (3) cont.

• MWA are able to produce long-span collocations• 48 extracted collocations with span > 6

– 33 are tagged "True"• ("處於 ", "狀態 "), ("由於 ", "因此 ")

– 69% precision

Fertility vs. Precision

• Manually label 100 sentences and observe fertility– 78% words collocate with 1 word– 17% words collocate with 2 words– 95% words have fertility <= 2

• Limit Φmax

Conclusion

• Main contribution– Successfully adapt BWA to MWA– Propose a ranking method

• Alignment probability + Exponential penalty function

• Initial failure are well discussed• Future work

– Improving Statistical Machine Translation with Monolingual Collocation, ACL 2010

– Improve alignment, phrase table

Collocation Extraction Using Monolingual Word Alignment Method

Documents

Transcript of Collocation Extraction Using Monolingual Word Alignment Method