Collocation Extraction Using Monolingual Word Alignment Method

Collocation Extraction Using Monolingual Word Alignment

MethodZhanyi Liu, Haifeng Wang, Hua

Wu, Sheng Li

EMNLP 2009

Collocation

• Two words– Consecutive ("by accident")– Interrupted ("take ... advice")

• Other examples– Proper noun ("New York")– Compound nouns ("ice cream")– Correlative conjunction ("either ... or")

Previous Works

• Co-occurring word pairs– Word pairs in a given window size

• Association measures– Frequency, log-likelihood, mutual information ...

• Disadvantage– Long-span collocation

• "either ... or", "because ... so"

• Limited by window size

– False collocation• Any word pairs in window size

Monolingual Word Alignment

• Bilingual word alignment (BWA)– Source-target sentence pairs

• Monolingual Word Alignment (MWA)– Source-source sentence pairs

– Replicate the corpus

Monolingual Word Alignment (2)

Bilingual

Monolingual

A word never collocates with itself

MWA Model

• Sentence with l words S ={w1,...,wl}

• Alignment A = {(i,ai) | i [1,l]}∈

A = {(2,3), (3,2), (4,7), (6,7)...}

MWA Model (2)

• Adapt IBM Model 3 to MWA

• EM training algorithm, produce 3 probability– Word collocation probability

– Position collocation probability• d(4|7,12)• Prob that 4th collocates with 7th word in a 12-word sentence

– Fertility probability

• Prob that wi is collocate with Φi words

Collocation Extraction

• Extract and rank. Filter when freq(wi,wj)<5

• Symmetric assumption– (wi, wj) = (wj, wi)

Initial Experiment

• Chinese

• Training data– LDC2007T03 Tagged Chinese Giga Word

– Xinhua portion, 28M words

• Gold set– Handcrafted collocation dictionaries

– 56888 collocations

Initial Experiment (2)

• Precision

• Baseline– Frequency, log-likelihood, mutual information

– Log-likelihood achieves the best performance

Initial Experiment (3)Observation

Precision is lowSmall gold set (57K/200K = 28%)

Low precision when N < 20K

ObservationFrequency vs. Probability vs. PrecisionPrecision curve

Lower freq --> lower precisionAlignment probability curve

Lower freq --> higher probability

Observation (2)

• Conclusion– What causes lower precision of top 20K?

– Collocation with low freq but high probability

Improved MWA Method

• Add a penalization function y=f(x), x=freq(w1,w2)– When x is small, y approaches 0 (penalize)– When x is large, y approaches 1 (do not penalize)

• y = e-b/x (b is tuned to 25)• New ranking score

Further Evaluation

• Automatic evaluation– Greatly outperforms the best baseline– For top 1K, 20.6% vs. 11.7%– Exponential function plays a key role

Further Evaluation (2)

• Human Evaluation– Top 1K collocations– For each collocation, tag "True" or "False"

• 4 "False" cases– A: two semantically related words

• (醫生 , 護士 )

– B: a part of multi-word collocation(>= 3 words)• (自我 , 機制 ) in (自我 , 約束 , 機制 )

– C: high frequency bigram• (他 , 說 ), (這 , 是 ), (很 , 好 )

– D: two words co-occurring frequently• (北京 , 月 ), (和 , 為 )

Further Evaluation (3)

• True collocations are much more than baseline• False collocation

– A: semantically related, not distinguishable by MWA– B: only two-word collocation is extracted.

• Few collocations have >=3 words

– C: frequent bigram, not distinguishable by MWA– D: much less than baseline

Further Evaluation (3) cont.

• MWA are able to produce long-span collocations• 48 extracted collocations with span > 6

– 33 are tagged "True"• ("處於 ", "狀態 "), ("由於 ", "因此 ")

– 69% precision

Fertility vs. Precision

• Manually label 100 sentences and observe fertility– 78% words collocate with 1 word– 17% words collocate with 2 words– 95% words have fertility <= 2

• Limit Φmax

Conclusion

• Main contribution– Successfully adapt BWA to MWA– Propose a ranking method

• Alignment probability + Exponential penalty function

• Initial failure are well discussed• Future work

– Improving Statistical Machine Translation with Monolingual Collocation, ACL 2010

– Improve alignment, phrase table

Collocation Extraction Using Monolingual Word Alignment Method

Documents

Transcript of Collocation Extraction Using Monolingual Word Alignment Method

Sailing Legends Monolingual

You sound like Mommy: Bilingual and monolingual infants ... · Bilingual and monolingual infant word learning 1 Running head: Bilingual and monolingual infant word learning You sound

RBF Collocation - · PDF fileRBF collocation C.T.Mouat ∗andR.K.Beatson versionofFebruary28,2002 1 Introduction Inrecentyearsradialbasisfunctioncollocationhasbecomeausefulalternativetoﬁnite

English Monolingual Lexicography

Collocation Sok

· CME Collocation Telkomsel, 2016 CM E Collocation Indosat, 2016 Crv1E Collocation Smart, 2016 CME Collocation Smart, 2016 CM E Collocation Smart, 2016 Strengthening 2016 ... Sumut

Collocation 1

Alignment by Bilingual Generation and Monolingual Derivation Toshiaki Nakazawa and Sadao Kurohashi Kyoto University.

Lexical collocation 1

Dedicated Server Collocation

Collocation s

Collocation mat lac

Runge-Kutta and Collocation Methodshiptmair/Seminars/GNI/slides/Landis.pdf · Overview • Deﬁne Runge-Kutta methods. • Introduce collocation methods. • Identify collocation

On Using Monolingual Corpora in Neural Machine Translation ... · monolingual corpora exhibit linguistic structure integrate language model trained on monolingual corpora into a NMT

Monolingual bilingual trilingual 1 Running head ... · Monolingual bilingual trilingual 4 Monolingual, bilingual, trilingual: Infants’ language experience influences the development

Collocation M99C0101 Joyce M99C0214 Wendy. Definition Why to learn collocation? Types of collocation How to teach? Concordance.

Education Collocation

IELTS Collocation

Collocation tutorial pp

MONOLINGUAL CLASSROOMS