Collocation Extraction Using Monolingual Word Alignment Method

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)


Collocation Extraction Using Monolingual Word Alignment Method. Zhanyi Liu, Haifeng Wang, Hua Wu, Sheng Li EMNLP 2009. Collocation. Two words Consecutive ("by accident") Interrupted ("take ... advice") Other examples Proper noun ("New York") Compound nouns ("ice cream") - PowerPoint PPT Presentation

Transcript of Collocation Extraction Using Monolingual Word Alignment Method

  • Collocation Extraction Using Monolingual Word Alignment MethodZhanyi Liu, Haifeng Wang, Hua Wu, Sheng LiEMNLP 2009

  • CollocationTwo wordsConsecutive ("by accident")Interrupted ("take ... advice")Other examplesProper noun ("New York")Compound nouns ("ice cream")Correlative conjunction ("either ... or")

  • Previous WorksCo-occurring word pairsWord pairs in a given window sizeAssociation measuresFrequency, log-likelihood, mutual information ...DisadvantageLong-span collocation"either ... or", "because ... so"Limited by window sizeFalse collocationAny word pairs in window size

  • Monolingual Word AlignmentBilingual word alignment (BWA)Source-target sentence pairsMonolingual Word Alignment (MWA)Source-source sentence pairsReplicate the corpus

  • Monolingual Word Alignment (2)BilingualMonolingualA word never collocates with itself

  • MWA ModelSentence with l words S ={w1,...,wl}Alignment A = {(i,ai) | i[1,l]}A = {(2,3), (3,2), (4,7), (6,7)...}

  • MWA Model (2)Adapt IBM Model 3 to MWA

    EM training algorithm, produce 3 probabilityWord collocation probabilityPosition collocation probabilityd(4|7,12)Prob that 4th collocates with 7th word in a 12-word sentenceFertility probabilityProb that wi is collocate with i words

  • Collocation ExtractionExtract and rank. Filter when freq(wi,wj)
  • Initial ExperimentChineseTraining dataLDC2007T03 Tagged Chinese Giga WordXinhua portion, 28M wordsGold setHandcrafted collocation dictionaries56888 collocations

  • Initial Experiment (2)Precision

    BaselineFrequency, log-likelihood, mutual informationLog-likelihood achieves the best performance

  • Initial Experiment (3)ObservationPrecision is lowSmall gold set (57K/200K = 28%)Low precision when N < 20K

  • ObservationFrequency vs. Probability vs. PrecisionPrecision curveLower freq --> lower precisionAlignment probability curveLower freq --> higher probability

  • Observation (2)ConclusionWhat causes lower precision of top 20K?Collocation with low freq but high probability

  • Improved MWA MethodAdd a penalization function y=f(x), x=freq(w1,w2)When x is small, y approaches 0 (penalize)When x is large, y approaches 1 (do not penalize)y = e-b/x (b is tuned to 25)New ranking score

  • Further EvaluationAutomatic evaluationGreatly outperforms the best baselineFor top 1K, 20.6% vs. 11.7%Exponential function plays a key role

  • Further Evaluation (2)Human EvaluationTop 1K collocationsFor each collocation, tag "True" or "False"4 "False" casesA: two semantically related words(, )B: a part of multi-word collocation(>= 3 words)(, ) in (, , )C: high frequency bigram(, ), (, ), (, )D: two words co-occurring frequently(, ), (, )

  • Further Evaluation (3)True collocations are much more than baselineFalse collocationA: semantically related, not distinguishable by MWAB: only two-word collocation is extracted.Few collocations have >=3 wordsC: frequent bigram, not distinguishable by MWAD: much less than baseline

  • Further Evaluation (3) cont.MWA are able to produce long-span collocations48 extracted collocations with span > 633 are tagged "True"("", ""), ("", "")69% precision

  • Fertility vs. PrecisionManually label 100 sentences and observe fertility78% words collocate with 1 word17% words collocate with 2 words95% words have fertility
  • ConclusionMain contributionSuccessfully adapt BWA to MWAPropose a ranking methodAlignment probability + Exponential penalty functionInitial failure are well discussedFuture workImproving Statistical Machine Translation with Monolingual Collocation, ACL 2010Improve alignment, phrase table