Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado...

ComparingWord Relatedness Measures

Based on Google n-gramsAminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ

Faculty of Computer ScienceDalhousie University, Halifax, Canada

[email protected], [email protected], [email protected]

COLING 2012

Introduction●Word-relatedness has a wide range of

applications– IR: Image retrieval, Query extention…– Paraphrase recognition– Malapropism detection and correction– Automatic creation of thesauri– Speak recognition– …

Introduction●Methods can be categorized into 3:

– Corpus-based●Supervised●Unsupervised

– Knowledge-based●Semantic resources were used

– Hybrid

Introduction

●This paper focus on unsupervised corpus-based measures

●6 measures have been compared

Problem

●Unsupervised corpus-based measures usually use co-occurrence statistics, mostly word n-grams and frequencies

– The co-occurrence are corpus-specific– Most of the corpura doesn't have co-occurrence

stats, thus can't be used on-line– Some use web search result, but results vary from

time to time

Motivation●How to compare different measures fairly?●Observation

– Co-occurrence stats were used– A corpus with co-occurrence information, eg.

Google n-grams, is probably a good resource

Google N-Grams●A publicly available corpus with

– Co-occurrence statistics (uni-gram to 5-gram)– A large volume of <del>web text</del>

●Digitalized books with over 5.2 million books published since 1500

– Data format:●ngram year match_count volume_count●eg:

– analysis is often described as 1991 1 1 1

Another Motivation●To find a indirect mapping between Google n-

grams and web search result– Thus, it might be used on-line

How About WordNet?●In 2006, Budanitsky and Hirst evaluated 5

knowledge-based measures using WordNet– Create a resource like WordNet requires lots of

efforts– Coverage of words is not enough for NLP tasks– Resource is language-specific, while Google n-

grams consists more than 10 languages

Notations●C(w1 … wn)

– Frequency of the n-gram●D(w1 … wn)

– # of web docs (up to 5-grams)●M(w1, w2)

– C(w1 wi w2)

Notations●(w1, w2)

– 1/2 [ C(w1 wi w2) + C(w2 wi w1) ]●N

– # of docs used in Google n-grams● |V|

– # of uni-grams in Google n-grams●Cmax

– max frequency in Google n-grams

Assumptions●Some measures use web search results, and co-

occurrence information not provided by Google n-gram, but

– C(w1) ≥ D(w1)– C(w1 w2) ≥ D(w1 w2)

●It is because uni-grams and bi-grams might occurs multiple times in one document

Assumptions●Considering the lower limits

– C(w1) ≈ D(w1)– C(w1 w2) ≈ D(w1 w2)

Measures●Jaccard Coefficient

●Simpson Coefficient

Measures●Dice Coefficient

●Pointwise Mutual Information

Measures●Normalized Google Distane (NGD)

variation

Measures●Relatedness based on Tri-grams (RT)

Evaluation●Compare with human judgments

– It is considered to be the upper limit●Evaluate the measures with respect to a

particular application– Evaluate relatedness of words

●Text Similarity

Compare With Human Judgments●Rubenstein and Goodenough's 65 Word Pairs

– 51 people rating 65 pairs of word (English) on the scale of 0.0 to 4.0

●Miller and Charles' 28 Noun Pairs– Restricting R&G to 30 pairs, 38 human judges– Most of researchers use 28 pairs because 2 were

omitted from early version of WordNet

Result

Application-based Evaluation●TOEFL's 80 Synonym Questions

– Given a problem word,infinite, and four alternative wordslimitless, relative, unusual, structuralchoose the most related word

●ESL's 50 Synonym Qeustions– Same as TOEFL's 80 synonym questions task– Expect the synonym questions are from English

as a 2nd Language tests

Result

Text Similarity●Find the similarity between two text items●Use different measures on a single text

similarity measure, and evaluate the results of the text similarity measure based on a standard data set

●30 sentences pairs from one of most used data sets were used

Result

Result●Pearson correlation coefficient with mean

human similarity ratings:– Ho et al. (2010) used one measure based-on

WordNet and then applied those scores in Islam and Inkpen (2008) achieved 0.895

– Tsatsaronis et al. (2010) achieved 0.856– Islam et al. (2012) achieved 0.916

●The improvement over Ho et al. (2010) is statistically significant at 0.05 level

Conclusion●Any measures uses n-gram statistics can easily

apply Google n-gram corpus, and be fairly evaluated with existing works on standard data sets of different tasks

●Find an indirect mapping of co-occurrence statistics between the Google n-gram corpus and a web search engine using some assumptions

Conclusion●Measures based on n-gram are language-

independent– Other languages can be implemented if it has a

sufficiently large n-gram corpus

Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado...

Documents

Transcript of Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado...