Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado...

29
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University, Halifax, Canada [email protected], [email protected], [email protected] COLING 2012

description

Introduction ● Methods can be categorized into 3: – Corpus-based ● Supervised ● Unsupervised – Knowledge-based ● Semantic resources were used – Hybrid

Transcript of Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado...

Page 1: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

ComparingWord Relatedness Measures

Based on Google n-gramsAminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ

Faculty of Computer ScienceDalhousie University, Halifax, Canada

[email protected], [email protected], [email protected]

COLING 2012

Page 2: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Introduction●Word-relatedness has a wide range of

applications– IR: Image retrieval, Query extention…– Paraphrase recognition– Malapropism detection and correction– Automatic creation of thesauri– Speak recognition– …

Page 3: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Introduction●Methods can be categorized into 3:

– Corpus-based●Supervised●Unsupervised

– Knowledge-based●Semantic resources were used

– Hybrid

Page 4: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Introduction

●This paper focus on unsupervised corpus-based measures

●6 measures have been compared

Page 5: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Problem

●Unsupervised corpus-based measures usually use co-occurrence statistics, mostly word n-grams and frequencies

– The co-occurrence are corpus-specific– Most of the corpura doesn't have co-occurrence

stats, thus can't be used on-line– Some use web search result, but results vary from

time to time

Page 6: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Motivation●How to compare different measures fairly?●Observation

– Co-occurrence stats were used– A corpus with co-occurrence information, eg.

Google n-grams, is probably a good resource

Page 7: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Google N-Grams●A publicly available corpus with

– Co-occurrence statistics (uni-gram to 5-gram)– A large volume of <del>web text</del>

●Digitalized books with over 5.2 million books published since 1500

– Data format:●ngram year match_count volume_count●eg:

– analysis is often described as 1991 1 1 1

Page 8: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Another Motivation●To find a indirect mapping between Google n-

grams and web search result– Thus, it might be used on-line

Page 9: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

How About WordNet?●In 2006, Budanitsky and Hirst evaluated 5

knowledge-based measures using WordNet– Create a resource like WordNet requires lots of

efforts– Coverage of words is not enough for NLP tasks– Resource is language-specific, while Google n-

grams consists more than 10 languages

Page 10: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Notations●C(w1 … wn)

– Frequency of the n-gram●D(w1 … wn)

– # of web docs (up to 5-grams)●M(w1, w2)

– C(w1 wi w2)

Page 11: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Notations●(w1, w2)

– 1/2 [ C(w1 wi w2) + C(w2 wi w1) ]●N

– # of docs used in Google n-grams● |V|

– # of uni-grams in Google n-grams●Cmax

– max frequency in Google n-grams

Page 12: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Assumptions●Some measures use web search results, and co-

occurrence information not provided by Google n-gram, but

– C(w1) ≥ D(w1)– C(w1 w2) ≥ D(w1 w2)

●It is because uni-grams and bi-grams might occurs multiple times in one document

Page 13: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Assumptions●Considering the lower limits

– C(w1) ≈ D(w1)– C(w1 w2) ≈ D(w1 w2)

Page 14: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Measures●Jaccard Coefficient

●Simpson Coefficient

Page 15: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Measures●Dice Coefficient

●Pointwise Mutual Information

Page 16: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Measures●Normalized Google Distane (NGD)

variation

Page 17: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Measures●Relatedness based on Tri-grams (RT)

Page 18: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Evaluation●Compare with human judgments

– It is considered to be the upper limit●Evaluate the measures with respect to a

particular application– Evaluate relatedness of words

●Text Similarity

Page 19: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Compare With Human Judgments●Rubenstein and Goodenough's 65 Word Pairs

– 51 people rating 65 pairs of word (English) on the scale of 0.0 to 4.0

●Miller and Charles' 28 Noun Pairs– Restricting R&G to 30 pairs, 38 human judges– Most of researchers use 28 pairs because 2 were

omitted from early version of WordNet

Page 20: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Result

Page 21: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Result

Page 22: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Application-based Evaluation●TOEFL's 80 Synonym Questions

– Given a problem word,infinite, and four alternative wordslimitless, relative, unusual, structuralchoose the most related word

●ESL's 50 Synonym Qeustions– Same as TOEFL's 80 synonym questions task– Expect the synonym questions are from English

as a 2nd Language tests

Page 23: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Result

Page 24: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Result

Page 25: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Text Similarity●Find the similarity between two text items●Use different measures on a single text

similarity measure, and evaluate the results of the text similarity measure based on a standard data set

●30 sentences pairs from one of most used data sets were used

Page 26: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Result

Page 27: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Result●Pearson correlation coefficient with mean

human similarity ratings:– Ho et al. (2010) used one measure based-on

WordNet and then applied those scores in Islam and Inkpen (2008) achieved 0.895

– Tsatsaronis et al. (2010) achieved 0.856– Islam et al. (2012) achieved 0.916

●The improvement over Ho et al. (2010) is statistically significant at 0.05 level

Page 28: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Conclusion●Any measures uses n-gram statistics can easily

apply Google n-gram corpus, and be fairly evaluated with existing works on standard data sets of different tasks

●Find an indirect mapping of co-occurrence statistics between the Google n-gram corpus and a web search engine using some assumptions

Page 29: Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Conclusion●Measures based on n-gram are language-

independent– Other languages can be implemented if it has a

sufficiently large n-gram corpus