Welcome To Journal Club Presented by: Dr. Aminul Islam Lecturer of Microbiology, MMC.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado...
-
Upload
marlene-mckenzie -
Category
Documents
-
view
218 -
download
0
description
Transcript of Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado...
ComparingWord Relatedness Measures
Based on Google n-gramsAminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ
Faculty of Computer ScienceDalhousie University, Halifax, Canada
[email protected], [email protected], [email protected]
COLING 2012
Introduction●Word-relatedness has a wide range of
applications– IR: Image retrieval, Query extention…– Paraphrase recognition– Malapropism detection and correction– Automatic creation of thesauri– Speak recognition– …
Introduction●Methods can be categorized into 3:
– Corpus-based●Supervised●Unsupervised
– Knowledge-based●Semantic resources were used
– Hybrid
Introduction
●This paper focus on unsupervised corpus-based measures
●6 measures have been compared
Problem
●Unsupervised corpus-based measures usually use co-occurrence statistics, mostly word n-grams and frequencies
– The co-occurrence are corpus-specific– Most of the corpura doesn't have co-occurrence
stats, thus can't be used on-line– Some use web search result, but results vary from
time to time
Motivation●How to compare different measures fairly?●Observation
– Co-occurrence stats were used– A corpus with co-occurrence information, eg.
Google n-grams, is probably a good resource
Google N-Grams●A publicly available corpus with
– Co-occurrence statistics (uni-gram to 5-gram)– A large volume of <del>web text</del>
●Digitalized books with over 5.2 million books published since 1500
– Data format:●ngram year match_count volume_count●eg:
– analysis is often described as 1991 1 1 1
Another Motivation●To find a indirect mapping between Google n-
grams and web search result– Thus, it might be used on-line
How About WordNet?●In 2006, Budanitsky and Hirst evaluated 5
knowledge-based measures using WordNet– Create a resource like WordNet requires lots of
efforts– Coverage of words is not enough for NLP tasks– Resource is language-specific, while Google n-
grams consists more than 10 languages
Notations●C(w1 … wn)
– Frequency of the n-gram●D(w1 … wn)
– # of web docs (up to 5-grams)●M(w1, w2)
– C(w1 wi w2)
Notations●(w1, w2)
– 1/2 [ C(w1 wi w2) + C(w2 wi w1) ]●N
– # of docs used in Google n-grams● |V|
– # of uni-grams in Google n-grams●Cmax
– max frequency in Google n-grams
Assumptions●Some measures use web search results, and co-
occurrence information not provided by Google n-gram, but
– C(w1) ≥ D(w1)– C(w1 w2) ≥ D(w1 w2)
●It is because uni-grams and bi-grams might occurs multiple times in one document
Assumptions●Considering the lower limits
– C(w1) ≈ D(w1)– C(w1 w2) ≈ D(w1 w2)
Measures●Jaccard Coefficient
●Simpson Coefficient
Measures●Dice Coefficient
●Pointwise Mutual Information
Measures●Normalized Google Distane (NGD)
variation
Measures●Relatedness based on Tri-grams (RT)
Evaluation●Compare with human judgments
– It is considered to be the upper limit●Evaluate the measures with respect to a
particular application– Evaluate relatedness of words
●Text Similarity
Compare With Human Judgments●Rubenstein and Goodenough's 65 Word Pairs
– 51 people rating 65 pairs of word (English) on the scale of 0.0 to 4.0
●Miller and Charles' 28 Noun Pairs– Restricting R&G to 30 pairs, 38 human judges– Most of researchers use 28 pairs because 2 were
omitted from early version of WordNet
Result
Result
Application-based Evaluation●TOEFL's 80 Synonym Questions
– Given a problem word,infinite, and four alternative wordslimitless, relative, unusual, structuralchoose the most related word
●ESL's 50 Synonym Qeustions– Same as TOEFL's 80 synonym questions task– Expect the synonym questions are from English
as a 2nd Language tests
Result
Result
Text Similarity●Find the similarity between two text items●Use different measures on a single text
similarity measure, and evaluate the results of the text similarity measure based on a standard data set
●30 sentences pairs from one of most used data sets were used
Result
Result●Pearson correlation coefficient with mean
human similarity ratings:– Ho et al. (2010) used one measure based-on
WordNet and then applied those scores in Islam and Inkpen (2008) achieved 0.895
– Tsatsaronis et al. (2010) achieved 0.856– Islam et al. (2012) achieved 0.916
●The improvement over Ho et al. (2010) is statistically significant at 0.05 level
Conclusion●Any measures uses n-gram statistics can easily
apply Google n-gram corpus, and be fairly evaluated with existing works on standard data sets of different tasks
●Find an indirect mapping of co-occurrence statistics between the Google n-gram corpus and a web search engine using some assumptions
Conclusion●Measures based on n-gram are language-
independent– Other languages can be implemented if it has a
sufficiently large n-gram corpus