JIGSAW: an Algorithm for Word Sense Disambiguation
description
Transcript of JIGSAW: an Algorithm for Word Sense Disambiguation
EVALITA 2007Evaluation of NLP Tools for Italian
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW: an Algorithm for Word Sense Disambiguation
Dipartimento di InformaticaUniversity of Bari
Pierpaolo Basile ([email protected])Giovanni Semeraro ([email protected])
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Word Sense Disambiguation
Word Sense Disambiguation (WSD) is the problem of selecting a sense for a word from a set of predefined possibilitiesSense Inventory usually comes from a
dictionary or thesaurusKnowledge intensive methods, supervised
learning, and (sometimes) bootstrapping approaches
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
All Words WSD
Attempt to disambiguate all open-class words in a text: “He put his suit over the back of the chair”
How?Knowledge-based approachesUse information from dictionariesPosition in a semantic networkUse discourse propertiesMinimally supervised approachesMost frequent sense
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW
Knowledge-based WSD algorithmDisambiguation of words in a text by
exploiting WordNet sensesCombination of three different strategies
to disambiguate nouns, verbs, adjectives and adverbs
Main motivation: the effectiveness of a WSD algorithm is strongly influenced by the POS tag of the target word
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW algorithm
Input: document d = {w1, w2, ... , wh} Output: list of WordNet synsets X = {s1, s2, ... ,
sk}each element si is obtained by disambiguating the
target word wi
based on the information obtained from WordNet about words in the context context C of the target word: a window of n words to the left
and another n words to the right, for a total of 2n surrounding words
For each word JIGSAW adopts a different strategy based on POS tag
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_nouns: the idea
Based on Resnik [Resnik95] algorithm for disambiguating noun groups
Given a set of nouns W={w1,w2, ... ,wn} from document d:each wi has an associated sense inventory
Si={si1, si2, ... , sik} of possible senses
Goal: assigning each wi with the most appropriate sense sihSi, according to the similarity of wi with the other nouns in W
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_nouns: semantic similarity
w = cat C = {mouse}
cat mousemouse
Mouse#1: any of numerous small
rodents…
Mouse#2: a hand-operated electronic
device …
Cat#1: feline mammal…
Cat#2: computerized axial
tomography…
T={02244530,03651364}
Wcat={02037721,00847815}
T={mouse#1,mouse#2}
Wcat={cat#1,cat#2}
“The white cat is hunting the mouse”
0.726
0.0
0.107
0.0
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_nouns: MSS support
MostSpecificSubsumer between wordsGive more importance to senses that are
hyponym of MSS Combine MSS support with semantic
similarity
Wcat={cat#1,cat#2} T={mouse#1,mouse#2}
MostSpecificSubsumer{placental_mammal}
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Difference between JIGSAW_nouns and Resnik
Leacock-Chodorow measure to calculate similarity (instead Information Content)
a Gaussian factor G, which takes into account the distance between words in the text
a factor R, which takes into account the synset frequency score in WordNet
a parameterized search for the MSS (Most Specific Subsumer) between two concepts
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: the idea
Try to establish a relation between verbs and nouns (distinct IS-A hierarchies in WordNet)
Verb wi disambiguated using:nouns in the context C of wi
nouns into the description (gloss + WordNet usage examples) of each candidate synset for wi
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: algorithm [1/4]
For each candidate synset sik of wi
computes nouns(i, k): the set of nouns in the description for sik
for each wj in C and each synset sik computes the highest similarity maxjk
maxjk is the highest similarity value for wj wrt the nouns related to the k-th sense for wi (using Leacock-Chodorow measure)
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
1. (70) play -- (participate in games or sport; "We played hockey all afternoon"; "play cards"; "Pele played for the Brazilian teams in many important matches")
2. (29) play -- (play on an instrument; "The band played all night long")
3. …
nouns(play,1): game, sport, hockey, afternoon, card, team, match
wi=playC={basketball, soccer}
nouns(play,2): instrument, band, night
nouns(play,35): …
…
I play basketball and soccer
JIGSAW_verbs: algorithm [2/4]
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
nouns(play,1): game, sport, hockey, afternoon, card, team, match
basketball
basketball1
basketballh
MAXbasketball = MAXi Sim(wi,basketball) winouns(play,1)
game
game1
game2
gamek
…
sport
sport1
sport2
sportm
…
…
wi=playC={basketball, soccer}
JIGSAW_verbs: algorithm [3/4]
similarity
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_verbs: algorithm [4/4]
finally, an overall similarity score, (i, k), among sik and the whole context C is computed:
hhi
Cwjkji
wposwposG
wposwposG
kRkij
))(),((
max))(),((
)(),(
the synset assigned to wi is the one with the highest value
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAW_others
Based on the WSD algorithm proposed by Banerjee and Pedersen [Banerjee07] (inspired to Lesk)
Idea: computes the overlap between the glosses of each candidate sense for the target word to the glosses of all words in its context
assigns the synset with the highest overlap score if ties occur, the most common synset in WordNet is
chosen
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Evaluation
EVALITA All-Words-Taskdisambiguate all words in a text
Dataset16 texts in Italian languageabout 5000 words (tagged by ItalWordNet)
ProcessingWSD needs others NLP steps:
Text normalization and Tokenization Part-Of-Speech Tagging (based on ACOPOST) Lemmatization (based on Morph-it! Resource)
META
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Evaluation (Results)
system precision recall attempted
JIGSAW 0,560 0,414 73,95%
Baseline (1° sense)
0,669 0,669 100%
Comments: results are encouraging considering that our system exploits only ItalWordNetpre-processing phases, lemmatization and POS-tagging, introduce errors:
77,66% lemmatization precision 76,23% POS-tagging precision
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Conclusions
Conclusions:Knowledge-based WSD algorithmDifferent strategy for each POS-taggingUse only WordNet (ItalWordNet) and some
heuristics Advantage: use the same strategy for Italian and
English [Basile07] Drawback: low precision (now)
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Future Work
Including new knowledge sources:Web
(e.g. Topic Signature)
Wikipedia (e.g. Wikipedia similarity)
Do not include resources that are available for only few languages
Use others heuristics: Statistical distribution of senses instead WordNet frequency WordNet domains
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
References S. Banerjee and T. Pedersen. An adapted lesk algorithm for word
sense disambiguation using wordnet. In CICLing’02: Proc. 3rd Int’l Conf. on Computational Linguistics and Intelligent Text Processing,pages 136–145, London, UK, 2002. Springer-Verlag.
P. Basile, M. de Gemmis, A.L. Gentile, P. Lops, and G. Semeraro. JIGSAW algorithm for word sense disambiguation. In SemEval-2007: 4th Int. Workshop on Semantic Evaluations, pages 398–401. ACL press, 2007.
C. Leacock and M. Chodorow. Combining local context and wordnet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, pages 305–332. MIT Press, 1998.
P. Resnik. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for Computational Linguistics, 1995.
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Backup slides
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
Results (detail)
total valid precision
nouns 2405 1923 0,556
verbs 1479 1118 0,375
others 694 330 0,676
proper nouns 159 126 0,913
Comments:polysemy of verb is highgenerally proper nouns have only one sense lemmatizer and pos-tagger work worst for adjectives
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAWnouns: The idea Inspired by the idea proposed by Resnik
[Resnik95]
[Resnik95] P. Resnik. Disambiguating noun groupings with respect to WordNet senses. In Proceedings of the Third Workshop on Very Large Corpora, pages 54–68. Association for Computational Linguistics, 1995.
Most plausible assignment of senses to a set of co-occurring nouns is the one that maximizes the relatedness of meanings among the senses chosen
[ w1, w2, … wn ]
[s11 s12 … s1k] [s21 s22 … s1h] [sn1 sn2 … snm]
[ w1 w2 … wn ]
Relatedness is measured by computing a score for each sij
confidence with which sij is the most appropriate synset for wi
The score is a way to assign credit to word senses based on similarity with co-occurring words (support)
[0.4 0.3 0.5] [0.2 0.3 0.4] [0.6 0.1 0.2]
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAWnouns: The support
[s11 s12 … s1k] [s21 s22 … s1h] [sn1 sn2 … snm]
[ w1 w2 … wn ]
the more similar two words are, the more informative will be the most specific concept that subsumes both of them
0.56
0.15semantic similarity score
most specific subsumer MSS
EVALITA 2007 – Evaluation of NLP Tools for Italian, 10 September 2007 - Roma, Italy
JIGSAWnouns: The idea
[s11 s12 … s1k] [s21 s22 … s1h] [sn1 sn2 … snm]
[ w1 w2 … wn ]
MSS = placental mammal
most specific subsumer MSS
cat mouse
feline mamma
l
rodent[0.56 0.0 0.0] [0.56 0.0 0.56] [0.0 0.0 0.0]
Placental mammal
Carnivore Rodent
Feline, felid
Cat(feline mammal)
Mouse(rodent)
MSS