Wordnet and Semantick Similarity

download Wordnet and Semantick Similarity

of 35

Transcript of Wordnet and Semantick Similarity

  • 8/3/2019 Wordnet and Semantick Similarity

    1/35

    STOCKHOLMS UNIVERSITETInstitutionen for lingvistik

    Using WordNet and Semantic Similarity toDisambiguate an Ontology

    Martin Warin

    Abstract

    In this work I have used a number of different semantic similarity measures andWordNet to enrich an ontology (the Common Procurement Vocabulary) with infor-mation about its leaf-nodes. Nouns are extracted from the leaf-nodes, and comparedfor semantic similarity against the nouns extracted from their parent-node. The leaf-node noun-senses with highest overall similarity with the parent-node noun-sensesare seen as good descriptors of the leaf-node and are added to the ontology.

    It is shown that the similarity measure proposed by Leacock and Chodorow is the

    one most suitable for this task, out of ve measures compared. Leacock-Chodorowshows average precision between 0.644 and 0.711 and recall between 0.534 and 0.977,depending on the kind of threshold used. Baseline average precision peaks at 0.592.

    In the end, this work aims to contribute to the work of Henrik Oxhammar (PhDstudent at Stockholm University), who is investigating the possibilities of automati-cally classifying product descriptions on the WWW according to the Common Pro-curement Vocabulary. For this to be possible, more information about the leaf-nodesin the ontology than is currently there is needed.

    D-uppsats i Allm an sprakvetenskapmed inriktning pa datorlingvistik

    10 poang, VT 2004Handledare: Martin Volk

  • 8/3/2019 Wordnet and Semantick Similarity

    2/35

    Contents

    I Introduction 11 Structure 1

    2 Notation 1

    3 Aim 1

    4 Material 24.1 The Common Procurement Vocabulary . . . . . . . . . . . . . . . . . . . . 24.2 WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    4.2.1 The synset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.2.2 The tennis-problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    4.3 Measuring semantic similarity . . . . . . . . . . . . . . . . . . . . . . . . . 74.3.1 Semantic similarity versus semantic relatedness . . . . . . . . . . . 74.3.2 Information content . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3.3 Path length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3.4 Leacock-Chodorow . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3.5 Resnik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3.6 Wu-Palmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.3.7 Lin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3.8 Jiang-Conrath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    4.3.9 Semantic relatedness measures . . . . . . . . . . . . . . . . . . . . . 124.3.10 Evaluating performance of semantic similarity measures . . . . . . . 124.4 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    4.4.1 Previous work on ontologies . . . . . . . . . . . . . . . . . . . . . . 134.4.2 Previous work using semantic similarity . . . . . . . . . . . . . . . . 14

    II Method 15

    5 Description of the program 155.1 How to nd WordNet-senses given a CPV-node . . . . . . . . . . . . . . . 15

    5.2 Comparing word-senses for semantic similarity . . . . . . . . . . . . . . . . 175.3 When things go wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.3.1 Error scenario #1: No words in W leaf . . . . . . . . . . . . . . . . . 175.3.2 Error scenario #2: No words in W parent . . . . . . . . . . . . . . . . 205.3.3 Error scenario #3: Only one and the same word W leaf and W parent 20

    6 Precision and recall in word-sense disambiguation 216.1 Gold standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

  • 8/3/2019 Wordnet and Semantick Similarity

    3/35

    7 Evaluation of the measures 227.1 Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    III Results, discussion and conclusion 23

    8 Results 23

    9 Discussion 239.1 Quantitative analysis of results . . . . . . . . . . . . . . . . . . . . . . . . 239.2 Qualitative analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . 279.3 Biased towards Resnik? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    10 Conclusion 2910.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    A Test set i

  • 8/3/2019 Wordnet and Semantick Similarity

    4/35

    Part I

    Introduction1 StructureFirst of all, some matters concerning the notation are explained in Section 2. After that,the aim of this work is presented in Section 3.

    In Section 4 I will explain the two sources of data - the Common Procurement Vocabu-lary (Section 4.1) and WordNet (Section 4.2) - what they are, how they are structured andhow they can be used. After that comes an introduction to semantic similarity and someways to measure it (Section 4.3). This is followed by some previous research on ontologiesand some previous work using semantic similarity measures (Section 4.4).

    Part II describes how the task was executed and evaluated, and in the Part III theresults are presented and discussed, followed by conclusions and thoughts about futurework in this area.

    2 Notation WordNet synsets are written as the lexical items they contain, e.g. beet/beetroot. WordNet-senses and glosses are written in angle brackets: < beet#2 > and < round

    red root vegetable > .

    All WordNet-senses and synsets are nouns unless otherwise stated. Very long WordNet-glosses will be shortened, as in < a bluish-white lustrous metallic

    element... > .

    Nodes in the Common Procurement Vocabulary are written within square brackets:[Beetroot].

    The code of a node in the Common Procurement Vocabulary will only be includedwhen important ([01121111 Beetroot]), otherwise left out.

    3 AimThe aim of this work is to successfully disambiguate the ontology called the CommonProcurement Vocabulary (CPV) using a measure for semantic similarity and WordNet.By disambiguating is meant to provide a small ranked list of WordNet-senses for each leaf-node in the ontology. These WordNet-senses should be good candidates for describing thenode as a whole, or parts of it.

    Consider the CPV-node [Jackets], which is a child to [Jackets and blazers]. Thereare ve different senses of jacket in WordNet: < a short coat > , < an outer wrapping or

    1

  • 8/3/2019 Wordnet and Semantick Similarity

    5/35

    casing> , < (dentistry) an articial crown tted over a broken or decayed tooth > , < the outerskin of a potato > and < the tough metal shell casing for certain kinds of ammunition > .To disambiguate jacket, we compare its ve senses with the information found in the

    parent node, the one sense of blazer, described as a < lightweight single-breasted jacket;often striped in the colors of a club or school> . If the semantic similarity measure we use isaccurate, we will nd that the jacket-sense most similar to the blazer-sense is < a shortcoat> . This sense will top the list of senses that will be assigned to [Jackets], followed bythe other senses.

    Primarily, the reason for doing this work is to enable automatic classication of productsor services according to the Common Procurement Vocabulary. Henrik Oxhammar (PhDstudent at Stockholm University) is working on such a project. He is building a vector spacemodel, where products will be matched to nodes in the Common Procurement Vocabulary.The nodes must rst be broadened by building clouds of semantically similar itemsaround them.

    This work will hopefully also be of use for any one wishing to add information fromWordNet to their ontology or disambiguate with WordNet-senses.

    In one sentence, this is what I want to nd out: Is the method described in this worksufficient for successfully disambiguating the leaf-nodes of a whole ontology?

    4 MaterialIn this section I present the material I have used. First is the Common ProcurementVocabulary (4.1), the ontology that is to be enriched with information about its leaf-nodes. Second is WordNet (4.2), the lexical database from which the enriching informationis taken. After that, I describe the WordNet::Similarity package (4.3), a collection of toolsfor measuring semantic similarity. It is with help of these tools it is decided which piecesof information are to be used for enriching. Last in this section comes some informationabout previous work in this eld (4.4).

    4.1 The Common Procurement VocabularyThe Common Procurement Vocabulary (henceforth CPV) is an EU standard document,compiled by the EU project Simap, whose objective is to develop information systems sup-porting effective procurement within the union. The CPV is used for classifying productsor services in common procurement, and using it is recommended by the Commission.They argue that the CPV improves openness and transparency of common procurement,and that it helps overcoming language barriers [European Union and European Parlia-ment 2002]. As of now, the CPV is available in 11 European languages: Spanish, Danish,German, Greek, English, French, Italian, Dutch, Portuguese, Finnish and Swedish. 1

    1 It can be downloaded athttp://simap.eu.int/EN/pub/src/main5.htm , last visited September 27, 2004

    2

  • 8/3/2019 Wordnet and Semantick Similarity

    6/35

    The CPV is an ontology, i.e. a collection of knowledge about a domain structured ina meaningful way. Its nodes consist of a unique eight-digit code and a classication, e.g.30233171-0 Disks. . The ninth digit is only used for control purposes and can in

    practice be ignored; in fact, that is what will be done throughout the rest of this work.The codes are structured in such a way that they represent hierarchical relationships(IS-A or hypernymy / hyponymy relations) between nodes.

    30000000 [Office and computing machinery, equipment and supplies.]30200000 [Computer equipment and supplies.]30230000 [Computer hardware.]30233000 [Storage, input or output units.]30233100 [Computer storage units.]30233170 [Storage media.]30233171 [Disks.]

    Figure 1: CPV: The ancestors to Disks.

    Parent:01117100 [Raw vegetable materials used in textile production.]

    Children:01117110 [Cotton.]01117120 [Jute.]01117130 [Flax.]

    Figure 2: CPV: Siblings.

    The design is fairly simple. A top-level node has a code with three digits to the leftidentifying the branch, followed by zeros. The next generation inherits these three digitsand adds a non-zero of its own, the next generation in turn inherits these four digits andadds a non-zero of its own etc. If a node has children, the rst child inherits the parent-code and substitutes the leftmost zero with 1, the second child substitutes with 2 insteadetc. A case like this is illustrated in Figure 2.

    By substituting the rightmost non-zero with a zero in any given CPV-code, one gets thecode belonging to its parent. It is also possible to nd the siblings to any given CPV-nodeby systematically substituting the rightmost non-zero with all other non-zeros.

    The CPV contains 8323 nodes of the type discussed above, out of which 5644 areleaf-nodes. Apart from the type of nodes discussed above, there is also a shorter list of supplementary vocabulary with a code structure of its own. It is used in combination withregular CPV-nodes to further describe them, e.g. what they will be used for (Figure 3).This supplementary vocabulary will not be treated in this work.

    3

  • 8/3/2019 Wordnet and Semantick Similarity

    7/35

    E001-0 For cleaningE002-7 For childrenE003-4 For day homesE004-1 For graphic purposesE005-8 For access

    Figure 3: CPV: Supplementary vocabulary.

    During my work with the CPV, I have found a few oddities, for example that [28528000Swords] would be the parent of [28528900 Metal ttings], and that [30192130 Pencils] wouldbe a kind of [30192100 Erasers]. These oddities are probably due to the fact that the CPVhas been revised several times. These oddities are very few as far as I have seen, and theydo not seem to affect the outcome in any signicant way.

    4.2 WordNetWordNet is a semantic network database for English developed at Princeton under thedirection of George A. Miller. Latest current version is WordNet 2.0; however, in this workthe version used is 1.7.1 from 2001. 2 Versions for other languages have been made, e.g.EuroNet [Vossen 1998], where also cross-lingual links exist.

    The basic building-block in WordNet is the synset (Figure 4). A synset is a set of synonyms denoting the same concept, paired with a description (a.k.a. gloss) of the synset.The synsets are interconnected with different relational links, such as hypernymy (is-a-kind-of), meronymy (is-a-part-of), antonymy (is-an-opposite-of) and others.

    Since in this work only nouns are used, only noun synsets will be discussed here. Thesynsets for the other parts of speech are construed in a slightly different fashion. Theinterested reader may consult [Fellbaum 1998] chapter 2 for adjectives and adverbs, andchapter 3 for verbs.

    The lexical items in a synset are not absolute synonyms, rather they are interchangeablewithin some context ([Fellbaum 1998] pp. 23-24).

    The synsets for the different parts-of-speech are stored in separate les; data.noun ,data.verb etc.

    When using WordNet, the user types in a query, e.g. beetroot, and then the sense of that query is looked up. The user can then choose to look at synsets related to the oneshown.

    4.2.1 The synset

    All noun synsets contain the following data, which can bee seen in Figure 4:

    A synset ID (06429642 ). This is used by all other synsets when referring to this one.2 The reason for using WordNet 1.7.1 is that the tool WordNet::Similarity had not been made compatible

    with WordNet 2.0 by the time I began this project.

    4

  • 8/3/2019 Wordnet and Semantick Similarity

    8/35

    0642964213n02beet 0 beetroot 0004@06420440 n 0000 #p 09775909 n 0000 06429887 n 0000 06429988 n 0000round red root vegetable

    Figure 4: WordNet: A typical synset.

    A two-digit code (13 ) between 03 and 28 identifying the synset as descending from oneof 25 so called unique beginners. The unique beginners are top nodes in the WordNet

    hierarchy and can be seen as a sort of semantic prime. Unique beginner 13, which isthe case here, is food. Other unique beginners are 05 animal, fauna, 20 plant,ora, 27 substance etc. In fact, there is another synset for beetroot/beet, thisone descending from unique beginner 20 plant, ora and with a slightly differentdescription focusing on the plant-sense of the word. This is the case with all poly-semous words; they have one synset for every sense of the word, usually descendingfrom different unique beginners.

    A part-of-speech tag. n is for nouns. A number ( 02) indicating how many lexical items the synset contains ( beetroot and

    beet ).

    Pairs containing one lexical item and a number ( beet 0 ) indicating where in the leindex.sense this sense of the word can be found. sense.index provides an alternateway of searching for synsets, but this will not be used here.

    A three-digit number specifying how many other synsets the synset points to. A number of relation pointers. The rst relation is always that to the hypernym synset

    (@06420440), which in this case is root vegetables, and other various relation-type& synset pairs usually follow. In this case the #p stands for a part-holonym relation,and for hyponym relations. Since only the hypernym relation is used in this work,the others will not be discussed here. 3

    Relation pointers are usually conned to within the parts-of-speech, i.e. there is norelation pointing from e.g. the noun concept snow to the adjective concepts whiteor cold, but there are to other noun concepts such as water. The exceptions arethe relations attribute and domain-term. The domain-term relation rst appeared inWordNet 2.0 and is described in Section 4.2.2. The attribute relation is a relationbetween certain nouns and adjectives; the noun weight is an attribute for which the

    3 The interested reader can nd more about the relations used in WordNet atwww.cogsci.princeton.edu/ wn/man/wninput.5WN.html or in [Fellbaum 1998] pp 37-41.

    5

  • 8/3/2019 Wordnet and Semantick Similarity

    9/35

    adjectives light and heavy express values. In WordNet 1.7.1, there are only 325noun-synsets with attribute relations.

    A description (also called gloss) of the synset. It is seldom longer than a couple of sentences. Often the description also contains one or a few example sentences so theuser may see how the word is used in a context. The example sentences are generallycut off in this work, in order to improve readability.

    Another type of le, the index.noun (and index.verb etc.) contains one lexical itemper line, paired with the synset ID(s) of the synset(s) in which the lexical item occurs(Figure 5). This le is used when a query is sent to WordNet in order for it to nd thesynsets to retrieve. Later in this work, index.noun will be used to control whether a wordis included in WordNet or not.

    beethoven n 2 1 @ 2 0 08870823 06064724 beetle n 2 3 @ ~ #m 2 1

    01829893 03236882 beetleweed n 1 2 @ #m 1 0 10192028beetroot n 2 4 @ ~ #p %p 2 0 09775909 06429642

    Figure 5: WordNet: index.noun .

    4.2.2 The tennis-problem

    In WordNet version 1.7.1, the relations between noun synsets are antonymy, hypernymy,hyponymy, holonymy and meronymy. With these, it is possible to relate tennis with

    badminton (coordinate term, i.e. both having the same hypernym, both tennis andbadminton is a kind of court game), court game (hypernym), doubles (hyponym,a double is a kind of tennis), set point (meronym, set point is a part of ten-nis). There are no antonyms, i.e. opposites to tennis, but there is an antonymy relationbetween e.g. defeat and victory.

    But there is no way to link tennis to many things that most humans would agree onreally relates to tennis in some way (forehand, racquet, serve etc.). With WordNet2.0 came the relation domain-term with which these loose relations can be captured. Itworks in two ways, one can either nd the domain(s) in which serve occurs, or searchfor the terms that come with a domain. This relation might prove useful in tasks such asthe one undertaken in this work. However, there are yet no semantic similarity tools whichcan exploit this relation.

    In the WordNet::Similarity package, the only measure that uses WordNet-relationsother than hyponomy relations is the semantic relatedness measure Hirst-St-Onge [Hirstand St-Onge 1998]. However, the domain-term relation is not used even by the imple-mentation in version 1.0, which is adapted to WordNet 2.0. This might change in comingversions. 4

    4 Ted Pedersen, personal correspondence, August 2004.

    6

  • 8/3/2019 Wordnet and Semantick Similarity

    10/35

    4.3 Measuring semantic similarityThe task at hand in this work is to nd WordNet-senses which describe a leaf-node in theCPV (the node as a whole or parts of it) in the context of its parent node. In practice,this means that all nouns are extracted from the leaf-node and its parent node. All nounsextracted from the leaf-node are evaluated for semantic similarity against all nouns fromthe parent node. This is described more in detail in Part II.

    How then, are the nouns evaluated for semantic similarity? Luckily, there is a perlpackage called WordNet::Similarity [Pedersen et al. 2004], containing implementations of eight algorithms for measuring semantic similarity or relatedness. In this work, Word-Net::Similarity version 0.09 is the version used. 5

    Out of these eight measures, I have chosen to run a test on ve of them. This isdescribed in Sections 4.3.10 and 7.

    The ve measures used here will be referred to as the Leacock-Chodorow measure

    ([Leacock and Chodorow 1998]), the Jiang-Conrath measure ([Jiang and Conrath 1997]),the Lin measure ([Lin 1998]), the Resnik measure ([Resnik 1995b]) and the Wu-Palmermeasure [Wu and Palmer 1994].

    The reasons for choosing these are partly because they are all designed for dealing withsemantic similarity, as opposed to semantic relatedness, within IS-A-taxonomies (well, theLin measure is intended to be useful in nearly any environment, see 4.3.7 below), and partlybecause of time limitations. The measures Lin, Resnik and Jiang-Conrath are based onthe notion of information content (4.3.2), and the measures Leacock-Chodorow are basedon path-length.

    When describing the ve measures below, I will give examples of how they differ inscoring semantic similarity. It must be noticed, however, that one can not tell only bylooking at the scores of a measure how good it is. It is rst when using the measures toperform a task that one can actually evaluate how well they work ([Budanitsky and Hirst2004], p. 2).

    4.3.1 Semantic similarity versus semantic relatedness

    Semantic similarity is a special case of semantic relatedness. Semantic relatedness is thequestion of how related two concepts are, using any kind of relation, whereas semanticsimilarity only considers the IS-A relation (hypernymy / hyponymy). Car and gasolinemay be closely related to each other, e.g. because gasoline is the fuel most often used bycars. Car and bicycle are semantically similar, not because they both have wheels andmeans of steering and propulsion, but because they are both instances of vehicle.

    The relation between semantically similar and semantically related is asymmetric: if two concepts are similar, they are also related, but they are not necessarily similar just

    5 WordNet::Similarity can be downloaded freely at:groups.yahoo.com/group/wn-similarity/ .There is also a web interface at:www.d.umn.edu/ mich0212/cgi-bin/similarity/similarity.cgi . Last visited September 27, 2004.

    7

  • 8/3/2019 Wordnet and Semantick Similarity

    11/35

    Table 1: The effect of using different corpora on the information content-based measurescorrelation with human judges.

    Corpus Size (word tokens) Resnik Lin Jiang-ConrathSemCor 200 000 0.714 0.698 0.727Brown Corpus 1 million 0.730 0.744 0.802Penn Treebank 1.3 million 0.746 0.749 0.826British National Corpus 100 million 0.753 0.751 0.812

    because they are related.There are three measures for semantic relatedness in WordNet::Similarity, Extended

    gloss overlap [Banerjee and Pedersen 2003], Vector [Patwardhan 2003] and Hirst-St-Onge[Hirst and St-Onge 1998]. Because of time limitations, they were not included in this

    work. How they perform compared to the semantic similarity measures used here, at taskssimilar to the one at hand, remains to see. These measures will be briey discussed inSection 4.3.9.

    4.3.2 Information content

    Since three of the ve measures below are based on the notion of information content, herefollows a short description of what that is, after which the semantic similarity measuresare explained.

    Each node in the WordNet noun-taxonomy is a synset (4.2.1). The information contentof a node is -log the sum of all probabilities (computed from corpus frequencies) of allwords in that synset ([Resnik 1995b]).

    Let x be a synset in WordNet and p(x) the probability of encountering an instance (i.e.one of the lexical items in the synset) of x , computed from a corpus. The informationcontent of x is log p(x). The default counting method in WordNet::Similarity is to countan occurrence of e.g. the word beet as an instance of each synset it occurs in, i.e. thefrequency of both beet as a plant and beet as a food increases by one. The smoothingtechnique used is add-1.

    By default in WordNet::Similarity, word-sense frequencies are computed from the Sem-Cor corpus [Landes et al. 1998], a portion of the Brown Corpus tagged with WordNet-senses. This is used in this work too. [Patwardhan 2003] (pp. 41-43) has shown that

    the performance of the information content-based measures is noticeably inuenced by thecorpus used, as can be seen in Table 1. The numbers in Table 1 show the correlation withhuman judges ([Miller and Charles 1991], I shall return to this experiment in 4.3.10). Allmeasures show results closer correlating with human judgement as the corpus size increases.The only exception is the Jiang-Conrath measure, which has the highest correlation usingthe second largest corpus. [Patwardhan 2003] believes that the Jiang-Conrath measureis relatively independent of corpus size, once it is above a certain minimum size and isrelatively balanced (p. 42).

    8

  • 8/3/2019 Wordnet and Semantick Similarity

    12/35

    4.3.3 Path length

    The two measures Leacock-Chodorow and Wu-Palmer are based on path length. Simplycounting the number of nodes or relation links between nodes in a taxonomy may seemas a plausible way of measuring semantic similarity. The lower the distance between twoconcepts, the higher their similarity. However, this has proved not to be a successfulmethod of measuring semantic similarity. A problem with this method is that it relies onthat the links in the taxonomy represent uniform distances ([Resnik 1995b]). Therefor,most similarity or relatedness measures based on path length use some value to scale pathlength with.

    Path length measures have the advantage of being independent of corpus statistics, andtherefor uninuenced by sparse data.

    4.3.4 Leacock-Chodorow

    The Leacock-Chodorow measure [Leacock and Chodorow 1998] is not based on the notionof information content, unlike Lin, Resnik and Jiang-Conrath, but on path length. Thesimilarity between two concepts a and b equals the number of nodes along the shortestpath between them, divided by double the maximum depth (from the lowest node to thetop) in the taxonomy in which a and b occur. The number of nodes between two siblings,i.e. two nodes with the same parent node is three.

    sim LCH (a, b ) = max loglength (a, b )

    2D

    The Leacock-Chodorow measure assumes a virtual top node dominating all nodes, andwill therefor always return a value greater than zero, as long as the two concepts comparedcan be found in WordNet since there always will be a path between them. No corpusdata is used by the Leacock-Chodorow measure, so it cannot be affected by sparse dataproblems.

    The Leacock-Chodorow measure gives a score of 3.583 for maximum similarity 6, thatis the similarity of a concept and itself, such as < apple#1 > and < apple#1 > . Thescore it gives for < apple#1 > and < pear#1 > is 2.484, the same as for < apple#1 > and< cortland#1 > (< large red-skinned apple > , immediate child of < apple#1 > ).

    4.3.5 Resnik

    The semantic similarity measure proposed in [Resnik 1995b] is dened as follows: Thesimilarity score of two concepts in an IS-A taxonomy equals the information content valueof their lowest common subsumer (LCS, the lowest node subsuming / dominating themboth).

    6 At least when using WordNet 1.7.1. If other versions have different depth, then this value will also bedifferent.

    9

  • 8/3/2019 Wordnet and Semantick Similarity

    13/35

    sim RES (a, b) = max [IC (LCS (a, b))]

    Resniks measure does not capture the difference in semantic similarity in the followingcase: Assume three concepts, a b and ca , and that ca is a special instance of a . The lowestcommon subsumer for a and b is then the same as for ca and b! In practice, this meansthat the Resnik score for < apple#1 > and < pear#1 > is not different from < pear#1 > and< cortland#1 > .

    The Resnik score for two identical concepts, such as < apple#1 > and < apple#1 > stillequals the information content value of the lowest common subsumer. This means thatthe Resnik score for < apple#1 > and < apple#1 > differs from the score for < spoon#1 >and < spoon#1 > , since the lowest common subsumers (which in this special case is animmediate parent) of < apple#1 > and < spoon#1 > do not have the exactly same frequency

    in SemCor. This is rather counterintuitive; is < apple#1 > not as semantically similar to< apple#1 > as < spoon#1 > is to < spoon#1 > ? The Resnik measure is the only of theve measures discussed here that that does not give the same value for < apple#1 > -< apple#1 > and < spoon#1 > - < spoon#1 > .

    If the only LCS is the virtual root node, Resnik will return zero.

    4.3.6 Wu-Palmer

    The semantic similarity measure in [Wu and Palmer 1994] was rst used in a system forEnglish-Mandarin machine translation of verbs. Like the Leacock-Chodorow measure, itis not based on information content but on path length. It is dened as:

    sim wup = max2 depth (LCS (a, b ))

    length (a, b ) + 2 depth (LCS (a, b ))

    Though this formula may look more complex than the others, it is rather straightfor-ward. The depth of the lowest common subsumer of the two concepts is divided by thelength (number of nodes) between the concepts times the depth again.

    Two identical concepts ( < apple#1 > - < apple#1 > ) receive similarity score 1 by Wu-Palmer. Sometimes, due to a bug in the implementation, similarity scores higher than 1 canbe assigned to concepts. 7 When computing the similarity of < apple#1 > - < cortland#1 >there are paths of different lengths from the top node (a virtual top node is assumed) to

    the concepts. This somehow results in a LCS that the algorithm (or the implementationof it) believes is situated deeper than the concepts it subsumes, and the similarity scorereported is 1.2. Also < apple#1 > - < pear#1 > results in a score above 1, namely 1.125.Despite this bug, the measure will be evaluated with the others; it need not be a fatal aw.

    The same thing goes for Wu-Palmer as for Leacock-Chodorow, they will both alwaysreturn a value greater than 0, and they will not be affected by sparse data problems.

    7 Jason Michelizzi, personal correspondence, August 2004.

    10

  • 8/3/2019 Wordnet and Semantick Similarity

    14/35

    4.3.7 Lin

    In [Lin 1998], the semantic similarity between two concepts a and b in a taxonomy is denedas

    sim LIN (a, b ) = max2 log p (LCS (a, b))

    log p (a ) + log p (b)

    The algorithm is intended to be useful in any domain, as long as there can be a proba-bilistic model for that domain, unlike Resnik, Jiang-Conrath and Leacock-Chodorow whoall presuppose a taxonomy. Lin motivates his measure with the two arguments that therewas no similarity measures around that was not tied to a particular application or domainmodel (such as a taxonomy), and that the fundamental assumptions of previous similaritymeasures were not explicitly stated. Lin is very rm on the second point, he lists theintuitions and assumptions underlying the measure, and then gives a logical proof that themeasure actually conforms with them.

    The Lin measure gives scores between 0 and 1. < apple#1 > - < pear#1 > gets a scoreof 0.935, meaning that they are highly similar. < apple#1 > - < apple#1 > is scored 1 and< apple#1 > - < cortland#1 > 0. The zero score is caused by sparse data, as described in4.3.8 below.

    4.3.8 Jiang-Conrath

    Originally, the Jiang-Conrath [Jiang and Conrath 1997] measure measures semantic dis-tance, i.e. that a high number indicates less similarity between two concepts than alower number. In the WordNet::Similarity implementation the scoring has been inverted

    sim JC N (a, b) = 1dist JC N (a,b ) , so that a high value indicates greater semantic similarity be-tween two concepts than a low value.

    The Jiang-Conrath measure can be weighted so that it takes into consideration factorslike at which depth the concepts compared are situated and the fact that some areas of a taxonomy are denser (i.e. where nodes have more children) than others. When, as inthe WordNet::Similarity implementation, these weights are not used, the Jiang-Conrathmeasure (i.e. the un-inverted version, showing semantic distance) can be dened as:

    sim JC N (a, b ) = max [IC (a ) + IC (b) 2 IC (LCS (a, b))]

    The fact that the information content for a , b and LCS (a, b ) must be known also makesthe Jiang-Conrath measure especially sensitive to sparse data problems.

    Identical concepts ( < apple#1 > - < apple#1 > ) are scored close to 29 million, whichmight seem a bit strange, since < apple#1 > - < pear#1 > is only scored 0.665. This has todo with the original nature of the Jiang-Conrath algorithm; the semantic distance betweentwo identical concepts must be zero, and when inverted, the value must be very high.

    < apple#1 > - < cortland#1 > gets a score of zero, because cortland was never seenin the corpus. Add-1 smoothing gives < cortland#1 > a frequency of 1 anyway, and then

    11

  • 8/3/2019 Wordnet and Semantick Similarity

    15/35

    the information content of cortland becomes very high, and will indicate high similaritywith practically any concept. Therefor, the implementation returns zero whenever any of the two concepts has the frequency 1. 8

    4.3.9 Semantic relatedness measures

    A semantic relatedness measure based on gloss overlaps is presented in [Banerjee andPedersen 2003]. The core idea is that if the glosses (denitions) of two concepts sharewords, then they are related. The more words the two concept glosses share, the morerelated the concepts are. The glosses in WordNet tend to be too short for this kind of evaluation to be successful. Therefor, the glosses of the concepts related by hypernymy,hyponymy, meronymy and holonymy are also taken into consideration.

    The relatedness score for this measure is the number of words occurring in both theglosses of the concepts related to (by the above relations) concept a and the gloss of a itself,

    and the glosses of the concepts related to concept b and the gloss of b itself. Multi-wordunits are given extra weight (square the number of words in the multi-word unit). Itscorrelation with human judgement is reported as 0.67 by [Banerjee and Pedersen 2003].

    Another semantic relatedness measure, proposed in [Patwardhan 2003], uses contextvectors combining gloss overlaps and corpus statistics. This measure shows very highcorrelation with human judgement, a correlation of 0.877 (p. 40), which will be shownbelow to be close to the upper bound.

    The Hirst-St-Onge measure, [Hirst and St-Onge 1998] treats different relations in Word-Net as directions; hypernymy and meronymy are upwards, antonymy and attribute arehorizontal etc. Similarity between two concepts (the original algorithm does not supportsingle senses to be compared, but this has been made possible in the WordNet::Similarityimplementation) is computed by taking the shortest path minus the number of changesin direction. The Hirst-St-Onge measure has a correlation with human judgement of 0.68according to [Seco et al. 2004].

    4.3.10 Evaluating performance of semantic similarity measures

    When a measure for semantic similarity (for English) is evaluated, the Gold Standard mostoften used is [Miller and Charles 1991]. In that experiment, 30 pairs of nouns were givento 38 undergraduate students who were asked to give similarity of meaning-ratings foreach pair. The scale went from 0 (no similarity) to 4 (perfect similarity). The averagescore for each pair was seen as a measure of how semantically similar humans found thatpair of words.

    This experiment was later replicated by [Resnik 1995b] in a smaller scale; using the sameword pairs but only ten subjects. The outcome was as good as consistent with [Miller andCharles 1991].

    8 This is discussed further onhttp://groups.yahoo.com/group/wn-similarity/message/7and http://groups.yahoo.com/group/wn-similarity/message/8 , last visited September 27, 2004

    12

  • 8/3/2019 Wordnet and Semantick Similarity

    16/35

    [Resnik 1995b] also found that the average correlation between his subjects and [Millerand Charles 1991] was r = 0.885. This is usually seen as the upper bound to what acomputer might perform given the same task.

    In [Seco et al. 2004] all modules in the WordNet::Similarity package (and their ownimplementations of the same algorithms) were evaluated with the same noun pairs as in[Miller and Charles 1991]. The scores for the ve measures used here (WordNet::Similarity-implementations) according to [Seco et al. 2004] can be seen in Table 2. One should notethat in [Seco et al. 2004], WordNet 2.0 was used, so the numbers for WordNet 1.7.1might be somewhat different. Also, the numbers differ somewhat from those in Table 1,probably because different versions of both WordNet and WordNet::Similarity were used,and because [Patwardhan 2003] and [Seco et al. 2004] used different corpora and smoothingtechniques for the information content based measures.

    Table 2: Five semantic similarity measures correlation with human judgement.Leacock-Chodorow 0.82Jiang-Conrath 0.81

    Lin 0.80Resnik 0.77

    Wu-Palmer 0.74

    [Budanitsky and Hirst 2004] made an evaluation of the measures Lin, Resnik, Leacock-Chodorow, Jiang-Conrath and Hirst-St-Onge (their own implementations) in a malapropismdetection and correction task. They found that Jiang-Conrath performed best, followed by

    Lin and Leacock-Chodorow on shared second place, then Resnik followed by Hirst-St-Onge.They used the Brown corpus for word counts.

    In order to see how well the different measures performed at the task at hand in thiswork, I ran a test, letting the ve measures disambiguate 73 CPV-nodes. The test isdescribed in Section 7, and results and discussion about them in Section 8 and Part III.

    4.4 Previous work4.4.1 Previous work on ontologies

    The construction and maintenance of an ontology is difficult, expensive and highly time-

    consuming, as one must train domain experts in formal knowledge representation [Faatzand Steinmetz 2002], p. 2. Therefor, every way that ontology construction and / or main-tenance can be successfully automatized or semi-automatized is welcome.

    Ontology enrichment is one such way. This usually means that new nodes are created inan existing ontology. New concepts are identied by mining large corpora, either domain-special corpora such as a collection of computer magazines, general-purpose corpora suchas the Brown corpus, or corpora constructed by search-engine results.

    13

  • 8/3/2019 Wordnet and Semantick Similarity

    17/35

    Ontologies covering e.g. technical domains constantly need updating, since new productswith new names appear all the time. A new clustering-algorithm for nding new synonymsfor a concept (such as P II for Pentium 2) is presented in [Valarakos et al. 2004]. If

    there exists a node Pentium 2, new versions of that name will be clustered around theexisting node. New names which are not (orthographically) similar enough to any existingnode will result in a new node.

    In [Roux et al. 2000] an information extraction system for the domain genomics ispresented. Its knowledge of the domain is represented as verb frames such as represses(protein(Antp), gene(BicD)) , meaning that the Antp protein represses the BicD gene,in an ontology. The verb repress with a protein as rst argument will usually take a geneas its second argument, so the system can learn new gene names by looking for any framematching repress(protein(), ()) .

    [Agirre et al. 2000] wanted to nd out if they could collapse some of the senses inWordNet, arguing that WordNet often makes too ne distinctions between concepts. Theyalso wanted to nd information about topically relations, that WordNet could be enrichedwith.

    For this, they constructed queries like (all words in the description of WordNet-sensex) NOT (all words in the description of the other WordNet-senses for that word) forall senses of 20 words. These queries were sent to Alta-Vista, and the 100 rst retrieveddocuments were tied to the word-sense for that particular query.

    When all word-senses had a collection of documents, the documents for each word-sensewere compared to the documents belonging to all the other senses of that word. The words(in the documents) with a distinct frequency for one of the collections were seen as topicsignatures for the word-sense to which their collection was tied. They were then able to

    cluster senses of a word which had high overlap in topic signature words, thus reducing thenumber of sense-distinctions in WordNet. The topic signature words for a sense could alsobe used for topical relations in WordNet.

    4.4.2 Previous work using semantic similarity

    In [Resnik 1995a] the semantic similarity measure described in [Resnik 1995b] is used todisambiguate groups of similar polysemous nouns.

    He observes that when two polysemous nouns are similar, i.e. they have one senseeach which are similar, their lowest common subsumer tells us which of their senses arerelevant. For instance, take doctor and nurse. They are both polysemous, and the

    lowest common subsumer for all of their senses taken together is health professional.When this is known, the senses descending from health professional can be assigned.An algorithm disambiguating large groups of nouns (groups like tie, jacket, suit or

    head, body, hands, eye, voice, arm, seat, hair, mouth) was proposed, and it was almostcomparable to human judgement results.

    Two human judges were independently given the same set of 125 noun groups fromRogets thesaurus, together with one of the words in the group to disambiguate. Thehuman judges were forced to choose one of the words WordNet-senses only, and give a

    14

  • 8/3/2019 Wordnet and Semantick Similarity

    18/35

    Table 3: Evaluation of Resniks disambiguation algorithmUpper bound % Algorithm correct% Random correct %

    Judge 1 65.7 58.6 34.8Judge 2 68.6 60.5 33.3

    value of how condent they were that this was the one correct sense. Only answers withcondence higher than 1 (on a scale 0-4) were kept.

    Results from Judge 2 were used to create an upper bound for the results on Judge 1and vice versa. A baseline was also created by randomly choosing senses. It is shown inTable 3 that the algorithm performs rather well compared to the upper bound.

    Part IIMethodIn this part, I describe how I have solved the task at hand: how candidates for enriching aleaf-node were found and evaluated, how certain errors were worked around, how the testand Gold Standard was designed.

    5 Description of the program

    In this section I will explain how my program works. I do this in order to facilitatecomparison and repeatability.

    First of all, let me just briey recapture what it is I am doing. In this work, theleaf-nodes in the CPV have been selected as nodes to be enriched with information fromWordNet. Only leaf-nodes are treated, because they are generally more informative andspecic than non-terminal nodes. The nouns from each leaf-node will be extracted, aswill the nouns in their parent-nodes. The senses of the nouns in the leaf-node will becompared against the senses of the nouns in the parent node for semantic similarity, usingdifferent similarity measures. The senses of the leaf-node nouns which get the highestoverall similarity scores, i.e. is more overall semantically similar to the senses of the parent-node nouns, will be selected for enriching the leaf-node.

    The program takes CPV-nodes as input, and for each node it outputs a ranked list of senses. An overview of the stages involved from input to output can be seen in Table 5.

    5.1 How to nd WordNet-senses given a CPV-nodeIn order to have full control and overview of what is done, I have chosen not to use theWordNet browser (called wnb) which is included in the WordNet package. Instead I

    15

  • 8/3/2019 Wordnet and Semantick Similarity

    19/35

    have written a perl module that does just the things I need in this work; nothing more,nothing less. This perl module will simply be referred to as the WordNet noun-searcher.Its function is to retrieve only WordNet noun-senses given a word.

    The rst step in the program is nding the parent of each input CPV leaf-node. Thenboth nodes are transformed into relevant queries which can be sent to my WordNet noun-searcher, which then returns all WordNet noun-senses for each query word. This transfor-mation requires that all words are lemmatized. For this, the program uses bits and piecesfrom the WordNet lemmatizer called Morphy.

    Lemmatizing is very important, since only uninected word-forms are stored in data.nounand index.noun , and most CPV-nodes are in plural. Information about the words in theCPV-node [Curtains, drapes, valances and textile blinds.] are stored under curtain,drape, valance, textile and blind in WordNet.

    First of all, the lemmatizer looks for the input word in a long list of irregular nouns,where their lemma-forms also are listed, enabling easy substitution. If the input word isnot found there, a simple set of rules for lemmatizing regular nouns is employed. All rulesare tried, and the resulting forms which cannot be found in index.noun are immediatelydiscarded, the ones that can be are kept. Words that are included in a stoplist, containing448 common or unwanted words, are ltered out.

    In order to nd compound words as well 9, all pairs that can be made of the lemmatizedwords such that the second word is not the same as or preceding the rst word are alsolooked for in index.noun .

    That is, for all words in the original node [ w1, w2...w n ] make pairs of the kind [wx , wy ]so that y > x . Discard all pairs that cannot be found in index.noun , save those that can.

    Generating compounds this way makes sure that, in e.g. the case of the CPV-node

    [Lead plates, sheets, strip and foil.], not only lead plate but also lead sheet, leadstrip and lead foil are tried. The constraint that y > x also makes sure that, in thecase of e.g. [Command and control system.], control system and command system arelooked for, but system command is not.

    The queries that have been extracted from the CPV-node are then sent to the WordNet-browser, which retrieves at least one word-sense per query. These word-senses are savedfor later. This procedure is done twice: once for the leaf-node and once for its parent.

    A second, smaller stoplist, containing the words gurative, slang, informal, astrology,(ethnic) slur, baseball, chess, folklore, old/new testament, archaic, someone, anatomy,psychoanalysis, obscene is used to lter out some of the WordNet-senses retrieved. If the sense gloss contains any of the above words, it is discarded. All WordNet-senses with

    years in their gloss are also ltered out. Such concepts often denote historical events orpersons, e.g. < wood#3 > < United States lm actress (1938-1981) > . These concepts arenot useful when working with the CPV, and therefor discarded. When working with otherontologies, stoplists must certainly be revised to t their needs.

    9 Only two-word compounds are considered here. There seems to be very few or no three-word com-pounds in the CPV, and most more-than-two-word compounds in WordNet are of the type imaginarypart of a complex number or breach of the covenant of warranty.

    16

  • 8/3/2019 Wordnet and Semantick Similarity

    20/35

    This stoplist was manually compiled during the earliest stages of this work. It is asimple and rather crude heuristic to lower the number of unique comparisons somewhat.Without this stoplist, the numbers of unique comparisons made for the whole CPV is ca

    198000, with it 184000, ca 14000 fewer comparisons.A stoplist of this kind could also have been automatically compiled, by investigatingthe glosses of senses that are always given low or no similarity. Time limitations prohibitedme from investigating this possibility.

    5.2 Comparing word-senses for semantic similarityFrom the leaf-node, we now have a set of words W leaf = {w1 ... w n } with each word wihaving an associated set of senses S i = {s 1 ... s n }. There is also a set of words from theparent W parent = {w1 ... w n } with each word w j having a set of senses S j = {s 1 ... s n }.L is the set of all senses associated to all words in W leaf , and P is the set of all sensesassociated to all words in W parent . This corresponds to stage 5 in Table 5.

    All pairs of L and P senses which are not associated to the same word (we do notwant to compare senses of two identical words with each other, unless in the specic casedescribed in Section 5.3.3) are sent to the module WordNet::Similarity where their semanticsimilarity is computed according to the chosen similarity measure. The output from themodule is a semantic similarity value for that pair of word-senses.

    In Table 4 it is shown how the semantic similarity values for all pairs of L and P sensesare computed. The CPV-node disambiguated is [Beetroot], child to [Root vegetables].

    The scores for all pairs containing the same L -sense are added together and assigned tothat L -sense. The assumption is that the higher the score, the more likely is that L -senseto be a good description of the CPV-node from which it was extracted. When all L -sensesand P -senses have been compared, the L -senses are sorted according to their total score.This corresponds to stages 4 and 5 in Figure 5.

    In Table 4 we see that < beetroot#2 > got almost ve times higher score than < beetroot#1 > .There are usually many more senses involved in most CPV-nodes. In this particular case,there were only two, and both of them are good descriptions of beetroots.

    5.3 When things go wrongThere are a few scenarios when the program is unable to produce any candidates forenriching a CPV-node. They will in turn be discussed below.

    5.3.1 Error scenario #1: No words in W leaf

    In some cases, the words in the leaf-node are all included in the stoplist or simply not inWordNet, and therefor not put in W leaf .

    For instance, the only W leaf -word from the CPV-node [Inll work.] is work. inllis not included in WordNet. work has been added to the stoplist, since it rst of all isa very common word; it appears 303 times in the leaf-nodes of the CPV. Second, the 8

    17

  • 8/3/2019 Wordnet and Semantick Similarity

    21/35

    Table 4: Finding a good description for [Beetroot].L -sense P -sense Similaritybeetroot#1 root#1 1.33735833155429beetroot#1 root#2 0beetroot#1 root#3 0.828662971392306beetroot#1 root#4 0beetroot#1 root#5 0beetroot#1 root#7 0beetroot#1 root#8 0.828662971392306beetroot#1 vegetable#1 0.828662971392306beetroot#1 vegetable#2 0

    beetroot#1 root vegetable#1 0.828662971392306beetroot#2 root#1 0.828662971392306beetroot#2 root#3 0.828662971392306beetroot#2 root#4 0beetroot#2 root#5 0beetroot#2 root#7 0beetroot#2 root#8 0.828662971392306beetroot#2 vegetable#1 8.8602465125736beetroot#2 vegetable#2 0.828662971392306beetroot#2 root vegetable#1 10.1759233064795

    Sum beetroot#1 = 4.65201021712351Sum beetroot#2 = 22.3508217046223

    Word-sense Glossbeetroot#1 beet having a massively swollen red root; widely grown ...beetroot#2 round red root vegetableroot#1 the usually underground organ that lacks buds or leaves ...root#2 (linguistics) the form of a word after all affixes ...root#3 the place where something begins, where it springs ...root#4 a number that when multiplied by itself some number ...root#5 the set of values that give a true statement when ...

    root#7 a simple form inferred as the common basis from ...root#8 the part of a tooth that is embedded in the jaw ...vegetable#1 edible seeds or roots or stems or leaves ...vegetable#2 any of various herbaceous plants cultivated ..root vegetable#1 any of various eshy edible underground roots ...

    18

  • 8/3/2019 Wordnet and Semantick Similarity

    22/35

    Table 5: A schematic description of the seven stages involved in the program, from input(a CPV leaf-node) to output (a ranked list of senses).

    Instructions Examples1) Find the parent of current CPV leaf-node. Leaf-node = [15223000 Frozen sh steaks],

    parent = [15220000 Frozen sh, sh lletsand other sh meat]

    2) Extract words (nouns that can be found inWordNet) from leaf-node and parent. Lem-matize words, search for compounds.

    W leaf = {sh, steak, sh steak }, W parent ={sh, meat, llet, sh llet] }

    3) Filter out stoplisted words. W leaf = {sh, steak, sh steak }, W parent ={sh, meat, llet, sh llet] }

    4) Look up all remaining words in WordNetand assign senses to all words.

    L = {sh#1-4, steak#1, sh steak#1 }P = {sh#1-4, meat#1-3, llet#1-5,sh llet#1 }

    5) Filter out senses whose gloss contain sto-plisted words.

    L = {sh#1-2+4, steak#1, sh steak#1 }P = {sh#1-2+4, meat#1-3, llet#1-5,sh llet#1 }

    6) Compute similarity score for each pair of for each pair of senses ( a, b )L and P senses. Assign score to L-sense. where a L and b P {

    compute similarity( a, b ) unless a = badd similarity score to score (a)

    }7) Sort L -senses according to score (sense )score (all senses ) . 0.343 < steak#1 > < a slice of meat cut from

    the eshy part of an animal or large sh >0.343 < sh steak#1 > < cross-section sliceof a large sh>0.189 < sh#2 > < the esh of sh used asfood>0.070 < sh#1 > < any of various mostly

    cold-blooded aquatic vertebrates ... >0.054 < sh#4 > < the twelfth sign of the zo-diac; the sun is in this sign ... >

    19

  • 8/3/2019 Wordnet and Semantick Similarity

    23/35

    senses returned from WordNet, given the query work, are also too generic to be of muchuse (even the possibly correct ones).

    Since the semantic similarity computation relies on there being words in both W leaf

    and W parent , something has to be done in order to ll W leaf with something. I see twosolutions.First, one could relax the constraint that W leaf -words must not be included in the

    stoplist. Second, one could let the parent node take the place of the leaf node, and let thegrandparent take the place of the parent node; moving up one generation in the CPV, inother words.

    I have chosen the latter solution, assuming that one in this way will be able to capturesome, although often less specic, information relevant to the original leaf-node. Infor-mation captured using the rst solution will often be too generic and is less likely to berestricted to the relevant domain.

    5.3.2 Error scenario #2: No words in W parent

    Another similar case is when there are no words in W parent , by the same reasons as inthe case above. Here too one can either relax the stoplist constraint or move up onegeneration. I have chosen the same sort of solution here, letting W parent take words fromthe grandparent.

    In the case of [Dental hygiene products], the parent is [Dental consumables]. W leaf will here consist of only hygiene, since product is stoplisted, and dental is not inWordNet as a noun. Using the adjective form of dental is not really an option, since thesimilarity measures used here are conned to within part-of-speech boundaries. W parent willbe empty, since also consumable is not in WordNet as a noun. Instead, the grandparent[Disposable non-chemical medical consumables and haematological consumables] will bethe source of words for W parent . The domain will hopefully remain the same, even if therenow is a wider gap between W leaf and W parent .

    5.3.3 Error scenario #3: Only one and the same word W leaf and W parent

    Another possibility is that W parent and W leaf contain only one and the same word. Insuch cases, the constraint that two identical words may not be sent together for similaritycomputation is relaxed. It is replaced by a more forgiving constraint saying that twoidentical words may be sent, but not two identical word-senses.

    The parent of [Tropical wood] is simply [Wood], so both W parent and W child will containthe single word wood. Senses 4-6 of wood are left out by the program since they alldenote historical persons. Senses 1, 2, 7 and 8 are compared with each other, but not withthemselves.

    At an earlier stage of this work, whenever W parent and W leaf contained one and thesame word, W parent was instead lled with words from the grandparent. This resulted inslightly fewer cases where no candidates were found, but the candidates found with therelaxing constraint were more often relevant.

    20

  • 8/3/2019 Wordnet and Semantick Similarity

    24/35

    6 Precision and recall in word-sense disambiguationAverage precision is an evaluation measure widely used in word-sense disambiguation and

    information retrieval. I nd that the easiest way to describe average precision, is in termsof information retrieval. That is what I will do here.Let the words to be disambiguated be the query, and the set of all WordNet-senses the

    documents in your collection. Assume you want to nd the best documents matching theCPV-node [Printing services for commercial catalogues], then the query you send is print-ing, services, commercial, service, catalogue (extracted from the CPV-node as describedin 5.1). The ranked list of retrieved documents for the query above can be seen in Table 6.

    Table 6: A ranked list of WordNet-senses.Ranking Document

    1 [The business of printing]2 [A commercially sponsored ad on radio or television]3 [A complete list of things; usually arranged systematically]4 [A book or pamphlet containing an enumeration of things]5 [All the copies of a work printed at one time]6 [Text handwritten in the style of printed matter]7 [Reproduction by applying ink to paper as for publication]

    Now, the average precision of a list such as Table 6 can be computed as follows: First,pick out all correct documents (according to your Gold Standard) and mark them with

    their position in the original list. Next, for each document in the new list, divide theposition in the new list by its position in the old list. The average of these divisions is theaverage precision at this particular CPV-node (Table 7).

    Recall is computed as usual; by dividing the number of correct found by the number of total correct.

    Table 7: A demonstration of average precision.

    Ranking Document New / Old1 [The business of printing] 1/1 = 12 [A complete list of things; usually arranged systematically] 2/3 = 0.673 [A book or pamphlet containing an enumeration of things] 3/4 = 0.754 [Reproduction by applying ink to paper as for publication] 4/7 = 0.57

    Average precision = 1+0 .67+0 .75+0 .574 = 0 .7475

    21

  • 8/3/2019 Wordnet and Semantick Similarity

    25/35

    6.1 Gold standardIn order to be able to determine recall and average precision, one must rst have a GoldStandard. A good way of acquiring a Gold Standard would be to compile a list of anumber of CPV-nodes and all WordNet-senses which can be extracted from them. Thena number of subjects, preferably native English-speakers, would be instructed to (withoutcollaborating) pick out those WordNet-senses which they think are good candidates fordescribing the whole or parts of each CPV-node. One could then view the WordNet-sensesfor a CPV-node which all subjects agreed on to be correct.

    Unfortunately, this was not possible to achieve, because of limited time and resources.Instead, the Gold Standard was made by myself and Henrik Oxhammar. It consists of 73 randomly selected CPV-nodes and the WordNet-senses which were seen by us as goodcandidates for describing the nodes as whole or in part.

    7 Evaluation of the measuresFor the test, the ve measures Leacock-Chodorow, Lin, Jiang-Conrath, Wu-Palmer andResnik were given the 73 randomly selected CPV-nodes seen in the appendix. Whenstoplists and constraints have ltered out some of the word-senses extracted from thesenodes, there are 362 different word-senses in L -position for the measures to rank, out of which 184 are marked as correct in the Gold Standard. All correct senses are assumed tobe equally correct.

    I have also assumed that compound words generally are more informative and lesspolysemous than simplex words. Therefor, extra weights were assigned to compound words

    in L-position. The weights were 1.25 and 1.75. If average precision and recall increasewhen weights are used, this would indicate that the assumption is correct.The output from the ve similarity measures was evaluated with average precision

    (Section 6). A baseline was created by taking the average results from ten runs withrandomly generated numbers instead of similarity values. If a measure does not performbetter than the baseline, it means that there is a serious aw in either the method used,the (implementation of the) similarity measure, or both.

    7.1 ThresholdsI also wanted to investigate how the different measures perform when thresholds were used.It is likely that when using this program in an application, one does not want to see allcandidate senses, just a few of the highest ranked. When none of the thresholds mentionedbelow are used, all word-senses with a score higher than zero are included in the rankedlist.

    The rst threshold allows the top three candidate word-senses plus all lower which havea score no less than two thirds of the third ranked word-sense. If there are only three orless word-senses in the ranked list, all are selected. This threshold is intended to cut off the

    22

  • 8/3/2019 Wordnet and Semantick Similarity

    26/35

    lowest ranked word-senses, and will reward a measure which gives all correct word-sensesan equal amount of high scores. It is a bit hazardous, though, since it would reward ameasure for giving all word-senses the same score.

    The second threshold looks at how many correct word-senses there are in the GoldStandard for a CPV-node. If e.g. there are four correct word-senses for a CPV-node in theGold Standard, then only the four top candidates for that node are kept.

    The two thresholds are not used in combination with each other.

    Part III

    Results, discussion and conclusionIn this part, the results from the test in Section 7 are presented (Section 8). After thatfollows a discussion and an analysis of the results (Section 9), conclusions (Section 10) andthoughts regarding what can be done in the future (Section 10.1).

    8 ResultsThe results from the test described in Section 7 are presented in Tables 8 - 11. Table 8shows the results when no thresholds are used, Table 9 shows what happens when the rstthreshold is applied, and Table 10 shows the effect of the second threshold (Section 7.1).Table 11 shows the mean score, when the weights have been averaged out, for the vemeasures (recall has been omitted here, since the variations are so small between differentweight classes). The highest value in each column of each table is written in bold.

    9 DiscussionWhen looking at Table 8 it is important not to be deceived by the high recall valuesfor Leacock-Chodorow, Wu-Palmer and Baseline. There is no threshold used there, andsince these three measures always return a value greater than zero, all word-senses areranked and recall becomes very high. The fact that recall is not 1 is due to a few wordextraction failures of the kind described in Section 5.3. The differences in recall for theinformation content based measures (Resnik, Lin and Jiang-Conrath) are due to theirdifferent tendencies to return zero.

    9.1 Quantitative analysis of resultsWhat do all the numbers in Tables 8 - 11 tell us? First of all that the Leacock-Chodorowmeasure clearly performs best, followed by the Resnik measure. In Table 8 Resnik consis-tently shows higher average precision than Leacock-Chodorow, but when the two thresholds

    23

  • 8/3/2019 Wordnet and Semantick Similarity

    27/35

    Table 8: Disambiguation results from 73 CPV-nodes. No threshold.Measure Weight Average precision RecallResnik 1.00 0.678 0.839

    1.25 0.697 0.839 1.75 0.705 0.839

    Leacock-Chodorow 1.00 0.663 0.977 1.25 0.684 0.977 1.75 0.684 0.977

    Wu-Palmer 1.00 0.655 0.977 1.25 0.677 0.977 1.75 0.677 0.977

    Lin 1.00 0.604 0.580 1.25 0.614 0.580

    1.75 0.616 0.580Jiang-Conrath 1.00 0.607 0.724 1.25 0.622 0.724 1.75 0.626 0.724

    Baseline 1.00 0.554 0.977 1.25 0.560 0.977 1.75 0.565 0.977

    24

  • 8/3/2019 Wordnet and Semantick Similarity

    28/35

    Table 9: Disambiguation results from 73 CPV-nodes. Threshold 1 used.Measure Weight Average precision RecallResnik 1.00 0.673 0.661

    1.25 0.696 0.667 1.75 0.711 0.667

    Leacock-Chodorow 1.00 0.688 0.845 1.25 0.711 0.845 1.75 0.711 0.845

    Wu-Palmer 1.00 0.680 0.770 1.25 0.703 0.770 1.75 0.703 0.770

    Lin 1.00 0.608 0.483 1.25 0.624 0.483

    1.75 0.626 0.483Jiang-Conrath 1.00 0.615 0.661 1.25 0.629 0.661 1.75 0.633 0.655

    Baseline 1.00 0.572 0.774 1.25 0.578 0.774 1.75 0.592 0.774

    25

  • 8/3/2019 Wordnet and Semantick Similarity

    29/35

    Table 10: Disambiguation results from 73 CPV-nodes. Threshold 2 used.Measure Weight Average precision RecallResnik 1.00 0.633 0.506

    1.25 0.653 0.529 1.75 0.668 0.529

    Leacock-Chodorow 1.00 0.644 0.534 1.25 0.677 0.552 1.75 0.677 0.552

    Wu-Palmer 1.00 0.624 0.517 1.25 0.652 0.529

    1.75 0.652 0.529Lin 1.00 0.555 0.425 1.25 0.577 0.431 1.75 0.577 0.437

    Jiang-Conrath 1.00 0.568 0.448 1.25 0.596 0.460 1.75 0.590 0.471

    Baseline 1.00 0.505 0.454 1.25 0.525 0.459 1.75 0.546 0.476

    Table 11: Average precision when weight has been evened out.

    Resnik Leacock-Chodorow Wu-Palmer Lin Jiang-Conrath BaselineNo threshold 0.693 0.677 0.669 0.611 0.618 0.559Threshold 1 0.693 0.703 0.695 0.619 0.626 0.581Threshold 2 0.651 0.667 0.643 0.570 0.585 0.525

    26

  • 8/3/2019 Wordnet and Semantick Similarity

    30/35

    are used, the relation is reverse (Tables 9 and 10). This is also apparent in Table 11.It is also clear that Lin and Jiang-Conrath did not perform very well, which is rather

    remarkable. In e.g. Table 10 they show average precision dangerously close to the base-

    line, and recall mostly lower than baseline. As is shown by [Budanitsky and Hirst 2004](Section 4.3.10) both the Jiang-Conrath measure and the Lin measure (sharing place withLeacock-Chodorow) perform very well, outperforming the Resnik measure at malapropismdetection and correction. Also in the comparison made by [Seco et al. 2004] (Table 2),both Lin and Jiang-Conrath show higher correlation with human judgement than Resnikand Wu-Palmer.

    We see that Wu-Palmer comes third, but closer to Resnik than to Jiang-Conrath. Thequestion if the bug in the WordNet::Similarity-implementation of the Wu-Palmer measure(Section 4.3.6) was fatal or not has gotten an answer: no, the bug is not fatal. This bugonly occurs twice in the test set, in the word-sense pairs < oil#1 > - < motor oil#1 > and< ferry#1 > - < boat#1 > , and they seem not to have affected the outcome. 10

    Another fact is that weighting of compounds seems to have had a positive effect onboth recall and average precision in nearly all cases. A weight of 1.25 makes the measuresperform better than without weights, and a weight of 1.75 makes the measures performeven better. Only Jiang-Conrath shows a slight loss of recall or precision in two cases, inTables 9 and 10, with weight 1.75. This probably has to do with a compound extractedfrom a CPV-node that is not correct, and when it is given enough weight, it moves upone place in the ranking and switches places with a correct non-compound word-sense.Incorrect compounds do occur, but not very frequently. One example is service book(< a book setting forth the forms of church service > ) extracted from [Printing services foraccount books].

    The positive effect of compound weighting is probably rather limited. Using weightsten times larger than the ones used here will probably not increase performance much,since they are not that common, and since incorrect compounds will appear.

    9.2 Qualitative analysis of resultsIn this test, where 73 CPV-nodes were disambiguated, 5 nodes did not receive any can-didate word-senses at all. They are [Dental haemostatic], [Aerial photography services],[Non-destructive testing services], [Repair and maintenance services of furniture] and [In-cubators]. That is 7% of the nodes in the test set. For all of the leaf-nodes in the CPVthe percentage is somewhat lower, 4.9%, or 274 nodes, using the heuristics described.

    That these nodes do not get any candidates is a aw in the program itself, and has nothingto do with any of the similarity measures. If 274 CPV leaf-nodes do not get any candidateword-senses, it still means that 5370 nodes do.

    The reason that ve of the nodes in the test set did not receive any candidate word-10 Actually, during the process of nishing this thesis, a new version of WordNet::Similarity was released,

    where this particular bug had been xed. So even if it turned out not to be fatal, we need not worry aboutit anymore.

    27

  • 8/3/2019 Wordnet and Semantick Similarity

    31/35

    senses is errors like the ones described in Section 5.3. Improved heuristics (such as smarterstoplists), and other sources of information such as extracting words from siblings in theCPV as well, might help here.

    I would also like to look a little closer on a few of the CPV-nodes in the test set, andsee how the measures perform. No weights or thresholds are used here.First, let us look at [Lead], a child to [Lead, zinc and tin]. This is a perfect example:

    lead is a highly polysemous word with as many as 16 senses in WordNet, and just oneof the senses is correct here ( < a soft heavy toxic malleable metallic element ... > ). Theinformation in the parent node is potentially very helpful; zinc is monosemous and highlysimilar to lead, and tin has only three senses, out of which one denotes the metal.Interestingly, all of the measures rank the correct sense of lead highest. All measuresgive maximum recall and average precision, for this particular CPV-node.

    Another interesting CPV-node in the test set is [Mixes for stocks], an instance of [Stocks]. 2 senses are correct, out of 21 extracted by the program; < a commercially pre-pared mixture of dry ingredients > and < liquid in which meat and vegetables are simmered...> .

    Table 12: Results for disambiguating [Mixes for stocks].

    Measure Average Precision RecallLeacock-Chodorow 0.567 1Resnik 0.327 1Lin 1 0.5Wu-Palmer 0.317 1

    Jiang-Conrath 1 0.5Baseline 0.095 1

    The words extracted by the program are W parent = {stocks, stock } and W leaf = {mix,stocks, stock }. The L-senses of mix will be compared for semantic similarity togetherwith P -senses of both stocks and stock, whereas the L-senses of stock only will becompared together with the one P -sense of stocks and the L -sense of stock will only becompared together with the P -senses of stock. This is because senses belonging to thesame word are not compared with each other if there are senses of other words available.

    All measures give the L -sense < a commercially prepared mixture of dry ingredients > a

    high ranking; rst or second place. The correct sense of stock is often not so high ranked.Leacock-Chodorow, the most successful measure, even in this small example, places thecorrect sense of mix rst andstock at rank 15. Resnik, on the other hand, places thecorrect sense of mix second, and stock thirteenth. How come the average precision of Leacock-Chodorow is so much higher than Resnik, then? Fact is, that average precisioncan be rather unforgiving of such seemingly small differences:

    lch =11 +

    215

    2= 0 .567 res =

    12 +

    213

    2= 0 .327

    28

  • 8/3/2019 Wordnet and Semantick Similarity

    32/35

    We saw in this example that the senses of stock were not compared with as manyother senses as the senses of mix. A possible way of evening out the difference betweenthe ranking of senses due to the number of other senses they are compared with could be

    to divide their total score with the number of senses they were compared with.

    9.3 Biased towards Resnik?Three of the measures can return zero when computing the similarity of two concepts;Resnik, Jiang-Conrath and Lin, and they show different tendency to do that. When lookingat how often they did return zero for the nodes in the test set, we see that Resnik returnszero 163 times, Jiang-Conrath 174 times and Lin 301 times. Lins poor results can partlybe explained by this. It does not suffice using the relatively small SemCor corpus for word-frequencies when using the Lin measure. Jiang-Conrath, on the other hand, does it just afew times more than Resnik, so the answer to Jiang-Conraths poor performance cannotbe as easily explained.

    Another explanation might be found in the fact that there is a bias towards the Resnikmeasure. This bias consists of that the default mode of WordNet::Similarity (the one usedhere) uses the kind of frequency counting and smoothing done by Resnik. [Patwardhan2003] p. 42 shows that Jiang-Conraths human correlation increases with 13% relative whenoptimal settings are used. If Jiang-Conraths performance in this test would rise with 13%relative, it would beat Resnik in Table 8 and Leacock-Chodorow in Table 9, but Leacockwould still be best in Table 10.

    10 ConclusionI have shown a method of disambiguating an ontology using 5 different measures of semanticsimilarity and WordNet. The measure best suited for this task is shown to be the Leacock-Chodorow measure. The average precision of this similarity measure goes from 0.644 to0.711, and recall between 0.534 and 0.977, both depending of the kind of threshold used.

    Out of the three semantic similarity measures based on information content, the Resnik,Jiang-Conrath and Lin measures, only Resnik proved to be of much use. However, theperformance of the other two might improve considerably, when using larger corpora thanSemCor, as was shown by [Patwardhan 2003].

    The method described, regardless of the semantic similarity measure used, has the po-tential of enriching 95% of the CPV leaf-nodes (5370 out of 5644) with some kind of information. Average precision and recall for Leacock-Chodorow suggest that the informa-tion often is relevant. How useful this information is will be shown in the coming work of Henrik Oxhammar.

    This method could also be used for disambiguating other English-based ontologies,provided that the information in them is not too domain-specic to be found in WordNet.

    Though the actual program written here uses WordNet 1.7.1 and WordNet::Similarity0.09, only a few minor changes need to be done in order to use later versions.

    29

  • 8/3/2019 Wordnet and Semantick Similarity

    33/35

    10.1 Future workIt would be interesting to see how other computer-readable information sources, e.g. onlinedictionaries, thesauri or search engines could be used instead of WordNet for similar tasks.To some extent, I think it would be possible. The only part of WordNet used in this workwas the noun taxonomy, and similar structures could be attained from thesauri etc.

    Another way of disambiguating an ontology such as the CPV would be to take advantageof the fact that there exist exact copies of it in different languages. A word which ispolysemous in one language is perhaps monosemous in one of the others. For example, todisambiguate the English CPV-node [31521320 Torches], you could look up the Swedishnode by simply searching for the code. Then you nd [31521320 Ficklampor], which ismonosemous in Swedish. All you need to do then is to look up cklampa (singular formof cklampor) in a machine-readable Swedish-English dictionary to nd the correct senseof torch.

    References[Agirre et al. 2000] E. Agirre, O. Ansa, E. Hovy, and D. Martinez. 2000. Enriching very

    large ontologies using the www. In Proceedings of the Ontology Learning Workshop,ECAI , Berlin.

    [Banerjee and Pedersen 2003] S. Banerjee and T. Pedersen. 2003. Extended gloss overlapsas a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Articial Intelligence , pages 805810, Acapulco.

    [Budanitsky and Hirst 2004] Alexander Budanitsky and Graeme Hirst. 2004. EvaluatingWordNet-based measures of semantic distance. Submitted for publication .

    [European Union and European Parliament 2002] European Union and European Parlia-ment. 2002. On the common procurement vocabulary (CPV). regulation (EC) no2195/2002 of the european parliament and of the council, November 5.

    [Faatz and Steinmetz 2002] Andreas Faatz and Ralf Steinmetz. 2002. Ontology enrich-ment with texts from the www. In Proc. Of the 13th European Conference on MachineLearning (ECML02) and the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD02) , Helsinki.

    [Fellbaum 1998] Christiane Fellbaum, editor. 1998. WordNet. An electronic lexical

    database . Language, Speech, and Communication. MIT Press, Cambridge, MA.[Hirst and St-Onge 1998] Graeme Hirst and David St-Onge, 1998. Lexical Chains as Rep-

    resentations of Context for the Detection and Correction of Malapropisms , chapter 13,pages 305332. MIT Press, Cambridge, MA.

    [Jiang and Conrath 1997] Jay Jiang and David Conrath. 1997. Semantic similarity basedon corpus statistics and lexical taxonomy. In Proceedings of International ConferenceResearch on Computational Linguistics , Taiwan.

    30

  • 8/3/2019 Wordnet and Semantick Similarity

    34/35

    [Landes et al. 1998] Shari Landes, Claudia Leacock, and Randee Tengi, 1998. Building Semantic Concordances , chapter 8, pages 199216. MIT Press, Cambridge, MA.

    [Leacock and Chodorow 1998] Claudia Leacock and Martin Chodorow, 1998. Combining Local Context and WordNet Similarity for Word Sense Identication , chapter 11, pages265283. MIT Press, Cambridge, MA.

    [Lin 1998] Dekang Lin. 1998. An information-theoretic denition of similarity. In Proc.15th International Conf. on Machine Learning , pages 296304. Morgan Kaufmann, SanFrancisco, CA.

    [Miller and Charles 1991] George A. Miller and Walter G. Charles. 1991. Contextualcorrelates of semantic similarity. Language and Cognitive Processes , 6(1):128.

    [Patwardhan 2003] Siddarth Patwardhan. 2003. Incorporating dictionary and corpus infor-mation into a context vector measure of semantic relatedness. Masters thesis, University

    of Minnesota.[Pedersen et al. 2004] Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004.WordNet::similarity - measuring the relatedness of concepts.

    [Resnik 1995a] Philip Resnik. 1995a. Disambiguating noun groupings with respect toWordnet senses. In David Yarovsky and Kenneth Church, editors, Proceedings of theThird Workshop on Very Large Corpora , pages 5468, Somerset, New Jersey. Associationfor Computational Linguistics.

    [Resnik 1995b] Philip Resnik. 1995b. Using information Content to evaluate semanticsimilarity in a taxonomy. In Chris Mellish, editor, IJCAI-95 , pages 448453, Montral,Canada.

    [Roux et al. 2000] Claude Roux, Denys Proux, Francois Rechenmann, and Laurent Jul-liard. 2000. An ontology enrichment method for a pragmatic information extraction sys-tem gathering data on genetic interactions. In Proc. Of Workshop on Ontology Learning at ECAI 2000 , Berlin.

    [Seco et al. 2004] Nuno Seco, Tony Veale, and Jer Hayes. 2004. An intrinsic informationcontent metric for semantic similarity in WordNet. In Proceedings of ECAI2004 , pagePAGES?, Valencia, Spain. The 16th European Conference on Articial Intelligence.

    [Valarakos et al. 2004] Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis,and George Vuoros. 2004. A name-matching algorithm for supporting ontology en-richment. In Proceedings of SETN04, 3d Hellenic Conference on Articial Intelligence ,Samos, Greece.

    [Vossen 1998] Piek Vossen, editor. 1998. EuroWordNet: A multilingual database with lexical semantic networks . Kluwer Academic Publishers, Dordrecht.

    [Wu and Palmer 1994] Zhibiao Wu and Martha Palmer. 1994. Verb semantics and lexicalselection. In 32nd Annual Meeting of the Association for Computational Linguistics ,pages 133138.

    31

  • 8/3/2019 Wordnet and Semantick Similarity

    35/35

    A Test set

    Land-reclamation work Urban solid-refuse disposal servicesDental haemostatic Repair and maintenance services of furnitureMulti-functional buildings Industrial process control equipmentNetwork components Unloading semi-trailers for agricultureMarine patrol vessels Printing services for stamp-impressed paperDampers Railway transport of bulk liquids or gasesAerial photography services Fire hosesCopper sulphate Gully-emptying servicesChests of drawers Leather wasteElectric pumps Structural steelworksOverhead projectors Painters brushes

    Carbon paper Flat-rolled products of ironSweet pies NotepaperTomatoes Non-destructive testing servicesShorts Installation services of beverage-processing machineryClinical-waste collection services Motor oilsSmoke-extraction equipment Underground railway worksApparatus for detecting uids Natural spongesChicken cuts Landscaping work for roof gardensSafe-deposit lockers Time recordersFerry boats Ass, mule or hinny meatStudio mixing console Paints and wallcoveringsShampoos System quality assurance planning servicesBreakwater Facilities management servicesMixes for stocks Cable-laying ship servicesPlastic self-adhesive sheets Insulated cable jointsRefrigerated showcases Dressing gownsLead Repair and maintenance services of dampersIncubators Electrical signalling equipment for railwaysPollution-monitoring services File coversPig iron Fittings for loose-leaf binders or lesSausage products Reconditioning services of rolling stock seats

    Gas-detection equipment Printing services for account booksHeadbands Diced, sliced and other frozen potatoesElectronic cards Ship refuelling servicesProcessed pulses Wheelchair tyresMetal sheets