Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip...

50
Word Meaning and Similarity Word Senses and Word Relations Slides are adapted from Dan Jurafsky

Transcript of Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip...

Page 1: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordMeaningandSimilarity

WordSenses andWordRelations

SlidesareadaptedfromDanJurafsky

Page 2: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Reminder:lemmaandwordform

• Alemma orcitationform• Samestem,partofspeech,roughsemantics

• Awordform• The“inflected”wordasitappearsintext

Wordform Lemmabanks banksung singduermes dormir

Page 3: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Lemmashavesenses

• Onelemma“bank”canhavemanymeanings:• …a bank can hold the investments in a custodial account…

• “…as agriculture burgeons on the east bank the river will shrink even more”

• Sense(orwordsense)• Adiscreterepresentation

ofanaspectofaword’smeaning.

• Thelemmabank herehastwosenses

1

2

Sense1:

Sense2:

Page 4: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Homonymy

Homonyms:wordsthatshareaformbuthaveunrelated,distinctmeanings:

• bank1:financialinstitution,bank2:slopingland• bat1:clubforhittingaball,bat2:nocturnalflyingmammal

1. Homographs (bank/bank,bat/bat)2. Homophones:

1. Write andright2. Piece andpeace

Page 5: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

HomonymycausesproblemsforNLPapplications

• Informationretrieval• “bat care”

• MachineTranslation• bat:murciélago (animal)orbate (forbaseball)

• Text-to-Speech• bass (stringedinstrument)vs.bass (fish)

Page 6: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Polysemy

• 1.Thebankwasconstructedin1875outoflocalredbrick.• 2.Iwithdrewthemoneyfromthebank• Arethosethesamesense?

• Sense2:“Afinancialinstitution”• Sense1:“Thebuildingbelongingtoafinancialinstitution”

• Apolysemous wordhasrelatedmeanings• Mostnon-rarewordshavemultiplemeanings

Page 7: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

• Lotsoftypesofpolysemyaresystematic• School, university, hospital• Allcanmeantheinstitutionorthebuilding.

• Asystematicrelationship:• Building Organization

• Othersuchkindsofsystematicpolysemy:Author (Jane Austen wrote Emma)

WorksofAuthor(I love Jane Austen)Tree (Plums have beautiful blossoms)

Fruit (I ate a preserved plum)

MetonymyorSystematicPolysemy:Asystematicrelationshipbetweensenses

Page 8: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Howdoweknowwhenawordhasmorethanonesense?

• The“zeugma”test:Twosensesofserve?• Which flights serve breakfast?• Does Lufthansa serve Philadelphia?• ?DoesLufthansaservebreakfastandSanJose?

• Sincethisconjunctionsoundsweird,• wesaythatthesearetwodifferentsensesof“serve”

Page 9: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Synonyms• Wordthathavethesamemeaninginsomeorallcontexts.

• filbert/hazelnut• couch/sofa• big/large• automobile/car• vomit/throwup• Water/H20

• Twolexemesaresynonyms• iftheycanbesubstitutedforeachotherinallsituations• Ifsotheyhavethesamepropositionalmeaning

Page 10: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Synonyms

• Buttherearefew(orno)examplesofperfectsynonymy.• Evenifmanyaspectsofmeaningareidentical• Stillmaynotpreservetheacceptabilitybasedonnotionsofpoliteness,slang,register,genre,etc.

• Example:• Water/H20• Big/large• Brave/courageous

Page 11: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Synonymyisarelationbetweensensesratherthanwords

• Considerthewordsbig andlarge• Aretheysynonyms?

• Howbig isthatplane?• WouldIbeflyingonalarge orsmallplane?

• Howabouthere:• MissNelson becameakindofbigsistertoBenjamin.• ?MissNelson becameakindoflarge sistertoBenjamin.

• Why?• big hasasensethatmeansbeingolder,orgrownup• large lacksthissense

Page 12: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Antonyms

• Sensesthatareoppositeswithrespecttoonefeatureofmeaning• Otherwise,theyareverysimilar!

dark/light short/long fast/slow rise/fallhot/cold up/down in/out

• Moreformally:antonymscan• defineabinaryopposition

orbeatoppositeendsofascale• long/short, fast/slow

• Bereversives:• rise/fall, up/down

Page 13: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

HyponymyandHypernymy

• Onesenseisahyponym ofanotherifthefirstsenseismorespecific,denotingasubclassoftheother• car isahyponymofvehicle• mango isahyponymoffruit

• Converselyhypernym/superordinate (“hyperissuper”)• vehicle isahypernym ofcar• fruit isahypernym ofmango

Superordinate/hyper vehicle fruit furnitureSubordinate/hyponym car mango chair

Page 14: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Hyponymymoreformally• Extensional:

• Theclassdenotedbythesuperordinateextensionallyincludestheclassdenotedbythehyponym

• Entailment:• AsenseAisahyponymofsenseBifbeinganAentailsbeingaB

• Hyponymyisusuallytransitive• (AhypoBandBhypoCentailsAhypoC)

• Anothername:theIS-Ahierarchy• AIS-A B(orAISA B)• Bsubsumes A

Page 15: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

HyponymsandInstances

• WordNet hasbothclasses andinstances.• Aninstance isanindividual,apropernounthatisauniqueentity

• San Francisco isaninstance ofcity• Butcity isaclass• city isahyponym ofmunicipality...location...

15

Page 16: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordMeaningandSimilarity

WordSenses andWordRelations

Page 17: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordMeaningandSimilarity

WordNet andotherOnlineThesauri

Page 18: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

ApplicationsofThesauriandOntologies

• InformationExtraction• InformationRetrieval• QuestionAnswering• Bioinformaticsand MedicalInformatics• MachineTranslation

Page 19: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordNet 3.0

• Ahierarchicallyorganizedlexicaldatabase• On-linethesaurus+aspectsofadictionary

• Someotherlanguagesavailableorunderdevelopment• (Arabic,Finnish,German,Portuguese…)

Category UniqueStringsNoun 117,798Verb 11,529Adjective 22,479Adverb 4,481

Page 20: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Sensesof“bass”inWordnet

Page 21: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Howis“sense”definedinWordNet?• The synset (synonymset),thesetofnear-synonyms,

instantiatesasenseorconcept,withagloss• Example:chumpasanounwiththegloss:

“apersonwhoisgullibleandeasytotakeadvantageof”

• Thissenseof“chump”issharedby9words:chump1, fool2, gull1, mark9, patsy1, fall guy1, sucker1, soft touch1, mug2

• Eachofthese senseshavethissamegloss• (Notevery sense;sense2ofgullistheaquaticbird)

Page 22: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordNet Hypernym Hierarchyfor“bass”

Page 23: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordNet NounRelations

Page 24: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordNet 3.0

• Whereitis:• http://wordnetweb.princeton.edu/perl/webwn

• Libraries• Python:WordNet fromNLTK• http://www.nltk.org/Home

• Java:• JWNL,extJWNL onsourceforge

Page 25: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Synset

• MeSH (MedicalSubjectHeadings)• 177,000entrytermsthatcorrespondto26,142biomedical“headings”

• HemoglobinsEntryTerms:Eryhem, FerrousHemoglobin,HemoglobinDefinition:Theoxygen-carryingproteinsofERYTHROCYTES.Theyarefoundinallvertebratesandsomeinvertebrates.Thenumberofglobinsubunitsinthehemoglobinquaternarystructurediffersbetweenspecies.Structuresrangefrommonomerictoavarietyofmultimeric arrangements

MeSH:MedicalSubjectHeadingsthesaurusfromtheNationalLibraryofMedicine

Page 26: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

TheMeSH Hierarchy

• a

26

Page 27: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

UsesoftheMeSH Ontology

• Providesynonyms(“entryterms”)• E.g.,glucoseanddextrose

• Providehypernyms (fromthehierarchy)• E.g.,glucoseISAmonosaccharide

• IndexinginMEDLINE/PubMED database• NLM’sbibliographicdatabase:• 20millionjournalarticles• Eacharticlehand-assigned10-20MeSH terms

Page 28: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordMeaningandSimilarity

WordNet andotherOnlineThesauri

Page 29: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordMeaningandSimilarity

WordSimilarity:ThesaurusMethods

Page 30: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordSimilarity

• Synonymy:abinaryrelation• Twowordsareeithersynonymousornot

• Similarity(or distance):aloosermetric• Twowordsaremoresimilariftheysharemorefeaturesofmeaning

• Similarityisproperlyarelationbetweensenses• Theword“bank”isnotsimilartotheword“slope”• Bank1 issimilartofund3

• Bank2 issimilartoslope5

• Butwe’llcomputesimilarityoverbothwordsandsenses

Page 31: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Whywordsimilarity

• Informationretrieval• Questionanswering• Machinetranslation• Naturallanguagegeneration• Languagemodeling• Automaticessaygrading• Plagiarismdetection• Documentclustering

Page 32: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Wordsimilarityandwordrelatedness

• Weoftendistinguishwordsimilarity fromwordrelatedness• Similar words:near-synonyms• Relatedwords:canberelatedanyway• car, bicycle: similar• car, gasoline: related,notsimilar

Page 33: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Twoclassesofsimilarityalgorithms

• Thesaurus-basedalgorithms• Arewords“nearby”inhypernym hierarchy?• Dowordshavesimilarglosses(definitions)?

• Distributionalalgorithms• Dowordshavesimilardistributionalcontexts?

Page 34: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Pathbasedsimilarity

• Twoconcepts(senses/synsets)aresimilariftheyareneareachotherinthethesaurushierarchy• =haveashortpathbetweenthem• conceptshavepath1tothemselves

Page 35: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Refinementstopath-basedsimilarity

• pathlen(c1,c2) =1+numberofedgesintheshortestpathinthehypernym graphbetweensensenodesc1 andc2

• rangesfrom0to1(identity)

• simpath(c1,c2) =

• wordsim(w1,w2) = max simpath(c1,c2)c1Îsenses(w1),c2Îsenses(w2)

1pathlen(c1,c2 )

Page 36: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Example:path-basedsimilaritysimpath(c1,c2) = 1/pathlen(c1,c2)

simpath(nickel,coin)=1/2 = .5simpath(fund,budget)=1/2 = .5simpath(nickel,currency)=1/4 = .25simpath(nickel,money)=1/6 = .17simpath(coinage,Richter scale)=1/6 = .17

Page 37: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Problemwithbasicpath-basedsimilarity

• Assumeseachlinkrepresentsauniformdistance• Butnickel tomoney seemstoustobecloserthannickel tostandard

• Nodeshighinthehierarchyareveryabstract• Weinsteadwantametricthat

• Representsthecostofeachedgeindependently• Wordsconnectedonlythroughabstractnodes• arelesssimilar

Page 38: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Informationcontentsimilaritymetrics

• Let’sdefineP(c) as:• Theprobabilitythatarandomlyselectedwordinacorpusisaninstanceofconceptc

• Formally:thereisadistinctrandomvariable,rangingoverwords,associatedwitheachconceptinthehierarchy• foragivenconcept,eachobservednouniseither

• amemberofthatconceptwithprobabilityP(c)• notamemberofthatconceptwithprobability1-P(c)

• Allwordsaremembersoftherootnode(Entity)• P(root)=1

• Theloweranodeinhierarchy,theloweritsprobability

Resnik 1995.Usinginformationcontenttoevaluatesemanticsimilarityinataxonomy.IJCAI

Page 39: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Informationcontentsimilarity

• Trainbycountinginacorpus• Eachinstanceofhill countstowardfrequencyofnaturalelevation,geologicalformation,entity,etc• Letwords(c) bethesetofallwordsthatarechildrenofnodec

• words(“geo-formation”)= {hill,ridge,grotto,coast,cave,shore,natural elevation}• words(“naturalelevation”)={hill,ridge}

P(c) =count(w)

w∈words(c)∑

N

geological-formation

shore

hill

naturalelevation

coast

cave

grottoridge

entity

Page 40: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Informationcontentsimilarity• WordNet hierarchyaugmentedwithprobabilitiesP(c)

D.Lin.1998.AnInformation-TheoreticDefinitionofSimilarity.ICML1998

Page 41: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Informationcontent:definitions

• Informationcontent:IC(c) = -log P(c)

• Mostinformativesubsumer(Lowestcommonsubsumer)LCS(c1,c2) = Themostinformative(lowest)nodeinthehierarchysubsumingbothc1 andc2

Page 42: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Usinginformationcontentforsimilarity:theResnik method

• Thesimilaritybetweentwowordsisrelatedtotheircommoninformation

• Themoretwowordshaveincommon,themoresimilartheyare

• Resnik:measurecommoninformationas:• Theinformationcontentofthemostinformative(lowest)subsumer (MIS/LCS)ofthetwonodes

• simresnik(c1,c2) = -log P( LCS(c1,c2) )

PhilipResnik.1995.UsingInformationContenttoEvaluateSemanticSimilarityinaTaxonomy.IJCAI1995.PhilipResnik.1999.SemanticSimilarityinaTaxonomy:AnInformation-BasedMeasureanditsApplicationtoProblemsofAmbiguityinNaturalLanguage.JAIR11,95-130.

Page 43: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Dekang Linmethod

• Intuition:SimilaritybetweenAandBisnotjustwhattheyhaveincommon

• Themoredifferences betweenAandB,thelesssimilartheyare:• Commonality:themoreAandBhaveincommon,themoresimilartheyare• Difference:themoredifferencesbetweenAandB,thelesssimilar

• Commonality:IC(common(A,B))• Difference:IC(description(A,B))-IC(common(A,B)

Dekang Lin.1998.AnInformation-TheoreticDefinitionofSimilarity.ICML

Page 44: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Dekang Linsimilaritytheorem• ThesimilaritybetweenAandBismeasuredbytheratio

betweentheamountofinformationneededtostatethecommonalityofAandBandtheinformationneededtofullydescribewhatAandBare

simLin(A,B)∝IC(common(A,B))IC(description(A,B))

• Lin(alteringResnik)definesIC(common(A,B))as2xinformationoftheLCS

simLin(c1,c2 ) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )

Page 45: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Linsimilarityfunction

simLin(A,B) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )

simLin(hill, coast) =2 logP(geological-formation)logP(hill)+ logP(coast)

=2 ln0.00176

ln0.0000189+ ln0.0000216= .59

Page 46: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

The(extended)Lesk Algorithm

• Athesaurus-basedmeasurethatlooksatglosses• Twoconceptsaresimilariftheirglossescontainsimilarwords

• Drawingpaper:paper thatisspeciallypreparedforuseindrafting• Decal:theartoftransferringdesignsfromspeciallypreparedpaper toawoodorglassormetalsurface

• Foreachn-wordphrasethat’sinbothglosses• Addascoreofn2

• Paperandspeciallypreparedfor1+22 =5• Computeoverlapalsoforotherrelations• glossesofhypernyms andhyponyms

Page 47: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Summary:thesaurus-basedsimilarity

simpath (c1,c2 ) =1

pathlen(c1,c2 )

simresnik (c1,c2 ) = − logP(LCS(c1,c2 )) simlin (c1,c2 ) =2 logP(LCS(c1,c2 ))logP(c1)+ logP(c2 )

sim jiangconrath (c1,c2 ) =1

logP(c1)+ logP(c2 )− 2 logP(LCS(c1,c2 ))

simeLesk (c1,c2 ) = overlap(gloss(r(c1)),gloss(q(c2 )))r,q∈RELS∑

Page 48: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Librariesforcomputingthesaurus-basedsimilarity

• NLTK• http://nltk.github.com/api/nltk.corpus.reader.html?highlight=similarity-nltk.corpus.reader.WordNetCorpusReader.res_similarity

• WordNet::Similarity• http://wn-similarity.sourceforge.net/• Web-basedinterface:

• http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi

48

Page 49: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

Evaluatingsimilarity• Extrinsic(task-based,end-to-end)Evaluation:

• QuestionAnswering• SpellChecking• Essaygrading

• IntrinsicEvaluation:• Correlationbetweenalgorithm andhumanwordsimilarityratings• Wordsim353:353nounpairsrated0-10.sim(plane,car)=5.77

• TakingTOEFLmultiple-choicevocabularytests• Levied is closest in meaning to:imposed, believed, requested, correlated

Page 50: Word Meaning and Similarity - ecology lab•sim resnik (c 1,c 2) = -log P( LCS(c 1,c 2) ) Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy.

WordMeaningandSimilarity

WordSimilarity:ThesaurusMethods