Measuring Similarity Between Contexts and Concepts
-
Upload
university-of-minnesota-duluth -
Category
Education
-
view
677 -
download
2
description
Transcript of Measuring Similarity Between Contexts and Concepts
11
Measuring Similarity Measuring Similarity Between Between
Concepts and ContextsConcepts and ContextsTed Pedersen Ted Pedersen
Department of Computer ScienceDepartment of Computer ScienceUniversity of Minnesota, DuluthUniversity of Minnesota, Duluthhttp://www.d.umn.edu/~tpedersehttp://www.d.umn.edu/~tpederse
22
The problems…The problems…
Recognize similar (or related) conceptsRecognize similar (or related) concepts frog : amphibianfrog : amphibian Duluth : snowDuluth : snow
Recognize similar contextsRecognize similar contexts I bought some food at the store : I bought some food at the store :
I purchased something to eat at the marketI purchased something to eat at the market
33
Similarity and RelatednessSimilarity and Relatedness
Two concepts are similar if they are Two concepts are similar if they are connected by connected by is-a is-a relationships.relationships. A frog A frog is-a-kind-of is-a-kind-of amphibianamphibian An illness An illness is-a is-a heath_conditionheath_condition
Two concepts can be related many ways…Two concepts can be related many ways… A human A human has-a-part has-a-part liver liver Duluth Duluth receives-a-lot-of receives-a-lot-of snowsnow
……similarity is one way to be related similarity is one way to be related
44
The approaches…The approaches…
Measure conceptual similarity using a Measure conceptual similarity using a structured repository of knowledge structured repository of knowledge Lexical database WordNetLexical database WordNet
Measure contextual similarity using Measure contextual similarity using knowledge lean methods that are based knowledge lean methods that are based on co-occurrence information from large on co-occurrence information from large corporacorpora
55
Why measure conceptual similarity? Why measure conceptual similarity?
A word will take the sense that is most A word will take the sense that is most related to the surrounding contextrelated to the surrounding context I love I love JavaJava, especially the beaches and the , especially the beaches and the
weather. weather. I love I love JavaJava, especially the support for , especially the support for
concurrent programming.concurrent programming. I love I love javajava, especially first thing in the morning , especially first thing in the morning
with a bagel. with a bagel.
66
Word Sense DisambiguationWord Sense Disambiguation
……can be performed by finding the sense of a can be performed by finding the sense of a word most related to its neighborsword most related to its neighbors
Here, we define similarity and relatedness with Here, we define similarity and relatedness with respect to WordNetrespect to WordNet WordNet::Similarity WordNet::Similarity http://wn-similarity.sourceforge.nethttp://wn-similarity.sourceforge.net
WordNet::SenseRelateWordNet::SenseRelate AllWords – assign a sense to every content wordAllWords – assign a sense to every content word TargetWord – assign a sense to a given wordTargetWord – assign a sense to a given word http://senserelate.sourceforge.net http://senserelate.sourceforge.net
77
SenseRelateSenseRelate
For each sense of a target word in contextFor each sense of a target word in context For each content word in the contextFor each content word in the context
• For each sense of that content wordFor each sense of that content word Measure similarity/relatedness between sense of target Measure similarity/relatedness between sense of target
word and sense of content word with WordNet::Similarityword and sense of content word with WordNet::Similarity Keep running sum for score of each sense of targetKeep running sum for score of each sense of target
Pick sense of target word with highest Pick sense of target word with highest score with words in contextscore with words in context
88
WordNet::SimilarityWordNet::Similarity Path based measuresPath based measures
Shortest path (path)Shortest path (path) Wu & Palmer (wup)Wu & Palmer (wup) Leacock & Chodorow (lch)Leacock & Chodorow (lch) Hirst & St-Onge (hso)Hirst & St-Onge (hso)
Information content measuresInformation content measures Resnik (res)Resnik (res) Jiang & Conrath (jcn)Jiang & Conrath (jcn) Lin (lin)Lin (lin)
Gloss based measuresGloss based measures Banerjee and Pedersen (lesk)Banerjee and Pedersen (lesk) Patwardhan and Pedersen (vector, vector_pairs)Patwardhan and Pedersen (vector, vector_pairs)
99
Why don’t path finding and info. Why don’t path finding and info. content solve the problem?content solve the problem?
Concepts must be organized in a Concepts must be organized in a hierarchy, and connected in that hierarchyhierarchy, and connected in that hierarchy Limited to comparing nouns with nouns, or Limited to comparing nouns with nouns, or
maybe verbs with verbsmaybe verbs with verbs Limited to similarity measures (is-a)Limited to similarity measures (is-a)
What about mixed parts of speech?What about mixed parts of speech? Murder (noun) and horrible (adjective)Murder (noun) and horrible (adjective) Tobacco (noun) and drinking (verb)Tobacco (noun) and drinking (verb)
1010
Using Dictionary Glosses Using Dictionary Glosses to Measure Relatednessto Measure Relatedness
Lesk (1985) Algorithm – measure relatedness of two Lesk (1985) Algorithm – measure relatedness of two concepts by counting the number of shared words in their concepts by counting the number of shared words in their definitionsdefinitions
Cold - a mild Cold - a mild viral viral infection involving the nose and respiratory passages (but infection involving the nose and respiratory passages (but not the lungs)not the lungs)
Flu - an acute febrile highly contagious Flu - an acute febrile highly contagious viral viral diseasedisease Adapted Lesk (Banerjee & Pedersen, 2003) – expand Adapted Lesk (Banerjee & Pedersen, 2003) – expand
glosses to include those concepts directly relatedglosses to include those concepts directly related Cold - a common cold affecting the nasal passages and resulting in Cold - a common cold affecting the nasal passages and resulting in
congestion and sneezing and headache; mild congestion and sneezing and headache; mild viralviral infection involving the nose infection involving the nose and and respiratoryrespiratory passages (but not the lungs); a passages (but not the lungs); a disease disease affecting the affecting the respiratoryrespiratory system system
Flu - an acute and highly contagious Flu - an acute and highly contagious respiratoryrespiratory diseasedisease of swine caused by of swine caused by the orthomyxovirus thought to be the same virus that caused the 1918 the orthomyxovirus thought to be the same virus that caused the 1918 influenza pandemic; an acute febrile highly contagious influenza pandemic; an acute febrile highly contagious viral viral disease; a disease; a disease disease that can be communicated from one person to anotherthat can be communicated from one person to another
1111
Context/Gloss VectorsContext/Gloss Vectors
Leskian approaches require exact matches in glossesLeskian approaches require exact matches in glosses Glosses are short, use related but not identical wordsGlosses are short, use related but not identical words
Solution? Expand glosses by replacing each content word Solution? Expand glosses by replacing each content word with a co-occurrence vector derived from corporawith a co-occurrence vector derived from corpora Rows are words in glosses, columns are the co-Rows are words in glosses, columns are the co-
occurring words in a corpus, cell values are their log-occurring words in a corpus, cell values are their log-likelihood ratioslikelihood ratios
Average the word vectors to create a single vector that Average the word vectors to create a single vector that represents the gloss/sense (Patwardhan & Pedersen, 2003)represents the gloss/sense (Patwardhan & Pedersen, 2003) 22ndnd order co-occurrences order co-occurrences
Measure relatedness using cosine rather than exact match!Measure relatedness using cosine rather than exact match!
1212
Gloss/Context VectorsGloss/Context Vectors
1313
22ndnd order co-occurrences order co-occurrences
Two word that occur together (within some Two word that occur together (within some number of positions of each other) are first order number of positions of each other) are first order co-occurrencesco-occurrences
Words A and B that co-occur with C separately Words A and B that co-occur with C separately but not with each other are second order co-but not with each other are second order co-occurrences (i.e., a friend of a friend)occurrences (i.e., a friend of a friend) ““military intelligence” and “military armor” are first military intelligence” and “military armor” are first
order co-occurrencesorder co-occurrences ““Intelligence” and “armor” are 2Intelligence” and “armor” are 2ndnd order co- order co-
occurrences (via military) occurrences (via military)
1414
WSD ExperimentWSD Experiment Senseval-2 data consists of 73 nouns, verbs, Senseval-2 data consists of 73 nouns, verbs,
and adjectives, approximately 8,600 “training” and adjectives, approximately 8,600 “training” examples and 4,300 “test” examples. examples and 4,300 “test” examples. Best supervised system 64%Best supervised system 64% SenseRelate 53% (lesk, vector)SenseRelate 53% (lesk, vector) Most frequent sense 48%Most frequent sense 48%
Window of context is defined by position, Window of context is defined by position, includes 2 content words to both the left and includes 2 content words to both the left and right which are measured against the word being right which are measured against the word being disambiguated. disambiguated. Positional proximity is not always associated with Positional proximity is not always associated with
semantic similarity.semantic similarity.
1515
Human Relatedness ExperimentHuman Relatedness Experiment
Miller and Charles (1991) created 30 pairs Miller and Charles (1991) created 30 pairs of nouns that were scored on a of nouns that were scored on a relatedness scale by over 50 human relatedness scale by over 50 human subjectssubjects
Vector measure correlates at over 80% Vector measure correlates at over 80% with human relatedness judgementswith human relatedness judgements
Next closest measure is lesk (at 70%)Next closest measure is lesk (at 70%) All other measures at less than 65%All other measures at less than 65%
1616
Why gloss based measures don’t Why gloss based measures don’t solve the problem..solve the problem..
WordNetWordNet Nouns – 80,000 conceptsNouns – 80,000 concepts Verbs – 13,000 conceptsVerbs – 13,000 concepts Adjectives – 18,000 conceptsAdjectives – 18,000 concepts Adverbs – 4,000 conceptsAdverbs – 4,000 concepts
Words not found in WordNet can’t be Words not found in WordNet can’t be disambiguated by SenseRelatedisambiguated by SenseRelate
1717
Knowledge Lean MethodsKnowledge Lean Methods
Can measure similarity between two Can measure similarity between two words by comparing co-occurrence words by comparing co-occurrence vectors created for each.vectors created for each.
Can measure similarity of two contexts by Can measure similarity of two contexts by representing them as 2representing them as 2ndnd order co- order co-occurrence vectors and comparing. occurrence vectors and comparing.
1818
Word Sense DiscriminationWord Sense Discrimination
Cluster different senses of words like Cluster different senses of words like line line or or interestinterest based on contextual similarity. based on contextual similarity. Pedersen & Bruce, 1997Pedersen & Bruce, 1997 Schutze, 1998Schutze, 1998 Purandare & Pedersen, 2004Purandare & Pedersen, 2004
Hard to evaluate, senses of words are Hard to evaluate, senses of words are somewhat ill defined, distinctions made by somewhat ill defined, distinctions made by clustering methods may or may not correspond clustering methods may or may not correspond with human intuitionswith human intuitions
http://http://senseclusters.sourceforge.netsenseclusters.sourceforge.net
1919
Name DiscriminationName Discrimination
Names that occur in similar contexts may Names that occur in similar contexts may refer to the same person.refer to the same person. George MillerGeorge Miller is an eminent psychologist. is an eminent psychologist. George MillerGeorge Miller is one of the founders of is one of the founders of
modern cognitive science. modern cognitive science. George MillerGeorge Miller is a member of the US House of is a member of the US House of
Representatives. Representatives.
2020
2121
2222
2323
2424
ObjectiveObjective
Given some number of contexts containing Given some number of contexts containing “John Smith”, identify those that are similar “John Smith”, identify those that are similar to each otherto each other
Group similar contexts together, assume Group similar contexts together, assume they are associated with single individualthey are associated with single individual
Generate an identifying label from the Generate an identifying label from the content of the different clusterscontent of the different clusters
2525
Similarity of Context? Similarity of Context? Context 1: He drives his car fast Context 1: He drives his car fast Context 2: Jim speeds in his autoContext 2: Jim speeds in his auto
Car -> motor, garage, gasoline, insuranceCar -> motor, garage, gasoline, insurance Auto -> motor, insurance, gasoline, accidentAuto -> motor, insurance, gasoline, accident
Car and Auto occur with many of the same words. Car and Auto occur with many of the same words. They share many first order co-occurrences. They share many first order co-occurrences.
• They are therefore similar! They are therefore similar!
Less direct relationship, more resistant to sparsity!Less direct relationship, more resistant to sparsity!
2626
Representing a ContextRepresenting a Context
Represent a context as the average of all Represent a context as the average of all the first order vectors of the words in the the first order vectors of the words in the contextcontext
Very similar to vector measure – the only Very similar to vector measure – the only different is that the context here comes different is that the context here comes from raw corpora, whereas in the vector from raw corpora, whereas in the vector measure the context is a dictionary measure the context is a dictionary definitiondefinition
2727
Feature SelectionFeature Selection
Bigrams – two word sequences that Bigrams – two word sequences that may have one intervening word may have one intervening word between thembetween them Frequency > 1Frequency > 1 Log-likelihood ratio > 3.841Log-likelihood ratio > 3.841 OR stop listOR stop list
Must occur within Ft positions of target, Must occur within Ft positions of target, Ft typically set to 5 or 20 Ft typically set to 5 or 20
2828
Second Order Context Second Order Context RepresentationRepresentation
Bigrams used to create a word matrixBigrams used to create a word matrix Cell values = log-likelihood of word pairCell values = log-likelihood of word pair
Rows are first order co-occurrence vector Rows are first order co-occurrence vector for a wordfor a word
Represent context by averaging vectors of Represent context by averaging vectors of words in that contextwords in that context Context includes the Cxt positions around the Context includes the Cxt positions around the
target, where Cxt is typically 5 or 20.target, where Cxt is typically 5 or 20.
2929
22ndnd Order Context Vectors Order Context Vectors
He won an Oscar, but He won an Oscar, but Tom HanksTom Hanks is still a nice guy. is still a nice guy.
06272.852.913362.608420.0321176.8451.021O2contex
t
018818.55
000205.5469
134.5102
guy
000136.0441
29.57600Oscar
008.739951.781230.5203324.9818.5533won
needlefamilywarmovieactorfootballbaseball
3030
Limits of co-occurrence vectorsLimits of co-occurrence vectors
052.2700.9204.210
28.7203.2401.2802.53
Weapon
Missile
ShootFireDestroy
Murder
Kill
17.77014.646.222.1034.2
19.232.36072.701.28
2.56
ExecuteCommandBomb
PipeFireCDBurn
3131
Singular Value DecompositionSingular Value Decomposition
What it does (for sure):What it does (for sure): Smoothes out zeroesSmoothes out zeroes Finds Principal ComponentsFinds Principal Components
What it might do: What it might do: Capture PolysemyCapture Polysemy Word Space to Semantic SpaceWord Space to Semantic Space
3232
After context representation…After context representation…
Second order vector is an average of word Second order vector is an average of word vectors that make up context, captures vectors that make up context, captures indirect relationshipsindirect relationships Reduced by SVD to principal componentsReduced by SVD to principal components
Now, cluster the vectors!Now, cluster the vectors! We use the method of repeated bisectionsWe use the method of repeated bisections CLUTOCLUTO
3333
Experimental DataExperimental Data
Created from AFE GigaWord corpusCreated from AFE GigaWord corpus 170,969,00 words170,969,00 words May 1994-May 1997May 1994-May 1997 December 2001-June 2002December 2001-June 2002 Created name conflated pseudo wordsCreated name conflated pseudo words
25 words to left and right of target25 words to left and right of target
3434
Name Conflated DataName Conflated Data
51.4%231,069
JapAnce
112,357France118,712Japan
53.9%46,431JorGypt21,762Egyptian25,539Jordan
56.0%13,734MonSlo6,176SlobodanMilosovic
7,846Shimon Peres
58.6%5,807MSIIBM2,406IBM3,401Microsoft
73.7%4,073JikRol1,071Rolf Ekeus
3,002Tajik
69.3%2,452RoBeck740David Beckham
1,652Ronaldo
Maj.TotalNewCountNameCountName
3535
Cxt 5Cxt 5 Cxt 20Cxt 20
# # Maj.Maj. Ft 5Ft 5 Ft 20Ft 20 Ft 5Ft 5 Ft 20Ft 20
RobeckRobeck 2,4522,452 69.369.3 57.357.3 72.772.7 85.985.9 54.754.7
JikRolJikRol 4,0734,073 73.773.7 94.794.7 96.296.2 91.091.0 90.490.4
MSIIBMMSIIBM 5,8075,807 58.658.6 47.747.7 51.351.3 68.068.0 60.060.0
MonSLoMonSLo 13,73413,734 56.056.0 62.862.8 96.696.6 54.654.6 91.491.4
JorGyptJorGypt 46,43146,431 53.953.9 56.656.6 59.159.1 57.057.0 53.053.0
JapAnceJapAnce 231,069231,069 51.451.4 51.151.1 51.151.1 50.350.3 50.350.3
3636
ConclusionsConclusions Tradeoff between size of context and feature Tradeoff between size of context and feature
selection space selection space Context small – Feature large : narrow window Context small – Feature large : narrow window
around target word where many possible around target word where many possible features representedfeatures represented
Context large – Feature small : large window Context large – Feature small : large window around target word where a selective set of around target word where a selective set of features representedfeatures represented
SVD didn’t help/hurtSVD didn’t help/hurt Results shown are without SVDResults shown are without SVD
3737
Ongoing workOngoing work
Creating Path Finding Measures of Creating Path Finding Measures of RelatednessRelatedness
Stopping Clustering AutomaticallyStopping Clustering Automatically Cluster labelingCluster labeling
……Bring together finding conceptual Bring together finding conceptual similarity and contextual similarity similarity and contextual similarity
3838
Thanks to…Thanks to… WordNet::Similarity and SenseRelateWordNet::Similarity and SenseRelate
http://wn-similarity.sourceforge.nethttp://wn-similarity.sourceforge.net http://http://senserelate.sourceforge.netsenserelate.sourceforge.net
Siddharth Patwardhan Siddharth Patwardhan Satanjeev BanerjeeSatanjeev Banerjee Jason MichelizziJason Michelizzi
SenseClusters SenseClusters http://http://senseclusters.sourceforge.netsenseclusters.sourceforge.net
Anagha KulkarniAnagha Kulkarni Amruta PurandareAmruta Purandare