Measuring Similarity Between Contexts and Concepts

11

Measuring Similarity Measuring Similarity Between Between

Concepts and ContextsConcepts and ContextsTed Pedersen Ted Pedersen

Department of Computer ScienceDepartment of Computer ScienceUniversity of Minnesota, DuluthUniversity of Minnesota, Duluthhttp://www.d.umn.edu/~tpedersehttp://www.d.umn.edu/~tpederse

22

The problems…The problems…

Recognize similar (or related) conceptsRecognize similar (or related) concepts frog : amphibianfrog : amphibian Duluth : snowDuluth : snow

Recognize similar contextsRecognize similar contexts I bought some food at the store : I bought some food at the store :

I purchased something to eat at the marketI purchased something to eat at the market

33

Similarity and RelatednessSimilarity and Relatedness

Two concepts are similar if they are Two concepts are similar if they are connected by connected by is-a is-a relationships.relationships. A frog A frog is-a-kind-of is-a-kind-of amphibianamphibian An illness An illness is-a is-a heath_conditionheath_condition

Two concepts can be related many ways…Two concepts can be related many ways… A human A human has-a-part has-a-part liver liver Duluth Duluth receives-a-lot-of receives-a-lot-of snowsnow

……similarity is one way to be related similarity is one way to be related

44

The approaches…The approaches…

Measure conceptual similarity using a Measure conceptual similarity using a structured repository of knowledge structured repository of knowledge Lexical database WordNetLexical database WordNet

Measure contextual similarity using Measure contextual similarity using knowledge lean methods that are based knowledge lean methods that are based on co-occurrence information from large on co-occurrence information from large corporacorpora

55

Why measure conceptual similarity? Why measure conceptual similarity?

A word will take the sense that is most A word will take the sense that is most related to the surrounding contextrelated to the surrounding context I love I love JavaJava, especially the beaches and the , especially the beaches and the

weather. weather. I love I love JavaJava, especially the support for , especially the support for

concurrent programming.concurrent programming. I love I love javajava, especially first thing in the morning , especially first thing in the morning

with a bagel. with a bagel.

66

Word Sense DisambiguationWord Sense Disambiguation

……can be performed by finding the sense of a can be performed by finding the sense of a word most related to its neighborsword most related to its neighbors

Here, we define similarity and relatedness with Here, we define similarity and relatedness with respect to WordNetrespect to WordNet WordNet::Similarity WordNet::Similarity http://wn-similarity.sourceforge.nethttp://wn-similarity.sourceforge.net

WordNet::SenseRelateWordNet::SenseRelate AllWords – assign a sense to every content wordAllWords – assign a sense to every content word TargetWord – assign a sense to a given wordTargetWord – assign a sense to a given word http://senserelate.sourceforge.net http://senserelate.sourceforge.net

77

SenseRelateSenseRelate

For each sense of a target word in contextFor each sense of a target word in context For each content word in the contextFor each content word in the context

• For each sense of that content wordFor each sense of that content word Measure similarity/relatedness between sense of target Measure similarity/relatedness between sense of target

word and sense of content word with WordNet::Similarityword and sense of content word with WordNet::Similarity Keep running sum for score of each sense of targetKeep running sum for score of each sense of target

Pick sense of target word with highest Pick sense of target word with highest score with words in contextscore with words in context

88

WordNet::SimilarityWordNet::Similarity Path based measuresPath based measures

Shortest path (path)Shortest path (path) Wu & Palmer (wup)Wu & Palmer (wup) Leacock & Chodorow (lch)Leacock & Chodorow (lch) Hirst & St-Onge (hso)Hirst & St-Onge (hso)

Information content measuresInformation content measures Resnik (res)Resnik (res) Jiang & Conrath (jcn)Jiang & Conrath (jcn) Lin (lin)Lin (lin)

Gloss based measuresGloss based measures Banerjee and Pedersen (lesk)Banerjee and Pedersen (lesk) Patwardhan and Pedersen (vector, vector_pairs)Patwardhan and Pedersen (vector, vector_pairs)

99

Why don’t path finding and info. Why don’t path finding and info. content solve the problem?content solve the problem?

Concepts must be organized in a Concepts must be organized in a hierarchy, and connected in that hierarchyhierarchy, and connected in that hierarchy Limited to comparing nouns with nouns, or Limited to comparing nouns with nouns, or

maybe verbs with verbsmaybe verbs with verbs Limited to similarity measures (is-a)Limited to similarity measures (is-a)

What about mixed parts of speech?What about mixed parts of speech? Murder (noun) and horrible (adjective)Murder (noun) and horrible (adjective) Tobacco (noun) and drinking (verb)Tobacco (noun) and drinking (verb)

1010

Using Dictionary Glosses Using Dictionary Glosses to Measure Relatednessto Measure Relatedness

Lesk (1985) Algorithm – measure relatedness of two Lesk (1985) Algorithm – measure relatedness of two concepts by counting the number of shared words in their concepts by counting the number of shared words in their definitionsdefinitions

Cold - a mild Cold - a mild viral viral infection involving the nose and respiratory passages (but infection involving the nose and respiratory passages (but not the lungs)not the lungs)

Flu - an acute febrile highly contagious Flu - an acute febrile highly contagious viral viral diseasedisease Adapted Lesk (Banerjee & Pedersen, 2003) – expand Adapted Lesk (Banerjee & Pedersen, 2003) – expand

glosses to include those concepts directly relatedglosses to include those concepts directly related Cold - a common cold affecting the nasal passages and resulting in Cold - a common cold affecting the nasal passages and resulting in

congestion and sneezing and headache; mild congestion and sneezing and headache; mild viralviral infection involving the nose infection involving the nose and and respiratoryrespiratory passages (but not the lungs); a passages (but not the lungs); a disease disease affecting the affecting the respiratoryrespiratory system system

Flu - an acute and highly contagious Flu - an acute and highly contagious respiratoryrespiratory diseasedisease of swine caused by of swine caused by the orthomyxovirus thought to be the same virus that caused the 1918 the orthomyxovirus thought to be the same virus that caused the 1918 influenza pandemic; an acute febrile highly contagious influenza pandemic; an acute febrile highly contagious viral viral disease; a disease; a disease disease that can be communicated from one person to anotherthat can be communicated from one person to another

1111

Context/Gloss VectorsContext/Gloss Vectors

Leskian approaches require exact matches in glossesLeskian approaches require exact matches in glosses Glosses are short, use related but not identical wordsGlosses are short, use related but not identical words

Solution? Expand glosses by replacing each content word Solution? Expand glosses by replacing each content word with a co-occurrence vector derived from corporawith a co-occurrence vector derived from corpora Rows are words in glosses, columns are the co-Rows are words in glosses, columns are the co-

occurring words in a corpus, cell values are their log-occurring words in a corpus, cell values are their log-likelihood ratioslikelihood ratios

Average the word vectors to create a single vector that Average the word vectors to create a single vector that represents the gloss/sense (Patwardhan & Pedersen, 2003)represents the gloss/sense (Patwardhan & Pedersen, 2003) 22ndnd order co-occurrences order co-occurrences

Measure relatedness using cosine rather than exact match!Measure relatedness using cosine rather than exact match!

1212

Gloss/Context VectorsGloss/Context Vectors

1313

22ndnd order co-occurrences order co-occurrences

Two word that occur together (within some Two word that occur together (within some number of positions of each other) are first order number of positions of each other) are first order co-occurrencesco-occurrences

Words A and B that co-occur with C separately Words A and B that co-occur with C separately but not with each other are second order co-but not with each other are second order co-occurrences (i.e., a friend of a friend)occurrences (i.e., a friend of a friend) ““military intelligence” and “military armor” are first military intelligence” and “military armor” are first

order co-occurrencesorder co-occurrences ““Intelligence” and “armor” are 2Intelligence” and “armor” are 2ndnd order co- order co-

occurrences (via military) occurrences (via military)

1414

WSD ExperimentWSD Experiment Senseval-2 data consists of 73 nouns, verbs, Senseval-2 data consists of 73 nouns, verbs,

and adjectives, approximately 8,600 “training” and adjectives, approximately 8,600 “training” examples and 4,300 “test” examples. examples and 4,300 “test” examples. Best supervised system 64%Best supervised system 64% SenseRelate 53% (lesk, vector)SenseRelate 53% (lesk, vector) Most frequent sense 48%Most frequent sense 48%

Window of context is defined by position, Window of context is defined by position, includes 2 content words to both the left and includes 2 content words to both the left and right which are measured against the word being right which are measured against the word being disambiguated. disambiguated. Positional proximity is not always associated with Positional proximity is not always associated with

semantic similarity.semantic similarity.

1515

Human Relatedness ExperimentHuman Relatedness Experiment

Miller and Charles (1991) created 30 pairs Miller and Charles (1991) created 30 pairs of nouns that were scored on a of nouns that were scored on a relatedness scale by over 50 human relatedness scale by over 50 human subjectssubjects

Vector measure correlates at over 80% Vector measure correlates at over 80% with human relatedness judgementswith human relatedness judgements

Next closest measure is lesk (at 70%)Next closest measure is lesk (at 70%) All other measures at less than 65%All other measures at less than 65%

1616

Why gloss based measures don’t Why gloss based measures don’t solve the problem..solve the problem..

WordNetWordNet Nouns – 80,000 conceptsNouns – 80,000 concepts Verbs – 13,000 conceptsVerbs – 13,000 concepts Adjectives – 18,000 conceptsAdjectives – 18,000 concepts Adverbs – 4,000 conceptsAdverbs – 4,000 concepts

Words not found in WordNet can’t be Words not found in WordNet can’t be disambiguated by SenseRelatedisambiguated by SenseRelate

1717

Knowledge Lean MethodsKnowledge Lean Methods

Can measure similarity between two Can measure similarity between two words by comparing co-occurrence words by comparing co-occurrence vectors created for each.vectors created for each.

Can measure similarity of two contexts by Can measure similarity of two contexts by representing them as 2representing them as 2ndnd order co- order co-occurrence vectors and comparing. occurrence vectors and comparing.

1818

Word Sense DiscriminationWord Sense Discrimination

Cluster different senses of words like Cluster different senses of words like line line or or interestinterest based on contextual similarity. based on contextual similarity. Pedersen & Bruce, 1997Pedersen & Bruce, 1997 Schutze, 1998Schutze, 1998 Purandare & Pedersen, 2004Purandare & Pedersen, 2004

Hard to evaluate, senses of words are Hard to evaluate, senses of words are somewhat ill defined, distinctions made by somewhat ill defined, distinctions made by clustering methods may or may not correspond clustering methods may or may not correspond with human intuitionswith human intuitions

http://http://senseclusters.sourceforge.netsenseclusters.sourceforge.net

1919

Name DiscriminationName Discrimination

Names that occur in similar contexts may Names that occur in similar contexts may refer to the same person.refer to the same person. George MillerGeorge Miller is an eminent psychologist. is an eminent psychologist. George MillerGeorge Miller is one of the founders of is one of the founders of

modern cognitive science. modern cognitive science. George MillerGeorge Miller is a member of the US House of is a member of the US House of

Representatives. Representatives.

2424

ObjectiveObjective

Given some number of contexts containing Given some number of contexts containing “John Smith”, identify those that are similar “John Smith”, identify those that are similar to each otherto each other

Group similar contexts together, assume Group similar contexts together, assume they are associated with single individualthey are associated with single individual

Generate an identifying label from the Generate an identifying label from the content of the different clusterscontent of the different clusters

2525

Similarity of Context? Similarity of Context? Context 1: He drives his car fast Context 1: He drives his car fast Context 2: Jim speeds in his autoContext 2: Jim speeds in his auto

Car -> motor, garage, gasoline, insuranceCar -> motor, garage, gasoline, insurance Auto -> motor, insurance, gasoline, accidentAuto -> motor, insurance, gasoline, accident

Car and Auto occur with many of the same words. Car and Auto occur with many of the same words. They share many first order co-occurrences. They share many first order co-occurrences.

• They are therefore similar! They are therefore similar!

Less direct relationship, more resistant to sparsity!Less direct relationship, more resistant to sparsity!

2626

Representing a ContextRepresenting a Context

Represent a context as the average of all Represent a context as the average of all the first order vectors of the words in the the first order vectors of the words in the contextcontext

Very similar to vector measure – the only Very similar to vector measure – the only different is that the context here comes different is that the context here comes from raw corpora, whereas in the vector from raw corpora, whereas in the vector measure the context is a dictionary measure the context is a dictionary definitiondefinition

2727

Feature SelectionFeature Selection

Bigrams – two word sequences that Bigrams – two word sequences that may have one intervening word may have one intervening word between thembetween them Frequency > 1Frequency > 1 Log-likelihood ratio > 3.841Log-likelihood ratio > 3.841 OR stop listOR stop list

Must occur within Ft positions of target, Must occur within Ft positions of target, Ft typically set to 5 or 20 Ft typically set to 5 or 20

2828

Second Order Context Second Order Context RepresentationRepresentation

Bigrams used to create a word matrixBigrams used to create a word matrix Cell values = log-likelihood of word pairCell values = log-likelihood of word pair

Rows are first order co-occurrence vector Rows are first order co-occurrence vector for a wordfor a word

Represent context by averaging vectors of Represent context by averaging vectors of words in that contextwords in that context Context includes the Cxt positions around the Context includes the Cxt positions around the

target, where Cxt is typically 5 or 20.target, where Cxt is typically 5 or 20.

2929

22ndnd Order Context Vectors Order Context Vectors

He won an Oscar, but He won an Oscar, but Tom HanksTom Hanks is still a nice guy. is still a nice guy.

06272.852.913362.608420.0321176.8451.021O2contex

t

018818.55

000205.5469

134.5102

guy

000136.0441

29.57600Oscar

008.739951.781230.5203324.9818.5533won

needlefamilywarmovieactorfootballbaseball

3030

Limits of co-occurrence vectorsLimits of co-occurrence vectors

052.2700.9204.210

28.7203.2401.2802.53

Weapon

Missile

ShootFireDestroy

Murder

Kill

17.77014.646.222.1034.2

19.232.36072.701.28

2.56

ExecuteCommandBomb

PipeFireCDBurn

3131

Singular Value DecompositionSingular Value Decomposition

What it does (for sure):What it does (for sure): Smoothes out zeroesSmoothes out zeroes Finds Principal ComponentsFinds Principal Components

What it might do: What it might do: Capture PolysemyCapture Polysemy Word Space to Semantic SpaceWord Space to Semantic Space

3232

After context representation…After context representation…

Second order vector is an average of word Second order vector is an average of word vectors that make up context, captures vectors that make up context, captures indirect relationshipsindirect relationships Reduced by SVD to principal componentsReduced by SVD to principal components

Now, cluster the vectors!Now, cluster the vectors! We use the method of repeated bisectionsWe use the method of repeated bisections CLUTOCLUTO

3333

Experimental DataExperimental Data

Created from AFE GigaWord corpusCreated from AFE GigaWord corpus 170,969,00 words170,969,00 words May 1994-May 1997May 1994-May 1997 December 2001-June 2002December 2001-June 2002 Created name conflated pseudo wordsCreated name conflated pseudo words

25 words to left and right of target25 words to left and right of target

3434

Name Conflated DataName Conflated Data

51.4%231,069

JapAnce

112,357France118,712Japan

53.9%46,431JorGypt21,762Egyptian25,539Jordan

56.0%13,734MonSlo6,176SlobodanMilosovic

7,846Shimon Peres

58.6%5,807MSIIBM2,406IBM3,401Microsoft

73.7%4,073JikRol1,071Rolf Ekeus

3,002Tajik

69.3%2,452RoBeck740David Beckham

1,652Ronaldo

Maj.TotalNewCountNameCountName

3535

Cxt 5Cxt 5 Cxt 20Cxt 20

# # Maj.Maj. Ft 5Ft 5 Ft 20Ft 20 Ft 5Ft 5 Ft 20Ft 20

RobeckRobeck 2,4522,452 69.369.3 57.357.3 72.772.7 85.985.9 54.754.7

JikRolJikRol 4,0734,073 73.773.7 94.794.7 96.296.2 91.091.0 90.490.4

MSIIBMMSIIBM 5,8075,807 58.658.6 47.747.7 51.351.3 68.068.0 60.060.0

MonSLoMonSLo 13,73413,734 56.056.0 62.862.8 96.696.6 54.654.6 91.491.4

JorGyptJorGypt 46,43146,431 53.953.9 56.656.6 59.159.1 57.057.0 53.053.0

JapAnceJapAnce 231,069231,069 51.451.4 51.151.1 51.151.1 50.350.3 50.350.3

3636

ConclusionsConclusions Tradeoff between size of context and feature Tradeoff between size of context and feature

selection space selection space Context small – Feature large : narrow window Context small – Feature large : narrow window

around target word where many possible around target word where many possible features representedfeatures represented

Context large – Feature small : large window Context large – Feature small : large window around target word where a selective set of around target word where a selective set of features representedfeatures represented

SVD didn’t help/hurtSVD didn’t help/hurt Results shown are without SVDResults shown are without SVD

3737

Ongoing workOngoing work

Creating Path Finding Measures of Creating Path Finding Measures of RelatednessRelatedness

Stopping Clustering AutomaticallyStopping Clustering Automatically Cluster labelingCluster labeling

……Bring together finding conceptual Bring together finding conceptual similarity and contextual similarity similarity and contextual similarity

3838

Thanks to…Thanks to… WordNet::Similarity and SenseRelateWordNet::Similarity and SenseRelate

http://wn-similarity.sourceforge.nethttp://wn-similarity.sourceforge.net http://http://senserelate.sourceforge.netsenserelate.sourceforge.net

Siddharth Patwardhan Siddharth Patwardhan Satanjeev BanerjeeSatanjeev Banerjee Jason MichelizziJason Michelizzi

SenseClusters SenseClusters http://http://senseclusters.sourceforge.netsenseclusters.sourceforge.net

Anagha KulkarniAnagha Kulkarni Amruta PurandareAmruta Purandare

Measuring Similarity Between Contexts and Concepts

Education

Transcript of Measuring Similarity Between Contexts and Concepts