Word Sense Induction Using Graphs of Collocations

WSI using Graphs of Collocations

Paper by: Ioannis P. Klapaftis and Suresh Manandhar

Presented by: Ahmad R. Shahid

Word Sense Induction (WSI)

• Identifying different senses (uses) of a word• Finds applications in Information Retrieval (IR)

and Machine Translation (MT)• Most of the work in WSI is based on vector-

space model– Each context of a target word is represented as a

vector of features– Context vectors are clustered and the resulting

clusters are taken to represent the induced senses.

Word Sense Induction (WSI)

Graph based methods

• Agirre et al. (2007) used co-occurrence graphs– Vertices are words and two vertices share an edge if

they co-occur in the same context• Each edge receives a weight indicating the strength of

relationship between words (vertices)

– Co-occurrence graphs have highly dense subgraphsrepresenting different clusters the entity may have

• Each cluster has a “hub”– Hubs are highly dense vertices

Graph based methods

• Each cluster (induced sense) consists of a set of words that are semantically related to the particular sense.

• Graph-based methods assume that each context word is related to one and only one sense of the target word– Not always valid

Graph based methods

• Consider the contexts for the target word network:– To install our satellite system please call our

technicians and book an appointment. Connection to our television network is free of charge…

– To connect to the BT network, proceed with the installation of the connection software and then reboot your system…

• Two senses are used: 1) Television Network, 2) Computer Network

Graph based methods

• Any hard-clustering approach would assign system to only one of the two senses of network, even though its related to both– Same is true for connection

• The two words cannot be filtered out as noise since they are semantically related to the target word

WSI using Graph Clustering

Small Lexical Worlds

Small Lexical Worlds

• Small worlds– The characteristic path length (L)

• Mean length of the shortest path between two nodes of the graph. Let be the length of the shortest path between two nodes and , and let be the total number of nodes

– Clustering coefficient (C)

),(min jidi j N

∑=

=N

i

jidN

L1

min ),(1

Clustering Coefficient• For each node , one can define a local

clustering coefficient equal to the proportion of connections between the neighborsof that node

• For a node with four neighbors the maximum number of connections is – If five of these connections actually exist,

• The global coefficient is the mean of the local coefficients

– Its 0 for totally disconnected and 1 for a complete graph.

iiC

( )( )iE Γ ( )iΓ

i( )

62

=⎟⎟⎠

⎞⎜⎜⎝

⎛ Γ i

83.0~65=iC

C( )( )( )∑

=

⎟⎟⎠

⎞⎜⎜⎝

⎛ Γ

Γ=

N

i iiE

NC

1

2

1

Small World Networks

• They lie somewhere between regular graphs and random graphs

• In the case of random graph of nodes whose mean degree is

• Small world graphs are characterized by:

Nk

( )( )k

NLrand loglog~

NkCrand 2~

randLL ~ randCC >>

Small World Networks• At a constant mean degree, the number of

nodes can increase exponentially, whereas the characteristic path length will only increase in a linear way– Six degrees of freedom

• In a small world there will be bundles or highly connected groups– Friends of a given individual will be much more likely

to be acquainted with each other than would be predicted if the edges of the graph were simply drawn at random

Small World Networks

Adam Smith• Every individual...generally, indeed, neither

intends to promote the public interest, nor knows how much he is promoting it. By preferring the support of domestic to that of foreign industry he intends only his own security; and by directing that industry in such a manner as its produce may be of the greatest value, he intends only his own gain, and he is in this, as in many other cases, led by an invisible hand to promote an end which was no part of his intention.

• The Wealth of Nations, Book IV Chapter II• “Greed is good”: the basic theme of Capitalism

Co-occurrence Graphs

• Co-occurrence graphs are small world graphs– The number of nodes can increase

exponentially, whereas the characteristic path length will only increase in a linear way

• They are scale-free– They contain a small number of highly

connected hubs and a large number of weakly connected nodes

Co-occurrence Graphs

• Since they are small-world networks– Contain highly dense subgraphs (hubs) which

represent the different clusters (senses) the target word may have

High Density Components

• Different uses of a target word form highly interconnected “bundles” in a small world of cooccurrences (high density components)– Barrage (in the sense of a hydraulic dam)

must cooccur frequently with eau, ouvrage, riviere, crue, irrigation, production, electricite(water, work, river, flood, irrigation, production, electricity)

• Those words themselves are likely to be interconnected

High Density Components

• Detecting the different uses of a word amounts to isolating the high density components in the cooccurrence graph– Most exact graph-partitioning techniques are

NP-hard• Given that graphs have several thousand nodes

and edges, only approximate heuristic-based methods can be employed

Detecting Root Hubs

• In every high-density component, one of the nodes has a higher degree than the others– Called the component’s root hub

• For the most frequent use of barrage (hydraulic dam), the root hub is the word eau (water).

• The isolated root hub is deleted along with all of its neighbors– Must have at least 6 neighbors (determined

empirically)

Minimum Spanning Tree (MST)

• After isolating the root hub along with all its neighbors, the next root hub is identified and the process is repeated

• A MST is computed by taking the target word as the root and making its first level consist of the previously identified root hubs

Minimum Spanning Tree (MST)

Veronis Algorithm

• Iteratively finds the candidate root hub– The one with the highest degree

• The root hub is deleted along with its direct neighbors from the graph– Only if it satisfies certain heuristics

• Minimum number of vertices in a hub• The average weight between the candidate root

hub and its adjacent neighbors• Minimum frequency of a root hub

Collocational Graphs for WSI

• Let bc, be the base corpus– Consists of paragraphs containing the target

word tw• The aim is to induce the sense of tw given

bc as the only input• Let rc be a large reference corpus

– British National Corpus (BNC) has been used for this study

Corpus pre-processing

• Initially, tw is removed from bc• Each paragraph of bc and rc is POS-tagged• Only nouns are kept and lemmatized

– Since they are less ambiguous than verbs, adverbs or adjectives

• At this stage each paragraph both in bc and rcis a list of lemmatized nouns

ip


• Each paragraph in bc contains nouns which are semantically related to tw, as well as, common nouns which are noisy, in the sense that they are not semantically related to tw

• In order to filter out the noise, they used a technique based on corpora comparison using log-likelihood ( ).

ip

2G


• The aim is to check if the distribution of a word , given it appears in , is similar to the distribution of , given it appears in , i.e. – It’s the null hypothesis– If that is true will have a small value, and should be

removed from the paragraphs of .

iw

bciw rc ( ) ( )rcwpbcwp ii || =

2G iwbc


• If the probability of the occurrence of a word in the base corpus is the same as in the reference corpus, then it looses its discriminating power and must be weeded out– In other words if the observed frequency of a word is

very close to its expected value then it really hasn’t got much to say

( ) ( ) ( )wprcwpbcwp ii == ||


• The expressions are:

– The corresponds to values in the observed values (OT) table

– The corresponds to values in the expected values (ET) table

• The values in ET are calculated from the values in OT using the equation for

∑ ⎟⎟⎠

⎞⎜⎜⎝

⎛=

ji ij

ijij m

nnG

,

2 log.*2

Nnn

m k kjk ikij

∑∑ ===2

1

2

1.

ijn

ijm

ijm


• They created two noun frequency lists– lbc, derived from the bc corpus– lrc, derived from the rc corpus

• For word , they created two contingency tables– OT contains the observed counts taken from lbc and

lrc– ET contains the expected values under the model of

independence

lbcwi ∈


• Then they calculated , where is the , cell of OT and is the , cell of ET, and

• lbc is first filtered by removing words, which have a relative frequency in lbc less than lrc– The resulting lbc is then sorted by the values

• The - sorted list is used to remove words from each paragraph of bc, which have a value less than a pre-specified thresholed (parameter ).

• At the end of that stage, each paragraph is a list of nouns, which are assumed to be topically related to the target word

2G ijn i jijm i j ∑= ji ijnN

,

2G2G

2G1p

bcpi ∈

tw

Collocational Graph

• A key problem at this stage is the determination of related nouns– They can be grouped into collocations

• Where each collocation is assigned a weight

• In this study collocations of size 2 are considered (pairs of words)– They consist of 2 nouns

Collocational Graph• Collocations are detected by generating all the

combinations for each - length paragraph– Then measuring their frequency

• The frequency of a collocation is the number of paragraphs, which contain that collocation

• Consider the following paragraphs:– To install our satellite system please call our technicians and

book an appointment. Connection to our television network is free of charge…

– To connect to the BT network, proceed with the installation of the connection software and then reboot your system…

– All the combinations for each - length paragraph of our example would provide us with 24 unique collocations, such as {system, technician}, {system, connection} etc.

n⎟⎟⎠

⎞⎜⎜⎝

⎛2n

⎟⎟⎠

⎞⎜⎜⎝

⎛2n n

Collocational Graph• Although the use of aims at keeping in bc

words, which are related to the target one, this does not necessarily mean that their pairwisecombinations are useful for discriminating the senses of – For example ambiguous collocations, which are

related to both senses of should not be taken into account, such as the {system, connection} collocation

• To circumvent around this problem, each extracted collocation is assigned a weight, which measures the relative frequency of two nouns co-occurring.

– Usually weighted using information theoretic measures such as pointwise mutual information (PMI)

2G

tw

tw

Collocational Graph• Conditional probabilities produce better results

than PMI, which overestimates rare events– Hence they used conditional probabilities

• Let denote the number of paragraphs, in which nouns co-occur, and denote the number of paragraphs, where noun occurs– Since allows us to capture the words which are

related to , the calculations for collocations frequency take place on the whole SemEval-2007 WSI (SWSI) corpus (27132 paragraphs)

• To deal with data sparsity and to determine whether a candidate collocation appears frequently enough to be included in our graphs

ijfreq

jfreqji,j

2Gtw

Collocational Graph• We can measure the conditional probability

and in a similar manner– The final weight applied to collocation is the

average of the calculated conditional probabilities

– They only extracted collocations, which had frequency (parameter ) and weight (parameter ) higher than pre-specified thresholds

• The collocational graph can now be created, in which each extracted and weighted collocation is represented as a vertex and two vertices share an edge, if they co-occur, in one or more paragraphs of bc

( )jip |( )ijp |

ijc

( ) ( )2

|| ijpjipwijc

+=

2p 3p

( )j

ij

freqfreq

jip =|

Collocational Graph

• The next stage is to weight the edges of the initial collocational graph, , as well as to discover new edges connecting the vertices of – The constructed graph is sparse since we are

attempting to identify rare events, i.e. edges connecting collocations

• The solution to the problem of data sparsity is smoothing

GG

Weighting and Populating• For each vertex (collocation ), they associated

a vertex vector containing the vertices (collocations), which share an edge with in graph

• Table shows an example of two vertices, i.e. cnn_nbc and nbc_news, which are not connected in of the target word network

i ic

iVCi

G

G

Weighting and Populating

• In the next step, the similarity between each vertex vector and each vertex vector is calculated

• Lee [4] showed that Jaccard similarity coefficient (JC) showed superior performance over other symmetric similarity measures such as cosine, L1 norm, euclidean distance, Jensen-Shannon divergence, etc.

iVC jVC


• Using JC for estimating similarity between vertex vectors yields:

– Two collocations and are said to be mutually similar if is the most similar collocation to and the other way around.

( )ji

jiji VCVC

VCVCVCVCJC

∪

∩=,

ic jcic jc


• Two mutually similar collocations and are clustered with the result that an occurrence of a collocation with one of is also counted as an occurrence with the other collocation– In table (slide 34) if cnn_nbc and nbc_news are

mutually similar, then the zero-frequency event between nbc_news and cnn_tv is set equal to the joint frequency between cnn_nbc and cnn_tv

• Many collocations connected to one of the target collocations are not connected to the other, although they should be, since both of the target collocations are contextually related i.e. both of them refer to the Television Network sense.

ic jc

kc ji cc ,


• The weight applied to each edge connecting vertices and (collocations and ) is the maximum of their conditional probabilities, where:

• Where denotes the number of paragraphs, in which nouns co-occur, and denotes the number of paragraphs, where noun occurs

i j ic jc

( )j

ji

freqfreq

jip ,| =

ijfreqji, jfreq

j

Inducing Senses and Tagging

• The final graph , resulting from the previous stage, is clustered in order to produce the induced senses.

• The two criteria for choosing a clustering algorithm were:– Its ability to automatically induce the number of

clusters– Its execution time

/G

Inducing Senses and Tagging• Markov Clustering Algorithm

– Is fast– Is based on stochastic flow in graphs– The number of clusters produced depends on an

inflation parameter that controls the number of produced clusters

• Chinese Whispers– Is a randomized graph-clustering method– Time-linear to the number of edges– Does not require any input parameters– Not guaranteed to converge– Automatically infers the number and size of clusters

Inducing Senses and Tagging

• Normalized MinCut– Graph partitioning technique– Graph is partitioned into two subgraphs by

minimising the total association between the two subgraphs

– Iteratively applied for each extracted subgraph until a user-defined criterion is met (e.g. number of clusters)

Inducing Senses and Tagging• CW assigns all vertices to different classes• Each vertex is processed for an (parameter )

number of iterations and inherits the strongest class in its local neighbourhood (LN) in an update step.– LN is defined as the set of vertices which share a direct

connection with vertex • During the update step for a vertex : each class, , receives a

score equal to the sum of the weights of edges ( ), where has been assigned class

– The maximum score determines the strongest class» In case of multiple classes, one is chosen at random

• Classes are updated immediately, which means that a node can inherit classes from its LN that were introduced there in the same iteration

i x 4p

ii cl

ji, jcl

Evaluation

• The WSI approach was evaluated under the framework of SemEval-2007 WSI task (SWSI)

• The corpus consists of texts of the Wall Street Journal corpus, and is hand tagged with OntoNotes senses

• They focused on all 35 nouns of SWSI, ignoring verbs

Evaluation

• They induced the senses of each target noun, , and then they tagged each instance of with one of its induced senses

• SWSI organizers employ two evaluation schemes– Unsupervised evaluation

• The results of systems are treated as clusters of target noun contexts and gold standard (GS) senses as classes

– Supervised evaluation• The training corpus is used to map the induced clusters to GS

senses. The testing corpus is then used to measure performance

tntn

Evaluation

• Perfect clustering solution is defined in terms of Homogeneity and Completeness

• Homogeneity– Where each induced cluster has exactly the

same contexts as one of the classes• Completeness

– Where each class has exactly the same contexts as one of the clusters

Evaluation

• F-Score is used to asses the overall quality of clustering– Measures both Homogeneity and

Completeness– Other measures, entropy and purity only

measure the first

Evaluation

• Purity– Let be the number of classes in the gold standard

(GS), be the number of clusters, be the size of cluster , and be the number of data points in class that belong to cluster , then:

qk rnr i

rn ir

q

∑=

=k

r

ir

in

nPurity

1maxarg1

∑ ∑= =

⎟⎟⎠

⎞⎜⎜⎝

⎛−=

k

r

q

i r

ir

r

irr

nn

nn

qnnEntropy

1 1log

log1

Evaluation• Their WSI methodology used Jaccard similarity

to populate the graph, referred to as Col-JC– Col-BL induces senses as Col-JC does but without

smoothing• Baselines

– 1cl1inst: assigns each instance to a distinct cluster– 1c1w: groups all instances of a target word into

one cluster• equal to the most frequent baseline (MFS) in the supervised

evaluation– The sense which appears most often in an annotated text

Evaluation• The evaluation results are given in the table below:

– UOY, UBS-AC have used labeled data for parameter estimation– I2R, UPV_SI, UMND2 do not state how their parameters were

estimated

Analysis

• Evaluation of WSI methods is a difficult task– 1cl1inst baseline achieves a perfect purity and

entropy, however scores low on F-Score• Because senses of GS are spread among induced

clusters causing a low unsupervised recall• Supervised recall of 1cl1inst is undefined, due to

the fact that each cluster tags one and only one instance in the corpus

– Clusters tagging instances in the test corpus do not tag any instances in the train corpus and the mapping cannot be performed

Analysis

• 1c1w baseline achieves high F-Score performance due to the dominance of MFS in the testing corpus– Its purity, entropy and supervised recall are

much lower than other systems

Analysis

• A clustering solution, which achieves high supervised recall, does not necessarily achieve high F-Score– Because F-Score penalizes systems for

getting the number of GS classes wrongly as 1cl1inst baseline

Analysis

• No system was able to achieve high performance in both settings except their technique

• Col-BL (Col-JC) achieved 72.9% (78%) F-Score

Analysis

• The target of smoothing was to reduce the number of clusters and obtain a better mapping of clusters to GS senses, but without affecting the clustering quality

Bibliography1) Ioannis Klapaftis, and Suresh Manandhar,

‘Word Sense Induction Using Graphs of Collocations’, in Proceedings of the 18th European Conference On Artificial Intelligence (ECAI-2008).

2) Agirre, E., Soroa, A.: Ubc-as: A graph based unsupervised system for induction and classification. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pp. 346–349 (2007)

Bibliography

3) J. V´eronis. 2004. Hyperlex: lexical cartography for information retrieval. Computer Speech & Language, 18(3):223.252.

4) Lee, L. 1999. Measures of distributional similarity. In Proc. ACL ’99, 25–32.

Word Sense Induction Using Graphs of Collocations

Documents

Transcript of Word Sense Induction Using Graphs of Collocations