Word Sense Induction Using Graphs of Collocations
Transcript of Word Sense Induction Using Graphs of Collocations
WSI using Graphs of Collocations
Paper by: Ioannis P. Klapaftis and Suresh Manandhar
Presented by: Ahmad R. Shahid
Word Sense Induction (WSI)
• Identifying different senses (uses) of a word• Finds applications in Information Retrieval (IR)
and Machine Translation (MT)• Most of the work in WSI is based on vector-
space model– Each context of a target word is represented as a
vector of features– Context vectors are clustered and the resulting
clusters are taken to represent the induced senses.
Word Sense Induction (WSI)
Graph based methods
• Agirre et al. (2007) used co-occurrence graphs– Vertices are words and two vertices share an edge if
they co-occur in the same context• Each edge receives a weight indicating the strength of
relationship between words (vertices)
– Co-occurrence graphs have highly dense subgraphsrepresenting different clusters the entity may have
• Each cluster has a “hub”– Hubs are highly dense vertices
Graph based methods
• Each cluster (induced sense) consists of a set of words that are semantically related to the particular sense.
• Graph-based methods assume that each context word is related to one and only one sense of the target word– Not always valid
Graph based methods
• Consider the contexts for the target word network:– To install our satellite system please call our
technicians and book an appointment. Connection to our television network is free of charge…
– To connect to the BT network, proceed with the installation of the connection software and then reboot your system…
• Two senses are used: 1) Television Network, 2) Computer Network
Graph based methods
• Any hard-clustering approach would assign system to only one of the two senses of network, even though its related to both– Same is true for connection
• The two words cannot be filtered out as noise since they are semantically related to the target word
WSI using Graph Clustering
Small Lexical Worlds
Small Lexical Worlds
• Small worlds– The characteristic path length (L)
• Mean length of the shortest path between two nodes of the graph. Let be the length of the shortest path between two nodes and , and let be the total number of nodes
– Clustering coefficient (C)
),(min jidi j N
∑=
=N
i
jidN
L1
min ),(1
Clustering Coefficient• For each node , one can define a local
clustering coefficient equal to the proportion of connections between the neighborsof that node
• For a node with four neighbors the maximum number of connections is – If five of these connections actually exist,
• The global coefficient is the mean of the local coefficients
– Its 0 for totally disconnected and 1 for a complete graph.
iiC
( )( )iE Γ ( )iΓ
i( )
62
=⎟⎟⎠
⎞⎜⎜⎝
⎛ Γ i
83.0~65=iC
C( )( )( )∑
=
⎟⎟⎠
⎞⎜⎜⎝
⎛ Γ
Γ=
N
i iiE
NC
1
2
1
Small World Networks
• They lie somewhere between regular graphs and random graphs
• In the case of random graph of nodes whose mean degree is
• Small world graphs are characterized by:
Nk
( )( )k
NLrand loglog~
NkCrand 2~
randLL ~ randCC >>
Small World Networks• At a constant mean degree, the number of
nodes can increase exponentially, whereas the characteristic path length will only increase in a linear way– Six degrees of freedom
• In a small world there will be bundles or highly connected groups– Friends of a given individual will be much more likely
to be acquainted with each other than would be predicted if the edges of the graph were simply drawn at random
Small World Networks
Adam Smith• Every individual...generally, indeed, neither
intends to promote the public interest, nor knows how much he is promoting it. By preferring the support of domestic to that of foreign industry he intends only his own security; and by directing that industry in such a manner as its produce may be of the greatest value, he intends only his own gain, and he is in this, as in many other cases, led by an invisible hand to promote an end which was no part of his intention.
• The Wealth of Nations, Book IV Chapter II• “Greed is good”: the basic theme of Capitalism
Co-occurrence Graphs
• Co-occurrence graphs are small world graphs– The number of nodes can increase
exponentially, whereas the characteristic path length will only increase in a linear way
• They are scale-free– They contain a small number of highly
connected hubs and a large number of weakly connected nodes
Co-occurrence Graphs
• Since they are small-world networks– Contain highly dense subgraphs (hubs) which
represent the different clusters (senses) the target word may have
High Density Components
• Different uses of a target word form highly interconnected “bundles” in a small world of cooccurrences (high density components)– Barrage (in the sense of a hydraulic dam)
must cooccur frequently with eau, ouvrage, riviere, crue, irrigation, production, electricite(water, work, river, flood, irrigation, production, electricity)
• Those words themselves are likely to be interconnected
High Density Components
• Detecting the different uses of a word amounts to isolating the high density components in the cooccurrence graph– Most exact graph-partitioning techniques are
NP-hard• Given that graphs have several thousand nodes
and edges, only approximate heuristic-based methods can be employed
Detecting Root Hubs
• In every high-density component, one of the nodes has a higher degree than the others– Called the component’s root hub
• For the most frequent use of barrage (hydraulic dam), the root hub is the word eau (water).
• The isolated root hub is deleted along with all of its neighbors– Must have at least 6 neighbors (determined
empirically)
Minimum Spanning Tree (MST)
• After isolating the root hub along with all its neighbors, the next root hub is identified and the process is repeated
• A MST is computed by taking the target word as the root and making its first level consist of the previously identified root hubs
Minimum Spanning Tree (MST)
Veronis Algorithm
• Iteratively finds the candidate root hub– The one with the highest degree
• The root hub is deleted along with its direct neighbors from the graph– Only if it satisfies certain heuristics
• Minimum number of vertices in a hub• The average weight between the candidate root
hub and its adjacent neighbors• Minimum frequency of a root hub
Collocational Graphs for WSI
• Let bc, be the base corpus– Consists of paragraphs containing the target
word tw• The aim is to induce the sense of tw given
bc as the only input• Let rc be a large reference corpus
– British National Corpus (BNC) has been used for this study
Corpus pre-processing
• Initially, tw is removed from bc• Each paragraph of bc and rc is POS-tagged• Only nouns are kept and lemmatized
– Since they are less ambiguous than verbs, adverbs or adjectives
• At this stage each paragraph both in bc and rcis a list of lemmatized nouns
ip
Corpus pre-processing
• Each paragraph in bc contains nouns which are semantically related to tw, as well as, common nouns which are noisy, in the sense that they are not semantically related to tw
• In order to filter out the noise, they used a technique based on corpora comparison using log-likelihood ( ).
ip
2G
Corpus pre-processing
• The aim is to check if the distribution of a word , given it appears in , is similar to the distribution of , given it appears in , i.e. – It’s the null hypothesis– If that is true will have a small value, and should be
removed from the paragraphs of .
iw
bciw rc ( ) ( )rcwpbcwp ii || =
2G iwbc
Corpus pre-processing
• If the probability of the occurrence of a word in the base corpus is the same as in the reference corpus, then it looses its discriminating power and must be weeded out– In other words if the observed frequency of a word is
very close to its expected value then it really hasn’t got much to say
( ) ( ) ( )wprcwpbcwp ii == ||
Corpus pre-processing
• The expressions are:
– The corresponds to values in the observed values (OT) table
– The corresponds to values in the expected values (ET) table
• The values in ET are calculated from the values in OT using the equation for
∑ ⎟⎟⎠
⎞⎜⎜⎝
⎛=
ji ij
ijij m
nnG
,
2 log.*2
Nnn
m k kjk ikij
∑∑ ===2
1
2
1.
ijn
ijm
ijm
Corpus pre-processing
• They created two noun frequency lists– lbc, derived from the bc corpus– lrc, derived from the rc corpus
• For word , they created two contingency tables– OT contains the observed counts taken from lbc and
lrc– ET contains the expected values under the model of
independence
lbcwi ∈
Corpus pre-processing
• Then they calculated , where is the , cell of OT and is the , cell of ET, and
• lbc is first filtered by removing words, which have a relative frequency in lbc less than lrc– The resulting lbc is then sorted by the values
• The - sorted list is used to remove words from each paragraph of bc, which have a value less than a pre-specified thresholed (parameter ).
• At the end of that stage, each paragraph is a list of nouns, which are assumed to be topically related to the target word
2G ijn i jijm i j ∑= ji ijnN
,
2G2G
2G1p
bcpi ∈
tw
Corpus pre-processing
Collocational Graph
• A key problem at this stage is the determination of related nouns– They can be grouped into collocations
• Where each collocation is assigned a weight
• In this study collocations of size 2 are considered (pairs of words)– They consist of 2 nouns
Collocational Graph• Collocations are detected by generating all the
combinations for each - length paragraph– Then measuring their frequency
• The frequency of a collocation is the number of paragraphs, which contain that collocation
• Consider the following paragraphs:– To install our satellite system please call our technicians and
book an appointment. Connection to our television network is free of charge…
– To connect to the BT network, proceed with the installation of the connection software and then reboot your system…
– All the combinations for each - length paragraph of our example would provide us with 24 unique collocations, such as {system, technician}, {system, connection} etc.
n⎟⎟⎠
⎞⎜⎜⎝
⎛2n
⎟⎟⎠
⎞⎜⎜⎝
⎛2n n
Collocational Graph• Although the use of aims at keeping in bc
words, which are related to the target one, this does not necessarily mean that their pairwisecombinations are useful for discriminating the senses of – For example ambiguous collocations, which are
related to both senses of should not be taken into account, such as the {system, connection} collocation
• To circumvent around this problem, each extracted collocation is assigned a weight, which measures the relative frequency of two nouns co-occurring.
– Usually weighted using information theoretic measures such as pointwise mutual information (PMI)
2G
tw
tw
Collocational Graph• Conditional probabilities produce better results
than PMI, which overestimates rare events– Hence they used conditional probabilities
• Let denote the number of paragraphs, in which nouns co-occur, and denote the number of paragraphs, where noun occurs– Since allows us to capture the words which are
related to , the calculations for collocations frequency take place on the whole SemEval-2007 WSI (SWSI) corpus (27132 paragraphs)
• To deal with data sparsity and to determine whether a candidate collocation appears frequently enough to be included in our graphs
ijfreq
jfreqji,j
2Gtw
Collocational Graph• We can measure the conditional probability
and in a similar manner– The final weight applied to collocation is the
average of the calculated conditional probabilities
– They only extracted collocations, which had frequency (parameter ) and weight (parameter ) higher than pre-specified thresholds
• The collocational graph can now be created, in which each extracted and weighted collocation is represented as a vertex and two vertices share an edge, if they co-occur, in one or more paragraphs of bc
( )jip |( )ijp |
ijc
( ) ( )2
|| ijpjipwijc
+=
2p 3p
( )j
ij
freqfreq
jip =|
Collocational Graph
• The next stage is to weight the edges of the initial collocational graph, , as well as to discover new edges connecting the vertices of – The constructed graph is sparse since we are
attempting to identify rare events, i.e. edges connecting collocations
• The solution to the problem of data sparsity is smoothing
GG
Weighting and Populating• For each vertex (collocation ), they associated
a vertex vector containing the vertices (collocations), which share an edge with in graph
• Table shows an example of two vertices, i.e. cnn_nbc and nbc_news, which are not connected in of the target word network
i ic
iVCi
G
G
Weighting and Populating
• In the next step, the similarity between each vertex vector and each vertex vector is calculated
• Lee [4] showed that Jaccard similarity coefficient (JC) showed superior performance over other symmetric similarity measures such as cosine, L1 norm, euclidean distance, Jensen-Shannon divergence, etc.
iVC jVC
Weighting and Populating
• Using JC for estimating similarity between vertex vectors yields:
– Two collocations and are said to be mutually similar if is the most similar collocation to and the other way around.
( )ji
jiji VCVC
VCVCVCVCJC
∪
∩=,
ic jcic jc
Weighting and Populating
• Two mutually similar collocations and are clustered with the result that an occurrence of a collocation with one of is also counted as an occurrence with the other collocation– In table (slide 34) if cnn_nbc and nbc_news are
mutually similar, then the zero-frequency event between nbc_news and cnn_tv is set equal to the joint frequency between cnn_nbc and cnn_tv
• Many collocations connected to one of the target collocations are not connected to the other, although they should be, since both of the target collocations are contextually related i.e. both of them refer to the Television Network sense.
ic jc
kc ji cc ,
Weighting and Populating
• The weight applied to each edge connecting vertices and (collocations and ) is the maximum of their conditional probabilities, where:
• Where denotes the number of paragraphs, in which nouns co-occur, and denotes the number of paragraphs, where noun occurs
i j ic jc
( )j
ji
freqfreq
jip ,| =
ijfreqji, jfreq
j
Inducing Senses and Tagging
• The final graph , resulting from the previous stage, is clustered in order to produce the induced senses.
• The two criteria for choosing a clustering algorithm were:– Its ability to automatically induce the number of
clusters– Its execution time
/G
Inducing Senses and Tagging• Markov Clustering Algorithm
– Is fast– Is based on stochastic flow in graphs– The number of clusters produced depends on an
inflation parameter that controls the number of produced clusters
• Chinese Whispers– Is a randomized graph-clustering method– Time-linear to the number of edges– Does not require any input parameters– Not guaranteed to converge– Automatically infers the number and size of clusters
Inducing Senses and Tagging
• Normalized MinCut– Graph partitioning technique– Graph is partitioned into two subgraphs by
minimising the total association between the two subgraphs
– Iteratively applied for each extracted subgraph until a user-defined criterion is met (e.g. number of clusters)
Inducing Senses and Tagging• CW assigns all vertices to different classes• Each vertex is processed for an (parameter )
number of iterations and inherits the strongest class in its local neighbourhood (LN) in an update step.– LN is defined as the set of vertices which share a direct
connection with vertex • During the update step for a vertex : each class, , receives a
score equal to the sum of the weights of edges ( ), where has been assigned class
– The maximum score determines the strongest class» In case of multiple classes, one is chosen at random
• Classes are updated immediately, which means that a node can inherit classes from its LN that were introduced there in the same iteration
i x 4p
ii cl
ji, jcl
Evaluation
• The WSI approach was evaluated under the framework of SemEval-2007 WSI task (SWSI)
• The corpus consists of texts of the Wall Street Journal corpus, and is hand tagged with OntoNotes senses
• They focused on all 35 nouns of SWSI, ignoring verbs
Evaluation
• They induced the senses of each target noun, , and then they tagged each instance of with one of its induced senses
• SWSI organizers employ two evaluation schemes– Unsupervised evaluation
• The results of systems are treated as clusters of target noun contexts and gold standard (GS) senses as classes
– Supervised evaluation• The training corpus is used to map the induced clusters to GS
senses. The testing corpus is then used to measure performance
tntn
Evaluation
• Perfect clustering solution is defined in terms of Homogeneity and Completeness
• Homogeneity– Where each induced cluster has exactly the
same contexts as one of the classes• Completeness
– Where each class has exactly the same contexts as one of the clusters
Evaluation
• F-Score is used to asses the overall quality of clustering– Measures both Homogeneity and
Completeness– Other measures, entropy and purity only
measure the first
Evaluation
• Purity– Let be the number of classes in the gold standard
(GS), be the number of clusters, be the size of cluster , and be the number of data points in class that belong to cluster , then:
qk rnr i
rn ir
q
∑=
=k
r
ir
in
nPurity
1maxarg1
∑ ∑= =
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
k
r
q
i r
ir
r
irr
nn
nn
qnnEntropy
1 1log
log1
Evaluation• Their WSI methodology used Jaccard similarity
to populate the graph, referred to as Col-JC– Col-BL induces senses as Col-JC does but without
smoothing• Baselines
– 1cl1inst: assigns each instance to a distinct cluster– 1c1w: groups all instances of a target word into
one cluster• equal to the most frequent baseline (MFS) in the supervised
evaluation– The sense which appears most often in an annotated text
Evaluation• The evaluation results are given in the table below:
– UOY, UBS-AC have used labeled data for parameter estimation– I2R, UPV_SI, UMND2 do not state how their parameters were
estimated
Analysis
• Evaluation of WSI methods is a difficult task– 1cl1inst baseline achieves a perfect purity and
entropy, however scores low on F-Score• Because senses of GS are spread among induced
clusters causing a low unsupervised recall• Supervised recall of 1cl1inst is undefined, due to
the fact that each cluster tags one and only one instance in the corpus
– Clusters tagging instances in the test corpus do not tag any instances in the train corpus and the mapping cannot be performed
Analysis
• 1c1w baseline achieves high F-Score performance due to the dominance of MFS in the testing corpus– Its purity, entropy and supervised recall are
much lower than other systems
Analysis
• A clustering solution, which achieves high supervised recall, does not necessarily achieve high F-Score– Because F-Score penalizes systems for
getting the number of GS classes wrongly as 1cl1inst baseline
Analysis
• No system was able to achieve high performance in both settings except their technique
• Col-BL (Col-JC) achieved 72.9% (78%) F-Score
Analysis
• The target of smoothing was to reduce the number of clusters and obtain a better mapping of clusters to GS senses, but without affecting the clustering quality
Bibliography1) Ioannis Klapaftis, and Suresh Manandhar,
‘Word Sense Induction Using Graphs of Collocations’, in Proceedings of the 18th European Conference On Artificial Intelligence (ECAI-2008).
2) Agirre, E., Soroa, A.: Ubc-as: A graph based unsupervised system for induction and classification. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pp. 346–349 (2007)
Bibliography
3) J. V´eronis. 2004. Hyperlex: lexical cartography for information retrieval. Computer Speech & Language, 18(3):223.252.
4) Lee, L. 1999. Measures of distributional similarity. In Proc. ACL ’99, 25–32.