Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

23
Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7

Transcript of Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

Page 1: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

Word Sense Disambiguation

2002. 1. 18.

Kyung-Hee Sung

Foundations of Statistical NLPChapter 7

Page 2: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

2

Contents

Methodological Preliminaries

Supervised Disambiguation– Bayesian classification / An information-theoretic

approach

– Disambiguation based on dictionaries, thesauri and bilingual dictionaries

– One sense per discourse, one sense per collocation

Unsupervised Disambiguation

Page 3: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

3

Introduction

Word sense disambiguation– Many words have several meanings or senses.

– Many words have different usages.

Ex) Butter may be used as noun, or as a verb.

– The task of disambiguation is done by looking at the context of the word’s use.

Page 4: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

4

Methodological Preliminaries (1/2)

Supervised learning Unsupervised learning

We know the actual status

(sense label) for each piece of

data on which we train.

(Learning from labeled data)

Classification task

We do not know the

classification of the data in

the training sample. (Learning

from unlabeled data)

Clustering task

Page 5: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

5

Methodological Preliminaries (2/2)

Pseudowards : artificial ambiguous words – In order to test the performance of disambiguation algo

rithms. Ex) banana-door

Performance estimation – Upper bound : human performance

– Lower bound (baseline) : the performance of the simplest possible algorithm, usually the assignment of all contexts to the most frequent sense.

Page 6: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

6

Supervised Disambiguation

Bayesian classification Information Theory

It treats the context of occurrence as a bag of words without structure.

It integrates information from all words in the context window.

It looks at only one informative feature in the context, which may be sensitive to text structure.

Page 7: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

7

Notations

Symbol Meaning

w an ambiguous word

s1, …, sk, …, sKsenses of the ambiguous word w (a semantic label of w)

c1, …, ci, …, cI contexts of w in a corpus

v1, …, vj, …, vJwords used as contextual features for disambiguation

Page 8: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

8

Bayesian classification (1/2)

Assumption : we have a corpus where each use of ambiguous words is labeled with its correct sense.

Bayes decision rule : minimizes the probability of errors – Decide s´ if P(s´| c) > P(sk| c)

← using Bayes’s rule

← P(c) is constant for all senses

Page 9: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

9

Bayesian classification (2/2)

Naive Bayes independence assumption– All the structure and linear ordering of words within the context is

ignored. → bag of words model

– The presence of one word in the bag is independent of another.

Decision rule for Naive Bayes– Decide s´ if

← computed by MLE

Page 10: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

10

An Information-theoretic approach (1/2)

The Flip-Flop algorithm applied to finding indicators for disambiguation between two senses.

1 find random partition P = {P1, P2} of {t1, … tm}

2 while (improving) do

3 find partition Q = {Q1, Q2} of {x1, … xm}

4 that maximize I(P; Q)

5 find partition P = {P1, P2} of {t1, … tm}

6 that maximize I(P; Q)

7 end

Page 11: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

11

An Information-theoretic approach (2/2)

To translate prendre (French) based on its object– Translation, {t1, … tm} = { take, make, rise, speak }

– Values of Indicator, {x1, … xm} = { mesure, note, exemple, décision, parole }

1. Initial partition P1 = { take, rise }

P2 = { make, speak }

2. Find partition Q1, = { mesure, note, exemple}

Q2, = { décision, parole }

← This division gives us the most information for distinguishing P1 from P2 (maximizes I(P; Q))

3. Find partition P1, = { take }

P2, = { make, rise, speak }

← Always collect for take.

Relations (English)

take a measure

take notes

take an example

make a decision

make a speech

rise to speak

Page 12: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

12

Dictionary-based disambiguation (1/2)

A word’s dictionary definitions are good indicators for the senses they define.

for all senses sk of w do

score(sk) = overlap (Dk, Uvj in c Evj) // number of common words

end

choose s´ s.t. s´= argmaxsk score(sk)

Symbol Meaning

D1, …, DK dictionary definitions of the senses s1, …, sK

sj1, …, sjL senses of vj

Evj dictionary definitions of a word vj / Evj = Uji Dji

Page 13: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

13

Dictionary-based disambiguation (2/2)

Simplified example : ash – The score is number of words that are shared by the sense

definition and the context.

Sense Definition

s1 tree a tree of the olive family

s2 burned stuff the solid residue left when combustible material is burned

ScoresContext

s1 s2

0 1 This cigar burns slowly and creates a stiff ash.

1 0 The ash is on of the last trees to come into leaf.

Page 14: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

14

Thesaurus-based disambiguation (1/2)

Semantic categories of the words in a context determine the semantic categories of the context as a whole, and that this category in turn determines which word senses are used.

Each word is assigned one or more subject codes in the dictionary.

– t(sk) : subject code of sense sk.

– The score is the number of words that compatible with the subject code of sense sk.for all senses sk of w do

score(sk) = Σ vj in c δ ( t(sk), vj )

end

choose s´ s.t. s´= argmaxsk score(sk)

Page 15: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

15

Thesaurus-based disambiguation (2/2)

Word Sense Roget category Accuracy

bass [beis] musical

[bæs] fish

MUSIC

ANIMAL, INSECT

99%

100%

interest curiosity

advantage

financial

share

RESIONING

INJUSTICE

DEBT

PROPERTY

88%

34%

90%

38%

Self-interest (advantage) is not topic-specific. When a sense is spread out over several topics, the topic-

based classification algorithm fails.

Page 16: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

16

Disambiguation based on translations in a second-language corpus

In order to disambiguate an occurrence of interest in Engli

sh (first language), we identify the phrase it occurs in and s

earch a German (second language) corpus for instances of

the phrase.

– The English phrase showed interest : show(E) → ‘zeigen’(G)

– ‘zeigen’(G) will only occur with Interesse(G) since ‘legal shares’ a

re usually not shown.

– We can conclude that interest in the phrase to show interest belong

s to the sense attention, concern.

Page 17: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

17

One sense per discourse constraint

The sense of a target word is highly consistent within any given document.– If the first occurrence of plant is a use of the sense

‘living being’, then later occurrences are likely to refer to living beings too.

for all documents dm do

determine the majority sense sk of w in dm

assign all occurrences of w in dm to sk

end

Page 18: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

18

One sense per collocation constraint

Word senses are strongly correlated with certain contextual features like other words in the same phrasal unit.– Collocational features are ranked according to the ratio:

– Relying on only the strongest feature has the advantage that no integration of different sources of evidence is necessary.

← The number of occurrences of sense sk

with collocation fm

Page 19: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

19

Unsupervised Disambiguation (1/3)

There are situations in which even such a small amount of information is not available.

Sense tagging requires that some characterization of the senses be provided. However, sense discrimination can be performed in a completely unsupervised fashion.

Context-group discrimination : a completely unsupervised algorithm that clusters unlabeled occurrences.

Page 20: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

20

An EM algorithm (1/2)

1. Initialize the parameters of the model μ randomly.

Compute the log likelihood of the corpus C

2. While l(C|μ) is improving repeat:

(a) E-step. Estimate hik

← Naive Bayes assumption

Page 21: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

21

An EM algorithm (2/2)

(b) M-step. Re-estimate the parameters P(vj|sk) and P(sk) by way of MLE

Page 22: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

22

Unsupervised Disambiguation (2/3)

Unsupervised disambiguation can be easily adapted to produce distinctions between usage types. – Ex) The distinction between physical bank ( in the context of bank

robberies ) banks as abstract corporations ( in the context of corporate mergers )

The unsupervised algorithm splits dictionary senses into fine-grained contextual variants.– Usually, the induced clusters do not line up well with dictionary

senses. Ex) ‘lawsuit’ → ‘civil suit’, ‘criminal suit’

Page 23: Word Sense Disambiguation 2002. 1. 18. Kyung-Hee Sung Foundations of Statistical NLP Chapter 7.

23

Unsupervised Disambiguation (3/3)

Infrequent senses and senses that have few collocations are hard to isolate in unsupervised disambiguation.

Results of the EM algorithm– The algorithm fails for words whose senses are topic-independent

such as ‘to teach’ for train.

Word SenseAccuracy

Mean σ

suit lawsuit

the suit you wear

95

96

0

0

train line of railroad cars

to teach

79

55

19

31

← for ten experiments with different initial conditions