Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz...
-
Upload
lorraine-wood -
Category
Documents
-
view
221 -
download
0
Transcript of Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz...
![Page 1: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/1.jpg)
Number Sense Disambiguation
Stuart Moore
Supervised by:
Anna Korhonen (Computer Lab)Sabine Buchholz (Toshiba CRL)
![Page 2: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/2.jpg)
2
Number Sense Disambiguation
Similar to Word Sense Disambiguation Seek to classify numbers into different senses
e.g. year, time, telephone number...
![Page 3: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/3.jpg)
3
Applications
Speech Synthesis 1990
nineteen-ninety one thousand, nine hundred and ninety
2015 two thousand and fifteen eight fifteen p.m.
Information Retrieval Parsing
![Page 4: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/4.jpg)
4
Aim
To successfully classify numbers into sense categories
To use a semi-supervised method Avoids the need for a large, human annotated
training set Allows economical adaptation to different
languages and domains
![Page 5: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/5.jpg)
5
Differences with Word Sense Disambiguation
There are infinitely many numbers – you will almost certainly come across 'digit strings' you have not seen in training data.
Intuitively, the models for 2007 and 2008 should be similar But the model for 5, or 2007.4, should be different
There is no resource equivalent to a dictionary, enumerating all possible senses of a number.
![Page 6: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/6.jpg)
6
Previous System
The report Normalization of non-standard Words (Sproat et al, 2001) defines a taxonomy of 13 'senses' for numbers
They annotated 4 corpora, the largest of which is a subsection of the North American News Text Corpus – newswire text from 1994-97
They used this to create a decision tree classifier
The main focus of the report was the performance when expanding abbreviations, and numbers are not examined in detail.
![Page 7: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/7.jpg)
7
Number Sense CategoriesLabel Description Examples CountNUM Number (Cardinal) 12, 45, 1/2, 0.6 21253 (56.53%)NYER Year(s) 7659 (20.37%)
NORD Number (Ordinal) 3264 (8.68%)
MONEY Money (US or other) 2909 (7.74%)
NIDE Identifier 1027 (2.73%)
NTEL Telephone number (or part of) 212 555-4523 507 (1.35%)NTIME A (compound) time 3:20, 11:45 440 (1.17%)NDATE A (compound) date 307 (0.82%)
NDIG Number as digits Room 101 74 (0.20%)NADDR Number as street address 69 (0.18%)
NZIP Zip code or PO box 91020 66 (0.18%)
1998, 80s, 1900s, 2003May 7, 3rd, Bill Gates III$3.45, HK$300, Y20,000, $200K747, 386, I5, pc110, 3A
2/2/99, 14/03/87 (or US) 03/14/87
45 North Street, 5000 Pennsylvania Ave
(Counts are from the training data of the North American News Text Corpus)
![Page 8: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/8.jpg)
8
Overview of my system
Based on work by Yarowsky (1995) investigating decision lists for Word Sense Disambiguation
Takes a few annotated 'seed examples', together with a large, unannotated corpus.
Generates one model using the seed examples, and applies this to the unannotated corpus.
This is used as input to generate another model.
The process can be iterated many times
![Page 9: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/9.jpg)
9
Overview of my system
![Page 10: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/10.jpg)
10
Features
The context of each number is examined for a list of features.
Local context: ± 5 tokens from the number Punctuation, words, word stems, number features Specific location (e.g. token following number)
Wider context: ± 15 tokens from the number Words and Word stems only Bag of words (anywhere within the window)
![Page 11: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/11.jpg)
11
Rules
Each rule is conditional on the presence of one or two features Consider all possible combinations of features that
occur together at least five times in the training corpus.
Based on Yarowsky's rules, but more powerful He had 'Bag of word' rules, and some rules
combining two words in the local area He did not have any specific numeric or punctuation
features.
![Page 12: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/12.jpg)
12
Ranking Rules
α is a parameter that can be varied to change the effect of negative examples on the model
Rank rules according to log likelihood When classifying, use the first rule that matches
the target sentence
Follows Yarowsky (1995) For each rule, count the number of examples
for each number sense Calculate Log Likelihood:
Examples) Negative(Count
Examples) PositiveCount (logLogLike
![Page 13: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/13.jpg)
13
Performance as a fully supervised system
We applied the method to the entire training set, and investigated its performance on the training and test sets This gives an idea of the 'upper bound' of
performance of the system
![Page 14: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/14.jpg)
14
Performance on training data
00.
30.
60.
91.
21.
51.
82.
12.
42.
7 33.
33.
63.
94.
24.
54.
85.
15.
45.
7 66.
36.
66.
97.
27.
57.
88.
18.
4
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Incorrect
Default makes incorrect
Default makes correct
Correct
97.2%
Log Likelihood cut off
![Page 15: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/15.jpg)
15
Performance on test data
00.
40.
81.
21.
6 22.
42.
83.
23.
6 44.
44.
85.
25.
6 66.
46.
87.
27.
6 88.
4
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Incorrect
Default makes incorrect
Default makes correct
Correct
Log Likelihood cut off
66.0%
81.2%
![Page 16: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/16.jpg)
16
Performance as a fully Supervised system - Summary
Accuracy is 66.0% on test data Using the most common number type for
unclassified examples increases accuracy to 81.2%
The Sproat et al system achieves an accuracy of 97.6% on the same task Uses decision trees instead of decision lists Decision trees generally classify everything – less
suitable for an iterative process.
![Page 17: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/17.jpg)
17
Performance as a fully Supervised system - Summary
A large proportion of the test data – approximately 25% - was unclassified.
By adding in unlabelled data to the training set, we hope to increase coverage of the rules, and thereby boost accuracy (experiment not yet performed)
![Page 18: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/18.jpg)
18
Performance as a semi-supervised system
Concept: Provide a small number of seed examples, from
which rules are extrapolated over various iterations. Important to have high precision in the first
iteration (Recall can be low, as long as it's not too low)
Future iterations aim to improve recall
![Page 19: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/19.jpg)
19
Performance as a semi-supervised system
After experimenting with a few different strategies for the first iteration, the following was found to perform best:
Rank all rules based on their scores from the seed examples
For each number type, take the three highest scoring rules (more if several had an equal score)
Apply these rules to the unlabelled data. If a number is matched by rules from more than
one number type, do not classify it
![Page 20: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/20.jpg)
20
How many seed examples are needed?
Seed examples were randomly picked from the training data
Equal numbers of seed examples for each number type
Definite improvement seen for going up to 40 seed examples
Limited improvement after this point
20 30 40 50 60
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Number of seed examples per number type
Pre
cisi
on(%
of
tho
se a
ssig
ne
d w
he
re t
he
ca
teg
ory
is c
orr
ect
)
![Page 21: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/21.jpg)
21
Performance of the second iteration – training data
00.
30.
60.
91.
21.
51.
82.
12.
42.
7 33.
33.
63.
94.
24.
54.
85.
15.
45.
7 66.
36.
66.
97.
27.
57.
88.
18.
4
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Incorrect
Default makes incorrect
Default makes correct
Correct
Baseline - 56.24%
Peak – 84.84%(LogLike >= 5.0)
Log Likelihood cut off
![Page 22: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/22.jpg)
22
Performance of the second iteration – test data
00.
30.
60.
91.
21.
51.
82.
12.
42.
7 33.
33.
63.
94.
24.
54.
85.
15.
45.
7 66.
36.
66.
97.
27.
57.
88.
18.
4
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Incorrect
Default makes incorrect
Default makes correct
Correct
Peak – 75.2%(LogLike >= 5.2)
Using previous peak value,cut off=5.0, gives 74.93%accuracy
Log Likelihood cut off
![Page 23: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/23.jpg)
23
Future Work
Error analysis of the data More sophisticated features
Part of Speech tags, or a parser More sophisticated rules
Try to allow more than two features per rule, without creating too many rules to be handled.
Different rule strategies Closer to a decision tree Other machine learning methods?
![Page 24: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/24.jpg)
24
Future Work
Increase coverage Investigate use of document level features, using
method from Stevenson et al, 2008 Investigate different strategies for picking the
seed examples Distribute according to relative frequency of
categories, rather than a set number per category Investigate the effects of more unannotated data
Can use sections of the North American News Corpus that haven't been annotated.
![Page 25: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/25.jpg)
25
Future Work
Consider modifying the number classes Should some categories be combined? Would moving the categories into a tree structure
improve performance? Are different classes needed for different domains
(e.g. financial, biomedical) or languages? Investigate corpus for consistency
A few inconsistent examples have been identified
![Page 26: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/26.jpg)
![Page 27: Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab) Sabine Buchholz (Toshiba CRL)](https://reader033.fdocuments.net/reader033/viewer/2022051115/5697bfb81a28abf838c9f121/html5/thumbnails/27.jpg)
27
Number Features
Does the number start with a leading zero Is the number an integer How many digits in the number The real value of the number The number rounded to one significant figure
So 1500 ≤ x < 2500 maps to 2000 The token with all digits removed
1st becomes st, 70mph becomes mph