A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving...
-
Upload
alan-bates -
Category
Documents
-
view
220 -
download
0
Transcript of A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving...
![Page 1: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/1.jpg)
A BAYESIAN APPROACH TO SPELLING CORRECTION
![Page 2: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/2.jpg)
‘Noisy channels’
In a number of tasks involving natural language, the problem can be viewed as recovering an ‘original signal’ distorted by a `noisy channel’:
– Speech recognition– Spelling correction– OCR / handwriting recognition– (less felicitously perhaps): pronunciation variation
This metaphor has provided the justification for the Bayesian approach to statistical NLP,which has found application also outside these application areas
![Page 3: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/3.jpg)
Spelling Errors
They are leaving in about fifteen minuets to go to her house
The study was conducted mainly be John Black.
The design an construction of the system will take more than one year.
Hopefully, all with continue smoothly in my absence.
Can they lave him my messages?
I need to notified the bank of this problem.
He is trying to fine out.
![Page 4: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/4.jpg)
Handwriting recognition
From Woody Allen’s Take the Money and Run (1969)– Allen (a bank robber), walks up to the teller and
hands her a note that reads. "I have a gun. Give me all your cash."
The teller, however, is puzzled, because he reads "I have a gub." "No, it's gun", Allen says. "Looks like 'gub' to me," the teller says, then asks another teller to help him read the note, then another, and finally everyone is arguing over what the note means.
![Page 5: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/5.jpg)
Spelling errors
How common are spelling errors?– .005% in carefully edited newswire– 1-3% in `normal’ human written text– 20% of web queries are misspelled (Google
includes spelling correction algorithms)– 38% in applications like directory lookup– Handwriting recognition errors:
Apple Newton: 2-3%
![Page 6: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/6.jpg)
Types of spelling errors
Damerau (1964): 80% of all misspelled words (non-word errors) caused by SINGLE-ERROR MISSPELLINGS:
– INSERTION: the ther– DELETION: the th– SUBSTITUTION: the thw– TRANSPOSITION: the hte
![Page 7: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/7.jpg)
Dealing with spelling errors (Kukich, 1992)
3 increasingly broader problems:– NON-WORD ERROR DETECTION: ‘graffe’ instead
of ‘giraffe’– ISOLATED WORD-ERROR CORRECTION:
replacing ‘graffe’ with ‘giraffe’ without looking at context
– CONTEXT-DEPENDENT ERROR DETECTION / CORRECTION: detecting also spelling errors that result in a real world
![Page 8: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/8.jpg)
Detecting non-word errors: Dictionaries
Peterson, 1986: large dictionaries may do more damage than good– wont– veery
Damerau and Mays (1989): no evidence this was the case
![Page 9: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/9.jpg)
The Noisy Channel model
![Page 10: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/10.jpg)
Bayesian inference
`Bayesian inference’ is the name given to techniques typically used in diagnostics to identify the CAUSE of certain OBSERVATIONS
The name ‘Bayesian’ comes from the fact that Bayes’ rule is used to ‘turn around’ a problem: from one of finding statistics about the posterior probability of the CAUSE to one of finding the posterior probability of the OBSERVATIONS
![Page 11: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/11.jpg)
Bayesian inference: the equations
(These are equations that we will encounter again and again for different tasks)
The statistical formulation of the problem of finding the most likely `explanation’ for the observation:
Using Bayes’ Rule, this probability can be `turned around’:
)|(argmaxˆVw
OwPw
)(
)()|(argmaxˆ
Vw OP
wPwOPw
![Page 12: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/12.jpg)
Bayesian equations, 2
Some of these quantities are easy to compute, but others much less so – especially P(O)
Fortunately, we don’t really need to compute this term!! (It’s the same for ALL `explanations’)
This equation is a pattern that we will encounter again and again.
priorlikelihood
Vw)()|(argmaxˆ wPwOPw
![Page 13: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/13.jpg)
Applying the Bayesian Method to Spelling: Kernigham et al, 1990
correct takes words rejected by spell and generates a list of potential correct words
Two steps:1. Proposing candidate corrections
2. Scoring the candidates
An example of isolated word-error correction
![Page 14: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/14.jpg)
Proposing candidate corrections
The noisy channel assumption: misspelled word the result of a `noisy channel’ – the typist performing a single MISTYPING OPERATION
Four possible operations: – INSERTION: x xy– DELETION: xy x– SUBSTITUTION: y x– REVERSAL: xy yx
At most one operation involved (cfr. Damerau, 1964)
![Page 15: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/15.jpg)
Example: acress
Error Correction Corr letter Error letter
Position Type
acress actress t - 2 del
acress cress - a 0 ins
acress caress ca ac 0 transp
acress access c r 2 sub
acress across o e 3 sub
acress acres - s 5 ins
![Page 16: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/16.jpg)
Scoring the candidates
Choose the correction with the highest probability:
P(c): MLE estimation in a 44M words corpus, with smoothing (Good-Turing)
priorlikelihood
Cc)()|(argmaxˆ cPctPc
c freq(c) p(c)
actress 1343 .0000315
cress 0 .000000014
caress 4 .0000001
access 2280 .000058
![Page 17: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/17.jpg)
A simplification
THE TRAINING CORPUS:
acress actress actressacress acres acres acres
WOULD WANT:
Likelihoods: P(acress|actress) = 1/3 P(acress|acress) = ¼
APPROXIMATE WITH:
P(acress|actress) = del[ct,c] / count[ct] = 1/3 (?)
![Page 18: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/18.jpg)
Confusion matrices
Difficult to compute directly, but can be estimated by looking at LOCAL FACTORS only
Entry [m,n] in a CONFUSION MATRIX for SUBSTITUTION will tell us how often n is used instead of m
Kernighan et al used four confusion matrices:– del[x,y] (number of times x is typed instead of correct xy)– ins[x,y] (number of times xy is typed instead of correct x)– sub[x,y] (number of times y is typed instead of correct x)– trans[x,y] (number of times yx is typed instead of correct xy)
![Page 19: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/19.jpg)
Estimating the likelihood of a typo
],[
],[][
],[][
],[],[
],[
)|(
1
1
1
1
1
1
pp
pp
p
pp
p
pp
pp
pp
cccount
cctransccount
ctsubccount
tcinscccount
ccdel
ctP
![Page 20: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/20.jpg)
Resulting likelihoods
c Freq(c) P(c) P(t|c) P(t|c)p(c) %
actress 1343 .0000315 .000117 3.69x10-9 37%
cress 0 .000000014 .00000144 2.02x10-12 0%
caress 4 .0000001 .00000164 1.64x10-13 0%
access 2280 .000058 .000000209 1.21x10-11 0%
across 8436 .00019 .0000093 1.77x10-9 18%
acres 2879 .000065 .0000321 2.09x10-9 21%
acres 2879 .000065 .0000342 2.09x10-9 23%
![Page 21: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/21.jpg)
Evaluation (3 judges, 329 triples)
Method Results Perc. Correct
correct 286/329 87±1.9%
no prior 263/329 80 ±2.2%
no channel 247/329 75 ±2.4%
neither 172/329 52 ±2.8
Judge 1 271/273 99 ±0.5
Judge 2 271/275 99 ±0.7
Judge 3 271/281 96 ±1.1
![Page 22: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/22.jpg)
More sophisticated methods
MINIMUM EDIT DISTANCE: allow for the possibility of more than one problem
N-GRAM models: use context (detect ‘real words’)
![Page 23: A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d205503460f949f5679/html5/thumbnails/23.jpg)
References
Jurafsky and Martin, chapter 5 Kernighan, M. D., Church, K. W., and Gale, W. A.
(1990). A spelling correction method based on a noisy channel model. COLING-90, 205-211.
Karen Kukich (1992). Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4), 377-439.
More recent work:– Brill, E. and Moore, R. An improved error model for noisy
channel spelling correction Proc. ACL 2000