600.465 - Intro to NLP - J. Eisner1 Semantics From Syntax to Meaning!
*Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)
description
Transcript of *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)
![Page 1: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/1.jpg)
1
*Introduction to Natural Language Processing (600.465)
Language Modeling (and the Noisy Channel)
Dr. Jan Hajič
CS Dept., Johns Hopkins Univ.
www.cs.jhu.edu/~hajic
![Page 2: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/2.jpg)
2
The Noisy Channel
• Prototypical case: Input Output (noisy)
The channel
0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,...
• Model: probability of error (noise):
• Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6
• The Task:
known: the noisy output; want to know: the input (decoding)
![Page 3: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/3.jpg)
3
Noisy Channel Applications• OCR
– straightforward: text → print (adds noise), scan →image
• Handwriting recognition– text → neurons, muscles (“noise”), scan/digitize → image
• Speech recognition (dictation, commands, etc.)– text → conversion to acoustic signal (“noise”) → acoustic waves
• Machine Translation– text in target language → translation (“noise”) → source language
• Also: Part of Speech Tagging– sequence of tags → selection of word forms → text
![Page 4: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/4.jpg)
4
Noisy Channel: The Golden Rule of ...
OCR, ASR, HR, MT, ...• Recall:
p(A|B) = p(B|A) p(A) / p(B) (Bayes formula)
Abest = argmaxA p(B|A) p(A) (The Golden Rule)
• p(B|A): the acoustic/image/translation/lexical model– application-specific name
– will explore later
• p(A): the language model
![Page 5: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/5.jpg)
Dan Jurafsky
Probabilistic Language Models
Why?
![Page 6: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/6.jpg)
Dan Jurafsky
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or sequence of words:• P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:• P(w5|w1,w2,w3,w4)
• A model that computes either of these:• P(W) or P(wn|w1,w2…wn-1) is called a language model.
• Better: the grammar But language model or LM is standard
![Page 7: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/7.jpg)
7
The Perfect Language Model
• Sequence of word forms [forget about tagging for the moment]
• Notation: A ~ W = (w1,w2,w3,...,wd)
• The big (modeling) question:
p(W) = ?
• Well, we know (Bayes/chain rule →):
p(W) = p(w1,w2,w3,...,wd) =
= p(w1)ⅹp(w2|w1)ⅹp(w3|w1,w2)ⅹⅹp(wd|w1,w2,...,wd-1)
• Not practical (even short W →too many parameters)
![Page 8: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/8.jpg)
8
Markov Chain
• Unlimited memory (cf. previous foil):– for wi, we know all its predecessors w1,w2,w3,...,wi-1
• Limited memory:– we disregard “too old” predecessors
– remember only k previous words: wi-k,wi-k+1,...,wi-1
– called “kth order Markov approximation”
• + stationary character (no change over time):
p(W) i=1..dp(wi|wi-k,wi-k+1,...,wi-1), d = |W|
![Page 9: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/9.jpg)
9
n-gram Language Models
• (n-1)th order Markov approximation → n-gram LM:
p(W) df i=1..dp(wi|wi-n+1,wi-n+2,...,wi-1) !• In particular (assume vocabulary |V| = 60k):
• 0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter
• 1-gram LM: unigram model, p(w), 6ⅹ104 parameters
• 2-gram LM: bigram model, p(wi|wi-1) 3.6ⅹ109 parameters
• 3-gram LM: trigram model, p(wi|wi-2,wi-1) 2.16ⅹ1014 parameters
prediction history
![Page 10: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/10.jpg)
10
LM: Observations• How large n?
– nothing is enough (theoretically)
– but anyway: as much as possible (→close to “perfect” model)
– empirically: 3• parameter estimation? (reliability, data availability, storage space, ...)
• 4 is too much: |V|=60k →1.296ⅹ1019 parameters
• but: 6-7 would be (almost) ideal (having enough data): in fact, one can recover original from 7-grams!
• Reliability ~ (1 / Detail) (→ need compromise) (detail=many gram)
• For now, keep word forms (no “linguistic” processing)
![Page 11: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/11.jpg)
11
Parameter Estimation
• Parameter: numerical value needed to compute p(w|h)• From data (how else?)• Data preparation:
• get rid of formatting etc. (“text cleaning”)• define words (separate but include punctuation, call it “word”)• define sentence boundaries (insert “words” <s> and </s>)• letter case: keep, discard, or be smart:
– name recognition
– number type identification
[these are huge problems per se!]
• numbers: keep, replace by <num>, or be smart (form ~ pronunciation)
![Page 12: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/12.jpg)
12
Maximum Likelihood Estimate• MLE: Relative Frequency...
– ...best predicts the data at hand (the “training data”)
• Trigrams from Training Data T:– count sequences of three words in T: c3(wi-2,wi-1,wi)
– [NB: notation: just saying that the three words follow each other]
– count sequences of two words in T: c2(wi-1,wi):
• either use c2(y,z) = w c3(y,z,w)
• or count differently at the beginning (& end) of data!
p(wi|wi-2,wi-1) =est. c3(wi-2,wi-1,wi) / c2(wi-2,wi-1) !
![Page 13: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/13.jpg)
13
Character Language Model
• Use individual characters instead of words:
• Same formulas etc.• Might consider 4-grams, 5-grams or even more• Good only for language comparison• Transform cross-entropy between letter- and word-
based models: HS(pc) = HS(pw) / avg. # of characters/word in S
p(W) df i=1..dp(ci|ci-n+1,ci-n+2,...,ci-1)
![Page 14: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/14.jpg)
14
LM: an Example
• Training data: <s> <s> He can buy the can of soda.– Unigram: p1(He) = p1(buy) = p1(the) = p1(of) = p1(soda) = p1(.) = .125
p1(can) = .25
– Bigram: p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5,
p2(of|can) = .5, p2(the|buy) = 1,...– Trigram: p3(He|<s>,<s>) = 1, p3(can|<s>,He) = 1,
p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.– (normalized for all n-grams) Entropy: H(p1) = 2.75, H(p2) = .25,
H(p3) = 0 ← Great?!
![Page 15: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/15.jpg)
Dan Jurafsky
Language Modeling Toolkits
• SRILM• http://www.speech.sri.com/projects/srilm/
![Page 16: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/16.jpg)
Dan Jurafsky
Google N-Gram Release, August 2006
…
![Page 17: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/17.jpg)
Dan Jurafsky
Google N-Gram Release
• serve as the incoming 92• serve as the incubator 99• serve as the independent 794• serve as the index 223• serve as the indication 72• serve as the indicator 120• serve as the indicators 45• serve as the indispensable 111• serve as the indispensible 40• serve as the individual 234
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
![Page 19: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/19.jpg)
Dan Jurafsky
Evaluation: How good is our model?
• Does our language model prefer good sentences to bad ones?• Assign higher probability to “real” or “frequently observed” sentences
• Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.• We test the model’s performance on data we haven’t seen.
• A test set is an unseen dataset that is different from our training set, totally unused.
• An evaluation metric tells us how well our model does on the test set.
![Page 20: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/20.jpg)
Dan Jurafsky
Extrinsic evaluation of N-gram models
• Best evaluation for comparing models A and B• Put each model in a task
• spelling corrector, speech recognizer, MT system• Run the task, get an accuracy for A and for B
• How many misspelled words corrected properly• How many words translated correctly
• Compare accuracy for A and B
![Page 21: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/21.jpg)
Dan Jurafsky
Difficulty of extrinsic (in-vivo) evaluation of N-gram models
• Extrinsic evaluation• Time-consuming; can take days or weeks
• So• Sometimes use intrinsic evaluation: perplexity• Bad approximation
• unless the test data looks just like the training data• So generally only useful in pilot experiments
• But is helpful to think about.
![Page 22: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/22.jpg)
Dan Jurafsky
Intuition of Perplexity
• The Shannon Game:• How well can we predict the next word?
• Unigrams are terrible at this game. (Why?)
• A better model of a text• is one which assigns a higher probability to the word that actually occurs
I always order pizza with cheese and ____
The 33rd President of the US was ____
I saw a ____
mushrooms 0.1
pepperoni 0.1
anchovies 0.01
….
fried rice 0.0001
….
and 1e-100
Claude Shannon
![Page 23: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/23.jpg)
Dan Jurafsky
Perplexity
• Perplexity is the probability of the test set, normalized by the number of words:
• Chain rule:
• For bigrams:
Minimizing perplexity is the same as maximizing probability
The best language model is one that best predicts an unseen test set• Gives the highest P(sentence)
![Page 24: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/24.jpg)
Dan Jurafsky
The Shannon Game intuition for perplexity
• From Josh Goodman• How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9’
• Perplexity 10
• How hard is recognizing (30,000) names at Microsoft. • Perplexity = 30,000
• If a system has to recognize• Operator (1 in 4)• Sales (1 in 4)• Technical Support (1 in 4)• 30,000 names (1 in 120,000 each)• Perplexity is 53
• Perplexity is weighted equivalent branching factor
![Page 25: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/25.jpg)
Dan Jurafsky
Perplexity as branching factor
• Let’s suppose a sentence consisting of random digits• What is the perplexity of this sentence according to a model that
assign P=1/10 to each digit?
![Page 26: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/26.jpg)
Dan Jurafsky
Lower perplexity = better model
• Training 38 million words, test 1.5 million words, WSJ
N-gram Order Unigram Bigram Trigram
Perplexity 962 170 109
![Page 27: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/27.jpg)
Dan Jurafsky
The wall street journal
![Page 28: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/28.jpg)
28
LM: an Example
• Training data: <s> <s> He can buy the can of soda.– Unigram: p1(He) = p1(buy) = p1(the) = p1(of) = p1(soda) = p1(.) = .125
p1(can) = .25
– Bigram: p2(He|<s>) = 1, p2(can|He) = 1, p2(buy|can) = .5,
p2(of|can) = .5, p2(the|buy) = 1,...– Trigram: p3(He|<s>,<s>) = 1, p3(can|<s>,He) = 1,
p3(buy|He,can) = 1, p3(of|the,can) = 1, ..., p3(.|of,soda) = 1.– (normalized for all n-grams) Entropy: H(p1) = 2.75, H(p2) = .25,
H(p3) = 0 ← Great?!
![Page 29: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/29.jpg)
29
LM: an Example (The Problem)
• Cross-entropy:• S = <s> <s> It was the greatest buy of all. (test data)
• Even HS(p1) fails (= HS(p2) = HS(p3) = ∞), because:
– all unigrams but p1(the), p1(buy), p1(of) and p1(.) are 0.
– all bigram probabilities are 0.
– all trigram probabilities are 0.
• We want: to make all probabilities non-zero. data sparseness handling
![Page 30: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/30.jpg)
30
The Zero Problem• “Raw” n-gram language model estimate:
– necessarily, some zeros• !many: trigram model → 2.16ⅹ1014 parameters, data ~ 109 words
– which are true 0? • optimal situation: even the least frequent trigram would be seen several times,
in order to distinguish it’s probability vs. other trigrams
• optimal situation cannot happen, unfortunately (open question: how many data would we need?)
– → we don’t know– we must eliminate the zeros
• Two kinds of zeros: p(w|h) = 0, or even p(h) = 0!
![Page 31: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/31.jpg)
31
Why do we need Nonzero Probs?
• To avoid infinite Cross Entropy:– happens when an event is found in test data which has
not been seen in training data
H(p) = ∞prevents comparing data with ≥ 0 “errors”
• To make the system more robust– low count estimates:
• they typically happen for “detailed” but relatively rare appearances
– high count estimates: reliable but less “detailed”
![Page 32: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/32.jpg)
32
Eliminating the Zero Probabilities:Smoothing
• Get new p’(w) (same ): almost p(w) but no zeros• Discount w for (some) p(w) > 0: new p’(w) < p(w)
w∈discounted (p(w) - p’(w)) = D
• Distribute D to all w; p(w) = 0: new p’(w) > p(w) – possibly also to other w with low p(w)
• For some w (possibly): p’(w) = p(w)
• Make sure w∈p’(w) = 1
• There are many ways of smoothing
![Page 33: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/33.jpg)
33
Smoothing by Adding 1(Laplace)• Simplest but not really usable:
– Predicting words w from a vocabulary V, training data T:
p’(w|h) = (c(h,w) + 1) / (c(h) + |V|)• for non-conditional distributions: p’(w) = (c(w) + 1) / (|T| + |V|)
– Problem if |V| > c(h) (as is often the case; even >> c(h)!)
• Example: Training data: <s> what is it what is small ? |T| = 8
• V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12
• p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 .001
p(it is flying.) = .125ⅹ.25ⅹ02 = 0
• p’(it) =.1, p’(what) =.15, p’(.)=.05 p’(what is it?) = .152ⅹ.12 .0002
p’(it is flying.) = .1ⅹ.15ⅹ.052 .00004
(assume word independence!)
![Page 34: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/34.jpg)
34
Adding less than 1
• Equally simple:– Predicting words w from a vocabulary V, training data T:
p’(w|h) = (c(h,w) + ) / (c(h) + |V|), • for non-conditional distributions: p’(w) = (c(w) + ) / (|T| + |V|)
• Example: Training data: <s> what is it what is small ? |T| = 8
• V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12
• p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .252ⅹ.1252 .001
p(it is flying.) = .125ⅹ.2502 = 0
• Use = .1:• p’(it).12, p’(what).23, p’(.).01 p’(what is it?) = .232ⅹ.122 .0007
p’(it is flying.) = .12ⅹ.23ⅹ.012 .000003
![Page 35: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/35.jpg)
Language Modeling
Advanced: Good Turing Smoothing
![Page 36: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/36.jpg)
Reminder: Add-1 (Laplace) Smoothing
![Page 37: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/37.jpg)
More general formulations: Add-k
![Page 38: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/38.jpg)
Unigram prior smoothing
![Page 39: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/39.jpg)
Advanced smoothing algorithms
• Intuition used by many smoothing algorithms– Good-Turing
– Kneser-Ney
– Witten-Bell
• Use the count of things we’ve seen once– to help estimate the count of things we’ve never seen
![Page 40: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/40.jpg)
Notation: Nc = Frequency of frequency c
• Nc = the count of things we’ve seen c times
• Sam I am I am Sam I do not eat
I 3sam 2am 2do 1not 1eat 1
40
N1 = 3
N2 = 2
N3 = 1
![Page 41: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/41.jpg)
Good-Turing smoothing intuition
• You are fishing (a scenario from Josh Goodman), and caught:– 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
• How likely is it that next species is trout?– 1/18
• How likely is it that next species is new (i.e. catfish or bass)– Let’s use our estimate of things-we-saw-once to estimate the new things.– 3/18 (because N1=3)
• Assuming so, how likely is it that next species is trout?
– Must be less than 1/18 – discounted by 3/18!!
– How to estimate?
![Page 42: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/42.jpg)
• Seen once (trout)• c = 1• MLE p = 1/18
• C*(trout) = 2 * N2/N1
= 2 * 1/3
= 2/3
• P*GT(trout) = 2/3 / 18 = 1/27
Good Turing calculations
• Unseen (bass or catfish)– c = 0:– MLE p = 0/18 = 0
– P*GT (unseen) = N1/N =
3/18
![Page 43: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/43.jpg)
Ney et al.’s Good Turing Intuition
43
Held-out words:
H. Ney, U. Essen, and R. Kneser, 1995. On the estimation of 'small' probabilities by leaving-one-out. IEEE Trans. PAMI. 17:12,1202-1212
![Page 44: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/44.jpg)
Ney et al. Good Turing Intuition(slide from Dan Klein)
• Intuition from leave-one-out validation– Take each of the c training words out in turn– c training sets of size c–1, held-out of size 1– What fraction of held-out words are unseen in training?
• N1/c– What fraction of held-out words are seen k times in
training?• (k+1)Nk+1/c
– So in the future we expect (k+1)Nk+1/c of the words to be those with training count k
– There are Nk words with training count k– Each should occur with probability:
• (k+1)Nk+1/c/Nk
– …or expected count:
N1
N2
N3
N4417
N3511
. . .
.
N0
N1
N2
N4416
N3510
. . .
.
Training Held out
![Page 45: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/45.jpg)
Good-Turing complications (slide from Dan Klein)
• Problem: what about “the”? (say c=4417)
– For small k, Nk > Nk+1
– For large k, too jumpy, zeros wreck estimates
– Simple Good-Turing [Gale and Sampson]: replace empirical Nk with a best-fit power law once counts get unreliable
N1
N2 N3
N1
N2
![Page 46: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/46.jpg)
Resulting Good-Turing numbers
• Numbers from Church and Gale (1991)• 22 million words of AP Newswire
Count c
Good Turing c*
0 .00002701 0.4462 1.263 2.244 3.245 4.226 5.197 6.218 7.249 8.25
![Page 47: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/47.jpg)
Language Modeling
Advanced:
Kneser-Ney Smoothing
![Page 48: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/48.jpg)
Resulting Good-Turing numbers
• Numbers from Church and Gale (1991)• 22 million words of AP Newswire
• It sure looks like c* = (c - .75)
Count c
Good Turing c*
0 .0000270
1 0.446
2 1.26
3 2.24
4 3.24
5 4.22
6 5.19
7 6.21
8 7.24
9 8.25
![Page 49: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/49.jpg)
Absolute Discounting Interpolation
• Save ourselves some time and just subtract 0.75 (or some d)!
– (Maybe keeping a couple extra values of d for counts 1 and 2)
• But should we really just use the regular unigram P(w)?
49
discounted bigram
unigram
Interpolation weight
![Page 50: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/50.jpg)
Kneser-Ney Smoothing I• Better estimate for probabilities of lower-order unigrams!
– Shannon game: I can’t see without my reading___________?
– “Francisco” is more common than “glasses”
– … but “Francisco” always follows “San”
• The unigram is useful exactly when we haven’t seen this bigram!
• Instead of P(w): “How likely is w”
• Pcontinuation(w): “How likely is w to appear as a novel continuation?
– For each word, count the number of bigram types it completes
– Every bigram type was a novel continuation the first time it was seen
Franciscoglasses
![Page 51: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/51.jpg)
Kneser-Ney Smoothing II
• How many times does w appear as a novel continuation:
• Normalized by the total number of word bigram types
![Page 52: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/52.jpg)
Kneser-Ney Smoothing III
• Alternative metaphor: The number of # of word types seen to precede w
• normalized by the # of words preceding all words:
• A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability
![Page 53: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/53.jpg)
Kneser-Ney Smoothing IV
53
λ is a normalizing constant; the probability mass we’ve discounted
the normalized discountThe number of word types that can follow wi-1 = # of word types we discounted= # of times we applied normalized discount
![Page 54: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/54.jpg)
Kneser-Ney Smoothing: Recursive formulation
54
Continuation count = Number of unique single word contexts for
![Page 55: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/55.jpg)
Backoff and Interpolation
• Sometimes it helps to use less context– Condition on less context for contexts you haven’t learned
much about
• Backoff: – use trigram if you have good evidence,– otherwise bigram, otherwise unigram
• Interpolation: – mix unigram, bigram, trigram
• Interpolation works better
![Page 56: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/56.jpg)
56
Smoothing by Combination:Linear Interpolation
• Combine what?• distributions of various level of detail vs. reliability
• n-gram models:• use (n-1)gram, (n-2)gram, ..., uniform
reliability
detail
• Simplest possible combination: – sum of probabilities, normalize:
• p(0|0) = .8, p(1|0) = .2, p(0|1) = 1, p(1|1) = 0, p(0) = .4, p(1) = .6:
• p’(0|0) = .6, p’(1|0) = .4, p’(0|1) = .7, p’(1|1) = .3
• (p’(0|0) = 0.5p(0|0) + 0.5p(0))
![Page 57: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/57.jpg)
57
Typical n-gram LM Smoothing
• Weight in less detailed distributions using =(0,,,):
p’(wi| wi-2 ,wi-1) = p3(wi| wi-2 ,wi-1) +
p2(wi| wi-1) + p1(wi) + 0/|V|
• Normalize: i > 0, i=0..n i = 1 is sufficient (0 = 1 - i=1..n i) (n=3)
• Estimation using MLE:– fix the p3, p2, p1 and |V| parameters as estimated from the training data
– then find such {i}which minimizes the cross entropy (maximizes
probability of data): -(1/|D|)i=1..|D|log2(p’(wi|hi))
![Page 58: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/58.jpg)
58
Held-out Data• What data to use? (to estimate
– (bad) try the training data T: but we will always get = 1
• why? (let piT be an i-gram distribution estimated using relative freq. from T)
• minimizing HT(p’) over a vector , p’ = p3T+p2T+p1T+/|V|
– remember: HT(p’) = H(p3T) + D(p3T||p’); (p3T fixed → H(p3T) fixed, best)
– which p’ minimizes HT(p’)? Obviously, a p’ for which D(p3T|| p’)=0
– ...and that’s p3T (because D(p||p) = 0, as we know).
– ...and certainly p’ = p3T if = 1 (maybe in some other cases, too).
– (p’ = 1ⅹp3T + 0ⅹp2T + 0ⅹp1T + 0/|V|)
– thus: do not use the training data for estimation of • must hold out part of the training data (heldout data, H):
• ...call the remaining data the (true/raw) training data, T
• the test data S (e.g., for comparison purposes): still different data!
![Page 59: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/59.jpg)
59
The Formulas (for H)• Repeat: minimizing -(1/|H|)i=1..|H|log2(p’(wi|hi)) over
p’(wi| hi) = p’(wi| wi-2 ,wi-1) = p3(wi| wi-2 ,wi-1) +
p2(wi| wi-1) + p1(wi) + 0/|V|
• “Expected Counts (of lambdas)”: j = 0..3 – next page
c(j) = i=1..|H| (jpj(wi|hi) / p’(wi|hi))
• “Next ”: j = 0..3
j,next = c(j) / k=0..3 (c(k))
!
!
!
![Page 60: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/60.jpg)
60
The (Smoothing) EM Algorithm
1. Start with some , such that j > 0 for all j ∈0..3.
2. Compute “Expected Counts” for each j.
3. Compute new set of j, using the “Next ” formula.
4. Start over at step 2, unless a termination condition is met.• Termination condition: convergence of .
– Simply set an , and finish if |j - j,next| < for each j (step 3).
• Guaranteed to converge: follows from Jensen’s inequality, plus a technical proof.
![Page 61: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/61.jpg)
61
Simple Example
• Raw distribution (unigram only; smooth with uniform): p(a) = .25, p(b) = .5, p() = 1/64 for ∈{c…r}, = 0 for the rest: s,t,u,v,w,x,y,z
• Heldout data: baby; use one set of (1: unigram, 0: uniform)
• Start with 1 = .5; p’(b) = .5 x .5 + .5 / 26 = .27
p’(a) = .5 x .25 + .5 / 26 = .14
p’(y) = .5 x 0 + .5 / 26 = .02
c(1) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x0/.02 = 2.72
c(0) = .5x.04/.27 + .5x.04/.14 + .5x.04/.27 + .5x.04/.02 = 1.28
Normalize: 1,next = .68, 0,next = .32.
Repeat from step 2 (recompute p’ first for efficient computation, then c(i), ...)
Finish when new lambdas almost equal to the old ones (say, < 0.01 difference).
![Page 62: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/62.jpg)
62
Some More Technical Hints
• Set V = {all words from training data}.• You may also consider V = T ∪ H, but it does not make the coding in
any way simpler (in fact, harder).• But: you must never use the test data for your vocabulary!
• Prepend two “words” in front of all data:• avoids beginning-of-data problems• call these index -1 and 0: then the formulas hold exactly
• When cn(w) = 0:• Assign 0 probability to pn(w|h) where cn-1(h) > 0, but a uniform
probability (1/|V|) to those pn(w|h) where cn-1(h) = 0 [this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!]
![Page 63: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/63.jpg)
63
Introduction to Natural Language Processing (600.465)
Mutual Information and Word Classes(class n-gram)
Dr. Jan Hajič
CS Dept., Johns Hopkins Univ.
www.cs.jhu.edu/~hajic
![Page 64: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/64.jpg)
64
The Problem
• Not enough data• Language Modeling: we do not see “correct” n-grams
– solution so far: smoothing
• suppose we see:– short homework, short assignment, simple homework
• but not:– simple assigment
• What happens to our (bigram) LM?– p(homework | simple) = high probability
– p(assigment | simple) = low probability (smoothed with p(assigment))
– They should be much closer!
![Page 65: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/65.jpg)
65
Word Classes
• Observation: similar words behave in a similar way– trigram LM:
– in the ... (all nouns/adj);
– catch a ... (all things which can be catched, incl. their accompanying adjectives);
– trigram LM, conditioning: – a ... homework (any atribute of homework: short, simple, late, difficult),
– ... the woods (any verb that has the woods as an object: walk, cut, save)
– trigram LM: both:– a (short,long,difficult,...) (homework,assignment,task,job,...)
![Page 66: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/66.jpg)
66
Solution
• Use the Word Classes as the “reliability” measure• Example: we see
• short homework, short assignment, simple homework
– but not:• simple assigment
– Cluster into classes:• (short, simple) (homework, assignment)
– covers “simple assignment”, too
• Gaining: realistic estimates for unseen n-grams• Loosing: accuracy (level of detail) within classes
![Page 67: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/67.jpg)
67
The New Model• Rewrite the n-gram LM using classes:
– Was: [k = 1..n]• pk(wi|hi) = c(hi,wi) / c(hi) [history: (k-1) words]
– Introduce classes:
pk(wi|hi) = p(wi|ci) pk(ci|hi) !• history: classes, too: [for trigram: hi = ci-2,ci-1, bigram: hi = ci-1]
– Smoothing as usual• over pk(wi|hi), where each is defined as above (except uniform which stays
at 1/|V|)
![Page 68: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/68.jpg)
68
Training Data
• Suppose we already have a mapping:– r: V →C assigning each word its class (ci = r(wi))
• Expand the training data:– T = (w1, w2, ..., w|T|) into
– TC = (<w1,r(w1)>, <w2,r(w2)>, ..., <w|T|,r(w|T|)>)
• Effectively, we have two streams of data:– word stream: w1, w2, ..., w|T|
– class stream: c1, c2, ..., c|T| (def. as ci = r(wi))
• Expand Heldout, Test data too
![Page 69: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/69.jpg)
69
Training the New Model
• As expected, using ML estimates:– p(wi|ci) = p(wi|r(wi)) = c(wi) / c(r(wi)) = c(wi) / c(ci)
• !!! c(wi,ci) = c(wi) [since ci determined by wi]
– pk(ci|hi):
• p3(ci|hi) = p3(ci|ci-2 ,ci-1) = c(ci-2 ,ci-1,ci) / c(ci-2 ,ci-1)
• p2(ci|hi) = p2(ci|ci-1) = c(ci-1,ci) / c(ci-1)
• p1(ci|hi) = p1(ci) = c(ci) / |T|
• Then smooth as usual – not the p(wi|ci) nor pk(ci|hi) individually, but the pk(wi|hi)
![Page 70: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/70.jpg)
70
Classes: How To Get Them
• We supposed the classes are given• Maybe there are in [human] dictionaries, but...
– dictionaries are incomplete
– dictionaries are unreliable
– do not define classes as equivalence relation (overlap)
– do not define classes suitable for LM • small, short... maybe; small and difficult?
• we have to construct them from data (again...)
![Page 71: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/71.jpg)
71
Creating the Word-to-Class Map
• We will talk about bigrams from now• Bigram estimate:
• p2(ci|hi) = p2(ci|ci-1) = c(ci-1,ci) / c(ci-1) = c(r(wi-1),r(wi)) / c(r(wi-1))
• Form of the model: (class bi-gram)– just raw bigram for now:
• P(T) = i=1..|T|p(wi|r(wi)) p2(r(wi))|r(wi-1)) (p2(c1|c0) =df p(c1))
• Maximize over r (given r → fixed p, p2):– define objective L(r) = 1/|T| i=1..|T|log(p(wi|r(wi)) p2(r(wi))|r(wi-1)))
– rbest = argmaxr L(r) (L(r) = norm. logprob of training data... as usual) (or negative cross entropy)
![Page 72: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/72.jpg)
72
Simplifying the Objective Function• Start from L(r) = 1/|T| i=1..|T|log(p(wi|r(wi)) p2(r(wi)|r(wi-1))):
1/|T| i=1..|T|log(p(wi|r(wi)) p(r(wi)) p2(r(wi)|r(wi-1)) / p(r(wi))) =
1/|T| i=1..|T|log(p(wi,r(wi)) p2(r(wi)|r(wi-1)) / p(r(wi))) =
1/|T| i=1..|T|log(p(wi)) + 1/|T| i=1..|T|log(p2(r(wi)|r(wi-1)) / p(r(wi))) =
-H(W) + 1/|T| i=1..|T|log(p2(r(wi)|r(wi-1)) p(r(wi-1)) / (p(r(wi-1)) p(r(wi)))) =
-H(W) + 1/|T| i=1..|T|log(p(r(wi),r(wi-1)) / (p(r(wi-1)) p(r(wi)))) =
-H(W) + d,e∈C p(d,e) log( p (d,e) / (p(d) p(e)) ) =
-H(W) + I(D,E) (event E picks class adjacent (to the right) to the one picked by D)
• Since W does not depend on r, we ended up with I(D,E).the need to maximize
![Page 73: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/73.jpg)
73
Maximizing Mutual Information(dependent on the mapping r)
• Result from previous foil:– Maximizing the probability of data amounts to
maximizing I(D,E), the mutual information of the adjacent classes.
• Good:– We know what a MI is, and we know how to maximize.
• Bad:– There is no way how to maximize over so many
possible partitionings: |V||V| - no way to test them all.
![Page 74: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/74.jpg)
74
The Greedy Algorithm• Define merging operation on the mapping r: V →C:
– merge: R ⅹC ⅹ C →R’ ⅹC-1: (r,k,l) →r’,C’ such that– C-1 = {C - {k,l} ∪ {m}} (throw out k and l, add new m C)∉
– r’(w) = ..... m for w ∈rINV{k,l}),
..... r(w) otherwise.
• 1. Start with each word in its own class (C = V), r = identity.
• 2. Merge two classes k,l into one, m, such that (k,l) = argmaxk,l Imerge(r,k,l)(D,E).
• 3. Set new (r,C) = merge(r,k,l).
• 4. Repeat 2 and 3 until |C| reaches predetermined size.
![Page 75: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/75.jpg)
75
Word Classes in Applications
• Word Sense Disambiguation: context not seen [enough(-times)]
• Parsing: verb-subject, verb-object relations• Speech recognition (acoustic model): need more
instances of [rare(r)] sequences of phonemes• Machine Translation: translation equivalent
selection [for rare(r) words]
![Page 76: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/76.jpg)
Spelling Correction and
the Noisy Channel
The Spelling Correction Task
![Page 77: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/77.jpg)
Dan Jurafsky
Applications for spelling correction
77
Web search
PhonesWord processing
![Page 78: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/78.jpg)
Dan Jurafsky
Spelling Tasks
• Spelling Error Detection• Spelling Error Correction:
• Autocorrect • htethe
• Suggest a correction• Suggestion lists
78
![Page 79: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/79.jpg)
Dan Jurafsky
Types of spelling errors
• Non-word Errors• graffe giraffe
• Real-word Errors• Typographical errors
• three there• Cognitive Errors (homophones)
• piecepeace, • too two
79
![Page 80: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/80.jpg)
Dan Jurafsky
Rates of spelling errors
26%: Web queries Wang et al. 2003
13%: Retyping, no backspace: Whitelaw et al. English&German
7%: Words corrected retyping on phone-sized organizer2%: Words uncorrected on organizer Soukoreff &MacKenzie 2003
1-2%: Retyping: Kane and Wobbrock 2007, Gruden et al. 1983
80
![Page 81: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/81.jpg)
Dan Jurafsky
Non-word spelling errors
• Non-word spelling error detection:• Any word not in a dictionary is an error• The larger the dictionary the better
• Non-word spelling error correction:• Generate candidates: real words that are similar to error• Choose the one which is best:
• Shortest weighted edit distance• Highest noisy channel probability
81
![Page 82: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/82.jpg)
Dan Jurafsky
Real word spelling errors
• For each word w, generate candidate set:• Find candidate words with similar pronunciations• Find candidate words with similar spelling• Include w in candidate set
• Choose best candidate• Noisy Channel • Classifier
82
![Page 83: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/83.jpg)
Spelling Correction and
the Noisy Channel
The Noisy Channel Model of Spelling
![Page 84: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/84.jpg)
Dan Jurafsky
Noisy Channel Intuition
84
![Page 85: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/85.jpg)
Dan Jurafsky
Noisy Channel
• We see an observation x of a misspelled word• Find the correct word w
85
![Page 86: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/86.jpg)
Dan Jurafsky
History: Noisy channel for spelling proposed around 1990
• IBM• Mays, Eric, Fred J. Damerau and Robert L. Mercer. 1991.
Context based spelling correction. Information Processing and Management, 23(5), 517–522
• AT&T Bell Labs• Kernighan, Mark D., Kenneth W. Church, and William A. Gale.
1990. A spelling correction program based on a noisy channel model. Proceedings of COLING 1990, 205-210
![Page 87: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/87.jpg)
Dan Jurafsky
Non-word spelling error example
acress
87
![Page 88: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/88.jpg)
Dan Jurafsky
Candidate generation
• Words with similar spelling• Small edit distance to error
• Words with similar pronunciation• Small edit distance of pronunciation to error
88
![Page 89: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/89.jpg)
Dan Jurafsky
Damerau-Levenshtein edit distance
• Minimal edit distance between two strings, where edits are:• Insertion• Deletion• Substitution• Transposition of two adjacent letters
89
![Page 90: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/90.jpg)
Dan Jurafsky
Words within 1 of acressError Candidate
CorrectionCorrect Letter
Error Letter
Type
acress actress t - deletion
acress cress - a insertion
acress caress ca ac transposition
acress access c r substitution
acress across o e substitution
acress acres - s insertion
acress acres - s insertion90
![Page 91: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/91.jpg)
Dan Jurafsky
Candidate generation
• 80% of errors are within edit distance 1• Almost all errors within edit distance 2
• Also allow insertion of space or hyphen• thisidea this idea• inlaw in-law
91
![Page 92: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/92.jpg)
Dan Jurafsky
Language Model
• Use any of the language modeling algorithms we’ve learned• Unigram, bigram, trigram• Web-scale spelling correction (web-scale language modeling)
• Stupid backoff
92
• “Stupid backoff” (Brants et al. 2007)• No discounting, just use relative frequencies
![Page 93: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/93.jpg)
Dan Jurafsky
Unigram Prior probability
word Frequency of word
P(word)
actress 9,321 .0000230573
cress 220 .0000005442
caress 686 .0000016969
access 37,038 .0000916207
across 120,844 .0002989314
acres 12,874 .000031846393
Counts from 404,253,213 words in Corpus of Contemporary English (COCA)
![Page 94: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/94.jpg)
Dan Jurafsky
Channel model probability
• Error model probability, Edit probability• Kernighan, Church, Gale 1990
• Misspelled word x = x1, x2, x3… xm
• Correct word w = w1, w2, w3,…, wn
• P(x|w) = probability of the edit • (deletion/insertion/substitution/transposition)
94
![Page 95: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/95.jpg)
Dan Jurafsky
Computing error probability: confusion matrix
del[x,y]: count(xy typed as x)ins[x,y]: count(x typed as xy)sub[x,y]: count(x typed as y)trans[x,y]: count(xy typed as yx)
Insertion and deletion conditioned on previous character
95
![Page 96: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/96.jpg)
Dan Jurafsky
Confusion matrix for spelling errors
![Page 97: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/97.jpg)
Dan Jurafsky
Generating the confusion matrix
• Peter Norvig’s list of errors• Peter Norvig’s list of counts of single-edit errors
97
![Page 98: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/98.jpg)
Dan Jurafsky
Channel model
98
Kernighan, Church, Gale 1990
![Page 99: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/99.jpg)
Dan Jurafsky
Channel model for acressCandidate Correction
Correct Letter
Error Letter
x|w P(x|word)
actress
t - c|ct .000117
cress - a a|# .00000144
caress ca ac ac|ca
.00000164
access c r r|c .000000209
across o e e|o .0000093
acres - s es|e .0000321
acres - s ss|s .000034299
![Page 100: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/100.jpg)
Dan Jurafsky
Noisy channel probability for acressCandidate Correction
Correct Letter
Error Letter
x|w P(x|word) P(word) 109 *P(x|w)P(w)
actress t - c|ct .000117 .0000231 2.7
cress - a a|# .00000144 .000000544
.00078
caress ca ac ac|ca .00000164 .00000170 .0028
access c r r|c .000000209
.0000916 .019
across o e e|o .0000093 .000299 2.8
acres - s es|e .0000321 .0000318 1.0
acres - s ss|s .0000342 .0000318 1.0100
![Page 101: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/101.jpg)
Dan Jurafsky
Using a bigram language model
• “a stellar and versatile acress whose combination of sass and glamour…”
• Counts from the Corpus of Contemporary American English with add-1 smoothing
• P(actress|versatile)=.000021 P(whose|actress) = .0010• P(across|versatile) =.000021 P(whose|across) = .000006
• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10
• P(“versatile across whose”) = .000021*.000006 = 1 x10-10
101
![Page 102: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/102.jpg)
Dan Jurafsky
Using a bigram language model
• “a stellar and versatile acress whose combination of sass and glamour…”
• Counts from the Corpus of Contemporary American English with add-1 smoothing
• P(actress|versatile)=.000021 P(whose|actress) = .0010• P(across|versatile) =.000021 P(whose|across) = .000006
• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10
• P(“versatile across whose”) = .000021*.000006 = 1 x10-10
102
![Page 103: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/103.jpg)
Dan Jurafsky
Evaluation
• Some spelling error test sets• Wikipedia’s list of common English misspelling• Aspell filtered version of that list• Birkbeck spelling error corpus• Peter Norvig’s list of errors (includes Wikipedia and Birkbeck, for training
or testing)
103
![Page 104: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/104.jpg)
Spelling Correction and
the Noisy Channel
Real-Word Spelling Correction
![Page 105: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/105.jpg)
Dan Jurafsky
Real-word spelling errors
• …leaving in about fifteen minuets to go to her house.• The design an construction of the system…• Can they lave him my messages?• The study was conducted mainly be John Black.
• 25-40% of spelling errors are real words Kukich 1992
105
![Page 106: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/106.jpg)
Dan Jurafsky
Solving real-world spelling errors
• For each word in sentence• Generate candidate set
• the word itself • all single-letter edits that are English words• words that are homophones
• Choose best candidates• Noisy channel model• Task-specific classifier
106
![Page 107: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/107.jpg)
Dan Jurafsky
Noisy channel for real-word spell correction
• Given a sentence w1,w2,w3,…,wn
• Generate a set of candidates for each word wi
• Candidate(w1) = {w1, w’1 , w’’1 , w’’’1 ,…}
• Candidate(w2) = {w2, w’2 , w’’2 , w’’’2 ,…}
• Candidate(wn) = {wn, w’n , w’’n , w’’’n ,…}
• Choose the sequence W that maximizes P(W)
![Page 108: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/108.jpg)
Dan Jurafsky
Noisy channel for real-word spell correction
108
![Page 109: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/109.jpg)
Dan Jurafsky
Noisy channel for real-word spell correction
109
![Page 110: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/110.jpg)
Dan Jurafsky
Simplification: One error per sentence
• Out of all possible sentences with one word replaced• w1, w’’2,w3,w4 two off thew
• w1,w2,w’3,w4 two of the
• w’’’1,w2,w3,w4 too of thew
• …
• Choose the sequence W that maximizes P(W)
![Page 111: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/111.jpg)
Dan Jurafsky
Where to get the probabilities
• Language model• Unigram• Bigram• Etc
• Channel model• Same as for non-word spelling correction• Plus need probability for no error, P(w|w)
111
![Page 112: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/112.jpg)
Dan Jurafsky
Probability of no error
• What is the channel probability for a correctly typed word?• P(“the”|“the”)
• Obviously this depends on the application• .90 (1 error in 10 words)• .95 (1 error in 20 words)• .99 (1 error in 100 words)• .995 (1 error in 200 words)
112
![Page 113: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/113.jpg)
Dan Jurafsky
Peter Norvig’s “thew” example
113
x w x|w P(x|w) P(w)109 P(x|w)P(w)
thew the ew|e 0.000007 0.02 144
thew thew 0.95 0.00000009 90
thew thaw e|a 0.001 0.0000007 0.7
thew threw h|hr 0.000008 0.000004 0.03
thew thweew|we 0.000003 0.00000004 0.0001
![Page 114: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/114.jpg)
Spelling Correction and
the Noisy Channel
State-of-the-art Systems
![Page 115: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/115.jpg)
Dan Jurafsky
HCI issues in spelling
• If very confident in correction• Autocorrect
• Less confident• Give the best correction
• Less confident• Give a correction list
• Unconfident• Just flag as an error
115
![Page 116: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/116.jpg)
Dan Jurafsky
State of the art noisy channel
• We never just multiply the prior and the error model• Independence assumptionsprobabilities not commensurate• Instead: Weigh them
• Learn λ from a development test set
116
![Page 117: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/117.jpg)
Dan Jurafsky
Phonetic error model
• Metaphone, used in GNU aspell • Convert misspelling to metaphone pronunciation
• “Drop duplicate adjacent letters, except for C.”• “If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.”• “Drop 'B' if after 'M' and if it is at the end of the word”• …
• Find words whose pronunciation is 1-2 edit distance from misspelling’s• Score result list
• Weighted edit distance of candidate to misspelling• Edit distance of candidate pronunciation to misspelling pronunciation
117
![Page 118: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/118.jpg)
Dan Jurafsky
Improvements to channel model
• Allow richer edits (Brill and Moore 2000)• entant• phf• leal
• Incorporate pronunciation into channel (Toutanova and Moore 2002)
118
![Page 119: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/119.jpg)
Dan Jurafsky
Channel model
• Factors that could influence p(misspelling|word)• The source letter• The target letter• Surrounding letters• The position in the word• Nearby keys on the keyboard• Homology on the keyboard• Pronunciations• Likely morpheme transformations
119
![Page 120: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/120.jpg)
Dan Jurafsky
Nearby keys
![Page 121: *Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel)](https://reader036.fdocuments.net/reader036/viewer/2022062408/5681363b550346895d9db4b9/html5/thumbnails/121.jpg)
Dan Jurafsky
Classifier-based methods for real-word spelling correction
• Instead of just channel model and language model• Use many features in a classifier such as MaxEnt, CRF.• Build a classifier for a specific pair like: whether/weather
• “cloudy” within +- 10 words• ___ to VERB• ___ or not
121