Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite...
Transcript of Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite...
![Page 1: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/1.jpg)
1
Language ModellingCOM4513/6513 Natural Language Processing
Nikos [email protected]
@nikaletras
Computer Science Department
Week 3Spring 2020
![Page 2: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/2.jpg)
2
In previous lecture...
Our first NLP problem: Text classification
But we ignored word order (apart from short sequences, e.g.n-grams)!
![Page 3: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/3.jpg)
2
In previous lecture...
Our first NLP problem: Text classification
But we ignored word order (apart from short sequences, e.g.n-grams)!
![Page 4: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/4.jpg)
3
In this lecture...
Our second NLP problem: Language Modelling
What is the probability of a given sequence of words in a particularlanguage (e.g. English)?
Odd problem. Applications?
![Page 5: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/5.jpg)
3
In this lecture...
Our second NLP problem: Language Modelling
What is the probability of a given sequence of words in a particularlanguage (e.g. English)?
Odd problem. Applications?
![Page 6: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/6.jpg)
4
Applications of LMs
Word likelihood for query completion in information retrieval(“Is Sheffield” → try it on your search engine)
Language detection (“Ciao Sheffield” is it Italian or English?)
Grammatical error detection (“You’re welcome” or “Yourwelcome”?)
Speech recognition (“I was tired too.” or “I was tired two.”?)
![Page 7: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/7.jpg)
4
Applications of LMs
Word likelihood for query completion in information retrieval(“Is Sheffield” → try it on your search engine)
Language detection (“Ciao Sheffield” is it Italian or English?)
Grammatical error detection (“You’re welcome” or “Yourwelcome”?)
Speech recognition (“I was tired too.” or “I was tired two.”?)
![Page 8: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/8.jpg)
4
Applications of LMs
Word likelihood for query completion in information retrieval(“Is Sheffield” → try it on your search engine)
Language detection (“Ciao Sheffield” is it Italian or English?)
Grammatical error detection (“You’re welcome” or “Yourwelcome”?)
Speech recognition (“I was tired too.” or “I was tired two.”?)
![Page 9: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/9.jpg)
4
Applications of LMs
Word likelihood for query completion in information retrieval(“Is Sheffield” → try it on your search engine)
Language detection (“Ciao Sheffield” is it Italian or English?)
Grammatical error detection (“You’re welcome” or “Yourwelcome”?)
Speech recognition (“I was tired too.” or “I was tired two.”?)
![Page 10: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/10.jpg)
5
Problem setup
Training data is a (often large) set of sentences xm with words xn:
Dtrain = {x1, ..., xM}x = [x1, ...xN ]
for example:
x =[<s>,The,water, is, clear, ., </s>]
<s>: denotes start of the sentence</s>: denotes end of the sentence
![Page 11: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/11.jpg)
6
Calculate sentence probabilities
We want to learn a model that returns the probability of anunseen sentence x:
P(x) = P(x1, ..., xn), for ∀x ∈ VmaxN
V is the vocabulary and VmaxN all possible sentences.
How to compute probability?
![Page 12: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/12.jpg)
6
Calculate sentence probabilities
We want to learn a model that returns the probability of anunseen sentence x:
P(x) = P(x1, ..., xn), for ∀x ∈ VmaxN
V is the vocabulary and VmaxN all possible sentences.
How to compute probability?
![Page 13: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/13.jpg)
7
Unigram language model
Multiply the probability of each word appearing in the sentence xcomputed over the entire corpus:
P(x) =N∏
n=1
P(xn) =N∏
n=1
c(xn)∑x∈V c(x)
<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>
P(i love) = P(i)P(love) =2
20· 1
20= 0.005
![Page 14: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/14.jpg)
7
Unigram language model
Multiply the probability of each word appearing in the sentence xcomputed over the entire corpus:
P(x) =N∏
n=1
P(xn) =N∏
n=1
c(xn)∑x∈V c(x)
<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>
P(i love) = P(i)P(love) =2
20· 1
20= 0.005
![Page 15: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/15.jpg)
7
Unigram language model
Multiply the probability of each word appearing in the sentence xcomputed over the entire corpus:
P(x) =N∏
n=1
P(xn) =N∏
n=1
c(xn)∑x∈V c(x)
<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>
P(i love) = P(i)P(love) =2
20· 1
20= 0.005
![Page 16: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/16.jpg)
8
What could go wrong?
<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>
The most probable word is <s> ( 320)
The most probable single-word sentence is “<s>”
The most probable two-word sentence is “<s> <s>”
The most probable N-word sentence is Nד<s>”
![Page 17: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/17.jpg)
8
What could go wrong?
<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>
The most probable word is <s> ( 320)
The most probable single-word sentence is “<s>”
The most probable two-word sentence is “<s> <s>”
The most probable N-word sentence is Nד<s>”
![Page 18: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/18.jpg)
8
What could go wrong?
<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>
The most probable word is <s> ( 320)
The most probable single-word sentence is “<s>”
The most probable two-word sentence is “<s> <s>”
The most probable N-word sentence is Nד<s>”
![Page 19: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/19.jpg)
8
What could go wrong?
<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>
The most probable word is <s> ( 320)
The most probable single-word sentence is “<s>”
The most probable two-word sentence is “<s> <s>”
The most probable N-word sentence is Nד<s>”
![Page 20: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/20.jpg)
9
Maximum Likelihood Estimation
Instead of assuming independence:
P(x) =N∏
n=1
P(xn)
We assume that each word is dependent on all previous ones:
P(x) = P(x1, ..., xN)
= P(x1)P(x2...xN |x1)
= P(x1)P(x2|x1)...P(xN |x1, ..., xN−1)
=N∏
n=1
P(xn|x1, ...xn−1) (chain rule)
What could go wrong?
![Page 21: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/21.jpg)
9
Maximum Likelihood Estimation
Instead of assuming independence:
P(x) =N∏
n=1
P(xn)
We assume that each word is dependent on all previous ones:
P(x) = P(x1, ..., xN)
= P(x1)P(x2...xN |x1)
= P(x1)P(x2|x1)...P(xN |x1, ..., xN−1)
=N∏
n=1
P(xn|x1, ...xn−1) (chain rule)
What could go wrong?
![Page 22: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/22.jpg)
10
Problems with MLE
Let’s analyse this:
P(x) = P(x1)P(x2|x1)P(x3|x2, x1)...P(xN |x1, ..., xN−1)
P(xn|xn−1...x1) =c(x1...xn−1, xn)
c(x1...xn−1)
As we condition on more words, the counts become sparser
![Page 23: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/23.jpg)
11
Bigram Language Models
Assume that the choice of a word dependsonly on the one before it:
P(x) =N∏
n=1
P(xn|xn−1) =N∏
n=1
c(xn−1, xn)
c(xn−1)
k-th order Markov assumption:
P(xn|xn−1, ..., x1) ≈ P(xn|xn−1, ..., xn−k)
with k=1
![Page 24: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/24.jpg)
12
Bigram LM: From counts to probabilities
Unigram counts:arctic monkeys are my favourite band
100 600 4000 3000 500 200
Bigram counts (rows: xi−1, cols: xi ):arctic monkeys are my favourite band
arctic 0 10 2 0 0 0monkeys 0 0 250 1 5 0
are 3 45 0 600 25 1my 0 2 0 1 300 5
favourite 0 1 0 0 0 50band 0 0 3 10 0 0
![Page 25: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/25.jpg)
13
Bigram LM: From counts to probabilities
From the bigram count matrix, compute probabilities by dividingeach cell by the appropriate unigram count for its row.
Bigram probabilities (rows: xi−1, cols: xi ):arctic monkeys are my favourite band
arctic 0 0.1 0.02 0 0 0monkeys 0 0 0.417 0.0017 0.008 0
are 0.0008 0.0113 0 0.15 0.0063 0.00003my 0 0.0007 0 0.0003 0.1 0.0017
favourite 0 0.002 0 0 0 0.1band 0 0 0.015 0.05 0 0
![Page 26: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/26.jpg)
14
Example: Bigram language model
x = [arctic,monkeys, are,my, favourite, band]
P(x) = P(monkeys|arctic)P(are|monkeys)P(my|are)
P(favourite|my)P(band|favourite)
=c(arctic,monkeys)
c(arctic)...c(favourite,band)
c(favourite)
= 0.1 · 0.417 · 0.15 · 0.1 · 0.1= 0.00006255
![Page 27: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/27.jpg)
15
Longer contexts (N-gram LMs)
P(x |context) =P(context, x)
P(context)=
c(context, x)
c(context)
In bigram LM context is xn−1, trigram xn−2, xn−1, etc.
The longer the context:
the more likely to capture long-range dependencies:“I saw a tiger that was really very...” fierce or talkative?the sparser the counts (zero probabilities)
5-grams and training sets with billions of tokens are common.
![Page 28: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/28.jpg)
16
Unknown Words
If a word was never encountered in training, any sentencecontaining it will have probability 0
It happens:
all corpora are finitenew words emerging
Common solutions:
Generate unknown words in the training data by replacinglow-frequency words with a special UNKNOWN tokenUse classes of unknown words, e.g. names and numbers
![Page 29: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/29.jpg)
16
Unknown Words
If a word was never encountered in training, any sentencecontaining it will have probability 0
It happens:
all corpora are finitenew words emerging
Common solutions:
Generate unknown words in the training data by replacinglow-frequency words with a special UNKNOWN tokenUse classes of unknown words, e.g. names and numbers
![Page 30: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/30.jpg)
17
Implementation details
Dealing with large datasets requires efficiency:
use log probabilities to avoid underflows (small numbers)efficient data structures for sparse counts, e.g. lossy datastructures Bloom filters)
How do we train and evaluate our language models?
We need train/dev/test data
Evaluation approaches
![Page 31: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/31.jpg)
17
Implementation details
Dealing with large datasets requires efficiency:
use log probabilities to avoid underflows (small numbers)efficient data structures for sparse counts, e.g. lossy datastructures Bloom filters)
How do we train and evaluate our language models?
We need train/dev/test data
Evaluation approaches
![Page 32: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/32.jpg)
18
Intrinsic Evaluation: Accuracy
How well does our LM predict the next word?
I always order pizza with cheese and...mushrooms?bread?and?
Accuracy: how often the LM predicts the correct word
The higher the better
![Page 33: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/33.jpg)
19
Intrinsic Evaluation: Perplexity
Perplexity: the inverse probability of the test setx = [x1, ..., xN ], normalised by the number of words N:
PP(x) = P(x1, . . . , xN)1/N
= N
√1
P(x1, . . . , xN)
= N
√1∏N
i=1 P(xi |x1, ...xi−1)
Measures how well a probability distribution predicts a sample.
The lower the better.
Why is a bigram language model likely to have lower perplexitythan a unigram one? There is more context to predict the nextword!
![Page 34: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/34.jpg)
19
Intrinsic Evaluation: Perplexity
Perplexity: the inverse probability of the test setx = [x1, ..., xN ], normalised by the number of words N:
PP(x) = P(x1, . . . , xN)1/N
= N
√1
P(x1, . . . , xN)
= N
√1∏N
i=1 P(xi |x1, ...xi−1)
Measures how well a probability distribution predicts a sample.
The lower the better.
Why is a bigram language model likely to have lower perplexitythan a unigram one?
There is more context to predict the nextword!
![Page 35: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/35.jpg)
19
Intrinsic Evaluation: Perplexity
Perplexity: the inverse probability of the test setx = [x1, ..., xN ], normalised by the number of words N:
PP(x) = P(x1, . . . , xN)1/N
= N
√1
P(x1, . . . , xN)
= N
√1∏N
i=1 P(xi |x1, ...xi−1)
Measures how well a probability distribution predicts a sample.
The lower the better.
Why is a bigram language model likely to have lower perplexitythan a unigram one? There is more context to predict the nextword!
![Page 36: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/36.jpg)
20
The problem with perplexity
Doesn’t always correlate with application performance
Can’t evaluate non probabilistic LMs
![Page 37: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/37.jpg)
21
Extrinsic Evaluation
Sentence completion
Grammatical error correction: detecting “odd” sentences andpropose alternatives
Natural lanuage generation: prefer more “natural” sentences
Speech recognition
Machine translation
![Page 38: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/38.jpg)
22
Smoothing
What happens when words that are in our vocabulary appearwith an unseen context in test data?
They will be assigned with zero probability
Smoothing (or discounting) to therescue: Steal from the rich and give tothe poor!
![Page 39: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/39.jpg)
22
Smoothing
What happens when words that are in our vocabulary appearwith an unseen context in test data?
They will be assigned with zero probability
Smoothing (or discounting) to therescue: Steal from the rich and give tothe poor!
![Page 40: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/40.jpg)
22
Smoothing
What happens when words that are in our vocabulary appearwith an unseen context in test data?
They will be assigned with zero probability
Smoothing (or discounting) to therescue: Steal from the rich and give tothe poor!
![Page 41: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/41.jpg)
23
Smoothing intuition
Taking from the frequent and giving to the rare (discounting)
![Page 42: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/42.jpg)
24
Add-1 Smoothing
Add-1 (or Laplace) smoothing adds one to all bigram counts:
Padd−1(xn|xn−1) =c(xn−1, xn) + 1
c(xn−1) + |V |
Pretend we have seen all bigrams at least once!
![Page 43: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/43.jpg)
25
Add-k Smoothing
Add-1 puts too much probability mass to unseen bigrams, betterto add-k , k < 1:
Padd−k(xn|xn−1) =counts(xn−1, xn) + k
counts(xn−1) + k|V |
k is a hyperparameter: choose optimal value on the dev set!
![Page 44: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/44.jpg)
25
Add-k Smoothing
Add-1 puts too much probability mass to unseen bigrams, betterto add-k , k < 1:
Padd−k(xn|xn−1) =counts(xn−1, xn) + k
counts(xn−1) + k|V |
k is a hyperparameter: choose optimal value on the dev set!
![Page 45: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/45.jpg)
26
Interpolation
Longer contexts are more informative:
dog bites ... better than bites ...
but only if they are frequent enough:
canid bites ...better than bites ...?
Can we combine evidence from unigram, bigram and trigramprobabilities?
![Page 46: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/46.jpg)
26
Interpolation
Longer contexts are more informative:
dog bites ... better than bites ...
but only if they are frequent enough:
canid bites ...better than bites ...?
Can we combine evidence from unigram, bigram and trigramprobabilities?
![Page 47: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/47.jpg)
27
Simple Linear Interpolation
For a trigram LM:
PSLI (xn|xn−1, xn−2) = λ3P(xn|xn−1, xn−2)
+ λ2P(xn|xn−1)
+ λ1P(xn) λi > 0,∑
λi = 1
Weighted average of unigram, bigram and trigram probabilities
How we choose the value of λs?
Parameter tuning on the devset!
![Page 48: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/48.jpg)
27
Simple Linear Interpolation
For a trigram LM:
PSLI (xn|xn−1, xn−2) = λ3P(xn|xn−1, xn−2)
+ λ2P(xn|xn−1)
+ λ1P(xn) λi > 0,∑
λi = 1
Weighted average of unigram, bigram and trigram probabilities
How we choose the value of λs?
Parameter tuning on the devset!
![Page 49: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/49.jpg)
27
Simple Linear Interpolation
For a trigram LM:
PSLI (xn|xn−1, xn−2) = λ3P(xn|xn−1, xn−2)
+ λ2P(xn|xn−1)
+ λ1P(xn) λi > 0,∑
λi = 1
Weighted average of unigram, bigram and trigram probabilities
How we choose the value of λs?
Parameter tuning on the devset!
![Page 50: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/50.jpg)
27
Simple Linear Interpolation
For a trigram LM:
PSLI (xn|xn−1, xn−2) = λ3P(xn|xn−1, xn−2)
+ λ2P(xn|xn−1)
+ λ1P(xn) λi > 0,∑
λi = 1
Weighted average of unigram, bigram and trigram probabilities
How we choose the value of λs? Parameter tuning on the devset!
![Page 51: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/51.jpg)
28
Backoff
Start with n-gram order of k but if the counts are 0 use k − 1:
BO(xn|xn−1 . . . xn−k) ={P(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0
BO(xn|xn−1 . . . xn−k+1), otherwise
Is this a probability distribution?
![Page 52: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/52.jpg)
28
Backoff
Start with n-gram order of k but if the counts are 0 use k − 1:
BO(xn|xn−1 . . . xn−k) ={P(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0
BO(xn|xn−1 . . . xn−k+1), otherwise
Is this a probability distribution?
![Page 53: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/53.jpg)
29
Backoff
NO! Must discount probabilities for contexts with counts P? anddistribute the mass to the shorter context ones:
PBO(xn|xn−1 . . . xn−k) ={P?(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0
αxn−1...xn−kPBO(xn|xn−1 . . . xn−k+1), otherwise
αxn−1...xn−k =βxn−1...xn−k∑
PBO(xn|xn−1 . . . xn−k+1)
β, is the left-over probability mass for the (n-k)-gram
![Page 54: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/54.jpg)
30
Absolute Discounting
Using 22M words for train and held-out
Can you predict the heldout (test) set average count given thetraining?
Testing counts = training counts - 0.75 (absolute discount)
![Page 55: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/55.jpg)
30
Absolute Discounting
Using 22M words for train and held-out
Can you predict the heldout (test) set average count given thetraining?Testing counts = training counts - 0.75 (absolute discount)
![Page 56: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/56.jpg)
31
Absolute discounting
PAbsDiscount(xn|xn−1) =c(xn, xn−1)− d
c(xn−1)+ λxn−1P(xn)
d = 0.75, λs tuned to ensure we have a valid probabilitydistribution.
Component of the Kneser-Ney discounting:
Intuition: a word can be very frequent, but if only follows veryfew contexts,e.g. Francisco is frequent but almost always follows SanThe unigram probability in the context of the bigram shouldcapture how likely xn is to be a novel continuation.
![Page 57: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/57.jpg)
32
Stupid Backoff
Do we really need probabilities? Estimating the additionalparameters takes time for large corpora.
If scoring is enough, stupid backoff works adequately:
SBO(xn|xn−1 . . . xn−k) ={P(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0
λSBO(xn|xn−1 . . . xn−k+1), otherwise
Empirically found that λ = 0.4 works well
They called it stupid because they didn’t expect it to workwell!
![Page 58: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/58.jpg)
32
Stupid Backoff
Do we really need probabilities? Estimating the additionalparameters takes time for large corpora.
If scoring is enough, stupid backoff works adequately:
SBO(xn|xn−1 . . . xn−k) ={P(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0
λSBO(xn|xn−1 . . . xn−k+1), otherwise
Empirically found that λ = 0.4 works well
They called it stupid because they didn’t expect it to workwell!
![Page 59: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/59.jpg)
32
Stupid Backoff
Do we really need probabilities? Estimating the additionalparameters takes time for large corpora.
If scoring is enough, stupid backoff works adequately:
SBO(xn|xn−1 . . . xn−k) ={P(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0
λSBO(xn|xn−1 . . . xn−k+1), otherwise
Empirically found that λ = 0.4 works well
They called it stupid because they didn’t expect it to workwell!
![Page 60: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/60.jpg)
33
Syntax-based language models
P(binoculars|saw)
more informative than:
P(binoculars|strong, very,with, ship, the, saw)
![Page 61: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/61.jpg)
33
Syntax-based language models
P(binoculars|saw)
more informative than:
P(binoculars|strong, very,with, ship, the, saw)
![Page 62: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/62.jpg)
34
Last words: More data defeats smarter models!
From Large Language Models in Machine Translation
![Page 63: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/63.jpg)
35
Bibliography
Chapter 3 from Jurafsky & Martin
Chapter 6 from Eisentein
Michael Collins’ notes on LMs
![Page 64: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0](https://reader034.fdocuments.net/reader034/viewer/2022052408/5f0e43c37e708231d43e66f8/html5/thumbnails/64.jpg)
36
Coming up next...
We have learned how to model word sequences using Markovmodels
In the following lecture we will look at how to performpart-of-speech tagging using:
the Hidden Markov Model (HMM)the Conditional Random Fields (CRFs), an extension oflogistic regression for sequence modelling