T-4OO...AI自動翻訳「T-4OO」が選ばれる理由 『 T-4OO』 (Translation For Onsha Only)は、一般的な翻訳サイトに使われている機械翻訳エンジンとは
[Book Reading] 機械翻訳 - Section 3 No.1
-
Upload
naist-machine-translation-study-group -
Category
Engineering
-
view
257 -
download
5
Transcript of [Book Reading] 機械翻訳 - Section 3 No.1
![Page 1: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/1.jpg)
Language ModelMT STUDY MEETING 5/21
HIROYUKI FUDABA
![Page 2: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/2.jpg)
How can you say whether a
sentence is natural or not?
𝑒1 = he is dog
𝑒2 = is big he
𝑒1 = this is a purple dog
![Page 3: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/3.jpg)
How can you say whether a
sentence is natural or not?
𝑒1 = he is dog
↑ correct
𝑒2 = is big he
↑ grammatically wrong
𝑒1 = this is a purple dog
↑ semantically wrong
![Page 4: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/4.jpg)
Language model probability
We want to treat “naturality” statistically
We represent this with language model probability 𝑃 𝑒
𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 0.7
𝑃 𝑒 = 𝑖𝑠 𝑏𝑖𝑔 ℎ𝑒 = 0.3
𝑃 𝑒 = 𝑡ℎ𝑖𝑠 𝑖𝑠 𝑎 𝑝𝑢𝑟𝑝𝑙𝑒 𝑑𝑜𝑔 = 0.5
![Page 5: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/5.jpg)
Some ways to estimate 𝑃(𝑒)
n-gram model
Positional language model
factored language model
cache language model
![Page 6: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/6.jpg)
Basis of n-gram
we notate a sentence as 𝒆 = 𝑒1𝐼, 𝐼 being the length of it
𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔
𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝐼 = 3
We can define 𝑃(𝑒) as following
𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 𝑃 𝐼 = 3, 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔
= 𝑃 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝑒4 = 𝑒𝑜𝑠
= P(e0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝑒4 = 𝑒𝑜𝑠 )
![Page 7: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/7.jpg)
estimate 𝑃(𝑒) with a simple way
assume that natural sentence appear more frequently than the ones
that aren’t, simple way to estimate 𝑃(𝑒) is following
Bring a big training data 𝐸𝑡𝑟𝑎𝑖𝑛
Count frequencies of each sentences in 𝐸𝑡𝑟𝑎𝑖𝑛
𝑃𝑠 𝑒 =𝑓𝑟𝑒𝑞 𝑒
𝑠𝑖𝑧𝑒(𝐸𝑡𝑟𝑎𝑖𝑛)=𝑐𝑡𝑟𝑎𝑖𝑛 𝑒
𝑒 𝑐𝑡𝑟𝑎𝑖𝑛( 𝑒)
𝑐𝑡𝑟𝑎𝑖𝑛 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 returns how many sentences exactly matched to “he is big”
![Page 8: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/8.jpg)
Problem of estimation in simple way
when 𝐸𝑡𝑟𝑎𝑖𝑛 does not contain sentences 𝑒1 and 𝑒2,
than you can not say which is more natural.
𝑐𝑡𝑟𝑎𝑖𝑛 𝑒1 = 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒2 = 0
𝑃𝑆 𝑒1 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒1 𝑒 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒
= 0
𝑃𝑆 𝑒2 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒2 𝑒 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒
= 0
You can not compare if both values are 0 …
![Page 9: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/9.jpg)
Solution to 𝑃 𝑒 = 0
Rather thinking a sentence as a whole,
let’s think that a sentence is a data that is composed of words
𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑌 𝑃(𝑌)
𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 𝑃 𝑒1 = ℎ𝑒 𝑒0 = 𝑏𝑜𝑠 )
∗ P e2 = is e0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒)
∗ 𝑃 𝑒3 = 𝑏𝑖𝑔 𝑒0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠
∗ 𝑃 𝑒4 = 𝑒𝑜𝑠 𝑒0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = is, e3 = big)
![Page 10: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/10.jpg)
Solution to 𝑃 𝑒 = 0
𝑃𝑆 𝑒 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒
𝑒 𝑐𝑡𝑟𝑎𝑖𝑛( 𝑒)= 𝑃 𝑒1
𝐼 =
𝑖=1
𝐼+1
𝑃𝑀𝐿 𝑒𝑖|𝑒0𝑖−1
𝑃𝑀𝐿 𝑒𝑖| 𝑒0𝑖−1 =
𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖
𝑐𝑡𝑟𝑎𝑖𝑛(𝑒0𝑖−1)
So far 𝑃 𝑒1𝐼 is completely equal to 𝑃𝑆(𝑒),
which means it still don’t work
![Page 11: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/11.jpg)
Idea of n-gram model
Rather considering all words appeared before the word looking at,
let’s consider only 𝑛 − 1 words appeared just before the word
Instead of considering all words …
is big 𝑒𝑜𝑠he𝑏𝑜𝑠
![Page 12: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/12.jpg)
Idea of n-gram model
Rather considering all words appeared before the word looking at,
let’s consider only 𝑛 − 1 words appeared just before the word
Consider only 𝑛 − 1 words
is big 𝑒𝑜𝑠he𝑏𝑜𝑠
![Page 13: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/13.jpg)
n-gram in precise
From the previous expression
𝑃 𝑒1𝐼 =
𝑖=1
𝐼+1
𝑃𝑀𝐿 𝑒𝑖|𝑒0𝑖−1
we can approximate 𝑃(𝑒) as following
𝑃 𝑒1𝐼 ≈
𝑖=1
𝐼+1
𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1
![Page 14: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/14.jpg)
How does this help?
𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 ≈ 𝑃 𝑒𝑖 = ℎ𝑒 | 𝑒𝑖−1 = 𝑏𝑜𝑠
∗ P ei = is | 𝑒𝑖−1 = he
∗ P ei = big ei−1 = is)
∗ P 𝑒𝑖 = 𝑒𝑜𝑠 | 𝑒𝑖−1 = 𝑏𝑖𝑔
Intuitively, a subset sequence appear more than it’s super set,
so 𝑃 𝑒 estimated with n-gram model is less likely to be 0
![Page 15: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/15.jpg)
Smoothing n-gram model
n-gram less likely estimate 𝑃 𝑒 = 0
But it still have a possibility of estimating 0
→ Smoothing
![Page 16: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/16.jpg)
Idea of smoothing
Combining probability of n-gram and (n-1)-gram
Even if probability of word 𝑤 could not be estimated with n-gram,
there is a possibility that probability can be estimated with (n-1)-gram
𝑃3−𝑔𝑟𝑎𝑚 𝑠𝑚𝑎𝑙𝑙 | ℎ𝑒 𝑖𝑠 = 0
P2−gram small is) = 0.03
0
0.05
0.1
0.15
0.2
0.25
P(he|<bos>) P(is|<bos> he) P(big|he is) P(small|he is) P(<eos>|is big)
probability
probability
![Page 17: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/17.jpg)
Linear interpolation
Easiest, and basic way to express the idea
𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1
0 ≤ 𝑎 ≤ 1
Adjusting 𝑎 to a good value is the problem
So how can we do that?
![Page 18: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/18.jpg)
Adjusting 𝑎 to a good value
Easy way to achieve this is following
Bring dataset which is different from training data
Select 𝑎 that gives the highest likelihood to the dataset
Improve performance by considering each context
![Page 19: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/19.jpg)
Witten-Bell smoothing
How should I choose 𝑎 if n-gram was like following?
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetiaan 1
First 3
…
52 kind, sum 110 3 kind, sum 40
![Page 20: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/20.jpg)
Witten-Bell smoothing
It is likely to have an unknown word after context “President was”
𝑎 should be large, so that (n-1)-gram will be more emphasized
𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetiaan 1
First 3
…
52 kind, sum 110 3 kind, sum 40
![Page 21: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/21.jpg)
Witten-Bell smoothing
It is likely to have an unknown word after context “President Ronald”
𝑎 should be small, so that n-gram will be more emphasized
𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetiaan 1
First 3
…
52 kind, sum 110 3 kind, sum 40
![Page 22: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/22.jpg)
Idea of Witten-Bell smoothing
If you only had a single coefficient value 𝑎 to adjust,
You can not consider context for each word
→ why not use different 𝒂 to consider each context info
for each word?
![Page 23: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/23.jpg)
Witten-Bell smoothing in precise
Simple smoothing
𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1
Witten-Bell smoothing
𝑃𝑊𝐵 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎
𝑒𝑖−𝑛+1𝑖−1 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑒𝑖−𝑛+1𝑖−1 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2
𝑖−1
𝑎𝑒𝑖−𝑛+1𝑖−1 =
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1
𝑖−1
![Page 24: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/24.jpg)
Witten-Bell smoothing in precise
𝑎𝑒𝑖−𝑛+1𝑖−1 =
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1
𝑖−1
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ represents how many
kind of words continue after context 𝑒𝑖−𝑛+1𝑖−1
𝑢 𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑤𝑎𝑠,∗ = 52
𝑢 𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑅𝑜𝑛𝑎𝑙𝑑,∗ = 3
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetia
an
1
First 3
…
52 kind, sum 110 3 kind, sum 40
![Page 25: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/25.jpg)
Witten-Bell smoothing in precise
𝑎𝑒𝑖−𝑛+1𝑖−1 =
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1
𝑖−1
𝑎𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑤𝑎𝑠 =52
110+52= 0.32
𝑎𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑅𝑜𝑛𝑎𝑙𝑑 =3
40+3= 0.07
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetia
an
1
First 3
…
52 kind, sum 110 3 kind, sum 40
![Page 26: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/26.jpg)
Absolute discounting
Yet another smoothing
Unlike Witten-Bell smoothing which uses 𝑃𝑀𝐿, it subtracts constant
value 𝑑 from frequency of each word in order to estimate
probability
𝑃𝑑 𝑒𝑖 | 𝑒0𝑖−1 =
max 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖 − 𝑑, 0
𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖−1
![Page 27: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/27.jpg)
Abstruct discounting
So why do you subtract?
We want to treat low-frequent word as unknown word,
because low-frequent one can not really be trusted.
By doing this, (n-1)-gram gets more emphasized
![Page 28: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/28.jpg)
Absolute discounting
𝑃𝑑 𝑒𝑖 | 𝑒𝑖−𝑛+1𝑖−1 =
max 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒𝑖−𝑛+1𝑖 − 𝑑, 0
𝑐𝑡𝑟𝑎𝑖𝑛 𝑒𝑖−𝑛+1𝑖−1
𝑃𝑑 𝑒𝑖 = 𝑟𝑒𝑎𝑔𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑
=38 − 0.5
40= 0.9375
𝑃𝑑 𝑒𝑖 = 𝑐𝑎𝑧𝑎|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑
=1 − 0.5
40= 0.0125
𝑃𝑑 𝑒𝑖 = 𝑣𝑒𝑛𝑒𝑡𝑖𝑎𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑
=1 − 0.5
40= 0.0125
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetia
an
1
First 3
…
52 kind, sum 110 3 kind, sum 40
![Page 29: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/29.jpg)
Absolute discounting
𝑃𝑑 𝑒𝑖 = 𝑟𝑒𝑎𝑔𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.9375
𝑃𝑑 𝑒𝑖 = 𝑐𝑎𝑧𝑎|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.0125
𝑃𝑑 𝑒𝑖 = 𝑣𝑒𝑛𝑒𝑡𝑖𝑎𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.0125
𝑎𝑒𝑖−𝑛+1𝑖−1 = 1 − 0.9375 + 0.0125 + 0.0125 = 0.0375
Efficient way of solving this is following
𝑎𝑒𝑖−𝑛+1𝑖−1 =
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ × 𝑑
𝑐 𝑒𝑖−𝑛+1𝑖−1
![Page 30: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/30.jpg)
Absolute discounting
Now that we do not use maximum likelihood,
n-gram probability will be estimated as following
𝑃 𝑒𝑖| 𝑒𝑖−𝑛+1𝑖−1 = 𝑃𝑑 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑒𝑖−𝑛+1𝑖−1 𝑃 𝑒𝑖|𝑒𝑖−𝑛+2
𝑖−1
Quite similar, but differs in that absolute discounting use 𝑃𝑑
𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1
![Page 31: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/31.jpg)
Kneser-Ney smoothing
achieve excellent performance
Similar to absolute discounting
Have an interest in a word that
only appears in specific context
![Page 32: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/32.jpg)
Kneser-Ney smoothing
Lower order model is needed only when count in higher
order model is small
Suppose “San Francisco” is common, but “Francisco” appears only
after “San”
Both “San” and “Francisco” get a high unigram probability
But we want to give “Francisco” a low unigram probability!!
![Page 33: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/33.jpg)
Kneser-Ney smoothing
Kneser-Ney is defined as following
𝑃𝑘𝑛 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 =
max 𝑢 ∗, 𝑒𝑖−𝑛+2𝑖−1 − d, 0
𝑢 𝑒𝑖−𝑛+1𝑖−1
![Page 34: [Book Reading] 機械翻訳 - Section 3 No.1](https://reader034.fdocuments.net/reader034/viewer/2022051123/587e2a341a28abb93e8b5c09/html5/thumbnails/34.jpg)
Unknown words
Even though smoothing can reduce probability of having 𝑃 𝑒 = 0,
possibility of getting 0 still rely
We may give a possibility to unknown word as following
𝑃𝑢𝑛𝑘 𝑒𝑖 =1
𝑉