Natural Language Processing: Language Modellingnlpcourse.europe.naverlabs.com/slides/01-LM.pdf ·...
Transcript of Natural Language Processing: Language Modellingnlpcourse.europe.naverlabs.com/slides/01-LM.pdf ·...
© 2017 NAVER LABS. All rights reserved.
Matthias Gallé
Naver Labs Europe
08th January 2018
Natural Language Processing:
Language Modelling
@mgalle
Language Modelling
Language is ambiguous, and we are decoding all the time the most probable meaning
We want to compute
𝑃 𝑠 = 𝑃 𝑤1𝑤2𝑤3…𝑤|𝑠|
“But it must be recognized that the notion ’probability of a sentence’ is an
entirely useless one, under any known interpretation of this term.”
Language Model: uses
Spell correction
Re-ranking for:
• OCR
• ASR
• MT
But also fundamental building block for Q&A, summarization, etc, etc(including IR)
P(boil an egg) > P(boil a egg)
P(boil an egg) > P(boil Enoch)
How to compute
𝑃 𝑤1𝑤2𝑤3…𝑤|𝑠|
= 𝑃 𝑤|𝑠| 𝑤1𝑤2…𝑤 𝑠 −1) ∗ 𝑃(𝑤 𝑠 −1 𝑤1𝑤2…𝑤 𝑠 −2 …∗ 𝑃(𝑤1)
Def. conditional probability: P(A | B) = P(A,B) / P(B)
P(you | boil an egg for) * P(for | boil an egg) * P(egg | boil an) * P(an | boil) * P(boil)
Ex:P(boil an egg for you) =
Not enough statistics
s c(s)
boil 2796
boil an 269
boil an egg 28
boil an egg for 1
boil an egg for you 0
But, c(an egg for) = 8 and c(an egg) = 1571
Billion Word Corpus (http://www.statmt.org/lm-benchmark/)
Not enough statistics: Markovian assumption
Memoryless property
Only most recent words are important
intuitively, should be approximately true in general for text
how common are long-range correlation in text?
Assume that 𝑃 𝑤𝑖 𝑤1𝑤2…𝑤𝑖−1 = 𝑃(𝑤𝑖 𝑤𝑖−𝑛+1…𝑤𝑖−2𝑤𝑖−1
Example: 2-gram model
P(boil an egg for you) = P(you | for) * P(for | egg) * P(egg | an) * P(an | boil)
s c(s)
for you 39191
egg for 136
an egg 1571
boil an 29
Maximum Likelihood Estimation
Just counting
𝑃(𝑤𝑖 𝑤𝑖−𝑛+1…𝑤𝑖−1 =𝑐(𝑤𝑖−𝑛+1…𝑤𝑖−1𝑤𝑖)
𝑐(𝑤𝑖−𝑛+1…𝑤𝑖−1)
2-gram
P(boil an egg for you) = c(for you)/c(for) * c(egg for)/c(egg) * c(an egg)/c(an) * c(boil an)/c(boil)
= 39191/6598312 * 136/7871 * 1571/2442763 * 29/2659 = 7.198e-11
P(boil a egg for you) = 1.056e-12
and P(an egg) = 6.4e-4 vs P(a egg) = 6.51e-6
P(boil Enoch for you) =3.43e-14
Toolkits
CMU-Cambridge (http://www.speech.cs.cmu.edu/SLM)
SRILM (http://www.speech.sri.com/projects/srilm/)
KenLM (https://kheafield.com/code/kenlm/)
integrated into Moses (phrase-based MT)
Data
Google Books Ngrams
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
8M books (6% of all books every published)
Binned by years (culturomics)
Ngrams of Corpus of Contemporary American English
https://www.ngrams.info/download_coca.asp
Any data-set
The one task in NLP where annotation is not an issue
Have to pre-process
(small domain-specific often > large generic-domain)
Evaluation
1. Word Error Rate: how many times your best prediction is incorrect
Problem: what if your second one was good with p = 0.49?
2. Final task: impact of different LM on end-task
time-consuming
co-founding variables
Evaluation: Perplexity
Interpretation: maximise probability
“how well can you predict upcoming word?”
“amount of surprise that the test data generates for your model“
Perplexity: 𝑝𝑝𝑥 𝑞 = 2−∑ log2 𝑞(𝑥𝑖)
= 2𝐻(𝑝,𝑞) , where p is observed (empirical) distribution (p(x) = c(x)/N )
Min ppx = max probability
Unit of entropy: bits per unit (word)
how much bits do I need to encode correct option given the model
perfect model: 1 bit
Evaluation: Perplexity
Cross-entropy: how many additional bits do I need to encode the correct answer wrt p, knowing q
corpus ppx/H
Penntree bank (PTB) ~ 60
1 Billion Word Corpus ~ 30 (23 with ensemble)
Char-PTB ~ 3.03 [1.6 bits-per-character]
CHAR-War & Peace ~ 2.46 [1.3 bpc]
Shortcomings of Perplexity
𝑝𝑝𝑥 𝑞 = 2−∑ log2 𝑞(𝑥𝑖)
What happens if you never saw x?
Lots of effort to model unseen events
SPiCe competition (http://spice.lif.univ-mrs.fr) used nDCG (from IR)
Drawbacks of MLE: lack of generalizationActually, P(boil Enoch for you) = 0
“boil Enoch” never occurs
Law of NLP: You will always find new n-grams
Other drawbacks of MLE: OverfittingUse LM as text generator (Trump speeches)2-grams'The biggest concern with King of pro - fought very hard to look at a state legislature opposed it was she received a two hours during the oath to me the Obama administration has known opportunity . <END>‘‘And even knows how bad for you , and leave government service members who will never before in our borders from a whole thing . <END>’‘The Democratic Convention was invaded and restore dignity and put her emails she delivers for teachers but we 're asking for the systemic failures in line , go and together as your jobs , about the side of the swamp of every year including one from our southern border with backdoor tariffs , you 've been charged with all is no moral character . <END>’4-gram'I do have a reaction to the prosecutor in Baltimore who indicted those police officers who probably could have made a deal to DESTROY the laptops of government officials implicated in a massive criminal cover - up her crimes . <END>''It is no great secret that many of the great veterans as you saw , and it arrives on November 8th , the arrogance of Washington , D.C. <END>''Every day we fail to enforce our laws is a day when a loving parent is at risk of losing their tax - exempt status . <END>‘
Other drawbacks of MLE: OverfittingUse LM as text generator (Trump speeches)
7-gramPerhaps it is easy for politicians to lose touch with reality when they are being paid millions of dollars to read speeches to Wall Street executives instead of spending time with real people in real pain . <END>You saw it the other day with the truck screaming out the window . <END>
Smoothing
Whenever data sparsity is an issue, smoothing can help performance,
and data sparsity is almost always an issue in statistical modeling. In the
extreme case where there is so much training data that all parameters
can be accurately trained without smoothing, one can almost always
expand the model, such as by moving to a higher n-gram model, to
achieve improved performance. With more parameters data sparsity
becomes an issue again, but with proper smoothing the models are
usually more accurate than the original models. Thus, no matter how
much data one has, smoothing can almost always help performance, and
for a relatively small effort.”
Chen & Goodman (1998)
Smoothing
Simplest: Laplace smoothing
reserve some probability mass for unseen events
Just assume you saw each word 𝛼 times more than you actually did (calledLaplace when 𝛼 = 1)
𝑃 𝑤 𝑐 =#𝑐. 𝑤
#𝑐
𝑃 𝑤 𝑐 =#𝑐.𝑤 + 𝛼
#𝑐 + 𝛼𝑉
Back-off & Interpolation
Sometimes less is better
Back-off: if not enough evidence, use smaller context𝑖𝑓 𝑐 𝑤𝑖−2𝑤𝑖−1 > 𝐾
𝑝 𝑤𝑖 𝑤𝑖−2𝑤𝑖−1 = 𝑝2(𝑤𝑖|𝑤𝑖−2𝑤𝑖−1)
otherwise
𝑝 𝑤𝑖 𝑤𝑖−2𝑤𝑖−1 = 𝑝1(𝑤𝑖|𝑤𝑖−1 )
Back-off & Interpolation
Interpolation (aka Jelinek-Mercer) : combine signal from >1 context
𝑝 𝑤𝑖 𝑤𝑖−2𝑤𝑖−1 = 𝜆1𝑝2(𝑤𝑖|𝑤𝑖−2𝑤𝑖−1) + (1 − 𝜆1)𝑝1(𝑤𝑖|𝑤𝑖−1 )
Can be done recursively.
Base model: MLE, or uniform
𝜆𝑖:
• estimated using held-out data (≠training). Why?
• can be context-dependent (bucketing to reduce parameter explosion)
Witten-Bell
Intuition:
• not many different words will follow (“San ____”) or (“in spite ___”)
• if context-diversity is high, then context doesn’t provide lot of informationif a w occurs V times and has V-context-diversity, then its presence is not informative
• Model probability of using smaller-order model (1 − 𝜆𝑐) as 𝑟𝑑 𝑐
𝑟𝑑 𝑐 +∑𝑤𝑖#(𝑐𝑤𝑖)
Where rd(c) is right-diversity of c : |{w | #(cw)>0}|
Witten-Bell
Example:
• “New” always followed by “York” : 1 − 𝜆𝑁𝑒𝑤 =1
1+𝑐(𝑁𝑒𝑤 𝑌𝑜𝑟𝑘)
• c always followed by a different word: 1 − 𝜆𝑐 =#𝑐
2#𝑐=
1
2
Originally developed for compression
Compressing ~ Learning
Kneser-Ney
Key innovation: Absolute discounting + better modelling of lower-order
Absolute discounting: again, reserve probability mass for unseen event
𝑝 𝑤𝑖 𝑤𝑖−1) =max c wi−1wi − 𝛿, 0
𝑐(𝑤𝑖−1)+ 𝜆𝑤𝑖−1
𝑝 𝑤𝑖
∑𝑤𝑖𝑝 𝑤𝑖 𝑤𝑖−1) = 1. How often 𝛿 gets discounted?
• V?
• rd(𝑤𝑖−1) !
⇒ 𝜆𝑤𝑖 −1= 𝛿 ∗
𝑟𝑑 𝑤𝑖−1
𝑐(𝑤𝑖−1)
Kneser-Ney II
Lower-order model
Assumes it is coming from an interpolated model:
lower-order only used when higher-order useless
Intuition: assume Rica occurs very often, but only ever preceded by Costa.
In interpolated model, unigram distribution p(Rica) will be relatively high, although it is used only if p(Rica | c) is not considered to be modelled good enough (therefore c ≠ 𝐶𝑜𝑠𝑡𝑎 => p(Rica) should be low)
Kneser-Ney II
Lower-order model
𝑝 𝑤𝑖 =ld wi
𝑁bigrams
Where ld = left diversity
Can be done recursively as well.
“Modified Kneser-Ney” (normally) uses 3 different values for 𝛿 (for ngrams occurring 1, 2 and 3+ times).
Data structures
• Hashing
• Approximate hashing:
Bloom Filters. Store 𝑤_(1 + logہ # wۂ )
• Quantize probabilities
• Suffix Trees
Smoothed Bloom filter language models: Tera-Scale LMs on the Cheap. Talbot & Osborne. EMNLP 2007
Data structures: Suffix Trees
https://www.researchgate.net/publication/315676593_Accelerating_a_BWT-based_exact_search_on_multi-GPU_heterogeneous_computing_platforms
https://en.wikipedia.org/wiki/Suffix_tree
With suffix links
Suffix Trees as Language Models. Kennington et al. LREC 2012
Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees. Shareghi, EMNLP 2016
Efficiency: Pruning
• Count-based: remove all ngrams < K
removes in particular singletons
• Probability-based: p cw [ log 𝑝(𝑤|𝑐) − log 𝑝′ 𝑤 𝑐 ) ]
• Relative-entropy: ∑𝑐𝑖,𝑤𝑗p ciwj [ log 𝑝(𝑤𝑗|𝑐𝑖) − log 𝑝′ 𝑤𝑗 𝑐𝑖 ) ]
!! Pruning and smoothing do not work always well together
Out-of-Vocabulary words
What if you see new words?
(remember, 1k new En words per year, + named entities, spelling errors, transliteration)
One of the most common problem in NLP
1. Character-based LM• Hybrid word-char based
2. Train with <UNK> token• Keep only V’ words (most common, centroids, most-discriminative)
• Everything not in V’ gets mapped to <UNK>
Other LM approaches: History-based
𝑃 𝑤 𝑐𝑡𝑥𝑡, ℎ𝑖𝑠𝑡𝑜𝑟𝑦 = 𝜆𝑃 𝑤 𝑐𝑡𝑥𝑡 + 1 − 𝜆#(𝑤 ∈ ℎ𝑖𝑠𝑡𝑜𝑟𝑦)
|ℎ𝑖𝑠𝑡𝑜𝑟𝑦|
Or just linear interpolation:𝑃 𝑤 𝑐𝑡𝑥𝑡, ℎ𝑖𝑠𝑡𝑜𝑟𝑦 = 𝜆𝑃 𝑤 𝑐𝑡𝑥𝑡 + 1 − 𝜆 𝑃ℎ𝑖𝑠𝑡𝑜𝑟𝑦(𝑤|𝑐𝑡𝑥𝑡)
Also useful when you have (small) amount of in-domain text
Other approaches: Parsing-based
We decided 𝑃 𝑤1, 𝑤2, … , 𝑤 𝑠 = 𝑃 𝑤 𝑠 𝑤1, 𝑤2, … , 𝑤𝑠 ∗ 𝑃 𝑤2 𝑤1 ∗𝑃 𝑤1 which is completely arbitrary
Assume you have a probabilistic parser with which you can compute p(t|s), then you can define
𝑃 s =
𝑡
𝑃(𝑠|𝑡)
Other approaches: Class-based
Define set of classes 𝑐1, 𝑐2, … , 𝑐𝑘
Define 𝑃 𝑤𝑖 𝑐𝑡𝑥𝑡) = ∑𝑐𝑗 𝑃 𝑤𝑖 𝑐𝑗 𝑃 𝑐𝑗 𝑐𝑡𝑥𝑡)
Typical class are Part-of-Speech Tags (Noun, Verb, Adj, etc)
But can be induced
Paris is the capital of FranceBerlin is the capital of GermanyRome is the capital of Italy
CITY is the capital of COUNTRYCITY is the capital of COUNTRYCITY is the capital of COUNTRY
Other approaches: max-entropy
Define feature vector 𝜙(𝑤𝑖 , 𝑤1…𝑤𝑖−1) of size d
And then, find 𝜃 that maximises
𝑃𝜃 𝑤𝑖 𝑤1…𝑤𝑖−1 =exp 𝜃∙ 𝜙 𝑤𝑖,𝑤1…𝑤𝑖−1
∑𝑤∈Σ exp 𝜃∙𝜙 𝑤,𝑤1…𝑤𝑖−1
Can use any feature you can dream of (syntactic, grammatical, external)
Log-linear model: easy to train (convex)
Can control the # parameters (d)
Huge vocabulary
Neural Language Modeling
The cat is walking in a bedroom
A dog was running in the room
For us both are very similar, but wouldn’t share almost any parameter in n-gram model
Language is discrete
No clear relationship between dog/cat, is/was
Continuous representation of words
Was common < 2010, but not mainstream
Nowadays, ubiquitous in NLP
Continuous => Learn
Overall view
𝑤𝑖−1
𝑤𝑖−2
𝑤𝑖−3
𝑤𝑖−4
𝑒(𝑤𝑖−1)
Hidden Layer
𝑤𝑖
𝑒(𝑤𝑖−2)
𝑒(𝑤𝑖−3)
𝑒(𝑤𝑖−4)
tanh softmax
Practical details
lots of effort how to do this effectively
data-parallelizing
parameter-parallel
Final prediction actually an interpolation of NN with 3-gram model
Recurrent Neural Networks: the intuition
Obtains state-of-the-art in many NLP tasks
RNN designed to handle sequences (main difference with vision)
Differently from traditional NN, RNN feeds itself:
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks: the intuition
To bypass the explosion of parameters of n-grams models, use same weights at each time-step
Wh
xt
ht
yt
Wh
xt+1
ht+1
yt+1
Wh
xt-1
ht-1
yt-1
h: size of hidden layerd: size of word embeddingV: size of vocabulary
𝑊ℎ ∈ ℝℎ×ℎ
𝑊𝑤 ∈ ℝℎ×𝑑
𝑊𝑝 ∈ ℝ𝑉×ℎ
Ww Ww Ww
WpWpWp
LSTM
NN trained with gradient-descent methods
Compositional power of back-propagation
Ending up multiplying many numbers ∈ [0,1] tends to 0.
Vanishing gradient problem
Long short-term memory
Adds a gate: a switch which decides when to forget past
http://deeplearning.net/tutorial/lstm.html