Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force...
Transcript of Language Modeling 2 - GitHub Pages · 2017-02-10 · Review: Softmax Issues Use a softmax to force...
Language Modeling 2
CS287
Review: Windowed Setup
Let V be the vocabulary of English and let s score any window of size
dwin = 5, if we see the phrase
[ the dog walks to the ]
It should score higher by s than
[ the dog house to the ]
[ the dog cats to the ]
[ the dog skips to the ]
...
Review: Continuous Bag-of-Words (CBOW)
y = softmax((∑i x
0i W
0
dwin − 1)W1)
I Attempt to predict the middle word
[ the dog walks to the ]
Example: CBOW
x =v(w3) + v(w4) + v(w6) + v(w7)
dwin − 1
y = δ(w5)
W1 is no longer partitioned by row (order is lost)
Review: Softmax Issues
Use a softmax to force a distribution,
softmax(z) =exp(z)
∑c∈C
exp(zc)
log softmax(z) = z− log ∑c∈C
exp(zc)
I Issue: class C is huge.
I For C&W, 100,000, for word2vec 1,000,000 types
I Note largest dataset is 6 billion words
Quiz
We have trained a depth-two balanced soft-max tree. Give an algorithm
and time-complexity for:
I Computing the optimal class decision?
I Greedily approximating the optimal class decision.
I Appropriately sampling a word.
I Computing the full distribution over classes?
Answers I
I Computing the optimal class decision i.e. arg maxy p(y = y |x)?
I Requires full enumeration O(|V|)
arg maxy
p(y = y |x) = arg maxy
p(y|C , x)p(C |x)
I Greedily approximating the optimal class decision.
I Walk tree O(√|V|)
arg maxy
p(y = y |x)
c∗ = arg maxy
p(C |x)
y∗ = arg maxy
p(y|C = c∗, x)
Answers II
I Appropriately sampling a word y ∼ p(y = δ(c)|x)?I Walk tree O(
√|V|)
y ∼ p(y = δ(c)|x)
c ∼ p(C |x)
y ∼ p(y|C = c , x)
I Computing the full distribution over classes?
I Enumerate O(|V|)
p(y = δ(c)|x)p(y|C = c , x)
Contents
Language Modeling
NGram Models
Smoothing
Language Models in Practice
Language Modeling Task
Given a sequence of text give a probability distribution over the next
word.
The Shannon game. Estimate the probability of the next letter/word
given the previous.
THE ROOM WAS NOT VERY LIGHT A SMALL OBLONG
READING LAMP ON THE DESK SHED GLOW ON
POLISHED
Language Modeling
Shannon (1948) Mathematical Model of Communication
We may consider a discrete source, therefore, to be
represented by a stochastic process. Conversely, any
stochastic process which produces a discrete sequence of
symbols chosen from a finite set may be considered a discrete
source. This will include such cases as:
1. Natural written languages such as English, German,
Chinese. ...
Shannon’s Babblers I
4. Third-order approximation (trigram structure as in English).
IN NO 1ST LAT WHEY CRATICT FROURE BIRS GROCID
PONDENOME OF DEMONSTURES OF THE REPTAGIN IS
REGOACTIONA OF CRE
5. First-Order Word Approximation. Rather than continue
with tetragram, ... , II-gram structure it is easier and better to
jump at this point to word units. Here words are chosen
independently but with their appropriate frequencies.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR
COME CAN DIFFERENT NATURAL HERE HE THE A IN
CAME THE TO OF TO EXPERT GRAY COME TO
FURNISHES THE LINE MESSAGE HAD BE THESE.
Shannon’s Babblers II
6. Second-Order Word Approximation. The word transition
probabilities are correct but no further structure is included.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH
’RITER THAT THE CHARACTER OF THIS POINT IS
THEREFORE ANOTHER METHOD FOR THE LETTERS
THAT THE TIME OF WHO EVER TOLD THE PROBLEM
FOR AN UNEXPECTED
The resemblance to ordinary English text increases quite
noticeably at each of the above steps.
Smoothness Image/Language
It is a capital mistake to theorize before one has data.
Insensibly one begins to twist facts to suit theories, instead of
theories to suit facts. -Sherlock Holmes, A Scandal in Bohemia
Repair Image / Language (Chatterjee et al., 2009)
It is a capital mistake to theorize before one has . . .
Repair Image / Language (Chatterjee et al., 2009)
108 938 285 28 184 29 593 219 58 772 . . .
Language Modeling
Crucially important for:
I Speech Recognition
I Machine Translation
I Many deep learning applications
I Captioning
I Dialogue
I Summarization
I . . .
Statistical Model of Speech
How permanent are their records?
Statistical Model of Speech
Statistical Model of Speech
Transcription: Help peppermint on their records
Statistical Model of Speech
Transcription: How permanent are their records
Language Modeling Formally
Goal: Compute the probability of a sentence,
I Factorization:
p(w1, . . . ,wn) =n
∏t=1
p(wt |w1, . . . ,wt−1)
k Estimate the probability of the next word, conditioned on prefix,
p(wt |w1, . . . ,wt−1; θ)
Machine Learning Setup
Multi-class prediction problem,
(x1, y1), . . . , (xn, yn)
I yi ; the one-hot next word
I xi ; representation of the prefix (w1, . . . ,wt−1)
Challenges:
I How do you represent input?
I Smoothing is crucially important.
I Output space is very large (next class)
Machine Learning Setup
Multi-class prediction problem,
(x1, y1), . . . , (xn, yn)
I yi ; the one-hot next word
I xi ; representation of the prefix (w1, . . . ,wt−1)
Challenges:
I How do you represent input?
I Smoothing is crucially important.
I Output space is very large (next class)
Problem Metric
Previously, used accuracy as a metric.
Language modeling uses of version average negative log-likelihood
I For test data w1, . . . , wn
I
NLL = −1
n
n
∑i=1
log p(wi |w1, . . . ,wi−1)
Actually report perplexity,
perp = exp(−1
n
n
∑i=1
log p(wi |w1, . . . ,wi−1))
Requires modeling full distribution as opposed to argmax (hinge-loss)
Perplexity: Intuition
I Effective uniform distribution size.
I If words were uniform: perp = |V| = 10000
I Using unigram : perp ≈ 400
I log2 perp gives average number of bits needed per word.
Contents
Language Modeling
NGram Models
Smoothing
Language Models in Practice
n-gram Models
In practice representation doesn’t use all (w1, . . . ,wt−1)
p(wi |w1, . . . ,wi−1) ≈ p(wi |wi−n+1, . . . ,wi−1; θ)
I We call this an n-gram model.
I wi−n+1, . . . ,wi−1; the context
Bigram Models
Back to count-based multinomial estimation,
I Feature set F : previous words
I Input vector x is sparse
I Count matrix F ∈ RV×V
I Counts come from training data:
Fc,w = ∑i
1(wi−1 = c,wi = w)
pML(wi |wi−1; θ) =Fc,wFc,•
Trigram Models
I Feature set F : previous two words (conjunction)
I Input vector x is sparse
I Count matrix F ∈ RV×V ,V
Fc,w = ∑i
1(wi−2:i−1 = c ,wi = w)
pML(wi |wi−2:i−1 = c ; θ) =Fc,wFc,•
Notation
I c = wi−n+1:i−1; context
I c ′ = wi−n+2:i−1; context without first word
I c ′′ = wi−n+3:i−1; context without first two words
I [x ]+ = max{0, x}; positive part (ReLU)
I F•,w = ∑c Fc,w ; sum-over-dimension
I Maximum-likelihood,
pML(w |c) = Fc,w/Fc,•
I Non-zero count words, for all c ,w
Nc,w = 1(Fc,w > 0)
NGram Models
It is common to go up to 5-grams,
Fc,w = ∑i
1(wi−5+1 . . .wi−1 = c ,wi = w)
Matrix becomes very sparse at 5-grams.
Google 1T
Number of token 1,024,908,267,229
Number of sentences 95,119,665,584
Size compressed (counts only) 24 GB
Number of unigrams 13,588,391
Number of bigrams 314,843,401
Number of trigrams 977,069,902
Number of fourgrams 1,313,818,354
Number of fivegrams 1,176,470,663
Contents
Language Modeling
NGram Models
Smoothing
Language Models in Practice
NGram Models
I Maximum likelihood models word terribly.
I NGram models are sparse, need to handle unseen cases.
I Presentation follows work of Chen and Goodman (1999)
Count Modifications
I Laplace Smoothing
Fc,w = Fc,w + α for all c ,w
I Good-Turing Smoothing (Good, 1953)
Fc,w = (Fc,w + 1)hist(Fc,w + 1)
hist(Fc,w )for all c ,w
Real Issues N-Gram Sparsity
I Histogram of trigrams
Idea 1: Interpolation
For trigrams:
pinterp(w |c) = λ1pML(w |c) + λ2pML(w |c ′) + λ3pML(w |c ′′)
Ensure that λs form convex combination
∑i
λi = 1
λi ≥ 0 for all i
Idea 1: Interpolation (Jelinek-Mercer Smoothing)
Can write recursively,
pinterp(w |c) = λpML(w |c) + (1− λ)pinterp(w |c ′)
Ensure that λ form convex combination
0 ≤ λ ≤ 1
How to set parameters: Validation Tuning
I Treat λ as hyper-parameters
I Can estimate λ using EM (or gradients directly)
I Alternatively: Grid-search to held-out data.
I Can add more parameters λ(c ,w)
I Language models are often stored with prob and backoff value (λ)
How to set parameters: Witten-Bell
Define λ(c ,w) as a function of c ,w
(1− λ) =Nc,•
Nc,• + Fc,•
I Nc,•; Number of unique word types seen with context
pwb(w |c) =Fc,w +Nc,• × pwb(w |c ′)
Fc,• +Nc,•
Interpolation counts are a new events estimated as proportional to
seeing a new word.
Idea 2: Absolute Discounting
Similar form to interpolation
pabs(w |c) = λ(w , c)pML(w |c) + η(c)pabs(w |c ′)
Delete counts and redistribute:
pabs =[Fc,w −D ]+
Fc,•+ η(c)pabs(w |c ′)
Need for a given c
∑w
λ(w , c)p1(w) + η(c)p2(w) = ∑w
λ(w)p1(w) + η(c) = 1
In-class: To ensure a valid distribution, what are λ and η with D = 1?
Answer
λ(w) =[Fc,w −D ]+
Fc,w
η(w) = 1−∑w
[Fc,w −D ]+Fc,•
= ∑w
Fc,• − [Fc,w −D ]+Fc,•
Total counts minus counts discounted.
η(w) =D ×Nc,•
Fc,•
Lower-Order Smoothing Issue
Consider the example: San Francisco and Los Franciso
Would like:
I p(wi−1 = San,wi = Francisco) to be high
I p(wi−1 = Los,wi = Francisco) to be very low
However, interpolation alone doesn’t ensure this. Why?
(1− λ)p(wi = Francisco)
Could be quite high, from seeing San Francisco many times.
Lower-Order Smoothing Issue
Consider the example: San Francisco and Los Franciso
Would like:
I p(wi−1 = San,wi = Francisco) to be high
I p(wi−1 = Los,wi = Francisco) to be very low
However, interpolation alone doesn’t ensure this. Why?
(1− λ)p(wi = Francisco)
Could be quite high, from seeing San Francisco many times.
Kneser-Ney Smoothing
I Uses Absolute Discounting instead of ML
I However want distribution to match ML on lower-order terms.
I Ensure this by marginalizing out and matching lower-order
distribution.
I Uses a different function for the lower-order term KN’.
I Most commonly used smoothing technique (details follow).
Review: Chain-Rule and Marginalization
Chain rule:
p(X ,Y ) = P(X |Y )P(Y )
Marginalization:
p(X ,Y ) = ∑z∈Z
P(X ,Y ,Z = z)
Marginal matching constraint.
pA(X ,Y ) 6= pB(X ,Y )
∑y
pA(X ,Y = y) = pB(X )
Kneser-Ney Smoothing
Main Idea: match ML marginals for all c ′
∑c1
pKN(c1, c ′,w) = ∑c1
pKN(w |c)pML(c) = pML(c′,w)
∑c1
pKN(w |c)Fc,•
F•,•=
Fc ′w ,•
F•,•
[ c1 c ′ w ]
Kneser-Ney Smoothing
pKN =[Fc,w −D ]+
Fc,•+ η(c)pKN ′(w |c ′)
Fc ′w ,• = ∑c1
Fc,•[pKN(w |c)]
= ∑c1
Fc,•[[Fc,w −D ]+
Fc,•+
D
Fc,•Nc,• × pKN ′(w |c ′)]
= ∑c1 :Nc,w>0
Fc,•Fc,w −D
Fc,•+ ∑
c1
DNc,• × pKN ′(w |c ′)
= Fc ′,w −N•c ′,wD +DN•c ′,• × pKN ′(w |c ′)
Kneser-Ney Smoothing
Final equation for unigram
pKN ′(w) = N•,w/N•,•
I Intuition: Prob of a unique bigrams ending in w .
pKN ′(w |c ′) = N•c ′,w/N•c ′,•
I Intuition: Prob of a unique ngram (with middle c ′) ending in w .
Modified Kneser-Ney
Set D based on Fc,w
D =
0 Fc,w = 0
D1 Fc,w = 1
D2 Fc,w = 2
D3+ Fc,w ≥ 3
I Modifications to the formula are in the paper
Contents
Language Modeling
NGram Models
Smoothing
Language Models in Practice
Language Model Issues
1. LMs become very sparse.
I Obviously cannot use matrices (|V| × |V| × |V| > 1012)
2. LMs are very big
3. Lookup speed is crucial
Trie Data structure
ROOT
a
mouse(c7)cat
meows(c6)
dog
runs(c5)
the
mouse(c4)cat
runs(c3)
dog
sleeps(c2)barks(c1)
Issues?
Reverse Trie Data structure
ROOT
runs(c)
cat(c)
the(c)
dog(c)
a(c)
meows(c)
cat(c)
a(c)
sleeps(c)
dog(c)
the(c)
barks(c)
dog(c)
the(c)
mouse(c)
the(c)
;
Used in several standard language modeling toolkits.
Efficiency: Hash Tables
I KenLM finds it more efficient to directly hash ngrams
I All ngrams of a given weight are kept in a hash table.
I Fast linear-probing hash tables make this work well.
Quantization
I Memory issues are often a concern
I Probabilities are kept in log space
I KenLM quantizes from 32 bits to smaller (others use 8 bits)
I Quatization is done by sorting, binning, and averaging.
Finite State Automata
Conclusion
Today,
I Language Modeling
I Various ideas in smoothing
I Tricks for efficient implementation.
Next time,
I Neural Network Language models