Speech Recognition - New York Universityeugenew/asr13/lecture_5.pdf · Eugene Weinstein - Speech...
Transcript of Speech Recognition - New York Universityeugenew/asr13/lecture_5.pdf · Eugene Weinstein - Speech...
Speech RecognitionLecture 5: N-gram Language Models
Eugene WeinsteinGoogle, NYU Courant Institute
[email protected] Credit: Mehryar Mohri
Eugene Weinstein - Speech Recognition Courant Institute, NYUpage
Components
Acoustic and pronunciation model:
• : observation seq. distribution seq.
• : distribution seq. CD phone seq.
• : CD phone seq. phoneme seq.
• : phoneme seq. word seq.
Language model: , distribution over word seq.
Pr(o | w) =!
d,c,p
Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).
Pr(o | d)
Pr(d | c)
!
!
Pr(c | p) !
Pr(p | w) !
Pr(w)
acou
stic
mod
el
2
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Language Models
Definition: probability distribution over sequences of words
• Critical component of a speech recognition system.
Problems:
• Learning: use large text corpus (e.g., several million words) to estimate . Models in this course: n-gram models.
• Efficiency: computational representation and use.
3
w = w1 . . . wk.
Pr[w]
Pr[w]
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
n-gram models definition and problems
Good-Turing estimate
Smoothing techniques
Evaluation
Representation of n-gram models
Shrinking
LMs based on probabilistic automata
This Lecture
4
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
N-Gram Models
Definition: an n-gram model is a probability distribution based on the nth order Markov assumption
• Most widely used language models.
Consequence: by the chain rule,
5
!i,Pr[wi | w1 . . . wi!1] = Pr[wi | hi], |hi| " n # 1.
Pr[w] =k!
i=1
Pr[wi | w1 . . . wi!1] =k!
i=1
Pr[wi | hi].
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Maximum Likelihood
Likelihood: probability of observing sample under distribution , which, given the independence assumption is
Principle: select distribution maximizing sample probability
Pr[x1, . . . , xm] =m!
i=1
p(xi).
p ! P
p! = argmaxp!P
m!
i=1
p(xi),
p! = argmaxp!P
m!
i=1
log p(xi).or
6
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Example: Bernoulli Trials
Problem: find most likely Bernoulli distribution, given sequence of coin flips
Bernoulli distribution:
Likelihood:
Solution: is differentiable and concave;
H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.
p(H) = !, p(T ) = 1 ! !.
l(p) = log !N(H)(1 ! !)N(T )
= N(H) log ! + N(T ) log(1 ! !).
l
7
dl(p)d�
=N(H)
�� N(T )
1� �= 0⇥ � =
N(H)N(H) + N(T )
.
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Maximum Likelihood Estimation
Definitions:
• n-gram: sequence of n consecutive words.
• : sample or corpus of size .
• : count of sequence .
ML estimates: for ,
• But,
8
c(w1 . . . wk) w1 . . . wk
S m
c(w1 . . . wn!1) != 0
Pr[wn|w1 . . . wn!1] =c(w1 . . . wn)
c(w1 . . . wn!1).
c(w1 . . . wn) = 0 =! Pr[wn|w1 . . . wn!1] = 0!
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
N-Gram Model Problems
Sparsity: assigning probability zero to sequences not found in the sample speech recognition errors.
• Smoothing: adjusting ML estimates to reserve probability mass for unseen events. Central techniques in language modeling.
• Class-based models: create models based on classes (e.g., DAY) or phrases.
Representation: for , the number of bigrams is , the number of trigrams !
• Weighted automata: exploiting sparsity.9
=!
1010
1015
|!| = 100,000
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Smoothing Techniques
Typical form: interpolation of n-gram models, e.g., trigram, bigram, unigram frequencies.
Some widely used techniques:
• Katz Back-off models (Katz, 1987).
• Interpolated models (Jelinek and Mercer, 1980).
• Kneser-Ney models (Kneser and Ney, 1995).
10
Pr[w3|w1w2] = !1c(w3|w1w2) + !2c(w3|w2) + !3c(w3).
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Good-Turing Estimate
Definitions:
• Sample of words drawn from vocabulary .
• : count of word in .
• : set of words appearing times.
• : probability of drawing a word in .
Good-Turing estimate of :
11
S m
Sk k
c(x) x S
Mk Sk
Mk
Gk =k + 1
m|Sk+1| =
(k + 1)|Sk+1|
m.
(Good, 1953)
!
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Good-Turing Count Estimate
Definition: let and , then
12
r = c(w1 . . . wk) nr = |Sr|
c�(w1 . . . wk) =Gr �m
|Sr| = (r + 1)nr+1
nr.
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Simple Method
Additive smoothing: add to the count of each n-gram.
• Poor performance (Gale and Church, 1994).
• Not a principled attempt to estimate or make use of an estimate of the missing mass.
13
! ! [0, 1]
Pr[wn|w1 . . . wn!1] =c(w1 . . . wn) + !
c(w1 . . . wn!1) + !|!|.
(Jeffreys, 1948)
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Katz Back-off Model
Idea: back-off to lower order model for zero counts.
• if , then .
• otherwise,
14
(Katz, 1987)
where is a discount coefficient such thatdk
Pr[wn|wn!11 ] =
!
"
#
dc(wn
1)
c(wn
1 )
c(wn!11 )
if c(wn
1 ) > 0
! Pr[wn|wn!12 ] otherwise.
Katz suggests . K = 5
c(wn!1
1) = 0 Pr[wn|w
n!1
1] = Pr[wn|w
n!1
2]
dk =
!
"
#
1 if k > K;
!
(k + 1)nk+1
knk
otherwise.
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Interpolated Models
Idea: interpolation of different order models.
• and are estimated by using held-out samples.
• sample split into multiple “buckets”, set of interpolation parameters learned per bucket
• optimization using expectation-maximization (EM) algorithm.
• deleted interpolation: k-fold cross-validation.15
Pr[w3|w1w2] = !c(w3|w1w2) + "c(w3|w2) + (1 ! ! ! ")c(w3),
0 ! !, " ! 1.with
! !
(Jelinek and Mercer, 1980)
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Kneser-Ney Model
Idea: combination of back-off and interpolation, but backing-off to lower order model based on counts of contexts. Extension of absolute discounting.
where is a constant.
Modified version (Chen and Goodman, 1998): function of .
16
Pr[w3|w1w2] =max {0, c(w1w2w3) ! D}
c(w1w2)+ !
c(·w3)!c(·w3)
,
D
D
c(w1w2w3)
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Evaluation
Average log-likelihood of test sample of size N:
Perplexity: the perplexity of the model is defined as (‘average branching factor’)
• For English texts, typically and
17
PP (q)
PP (q) ! [50, 1000]
6 � �L(q) � 10 bits.
⇥L(p) =1N
N�
k=1
log2 p[wk | hk], |hk| ⇥ n�1.
PP (q) = 2�L̂(q).
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
In Practice
Evaluation: an empirical observation based on the word error rate of a speech recognizer is often a better evaluation than perplexity.
n-gram order: typically n = 3, 4, or 5. Higher order n-grams typically do not yield any benefit.
Smoothing: small differences between Back-off, interpolated, and other models (Chen and Goodman, 1998).
Special symbols: beginning and end of sentences, start and stop symbols.
18
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Example: Bigram Model
19
The set of states of the WFA representing a backoff or
interpolated model is defined by associating a state qh toeach sequence of length less than n found in the corpus:
Q = {qh : |h| < n and c(h) > 0}
Its transition setE is defined as the union of the followingset of failure transitions:
{(qwh! , !,! log("h), qh!) : qwh! " Q}
and the following set of regular transitions:
{(qh, w,! log(Pr(w|h)), nhw) : qwh " Q}
where nhw is defined by:
nhw =
!qhw if 0 < |hw| < nqh!w if |hw| = n where h = w!h! (4)
Figure 2 illustrates this construction for a trigram model.
Treating #-transitions as regular symbols, this is adeterministic automaton. Figure 3 shows a completeKatz backoff bigram model built from counts taken from
the following toy corpus and using failure transitions:
#s$ b a a a a #/s$#s$ b a a a a #/s$#s$ a #/s$
where #s$ denotes the start symbol and #/s$ the end sym-bol for each sentence. Note that the start symbol #s$ doesnot label any transition, it encodes the history #s$. Alltransitions labeled with the end symbol #/s$ lead to thesingle final state of the automaton.
4.3 Approximate Offline Representation
The commonmethod used for an offline representation ofan n-gram language model can be easily derived from therepresentation using failure transitions by simply replac-ing each !-transition by an #-transition. Thus, a transitionthat could only be taken in the absence of any other alter-
native in the exact representation can now be taken re-gardless of whether there exists an alternative transition.
Thus the approximate representation may contain paths
whose weight does not correspond to the exact probabil-ity of the string labeling that path according to the model.
Consider for example the start state in figure 3, labeled
with #s$. In a failure transition model, there exists onlyone path from the start state to the state labeled a, with acost of 1.108, since the ! transition cannot be traversedwith an input of a. If the ! transition is replaced by an# transition, there is a second path to the state labeled a– taking the #-transition to the history-less state, then thea transition out of the history-less state. This path is notpart of the probabilistic model – we shall refer to it as an
invalid path. In this case, there is a problem, because the
</s>
a
</s>/1.101
a/0.405
φ/4.856 </s>/1.540
a/0.441
bb/1.945
a/0.287
φ/0.356
<s>
a/1.108
φ/0.231
b/0.693
Figure 3: Example of representation of a bigram model
with failure transitions.
cost of the invalid path to the state – the sum of the twotransition costs (0.672) – is lower than the cost of the true
path. Hence the WFA with #-transitions gives a lowercost (higher probability) to all strings beginning with the
symbol a. Note that the invalid path from the state labeled#s$ to the state labeled b has a higher cost than the correctpath, which is not a problem in the tropical semiring.
4.4 Exact Offline Representation
This section presents a method for constructing an ex-
act offline representation of an n-gram language model
whose size remains practical for large-vocabulary tasks.
The main idea behind our new construction is to mod-
ify the topology of the WFA to remove any path contain-ing #-transitions whose cost is lower than the correct costassociated by the model to the string labeling that path.
Since, as a result, the low cost path for each string willhave the correct cost, this will guarantee the correctness
of the representation in the tropical semiring.
Our construction admits two parts: the detection of the
invalid paths of the WFA, and the modification of thetopology by splitting states to remove the invalid paths.
To detect invalid paths, we determine first their initial
non-# transitions. Let E! denote the set of #-transitionsof the original automaton. Let Pq be the set of all paths
$ = e1 . . . ek " (E!E!)k, k > 0, leading to state q suchthat for all i, i = 1 . . . k, p[ei] is the destination state ofsome #-transition.
Lemma 2 For an n-gram language model, the number
of paths in Pq is less than the n-gram order: |Pq| < n.
Proof. For all $i " Pq , let $i = $!iei. By definition,
there is some e!i " E! such that n[e!i] = p[ei] = qhi. By
definition of #-transitions in the model, |hi| < n ! 1 forall i. It follows from the definition of regular transitionsthat n[ei] = qhiw = q. Hence, hi = hj = h, i.e. ei =ej = e, for all $i, $j " Pq . Then, Pq = {$e : $ " Pqh
}%{e}. The history-less state has no incoming non-# paths,therefore, by recursion, |Pq| = |Pqh
|+ 1 = |hw| < n.
We now define transition sets Dqq! (originally empty)
following this procedure: for all states r " Q and all $ =e1 . . . ek " Pr, if there exists another path $! = e$ suchthat e " E! and i[$!] = i[$], and either (i) n[$!] = n[$]and w[e$] < w[$!] or (ii) there exists e! " E! such thatp[e!] = n[$!] and n[e!] = n[$] andw[e$] < w[$!e!], thenwe add e1 to the set: Dp["]p["!] & Dp["]p["!] % {e1}. See
The set of states of the WFA representing a backoff or
interpolated model is defined by associating a state qh toeach sequence of length less than n found in the corpus:
Q = {qh : |h| < n and c(h) > 0}
Its transition setE is defined as the union of the followingset of failure transitions:
{(qwh! , !,! log("h), qh!) : qwh! " Q}
and the following set of regular transitions:
{(qh, w,! log(Pr(w|h)), nhw) : qwh " Q}
where nhw is defined by:
nhw =
!qhw if 0 < |hw| < nqh!w if |hw| = n where h = w!h! (4)
Figure 2 illustrates this construction for a trigram model.
Treating #-transitions as regular symbols, this is adeterministic automaton. Figure 3 shows a completeKatz backoff bigram model built from counts taken from
the following toy corpus and using failure transitions:
#s$ b a a a a #/s$#s$ b a a a a #/s$#s$ a #/s$
where #s$ denotes the start symbol and #/s$ the end sym-bol for each sentence. Note that the start symbol #s$ doesnot label any transition, it encodes the history #s$. Alltransitions labeled with the end symbol #/s$ lead to thesingle final state of the automaton.
4.3 Approximate Offline Representation
The commonmethod used for an offline representation ofan n-gram language model can be easily derived from therepresentation using failure transitions by simply replac-ing each !-transition by an #-transition. Thus, a transitionthat could only be taken in the absence of any other alter-
native in the exact representation can now be taken re-gardless of whether there exists an alternative transition.
Thus the approximate representation may contain paths
whose weight does not correspond to the exact probabil-ity of the string labeling that path according to the model.
Consider for example the start state in figure 3, labeled
with #s$. In a failure transition model, there exists onlyone path from the start state to the state labeled a, with acost of 1.108, since the ! transition cannot be traversedwith an input of a. If the ! transition is replaced by an# transition, there is a second path to the state labeled a– taking the #-transition to the history-less state, then thea transition out of the history-less state. This path is notpart of the probabilistic model – we shall refer to it as an
invalid path. In this case, there is a problem, because the
</s>
a
</s>/1.101
a/0.405
φ/4.856 </s>/1.540
a/0.441
bb/1.945
a/0.287
φ/0.356
<s>
a/1.108
φ/0.231
b/0.693
Figure 3: Example of representation of a bigram model
with failure transitions.
cost of the invalid path to the state – the sum of the twotransition costs (0.672) – is lower than the cost of the true
path. Hence the WFA with #-transitions gives a lowercost (higher probability) to all strings beginning with the
symbol a. Note that the invalid path from the state labeled#s$ to the state labeled b has a higher cost than the correctpath, which is not a problem in the tropical semiring.
4.4 Exact Offline Representation
This section presents a method for constructing an ex-
act offline representation of an n-gram language model
whose size remains practical for large-vocabulary tasks.
The main idea behind our new construction is to mod-
ify the topology of the WFA to remove any path contain-ing #-transitions whose cost is lower than the correct costassociated by the model to the string labeling that path.
Since, as a result, the low cost path for each string willhave the correct cost, this will guarantee the correctness
of the representation in the tropical semiring.
Our construction admits two parts: the detection of the
invalid paths of the WFA, and the modification of thetopology by splitting states to remove the invalid paths.
To detect invalid paths, we determine first their initial
non-# transitions. Let E! denote the set of #-transitionsof the original automaton. Let Pq be the set of all paths
$ = e1 . . . ek " (E!E!)k, k > 0, leading to state q suchthat for all i, i = 1 . . . k, p[ei] is the destination state ofsome #-transition.
Lemma 2 For an n-gram language model, the number
of paths in Pq is less than the n-gram order: |Pq| < n.
Proof. For all $i " Pq , let $i = $!iei. By definition,
there is some e!i " E! such that n[e!i] = p[ei] = qhi. By
definition of #-transitions in the model, |hi| < n ! 1 forall i. It follows from the definition of regular transitionsthat n[ei] = qhiw = q. Hence, hi = hj = h, i.e. ei =ej = e, for all $i, $j " Pq . Then, Pq = {$e : $ " Pqh
}%{e}. The history-less state has no incoming non-# paths,therefore, by recursion, |Pq| = |Pqh
|+ 1 = |hw| < n.
We now define transition sets Dqq! (originally empty)
following this procedure: for all states r " Q and all $ =e1 . . . ek " Pr, if there exists another path $! = e$ suchthat e " E! and i[$!] = i[$], and either (i) n[$!] = n[$]and w[e$] < w[$!] or (ii) there exists e! " E! such thatp[e!] = n[$!] and n[e!] = n[$] andw[e$] < w[$!e!], thenwe add e1 to the set: Dp["]p["!] & Dp["]p["!] % {e1}. See
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Failure Transitions
Definition: a failure transition at state of an automaton is a transition taken at that state without consuming any symbol, when no regular transition at has the desired input label.
• Thus, a failure transition corresponds to the semantics of otherwise.
Advantages: originally used in string-matching.
• More compact representation.
• Dealing with unknown alphabet symbols.
20
! q
(Aho and Corasick 1975; Mohri, 1997)
q
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Weighted Automata Representation
21
Representation of a trigram model using failure transitions.
18 Statistical Natural Language Processing 4.0
wi!2wi!1 wi!1wi
wi!1 wi
!
wi
wi
wi!1! !
!
0
1/8.318
2/1.386
3
bye/8.318
hello/7.625
!/-0.287
!/-1.386 bye/0.693
hello/1.386
bye/7.625
!/-0.693
(a) (b)
Figure 4.7. Katz back-o! n-gram model. (a) Representation of a trigrammodel with failure transitions labeled with ". (b) Bigram model derivedfrom the input text hello bye bye. The automaton is defined over the logsemiring (the transition weights are negative log-probabilities). State 0 isthe initial state. State 1 corresponds to the word bye and state 3 to theword hello. State 2 is the back-o! state.
Let c(w) denote the number of occurrences of a sequence w in the corpus. c(hi)and c(hiwi) can be used to estimate the conditional probability P(wi | hi).When c(hi) != 0, the maximum likelihood estimate of P(wi | hi) is:
P̂(wi | hi) =c(hiwi)
c(hi)(4.3.8)
But, a classical data sparsity problem arises in the design of all n-gram models:the corpus, no matter how large, may contain no occurrence of hi (c(hi) = 0).A solution to this problem is based on smoothing techniques. This consists ofadjusting P̂ to reserve some probability mass for unseen n-gram sequences.
Let P̃(wi | hi) denote the adjusted conditional probability. A smoothingtechnique widely used in language modeling is the Katz back-o" technique.The idea is to “back-o"” to lower order n-gram sequences when c(hiwi) = 0.Define the backo" sequence of hi as the lower order n-gram sequence su#x ofhi and denote it by h"
i. hi = uh"i for some word u. Then, in a Katz back-o"
model, P(wi | hi) is defined as follows:
P(wi | hi) =
!
P̃(wi | hi) if c(hiwi) > 0"hi
P(wi | h"i) otherwise
(4.3.9)
where "hiis a factor ensuring normalization. The Katz back-o" model admits a
natural representation by a weighted automaton in which each state encodes a
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Approximate Representation
The cost of a representation without failure transitions and -transitions is prohibitive.
• For a trigram model, states and transitions are needed.
• Exact on-the-fly representation but drawback: no offline optimization.
Approximation: empirically, limited accuracy loss.
• -transitions replaced by -transitions.
• Log semiring replaced by tropical semiring.Alternative: exact representation with -transitions (Allauzen, Roark, and Mohri, 2003).
22
|!|n!1 |!|n
! !
!
!
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Shrinking
Idea: remove some n-grams from the model while minimally affecting its quality.
Main motivation: real-time speech recognition (speed and memory).
Method of (Seymore and Rosenfeld,1996): rank n-grams according to difference of log probabilities before and after shrinking:
23
(w, h)
c!(wh)
!
log p[w|h] ! log p"[w|h]
"
.
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Shrinking
Method of (Stolcke, 1998): greedy removal of n-grams based on relative entropy of the models before and after removal, independently for each n-gram.
• slightly lower perplexity.
• but, ranking close to that of (Seymore and Rosenfeld,1996) both in the definition and empirical results.
24
D(p!p!)
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
LMs Based on Probabilistic Automata
Definition: expected count of sequence in probabilistic automaton:
Computation:
• use counting weighted transducers.
• can be generalized to other moments of the counts.
25
xxx
where is the number of occurrences of in .|u|x|u|x|u|x xxx uuu
(Allauzen, Mohri, and Roark, 1997)
c(x) =!
u!!!
|u|xA(u),
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Example: Bigram Transducer
0
a:ε/1b:ε/1
1a:a/1b:b/1
2/1a:a/1b:b/1
a:ε/1b:ε/1
Weighted transducer .TTT
computes the (expected) count of each bigram in .X ! TX ! TX ! T
{aa, ab, ba, bb}{aa, ab, ba, bb}{aa, ab, ba, bb} XXX
26
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
Counting Transducers
X is an automaton representing a string or any other regular expression.
Alphabet .
0
a:ε/1b:ε/1
1/1X:X/1
a:ε/1b:ε/1
bbabaabba
εεabεεεεε εεεεεabεε
27
x = abx = abx = ab
! = {a, b}! = {a, b}! = {a, b}
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
References• Cyril Allauzen, Mehryar Mohri, and Brian Roark. Generalized Algorithms for Constructing
Statistical Language Models. In 41st Meeting of the Association for Computational Linguistics (ACL 2003), Proceedings of the Conference, Sapporo, Japan. July 2003.
• Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467-479.
• Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical Report, TR-10-98, Harvard University. 1998.
• William Gale and Kenneth W. Church. What’s wrong with adding one? In N. Oostdijk and P. de Hann, editors, Corpus-Based Research into Language. Rodolpi, Amsterdam.
• Good, I. The population frequencies of species and the estimation of population parameters, Biometrika, 40, 237-264, 1953.
• Frederick Jelinek and Robert L. Mercer. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, pages 381-397.
28
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
References• Slava Katz . Estimation of probabilities from sparse data for the language model
component of a speech recognizer, IEEE Transactions on Acoustics, Speech and Signal Processing, 35, 400-401, 1987.
• Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 181-184, 1995.
• David A. McAllester, Robert E. Schapire: On the Convergence Rate of Good-Turing Estimators. Proceedings of Conference on Learning Theory (COLT) 2000: 1-6.
• Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic dependences in stochastic language modeling. Computer Speech and Language, 8:1-38.
• Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), 1996.
• Andreas Stolcke. 1998. Entropy-based pruning of back-off language models. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pages 270-274.
29
Courant Institute, NYUpageEugene Weinstein - Speech Recognition
References• Ian H. Witten and Timothy C. Bell. The zero-frequency problem: Estimating the
probabilities of novel events in adaptive text compression, IEEE Transactions on Information Theory, 37(4):1085-1094, 1991.
30