Speech Recognition - New York Universityeugenew/asr13/lecture_5.pdf · Eugene Weinstein - Speech...

Speech RecognitionLecture 5: N-gram Language Models

Eugene WeinsteinGoogle, NYU Courant Institute

[email protected] Credit: Mehryar Mohri

mailto:[email protected]

mailto:[email protected]

Eugene Weinstein - Speech Recognition Courant Institute, NYUpage

Components

Acoustic and pronunciation model:

• : observation seq. distribution seq.

• : distribution seq. CD phone seq.

• : CD phone seq. phoneme seq.

• : phoneme seq. word seq.

Language model: , distribution over word seq.

Pr(o | w) =!

d,c,p

Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).

Pr(o | d)

Pr(d | c)

!

!

Pr(c | p) !

Pr(p | w) !

Pr(w)

acou

stic

mod

el

2

Courant Institute, NYUpageEugene Weinstein - Speech Recognition

Language Models

Definition: probability distribution over sequences of words

• Critical component of a speech recognition system.

Problems:

• Learning: use large text corpus (e.g., several million words) to estimate . Models in this course: n-gram models.

• Efficiency: computational representation and use.

3

w = w1 . . . wk.

Pr[w]

Pr[w]


n-gram models definition and problems

Good-Turing estimate

Smoothing techniques

Evaluation

Representation of n-gram models

Shrinking

LMs based on probabilistic automata

This Lecture

4


N-Gram Models

Definition: an n-gram model is a probability distribution based on the nth order Markov assumption

• Most widely used language models.

Consequence: by the chain rule,

5

!i,Pr[wi | w1 . . . wi!1] = Pr[wi | hi], |hi| " n # 1.

Pr[w] =k!

i=1

Pr[wi | w1 . . . wi!1] =k!

i=1

Pr[wi | hi].


Maximum Likelihood

Likelihood: probability of observing sample under distribution , which, given the independence assumption is

Principle: select distribution maximizing sample probability

Pr[x1, . . . , xm] =m!

i=1

p(xi).

p ! P

p! = argmaxp!P

m!

i=1

p(xi),

p! = argmaxp!P

m!

i=1

log p(xi).or

6


Example: Bernoulli Trials

Problem: find most likely Bernoulli distribution, given sequence of coin flips

Bernoulli distribution:

Likelihood:

Solution: is differentiable and concave;

H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.

p(H) = !, p(T ) = 1 ! !.

l(p) = log !N(H)(1 ! !)N(T )

= N(H) log ! + N(T ) log(1 ! !).

l

7

dl(p)d�

=N(H)

�� N(T )

1� �= 0⇥ � =

N(H)N(H) + N(T )

.


Maximum Likelihood Estimation

Definitions:

• n-gram: sequence of n consecutive words.

• : sample or corpus of size .

• : count of sequence .

ML estimates: for ,

• But,

8

c(w1 . . . wk) w1 . . . wk

S m

c(w1 . . . wn!1) != 0

Pr[wn|w1 . . . wn!1] =c(w1 . . . wn)

c(w1 . . . wn!1).

c(w1 . . . wn) = 0 =! Pr[wn|w1 . . . wn!1] = 0!


N-Gram Model Problems

Sparsity: assigning probability zero to sequences not found in the sample speech recognition errors.

• Smoothing: adjusting ML estimates to reserve probability mass for unseen events. Central techniques in language modeling.

• Class-based models: create models based on classes (e.g., DAY) or phrases.

Representation: for , the number of bigrams is , the number of trigrams !

• Weighted automata: exploiting sparsity.9

=!

1010

1015

|!| = 100,000


Smoothing Techniques

Typical form: interpolation of n-gram models, e.g., trigram, bigram, unigram frequencies.

Some widely used techniques:

• Katz Back-off models (Katz, 1987).

• Interpolated models (Jelinek and Mercer, 1980).

• Kneser-Ney models (Kneser and Ney, 1995).

10

Pr[w3|w1w2] = !1c(w3|w1w2) + !2c(w3|w2) + !3c(w3).


Good-Turing Estimate

Definitions:

• Sample of words drawn from vocabulary .

• : count of word in .

• : set of words appearing times.

• : probability of drawing a word in .

Good-Turing estimate of :

11

S m

Sk k

c(x) x S

Mk Sk

Mk

Gk =k + 1

m|Sk+1| =

(k + 1)|Sk+1|

m.

(Good, 1953)

!


Good-Turing Count Estimate

Definition: let and , then

12

r = c(w1 . . . wk) nr = |Sr|

c�(w1 . . . wk) =Gr �m

|Sr| = (r + 1)nr+1

nr.


Simple Method

Additive smoothing: add to the count of each n-gram.

• Poor performance (Gale and Church, 1994).

• Not a principled attempt to estimate or make use of an estimate of the missing mass.

13

! ! [0, 1]

Pr[wn|w1 . . . wn!1] =c(w1 . . . wn) + !

c(w1 . . . wn!1) + !|!|.

(Jeffreys, 1948)


Katz Back-off Model

Idea: back-off to lower order model for zero counts.

• if , then .

• otherwise,

14

(Katz, 1987)

where is a discount coefficient such thatdk

Pr[wn|wn!11 ] =

!

"

#

dc(wn

1)

c(wn

1 )

c(wn!11 )

if c(wn

1 ) > 0

! Pr[wn|wn!12 ] otherwise.

Katz suggests . K = 5

c(wn!1

1) = 0 Pr[wn|w

n!1

1] = Pr[wn|w

n!1

2]

dk =

!

"

#

1 if k > K;

!

(k + 1)nk+1

knk

otherwise.


Interpolated Models

Idea: interpolation of different order models.

• and are estimated by using held-out samples.

• sample split into multiple “buckets”, set of interpolation parameters learned per bucket

• optimization using expectation-maximization (EM) algorithm.

• deleted interpolation: k-fold cross-validation.15

Pr[w3|w1w2] = !c(w3|w1w2) + "c(w3|w2) + (1 ! ! ! ")c(w3),

0 ! !, " ! 1.with

! !

(Jelinek and Mercer, 1980)


Kneser-Ney Model

Idea: combination of back-off and interpolation, but backing-off to lower order model based on counts of contexts. Extension of absolute discounting.

where is a constant.

Modified version (Chen and Goodman, 1998): function of .

16

Pr[w3|w1w2] =max {0, c(w1w2w3) ! D}

c(w1w2)+ !

c(·w3)!c(·w3)

,

D

D

c(w1w2w3)


Evaluation

Average log-likelihood of test sample of size N:

Perplexity: the perplexity of the model is defined as (‘average branching factor’)

• For English texts, typically and

17

PP (q)

PP (q) ! [50, 1000]

6 � �L(q) � 10 bits.

⇥L(p) =1N

N�

k=1

log2 p[wk | hk], |hk| ⇥ n�1.

PP (q) = 2�L̂(q).


In Practice

Evaluation: an empirical observation based on the word error rate of a speech recognizer is often a better evaluation than perplexity.

n-gram order: typically n = 3, 4, or 5. Higher order n-grams typically do not yield any benefit.

Smoothing: small differences between Back-off, interpolated, and other models (Chen and Goodman, 1998).

Special symbols: beginning and end of sentences, start and stop symbols.

18


Example: Bigram Model

19

The set of states of the WFA representing a backoff or

interpolated model is defined by associating a state qh toeach sequence of length less than n found in the corpus:

Q = {qh : |h| < n and c(h) > 0}

Its transition setE is defined as the union of the followingset of failure transitions:

{(qwh! , !,! log("h), qh!) : qwh! " Q}

and the following set of regular transitions:

{(qh, w,! log(Pr(w|h)), nhw) : qwh " Q}

where nhw is defined by:

nhw =

!qhw if 0 < |hw| < nqh!w if |hw| = n where h = w!h! (4)

Figure 2 illustrates this construction for a trigram model.

Treating #-transitions as regular symbols, this is adeterministic automaton. Figure 3 shows a completeKatz backoff bigram model built from counts taken from

the following toy corpus and using failure transitions:

#s$ b a a a a #/s$#s$ b a a a a #/s$#s$ a #/s$

where #s$ denotes the start symbol and #/s$ the end sym-bol for each sentence. Note that the start symbol #s$ doesnot label any transition, it encodes the history #s$. Alltransitions labeled with the end symbol #/s$ lead to thesingle final state of the automaton.

4.3 Approximate Offline Representation

The commonmethod used for an offline representation ofan n-gram language model can be easily derived from therepresentation using failure transitions by simply replac-ing each !-transition by an #-transition. Thus, a transitionthat could only be taken in the absence of any other alter-

native in the exact representation can now be taken re-gardless of whether there exists an alternative transition.

Thus the approximate representation may contain paths

whose weight does not correspond to the exact probabil-ity of the string labeling that path according to the model.

Consider for example the start state in figure 3, labeled

with #s$. In a failure transition model, there exists onlyone path from the start state to the state labeled a, with acost of 1.108, since the ! transition cannot be traversedwith an input of a. If the ! transition is replaced by an# transition, there is a second path to the state labeled a– taking the #-transition to the history-less state, then thea transition out of the history-less state. This path is notpart of the probabilistic model – we shall refer to it as an

invalid path. In this case, there is a problem, because the

</s>

a

</s>/1.101

a/0.405

φ/4.856 </s>/1.540

a/0.441

bb/1.945

a/0.287

φ/0.356

<s>

a/1.108

φ/0.231

b/0.693

Figure 3: Example of representation of a bigram model

with failure transitions.

cost of the invalid path to the state – the sum of the twotransition costs (0.672) – is lower than the cost of the true

path. Hence the WFA with #-transitions gives a lowercost (higher probability) to all strings beginning with the

symbol a. Note that the invalid path from the state labeled#s$ to the state labeled b has a higher cost than the correctpath, which is not a problem in the tropical semiring.

4.4 Exact Offline Representation

This section presents a method for constructing an ex-

act offline representation of an n-gram language model

whose size remains practical for large-vocabulary tasks.

The main idea behind our new construction is to mod-

ify the topology of the WFA to remove any path contain-ing #-transitions whose cost is lower than the correct costassociated by the model to the string labeling that path.

Since, as a result, the low cost path for each string willhave the correct cost, this will guarantee the correctness

of the representation in the tropical semiring.

Our construction admits two parts: the detection of the

invalid paths of the WFA, and the modification of thetopology by splitting states to remove the invalid paths.

To detect invalid paths, we determine first their initial

non-# transitions. Let E! denote the set of #-transitionsof the original automaton. Let Pq be the set of all paths

$ = e1 . . . ek " (E!E!)k, k > 0, leading to state q suchthat for all i, i = 1 . . . k, p[ei] is the destination state ofsome #-transition.

Lemma 2 For an n-gram language model, the number

of paths in Pq is less than the n-gram order: |Pq| < n.

Proof. For all $i " Pq , let $i = $!iei. By definition,

there is some e!i " E! such that n[e!i] = p[ei] = qhi. By

definition of #-transitions in the model, |hi| < n ! 1 forall i. It follows from the definition of regular transitionsthat n[ei] = qhiw = q. Hence, hi = hj = h, i.e. ei =ej = e, for all $i, $j " Pq . Then, Pq = {$e : $ " Pqh

}%{e}. The history-less state has no incoming non-# paths,therefore, by recursion, |Pq| = |Pqh

|+ 1 = |hw| < n.

We now define transition sets Dqq! (originally empty)

following this procedure: for all states r " Q and all $ =e1 . . . ek " Pr, if there exists another path $! = e$ suchthat e " E! and i[$!] = i[$], and either (i) n[$!] = n[$]and w[e$] < w[$!] or (ii) there exists e! " E! such thatp[e!] = n[$!] and n[e!] = n[$] andw[e$] < w[$!e!], thenwe add e1 to the set: Dp["]p["!] & Dp["]p["!] % {e1}. See

The set of states of the WFA representing a backoff or

interpolated model is defined by associating a state qh toeach sequence of length less than n found in the corpus:

Q = {qh : |h| < n and c(h) > 0}

Its transition setE is defined as the union of the followingset of failure transitions:

{(qwh! , !,! log("h), qh!) : qwh! " Q}

and the following set of regular transitions:

{(qh, w,! log(Pr(w|h)), nhw) : qwh " Q}

where nhw is defined by:

nhw =

!qhw if 0 < |hw| < nqh!w if |hw| = n where h = w!h! (4)

Figure 2 illustrates this construction for a trigram model.

Treating #-transitions as regular symbols, this is adeterministic automaton. Figure 3 shows a completeKatz backoff bigram model built from counts taken from

the following toy corpus and using failure transitions:

#s$ b a a a a #/s$#s$ b a a a a #/s$#s$ a #/s$

where #s$ denotes the start symbol and #/s$ the end sym-bol for each sentence. Note that the start symbol #s$ doesnot label any transition, it encodes the history #s$. Alltransitions labeled with the end symbol #/s$ lead to thesingle final state of the automaton.

4.3 Approximate Offline Representation

The commonmethod used for an offline representation ofan n-gram language model can be easily derived from therepresentation using failure transitions by simply replac-ing each !-transition by an #-transition. Thus, a transitionthat could only be taken in the absence of any other alter-

native in the exact representation can now be taken re-gardless of whether there exists an alternative transition.

Thus the approximate representation may contain paths

whose weight does not correspond to the exact probabil-ity of the string labeling that path according to the model.

Consider for example the start state in figure 3, labeled

with #s$. In a failure transition model, there exists onlyone path from the start state to the state labeled a, with acost of 1.108, since the ! transition cannot be traversedwith an input of a. If the ! transition is replaced by an# transition, there is a second path to the state labeled a– taking the #-transition to the history-less state, then thea transition out of the history-less state. This path is notpart of the probabilistic model – we shall refer to it as an

invalid path. In this case, there is a problem, because the

</s>

a

</s>/1.101

a/0.405

φ/4.856 </s>/1.540

a/0.441

bb/1.945

a/0.287

φ/0.356

<s>

a/1.108

φ/0.231

b/0.693

Figure 3: Example of representation of a bigram model

with failure transitions.

cost of the invalid path to the state – the sum of the twotransition costs (0.672) – is lower than the cost of the true

path. Hence the WFA with #-transitions gives a lowercost (higher probability) to all strings beginning with the

symbol a. Note that the invalid path from the state labeled#s$ to the state labeled b has a higher cost than the correctpath, which is not a problem in the tropical semiring.

4.4 Exact Offline Representation

This section presents a method for constructing an ex-

act offline representation of an n-gram language model

whose size remains practical for large-vocabulary tasks.

The main idea behind our new construction is to mod-

ify the topology of the WFA to remove any path contain-ing #-transitions whose cost is lower than the correct costassociated by the model to the string labeling that path.

Since, as a result, the low cost path for each string willhave the correct cost, this will guarantee the correctness

of the representation in the tropical semiring.

Our construction admits two parts: the detection of the

invalid paths of the WFA, and the modification of thetopology by splitting states to remove the invalid paths.

To detect invalid paths, we determine first their initial

non-# transitions. Let E! denote the set of #-transitionsof the original automaton. Let Pq be the set of all paths

$ = e1 . . . ek " (E!E!)k, k > 0, leading to state q suchthat for all i, i = 1 . . . k, p[ei] is the destination state ofsome #-transition.

Lemma 2 For an n-gram language model, the number

of paths in Pq is less than the n-gram order: |Pq| < n.

Proof. For all $i " Pq , let $i = $!iei. By definition,

there is some e!i " E! such that n[e!i] = p[ei] = qhi. By

definition of #-transitions in the model, |hi| < n ! 1 forall i. It follows from the definition of regular transitionsthat n[ei] = qhiw = q. Hence, hi = hj = h, i.e. ei =ej = e, for all $i, $j " Pq . Then, Pq = {$e : $ " Pqh

}%{e}. The history-less state has no incoming non-# paths,therefore, by recursion, |Pq| = |Pqh

|+ 1 = |hw| < n.

We now define transition sets Dqq! (originally empty)

following this procedure: for all states r " Q and all $ =e1 . . . ek " Pr, if there exists another path $! = e$ suchthat e " E! and i[$!] = i[$], and either (i) n[$!] = n[$]and w[e$] < w[$!] or (ii) there exists e! " E! such thatp[e!] = n[$!] and n[e!] = n[$] andw[e$] < w[$!e!], thenwe add e1 to the set: Dp["]p["!] & Dp["]p["!] % {e1}. See


Failure Transitions

Definition: a failure transition at state of an automaton is a transition taken at that state without consuming any symbol, when no regular transition at has the desired input label.

• Thus, a failure transition corresponds to the semantics of otherwise.

Advantages: originally used in string-matching.

• More compact representation.

• Dealing with unknown alphabet symbols.

20

! q

(Aho and Corasick 1975; Mohri, 1997)

q


Weighted Automata Representation

21

Representation of a trigram model using failure transitions.

18 Statistical Natural Language Processing 4.0

wi!2wi!1 wi!1wi

wi!1 wi

!

wi

wi

wi!1! !

!

0

1/8.318

2/1.386

3

bye/8.318

hello/7.625

!/-0.287

!/-1.386 bye/0.693

hello/1.386

bye/7.625

!/-0.693

(a) (b)

Figure 4.7. Katz back-o! n-gram model. (a) Representation of a trigrammodel with failure transitions labeled with ". (b) Bigram model derivedfrom the input text hello bye bye. The automaton is defined over the logsemiring (the transition weights are negative log-probabilities). State 0 isthe initial state. State 1 corresponds to the word bye and state 3 to theword hello. State 2 is the back-o! state.

Let c(w) denote the number of occurrences of a sequence w in the corpus. c(hi)and c(hiwi) can be used to estimate the conditional probability P(wi | hi).When c(hi) != 0, the maximum likelihood estimate of P(wi | hi) is:

P̂(wi | hi) =c(hiwi)

c(hi)(4.3.8)

But, a classical data sparsity problem arises in the design of all n-gram models:the corpus, no matter how large, may contain no occurrence of hi (c(hi) = 0).A solution to this problem is based on smoothing techniques. This consists ofadjusting P̂ to reserve some probability mass for unseen n-gram sequences.

Let P̃(wi | hi) denote the adjusted conditional probability. A smoothingtechnique widely used in language modeling is the Katz back-o" technique.The idea is to “back-o"” to lower order n-gram sequences when c(hiwi) = 0.Define the backo" sequence of hi as the lower order n-gram sequence su#x ofhi and denote it by h"

i. hi = uh"i for some word u. Then, in a Katz back-o"

model, P(wi | hi) is defined as follows:

P(wi | hi) =

!

P̃(wi | hi) if c(hiwi) > 0"hi

P(wi | h"i) otherwise

(4.3.9)

where "hiis a factor ensuring normalization. The Katz back-o" model admits a

natural representation by a weighted automaton in which each state encodes a


Approximate Representation

The cost of a representation without failure transitions and -transitions is prohibitive.

• For a trigram model, states and transitions are needed.

• Exact on-the-fly representation but drawback: no offline optimization.

Approximation: empirically, limited accuracy loss.

• -transitions replaced by -transitions.

• Log semiring replaced by tropical semiring.Alternative: exact representation with -transitions (Allauzen, Roark, and Mohri, 2003).

22

|!|n!1 |!|n

! !

!

!


Shrinking

Idea: remove some n-grams from the model while minimally affecting its quality.

Main motivation: real-time speech recognition (speed and memory).

Method of (Seymore and Rosenfeld,1996): rank n-grams according to difference of log probabilities before and after shrinking:

23

(w, h)

c!(wh)

!

log p[w|h] ! log p"[w|h]

"

.


Shrinking

Method of (Stolcke, 1998): greedy removal of n-grams based on relative entropy of the models before and after removal, independently for each n-gram.

• slightly lower perplexity.

• but, ranking close to that of (Seymore and Rosenfeld,1996) both in the definition and empirical results.

24

D(p!p!)


LMs Based on Probabilistic Automata

Definition: expected count of sequence in probabilistic automaton:

Computation:

• use counting weighted transducers.

• can be generalized to other moments of the counts.

25

xxx

where is the number of occurrences of in .|u|x|u|x|u|x xxx uuu

(Allauzen, Mohri, and Roark, 1997)

c(x) =!

u!!!

|u|xA(u),


Example: Bigram Transducer

0

a:ε/1b:ε/1

1a:a/1b:b/1

2/1a:a/1b:b/1

a:ε/1b:ε/1

Weighted transducer .TTT

computes the (expected) count of each bigram in .X ! TX ! TX ! T

{aa, ab, ba, bb}{aa, ab, ba, bb}{aa, ab, ba, bb} XXX

26


Counting Transducers

X is an automaton representing a string or any other regular expression.

Alphabet .

0

a:ε/1b:ε/1

1/1X:X/1

a:ε/1b:ε/1

bbabaabba

εεabεεεεε εεεεεabεε

27

x = abx = abx = ab

! = {a, b}! = {a, b}! = {a, b}


References• Cyril Allauzen, Mehryar Mohri, and Brian Roark. Generalized Algorithms for Constructing

Statistical Language Models. In 41st Meeting of the Association for Computational Linguistics (ACL 2003), Proceedings of the Conference, Sapporo, Japan. July 2003.

• Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L. Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467-479.

• Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical Report, TR-10-98, Harvard University. 1998.

• William Gale and Kenneth W. Church. What’s wrong with adding one? In N. Oostdijk and P. de Hann, editors, Corpus-Based Research into Language. Rodolpi, Amsterdam.

• Good, I. The population frequencies of species and the estimation of population parameters, Biometrika, 40, 237-264, 1953.

• Frederick Jelinek and Robert L. Mercer. 1980. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, pages 381-397.

28


References• Slava Katz . Estimation of probabilities from sparse data for the language model

component of a speech recognizer, IEEE Transactions on Acoustics, Speech and Signal Processing, 35, 400-401, 1987.

• Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages 181-184, 1995.

• David A. McAllester, Robert E. Schapire: On the Convergence Rate of Good-Turing Estimators. Proceedings of Conference on Learning Theory (COLT) 2000: 1-6.

• Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic dependences in stochastic language modeling. Computer Speech and Language, 8:1-38.

• Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), 1996.

• Andreas Stolcke. 1998. Entropy-based pruning of back-off language models. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pages 270-274.

29


References• Ian H. Witten and Timothy C. Bell. The zero-frequency problem: Estimating the

probabilities of novel events in adaptive text compression, IEEE Transactions on Information Theory, 37(4):1085-1094, 1991.

30

Speech Recognition - New York Universityeugenew/asr13/lecture_5.pdf · Eugene Weinstein - Speech...

Documents

Transcript of Speech Recognition - New York Universityeugenew/asr13/lecture_5.pdf · Eugene Weinstein - Speech...