Download - Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.

EEM4R Spoken Language Processing - Introduction

Speech Technology Lab

Ƅ ɜ: m ɪ ŋ ǝ m

Training HMMs

Version 4: February 2005



Ƅ ɜ: m ɪ ŋ ǝ m

Outline

Parameter estimation Maximum Likelihood (ML) parameter estimation ML for Gaussian PDFs ML for HMMs – the Baum-Welch algorithm HMM adaptation:

– MAP estimation

– MLLR



Ƅ ɜ: m ɪ ŋ ǝ m

Discrete variables

Suppose that Y is a random variable which can take any value in a discrete set X={x1,x2,…,xM}

Suppose that y1,y2,…,yN are samples of the random variable Y

If cm is the number of times that the yn = xm then an estimate of the probability that yn takes the value xm is given by:

N

cxyPxP m

mnm



Ƅ ɜ: m ɪ ŋ ǝ m

Discrete Probability Mass Function

0

0.05

0.1

0.15

0.2

0.25

1 2 3 4 5 6 7 8 9

symbol n

P(n

)

Symbol123456789

Total

Num.Occurrences12023190876357

15620391

1098



Ƅ ɜ: m ɪ ŋ ǝ m

Continuous Random Variables

In most practical applications the data are not restricted to a finite set of values – they can take any value in N-dimensional space

Simply counting the number of occurrences of each value is no longer a viable way of estimating probabilities…

…but there are generalisations of this approach which are applicable to continuous variables – these are referred to as non-parametric methods



Ƅ ɜ: m ɪ ŋ ǝ m

Continuous Random Variables

An alternative is to use a parametric model In a parametric model, probabilities are defined by a

small set of parameters Simplest example is a normal, or Gaussian model A Gaussian probability density function (PDF) is

defined by two parameters– its mean , and – variance



Ƅ ɜ: m ɪ ŋ ǝ m

Gaussian PDF

‘Standard’ 1-dimensional Guassian PDF:– mean =0

– variance =1

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-5 -4 -3 -2 -1 0 1 2 3 4 5

x



Ƅ ɜ: m ɪ ŋ ǝ m

Gaussian PDF

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

a b

P(a x b)



Ƅ ɜ: m ɪ ŋ ǝ m

Gaussian PDF

For a 1-dimensional Gaussian PDF p with mean and variance :

2exp

2

1,|

2xxpxp

Constant to ensure area under curve is 1

Defines ‘bell’ shape



Ƅ ɜ: m ɪ ŋ ǝ m

More examples

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

=0.1 =1.0

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-5 -4 -3 -2 -1 0 1 2 3 4 5

x

=10.0 =5.0



Ƅ ɜ: m ɪ ŋ ǝ m

Fitting a Gaussian PDF to Data

Suppose y = y1,…,yn,…,yT is a sequence of T data values

Given a Gaussian PDF p with mean and variance , define:

How do we choose and to maximise this probability?

T

ttypyp

1

,|,|



Ƅ ɜ: m ɪ ŋ ǝ m

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-5 -4 -3 -2 -1 0 1 2 3 4 5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-5 -4 -3 -2 -1 0 1 2 3 4 5

Fitting a Gaussian PDF to Data

Poor fitGood fit



Ƅ ɜ: m ɪ ŋ ǝ m

Maximum Likelihood Estimation

Define the best fitting Gaussian to be the one such that p(y|,) is maximised.

Terminology:– p(y|,) as a function of y is the probability

(density) of y– p(y|,) as a function of , is the likelihood of

, Maximising p(y|,) with respect to , is called

Maximum Likelihood (ML) estimation of ,



Ƅ ɜ: m ɪ ŋ ǝ m

ML estimation of ,

Intuitively:– The maximum likelihood estimate of should be the

average value of y1,…,yT, (the sample mean)

– The maximum likelihood estimate of should be the variance of y1,…,yT. (the sample variance)

This turns out to be true: p(y| , ) is maximised by setting:

T

t

T

ttt y

Ty

T 1 1

21,

1



Ƅ ɜ: m ɪ ŋ ǝ m

Proof

,|log,|log,|log11

t

T

t

T

tt ypypyp

First note that maximising p(y) is the same as maximising log(p(y))

At a maximum:

T

t

tt

T

t

yypyp

11

12,|log,|log0

Also

2

2log2

1,|log t

t

yyp

So,

T

tt

T

tt y

TyT

11

1,



Ƅ ɜ: m ɪ ŋ ǝ m

ML training for HMMs

Now consider– An N state HMM M, each of whose states is associated

with a Gaussian PDF

– A training sequence y1,…,yT

For simplicity assume that each yt is 1-dimensional



Ƅ ɜ: m ɪ ŋ ǝ m

ML training for HMMs If we knew that:

– y1,…,ye(1) correspond to state 1

– ye(1)+1,…,ye(2) correspond to state 2

– :

– ye(n-1)+1,…,ye(n) correspond to state n

– :

Then we could set the mean of state n to the average value of ye(n-1)+1,…,ye(n)



Ƅ ɜ: m ɪ ŋ ǝ m

ML Training for HMMs

y1,…,ye(1), ye(1)+1,…,ye(2), ye(2)+1,…,yT

Unfortunately we don’t know that ye(n-1)+1,…,ye(n) correspond to state n…



Ƅ ɜ: m ɪ ŋ ǝ m

Solution

1. Define an initial HMM – M0

2. Use the Viterbi algorithm to compute the optimal state sequence between M0 and y1,…,yT

y1 y2 y3 … … yt … yT



Ƅ ɜ: m ɪ ŋ ǝ m

Solution (continued)

Use optimal state sequence to segment y

y1 ye(1) ye(1)+1 … … ye(2) … yT

Reestimate parameters to get a new model M1



Ƅ ɜ: m ɪ ŋ ǝ m

Solution (continued)

Now repeat whole process using M1 instead of M0, to get a new model M2

Then repeat again using M2 to get a new model M3

….

p(y | M0) p(y | M1) p(y | M2) …. p(y | Mn) ….



Ƅ ɜ: m ɪ ŋ ǝ m

Local optimization

p(y|M)

MM0 M1…Mn

Local optimumGlobal

optimum



Ƅ ɜ: m ɪ ŋ ǝ m

Baum-Welch optimization

The algorithm just described is often called Viterbi training or Viterbi reestimation

It is often used to train large sets of HMMs An alternative method is called Baum-Welch

reestimation



Ƅ ɜ: m ɪ ŋ ǝ m

Baum-Welch Reestimation

y1 y3 y4 y5y2 yt+1yt yT

P(si | yt) = t(i)



Ƅ ɜ: m ɪ ŋ ǝ m

‘Forward’ Probabilities




Ƅ ɜ: m ɪ ŋ ǝ m

‘Backward’ Probabilities


j

tjtijtTtt ybjaMixyyi )()(), |,...,(Prob)( 111



Ƅ ɜ: m ɪ ŋ ǝ m

‘Forward-Backward’ Algorithm

yT

y1 y3 y4 y5y2 yt+1yt



Ƅ ɜ: m ɪ ŋ ǝ m

Adaptation

A modern large-vocabulary continuous speech recognition system has many thousands of parameters

Many hours of speech data used to train the system (e.g. 200+ hours!)

Speech data comes from many speakers Hence recogniser is ‘speaker independent’ But performance for an individual would be better if

the system were speaker dependent



Ƅ ɜ: m ɪ ŋ ǝ m

Adaptation

For a single speaker, only a small amount of training data is available

Viterbi reestimation or Baum-Welch reestimation will not work



Ƅ ɜ: m ɪ ŋ ǝ m

‘Parameters vs training data’

Performance

Number of parameters

Larger training set

Smaller training set



Ƅ ɜ: m ɪ ŋ ǝ m

Adaptation

Two common approaches to adaptation (with small amounts of training data)– Bayesian adaptation (also known as MAP adaptation

(MAP = Maximum a Priori))

– Transform-based adaptation (also known as MLLR (MLLR = Maximum Likelihood Linear Regression))



Ƅ ɜ: m ɪ ŋ ǝ m

Bayesian adaptation

Uses well-trained, ‘speaker-independent’ HMM as a prior for the estimate of the parameters of the speaker dependent HMM

E.G:

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-5 -4 -3 -2 -1 0 1 2 3 4 5

Speaker independent state PDF

Sample mean

MAP estimate of mean



Ƅ ɜ: m ɪ ŋ ǝ m

Transform-based adaptation

Each acoustic vector is typically 40 dimensional So a linear transform of the acoustic data needs

40 x 40 = 1600 parameters This is much less than the 10s or 100s or thousands

of parameters needed to train the whole system Estimate a linear transform to transform speaker-

independent into speaker-dependent parameters



Ƅ ɜ: m ɪ ŋ ǝ m

Transform-based adaptationSpeaker-

independent parameters

Speaker-dependent data points

‘best fit’ transform

Adapted parameters



Ƅ ɜ: m ɪ ŋ ǝ m

Summary

Maximum Likelihood (ML) estimation Viterbi HMM parameter estimation Baum-Welch HMM parameter estimation Forward and backward probabilities Adaptation

– Bayesian adaptation

– Transform-based adaptation