EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Training HMMs
Version 4: February 2005
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Outline
Parameter estimation Maximum Likelihood (ML) parameter estimation ML for Gaussian PDFs ML for HMMs – the Baum-Welch algorithm HMM adaptation:
– MAP estimation
– MLLR
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Discrete variables
Suppose that Y is a random variable which can take any value in a discrete set X={x1,x2,…,xM}
Suppose that y1,y2,…,yN are samples of the random variable Y
If cm is the number of times that the yn = xm then an estimate of the probability that yn takes the value xm is given by:
N
cxyPxP m
mnm
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Discrete Probability Mass Function
0
0.05
0.1
0.15
0.2
0.25
1 2 3 4 5 6 7 8 9
symbol n
P(n
)
Symbol123456789
Total
Num.Occurrences12023190876357
15620391
1098
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Continuous Random Variables
In most practical applications the data are not restricted to a finite set of values – they can take any value in N-dimensional space
Simply counting the number of occurrences of each value is no longer a viable way of estimating probabilities…
…but there are generalisations of this approach which are applicable to continuous variables – these are referred to as non-parametric methods
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Continuous Random Variables
An alternative is to use a parametric model In a parametric model, probabilities are defined by a
small set of parameters Simplest example is a normal, or Gaussian model A Gaussian probability density function (PDF) is
defined by two parameters– its mean , and – variance
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Gaussian PDF
‘Standard’ 1-dimensional Guassian PDF:– mean =0
– variance =1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Gaussian PDF
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
a b
P(a x b)
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Gaussian PDF
For a 1-dimensional Gaussian PDF p with mean and variance :
2exp
2
1,|
2xxpxp
Constant to ensure area under curve is 1
Defines ‘bell’ shape
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
More examples
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
=0.1 =1.0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
=10.0 =5.0
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Fitting a Gaussian PDF to Data
Suppose y = y1,…,yn,…,yT is a sequence of T data values
Given a Gaussian PDF p with mean and variance , define:
How do we choose and to maximise this probability?
T
ttypyp
1
,|,|
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-5 -4 -3 -2 -1 0 1 2 3 4 5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-5 -4 -3 -2 -1 0 1 2 3 4 5
Fitting a Gaussian PDF to Data
Poor fitGood fit
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Maximum Likelihood Estimation
Define the best fitting Gaussian to be the one such that p(y|,) is maximised.
Terminology:– p(y|,) as a function of y is the probability
(density) of y– p(y|,) as a function of , is the likelihood of
, Maximising p(y|,) with respect to , is called
Maximum Likelihood (ML) estimation of ,
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
ML estimation of ,
Intuitively:– The maximum likelihood estimate of should be the
average value of y1,…,yT, (the sample mean)
– The maximum likelihood estimate of should be the variance of y1,…,yT. (the sample variance)
This turns out to be true: p(y| , ) is maximised by setting:
T
t
T
ttt y
Ty
T 1 1
21,
1
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Proof
,|log,|log,|log11
t
T
t
T
tt ypypyp
First note that maximising p(y) is the same as maximising log(p(y))
At a maximum:
T
t
tt
T
t
yypyp
11
12,|log,|log0
Also
2
2log2
1,|log t
t
yyp
So,
T
tt
T
tt y
TyT
11
1,
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
ML training for HMMs
Now consider– An N state HMM M, each of whose states is associated
with a Gaussian PDF
– A training sequence y1,…,yT
For simplicity assume that each yt is 1-dimensional
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
ML training for HMMs If we knew that:
– y1,…,ye(1) correspond to state 1
– ye(1)+1,…,ye(2) correspond to state 2
– :
– ye(n-1)+1,…,ye(n) correspond to state n
– :
Then we could set the mean of state n to the average value of ye(n-1)+1,…,ye(n)
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
ML Training for HMMs
y1,…,ye(1), ye(1)+1,…,ye(2), ye(2)+1,…,yT
Unfortunately we don’t know that ye(n-1)+1,…,ye(n) correspond to state n…
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Solution
1. Define an initial HMM – M0
2. Use the Viterbi algorithm to compute the optimal state sequence between M0 and y1,…,yT
y1 y2 y3 … … yt … yT
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Solution (continued)
Use optimal state sequence to segment y
y1 ye(1) ye(1)+1 … … ye(2) … yT
Reestimate parameters to get a new model M1
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Solution (continued)
Now repeat whole process using M1 instead of M0, to get a new model M2
Then repeat again using M2 to get a new model M3
….
p(y | M0) p(y | M1) p(y | M2) …. p(y | Mn) ….
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Local optimization
p(y|M)
MM0 M1…Mn
Local optimumGlobal
optimum
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Baum-Welch optimization
The algorithm just described is often called Viterbi training or Viterbi reestimation
It is often used to train large sets of HMMs An alternative method is called Baum-Welch
reestimation
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Baum-Welch Reestimation
y1 y3 y4 y5y2 yt+1yt yT
P(si | yt) = t(i)
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
‘Forward’ Probabilities
y1 y3 y4 y5y2 yt+1yt yT
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
‘Backward’ Probabilities
y1 y3 y4 y5y2 yt+1yt yT
j
tjtijtTtt ybjaMixyyi )()(), |,...,(Prob)( 111
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
‘Forward-Backward’ Algorithm
yT
y1 y3 y4 y5y2 yt+1yt
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Adaptation
A modern large-vocabulary continuous speech recognition system has many thousands of parameters
Many hours of speech data used to train the system (e.g. 200+ hours!)
Speech data comes from many speakers Hence recogniser is ‘speaker independent’ But performance for an individual would be better if
the system were speaker dependent
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Adaptation
For a single speaker, only a small amount of training data is available
Viterbi reestimation or Baum-Welch reestimation will not work
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
‘Parameters vs training data’
Performance
Number of parameters
Larger training set
Smaller training set
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Adaptation
Two common approaches to adaptation (with small amounts of training data)– Bayesian adaptation (also known as MAP adaptation
(MAP = Maximum a Priori))
– Transform-based adaptation (also known as MLLR (MLLR = Maximum Likelihood Linear Regression))
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Bayesian adaptation
Uses well-trained, ‘speaker-independent’ HMM as a prior for the estimate of the parameters of the speaker dependent HMM
E.G:
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
-5 -4 -3 -2 -1 0 1 2 3 4 5
Speaker independent state PDF
Sample mean
MAP estimate of mean
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Transform-based adaptation
Each acoustic vector is typically 40 dimensional So a linear transform of the acoustic data needs
40 x 40 = 1600 parameters This is much less than the 10s or 100s or thousands
of parameters needed to train the whole system Estimate a linear transform to transform speaker-
independent into speaker-dependent parameters
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Transform-based adaptationSpeaker-
independent parameters
Speaker-dependent data points
‘best fit’ transform
Adapted parameters
EEM4R Spoken Language Processing - Introduction
Speech Technology Lab
Ƅ ɜ: m ɪ ŋ ǝ m
Summary
Maximum Likelihood (ML) estimation Viterbi HMM parameter estimation Baum-Welch HMM parameter estimation Forward and backward probabilities Adaptation
– Bayesian adaptation
– Transform-based adaptation
Top Related