Minimum Phone Error (MPE) Model and Feature Training
description
Transcript of Minimum Phone Error (MPE) Model and Feature Training
Minimum Phone Error (MPE)Model and Feature Training
ShihHsiang 2006
2
The derivation flow of the various training criteria
xx log1
3
Difference
• MPE v.s. ORCE– ORCE focuses on word error rate and is implemented on N-best res
ults– MPE focuses on phone accuracy and is implemented on a word gra
ph also introduces the prior distribution of the new estimated models (I-smoothing)
• MPE v.s. MMI– MMI treated the correct transcriptions as the numerator lattice and th
e whole word graph as the denominator lattice or the competing sequences
– MPE treats all possible correct sequences on the word graph as the numerator lattice, and treats all possible wrong sequences as the denominator lattice
4
fMPE (cont.)
• Feature-space minimum phone error (fMPE) is a discriminative training method which adds an offset to the old feature
ttt Mhoy
current feature
transform matrix
high-dimensional feature
current frame
Each vector contains 10,000 Gaussian posterior probabilityAnd the Gaussian likelihoods are evaluated with no priors
average
5
fMPE (cont.)
• Objective Function
using gradient descent to update the transformation matrix
Direct differential
r u
r
latticevr
rMPE suAcc
vPvOP
uPuOPF ,
|
|
ij
T
t ti
MPE
ij
ijT
t ti
MPE
ij
MPE hy
F
M
y
y
F
M
F
11
ti
smtS
s
M
m smtti
direct
y
l
l
F
y
F
1 1
smitismismi
S
s
M
m sm
sm
ti
indirect
yFFt
y
F
2
1 1
2ti
indirect
ti
direct
ti
MPE
y
F
y
F
y
F
ij
MPEijijij M
FvMM
6
fMPE (cont.)
• When using only direct differential to update the transformation matrix, significant improvements are obtainable but then lost very soon when the acoustic model is retrained with ML
• The indirect differential part thus aims to reflect the model change from the ML training with new features,
7
offset fMPE
• The difference of offset fMPE from the original fMPE is the definition of the high dimensional vector t h of posterior probabilities
where represents the posterior of i -th Gaussian at time tsize:
• The number of Gaussians needed is about 1000, which is significantly lower than 100000 for the original fMPE
T
1111111
],2/22,1/11,0.5
,2/22,1/11,0.5[
nn
tnt
nnt
nt
nt
tttttt
xx
xxh
nt
dimension dependent
1: dNht
N
jjt
itnt
gOp
gOp
1
|
|
8
Dimension-weighted offset fMPE
• Different from the offset fMPE which gives the same weight on each dimension of the feature offset vector– calculates the posterior probability on each dimension of the feature
offset vector
T
1111111
],2/222,1/111,0.5
,2/222,1/111,0.5[
nn
tnt
nnt
nt
nt
tttttt
xx
xxh
N
jjt
itnt
dgdOp
dgdOpd
1
|
|
9
Experiments (on MATBN)
• Error rates (%) for MPE and fMPE for different features, on different acoustic levels.
10
Experiments (cont.)
• CER(%) for offset fMPE and dimension-weighted offset fMPE with different features
11
Connect to SPLICE
• Decomposition Scheme 1
= +
ty to M
th
1p 1p qp
1q
pp
M)1(
1)1( p
th
1)( pnth
ppnM)(
n
i
iitt hMoy
1
)()(
12
Connect to SPLICE (cont.)
• Compensation of the original feature is carried out by adding a large number of bias vectors, each of which is computed as a full-rank rotation of a small set of posterior probabilities
• Maximum-Likelihood estimation
n
i
iit
n
i
iitt hMohMoy
1
*)(*)(
1
)()(
*i denotes the term greater than remaining (n-1) terms
13
Connect to SPLICE (cont.)
• Decomposition Scheme 2
= +
ty to M
th
1p 1p qp
1q
1m
1h
2m 3m km
2h3h4h
2kh
1khkh
q
iktt
q
ikkttt mxkpomhoMhoy
11
|
14
Connect to SPLICE (cont.)
• The compensation vector consists of a linear weighted sum of a set of frame-independent correction vectors, where the weight is the posterior probability associated with the corresponding correction vector
• The key difference is– the bias vector for compensation in fMPE is specific to each time
frame t– the bias vector in feature-space stochastic matching is common
over all frames in the utterance