A New Approach to Utterance Verification Based on Neighborhood Information in Model Space
description
Transcript of A New Approach to Utterance Verification Based on Neighborhood Information in Model Space
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space
Author :Hui Jiang, Chin-Hui Lee
Reporter : 陳燦輝
2
Reference [1] Hui Jiang, Chin-Hui Lee, “A new approach to utterance ve
rification based on neighborhood information in model space” ,Speech and Audio Processing, IEEE Transactions on, Vol. 11, No. 5. (2003), pp. 425-434.
[2] H. Jiang, K. Hirose, and Q. Huo, “Robust speech recognition based on Bayesian prediction approach,” IEEE Trans. Speech Audio Processing,vol. 7, pp. 426–440, July 1999.
[3] N. Merhav and C.-H. Lee, “A minimax classification approach with application To robust speech recognition,” IEEE Trans. Speech Audio Processing, vol. 1, pp. 90–100, 1993.
3
Outline Introduction
UV based on neighborhood information
Bayes factors : a bayesian tool for verification problems.
Experiments
Summary and Conclusions
4
Introduction The major difficulty with likelihood ration test-based in
utterance verification is how to model the alternative hypothesis.
It is very important to know the properties of competing source distributions.
In this paper, we are going to investigate a novel idea to perform utterance verification based on neighborhood information in model space.
5
UV based on neighborhood information
Nested neighborhoods in model space :
. model underlying thegsurroundin are which in odsneighborho nested ofset a enumerate toable are we, model givenevery for y,Intuitivel
. space model theinpoint a asview be can Each.1| as denoted ,recognizer thein HMMsdifference
N have weSuppose HMM.of space model at thelook uslet , all ofFirst
i
i
ii
Ni
6
UV based on neighborhood information (cont)
Nested neighborhoods in model space (cont) :
Fig. 1. Illustration of the structure of nested neighborhoods in HMM model space.
7
UV based on neighborhood information (cont)
Nested neighborhoods in model space (cont) :
i1
i
i1
i1
ii0
i0
1i0i
withinsomewhere resides
still model optimal that theconsidered isgenerally it but model, estimated theof position original thefromshift slightly could utterance given anfor model optimal
The tion.representarobust a as serves odneighborho of kind This . model thesurrounds tightly whichodneighborho smalla very is : odneighborho tight 2)
only. center theof consists : odneighborho zero 1)
:follows as sizes odneighborho increasing with,,, odsneighborho nested ofset a define can we, model givena For
i
8
UV based on neighborhood information (cont)
Nested neighborhoods in model space (cont) :
. model origrinal thefromaway far are models theseconcept, In space. model in models all include should Therefore,
space. entire thecoversactucally it and sizeinfinity an has : odneighborhoInfinity )5
. and , own its
have should modeldifferent a , handother theother.On each withoverlap should modelsdifferent of odneighborholarger The space. model in models speech
related allcover should and size inlarger even is : odneighborho Large4)
models. competing potential s' of all includespossibly , Thus .
nlarger thatly significan is and size mediuma has : odneighborho medium )3
i
i4
i4
i4
i2
i1
i0
i
i
i3
i3
ii2
i1
i2
i2
9
UV based on neighborhood information (cont)
For a given speech segment X, assume that a ASRsystem recognizes it as word W which is represented byan HMM model
Traditionally , We usually formulate UV as a statistical hypothesis testing problem.
Here, we translate the above hypothesis testing into the following ones
W
W1
W0
model from NOT is X : model from truly is X :
HH
w
w
H
H
of region thein lies X of model tureThe :
of odneighborho tight thein lies X of model trueThe :
12'1
1'0
10
UV based on neighborhood information (cont)
2. Fig.in shown as odneighborhotight
excludingbut odneighborho medium inside region holed thedenotes and
of models competing potential all including : odneighborho medium )
model orignal theof tionrepresentarobust a as : odneighborho tight i)
1
212
W2
W1
ii
Fig. 2. Illustration of hypothesis testing in the scenario of detecting speech recognition errors based on the neighborhood information.
11
Bayes factors
The Bayesian approach to hypothesis testing involves the calculation and evaluation of the so-called Bayes factor.
Given the observation X along with two hypotheses and , Bayes factors is computed as
0H
1H
it.reject otherwise ,accept we then, thresholdcriticalset -prea is where, BFIf
under of function likelihood theis ),|( and density,prior its is )|(,under parameter model theis 1, 0, k for where,
)1()|(),|(
)|(),|(
)|()|(
0
k00
00k
11111
00000
1
0
H
HHXfHpH
dHpHXf
dHpHXf
HXpHXpBF
k
k
12
Bayes factors (cont)
In order to use Bayes factors to solve the hypothesis testing problem, i.e. , two important issue must be addressed How to properly choose prior distribution p(.) o
f HMM model parameter for each hypothesis.
How to quantitatively define neighborhoods 21 and ,
'1
'0 vs. HH
13
Bayes factors (cont)
priors uniform dconstraine and odneighborho )(C, : I CASE
od.neighborho theof size thecontrol toused are )1(0 and 0)C(C and od,neighborho theofpoint central theis
whichparameter model original thedenotes },,,,{ where
)2(1,11,
as and bothfor form odneighborho thedefine Westate. theinnumber mixture the
indicatesk where},,,{ i.e. mixtures, Gaussian several of consists HMM.in state th- of parameters thedenotes whereN},1,2,i|{
CDHMM. state-N an is HMM each Assume
**
1*
21
ikii
ii
ikd*ik
*ik
**
dikdikd
*ikik
*ikik
**
ikik
mrωAπ
D}dKkNiCd|-m,|mr,rω,ωA,Aπ{|πΛ(λ)
rmi
14
Bayes factors (cont)
.separatelyr denominato andnumerator calculate We
1)(
)3()|(
)|(
)()|(
)()|(
)()(
as simplified be can factors Bayesof ncalculatio the, sassumption theseon Basedod.neighborho thein dconstraine p.d.f uniforma isparameter
HMMof ondistributiprior that theassume weod,neighborho thegiven Secondly,. and Cfor luessmaller va choose we, odneighborho
for tight and , and Cfor ueslarger val choose we, odneighborho mediumFor
12
1
1
12
12
1
1
01
1
2
dP
dXf
dXf
d
d
dpXf
dpXf
XpXpBF
(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
15
Bayes factors (cont)(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
)7()()|()( and
) (2) and silde 14th and 13th of assumption ()( i.e.
,parameter HMM theof pdfprior theof mean themean thedenotes where
)6(10)()5(1)()(
1 t : tionInitializa 1)VBPC. above the
ingaccomplishely approximatfor search recusive the),p( pdfprior its withalong vector parameter CDHMM ),x,,x,(xX utterancea test Given
)4()|,,(max)(p
).(p e.g. p(X),density predictive Bayesianeach compute toused is algorithm ion)classifica predictive BayesianbiVBPC(Viter The
~
*~
~
1
1
~~
1
T21
,
0
dpxpxb
dp
NiiNixbi
dlsXfX
X
itti
iii
ii
ii
ls
16
Bayes factors (cont)
(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
))( nformulatio original (the)11()(
))( nformulatio original (the)10(
.instant time the toup path partial optimal theon
based state to state from ns transitioofnumber daccumulate theis where
)9(
for /
for
i.e., ,instant time the toup path
partial optimal theon based theof pdf posterior theof mean theis The
)8(])([maxarg)(
)7(])([max)(
(2.1 doN j1 T,t2for : Recursion2)
)(_*
)(_
_*
_
)1(_)(_
_
~
~
~
11
~
11
_
dpaaaa
dpaaaa
t
jinjiaa
jiaa
t
aa
aij
aij
nij
n
ijn
ij
n
ij
ijijijij
ij
n
ij
n
ij
ij
ij
ijij
ijtNit
ijtNit
ijij
ij
ijij
17
Bayes factors (cont)
)14()()|()|()|(),,,(
.stateat residing },,,,{data of oncontributi the
denotes ),,,( and instant t; time the toup path partial optimal theon
based state tobelonging vectorsfeature ofnumber daccumulate theis where
)13(),,,(
),,,()()(
Else
)12()()()(
then] )( of ncomputatio in state involve tofirst time theisIf[it
:parameter state respect to with valuepredictive partial the Update(2.2
2121
~
21
21
~
)1(21
~
21
~
_
t
~_
t
t
j
dpxpxpxpxxxb
jxxx
xxxb
jN
xxxb
xxxbjj
xbjj
jj
jnjjjjjnjjjj
Njjj
Njjjj
j
Njjjj
Njjjj
t
tjt
j
j
j
j
(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
18
Bayes factors (cont)
)18(21)(
where
(17)))(())((2
122
1)(
where
)16()()()()(
have weMoreover,
)15()(max)|(: nTerminatio)3
)2
(
1**1**1
)(212
1*
1
11
~
2
1*
1*
2*
*
****
dxey
CdxmrCdxmrCd
dmerCd
xf
xfxfxfxb
iWXp
y x
ddidkikd
ddidkikdd
ikd
Cdm
Cdm
mxrikd
dtddjl
tddjl
D
djltjljl
K
ltjljltj
Ti
didk
didk
ikddikd
t
ttttt
tt
(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
19
Bayes factors (cont)
dxey
CdxmrCdxmrCd
dzedzeCd
dzeCd
dmeCd
rxf
y x
ddidkikd
ddidkikdd
Cdxmr zCdxmr z
d
Cdxmr
Cdxmr
z
d
ikd
Cdm
Cdm
mxr
dikd
tddjl
ddidkikd
ddidkikd
ddidkikd
ddidkikd
didk
didk
ikddikd
t
)2
(
1**1**1
)( )21
()( )21
(
1
)(
)(
)21
(
1
)()21(
1
21
*
2
1** 21** 2
1**
1**
2
1*
1*
2*
*
21)(
where
))(())((2
1
21
21
21
21
21
21
2)(
ikdikd
dikdikd
dmrdz
xmrz
*
* )(
let
20
Bayes factors (cont)
. state of
component mixture the tocloest"" vectors theof labels denote which
among , in state tobelonging vector feature deonte },,,{ where
)20(),,,(
),,,(),,,(
:},,,{data the toingcorrespond sequence labelcomponent
mixture closet"" theon based calculated is ),,,( Silmiarity
)19()(maxarg)(maxarg
i.e. closest"" is which tolabelcomponet mixture thedenotes
1
21
11
1
~
21
21
~
1
*
*
21
2121
jk
kk
Xjxxx
xxxf
xxxfxxxb
xxx
xxxb
xfxfl
xl
k
kN
k
kN
k
ttt
ttt
N
njjj
dkdkdkjkd
D
k
K
k
Njk
kkkjk
K
k
Njknjjjj
njjj
njjjj
tddjl
D
djlltjljllt
tt
(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
),()()( 112
11 kkikikkikikkikik xxfxfxf
21
Bayes factors (cont)
td
N
tk
dN
td
N
tkdN
dNdNk
ddNidkikdk
ddNidkikdk
dk
ikd
Cdm
Cdm
mxr
ddkdkdkjkd
xN
x
xN
x
xxN
CdxmrN
CdxmrN
CdN
dmeCd
xxxf
k
k
k
k
kk
k
k
kN
didk
didk
ikdd
kN
tikd
kN
kN
1
_
2
1
______2
2_______
2*ikd
1_
**
1_
**
1
21
2*ikd
)(21
1
2*ikd
*ikd
*ikd
1
and
)(1
with
)22(})(r21exp{
and (18), defined is )( where
)21())((
))((
21
2r
21
2r),,,(
have wely,respective ,parameters precision and mean pretrained thebeing r and m withThen
1
1*
1*
2
1
*
21
(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
22
Bayes factors (cont)
td
N
tk
dN
td
N
tkdN
dNdNk
dNikdkikddNdNkikd
td
N
tkikdkikdtd
N
tktd
N
tkkikd
td
N
tktd
N
tktd
N
tkikdkikd
ikdtdt
ikd
xN
x
xN
x
xxN
xmNrxxNr
xN
mNrxN
xN
Nr
xN
xN
xN
mNr
mxr
k
k
k
k
kk
kkk
kkk
kkk
kN
1
_
2
1
______2
2_______
2*ikd
2_
*2_______
2*
2
1
*2
1
2
1
*
2
1
2
1
2
1
*
2
1
*
1
and
)(1
with
)22(})(r21exp{
)(21exp())(
21exp(
)1(21exp())1()(1
21exp(
))1()(1)1(21exp(
))(21exp(
23
Bayes factors (cont)
dzzrNCd
dmrNdzxmrNz
dmxmrNCd
dmxmrNCd
dmeCd
xxxf
dkNd
idkikdk
dkNd
idkikdk
kN
k
didk
didk
k
kN
didk
didk
k
kN
didk
didk
ikdd
kN
tikd
kN
kN
xCdmrN
xCdmrNikdk
d
ikdikdkdNikdikdk
ikd
Cdm
CdmdNikdikdkd
ikd
Cdm
CdmdNikdikdkd
ikd
Cdm
Cdm
mxr
ddkdkdkjkd
)(
)( *1
2*ikd
*_
*
2_
*1
2*ikd
2_
*1
2*ikd
)(21
1
2*ikd
_1**
_1**
1*
1*
1*
1*
1*
1*
2
1
*
21
)21exp(1
21
22r2
)(let
))(21exp(
21
22r2
))(21exp(
21
2r
21
2r),,,(
24
Bayes factors (cont)
In this paper, in order to balance contribution from different models in the neighborhood, we introduce an exponential scale factor into the integral calculation.
The exponential scale factor is important equalize the contributions from different models in the neighborhood during the computation of Bayes factor.
If we choose , the models with large likelihood values are emphasized. On the other hand if the models with smaller likelihood values will be put more weight.
11
(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
25
Bayes factors (cont)
1
1_
**1_
***ikd
1
2_______
2*2*
ikd
111
1
_
_
))(())((r2
2
)])((21exp[
2r
),,()(
: follows as is density predictive Bayesianeapproximat the, },{ path optimal thegiven Therefore
dikdidkikdik
dikdidkikdik
ik
d
ikdikdikdikn
D
d
K
k
N
i
_
_
CdxmrnCdxmrnn
Cd
xxrn
dlsXfxp
p(x)ls
ik
(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
26
Bayes factors (cont)(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
baba
ba
klisxn
x
klisxn
x
klisn
tttd
T
tikikd
tttd
T
tik
ikd
tt
T
tik
,0,1
)(
withindicator delta Kronecker thedenotes (.) above, thein
)25()()(1
)24()()(1
)23()()(
where
__2
1
______2
__
1
_
__
1
27
Bayes factors (cont)(cont) priors uniform dconstraine and odneighborho )(C, : I CASE
2121
2121
21
1
21
1
2
21212
22111
and from statesHMM different for and calculate can wefinally ),usually ( and i.e. odneighborho medium andfor tight distances deviation allowed
maximally define we thenod,neighborho medium andfor tight and select manually first we, Thus
)26(max)(
as distance euclidean of termin odneighborho the inpoint with central thefrom deviation maximum thecalculate we(2), in as defined is state this
for odneighborho theassume we, parameter a HMM with Given :settingdependent -state
HMMs.all in mixture Gaussian and statesdifferent allfor medium) tight(or same theuse wecase thisIn . and where, odneighborho
mediumfor ),( and odneighborhofor tight ),(select manually We : setting Global ?),( parameters especify th How to
DDCCDDDD
rdCD
θ
CCCC
C
ikd
K
k
dD
d
i
i
28
Bayes factors (cont)
states. HMM of level thein priorsdelta upset can also wemodel, HMM eachfor priorsdelta building of insteadTherefore, CDHMMs. tied-state useusually wesystems, ASR scale-largemany in present,At
)29()|(1
)|(1as simplified be can Hand Hhypotheses verify tofactors Bayesprior, two theseon Based
. - region for the models ofnumber total thedenotes N where
)28()(1)(
- region for the ondistributiprior a build can weSimilarity. inside models ofnumber total thedenotes N where
)27()(1)(
function.delta of mixturea as ondistributiprior a construct we, odneighborho medium andFor tight
12
1
12
12
1
1
-
2
'1
'0
12m
--
12
1t
i
i
i
i
im
it
im
it
XfN
XfN
BF
Np
Np
priorsdelta : II CASE
29
Experiments We evaluate proposed methods on Bell Labs
communicator system
In our recognition system, we used a 38-dimension feature vector, consisting of 12 Mel LPCCEP, 12 delta CEP, 12 delta-delta CEP, delta and delta-delta log-energy
The acoustic models are state-tied, tri-phone CDHMM models, which consist of roughly 4K distinct HMM states with an average 13.2 Gaussian mixture per state.
30
Experiments (cont) A class-based, tri-gram LM including 2600 words is
used.
The ASR system achieves 15.8% WER in our independent evaluation set, which includes in total 1395 utterances.
Based on the word and phoneme segmentations generated by the recognizer, we calculate a confidence score for every recognized word.
31
Experiments (cont) Baseline system : likelihood ratio test.
New approach with settings in Case I We choose neighborhood and constrained
uniform prior distribution. Since we use static, delta and delta-delta feature, we slightly modify the neighborhood definition in (2) as
),( C
)30(3
1,1
1,
,
1*
)3
2()
32
(
1*
)3
()3
(
1*
}DdKk
NiCd-mm
Cd-mmCd|-m|m
,r,rω,ωA,Aπ{|πΛ(λ)
d
dD
ikdD
ik
d
dD
ikdD
ik
dikdikd
*ikik
*ikik
**
32
Experiments (cont) New approach with settings in Case I (cont)
For the state-dependent setting , we first set up to a small value, and to a large value. According to (26) we have manually checked the range and
New approach with settings in Case || We choose delta priors in (27) and (28) in the level of
HMM state. At first, for each distinct state, we calculate its distance
from all other states. The distance between two HMM states is computed as the minimum euclidean distance between every possible pair of Gaussian components from these states
12
]0.10,0[1D]0.250,0.100[2D
D
djkdikd mm
1
2)( distance euclidean
33
Experiments (cont) New approach with settings in Case || (cont)
For each state, we sort all other states according to their distances form the underlying state.
In the first case, denoted as Case II-A, for each underlying HMM state, we choose neighborhood sizes to include exactly other states in and in
In the second case, denoted as Case II-B, from the top 1500 sorted states, we choose neighborhood sizes for to include all other states with distance less than and one’s distance between and
tN 1 mN 12
1
tD
tD mD
34
Experiments (cont)
TABLE IVERIFICATION PERFORMANCE COMPARISON (EQUAL ERROR RATE IN %) OF BASELINE UV METHOD (LRT + ANTI-MODELS) WITH THE PROPOSED NEW APPROACH IN SEVERAL DIFFERENT SETTINGS. IN EACH CASE, THE BESTPERFORMANCE OF THE NEW APPROACH AND ITS CORRESPONDING PARAMETER SETTING ARE GIVEN. HERE WE ALWAYS FIX = 1.2
35
Experiments (cont)
Fig. 3. Comparison of ROC curves for different methods when verifying mis-recognized words against correctly recognized words in ASR outputs.
36
Summary and Conclusions
The basic idea is to assume that all competing models of a given model sit inside one neighborhood of the underlying model.
More research works are still need to search for a better neighborhood definition in high- dimension HMM model space.
Another possible research direction for future works , in stead of Bayes factors, such as generalized likelihood ratio testing (GLRT) can also be used to implement the neighborhood based UV