Feature Decorrelation on Speech Recognition
-
Upload
hungshinlee -
Category
Documents
-
view
121 -
download
0
Transcript of Feature Decorrelation on Speech Recognition
Feature Decorrelation onS h R iti
Feature Decorrelation onS h R iti
H Shi L
Speech RecognitionSpeech Recognition
Hung-Shin Lee
Institute of Information, Sinica AcademicD f El t i l E i i N ti l T i U i itDep. of Electrical Engineering, National Taiwan University
2009-10-09 @ IIS, Sinica Academic
References -11) J. Psutka and L. Muller, “Comparison of various feature decorrelation
techniques in automatic speech recognition ” in Proc CITSA 2006
References 1
techniques in automatic speech recognition, in Proc. CITSA 2006.2) Dr. Berlin Chen’s Lecture Slides: http://berlin.csie.ntnu.edu.tw3) Batlle et al., “Feature decorrelation methods in speech recognition - a
comparative study,” in Proc. ICSLP 1998.comparative study, in Proc. ICSLP 1998.4) K. Demuynck et al., “Improved feature decorrelation for HMM-based speech
recognition,” in Proc. ICSLP 1998.5) W. Krzanowski, Principles of Multivariate Analysis - A User’s Perspective,
Oxford Press, 1988.6) R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for
classification,” in Proc. ICASSP 1998.7) M. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE
Trans. on Speech and Audio Processing, vol. 7, no. 3, pp. 272-281, 1999.8) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis
expansion ” in Proc ICASSP 2002
2
expansion,” in Proc. ICASSP 2002.
References -29) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis
expansion ” IEEE Trans on Speech and Audio Processing vol 12 no 11 pp
References 2
expansion, IEEE Trans. on Speech and Audio Processing, vol. 12, no. 11, pp. 37-46, 2004.
10) N. Kumar and R. Gopinath, “Multiple linear transform,” in Proc. ICASSP 2001.11) N. Kumar and A. Andreou, “Heteroscedastic discriminant analysis and. reduced11) N. Kumar and A. Andreou, Heteroscedastic discriminant analysis and. reduced
rank HMMs for improved speech recognition,” Speech Communication, vol. 26, no. 4, pp. 283-297, 1998.
12) A. Ljolje, “The importance of cepstral parameter correlations in speech recognition,” Computer Speech and Language, vol. 8, pp. 223–232, 1994.
13) B. Flury, “Common principal components in k groups,” Journal of the American Statistical Association, vol. 79, no. 388, 1984.
14) B. Flury and W. Gautschi, “An algorithm for simultaneous orthogonal transformation of several positive definite symmetric matrices to nearly diagonal form,” SIAM Journal on Scientific and Statistical Computing, vol. 7, no 1 1986
3
no. 1, 1986.
Outline
• Introduction to Feature Decorrelation (FD)
Outline
– FD on Speech Recognition
• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
• Common Principal Component (CPC)
4
Outline
• Introduction to Feature Decorrelation (FD)
Outline
– FD on Speech Recognition
• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
• Common Principal Component (CPC)
5
Introduction to Feature Decorrelation -1
• Definition of Covariance Matrix:
Introduction to Feature Decorrelation 1
– Random vector:T
nxx ][ 1 =X
Covariance matrix and mean vector:
random variable
– Covariance matrix and mean vector:
]))([( TE μXμXΣ −−≡
expected value
][Xμ E≡random vector
6
Introduction to Feature Decorrelation -2
• Feature Decorrelation (FD)
Introduction to Feature Decorrelation 2
– To find transformations that make all variables or parameters (nearly) uncorrelated
Θ
jiXXXX jiji ≠∀= ,~,~ ,0)~,~cov(
transformed random variable
– Or make the covariance matrix diagonal (not necessary identity)
DΣΘΘ =T
The covariance matrix can be global or depend on each class.
7
diagonal matrixcovariance matrixglobal or depend on each class.
Introduction to Feature Decorrelation -3
• Why Feature Decorrelation (FD)?
Introduction to Feature Decorrelation 3
– In many speech recognition systems, the observation density function for each HMM state are modeled as mixtures of diagonal covariance Gaussiansg
– For the sake of computational simplicity, the off-diagonal elements of the covariance matrices of the Gaussians are assumed to be close to zeroassumed to be close to zero
=i
iii Nf ),;()( Σμxx π
−−−= −
iii
Ti
ii )()(
21exp
||)2(1 1
2121 μxΣμxΣπ
πobservation density function
8
density functiondiagonal matrix
FD on Speech Recognition -1
• Approaches for FD can be divided into two categories:
FD on Speech Recognition 1
Feature-space and Model-space• Feature-space Schemes
Hard to find a single transform which decorrelates all elements– Hard to find a single transform which decorrelates all elements of the feature vector for all states
• Model-space Schemes– A different transform is selected depending on which
component the observation was hypothesized to be generatedcomponent the observation was hypothesized to be generated from
– In the limit a transform may be used for each component, hi h i i l t t f ll i t i t
9
which is equivalent to a full covariance matrix system
FD on Speech Recognition -2
• Feature-space Schemes on LVCSR
FD on Speech Recognition 2
DCT, PCA, LDA, MLLT, …
Feature-spaceDecorrelation
Front-End Preprocessing
Speech Signal
Speech Decoding
TextualResults
Training Data
Test Data
LMAMAM Training
Training Data
Lexicon
10
FD on Speech Recognition -3
• Model-space Schemes on LVCSR
FD on Speech Recognition 3
Front-End Preprocessing
Speech Signal
Speech Decoding
TextualResults
Test Data
LMAMAM Training
Training Data
LexiconLMAM
Model-space
g Lexicon
pDecorrelation
MLLT, EMLLT, MLT, Semi-Tied, …
11
, , , ,
FD on Speech Recognition -4FD on Speech Recognition 4
Without LabelWithout Label Information
With Label Information
Discrete Cosine Transform (DCT)
Linear Discriminant Analysis (LDA)
F t S(DCT)
Principal Component Analysis (PCA)
Common Principal Components (CPC)
Extended Maximum Likelihood
Feature-SpaceSchemes
Extended Maximum Likelihood Linear Transform (EMLLT)
Semi-Tie Covariance Matrices
Multiple Linear Transforms (MLT)Model-SpaceS hMultiple Linear Transforms (MLT)
Maximum Linear Likelihood Transform (MLLT)
Schemes
12
Outline
• Introduction to Feature Decorrelation (FD)
Outline
– FD on Speech Recognition
• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
• Common Principal Component (CPC)
13
Discrete Cosine Transform -1
• Discrete Cosine Transform (DCT)
Discrete Cosine Transform 1
– Since the log-power spectrum is real and symmetric, the inverse DFT reduces to a Discrete Cosine Transform (DCT)
– Is Applied to the log-energies of output filters (Mel-scaled pp g g p (filterbank) during the MFCC parameterization
jijn
× 10f)50(2 π nmji
njx
nc
iij <=
−=
=
,...,1,0for ,)5.0(cos1
i-th coordinate of the input vector x
– Partial decorrelation
j-th coordinate of the output vector c
14
Discrete Cosine Transform -2
• (Total) covariance matrix calculated using the 25-hour
Discrete Cosine Transform 2
training data from MATBN
18 Mel-scaled Filterbank 13 Mel-cepstrum (using DCT)
1 5
2
3 p ( g )
6
8
0.5
1
1.5
0
2
4
05
1015
20
05
10
15200
010
2030
40
010
20
3040-2
15
Principal Component Analysis -1
• Principal Component Analysis (PCA)
Principal Component Analysis 1
– Based on the calculation of the major directions of variations of a set of data points in a high dimension space
– Extracts the direction of the greatest variance, assuming that g , gthe less variation of the data, the less information it carries of the featuresPrincipal components : the largest eigenvectors][ vvV =– Principal components : the largest eigenvectors of the total covariance matrix Σ
],...,[ 1 pvvV =
vΣv = λDΣVV
vΣv
= Tiii λ
diagonal matrix
16
diagonal matrix
Principal Component Analysis -2
• (Total) covariance matrix calculated using the 25-hour
Principal Component Analysis 2
training data from MATBN
162 Spliced Mel-Filterbank 39 PCA Transformed Subspace
1 5
2
162 Spliced Mel Filterbank
150
39 PCA Transformed Subspace
0
0.5
1
1.5
0
50
100
050
100150
200
050
100
150200-0.5
010
2030
40
010
20
3040-50
17
00 00
Linear Discriminant Analysis -1
• Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis 1
– Seeks a linear transformation matrix that satisfiesΘ
( ))()(tracemaxarg 1 ΘSΘΘSΘΘΘ
BT
WT −=
– is formed by the largest eigenvectors of The LDA subspace is not orthogonal (but PCA subspace is)
Θ
Θ BW SS 1−
– The LDA subspace is not orthogonal (but PCA subspace is)
f
IΘΘ ≠T
– The LDA subspace makes the transformed variables statistically uncorrelated.
18
Linear Discriminant Analysis -2
• From any two distinct eigenvalue/eigenvector pairs
Linear Discriminant Analysis 2
and),( ii θλ ),( jj θλ
= iWiiB θSθS λ
– Pre-multiplying by and , respectively
= jWjjB θSθS λ
Tjθ
Tiθp y g y , p yj i
jWTijiW
Tji
jWTijjB
Ti
iWTjiiB
Tj jB
TiiB
Tj =⎯⎯⎯⎯ →⎯
== = θSθθSθ
θSθθSθθSθθSθ θSθθSθ λλ
λλ
jiT
jWTiiW
Tj
jWijjBi
jijWTiiW
Tj
∀
==⎯⎯⎯⎯⎯⎯⎯ →⎯
≠=
f0
0 and
θSθ
θSθθSθθSθθSθ λλ
19
jijWTi ≠∀=∴ for 0θSθ
Ref. (Krzanowski, 1988)
Linear Discriminant Analysis -3
• To overcome arbitrary scaling of , it is usual to adopt
Linear Discriminant Analysis 3
iθthe normalization
θSθ =iWTi 1
λ 001
ΛΘSΘIΘSΘθSθθSθ
==∴==
BT
WT
iiWTiiiB
Ti
iWi
,λλ
=
pλ0000
1
Λ
• The total covariance matrix is also
BW ,
BWT SSS +=
p
transformed as a diagonal matrix
20
Linear Discriminant Analysis -4
• (Total) covariance matrix calculated using the 25-hour
Linear Discriminant Analysis 4
training data from MATBN
162 Spliced Mel-Filterbank 39 LDA Transformed Subspace
1.5
2
p
10
15
39 p
0
0.5
1
0
5
10
050
100150
200
050
100
150200-0.5
010
2030
40
010
20
3040-5
21
Maximum Likelihood Linear Transform
• Maximum Likelihood Linear Transform (MLLT) seems to
Maximum Likelihood Linear Transform
have two types:– Feature-based: (Gopinath, 1998)– Model-based: (Olsen 2002) (Olsen 2004)Model based: (Olsen, 2002), (Olsen, 2004)
• The common goal of the two types – Find a global linear transformation matrix that decorrelate features.
22
Feature-based MLLT -1
• Feature-based Maximum Likelihood Linear Transform
Feature based MLLT 1
(F-MLLT)– Tries to alleviate four problems:
(a) Data insufficiency implying unreliable models(a) Data insufficiency implying unreliable models(b) Large storage requirement(c) Large computational requirement(d) ML i t di i i ti b t l(d) ML is not discriminating between classes
– Solutions(a)-(c): Sharing parameters across classes(d): Appealing to LDA
23
Feature-based MLLT -2
• Maximum Likelihood Modeling
Feature based MLLT 2
– The likelihood of the training data is given by)},{( ii lx
==−−− −
∏∏iliil
TiliNN
llllN epp
)()(21
1
1
}){}{(}){}{(mxΣmx
ΣμxΣμx
( )=
==
++−−−−
==
−−
∏∏
j jjjjjjT
jjj
i
iiii
nNd
i ld
illill
e
pp
||log)trace()()(22
111
11
)2(
||)2(}){},{,(}){},{,(
ΣSΣμmΣμm
ΣΣμxΣμx
π
π
Log-Likelihood
Appendix A
e)2( π
−=jjN Ndp 1 )2log(
2}){},{,(log Σμx π
g
( )=
−− ++−−−C
jjjjjjj
Tjj
jn1
11 ||log)trace()()(2
2
ΣSΣμmΣμm
24
sample mean sample covarianceML estimators
Feature-based MLLT -3
• The idea of maximum likelihood estimation (MLE) is to
Feature based MLLT 3
ˆchoose the parameters and so as to maximize}ˆ{ jμ }ˆ{ jΣ}){},{,(log 1 jj
Np Σμx
– ML Estimators: and }ˆ{ jμ }ˆ{ jΣ
What’s the difference between “estimator” and “estimate”?
25
estimator and estimate ?
Feature-based MLLT -4
• Multiclass ML Modeling
Feature based MLLT 4
ˆ– The training data is modeled with Gaussians– The ML estimators:– The log-ML value:
)ˆ,ˆ( jj Σμjjjj SΣmμ == ˆ ,ˆ
Appendix Bg
( )
∗ =C
jC
j
jjNN
ndNnNdpp 11
||)(||1)2l (
})ˆ{},ˆ{,(log)(log
SS
Σμxx
f
( ) ==
−=−+−=j
jj
jj
j dNg11
||2
),(||2
1)2log(2
SSπ
– There is “no interaction” between the classes and therefore unconstrained ML modeling is not “discriminating”
26
Feature-based MLLT -5
• Constrained ML – Diagonal Covariance
Feature based MLLT 5
ˆ– The ML estimators:– The log-ML value:
)diag(ˆ ,ˆ jjjj SΣmμ ==
C n
If one linearly transforms the data and models using a diagonal
=
∗ −=j
jjN ndNgp
11diag |)diag(|
2),()(log Sx
– If one linearly transforms the data and models using a diagonal Gaussian, the ML value is
( )C
jn ( )=
∗ +−=j
jjjTj
jN ndNgp1
1diag ||log|)diag(|2
),()(log ΘΘSΘx
the JacobianHow to choose Θ for each class?
27Appendix C
How to choose Θ for each class?
Feature-based MLLT -6
• Multiclass ML Modeling – Some Issues
Feature based MLLT 6
– If the sample size for each class is not large enough then the ML parameter estimates may have large variance and hence be unreliable
– The storage requirement for the model:– The computational requirement:
)( 2CdO)( 2CdO
– The parameters for each class are obtained independently:f
Why?
ML principle dose not allow for discrimination between classes
28
Feature-based MLLT -7
• Multiclass ML Modeling – Some Issues (cont.)
Feature based MLLT 7
– Parameters sharing across classes: reduces the number of parameters, storage requirements, and computational requirementsq
– Hard to justify parameters sharing is more discriminating
We can appeal to Fisher’s criterion of LDA and a result of– We can appeal to Fisher s criterion of LDA and a result of Campbell to argue that sometimes constrained ML modeling is discriminating.
But, what is discriminating?
29
Feature-based MLLT -8
• Multiclass ML Modeling – Some Issues (cont.)
Feature based MLLT 8
– We can globally transform the data with a unimodular matrix Θand model the transformed data with diagonal Gaussians(There is a loss in likelihood too) 1)det( =Θ( )
– Among all possible transformation Θ, we can choose the one that takes the least loss in likelihood
1)det( =Θ
that takes the least loss in likelihood(In essence we will find a linearly transformed (shared) feature space in which the diagonal Gaussian assumption is almostvalid)
30
Feature-based MLLT -9
• Another constrained MLE with sharing of parameters
Feature based MLLT 9
– Equal Covariance– Clustering
• Covariances Diagonalization and Cluster Transformation– Classes are grouped into clusters
E h l t i d l d ith di l G i i– Each cluster is modeled with a diagonal Gaussian in a transformed feature space
– The ML estimators: 1)diag(ˆ ,ˆ −−== jjjj CTCjC
TCjjj ΘΘSΘΘΣmμ
– The ML value:
( )∗ +−=C
CTCjC
jNdiag jjj
ndNgp 1 ||log|)diag(|),()(log ΘΘSΘx
31
( )=j
CCjCdiag jjjgp1
1 ||g|)g(|2
),()(g
Feature-based MLLT -10
• One Cluster with Diagonal Covariance
Feature based MLLT 10
– When the number of clusters is one, there is single global transformation and the classes are modeled as diagonal Gaussians in this feature spacep
– The ML estimators:– The log-ML value:
1)diag(ˆ ,ˆ −−== ΘΘSΘΘΣmμ jTT
jjj
f
||log|)diag(|2
),()(log1
1diag ΘΘSΘx NndNgpC
jj
TjN +−= =
∗
– The optimal Θ can be obtained by optimization as follows
−= ||log|)diag(|
2minarg ΘΘSΘΘΘ
NnC
jTj
32
= 21Θ j
Feature-based MLLT -11
• Optimization – the numerical approach:
Feature based MLLT 11
– The objective function
||log|)diag(|2
)( ΘΘSΘΘ NnFC
jTj −=
– Differentiate Fwith respect to A, and get the derivative G:
21j=
f
( ) jT
jj
jT nNG ΘSΘΘSΘΘ 1)diag()( −− −=
– Directly optimizing the objective function is nontrivial and requires numerical optimization techniques and full matrix to be stored at each class
33
Outline
• Introduction to Feature Decorrelation (FD)
Outline
– FD on Speech Recognition
• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
• Common Principal Component (CPC)
34
Model-based MLLT -1
• In model-based MLLT, instead of having a distinct
Model based MLLT 1
covariance matrix for every component in the recognizer, each covariance matrix consists of two elements:
A non singular linear transformation matrix Θ shared over a– A non-singular linear transformation matrix Θ shared over a set of components
– The diagonal elements in the matrix for each class jjΛ
• The precision matrices are constrained to be of the formd jλ 00
=
− ===d
k
Tkk
jkj
Tjj
1
1 θθΘΛΘΣP λ
diagonal matrixprecision matrix
=j
j
j
λ
λ
0000001
Λ
35
diagonal matrixprecision matrix jdλ00
Ref. (Olsen, 2004)
Model-based MLLT -2
• M-MLLT fits within the standard maximum-likelihood
Model based MLLT 2
criterion used for training HMM's with each state represented by a GMM
j th t i ht
=
=m
jjjj NΘf
1),;()|( Σμxx π
• Log-likelihood function:
j-th component weight
)|(log1):(1
ΘfN
ΘL i
N
ixX
=
=
36
Model-based MLLT -3
• Parameters of the M-MLLT model for each HMM state
Model based MLLT 3
are estimated using a generalized expectation-maximization (EM) algorithm
• The auxiliary “Q function” should be introduced
},,,{ 111 ΘΛμ mmmΘ π=
• The auxiliary “Q-function” should be introduced.– The Q-function satisfies the inequality
ˆˆˆˆ
→= New}{ 111 ΘΛμ mmmΘ π
),(),():():( ΘΘQΘΘQΘLΘL −≥− XX
→=→
Old}ˆ,ˆ,ˆ,ˆ{ˆNew},,,{
111
111
ΘΛμΘΛμ
mmmΘΘ
ππ
37* A. P. Dempster et al., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977.
Model-based MLLT -4
• The auxiliary Q-function is
Model based MLLT 4
)(1)ˆ;(1 1
ij
N
i
m
jij L
NΘΘQ x=
= =
γ
21
1 1
N
i
m
j
ij
N=
= =
γ )),;(log()( jjijij NL Σμxx π=The component log-likelihood of the observation i
)))((trace(||log)2log(log2( Tjijijjj d μxμxPP −−−+−× ππ
= m
ik
ijij
xL
xL
))(ˆexp(
))(ˆexp(γA posteriori probability
of Gaussian component jgiven the observation i
38
=k
ik1
))(p(given the observation i
Model-based MLLT -5
• Gales’ Approach for deriving Θ
Model based MLLT 5
Estimate the mean and the component weight, which are independent of the other model parameters
N1 NN
Use the current estimate of the transform Θ and estimate
=
=i
ijj N 1
1 γπ ==
=i
iji
iijj11
γγ xμ
Use the current estimate of the transform Θ, and estimate the set of class-specific diagonal variances
Tj θSθ=λ ijiji θSθ=λ
the i-th row vector of Θthe i-th entry of diagonal variance of component j
39Ref. (Gales, 1999)
variance of component j
Model-based MLLT -6
• Gales’ Approach for deriving Θ (cont.)
Model based MLLT 6
Estimate the transform Θ using the current set of diagonal covariances
NTiii
iiiN
cGcGcθ 1
1ˆ−
−=
=j
jjj
i
i
N SG 2)(diagσ̂
The i-th row vector of the cofactors of the current Θ
Appendix D
Go to step 2 until convergence, or appropriate criterion satisfied
Appendix D
40
Semi-Tied Covariance Matrices -1
• Introduction to Semi-Tied Covariance Matrices
Semi Tied Covariance Matrices 1
– A natural extension of the state-specific rotation scheme– The transform is estimated in a maximum-likelihood (ML)
fashion given the current model parametersg p– The optimization is performed using a simple iterative scheme,
which is guaranteed to increase the likelihood of the training datadata
– An alternative approach to solve the optimization problem of DHLDA
41
Semi-Tied Covariance Matrices -2
• State-Specific Rotation
Semi Tied Covariance Matrices 2
– A full covariance matrix is calculated for each state in the system, and is decomposed into its eigenvectors and eigenvaluesg
All data from that state is then decorrelated using the
Tssssfull
)()()()( UΛUΣ =
– All data from that state is then decorrelated using the eigenvectors calculated
)()( )()( ττ οUο Tss =
– Multiple diagonal covariance matrix Gaussian components are then trained
)()( ττ οUο =
42Ref. (Gales, 1999), (Kumar, 1998)
Semi-Tied Covariance Matrices -3
• Drawbacks of State-Specific Rotation
Semi Tied Covariance Matrices 3
– It does not fit within the standard ML estimation framework for training HMM’s
– The transforms are not related to the multiple-component models being used to model the data
43
Semi-Tied Covariance Matrices -4
• Semi-Tied Covariance Matrices
Semi Tied Covariance Matrices 4
– Instead of having a distinct covariance matrix for every component in the recognizer, each covariance matrix consists of two elements
Trmrm )()(diag
)()( HΣHΣ =
component specific diagonal covariance element
semi-tied class-dependent, non-diagonal matrix,may be tied over a set of components
44
components
Semi-Tied Covariance Matrices -5
• Semi-Tied Covariance Matrices (cont.)
Semi Tied Covariance Matrices 5
– It is very complex to optimize these parameters directly so an expectation-maximization approach is adopted
– Parameters for each component m
}{ )()()()( rmd
mmM ΘΣμπ= },,,{ diagM ΘΣμπ=1)()( −= rr HΘ
45
Semi-Tied Covariance Matrices -6
• The auxiliary Q-function
Semi Tied Covariance Matrices 6
( ) ( )
=2)(
ˆˆˆ|ˆ|
)ˆ,(
Tr
MMQ
Θ ( ) ( )∈
−
−−−
ττττγ
),(
)()(1)(diag
)()()(
diag
ˆ)(ˆˆˆˆ)(|ˆ|
||log)(rMm
mrmTrTmmm μοΘΣΘμο
ΣΘ
|| )(mΣ 1)( −mΣ),|)(( Tm Mqp Οτ
component m at time τ
46
Semi-Tied Covariance Matrices -7
• If all the model parameters are to be simultaneously
Semi Tied Covariance Matrices 7
optimized then the Q-function may be rewritten as
βτγ2)( |ˆ|l)()ˆ(
r
dMMQ Θ
where
∈
−
=τ
βτγ),(
)()()( |)ˆˆ(|||log)(),(
rMmTrmrm d
diagMMQ
ΘWΘ
)()(where
( )( ) −−= τ
τττγ ˆ)(ˆ)()( )()(
)(
Tmmm
mμομο
W
=
τ
τ
τγ
ττγ
)(
)()(ˆ )(
m
mm
ομ
=
ττγ )(m
W τ
∈
=τ
τγβ),(
)(rMm
m
47
Semi-Tied Covariance Matrices -8
• The ML estimate of the diagonal element of the
Semi Tied Covariance Matrices 8
covariance matrix is given by
( )Trmrm )()()()(diag
ˆˆdiagˆ ΘWΘΣ =
• The reestimation formulae for the component weights and transition probabilities are identical to the standard
( )g
and transition probabilities are identical to the standard HMM cases (Rabiner, 1989)
• Unfortunately, optimizing the new Q-function is nontrivial and more complicated, so an alternative
h i d t
48
approach is proposed next
Semi-Tied Covariance Matrices -9
• Gales’ approach for optimizing the Q-function
Semi Tied Covariance Matrices 9
Estimate the mean, which is independent of the other model parameters
Using the current estimate of the semi-tied transform and estimate the set of component specific diagonal variances This set of parameters will be
)(ˆ rΘ)ˆˆdiag(ˆ )()()()(
diagTrmrm ΘWΘΣ =
specific diagonal variances. This set of parameters will be denoted as
f
},ˆ{}ˆ{ )()(diag
)(diag
rrr Mm∈= ΣΣ
ˆ ˆ )( Estimate the transform using the current set
Go to (2) until convergence, or appropriate criterion satisfied
)(ˆ rΘ }ˆ{ )(diagrΣ
49
( ) g , pp p
Semi-Tied Covariance Matrices -10
• How to carry out step (3)?
Semi Tied Covariance Matrices 10
– Optimizing the semi-tied transform requires an iterative estimation scheme even after fixing all other model parametersp
– Selecting a particular row of , , and rewriting the former Q-function using the current set
)(ˆ rΘ )(ˆ riθ
}ˆ{ )(diagrΣ
ˆ
ττγ
2)()()(2)(
)(diag
))(ˆˆ(|ˆ|l)ˆl ()(
}ˆ{;ˆ,(mTr
jmTr
rMMQ
οθΣθ
Σ|ˆ| )(rΘ )()( ˆ)( mm μo −τ
∈
−−=τ σ
τγ),(
2)(diag
)(diag
2)( ))((||log)log()(
rMm jm
jmi
Trim
j
Σcθ
The ith row vector of the cofactors the leading diagonal element j
50
The ith row vector of the cofactors the leading diagonal element j
Semi-Tied Covariance Matrices -11
• How to carry out step (3)? (cont.)
Semi Tied Covariance Matrices 11
– It is shown that the ML estimate for the ith row of the semi-tied transform, , is given by)(ˆ r
iθ
Ti
rii
rii
ri cGc
Gcθ 1)(1)()(ˆ
−−= β
∈
=τ
τγσ
)(ˆ
1)(
)(2)(
diag
)(m
Mm
mm
ri
ri
WG
51
Semi-Tied Covariance Matrices -12
• It can be shown that
Semi Tied Covariance Matrices 12
with equality when diagonal elements of the covariance
})ˆ{;ˆ,()ˆ,( )(diagrMMQMMQ Σ≥
with equality when diagonal elements of the covariance matrix are given by )ˆˆdiag(ˆ )()()()(
diagTrmrm ΘWΘΣ =
• During recognition the log-likelihood is based on
)),,);((log( )()()( rmmL ΘΣμο τ||log)),);((log(
)),,);((log()()()()()( rm
diagmrrN
LΘΣμΘο
ΘΣμο+= τ
τ
T)( 2)(
52
)(T)( τοΘ r 2)( ||log21 rΘ
Extended MLLT -1
• The extended MLLT (EMLLT) is much similar to M-MLLT.
Extended MLLT 1
But the only difference is the precision matrix modeling
Tk
D
kj
jT
jj θθΘΛΘΣP − === 1 λ kk
kkjjj θθΘΛΘΣP =
===1λ
basis
– Note that in EMLLT, , while in M-MLLT,– are not required to be positive!
h t b h h th t i iti d fi it
dD ≥ dD =jkλ
}{ jλ P– have to be chosen such that is positive definite– The authors provided two algorithms for iterative-update of
the parameters
}{ jkλ jP
53Ref. (Olsen, 2002), (Olsen, 2004)
Extended MLLT -2
• The highlight of EMLLT:
Extended MLLT 2
T– More flexible: the number of basis elements can be gradually varied from d (MLLT) to d(d+1)/2 (Full) by controlling the value of D
}{ Tkkθθ
54
Outline
• Introduction to Feature Decorrelation (FD)
Outline
– FD on Speech Recognition
• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)
• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)
• Common Principal Component (CPC)
55
Common Principal Components -1
• Why Common Principal Components (CPC)?
Common Principal Components 1
– We often deal with the situation of the same variables being measured on objects from different groups, and the covariance structure may vary from group to groupy y g p g p
– But sometimes the covariance matrices of different groups look somehow similar, and it seems reasonable to assume that the covariance matrices have a common basic structurethe covariance matrices have a common basic structure
• Goal of CPC – To find a rotation diagonalizes the covariance matrices simultaneously
56Ref. (Flury, 1984), (Flury, 1986)
Common Principal Components -2
• Hypothesis of CPC’s
Common Principal Components 2
kiH iiT
c ,...,1 ,: == ΛBΣB
diagonal matrix
– The common principal components (CPC's):T xBU =
diagonal matrix
– Note that, no canonical ordering of the columns of B need be given since the rank order of the diagonal elements of the
ii xBU =
given, since the rank order of the diagonal elements of the is not necessarily the same for all groups
57
Common Principal Components -3
• Assume are independently distributed as .
Common Principal Components 3
iin S ),( iip nW ΣThe common likelihood function is
∏ −−
×=
kni inCL 21traceexp)( ΣSΣΣΣ
• Instead of maximizing the likelihood function, minimize
∏=
−×=
iiiik CL
11
2traceexp),...,( ΣSΣΣΣ
Instead of maximizing the likelihood function, minimize
+−=k
kk CLg 11 log2),...,(log2),...,( ΣΣΣΣ
=
−+=k
iiiiin
1
1 ))trace(||(log SΣΣ
58
Common Principal Components -4
• Assume holds for some orthogonal matrix , and
Common Principal Components 4
CH β, then),...,diag( 1 ipii λλ=Λ
kip
1log||log == λΣ ,...,kij
iji 1 ,log||log1
===
λΣ
−−− ===p
jiTj
iT
iiT
iii111 )trace()trace()trace( βSββSβΛSββΛSΣ
• Therefore
=
===j ij
iiiiii1
)trace()trace()trace(λ
βSβΛSββΛSΣ
= kpppk gg 2111111 ),...,,,...,,,...,(),...,( λλλλββΣΣ
= =
+=
k
i
p
j ij
jiTj
ijin1 1
logλ
λβSβ
59
j
Common Principal Components -5
• The function g is to be minimized under the restrictions
Common Principal Components 5
=≠
=jhjh
jTh if1
if0ββ
( )
( ) ( )1
1
,...,
,..., 1 2
k
p pT T
k h h h hj h j
G
g γ γ= − − −
Σ Σ
Σ Σ β β β β
• Thus we wish to minimize the function
( )1
j jh h j= <
Thus we wish to minimize the function
−−−=p
jThhj
p
hThhkk gG ββββΣΣΣΣ γγ 2)1(),...,(),...,( 11
<= jhjhhj
hhhhkk gG ββββ γγ )(), ,(), ,(
111
60
Common Principal Components -6
• Minimization:
Common Principal Components 6
nG
k
i
p
j ij
jiTj
ijik
log),...,( 1 11
+∂
∂ = = βSβ
ΣΣ λλ
pjki
G
jiTjij
ij
i j ij
ij
k
,...,1 ;,...,1 ,
0),...,( 1 11
===
=∂
=∂
∂ = =
βSβ
ΣΣ
λλ
λλ
pjjijij , ,;, ,,ββ
Key point! Keep it in your mind.
kipii ,...,1 ,)trace( 1 == − SΣ
61
Common Principal Components -7
• Minimization (cont.):
Common Principal Components 7
nk
i
p
j ij
jiTj
iji log1 1
+
∂
= =
βSβλ
λ
G
p
jhj
Thhj
p
hh
Thh
k 02)1(
),...,( 11 =
−−−
∂
=∂
<=
ββββΣΣ
γγ
pjn pk
jii
jj
10
0
==−−
=∂
=∂
βββS
ββ
γγ
M lti l i th l ft b i
pjjh
hhjhjj
i ij
,...,1 ,011
== ≠==
ββ γγλ
Tβ jk
162
– Multiplying the left by givesThβ pjn
iij ,...,1 ,
1==
=
γ
Common Principal Components -8
• Minimization (cont.): Thus
Common Principal Components 8
pjnn p
jhh
hjhj
k
ii
k
i ij
jii ,...,1 ,)(111
=−− ≠===
βββS
γλ
– Multiplying the left by implies
jh≠
)( jlTl ≠β
nkji
Tli
βSβ
– Note that and
jlpjn
jli ij
jili ≠===
,,...,1 ,1
γλ
βSβ
jiTlli
Tj βSββSβ = ljjl γγ =Note that , and jillij βSββSβ ljjl γγ
ljpjn
jl
kji
Tli ≠== ,,...,1 ,
1γ
λβSβ
63
i ij=1 λ
Common Principal Components -9
• Minimization (cont.):
Common Principal Components 9
nn k
i il
jiTli
k
i ij
jiTli =−
==
011
βSββSβλλ
j,...,p; ll,jn j
k
ii
ijil
ijili
Tl ≠==
−
=
1 ,01
βSβλλλλ
j
the optimization objective
– These equations have to be solved under the orthonormality conditions and
2)1( −ppp
T Iββ = βSβ iTjij =λ
64
Common Principal Components -10
• Solving procedure of CPC – FG Algorithm
Common Principal Components 10
– F-Algorithm
65
Common Principal Components -11
– G-Algorithm
Common Principal Components 11
66
Common Principal Components -12
• Likelihood Ratio Test
Common Principal Components 12
– The sample common principal components:
kiiT
i ,...,1 ,ˆ == XβU
– For the ith group, the transformed covariance matrix is
kiiT
i ,...,1 ,ˆˆ == βSβF
– Sincethe statistic can be written as a function of the alone:
ii , ,,ββ
kiii ,...,1 ,)diag(ˆ == FΛ
∏
∏ === p
p
j
ijjk
ii
ik
i
fnn 1
)(
2 log||
|)diag(|logF
Fχ
67
∏=
==
jij
iii l1
11 || Feigenvalues of Fi
Common Principal Components -13
• Likelihood Ratio Test (cont.)
Common Principal Components 13
– The likelihood ratio criterion is a measure of simultaneous diagonalizability of k p.d.s. matrices
– The CPC's can be viewed as obtained by a simultaneous ytransformation, yielding variables that are as uncorrelated as possibleIt can also be seen from another viewpoint of Hadamard’s– It can also be seen from another viewpoint of Hadamard sinequality
|)diag(||| ii FF ≤ |)g(||| ii
68
Common Principal Components -14
• Actually, CPC can be also viewed as another measure of
Common Principal Components 14
“deviation from diagonality”
1|)diag(|)( ≥= ii
FFϕ
– The CPC criterion can be
||)(
ii F
ϕ
∏=
=k
i
nikk
i,...,nn1
11 ))(();,...,( FFFΦ ϕ
– Let for a given orthonal matrix BBABF iT
i =
);(min);( TT nnnn BABBABΦAAΦ =
69
);,...,(min);,...,( 11110 kkkk ,...,nn,...,nn BABBABΦAAΦB
=
Comparison Between CPC and F-MLLT
• CPC tries to maximize
Comparison Between CPC and F MLLT
=
−+C
iiiiin
1
1 ))trace(||(log SΣΣ The estimates are KNOWN
with in the original spaceCiiT
i ,...,1 , == BΛBΣ
• F-MLLT tries to maximize
1 CT ||log|)(|
21
1AASA NdiagN
jj
Tj −
=The estimates are UNKNOWN
70
Appendix A -1
• Show
Appendix A 1
( )=++−−−−
−−−−−
−
∏ j jjjjjjT
jjjiliil
Tili nNdN
dee ||log)trace()()(
22
)()(21
111
)2(||)2(
ΣSΣμmΣμmμxΣμx
Σπ
= ii ld
1 ||)2( Σπ
71
Appendix A -2Appendix A 2
, classeach For jC
=
−
=
−
−+−−+−=
−−
N
illlil
Tllli
N
ilil
Tli
iiiiiii
iii
1
1
1
1
)()(
)()(
μmmxΣμmmx
μxΣμx
)()(
)()(
1
1
−−=
−−
∈
−
∈
−
ji
ji
C
Tjijjj
Cjjj
Tji
x
x
mxμmΣ
μmΣmx
−−−−
=
−
−−+−−+−−+−−=
−+−−+−=
N
lilT
lllllT
lilllT
lllilT
li
N
ililil
Tll
Tli
i
iiiiiiiiiiiiiiii
iiiiii
1111
1
1
1
))()()()()()()()((
))()(())()((
mxΣμmμmΣmxμmΣμmmxΣmx
μmmxΣμmmx0=
−−
=
−
=
−
=
−−+=
−−+−−=
N
jjjT
jjj
N
jjj
N
ijjj
Tjjj
N
i
Tlilil
i
Nn
Niii
11
1
1
1
1
1
)()()trace(
)()())((trace(
μmΣμmSΣ
μmΣμmmxmxΣ
=
−−
==
+−−=
+
N
ijjjjj
Tjjj
ijjjjjj
ijjj
n
Nn
1
11
11
))trace()()((
)()()trace(
SΣμmΣμm
μmΣμmSΣ
72
Appendix B
• ML estimators for
Appendix B
}){},{,(log 1 jjNp Σμx
jjjj
jjN
Np
μmΣμ
Σμx=−=
∂∂ − 0)(
}){},{,(log 11
jj
j
mμ
μ
=
∂ˆ
TTTT
j
jjj
j
jjNp
ΣΣSΣ
ΣΣμx =
∂+∂=
∂∂ −
0|)|log)(trace(}){},{,(log 11
jj
Tj
Tj
Tj
Tj
SΣ
ΣΣSΣ
=
=+− −−−
ˆ0
73
Appendix C -1
• Change of Variable Theorem
Appendix C 1
– Consider a one-one mapping– Equal probability: The probability of falling in a region in space
X should be the same as the probability of falling in the
)( ,: XY fg nn =ℜ→ℜ
p y gcorresponding region in space Y
– Suppose the region maps to the region in the Y space Equating probabilities we have
ndxdxdx ...21 dAspace. Equating probabilities, we have
nnn dxdxxxfdAyyf ...),...,(),...,( 111 XY =
– The region is a hyperparallelepipe described by the vectorsdA
),...,(),...,,...,(),,...,( 122
111
1n
nn
nn dxddydx
ddydx
ddydx
ddydx
ddydx
ddy
74
2211 nn dxdxdxdxdxdx
Appendix C -2
• Change of Variable Theorem (cont.)
Appendix C 2
– The hyper-parallelepiped can be calculated bydA
nn
ddy
ddydx
ddydx
ddy ,...,,..., 1
111
nn
nn
n
dxdx
ddy
ddy
dxdx
dxddydx
ddy
dxdxdA ...
,...,
, ,
,...,
, ,
11
11
1
11
11
==
nnn
nn
n dxdxdxdx
J: the Jacobian of function g
– So,
),...,(||),...,(
...),...,(...||),...,(
11
1
1111
nn
nnnn
xxfyyf
dxdxxxfdxdxyyf
XY
XY
J
J−=
=
75
Appendix D -1
• If A is a square matrix, then the minor entry of aij is
Appendix D 1
denoted by Mij and is defined to be the determinant of the submatrix that remains after the i-th row and the j-thcolumn are deleted from A The number ( 1)i + jM iscolumn are deleted from A. The number (−1)i + jMij is denoted by cij and is called the cofactor of aij.
• Given the matrixGiven the matrix
=131211
bbbbbb
B)()1( 23
3223 Mc +−=)()1( 33
3333 Mc +−=
=333231
232221
bbbbbbB
311232113231
12111211
23 bbbbbbbb
bb
bbM −=
=
×××××
=
76
3231 bb ×
Appendix D -2
• Given the n by n matrix
Appendix D 2
– The determinant of A can be written as the sum of its cofactors multiplied by the entries that generated them.
aaa 11211
= n
n
aaaaaa
A
22221
11211
(cofactor expansion along the jth column)
nnnn aaa 21
(cofactor expansion along the ith row)
)()(332211)det( jTj
njnjjjjjjj cAcacacacaA =++++=
77
( p g )Tiiininiiiiii cAcacacacaA )()(332211)det( =++++=