Feature Decorrelation on Speech Recognition

Feature Decorrelation onS h R iti

Feature Decorrelation onS h R iti

H Shi L

Speech RecognitionSpeech Recognition

Hung-Shin Lee

Institute of Information, Sinica AcademicD f El t i l E i i N ti l T i U i itDep. of Electrical Engineering, National Taiwan University

2009-10-09 @ IIS, Sinica Academic

References -11) J. Psutka and L. Muller, “Comparison of various feature decorrelation

techniques in automatic speech recognition ” in Proc CITSA 2006

References 1

techniques in automatic speech recognition, in Proc. CITSA 2006.2) Dr. Berlin Chen’s Lecture Slides: http://berlin.csie.ntnu.edu.tw3) Batlle et al., “Feature decorrelation methods in speech recognition - a

comparative study,” in Proc. ICSLP 1998.comparative study, in Proc. ICSLP 1998.4) K. Demuynck et al., “Improved feature decorrelation for HMM-based speech

recognition,” in Proc. ICSLP 1998.5) W. Krzanowski, Principles of Multivariate Analysis - A User’s Perspective,

Oxford Press, 1988.6) R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for

classification,” in Proc. ICASSP 1998.7) M. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE

Trans. on Speech and Audio Processing, vol. 7, no. 3, pp. 272-281, 1999.8) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis

expansion ” in Proc ICASSP 2002

2

expansion,” in Proc. ICASSP 2002.

References -29) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis

expansion ” IEEE Trans on Speech and Audio Processing vol 12 no 11 pp

References 2

expansion, IEEE Trans. on Speech and Audio Processing, vol. 12, no. 11, pp. 37-46, 2004.

10) N. Kumar and R. Gopinath, “Multiple linear transform,” in Proc. ICASSP 2001.11) N. Kumar and A. Andreou, “Heteroscedastic discriminant analysis and. reduced11) N. Kumar and A. Andreou, Heteroscedastic discriminant analysis and. reduced

rank HMMs for improved speech recognition,” Speech Communication, vol. 26, no. 4, pp. 283-297, 1998.

12) A. Ljolje, “The importance of cepstral parameter correlations in speech recognition,” Computer Speech and Language, vol. 8, pp. 223–232, 1994.

13) B. Flury, “Common principal components in k groups,” Journal of the American Statistical Association, vol. 79, no. 388, 1984.

14) B. Flury and W. Gautschi, “An algorithm for simultaneous orthogonal transformation of several positive definite symmetric matrices to nearly diagonal form,” SIAM Journal on Scientific and Statistical Computing, vol. 7, no 1 1986

3

no. 1, 1986.

Outline

• Introduction to Feature Decorrelation (FD)

Outline

– FD on Speech Recognition

• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

4

Outline


Outline





5

Introduction to Feature Decorrelation -1

• Definition of Covariance Matrix:

Introduction to Feature Decorrelation 1

– Random vector:T

nxx ][ 1 =X

Covariance matrix and mean vector:

random variable

– Covariance matrix and mean vector:

]))([( TE μXμXΣ −−≡

expected value

][Xμ E≡random vector

6


• Feature Decorrelation (FD)


– To find transformations that make all variables or parameters (nearly) uncorrelated

Θ

jiXXXX jiji ≠∀= ,~,~ ,0)~,~cov(

transformed random variable

– Or make the covariance matrix diagonal (not necessary identity)

DΣΘΘ =T

The covariance matrix can be global or depend on each class.

7

diagonal matrixcovariance matrixglobal or depend on each class.


• Why Feature Decorrelation (FD)?


– In many speech recognition systems, the observation density function for each HMM state are modeled as mixtures of diagonal covariance Gaussiansg

– For the sake of computational simplicity, the off-diagonal elements of the covariance matrices of the Gaussians are assumed to be close to zeroassumed to be close to zero

=i

iii Nf ),;()( Σμxx π

−−−= −

iii

Ti

ii )()(

21exp

||)2(1 1

2121 μxΣμxΣπ

πobservation density function

8

density functiondiagonal matrix

FD on Speech Recognition -1

• Approaches for FD can be divided into two categories:

FD on Speech Recognition 1

Feature-space and Model-space• Feature-space Schemes

Hard to find a single transform which decorrelates all elements– Hard to find a single transform which decorrelates all elements of the feature vector for all states

• Model-space Schemes– A different transform is selected depending on which

component the observation was hypothesized to be generatedcomponent the observation was hypothesized to be generated from

– In the limit a transform may be used for each component, hi h i i l t t f ll i t i t

9

which is equivalent to a full covariance matrix system


• Feature-space Schemes on LVCSR


DCT, PCA, LDA, MLLT, …

Feature-spaceDecorrelation

Front-End Preprocessing

Speech Signal

Speech Decoding

TextualResults

Training Data

Test Data

LMAMAM Training

Training Data

Lexicon

10


• Model-space Schemes on LVCSR


Front-End Preprocessing

Speech Signal

Speech Decoding

TextualResults

Test Data

LMAMAM Training

Training Data

LexiconLMAM

Model-space

g Lexicon

pDecorrelation

MLLT, EMLLT, MLT, Semi-Tied, …

11

, , , ,

FD on Speech Recognition -4FD on Speech Recognition 4

Without LabelWithout Label Information

With Label Information

Discrete Cosine Transform (DCT)

Linear Discriminant Analysis (LDA)

F t S(DCT)

Principal Component Analysis (PCA)

Common Principal Components (CPC)

Extended Maximum Likelihood

Feature-SpaceSchemes

Extended Maximum Likelihood Linear Transform (EMLLT)

Semi-Tie Covariance Matrices

Multiple Linear Transforms (MLT)Model-SpaceS hMultiple Linear Transforms (MLT)

Maximum Linear Likelihood Transform (MLLT)

Schemes

12

Outline


Outline





13

Discrete Cosine Transform -1

• Discrete Cosine Transform (DCT)

Discrete Cosine Transform 1

– Since the log-power spectrum is real and symmetric, the inverse DFT reduces to a Discrete Cosine Transform (DCT)

– Is Applied to the log-energies of output filters (Mel-scaled pp g g p (filterbank) during the MFCC parameterization

jijn

× 10f)50(2 π nmji

njx

nc

iij <=

−=

=

,...,1,0for ,)5.0(cos1

i-th coordinate of the input vector x

– Partial decorrelation

j-th coordinate of the output vector c

14

Discrete Cosine Transform -2

• (Total) covariance matrix calculated using the 25-hour

Discrete Cosine Transform 2

training data from MATBN

18 Mel-scaled Filterbank 13 Mel-cepstrum (using DCT)

1 5

2

3 p ( g )

6

8

0.5

1

1.5

0

2

4

05

1015

20

05

10

15200

010

2030

40

010

20

3040-2

15

Principal Component Analysis -1

• Principal Component Analysis (PCA)

Principal Component Analysis 1

– Based on the calculation of the major directions of variations of a set of data points in a high dimension space

– Extracts the direction of the greatest variance, assuming that g , gthe less variation of the data, the less information it carries of the featuresPrincipal components : the largest eigenvectors][ vvV =– Principal components : the largest eigenvectors of the total covariance matrix Σ

],...,[ 1 pvvV =

vΣv = λDΣVV

vΣv

= Tiii λ

diagonal matrix

16

diagonal matrix

Principal Component Analysis -2


Principal Component Analysis 2


162 Spliced Mel-Filterbank 39 PCA Transformed Subspace

1 5

2

162 Spliced Mel Filterbank

150

39 PCA Transformed Subspace

0

0.5

1

1.5

0

50

100

050

100150

200

050

100

150200-0.5

010

2030

40

010

20

3040-50

17

00 00

Linear Discriminant Analysis -1

• Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis 1

– Seeks a linear transformation matrix that satisfiesΘ

( ))()(tracemaxarg 1 ΘSΘΘSΘΘΘ

BT

WT −=

– is formed by the largest eigenvectors of The LDA subspace is not orthogonal (but PCA subspace is)

Θ

Θ BW SS 1−

– The LDA subspace is not orthogonal (but PCA subspace is)

f

IΘΘ ≠T

– The LDA subspace makes the transformed variables statistically uncorrelated.

18


• From any two distinct eigenvalue/eigenvector pairs


and),( ii θλ ),( jj θλ

= iWiiB θSθS λ

– Pre-multiplying by and , respectively

= jWjjB θSθS λ

Tjθ

Tiθp y g y , p yj i

jWTijiW

Tji

jWTijjB

Ti

iWTjiiB

Tj jB

TiiB

Tj =⎯⎯⎯⎯ →⎯

== = θSθθSθ

θSθθSθθSθθSθ θSθθSθ λλ

λλ

jiT

jWTiiW

Tj

jWijjBi

jijWTiiW

Tj

∀

==⎯⎯⎯⎯⎯⎯⎯ →⎯

≠=

f0

0 and

θSθ

θSθθSθθSθθSθ λλ

19

jijWTi ≠∀=∴ for 0θSθ

Ref. (Krzanowski, 1988)


• To overcome arbitrary scaling of , it is usual to adopt


iθthe normalization

θSθ =iWTi 1

λ 001

ΛΘSΘIΘSΘθSθθSθ

==∴==

BT

WT

iiWTiiiB

Ti

iWi

,λλ

=

pλ0000

1

Λ

• The total covariance matrix is also

BW ,

BWT SSS +=

p

transformed as a diagonal matrix

20





162 Spliced Mel-Filterbank 39 LDA Transformed Subspace

1.5

2

p

10

15

39 p

0

0.5

1

0

5

10

050

100150

200

050

100

150200-0.5

010

2030

40

010

20

3040-5

21

Maximum Likelihood Linear Transform

• Maximum Likelihood Linear Transform (MLLT) seems to

Maximum Likelihood Linear Transform

have two types:– Feature-based: (Gopinath, 1998)– Model-based: (Olsen 2002) (Olsen 2004)Model based: (Olsen, 2002), (Olsen, 2004)

• The common goal of the two types – Find a global linear transformation matrix that decorrelate features.

22

Feature-based MLLT -1

• Feature-based Maximum Likelihood Linear Transform

Feature based MLLT 1

(F-MLLT)– Tries to alleviate four problems:

(a) Data insufficiency implying unreliable models(a) Data insufficiency implying unreliable models(b) Large storage requirement(c) Large computational requirement(d) ML i t di i i ti b t l(d) ML is not discriminating between classes

– Solutions(a)-(c): Sharing parameters across classes(d): Appealing to LDA

23


• Maximum Likelihood Modeling


– The likelihood of the training data is given by)},{( ii lx

==−−− −

∏∏iliil

TiliNN

llllN epp

)()(21

1

1

}){}{(}){}{(mxΣmx

ΣμxΣμx

( )=

==

++−−−−

==

−−

∏∏

j jjjjjjT

jjj

i

iiii

nNd

i ld

illill

e

pp

||log)trace()()(22

111

11

)2(

||)2(}){},{,(}){},{,(

ΣSΣμmΣμm

ΣΣμxΣμx

π

π

Log-Likelihood

Appendix A

e)2( π

−=jjN Ndp 1 )2log(

2}){},{,(log Σμx π

g

( )=

−− ++−−−C

jjjjjjj

Tjj

jn1

11 ||log)trace()()(2

2

ΣSΣμmΣμm

24

sample mean sample covarianceML estimators


• The idea of maximum likelihood estimation (MLE) is to


ˆchoose the parameters and so as to maximize}ˆ{ jμ }ˆ{ jΣ}){},{,(log 1 jj

Np Σμx

– ML Estimators: and }ˆ{ jμ }ˆ{ jΣ

What’s the difference between “estimator” and “estimate”?

25

estimator and estimate ?


• Multiclass ML Modeling


ˆ– The training data is modeled with Gaussians– The ML estimators:– The log-ML value:

)ˆ,ˆ( jj Σμjjjj SΣmμ == ˆ ,ˆ

Appendix Bg

( )

∗ =C

jC

j

jjNN

ndNnNdpp 11

||)(||1)2l (

})ˆ{},ˆ{,(log)(log

SS

Σμxx

f

( ) ==

−=−+−=j

jj

jj

j dNg11

||2

),(||2

1)2log(2

SSπ

– There is “no interaction” between the classes and therefore unconstrained ML modeling is not “discriminating”

26


• Constrained ML – Diagonal Covariance


ˆ– The ML estimators:– The log-ML value:

)diag(ˆ ,ˆ jjjj SΣmμ ==

C n

If one linearly transforms the data and models using a diagonal

=

∗ −=j

jjN ndNgp

11diag |)diag(|

2),()(log Sx

– If one linearly transforms the data and models using a diagonal Gaussian, the ML value is

( )C

jn ( )=

∗ +−=j

jjjTj

jN ndNgp1

1diag ||log|)diag(|2

),()(log ΘΘSΘx

the JacobianHow to choose Θ for each class?

27Appendix C

How to choose Θ for each class?


• Multiclass ML Modeling – Some Issues


– If the sample size for each class is not large enough then the ML parameter estimates may have large variance and hence be unreliable

– The storage requirement for the model:– The computational requirement:

)( 2CdO)( 2CdO

– The parameters for each class are obtained independently:f

Why?

ML principle dose not allow for discrimination between classes

28


• Multiclass ML Modeling – Some Issues (cont.)


– Parameters sharing across classes: reduces the number of parameters, storage requirements, and computational requirementsq

– Hard to justify parameters sharing is more discriminating

We can appeal to Fisher’s criterion of LDA and a result of– We can appeal to Fisher s criterion of LDA and a result of Campbell to argue that sometimes constrained ML modeling is discriminating.

But, what is discriminating?

29


• Multiclass ML Modeling – Some Issues (cont.)


– We can globally transform the data with a unimodular matrix Θand model the transformed data with diagonal Gaussians(There is a loss in likelihood too) 1)det( =Θ( )

– Among all possible transformation Θ, we can choose the one that takes the least loss in likelihood

1)det( =Θ

that takes the least loss in likelihood(In essence we will find a linearly transformed (shared) feature space in which the diagonal Gaussian assumption is almostvalid)

30


• Another constrained MLE with sharing of parameters


– Equal Covariance– Clustering

• Covariances Diagonalization and Cluster Transformation– Classes are grouped into clusters

E h l t i d l d ith di l G i i– Each cluster is modeled with a diagonal Gaussian in a transformed feature space

– The ML estimators: 1)diag(ˆ ,ˆ −−== jjjj CTCjC

TCjjj ΘΘSΘΘΣmμ

– The ML value:

( )∗ +−=C

CTCjC

jNdiag jjj

ndNgp 1 ||log|)diag(|),()(log ΘΘSΘx

31

( )=j

CCjCdiag jjjgp1

1 ||g|)g(|2

),()(g


• One Cluster with Diagonal Covariance


– When the number of clusters is one, there is single global transformation and the classes are modeled as diagonal Gaussians in this feature spacep

– The ML estimators:– The log-ML value:

1)diag(ˆ ,ˆ −−== ΘΘSΘΘΣmμ jTT

jjj

f

||log|)diag(|2

),()(log1

1diag ΘΘSΘx NndNgpC

jj

TjN +−= =

∗

– The optimal Θ can be obtained by optimization as follows

−= ||log|)diag(|

2minarg ΘΘSΘΘΘ

NnC

jTj

32

= 21Θ j


• Optimization – the numerical approach:


– The objective function

||log|)diag(|2

)( ΘΘSΘΘ NnFC

jTj −=

– Differentiate Fwith respect to A, and get the derivative G:

21j=

f

( ) jT

jj

jT nNG ΘSΘΘSΘΘ 1)diag()( −− −=

– Directly optimizing the objective function is nontrivial and requires numerical optimization techniques and full matrix to be stored at each class

33

Outline


Outline





34

Model-based MLLT -1

• In model-based MLLT, instead of having a distinct

Model based MLLT 1

covariance matrix for every component in the recognizer, each covariance matrix consists of two elements:

A non singular linear transformation matrix Θ shared over a– A non-singular linear transformation matrix Θ shared over a set of components

– The diagonal elements in the matrix for each class jjΛ

• The precision matrices are constrained to be of the formd jλ 00

=

− ===d

k

Tkk

jkj

Tjj

1

1 θθΘΛΘΣP λ

diagonal matrixprecision matrix

=j

j

j

λ

λ

0000001

Λ

35

diagonal matrixprecision matrix jdλ00

Ref. (Olsen, 2004)

Model-based MLLT -2

• M-MLLT fits within the standard maximum-likelihood

Model based MLLT 2

criterion used for training HMM's with each state represented by a GMM

j th t i ht

=

=m

jjjj NΘf

1),;()|( Σμxx π

• Log-likelihood function:

j-th component weight

)|(log1):(1

ΘfN

ΘL i

N

ixX

=

=

36

Model-based MLLT -3

• Parameters of the M-MLLT model for each HMM state

Model based MLLT 3

are estimated using a generalized expectation-maximization (EM) algorithm

• The auxiliary “Q function” should be introduced

},,,{ 111 ΘΛμ mmmΘ π=

• The auxiliary “Q-function” should be introduced.– The Q-function satisfies the inequality

ˆˆˆˆ

→= New}{ 111 ΘΛμ mmmΘ π

),(),():():( ΘΘQΘΘQΘLΘL −≥− XX

→=→

Old}ˆ,ˆ,ˆ,ˆ{ˆNew},,,{

111

111

ΘΛμΘΛμ

mmmΘΘ

ππ

37* A. P. Dempster et al., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977.

Model-based MLLT -4

• The auxiliary Q-function is

Model based MLLT 4

)(1)ˆ;(1 1

ij

N

i

m

jij L

NΘΘQ x=

= =

γ

21

1 1

N

i

m

j

ij

N=

= =

γ )),;(log()( jjijij NL Σμxx π=The component log-likelihood of the observation i

)))((trace(||log)2log(log2( Tjijijjj d μxμxPP −−−+−× ππ

= m

ik

ijij

xL

xL

))(ˆexp(

))(ˆexp(γA posteriori probability

of Gaussian component jgiven the observation i

38

=k

ik1

))(p(given the observation i

Model-based MLLT -5

• Gales’ Approach for deriving Θ

Model based MLLT 5

Estimate the mean and the component weight, which are independent of the other model parameters

N1 NN

Use the current estimate of the transform Θ and estimate

=

=i

ijj N 1

1 γπ ==

=i

iji

iijj11

γγ xμ

Use the current estimate of the transform Θ, and estimate the set of class-specific diagonal variances

Tj θSθ=λ ijiji θSθ=λ

the i-th row vector of Θthe i-th entry of diagonal variance of component j

39Ref. (Gales, 1999)

variance of component j

Model-based MLLT -6

• Gales’ Approach for deriving Θ (cont.)

Model based MLLT 6

Estimate the transform Θ using the current set of diagonal covariances

NTiii

iiiN

cGcGcθ 1

1ˆ−

−=

=j

jjj

i

i

N SG 2)(diagσ̂

The i-th row vector of the cofactors of the current Θ

Appendix D

Go to step 2 until convergence, or appropriate criterion satisfied

Appendix D

40

Semi-Tied Covariance Matrices -1

• Introduction to Semi-Tied Covariance Matrices

Semi Tied Covariance Matrices 1

– A natural extension of the state-specific rotation scheme– The transform is estimated in a maximum-likelihood (ML)

fashion given the current model parametersg p– The optimization is performed using a simple iterative scheme,

which is guaranteed to increase the likelihood of the training datadata

– An alternative approach to solve the optimization problem of DHLDA

41


• State-Specific Rotation


– A full covariance matrix is calculated for each state in the system, and is decomposed into its eigenvectors and eigenvaluesg

All data from that state is then decorrelated using the

Tssssfull

)()()()( UΛUΣ =

– All data from that state is then decorrelated using the eigenvectors calculated

)()( )()( ττ οUο Tss =

– Multiple diagonal covariance matrix Gaussian components are then trained

)()( ττ οUο =

42Ref. (Gales, 1999), (Kumar, 1998)


• Drawbacks of State-Specific Rotation


– It does not fit within the standard ML estimation framework for training HMM’s

– The transforms are not related to the multiple-component models being used to model the data

43


• Semi-Tied Covariance Matrices


– Instead of having a distinct covariance matrix for every component in the recognizer, each covariance matrix consists of two elements

Trmrm )()(diag

)()( HΣHΣ =

component specific diagonal covariance element

semi-tied class-dependent, non-diagonal matrix,may be tied over a set of components

44

components


• Semi-Tied Covariance Matrices (cont.)


– It is very complex to optimize these parameters directly so an expectation-maximization approach is adopted

– Parameters for each component m

}{ )()()()( rmd

mmM ΘΣμπ= },,,{ diagM ΘΣμπ=1)()( −= rr HΘ

45


• The auxiliary Q-function


( ) ( )

=2)(

ˆˆˆ|ˆ|

)ˆ,(

Tr

MMQ

Θ ( ) ( )∈

−

−−−

ττττγ

),(

)()(1)(diag

)()()(

diag

ˆ)(ˆˆˆˆ)(|ˆ|

||log)(rMm

mrmTrTmmm μοΘΣΘμο

ΣΘ

|| )(mΣ 1)( −mΣ),|)(( Tm Mqp Οτ

component m at time τ

46


• If all the model parameters are to be simultaneously


optimized then the Q-function may be rewritten as

βτγ2)( |ˆ|l)()ˆ(

r

dMMQ Θ

where

∈

−

=τ

βτγ),(

)()()( |)ˆˆ(|||log)(),(

rMmTrmrm d

diagMMQ

ΘWΘ

)()(where

( )( ) −−= τ

τττγ ˆ)(ˆ)()( )()(

)(

Tmmm

mμομο

W

=

τ

τ

τγ

ττγ

)(

)()(ˆ )(

m

mm

ομ

=

ττγ )(m

W τ

∈

=τ

τγβ),(

)(rMm

m

47


• The ML estimate of the diagonal element of the


covariance matrix is given by

( )Trmrm )()()()(diag

ˆˆdiagˆ ΘWΘΣ =

• The reestimation formulae for the component weights and transition probabilities are identical to the standard

( )g

and transition probabilities are identical to the standard HMM cases (Rabiner, 1989)

• Unfortunately, optimizing the new Q-function is nontrivial and more complicated, so an alternative

h i d t

48

approach is proposed next


• Gales’ approach for optimizing the Q-function


Estimate the mean, which is independent of the other model parameters

Using the current estimate of the semi-tied transform and estimate the set of component specific diagonal variances This set of parameters will be

)(ˆ rΘ)ˆˆdiag(ˆ )()()()(

diagTrmrm ΘWΘΣ =

specific diagonal variances. This set of parameters will be denoted as

f

},ˆ{}ˆ{ )()(diag

)(diag

rrr Mm∈= ΣΣ

ˆ ˆ )( Estimate the transform using the current set

Go to (2) until convergence, or appropriate criterion satisfied

)(ˆ rΘ }ˆ{ )(diagrΣ

49

( ) g , pp p


• How to carry out step (3)?


– Optimizing the semi-tied transform requires an iterative estimation scheme even after fixing all other model parametersp

– Selecting a particular row of , , and rewriting the former Q-function using the current set

)(ˆ rΘ )(ˆ riθ

}ˆ{ )(diagrΣ

ˆ

ττγ

2)()()(2)(

)(diag

))(ˆˆ(|ˆ|l)ˆl ()(

}ˆ{;ˆ,(mTr

jmTr

rMMQ

οθΣθ

Σ|ˆ| )(rΘ )()( ˆ)( mm μo −τ

∈

−−=τ σ

τγ),(

2)(diag

)(diag

2)( ))((||log)log()(

rMm jm

jmi

Trim

j

Σcθ

The ith row vector of the cofactors the leading diagonal element j

50

The ith row vector of the cofactors the leading diagonal element j


• How to carry out step (3)? (cont.)


– It is shown that the ML estimate for the ith row of the semi-tied transform, , is given by)(ˆ r

iθ

Ti

rii

rii

ri cGc

Gcθ 1)(1)()(ˆ

−−= β

∈

=τ

τγσ

)(ˆ

1)(

)(2)(

diag

)(m

Mm

mm

ri

ri

WG

51


• It can be shown that


with equality when diagonal elements of the covariance

})ˆ{;ˆ,()ˆ,( )(diagrMMQMMQ Σ≥

with equality when diagonal elements of the covariance matrix are given by )ˆˆdiag(ˆ )()()()(

diagTrmrm ΘWΘΣ =

• During recognition the log-likelihood is based on

)),,);((log( )()()( rmmL ΘΣμο τ||log)),);((log(

)),,);((log()()()()()( rm

diagmrrN

LΘΣμΘο

ΘΣμο+= τ

τ

T)( 2)(

52

)(T)( τοΘ r 2)( ||log21 rΘ

Extended MLLT -1

• The extended MLLT (EMLLT) is much similar to M-MLLT.

Extended MLLT 1

But the only difference is the precision matrix modeling

Tk

D

kj

jT

jj θθΘΛΘΣP − === 1 λ kk

kkjjj θθΘΛΘΣP =

===1λ

basis

– Note that in EMLLT, , while in M-MLLT,– are not required to be positive!

h t b h h th t i iti d fi it

dD ≥ dD =jkλ

}{ jλ P– have to be chosen such that is positive definite– The authors provided two algorithms for iterative-update of

the parameters

}{ jkλ jP

53Ref. (Olsen, 2002), (Olsen, 2004)

Extended MLLT -2

• The highlight of EMLLT:

Extended MLLT 2

T– More flexible: the number of basis elements can be gradually varied from d (MLLT) to d(d+1)/2 (Full) by controlling the value of D

}{ Tkkθθ

54

Outline


Outline





55

Common Principal Components -1

• Why Common Principal Components (CPC)?

Common Principal Components 1

– We often deal with the situation of the same variables being measured on objects from different groups, and the covariance structure may vary from group to groupy y g p g p

– But sometimes the covariance matrices of different groups look somehow similar, and it seems reasonable to assume that the covariance matrices have a common basic structurethe covariance matrices have a common basic structure

• Goal of CPC – To find a rotation diagonalizes the covariance matrices simultaneously

56Ref. (Flury, 1984), (Flury, 1986)


• Hypothesis of CPC’s


kiH iiT

c ,...,1 ,: == ΛBΣB

diagonal matrix

– The common principal components (CPC's):T xBU =

diagonal matrix

– Note that, no canonical ordering of the columns of B need be given since the rank order of the diagonal elements of the

ii xBU =

given, since the rank order of the diagonal elements of the is not necessarily the same for all groups

57


• Assume are independently distributed as .


iin S ),( iip nW ΣThe common likelihood function is

∏ −−

×=

kni inCL 21traceexp)( ΣSΣΣΣ

• Instead of maximizing the likelihood function, minimize

∏=

−×=

iiiik CL

11

2traceexp),...,( ΣSΣΣΣ

Instead of maximizing the likelihood function, minimize

+−=k

kk CLg 11 log2),...,(log2),...,( ΣΣΣΣ

=

−+=k

iiiiin

1

1 ))trace(||(log SΣΣ

58


• Assume holds for some orthogonal matrix , and


CH β, then),...,diag( 1 ipii λλ=Λ

kip

1log||log == λΣ ,...,kij

iji 1 ,log||log1

===

λΣ

−−− ===p

jiTj

iT

iiT

iii111 )trace()trace()trace( βSββSβΛSββΛSΣ

• Therefore

=

===j ij

iiiiii1

)trace()trace()trace(λ

βSβΛSββΛSΣ

= kpppk gg 2111111 ),...,,,...,,,...,(),...,( λλλλββΣΣ

= =

+=

k

i

p

j ij

jiTj

ijin1 1

logλ

λβSβ

59

j


• The function g is to be minimized under the restrictions


=≠

=jhjh

jTh if1

if0ββ

( )

( ) ( )1

1

,...,

,..., 1 2

k

p pT T

k h h h hj h j

G

g γ γ= − − −

Σ Σ

Σ Σ β β β β

• Thus we wish to minimize the function

( )1

j jh h j= <

Thus we wish to minimize the function

−−−=p

jThhj

p

hThhkk gG ββββΣΣΣΣ γγ 2)1(),...,(),...,( 11

<= jhjhhj

hhhhkk gG ββββ γγ )(), ,(), ,(

111

60


• Minimization:


nG

k

i

p

j ij

jiTj

ijik

log),...,( 1 11

+∂

∂ = = βSβ

ΣΣ λλ

pjki

G

jiTjij

ij

i j ij

ij

k

,...,1 ;,...,1 ,

0),...,( 1 11

===

=∂

=∂

∂ = =

βSβ

ΣΣ

λλ

λλ

pjjijij , ,;, ,,ββ

Key point! Keep it in your mind.

kipii ,...,1 ,)trace( 1 == − SΣ

61


• Minimization (cont.):


nk

i

p

j ij

jiTj

iji log1 1

+

∂

= =

βSβλ

λ

G

p

jhj

Thhj

p

hh

Thh

k 02)1(

),...,( 11 =

−−−

∂

=∂

<=

ββββΣΣ

γγ

pjn pk

jii

jj

10

0

==−−

=∂

=∂

βββS

ββ

γγ

M lti l i th l ft b i

pjjh

hhjhjj

i ij

,...,1 ,011

== ≠==

ββ γγλ

Tβ jk

162

– Multiplying the left by givesThβ pjn

iij ,...,1 ,

1==

=

γ


• Minimization (cont.): Thus


pjnn p

jhh

hjhj

k

ii

k

i ij

jii ,...,1 ,)(111

=−− ≠===

βββS

γλ

– Multiplying the left by implies

jh≠

)( jlTl ≠β

nkji

Tli

βSβ

– Note that and

jlpjn

jli ij

jili ≠===

,,...,1 ,1

γλ

βSβ

jiTlli

Tj βSββSβ = ljjl γγ =Note that , and jillij βSββSβ ljjl γγ

ljpjn

jl

kji

Tli ≠== ,,...,1 ,

1γ

λβSβ

63

i ij=1 λ


• Minimization (cont.):


nn k

i il

jiTli

k

i ij

jiTli =−

==

011

βSββSβλλ

j,...,p; ll,jn j

k

ii

ijil

ijili

Tl ≠==

−

=

1 ,01

βSβλλλλ

j

the optimization objective

– These equations have to be solved under the orthonormality conditions and

2)1( −ppp

T Iββ = βSβ iTjij =λ

64


• Solving procedure of CPC – FG Algorithm


– F-Algorithm

65


– G-Algorithm


66


• Likelihood Ratio Test


– The sample common principal components:

kiiT

i ,...,1 ,ˆ == XβU

– For the ith group, the transformed covariance matrix is

kiiT

i ,...,1 ,ˆˆ == βSβF

– Sincethe statistic can be written as a function of the alone:

ii , ,,ββ

kiii ,...,1 ,)diag(ˆ == FΛ

∏

∏ === p

p

j

ijjk

ii

ik

i

fnn 1

)(

2 log||

|)diag(|logF

Fχ

67

∏=

==

jij

iii l1

11 || Feigenvalues of Fi


• Likelihood Ratio Test (cont.)


– The likelihood ratio criterion is a measure of simultaneous diagonalizability of k p.d.s. matrices

– The CPC's can be viewed as obtained by a simultaneous ytransformation, yielding variables that are as uncorrelated as possibleIt can also be seen from another viewpoint of Hadamard’s– It can also be seen from another viewpoint of Hadamard sinequality

|)diag(||| ii FF ≤ |)g(||| ii

68


• Actually, CPC can be also viewed as another measure of


“deviation from diagonality”

1|)diag(|)( ≥= ii

FFϕ

– The CPC criterion can be

||)(

ii F

ϕ

∏=

=k

i

nikk

i,...,nn1

11 ))(();,...,( FFFΦ ϕ

– Let for a given orthonal matrix BBABF iT

i =

);(min);( TT nnnn BABBABΦAAΦ =

69

);,...,(min);,...,( 11110 kkkk ,...,nn,...,nn BABBABΦAAΦB

=

Comparison Between CPC and F-MLLT

• CPC tries to maximize

Comparison Between CPC and F MLLT

=

−+C

iiiiin

1

1 ))trace(||(log SΣΣ The estimates are KNOWN

with in the original spaceCiiT

i ,...,1 , == BΛBΣ

• F-MLLT tries to maximize

1 CT ||log|)(|

21

1AASA NdiagN

jj

Tj −

=The estimates are UNKNOWN

70

Appendix A -1

• Show

Appendix A 1

( )=++−−−−

−−−−−

−

∏ j jjjjjjT

jjjiliil

Tili nNdN

dee ||log)trace()()(

22

)()(21

111

)2(||)2(

ΣSΣμmΣμmμxΣμx

Σπ

= ii ld

1 ||)2( Σπ

71

Appendix A -2Appendix A 2

, classeach For jC

=

−

=

−

−+−−+−=

−−

N

illlil

Tllli

N

ilil

Tli

iiiiiii

iii

1

1

1

1

)()(

)()(

μmmxΣμmmx

μxΣμx

)()(

)()(

1

1

−−=

−−

∈

−

∈

−

ji

ji

C

Tjijjj

Cjjj

Tji

x

x

mxμmΣ

μmΣmx

−−−−

=

−

−−+−−+−−+−−=

−+−−+−=

N

lilT

lllllT

lilllT

lllilT

li

N

ililil

Tll

Tli

i

iiiiiiiiiiiiiiii

iiiiii

1111

1

1

1

))()()()()()()()((

))()(())()((

mxΣμmμmΣmxμmΣμmmxΣmx

μmmxΣμmmx0=

−−

=

−

=

−

=

−−+=

−−+−−=

N

jjjT

jjj

N

jjj

N

ijjj

Tjjj

N

i

Tlilil

i

Nn

Niii

11

1

1

1

1

1

)()()trace(

)()())((trace(

μmΣμmSΣ

μmΣμmmxmxΣ

=

−−

==

+−−=

+

N

ijjjjj

Tjjj

ijjjjjj

ijjj

n

Nn

1

11

11

))trace()()((

)()()trace(

SΣμmΣμm

μmΣμmSΣ

72

Appendix B

• ML estimators for

Appendix B

}){},{,(log 1 jjNp Σμx

jjjj

jjN

Np

μmΣμ

Σμx=−=

∂∂ − 0)(

}){},{,(log 11

jj

j

mμ

μ

=

∂ˆ

TTTT

j

jjj

j

jjNp

ΣΣSΣ

ΣΣμx =

∂+∂=

∂∂ −

0|)|log)(trace(}){},{,(log 11

jj

Tj

Tj

Tj

Tj

SΣ

ΣΣSΣ

=

=+− −−−

ˆ0

73

Appendix C -1

• Change of Variable Theorem

Appendix C 1

– Consider a one-one mapping– Equal probability: The probability of falling in a region in space

X should be the same as the probability of falling in the

)( ,: XY fg nn =ℜ→ℜ

p y gcorresponding region in space Y

– Suppose the region maps to the region in the Y space Equating probabilities we have

ndxdxdx ...21 dAspace. Equating probabilities, we have

nnn dxdxxxfdAyyf ...),...,(),...,( 111 XY =

– The region is a hyperparallelepipe described by the vectorsdA

),...,(),...,,...,(),,...,( 122

111

1n

nn

nn dxddydx

ddydx

ddydx

ddydx

ddydx

ddy

74

2211 nn dxdxdxdxdxdx

Appendix C -2

• Change of Variable Theorem (cont.)

Appendix C 2

– The hyper-parallelepiped can be calculated bydA

nn

ddy

ddydx

ddydx

ddy ,...,,..., 1

111

nn

nn

n

dxdx

ddy

ddy

dxdx

dxddydx

ddy

dxdxdA ...

,...,

, ,

,...,

, ,

11

11

1

11

11

==

nnn

nn

n dxdxdxdx

J: the Jacobian of function g

– So,

),...,(||),...,(

...),...,(...||),...,(

11

1

1111

nn

nnnn

xxfyyf

dxdxxxfdxdxyyf

XY

XY

J

J−=

=

75

Appendix D -1

• If A is a square matrix, then the minor entry of aij is

Appendix D 1

denoted by Mij and is defined to be the determinant of the submatrix that remains after the i-th row and the j-thcolumn are deleted from A The number ( 1)i + jM iscolumn are deleted from A. The number (−1)i + jMij is denoted by cij and is called the cofactor of aij.

• Given the matrixGiven the matrix

=131211

bbbbbb

B)()1( 23

3223 Mc +−=)()1( 33

3333 Mc +−=

=333231

232221

bbbbbbB

311232113231

12111211

23 bbbbbbbb

bb

bbM −=

=

×××××

=

76

3231 bb ×

Appendix D -2

• Given the n by n matrix

Appendix D 2

– The determinant of A can be written as the sum of its cofactors multiplied by the entries that generated them.

aaa 11211

= n

n

aaaaaa

A

22221

11211

(cofactor expansion along the jth column)

nnnn aaa 21

(cofactor expansion along the ith row)

)()(332211)det( jTj

njnjjjjjjj cAcacacacaA =++++=

77

( p g )Tiiininiiiiii cAcacacacaA )()(332211)det( =++++=

Feature Decorrelation on Speech Recognition

Documents

Transcript of Feature Decorrelation on Speech Recognition