Feature Decorrelation on Speech Recognition

77
Feature Decorrelation on S hR iti Feature Decorrelation on S hR iti H Shi L Speech Recognition Speech Recognition Hung-Shin Lee Institute of Information, Sinica Academic D f El ti lE i i N ti lT i U i it Dep. of Electrical Engineering, National T aiwan University 2009-10-09 @ IIS, Sinica Academic

Transcript of Feature Decorrelation on Speech Recognition

Page 1: Feature Decorrelation on Speech Recognition

Feature Decorrelation onS h R iti

Feature Decorrelation onS h R iti

H Shi L

Speech RecognitionSpeech Recognition

Hung-Shin Lee

Institute of Information, Sinica AcademicD f El t i l E i i N ti l T i U i itDep. of Electrical Engineering, National Taiwan University

2009-10-09 @ IIS, Sinica Academic

Page 2: Feature Decorrelation on Speech Recognition

References -11) J. Psutka and L. Muller, “Comparison of various feature decorrelation

techniques in automatic speech recognition ” in Proc CITSA 2006

References 1

techniques in automatic speech recognition, in Proc. CITSA 2006.2) Dr. Berlin Chen’s Lecture Slides: http://berlin.csie.ntnu.edu.tw3) Batlle et al., “Feature decorrelation methods in speech recognition - a

comparative study,” in Proc. ICSLP 1998.comparative study, in Proc. ICSLP 1998.4) K. Demuynck et al., “Improved feature decorrelation for HMM-based speech

recognition,” in Proc. ICSLP 1998.5) W. Krzanowski, Principles of Multivariate Analysis - A User’s Perspective,

Oxford Press, 1988.6) R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for

classification,” in Proc. ICASSP 1998.7) M. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE

Trans. on Speech and Audio Processing, vol. 7, no. 3, pp. 272-281, 1999.8) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis

expansion ” in Proc ICASSP 2002

2

expansion,” in Proc. ICASSP 2002.

Page 3: Feature Decorrelation on Speech Recognition

References -29) P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis

expansion ” IEEE Trans on Speech and Audio Processing vol 12 no 11 pp

References 2

expansion, IEEE Trans. on Speech and Audio Processing, vol. 12, no. 11, pp. 37-46, 2004.

10) N. Kumar and R. Gopinath, “Multiple linear transform,” in Proc. ICASSP 2001.11) N. Kumar and A. Andreou, “Heteroscedastic discriminant analysis and. reduced11) N. Kumar and A. Andreou, Heteroscedastic discriminant analysis and. reduced

rank HMMs for improved speech recognition,” Speech Communication, vol. 26, no. 4, pp. 283-297, 1998.

12) A. Ljolje, “The importance of cepstral parameter correlations in speech recognition,” Computer Speech and Language, vol. 8, pp. 223–232, 1994.

13) B. Flury, “Common principal components in k groups,” Journal of the American Statistical Association, vol. 79, no. 388, 1984.

14) B. Flury and W. Gautschi, “An algorithm for simultaneous orthogonal transformation of several positive definite symmetric matrices to nearly diagonal form,” SIAM Journal on Scientific and Statistical Computing, vol. 7, no 1 1986

3

no. 1, 1986.

Page 4: Feature Decorrelation on Speech Recognition

Outline

• Introduction to Feature Decorrelation (FD)

Outline

– FD on Speech Recognition

• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

4

Page 5: Feature Decorrelation on Speech Recognition

Outline

• Introduction to Feature Decorrelation (FD)

Outline

– FD on Speech Recognition

• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

5

Page 6: Feature Decorrelation on Speech Recognition

Introduction to Feature Decorrelation -1

• Definition of Covariance Matrix:

Introduction to Feature Decorrelation 1

– Random vector:T

nxx ][ 1 =X

Covariance matrix and mean vector:

random variable

– Covariance matrix and mean vector:

]))([( TE μXμXΣ −−≡

expected value

][Xμ E≡random vector

6

Page 7: Feature Decorrelation on Speech Recognition

Introduction to Feature Decorrelation -2

• Feature Decorrelation (FD)

Introduction to Feature Decorrelation 2

– To find transformations that make all variables or parameters (nearly) uncorrelated

Θ

jiXXXX jiji ≠∀= ,~,~ ,0)~,~cov(

transformed random variable

– Or make the covariance matrix diagonal (not necessary identity)

DΣΘΘ =T

The covariance matrix can be global or depend on each class.

7

diagonal matrixcovariance matrixglobal or depend on each class.

Page 8: Feature Decorrelation on Speech Recognition

Introduction to Feature Decorrelation -3

• Why Feature Decorrelation (FD)?

Introduction to Feature Decorrelation 3

– In many speech recognition systems, the observation density function for each HMM state are modeled as mixtures of diagonal covariance Gaussiansg

– For the sake of computational simplicity, the off-diagonal elements of the covariance matrices of the Gaussians are assumed to be close to zeroassumed to be close to zero

=i

iii Nf ),;()( Σμxx π

−−−= −

iii

Ti

ii )()(

21exp

||)2(1 1

2121 μxΣμxΣπ

πobservation density function

8

density functiondiagonal matrix

Page 9: Feature Decorrelation on Speech Recognition

FD on Speech Recognition -1

• Approaches for FD can be divided into two categories:

FD on Speech Recognition 1

Feature-space and Model-space• Feature-space Schemes

Hard to find a single transform which decorrelates all elements– Hard to find a single transform which decorrelates all elements of the feature vector for all states

• Model-space Schemes– A different transform is selected depending on which

component the observation was hypothesized to be generatedcomponent the observation was hypothesized to be generated from

– In the limit a transform may be used for each component, hi h i i l t t f ll i t i t

9

which is equivalent to a full covariance matrix system

Page 10: Feature Decorrelation on Speech Recognition

FD on Speech Recognition -2

• Feature-space Schemes on LVCSR

FD on Speech Recognition 2

DCT, PCA, LDA, MLLT, …

Feature-spaceDecorrelation

Front-End Preprocessing

Speech Signal

Speech Decoding

TextualResults

Training Data

Test Data

LMAMAM Training

Training Data

Lexicon

10

Page 11: Feature Decorrelation on Speech Recognition

FD on Speech Recognition -3

• Model-space Schemes on LVCSR

FD on Speech Recognition 3

Front-End Preprocessing

Speech Signal

Speech Decoding

TextualResults

Test Data

LMAMAM Training

Training Data

LexiconLMAM

Model-space

g Lexicon

pDecorrelation

MLLT, EMLLT, MLT, Semi-Tied, …

11

, , , ,

Page 12: Feature Decorrelation on Speech Recognition

FD on Speech Recognition -4FD on Speech Recognition 4

Without LabelWithout Label Information

With Label Information

Discrete Cosine Transform (DCT)

Linear Discriminant Analysis (LDA)

F t S(DCT)

Principal Component Analysis (PCA)

Common Principal Components (CPC)

Extended Maximum Likelihood

Feature-SpaceSchemes

Extended Maximum Likelihood Linear Transform (EMLLT)

Semi-Tie Covariance Matrices

Multiple Linear Transforms (MLT)Model-SpaceS hMultiple Linear Transforms (MLT)

Maximum Linear Likelihood Transform (MLLT)

Schemes

12

Page 13: Feature Decorrelation on Speech Recognition

Outline

• Introduction to Feature Decorrelation (FD)

Outline

– FD on Speech Recognition

• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

13

Page 14: Feature Decorrelation on Speech Recognition

Discrete Cosine Transform -1

• Discrete Cosine Transform (DCT)

Discrete Cosine Transform 1

– Since the log-power spectrum is real and symmetric, the inverse DFT reduces to a Discrete Cosine Transform (DCT)

– Is Applied to the log-energies of output filters (Mel-scaled pp g g p (filterbank) during the MFCC parameterization

jijn

× 10f)50(2 π nmji

njx

nc

iij <=

−=

=

,...,1,0for ,)5.0(cos1

i-th coordinate of the input vector x

– Partial decorrelation

j-th coordinate of the output vector c

14

Page 15: Feature Decorrelation on Speech Recognition

Discrete Cosine Transform -2

• (Total) covariance matrix calculated using the 25-hour

Discrete Cosine Transform 2

training data from MATBN

18 Mel-scaled Filterbank 13 Mel-cepstrum (using DCT)

1 5

2

3 p ( g )

6

8

0.5

1

1.5

0

2

4

05

1015

20

05

10

15200

010

2030

40

010

20

3040-2

15

Page 16: Feature Decorrelation on Speech Recognition

Principal Component Analysis -1

• Principal Component Analysis (PCA)

Principal Component Analysis 1

– Based on the calculation of the major directions of variations of a set of data points in a high dimension space

– Extracts the direction of the greatest variance, assuming that g , gthe less variation of the data, the less information it carries of the featuresPrincipal components : the largest eigenvectors][ vvV =– Principal components : the largest eigenvectors of the total covariance matrix Σ

],...,[ 1 pvvV =

vΣv = λDΣVV

vΣv

= Tiii λ

diagonal matrix

16

diagonal matrix

Page 17: Feature Decorrelation on Speech Recognition

Principal Component Analysis -2

• (Total) covariance matrix calculated using the 25-hour

Principal Component Analysis 2

training data from MATBN

162 Spliced Mel-Filterbank 39 PCA Transformed Subspace

1 5

2

162 Spliced Mel Filterbank

150

39 PCA Transformed Subspace

0

0.5

1

1.5

0

50

100

050

100150

200

050

100

150200-0.5

010

2030

40

010

20

3040-50

17

00 00

Page 18: Feature Decorrelation on Speech Recognition

Linear Discriminant Analysis -1

• Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis 1

– Seeks a linear transformation matrix that satisfiesΘ

( ))()(tracemaxarg 1 ΘSΘΘSΘΘΘ

BT

WT −=

– is formed by the largest eigenvectors of The LDA subspace is not orthogonal (but PCA subspace is)

Θ

Θ BW SS 1−

– The LDA subspace is not orthogonal (but PCA subspace is)

f

IΘΘ ≠T

– The LDA subspace makes the transformed variables statistically uncorrelated.

18

Page 19: Feature Decorrelation on Speech Recognition

Linear Discriminant Analysis -2

• From any two distinct eigenvalue/eigenvector pairs

Linear Discriminant Analysis 2

and),( ii θλ ),( jj θλ

= iWiiB θSθS λ

– Pre-multiplying by and , respectively

= jWjjB θSθS λ

Tjθ

Tiθp y g y , p yj i

jWTijiW

Tji

jWTijjB

Ti

iWTjiiB

Tj jB

TiiB

Tj =⎯⎯⎯⎯ →⎯

== = θSθθSθ

θSθθSθθSθθSθ θSθθSθ λλ

λλ

jiT

jWTiiW

Tj

jWijjBi

jijWTiiW

Tj

==⎯⎯⎯⎯⎯⎯⎯ →⎯

≠=

f0

0 and

θSθ

θSθθSθθSθθSθ λλ

19

jijWTi ≠∀=∴ for 0θSθ

Ref. (Krzanowski, 1988)

Page 20: Feature Decorrelation on Speech Recognition

Linear Discriminant Analysis -3

• To overcome arbitrary scaling of , it is usual to adopt

Linear Discriminant Analysis 3

iθthe normalization

θSθ =iWTi 1

λ 001

ΛΘSΘIΘSΘθSθθSθ

==∴==

BT

WT

iiWTiiiB

Ti

iWi

,λλ

=

pλ0000

1

Λ

• The total covariance matrix is also

BW ,

BWT SSS +=

p

transformed as a diagonal matrix

20

Page 21: Feature Decorrelation on Speech Recognition

Linear Discriminant Analysis -4

• (Total) covariance matrix calculated using the 25-hour

Linear Discriminant Analysis 4

training data from MATBN

162 Spliced Mel-Filterbank 39 LDA Transformed Subspace

1.5

2

p

10

15

39 p

0

0.5

1

0

5

10

050

100150

200

050

100

150200-0.5

010

2030

40

010

20

3040-5

21

Page 22: Feature Decorrelation on Speech Recognition

Maximum Likelihood Linear Transform

• Maximum Likelihood Linear Transform (MLLT) seems to

Maximum Likelihood Linear Transform

have two types:– Feature-based: (Gopinath, 1998)– Model-based: (Olsen 2002) (Olsen 2004)Model based: (Olsen, 2002), (Olsen, 2004)

• The common goal of the two types – Find a global linear transformation matrix that decorrelate features.

22

Page 23: Feature Decorrelation on Speech Recognition

Feature-based MLLT -1

• Feature-based Maximum Likelihood Linear Transform

Feature based MLLT 1

(F-MLLT)– Tries to alleviate four problems:

(a) Data insufficiency implying unreliable models(a) Data insufficiency implying unreliable models(b) Large storage requirement(c) Large computational requirement(d) ML i t di i i ti b t l(d) ML is not discriminating between classes

– Solutions(a)-(c): Sharing parameters across classes(d): Appealing to LDA

23

Page 24: Feature Decorrelation on Speech Recognition

Feature-based MLLT -2

• Maximum Likelihood Modeling

Feature based MLLT 2

– The likelihood of the training data is given by)},{( ii lx

==−−− −

∏∏iliil

TiliNN

llllN epp

)()(21

1

1

}){}{(}){}{(mxΣmx

ΣμxΣμx

( )=

==

++−−−−

==

−−

∏∏

j jjjjjjT

jjj

i

iiii

nNd

i ld

illill

e

pp

||log)trace()()(22

111

11

)2(

||)2(}){},{,(}){},{,(

ΣSΣμmΣμm

ΣΣμxΣμx

π

π

Log-Likelihood

Appendix A

e)2( π

−=jjN Ndp 1 )2log(

2}){},{,(log Σμx π

g

( )=

−− ++−−−C

jjjjjjj

Tjj

jn1

11 ||log)trace()()(2

2

ΣSΣμmΣμm

24

sample mean sample covarianceML estimators

Page 25: Feature Decorrelation on Speech Recognition

Feature-based MLLT -3

• The idea of maximum likelihood estimation (MLE) is to

Feature based MLLT 3

ˆchoose the parameters and so as to maximize}ˆ{ jμ }ˆ{ jΣ}){},{,(log 1 jj

Np Σμx

– ML Estimators: and }ˆ{ jμ }ˆ{ jΣ

What’s the difference between “estimator” and “estimate”?

25

estimator and estimate ?

Page 26: Feature Decorrelation on Speech Recognition

Feature-based MLLT -4

• Multiclass ML Modeling

Feature based MLLT 4

ˆ– The training data is modeled with Gaussians– The ML estimators:– The log-ML value:

)ˆ,ˆ( jj Σμjjjj SΣmμ == ˆ ,ˆ

Appendix Bg

( )

∗ =C

jC

j

jjNN

ndNnNdpp 11

||)(||1)2l (

})ˆ{},ˆ{,(log)(log

SS

Σμxx

f

( ) ==

−=−+−=j

jj

jj

j dNg11

||2

),(||2

1)2log(2

SSπ

– There is “no interaction” between the classes and therefore unconstrained ML modeling is not “discriminating”

26

Page 27: Feature Decorrelation on Speech Recognition

Feature-based MLLT -5

• Constrained ML – Diagonal Covariance

Feature based MLLT 5

ˆ– The ML estimators:– The log-ML value:

)diag(ˆ ,ˆ jjjj SΣmμ ==

C n

If one linearly transforms the data and models using a diagonal

=

∗ −=j

jjN ndNgp

11diag |)diag(|

2),()(log Sx

– If one linearly transforms the data and models using a diagonal Gaussian, the ML value is

( )C

jn ( )=

∗ +−=j

jjjTj

jN ndNgp1

1diag ||log|)diag(|2

),()(log ΘΘSΘx

the JacobianHow to choose Θ for each class?

27Appendix C

How to choose Θ for each class?

Page 28: Feature Decorrelation on Speech Recognition

Feature-based MLLT -6

• Multiclass ML Modeling – Some Issues

Feature based MLLT 6

– If the sample size for each class is not large enough then the ML parameter estimates may have large variance and hence be unreliable

– The storage requirement for the model:– The computational requirement:

)( 2CdO)( 2CdO

– The parameters for each class are obtained independently:f

Why?

ML principle dose not allow for discrimination between classes

28

Page 29: Feature Decorrelation on Speech Recognition

Feature-based MLLT -7

• Multiclass ML Modeling – Some Issues (cont.)

Feature based MLLT 7

– Parameters sharing across classes: reduces the number of parameters, storage requirements, and computational requirementsq

– Hard to justify parameters sharing is more discriminating

We can appeal to Fisher’s criterion of LDA and a result of– We can appeal to Fisher s criterion of LDA and a result of Campbell to argue that sometimes constrained ML modeling is discriminating.

But, what is discriminating?

29

Page 30: Feature Decorrelation on Speech Recognition

Feature-based MLLT -8

• Multiclass ML Modeling – Some Issues (cont.)

Feature based MLLT 8

– We can globally transform the data with a unimodular matrix Θand model the transformed data with diagonal Gaussians(There is a loss in likelihood too) 1)det( =Θ( )

– Among all possible transformation Θ, we can choose the one that takes the least loss in likelihood

1)det( =Θ

that takes the least loss in likelihood(In essence we will find a linearly transformed (shared) feature space in which the diagonal Gaussian assumption is almostvalid)

30

Page 31: Feature Decorrelation on Speech Recognition

Feature-based MLLT -9

• Another constrained MLE with sharing of parameters

Feature based MLLT 9

– Equal Covariance– Clustering

• Covariances Diagonalization and Cluster Transformation– Classes are grouped into clusters

E h l t i d l d ith di l G i i– Each cluster is modeled with a diagonal Gaussian in a transformed feature space

– The ML estimators: 1)diag(ˆ ,ˆ −−== jjjj CTCjC

TCjjj ΘΘSΘΘΣmμ

– The ML value:

( )∗ +−=C

CTCjC

jNdiag jjj

ndNgp 1 ||log|)diag(|),()(log ΘΘSΘx

31

( )=j

CCjCdiag jjjgp1

1 ||g|)g(|2

),()(g

Page 32: Feature Decorrelation on Speech Recognition

Feature-based MLLT -10

• One Cluster with Diagonal Covariance

Feature based MLLT 10

– When the number of clusters is one, there is single global transformation and the classes are modeled as diagonal Gaussians in this feature spacep

– The ML estimators:– The log-ML value:

1)diag(ˆ ,ˆ −−== ΘΘSΘΘΣmμ jTT

jjj

f

||log|)diag(|2

),()(log1

1diag ΘΘSΘx NndNgpC

jj

TjN +−= =

– The optimal Θ can be obtained by optimization as follows

−= ||log|)diag(|

2minarg ΘΘSΘΘΘ

NnC

jTj

32

= 21Θ j

Page 33: Feature Decorrelation on Speech Recognition

Feature-based MLLT -11

• Optimization – the numerical approach:

Feature based MLLT 11

– The objective function

||log|)diag(|2

)( ΘΘSΘΘ NnFC

jTj −=

– Differentiate Fwith respect to A, and get the derivative G:

21j=

f

( ) jT

jj

jT nNG ΘSΘΘSΘΘ 1)diag()( −− −=

– Directly optimizing the objective function is nontrivial and requires numerical optimization techniques and full matrix to be stored at each class

33

Page 34: Feature Decorrelation on Speech Recognition

Outline

• Introduction to Feature Decorrelation (FD)

Outline

– FD on Speech Recognition

• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

34

Page 35: Feature Decorrelation on Speech Recognition

Model-based MLLT -1

• In model-based MLLT, instead of having a distinct

Model based MLLT 1

covariance matrix for every component in the recognizer, each covariance matrix consists of two elements:

A non singular linear transformation matrix Θ shared over a– A non-singular linear transformation matrix Θ shared over a set of components

– The diagonal elements in the matrix for each class jjΛ

• The precision matrices are constrained to be of the formd jλ 00

=

− ===d

k

Tkk

jkj

Tjj

1

1 θθΘΛΘΣP λ

diagonal matrixprecision matrix

=j

j

j

λ

λ

0000001

Λ

35

diagonal matrixprecision matrix jdλ00

Ref. (Olsen, 2004)

Page 36: Feature Decorrelation on Speech Recognition

Model-based MLLT -2

• M-MLLT fits within the standard maximum-likelihood

Model based MLLT 2

criterion used for training HMM's with each state represented by a GMM

j th t i ht

=

=m

jjjj NΘf

1),;()|( Σμxx π

• Log-likelihood function:

j-th component weight

)|(log1):(1

ΘfN

ΘL i

N

ixX

=

=

36

Page 37: Feature Decorrelation on Speech Recognition

Model-based MLLT -3

• Parameters of the M-MLLT model for each HMM state

Model based MLLT 3

are estimated using a generalized expectation-maximization (EM) algorithm

• The auxiliary “Q function” should be introduced

},,,{ 111 ΘΛμ mmmΘ π=

• The auxiliary “Q-function” should be introduced.– The Q-function satisfies the inequality

ˆˆˆˆ

→= New}{ 111 ΘΛμ mmmΘ π

),(),():():( ΘΘQΘΘQΘLΘL −≥− XX

→=→

Old}ˆ,ˆ,ˆ,ˆ{ˆNew},,,{

111

111

ΘΛμΘΛμ

mmmΘΘ

ππ

37* A. P. Dempster et al., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977.

Page 38: Feature Decorrelation on Speech Recognition

Model-based MLLT -4

• The auxiliary Q-function is

Model based MLLT 4

)(1)ˆ;(1 1

ij

N

i

m

jij L

NΘΘQ x=

= =

γ

21

1 1

N

i

m

j

ij

N=

= =

γ )),;(log()( jjijij NL Σμxx π=The component log-likelihood of the observation i

)))((trace(||log)2log(log2( Tjijijjj d μxμxPP −−−+−× ππ

= m

ik

ijij

xL

xL

))(ˆexp(

))(ˆexp(γA posteriori probability

of Gaussian component jgiven the observation i

38

=k

ik1

))(p(given the observation i

Page 39: Feature Decorrelation on Speech Recognition

Model-based MLLT -5

• Gales’ Approach for deriving Θ

Model based MLLT 5

Estimate the mean and the component weight, which are independent of the other model parameters

N1 NN

Use the current estimate of the transform Θ and estimate

=

=i

ijj N 1

1 γπ ==

=i

iji

iijj11

γγ xμ

Use the current estimate of the transform Θ, and estimate the set of class-specific diagonal variances

Tj θSθ=λ ijiji θSθ=λ

the i-th row vector of Θthe i-th entry of diagonal variance of component j

39Ref. (Gales, 1999)

variance of component j

Page 40: Feature Decorrelation on Speech Recognition

Model-based MLLT -6

• Gales’ Approach for deriving Θ (cont.)

Model based MLLT 6

Estimate the transform Θ using the current set of diagonal covariances

NTiii

iiiN

cGcGcθ 1

1ˆ−

−=

=j

jjj

i

i

N SG 2)(diagσ̂

The i-th row vector of the cofactors of the current Θ

Appendix D

Go to step 2 until convergence, or appropriate criterion satisfied

Appendix D

40

Page 41: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -1

• Introduction to Semi-Tied Covariance Matrices

Semi Tied Covariance Matrices 1

– A natural extension of the state-specific rotation scheme– The transform is estimated in a maximum-likelihood (ML)

fashion given the current model parametersg p– The optimization is performed using a simple iterative scheme,

which is guaranteed to increase the likelihood of the training datadata

– An alternative approach to solve the optimization problem of DHLDA

41

Page 42: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -2

• State-Specific Rotation

Semi Tied Covariance Matrices 2

– A full covariance matrix is calculated for each state in the system, and is decomposed into its eigenvectors and eigenvaluesg

All data from that state is then decorrelated using the

Tssssfull

)()()()( UΛUΣ =

– All data from that state is then decorrelated using the eigenvectors calculated

)()( )()( ττ οUο Tss =

– Multiple diagonal covariance matrix Gaussian components are then trained

)()( ττ οUο =

42Ref. (Gales, 1999), (Kumar, 1998)

Page 43: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -3

• Drawbacks of State-Specific Rotation

Semi Tied Covariance Matrices 3

– It does not fit within the standard ML estimation framework for training HMM’s

– The transforms are not related to the multiple-component models being used to model the data

43

Page 44: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -4

• Semi-Tied Covariance Matrices

Semi Tied Covariance Matrices 4

– Instead of having a distinct covariance matrix for every component in the recognizer, each covariance matrix consists of two elements

Trmrm )()(diag

)()( HΣHΣ =

component specific diagonal covariance element

semi-tied class-dependent, non-diagonal matrix,may be tied over a set of components

44

components

Page 45: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -5

• Semi-Tied Covariance Matrices (cont.)

Semi Tied Covariance Matrices 5

– It is very complex to optimize these parameters directly so an expectation-maximization approach is adopted

– Parameters for each component m

}{ )()()()( rmd

mmM ΘΣμπ= },,,{ diagM ΘΣμπ=1)()( −= rr HΘ

45

Page 46: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -6

• The auxiliary Q-function

Semi Tied Covariance Matrices 6

( ) ( )

=2)(

ˆˆˆ|ˆ|

)ˆ,(

Tr

MMQ

Θ ( ) ( )∈

−−−

ττττγ

),(

)()(1)(diag

)()()(

diag

ˆ)(ˆˆˆˆ)(|ˆ|

||log)(rMm

mrmTrTmmm μοΘΣΘμο

ΣΘ

|| )(mΣ 1)( −mΣ),|)(( Tm Mqp Οτ

component m at time τ

46

Page 47: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -7

• If all the model parameters are to be simultaneously

Semi Tied Covariance Matrices 7

optimized then the Q-function may be rewritten as

βτγ2)( |ˆ|l)()ˆ(

r

dMMQ Θ

where

βτγ),(

)()()( |)ˆˆ(|||log)(),(

rMmTrmrm d

diagMMQ

ΘWΘ

)()(where

( )( ) −−= τ

τττγ ˆ)(ˆ)()( )()(

)(

Tmmm

mμομο

W

=

τ

τ

τγ

ττγ

)(

)()(ˆ )(

m

mm

ομ

=

ττγ )(m

W τ

τγβ),(

)(rMm

m

47

Page 48: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -8

• The ML estimate of the diagonal element of the

Semi Tied Covariance Matrices 8

covariance matrix is given by

( )Trmrm )()()()(diag

ˆˆdiagˆ ΘWΘΣ =

• The reestimation formulae for the component weights and transition probabilities are identical to the standard

( )g

and transition probabilities are identical to the standard HMM cases (Rabiner, 1989)

• Unfortunately, optimizing the new Q-function is nontrivial and more complicated, so an alternative

h i d t

48

approach is proposed next

Page 49: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -9

• Gales’ approach for optimizing the Q-function

Semi Tied Covariance Matrices 9

Estimate the mean, which is independent of the other model parameters

Using the current estimate of the semi-tied transform and estimate the set of component specific diagonal variances This set of parameters will be

)(ˆ rΘ)ˆˆdiag(ˆ )()()()(

diagTrmrm ΘWΘΣ =

specific diagonal variances. This set of parameters will be denoted as

f

},ˆ{}ˆ{ )()(diag

)(diag

rrr Mm∈= ΣΣ

ˆ ˆ )( Estimate the transform using the current set

Go to (2) until convergence, or appropriate criterion satisfied

)(ˆ rΘ }ˆ{ )(diagrΣ

49

( ) g , pp p

Page 50: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -10

• How to carry out step (3)?

Semi Tied Covariance Matrices 10

– Optimizing the semi-tied transform requires an iterative estimation scheme even after fixing all other model parametersp

– Selecting a particular row of , , and rewriting the former Q-function using the current set

)(ˆ rΘ )(ˆ riθ

}ˆ{ )(diagrΣ

ˆ

ττγ

2)()()(2)(

)(diag

))(ˆˆ(|ˆ|l)ˆl ()(

}ˆ{;ˆ,(mTr

jmTr

rMMQ

οθΣθ

Σ|ˆ| )(rΘ )()( ˆ)( mm μo −τ

−−=τ σ

τγ),(

2)(diag

)(diag

2)( ))((||log)log()(

rMm jm

jmi

Trim

j

Σcθ

The ith row vector of the cofactors the leading diagonal element j

50

The ith row vector of the cofactors the leading diagonal element j

Page 51: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -11

• How to carry out step (3)? (cont.)

Semi Tied Covariance Matrices 11

– It is shown that the ML estimate for the ith row of the semi-tied transform, , is given by)(ˆ r

Ti

rii

rii

ri cGc

Gcθ 1)(1)()(ˆ

−−= β

τγσ

)(ˆ

1)(

)(2)(

diag

)(m

Mm

mm

ri

ri

WG

51

Page 52: Feature Decorrelation on Speech Recognition

Semi-Tied Covariance Matrices -12

• It can be shown that

Semi Tied Covariance Matrices 12

with equality when diagonal elements of the covariance

})ˆ{;ˆ,()ˆ,( )(diagrMMQMMQ Σ≥

with equality when diagonal elements of the covariance matrix are given by )ˆˆdiag(ˆ )()()()(

diagTrmrm ΘWΘΣ =

• During recognition the log-likelihood is based on

)),,);((log( )()()( rmmL ΘΣμο τ||log)),);((log(

)),,);((log()()()()()( rm

diagmrrN

LΘΣμΘο

ΘΣμο+= τ

τ

T)( 2)(

52

)(T)( τοΘ r 2)( ||log21 rΘ

Page 53: Feature Decorrelation on Speech Recognition

Extended MLLT -1

• The extended MLLT (EMLLT) is much similar to M-MLLT.

Extended MLLT 1

But the only difference is the precision matrix modeling

Tk

D

kj

jT

jj θθΘΛΘΣP − === 1 λ kk

kkjjj θθΘΛΘΣP =

===1λ

basis

– Note that in EMLLT, , while in M-MLLT,– are not required to be positive!

h t b h h th t i iti d fi it

dD ≥ dD =jkλ

}{ jλ P– have to be chosen such that is positive definite– The authors provided two algorithms for iterative-update of

the parameters

}{ jkλ jP

53Ref. (Olsen, 2002), (Olsen, 2004)

Page 54: Feature Decorrelation on Speech Recognition

Extended MLLT -2

• The highlight of EMLLT:

Extended MLLT 2

T– More flexible: the number of basis elements can be gradually varied from d (MLLT) to d(d+1)/2 (Full) by controlling the value of D

}{ Tkkθθ

54

Page 55: Feature Decorrelation on Speech Recognition

Outline

• Introduction to Feature Decorrelation (FD)

Outline

– FD on Speech Recognition

• Feature-based Decorrelation• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT)

• Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT)

• Common Principal Component (CPC)

55

Page 56: Feature Decorrelation on Speech Recognition

Common Principal Components -1

• Why Common Principal Components (CPC)?

Common Principal Components 1

– We often deal with the situation of the same variables being measured on objects from different groups, and the covariance structure may vary from group to groupy y g p g p

– But sometimes the covariance matrices of different groups look somehow similar, and it seems reasonable to assume that the covariance matrices have a common basic structurethe covariance matrices have a common basic structure

• Goal of CPC – To find a rotation diagonalizes the covariance matrices simultaneously

56Ref. (Flury, 1984), (Flury, 1986)

Page 57: Feature Decorrelation on Speech Recognition

Common Principal Components -2

• Hypothesis of CPC’s

Common Principal Components 2

kiH iiT

c ,...,1 ,: == ΛBΣB

diagonal matrix

– The common principal components (CPC's):T xBU =

diagonal matrix

– Note that, no canonical ordering of the columns of B need be given since the rank order of the diagonal elements of the

ii xBU =

given, since the rank order of the diagonal elements of the is not necessarily the same for all groups

57

Page 58: Feature Decorrelation on Speech Recognition

Common Principal Components -3

• Assume are independently distributed as .

Common Principal Components 3

iin S ),( iip nW ΣThe common likelihood function is

∏ −−

×=

kni inCL 21traceexp)( ΣSΣΣΣ

• Instead of maximizing the likelihood function, minimize

∏=

−×=

iiiik CL

11

2traceexp),...,( ΣSΣΣΣ

Instead of maximizing the likelihood function, minimize

+−=k

kk CLg 11 log2),...,(log2),...,( ΣΣΣΣ

=

−+=k

iiiiin

1

1 ))trace(||(log SΣΣ

58

Page 59: Feature Decorrelation on Speech Recognition

Common Principal Components -4

• Assume holds for some orthogonal matrix , and

Common Principal Components 4

CH β, then),...,diag( 1 ipii λλ=Λ

kip

1log||log == λΣ ,...,kij

iji 1 ,log||log1

===

λΣ

−−− ===p

jiTj

iT

iiT

iii111 )trace()trace()trace( βSββSβΛSββΛSΣ

• Therefore

=

===j ij

iiiiii1

)trace()trace()trace(λ

βSβΛSββΛSΣ

= kpppk gg 2111111 ),...,,,...,,,...,(),...,( λλλλββΣΣ

= =

+=

k

i

p

j ij

jiTj

ijin1 1

logλ

λβSβ

59

j

Page 60: Feature Decorrelation on Speech Recognition

Common Principal Components -5

• The function g is to be minimized under the restrictions

Common Principal Components 5

=≠

=jhjh

jTh if1

if0ββ

( )

( ) ( )1

1

,...,

,..., 1 2

k

p pT T

k h h h hj h j

G

g γ γ= − − −

Σ Σ

Σ Σ β β β β

• Thus we wish to minimize the function

( )1

j jh h j= <

Thus we wish to minimize the function

−−−=p

jThhj

p

hThhkk gG ββββΣΣΣΣ γγ 2)1(),...,(),...,( 11

<= jhjhhj

hhhhkk gG ββββ γγ )(), ,(), ,(

111

60

Page 61: Feature Decorrelation on Speech Recognition

Common Principal Components -6

• Minimization:

Common Principal Components 6

nG

k

i

p

j ij

jiTj

ijik

log),...,( 1 11

+∂

∂ = = βSβ

ΣΣ λλ

pjki

G

jiTjij

ij

i j ij

ij

k

,...,1 ;,...,1 ,

0),...,( 1 11

===

=∂

=∂

∂ = =

βSβ

ΣΣ

λλ

λλ

pjjijij , ,;, ,,ββ

Key point! Keep it in your mind.

kipii ,...,1 ,)trace( 1 == − SΣ

61

Page 62: Feature Decorrelation on Speech Recognition

Common Principal Components -7

• Minimization (cont.):

Common Principal Components 7

nk

i

p

j ij

jiTj

iji log1 1

+

= =

βSβλ

λ

G

p

jhj

Thhj

p

hh

Thh

k 02)1(

),...,( 11 =

−−−

=∂

<=

ββββΣΣ

γγ

pjn pk

jii

jj

10

0

==−−

=∂

=∂

βββS

ββ

γγ

M lti l i th l ft b i

pjjh

hhjhjj

i ij

,...,1 ,011

== ≠==

ββ γγλ

Tβ jk

162

– Multiplying the left by givesThβ pjn

iij ,...,1 ,

1==

=

γ

Page 63: Feature Decorrelation on Speech Recognition

Common Principal Components -8

• Minimization (cont.): Thus

Common Principal Components 8

pjnn p

jhh

hjhj

k

ii

k

i ij

jii ,...,1 ,)(111

=−− ≠===

βββS

γλ

– Multiplying the left by implies

jh≠

)( jlTl ≠β

nkji

Tli

βSβ

– Note that and

jlpjn

jli ij

jili ≠===

,,...,1 ,1

γλ

βSβ

jiTlli

Tj βSββSβ = ljjl γγ =Note that , and jillij βSββSβ ljjl γγ

ljpjn

jl

kji

Tli ≠== ,,...,1 ,

λβSβ

63

i ij=1 λ

Page 64: Feature Decorrelation on Speech Recognition

Common Principal Components -9

• Minimization (cont.):

Common Principal Components 9

nn k

i il

jiTli

k

i ij

jiTli =−

==

011

βSββSβλλ

j,...,p; ll,jn j

k

ii

ijil

ijili

Tl ≠==

=

1 ,01

βSβλλλλ

j

the optimization objective

– These equations have to be solved under the orthonormality conditions and

2)1( −ppp

T Iββ = βSβ iTjij =λ

64

Page 65: Feature Decorrelation on Speech Recognition

Common Principal Components -10

• Solving procedure of CPC – FG Algorithm

Common Principal Components 10

– F-Algorithm

65

Page 66: Feature Decorrelation on Speech Recognition

Common Principal Components -11

– G-Algorithm

Common Principal Components 11

66

Page 67: Feature Decorrelation on Speech Recognition

Common Principal Components -12

• Likelihood Ratio Test

Common Principal Components 12

– The sample common principal components:

kiiT

i ,...,1 ,ˆ == XβU

– For the ith group, the transformed covariance matrix is

kiiT

i ,...,1 ,ˆˆ == βSβF

– Sincethe statistic can be written as a function of the alone:

ii , ,,ββ

kiii ,...,1 ,)diag(ˆ == FΛ

∏ === p

p

j

ijjk

ii

ik

i

fnn 1

)(

2 log||

|)diag(|logF

67

∏=

==

jij

iii l1

11 || Feigenvalues of Fi

Page 68: Feature Decorrelation on Speech Recognition

Common Principal Components -13

• Likelihood Ratio Test (cont.)

Common Principal Components 13

– The likelihood ratio criterion is a measure of simultaneous diagonalizability of k p.d.s. matrices

– The CPC's can be viewed as obtained by a simultaneous ytransformation, yielding variables that are as uncorrelated as possibleIt can also be seen from another viewpoint of Hadamard’s– It can also be seen from another viewpoint of Hadamard sinequality

|)diag(||| ii FF ≤ |)g(||| ii

68

Page 69: Feature Decorrelation on Speech Recognition

Common Principal Components -14

• Actually, CPC can be also viewed as another measure of

Common Principal Components 14

“deviation from diagonality”

1|)diag(|)( ≥= ii

FFϕ

– The CPC criterion can be

||)(

ii F

ϕ

∏=

=k

i

nikk

i,...,nn1

11 ))(();,...,( FFFΦ ϕ

– Let for a given orthonal matrix BBABF iT

i =

);(min);( TT nnnn BABBABΦAAΦ =

69

);,...,(min);,...,( 11110 kkkk ,...,nn,...,nn BABBABΦAAΦB

=

Page 70: Feature Decorrelation on Speech Recognition

Comparison Between CPC and F-MLLT

• CPC tries to maximize

Comparison Between CPC and F MLLT

=

−+C

iiiiin

1

1 ))trace(||(log SΣΣ The estimates are KNOWN

with in the original spaceCiiT

i ,...,1 , == BΛBΣ

• F-MLLT tries to maximize

1 CT ||log|)(|

21

1AASA NdiagN

jj

Tj −

=The estimates are UNKNOWN

70

Page 71: Feature Decorrelation on Speech Recognition

Appendix A -1

• Show

Appendix A 1

( )=++−−−−

−−−−−

∏ j jjjjjjT

jjjiliil

Tili nNdN

dee ||log)trace()()(

22

)()(21

111

)2(||)2(

ΣSΣμmΣμmμxΣμx

Σπ

= ii ld

1 ||)2( Σπ

71

Page 72: Feature Decorrelation on Speech Recognition

Appendix A -2Appendix A 2

, classeach For jC

=

=

−+−−+−=

−−

N

illlil

Tllli

N

ilil

Tli

iiiiiii

iii

1

1

1

1

)()(

)()(

μmmxΣμmmx

μxΣμx

)()(

)()(

1

1

−−=

−−

ji

ji

C

Tjijjj

Cjjj

Tji

x

x

mxμmΣ

μmΣmx

−−−−

=

−−+−−+−−+−−=

−+−−+−=

N

lilT

lllllT

lilllT

lllilT

li

N

ililil

Tll

Tli

i

iiiiiiiiiiiiiiii

iiiiii

1111

1

1

1

))()()()()()()()((

))()(())()((

mxΣμmμmΣmxμmΣμmmxΣmx

μmmxΣμmmx0=

−−

=

=

=

−−+=

−−+−−=

N

jjjT

jjj

N

jjj

N

ijjj

Tjjj

N

i

Tlilil

i

Nn

Niii

11

1

1

1

1

1

)()()trace(

)()())((trace(

μmΣμmSΣ

μmΣμmmxmxΣ

=

−−

==

+−−=

+

N

ijjjjj

Tjjj

ijjjjjj

ijjj

n

Nn

1

11

11

))trace()()((

)()()trace(

SΣμmΣμm

μmΣμmSΣ

72

Page 73: Feature Decorrelation on Speech Recognition

Appendix B

• ML estimators for

Appendix B

}){},{,(log 1 jjNp Σμx

jjjj

jjN

Np

μmΣμ

Σμx=−=

∂∂ − 0)(

}){},{,(log 11

jj

j

μ

=

∂ˆ

TTTT

j

jjj

j

jjNp

ΣΣSΣ

ΣΣμx =

∂+∂=

∂∂ −

0|)|log)(trace(}){},{,(log 11

jj

Tj

Tj

Tj

Tj

ΣΣSΣ

=

=+− −−−

ˆ0

73

Page 74: Feature Decorrelation on Speech Recognition

Appendix C -1

• Change of Variable Theorem

Appendix C 1

– Consider a one-one mapping– Equal probability: The probability of falling in a region in space

X should be the same as the probability of falling in the

)( ,: XY fg nn =ℜ→ℜ

p y gcorresponding region in space Y

– Suppose the region maps to the region in the Y space Equating probabilities we have

ndxdxdx ...21 dAspace. Equating probabilities, we have

nnn dxdxxxfdAyyf ...),...,(),...,( 111 XY =

– The region is a hyperparallelepipe described by the vectorsdA

),...,(),...,,...,(),,...,( 122

111

1n

nn

nn dxddydx

ddydx

ddydx

ddydx

ddydx

ddy

74

2211 nn dxdxdxdxdxdx

Page 75: Feature Decorrelation on Speech Recognition

Appendix C -2

• Change of Variable Theorem (cont.)

Appendix C 2

– The hyper-parallelepiped can be calculated bydA

nn

ddy

ddydx

ddydx

ddy ,...,,..., 1

111

nn

nn

n

dxdx

ddy

ddy

dxdx

dxddydx

ddy

dxdxdA ...

,...,

, ,

,...,

, ,

11

11

1

11

11

==

nnn

nn

n dxdxdxdx

J: the Jacobian of function g

– So,

),...,(||),...,(

...),...,(...||),...,(

11

1

1111

nn

nnnn

xxfyyf

dxdxxxfdxdxyyf

XY

XY

J

J−=

=

75

Page 76: Feature Decorrelation on Speech Recognition

Appendix D -1

• If A is a square matrix, then the minor entry of aij is

Appendix D 1

denoted by Mij and is defined to be the determinant of the submatrix that remains after the i-th row and the j-thcolumn are deleted from A The number ( 1)i + jM iscolumn are deleted from A. The number (−1)i + jMij is denoted by cij and is called the cofactor of aij.

• Given the matrixGiven the matrix

=131211

bbbbbb

B)()1( 23

3223 Mc +−=)()1( 33

3333 Mc +−=

=333231

232221

bbbbbbB

311232113231

12111211

23 bbbbbbbb

bb

bbM −=

=

×××××

=

76

3231 bb ×

Page 77: Feature Decorrelation on Speech Recognition

Appendix D -2

• Given the n by n matrix

Appendix D 2

– The determinant of A can be written as the sum of its cofactors multiplied by the entries that generated them.

aaa 11211

= n

n

aaaaaa

A

22221

11211

(cofactor expansion along the jth column)

nnnn aaa 21

(cofactor expansion along the ith row)

)()(332211)det( jTj

njnjjjjjjj cAcacacacaA =++++=

77

( p g )Tiiininiiiiii cAcacacacaA )()(332211)det( =++++=