REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M....

26
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communication s University of Granada (SPAIN) Presenter: Chen, Hung-Bin ICASSP 2007

Transcript of REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M....

Page 1: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION

Javier Ram’ırez, Jos’e C. Segura and J.M. G’orrizDept. of Signal Theory Networking and Communications

University of Granada (SPAIN)

Presenter: Chen, Hung-Bin

ICASSP 2007

Page 2: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

2

Outline

• Introduction• Multiple Observation Likelihood Ratio Test• Analysis Of The Proposed Algorithm (IEEE SIGNAL PROCESSING 2005)

• Revised MO-LRT (ICASSP 2007)

• Experimental

Page 3: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

3

Introduction

• This paper is based on a revised contextual likelihood ratio test (LRT) and defined over a multiple observation window

• The new approach not only evaluates the two hypothesis consisting on all the observations to be speech or nonspeech

• The proposed method showed a speech/non-speech discrimination over a wide range of SNR conditions

Page 4: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

4

Likelihood Ratio Test

• two-hypothesis test– Given an observation vector to be classified, the problem is reduced to

selecting the class ( or ) with the largest posterior probability

– Likelihood ratio test (LRT) is defined as:

– where the observation vector is classified as if the likelihood ratio is greater than the ratio between the a priori class probabilities, otherwise it is classified as

y~

)~|( yGP i1G0G

1G

0G

)~(yL)(/)( 10 GPGP

)(

)(

)|~(

)|~()~(

1

0

0

1

0

1

Gp

Gp

G

G

Gyp

GypyL

Page 5: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

5

Multiple Observation Likelihood Ratio Test

• To improve, a LRT for detecting the presence of speech in a noisy signal based on a Gaussian model– the multiple observation LRT (MO-LRT) considers not just a single obser

vation vector measured at a frame t, but also an N-frame neighborhood

– This test involves the evaluation of an N-th order LRT incorporating contextual information to the decision rule and exhibits significant improvements in speech/pause discrimination over the original LRT

– Multiple Observation LRT (MO-LRT) is defined as:

ty~

NttNt yyy ~,,~,,~

)|~,,~(

)|~,,~()~,,~,,~(

0|,,

1|,,

0

1

Gyyp

GyypyyyL

NtNtGyy

NtNtGyyNttNt

NtNt

NtNt

IEEE SIGNAL PROCESSING 2005

Page 6: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

6

Multiple Observation Likelihood Ratio Test

• This test involves the evaluation of an N-th order LRT that enables a computationally efficient evaluation when the individual measurements are independent

• In this case

• An equivalent log LRT can be defined by taking logarithms

IEEE SIGNAL PROCESSING 2005

ty~

)|~(

)|~()~,,~,,~(

0|

1|

0

1

Gyp

GypyyyL

tGy

tGyN

NtNttNt

t

t

N

Nt tGy

tGyNttNt Gyp

Gypyyyl

t

t

)|~(

)|~(ln)~,,~,,~(

0|

1|

0

1

Page 7: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

7

Multiple Observation Likelihood Ratio Test

• The use of the MO-LRT for voice activity detection is mainly motivated by the multiple observation vector to improvements in robustness against the acoustic noise present in the environment.

• The MO-LRT is defined over the observation vectors

• The decision rule is defined by

IEEE SIGNAL PROCESSING 2005

NttNt yyy ~,,~,,~

)(Gnonspeech or )(Gspeech as classified being frame thedenotes where

)|~(

)|~(ln

01

0|

1|,

0

1

l

Gyp

Gypl

N

Nt tGy

tGymt

t

t

tionclassificanonspeech andspeech between offbest trade for the

ally tunedexperiment is that hresholddecision t theis where

nonspeech as classified is frame ,

speech as classified is frame ,,

mtl

Page 8: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

8

Multiple Observation Likelihood Ratio Test

• In order to evaluate the proposed MO-LRT VAD on an incoming signal, an adequate statistical model for the feature vectors in presence

• The model selected is similar to that assumes the discrete Fourier transform (DFT) coefficients of the clean speech and the noise to be asymptotically independent Gaussian random variables

)( jS )( jN

1

0

2

1

1

0

2

0

)()(exp

)()(

1)|~(

)(exp

)(

1)|~(

J

j SN

j

SNt

J

j N

j

Nt

jj

Y

jjGyp

j

Y

jGyp

IEEE SIGNAL PROCESSING 2005

Page 9: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

9

Analysis Of The Proposed Algorithm

• For this experiment– The AURORA subset of the Spanish SpeechDat-Car (SDC) was used

– This database consists of recordings from distant and close-talking microphones in car environments at different driving conditions

– The most unfavorable scenario (i.e., distant microphone at high speed over good road) with a 5 dB average SNR was considered

IEEE SIGNAL PROCESSING 2005

Page 10: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

10

Analysis Of The Proposed Algorithm

• Fig. 1(a) shows the distributions of the LRT during presence and absence of speech for increasing values of m.

(a) Speech/Nonspeech distributions of the multiple observationlikelihood ratio for different number of observations m

IEEE SIGNAL PROCESSING 2005

Page 11: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

11

Analysis Of The Proposed Algorithm

• Fig. 1(b) the speech, nonspeech, and global detection errors are shown as a function of m

• Note that the speech detection error is clearly reduced when increasing the value of m

• The global error exhibits a minimum value for m =6 frames

(b) Error probabilities as a function of m

IEEE SIGNAL PROCESSING 2005

Page 12: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

12

Analysis Of The Proposed Algorithm

• Fig. 2 shows the nonspeech hit rate (HR0) versus the false alarm rate (FAR0=1- HR1, where HR1 denotes the speech hit rate) for recordings from the distant microphone at an average SNR of about 5 dB.

• It is clearly shown that increasing the number of observation vectors in the MO-LRT improves the performance of the proposed VAD

IEEE SIGNAL PROCESSING 2005

Page 13: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

13

Revised MO-LRT

• In this way, the test evaluates the probability that “all” the observations in the N frame of the central frame to be non-speech or speech

• This is the reason to revise the method in order to evaluate not just the two previous hypothesis G0 and G1

– the decision is made in favor of one of the two hypothesis:

ICASSP 2007

NttNtlfor

syG

nsyG

ll

lll

,,,,

ˆˆ :

ˆˆˆ :

0

1

Page 14: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

14

Revised MO-LRT

• the multiple observation vector that is reindexed as for convenience of the presentation

• Each hypothesis Hm can be defined in terms of a binary integer representation:

NttNt yyy ~,,~,,~

1211 ˆ,,ˆ,,ˆˆ NN yyyY

12,,2,1 ˆˆ : 0

ˆˆˆ : 1

)1(speech or )0(speech -non is n observatio theif define where

2 12

1

Nksyb

nsyb

bbkb

m

kkk

kkkk

kkk

N

k

bk

ICASSP 2007

Page 15: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

15

Revised MO-LRT

• Thus, each hypothesis Hm consists of 2N+1 individual hypothesis involving the 2N+1 observations

• The classification problem is then reformulated as selecting the class i with the current frame depending on the bit bN+1 to assigning speech (G1) or non-speech (G0)

• If the set of all the possible hypothesis is splitted depending on the value of the central frame bit bN+1 as:

• the posterior probabilities are defined to be:

0:

1:

10

11

Nm

Nm

bHM

bHM

0

1

)ˆ|()ˆ|(

)ˆ|()ˆ|(

0

1

Mm m

Mm m

YHpYGP

YHpYGP

ICASSP 2007

Page 16: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

16

Revised MO-LRT

• Using the Bayes rule:

• a revised LRT can be defined as:

• Approximation to the statistical test replace the summation by the maximum value of the probability of the hypothesis in M1 and M0:

• By taking logarithms this test is expressed in a more compact form:

0

1

)()|ˆ()ˆ(

1)ˆ|(

)()|ˆ()ˆ(

1)ˆ|(

0

1

Mm mm

Mm mm

HpHYpYP

YGP

HpHYpYP

YGP

0

1

)()|ˆ(

)()|ˆ(

)ˆ|(

)ˆ|(

0

1

Mm mm

Mm mm

HpHYp

HpHYp

YGP

YGP

)()|ˆ(max

)()|ˆ(max*

0

1

mmMm

mmMm

HpHYp

HpHYp

mMmm

Mmll

01

maxmax*log

ICASSP 2007

Page 17: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

17

Revised MO-LRT

• restrict the possible hypothesis – speech to non-speech or non-speech to speech transition in the N-frame neighbo

rhood

– K is the Hankel matrix:

TN

TN

TN

ypyp

ypyp

lll

)]1|ˆ(log,),1|ˆ([logB

)]0|ˆ(log,),0|ˆ([logB

],,,[L

:where

K)B(IKBL

1211

1210

1221

01

ICASSP 2007

Page 18: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

18

Revised MO-LRT

• The matrix K can be splitted into two submatrices K0 and K1 then the test is easily reduced to:

01 maxLmaxLlogΛ

01110

01111

01

00100

01111

BK)BK(IL

)BK(IBKL

: toreduced isequation and KKI that Note

)BK(IBKL

)BK(IBKL

where

ICASSP 2007

Page 19: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

19

Revised MO-LRT

• As an examplefor N = 1, the matrices K, K0 and K1 are defined to be:

• and L1 and L0 are computed by:

001

100

000

K

110

011

111

K

,

001

011

111

110

100

000

K

0

1

)0|ˆ(log

)0|ˆ(log

)0|ˆ(log

110

011

111

)1|ˆ(log

)1|ˆ(log

)1|ˆ(log

001

100

000

BK)BK(IL

)0|ˆ(log

)0|ˆ(log

)0|ˆ(log

001

100

000

)1|ˆ(log

)1|ˆ(log

)1|ˆ(log

110

011

111

)BK(IBKL

3

2

1

3

2

1

01110

3

2

1

3

2

1

01111

yp

yp

yp

yp

yp

yp

yp

yp

yp

yp

yp

yp

ICASSP 2007

Page 20: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

20

Revised MO-LRT

• Finally– The algorithm for voice activity detection is based on a comparison of a likelihood

ratio to a given threshold :

0

1

01 maxLmaxLlogΛ

G

G

ICASSP 2007

Page 21: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

21

Experimental Results

• The AURORA subset of the Spanish SpeechDat-Car (SDC) was used– Fig. 1 shows an utterance of the database in clean conditions (25 dB SNR)– Fig. 2 under the noisiest conditions (5 dB SNR)

ICASSP 2007

Page 22: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

22

Experimental Results

• ROC curves in quiet noise conditions (stopped car and engine running) and close talking microphone

ICASSP 2007

Page 23: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

23

Experimental Results

• ROC curves in high noise conditions (high speed over a good road) and distant talking microphone

ICASSP 2007

Page 24: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

24

Conclusion

• The new approach not only evaluates the two hypothesis consisting on all the observations to be speech or non-speech, but all the possible hypothesis defined over the individual observations

• Hankel matrix was introduced into the revised statistical test to smoothing process and reduced variance of the decision variable

ICASSP 2007

Page 25: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

25

Page 26: REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

26

Revised MO-LRT

• The algorithm is adaptive and suitable for non-stationary noise environments since the statistical properties are updated when the frame is classified as a non-speech frame. In this way, the variance of the noise is updated as:

2)1()()( jNN Yjj

N