REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M....

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION

Javier Ram’ırez, Jos’e C. Segura and J.M. G’orrizDept. of Signal Theory Networking and Communications

University of Granada (SPAIN)

Presenter: Chen, Hung-Bin

ICASSP 2007

2

Outline

• Introduction• Multiple Observation Likelihood Ratio Test• Analysis Of The Proposed Algorithm (IEEE SIGNAL PROCESSING 2005)

• Revised MO-LRT (ICASSP 2007)

• Experimental

3

Introduction

• This paper is based on a revised contextual likelihood ratio test (LRT) and defined over a multiple observation window

• The new approach not only evaluates the two hypothesis consisting on all the observations to be speech or nonspeech

• The proposed method showed a speech/non-speech discrimination over a wide range of SNR conditions

4

Likelihood Ratio Test

• two-hypothesis test– Given an observation vector to be classified, the problem is reduced to

selecting the class ( or ) with the largest posterior probability

– Likelihood ratio test (LRT) is defined as:

– where the observation vector is classified as if the likelihood ratio is greater than the ratio between the a priori class probabilities, otherwise it is classified as

y~

)~|( yGP i1G0G

1G

0G

)~(yL)(/)( 10 GPGP

)(

)(

)|~(

)|~()~(

1

0

0

1

0

1

Gp

Gp

G

G

Gyp

GypyL

5

Multiple Observation Likelihood Ratio Test

• To improve, a LRT for detecting the presence of speech in a noisy signal based on a Gaussian model– the multiple observation LRT (MO-LRT) considers not just a single obser

vation vector measured at a frame t, but also an N-frame neighborhood

– This test involves the evaluation of an N-th order LRT incorporating contextual information to the decision rule and exhibits significant improvements in speech/pause discrimination over the original LRT

– Multiple Observation LRT (MO-LRT) is defined as:

ty~

NttNt yyy ~,,~,,~

)|~,,~(

)|~,,~()~,,~,,~(

0|,,

1|,,

0

1

Gyyp

GyypyyyL

NtNtGyy

NtNtGyyNttNt

NtNt

NtNt

IEEE SIGNAL PROCESSING 2005

6


• This test involves the evaluation of an N-th order LRT that enables a computationally efficient evaluation when the individual measurements are independent

• In this case

• An equivalent log LRT can be defined by taking logarithms


ty~

)|~(

)|~()~,,~,,~(

0|

1|

0

1

Gyp

GypyyyL

tGy

tGyN

NtNttNt

t

t

N

Nt tGy

tGyNttNt Gyp

Gypyyyl

t

t

)|~(

)|~(ln)~,,~,,~(

0|

1|

0

1

7


• The use of the MO-LRT for voice activity detection is mainly motivated by the multiple observation vector to improvements in robustness against the acoustic noise present in the environment.

• The MO-LRT is defined over the observation vectors

• The decision rule is defined by


NttNt yyy ~,,~,,~

)(Gnonspeech or )(Gspeech as classified being frame thedenotes where

)|~(

)|~(ln

01

0|

1|,

0

1

l

Gyp

Gypl

N

Nt tGy

tGymt

t

t

tionclassificanonspeech andspeech between offbest trade for the

ally tunedexperiment is that hresholddecision t theis where

nonspeech as classified is frame ,

speech as classified is frame ,,

mtl

8


• In order to evaluate the proposed MO-LRT VAD on an incoming signal, an adequate statistical model for the feature vectors in presence

• The model selected is similar to that assumes the discrete Fourier transform (DFT) coefficients of the clean speech and the noise to be asymptotically independent Gaussian random variables

)( jS )( jN

1

0

2

1

1

0

2

0

)()(exp

)()(

1)|~(

)(exp

)(

1)|~(

J

j SN

j

SNt

J

j N

j

Nt

jj

Y

jjGyp

j

Y

jGyp


9

Analysis Of The Proposed Algorithm

• For this experiment– The AURORA subset of the Spanish SpeechDat-Car (SDC) was used

– This database consists of recordings from distant and close-talking microphones in car environments at different driving conditions

– The most unfavorable scenario (i.e., distant microphone at high speed over good road) with a 5 dB average SNR was considered


10


• Fig. 1(a) shows the distributions of the LRT during presence and absence of speech for increasing values of m.

(a) Speech/Nonspeech distributions of the multiple observationlikelihood ratio for different number of observations m


11


• Fig. 1(b) the speech, nonspeech, and global detection errors are shown as a function of m

• Note that the speech detection error is clearly reduced when increasing the value of m

• The global error exhibits a minimum value for m =6 frames

(b) Error probabilities as a function of m


12


• Fig. 2 shows the nonspeech hit rate (HR0) versus the false alarm rate (FAR0=1- HR1, where HR1 denotes the speech hit rate) for recordings from the distant microphone at an average SNR of about 5 dB.

• It is clearly shown that increasing the number of observation vectors in the MO-LRT improves the performance of the proposed VAD


13

Revised MO-LRT

• In this way, the test evaluates the probability that “all” the observations in the N frame of the central frame to be non-speech or speech

• This is the reason to revise the method in order to evaluate not just the two previous hypothesis G0 and G1

– the decision is made in favor of one of the two hypothesis:

ICASSP 2007

NttNtlfor

syG

nsyG

ll

lll

,,,,

ˆˆ :

ˆˆˆ :

0

1

14

Revised MO-LRT

• the multiple observation vector that is reindexed as for convenience of the presentation

• Each hypothesis Hm can be defined in terms of a binary integer representation:

NttNt yyy ~,,~,,~

1211 ˆ,,ˆ,,ˆˆ NN yyyY

12,,2,1 ˆˆ : 0

ˆˆˆ : 1

)1(speech or )0(speech -non is n observatio theif define where

2 12

1

Nksyb

nsyb

bbkb

m

kkk

kkkk

kkk

N

k

bk

ICASSP 2007

15

Revised MO-LRT

• Thus, each hypothesis Hm consists of 2N+1 individual hypothesis involving the 2N+1 observations

• The classification problem is then reformulated as selecting the class i with the current frame depending on the bit bN+1 to assigning speech (G1) or non-speech (G0)

• If the set of all the possible hypothesis is splitted depending on the value of the central frame bit bN+1 as:

• the posterior probabilities are defined to be:

0:

1:

10

11

Nm

Nm

bHM

bHM

0

1

)ˆ|()ˆ|(

)ˆ|()ˆ|(

0

1

Mm m

Mm m

YHpYGP

YHpYGP

ICASSP 2007

16

Revised MO-LRT

• Using the Bayes rule:

• a revised LRT can be defined as:

• Approximation to the statistical test replace the summation by the maximum value of the probability of the hypothesis in M1 and M0:

• By taking logarithms this test is expressed in a more compact form:

0

1

)()|ˆ()ˆ(

1)ˆ|(

)()|ˆ()ˆ(

1)ˆ|(

0

1

Mm mm

Mm mm

HpHYpYP

YGP

HpHYpYP

YGP

0

1

)()|ˆ(

)()|ˆ(

)ˆ|(

)ˆ|(

0

1

Mm mm

Mm mm

HpHYp

HpHYp

YGP

YGP

)()|ˆ(max

)()|ˆ(max*

0

1

mmMm

mmMm

HpHYp

HpHYp

mMmm

Mmll

01

maxmax*log

ICASSP 2007

17

Revised MO-LRT

• restrict the possible hypothesis – speech to non-speech or non-speech to speech transition in the N-frame neighbo

rhood

– K is the Hankel matrix:

TN

TN

TN

ypyp

ypyp

lll

)]1|ˆ(log,),1|ˆ([logB

)]0|ˆ(log,),0|ˆ([logB

],,,[L

:where

K)B(IKBL

1211

1210

1221

01

ICASSP 2007

18

Revised MO-LRT

• The matrix K can be splitted into two submatrices K0 and K1 then the test is easily reduced to:

01 maxLmaxLlogΛ

01110

01111

01

00100

01111

BK)BK(IL

)BK(IBKL

: toreduced isequation and KKI that Note

)BK(IBKL

)BK(IBKL

where

ICASSP 2007

20

Revised MO-LRT

• Finally– The algorithm for voice activity detection is based on a comparison of a likelihood

ratio to a given threshold :

0

1

01 maxLmaxLlogΛ

G

G

ICASSP 2007

21

Experimental Results

• The AURORA subset of the Spanish SpeechDat-Car (SDC) was used– Fig. 1 shows an utterance of the database in clean conditions (25 dB SNR)– Fig. 2 under the noisiest conditions (5 dB SNR)

ICASSP 2007

22


• ROC curves in quiet noise conditions (stopped car and engine running) and close talking microphone

ICASSP 2007

23


• ROC curves in high noise conditions (high speed over a good road) and distant talking microphone

ICASSP 2007

24

Conclusion

• The new approach not only evaluates the two hypothesis consisting on all the observations to be speech or non-speech, but all the possible hypothesis defined over the individual observations

• Hankel matrix was introduced into the revised statistical test to smoothing process and reduced variance of the decision variable

ICASSP 2007

26

Revised MO-LRT

• The algorithm is adaptive and suitable for non-stationary noise environments since the statistical properties are updated when the frame is classified as a non-speech frame. In this way, the variance of the noise is updated as:

2)1()()( jNN Yjj

N

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M....

Documents

Transcript of REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M....