HIWIRE MEETING Granada, June 9-10, 2005

32
HIWIRE MEETING HIWIRE MEETING Granada, June 9-10, 2005 Granada, June 9-10, 2005 JOSÉ C. SEGURA, LUZ GARCÍA JOSÉ C. SEGURA, LUZ GARCÍA JAVIER RAMÍREZ JAVIER RAMÍREZ GSTC UGR GSTC UGR

description

GSTC UGR. HIWIRE MEETING Granada, June 9-10, 2005. JOSÉ C. SEGURA, LUZ GARCÍA JAVIER RAMÍREZ. Schedule. Non-linear feature normalization ECDF segmental implementation Progressive equalization 2-class normalization Non-linear speaker adaptation/independence - PowerPoint PPT Presentation

Transcript of HIWIRE MEETING Granada, June 9-10, 2005

Page 1: HIWIRE MEETING Granada, June 9-10, 2005

HIWIRE MEETINGHIWIRE MEETINGGranada, June 9-10, 2005Granada, June 9-10, 2005

JOSÉ C. SEGURA, LUZ GARCÍAJOSÉ C. SEGURA, LUZ GARCÍAJAVIER RAMÍREZJAVIER RAMÍREZ

GSTC UGRGSTC UGR

Page 2: HIWIRE MEETING Granada, June 9-10, 2005

2 HIWIRE Meeting – Granada, 9-10 June, 2005

Schedule

Non-linear feature normalization ECDF segmental implementation Progressive equalization 2-class normalization

Non-linear speaker adaptation/independence Non-linear feature normalization Non-linear model adaptation

VAD and technique combination MO-LRT Bi-spectrum based VAD Combined Front-End

Page 3: HIWIRE MEETING Granada, June 9-10, 2005

3 HIWIRE Meeting – Granada, 9-10 June, 2005

Schedule

Non-linear feature normalization ECDF segmental implementation Progressive equalization 2-class normalization

Non-linear speaker adaptation/independence Non-linear feature normalization Non-linear model adaptation

VAD and technique combination MO-LRT Bi-spectrum based VAD Combined Front-End

Page 4: HIWIRE MEETING Granada, June 9-10, 2005

4 HIWIRE Meeting – Granada, 9-10 June, 2005

ECDF-based nonlinear transformation (1)

CDF-matching nonlinear transformation

))((][

))((][)()(

)()()()(

)(])[(][)(

1

11

xCCxTy

yCCyTxyCxC

duupyCduupxC

ypxTpxTyxpx

XY

YXYX

y

YY

x

XX

YYX

In previous works we modeled CDF’s by using histograms

Page 5: HIWIRE MEETING Granada, June 9-10, 2005

5 HIWIRE Meeting – Granada, 9-10 June, 2005

ECDF-based nonlinear transformation (2)

An alternative algorithm based on Order Statistics

Ttr

TttrT

T

trCyCCx

ECDFTtT

tryC

yyyyyyY

XYXt

tY

TrrrT

)(1

,,1)]([

5.0)())(ˆ(ˆ

,,15.0)(

)(ˆ

},,,{

11

)()2()1(21

Is faster, only requires sorting and table indexing Results are almost equal to those obtained with histograms

Page 6: HIWIRE MEETING Granada, June 9-10, 2005

6 HIWIRE Meeting – Granada, 9-10 June, 2005

ECDF Segmental implementation

Based on a sliding window

)12()2()1(},,,,{ TrrrTttTtt yyyyyyY

José C. Segura, M. Carmen Benítez, Ángel de la Torre, Antonio J. Rubio, Javier Ramírez, Cepstral domain segmental nonlinear feature transformations for robust speech recognition, IEEE Signal Processing Letters.,Vol.11, pp. 666-669, 2004

Page 7: HIWIRE MEETING Granada, June 9-10, 2005

7 HIWIRE Meeting – Granada, 9-10 June, 2005

Progressive normalization

As not all MFCC offer equal discrimination And HEQ introduces certain distortion Normalization up to a certain MFCC gives the best performance

Page 8: HIWIRE MEETING Granada, June 9-10, 2005

8 HIWIRE Meeting – Granada, 9-10 June, 2005

ECDF-based normalization results

Page 9: HIWIRE MEETING Granada, June 9-10, 2005

9 HIWIRE Meeting – Granada, 9-10 June, 2005

2-class normalization (1)

A first approach on parametric non-linear equalization PDF’s are modeled as two-Gaussian class mixtures for each MFCC Actually we use speech/noise like classes EM is used on each sentence to obtain the Gaussian classes

C0 C1

Tes

t01

Tes

t02

Page 10: HIWIRE MEETING Granada, June 9-10, 2005

10 HIWIRE Meeting – Granada, 9-10 June, 2005

2-class normalization (2)

22

2

211

1

1 )|2()|1(ˆ xxy

yxx

y

y yyP

yyPx

Equalization of C1 between Test02(Car) and Test01(Clean) of WSJ0 data

Nonlinear parametric transformation

Page 11: HIWIRE MEETING Granada, June 9-10, 2005

11 HIWIRE Meeting – Granada, 9-10 June, 2005

2-class normalization results

Page 12: HIWIRE MEETING Granada, June 9-10, 2005

12 HIWIRE Meeting – Granada, 9-10 June, 2005

Schedule

Non-linear feature normalization ECDF segmental implementation Progressive equalization 2-class normalization

Non-linear speaker adaptation/independence Non-linear feature normalization Non-linear model adaptation

VAD and technique combination MO-LRT Bi-spectrum based VAD Combined Front-End

Page 13: HIWIRE MEETING Granada, June 9-10, 2005

13 HIWIRE Meeting – Granada, 9-10 June, 2005

ECDF Features Normalization

HEQ as a non-linear speaker normalization technique using ECDF

-20 0 20 400

0.5

1

a) Reference (blue) and estimated (red) ECDF-20 -10 0 10 20

-20

0

20

40

b) Transformation

0 200 400 600-20

-10

0

10

20

30

c) Original features0 200 400 600

-20

-10

0

10

20

d) Transformed features

Page 14: HIWIRE MEETING Granada, June 9-10, 2005

14 HIWIRE Meeting – Granada, 9-10 June, 2005

ECDF Norm. for SA

85,50

86,00

86,50

87,00

87,50

88,00

88,50

89,00

89,50

TES

T01

WA

C

MLLR BASELINE AFE ECDF NORM

Test01WER (%)

Test01WAC (%)

MLLR 10,97 89,03

BASELINE 13,22 86,78

AFE 12,74 87,26

ECDF 11,23 88,77

Page 15: HIWIRE MEETING Granada, June 9-10, 2005

15 HIWIRE Meeting – Granada, 9-10 June, 2005

ECDF Models Adaptation

2 APPROACHES Pure Equalization: “HEQ MOD”

new Gaussian Distributions:

- shift on the means: X ->X HEQ

- scale factor on the variances

Equalization mixed with linear transformation: “HEQ PLIN”

LT: XA = M*X + B

M’, B’ such that

D(XA, XHEQ) = || M’X+B’ - XHEQ || 2 = minimum

Speaker Independent Features

Sp

ea

ke

r S

pe

cif

ic F

ea

ture

s

Page 16: HIWIRE MEETING Granada, June 9-10, 2005

16 HIWIRE Meeting – Granada, 9-10 June, 2005

Models Adaptation

85,00

85,50

86,00

86,50

87,00

87,50

88,00

88,50

89,00

89,50

TES

T01

WA

C

MLLR BASELINE HEQ MOD HEQ PLIN

Test01WER (%)

Test01WAC (%)

MLLR 10,97 89,03

BASELINE 13,22 86,78

HEQ MOD 12,95 87,05

HEQ PLIN 13,31 86,52

Page 17: HIWIRE MEETING Granada, June 9-10, 2005

17 HIWIRE Meeting – Granada, 9-10 June, 2005

SA methods. Comparison

85,0085,5086,0086,5087,0087,5088,0088,5089,0089,50

Tes

t 01

WA

C

Baseline MLLR HEQMOD

HEQ PLIN ECDFNORM

Speaker adaptation methods

Page 18: HIWIRE MEETING Granada, June 9-10, 2005

18 HIWIRE Meeting – Granada, 9-10 June, 2005

Future Work 1/2

45,00

50,00

55,00

60,00

65,00

70,00W

AC

MLLR MRC BASELINE AFE ECDF NORM

SA models using MLLR are not robust against noise

Feature Normalization + MLLR

Page 19: HIWIRE MEETING Granada, June 9-10, 2005

19 HIWIRE Meeting – Granada, 9-10 June, 2005

Future Work 2/2

Non linear Feature Normalization and Model Adaptation

Development of further experiments with more complex tasks on WSJ1 database (spoke3 and spoke4)

Page 20: HIWIRE MEETING Granada, June 9-10, 2005

20 HIWIRE Meeting – Granada, 9-10 June, 2005

Schedule

Non-linear feature normalization ECDF segmental implementation Progressive equalization 2-class normalization

Non-linear speaker adaptation/independence Non-linear feature normalization Non-linear model adaptation

VAD and technique combination MO-LRT Bi-spectrum based VAD Combined Front-End

Page 21: HIWIRE MEETING Granada, June 9-10, 2005

21 HIWIRE Meeting – Granada, 9-10 June, 2005

Previous work on VAD

Voice activity detection: Kullback-Leibler divergence

J. Ramírez, J. C. Segura, C. Benítez, A. de la Torre, A. Rubio, “A New Kullback-Leibler VAD for Robust Speech Recognition”,IEEE Signal Processing Letters, Vol.11, No.2, pp. 666-669, Feb. 2004

Long-term spectral divergence J. Ramírez, J. C. Segura, C. Benítez, A. de la Torre, A. Rubio, “Efficient Voice Activity

Detection Algorithms Using Long-Term Speech Information”,Speech Communication, Vol. 42/3-4, pp. 271-287, 2004

Subband SNR estimation using OS filters J. Ramírez, J. C. Segura, C. Benítez, A. de la Torre, A. Rubio, “An Effective Subband

OSF-based VAD with Noise Reduction for Robust Speech Recognition”,To appear in IEEE Transactions on Speech and Audio Processing, 2005/2006.

Multiple observation likelihood ratio test J. Ramírez, J. C. Segura, C. Benítez, L. García, A. Rubio, “Statistical Voice Activity

Detection using a Multiple Observation Likelihood Ratio Test”, To appear in IEEE Signal Processing Letters

Page 22: HIWIRE MEETING Granada, June 9-10, 2005

22 HIWIRE Meeting – Granada, 9-10 June, 2005

Likelihood ratio test

Generalization of the Sohn’s VAD: J. Sohn, N. S. Kim, W. Sung, “A statistical model-based voice activity

detection”, IEEE Signal Processing Letters, vol. 16 (1), pp. 1-3, 1999.

Two hypothesis are considered: H0 : y= n Absence of speech

(Silence) H1 : y= s + n Speech presence

Optimum decision rule (Bayes classifier):

l-frame observation vector:

LRT evaluation Adequate signal model

)( )H(P

)H(P

)H|ˆ(

)H|ˆ()ˆ(

1

0

0H|

1H|

1H

0H0

1

l

ll p

pL

y

yy

y

y

ly

LRT: Likelihood ratio test

Page 23: HIWIRE MEETING Granada, June 9-10, 2005

23 HIWIRE Meeting – Granada, 9-10 June, 2005

Multiple observation likelihood ratio test

MO-LRT (multiple observation LRT): Given a set of N= 2m+1 consecutive observations:

LRT:

Under statistical independence:

Recursive Log-LRT:

ml

mlk k

kmllmlN

k

k

p

pL

)H|ˆ(

)H|ˆ()ˆ,...,ˆ,...,ˆ(

0H|

1H|

0

1

y

yyyy

y

y

)( )H(P

)H(P

)H|ˆ,...,ˆ,...,ˆ(

)H|ˆ,...,ˆ,...,ˆ()ˆ,...,ˆ,...,ˆ(

1

0

0H|,...,,...,

1H|,...,,...,

1H

0H0

1

mllml

mllmlmllmlN

mllml

mllml

p

pL

yyy

yyyyyy

yyy

yyy

ml

mlk k

kmllmlN

k

k

p

p

)H|ˆ(

)H|ˆ(ln)ˆ,...,ˆ,...,ˆ(

0H|

1H|

0

1

y

yyyy

y

y

mllllml yyyyy ˆ ..., ,ˆ ,ˆ ,ˆ ..., ,ˆ 11

Page 24: HIWIRE MEETING Granada, June 9-10, 2005

24 HIWIRE Meeting – Granada, 9-10 June, 2005

Analysis: Optimum delay

-20 0 20 400

0.01

0.02

0.03

0.04m= 1

-20 0 20 400

0.01

0.02

0.03

0.04m= 4

-20 0 20 400

0.01

0.02

0.03

0.04m= 6

-20 0 20 400

0.01

0.02

0.03

0.04m= 8

Non-speech

Speech

Non-speech

Speech

Non-speech

Speech

Non-speech

Speech

Probability distributions Classification errors

Increasing m (number of the observations): Reduction of the overlap between the distributions Misclassification errors:

Reduced for speech vs Moderate increase for non-speech

1 2 3 4 5 6 7 8 9 10 11 120.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

Non-Speech

SpeechTotal error

m

Page 25: HIWIRE MEETING Granada, June 9-10, 2005

25 HIWIRE Meeting – Granada, 9-10 June, 2005

Analysis: Optimum delay

ROC analysis AURORA 3 Spanish (High-Ch1, 5dB)

Sohn’s VAD

MO-LRT

Page 26: HIWIRE MEETING Granada, June 9-10, 2005

26 HIWIRE Meeting – Granada, 9-10 June, 2005

Speech recognition experiments

MO-LRT G.729 AMR1 AMR2 AFE

86.14 70.32 74.29 82.89 83.29

Ref. VAD Woo Li Marzinzik Sohn

86.86 81.09 82.11 85.23 83.80

VAD

Noiseestimation

WienerFiltering

(WF)MFCC HTK

Framedropping

(FD)

Average Wacc (%) for CT and MCT

AURORA 2:

Page 27: HIWIRE MEETING Granada, June 9-10, 2005

27 HIWIRE Meeting – Granada, 9-10 June, 2005

Speech recognition experiments

WACC (%) MO-LRT G.729 AMR1 AMR2 AFE

WM 96.33 88.62 94.65 95.67 95.28

MM 91.61 72.84 80.59 90.91 90.23

HM 87.43 65.50 62.41 85.77 77.53

Average 91.79 75.65 74.33 90.78 87.68

MO-LRT Woo Li Marzinzik Sohn

WM 96.33 95.35 91.82 94.29 96.07

MM 91.61 89.30 77.45 89.81 91.64

HM 87.43 83.64 78.52 79.43 84.03

Average 91.79 89.43 82.60 87.84 90.58

AURORA 3: Spanish SpeechDat-Car

Page 28: HIWIRE MEETING Granada, June 9-10, 2005

28 HIWIRE Meeting – Granada, 9-10 June, 2005

Work in progress

Statistical tests in the bispectrum domain:

J. M. Górriz, et al., “Voice Activity Detection Based on HOS”, 8th International Work-Conference on Artificial Neural Networks (IWANN'2005)

J. M. Górriz, et al., “Statistical Tests for Voice Activity Detection”, Non-linear Speech Processing (NOLISP’2005), 2005.

J. M. Górriz, et al., “Bispectra analysis-based VAD for robust speech recognition”, First International Work-Conference on the Interplay Between Natural and Artificial Computation (IWINAC’2005)

Bispectrum LRT (application of MO-LRT on the bispectra)

J. M. Górriz, et al, “An Improved MO-LRT VAD Based on a Bispectra Gaussian Model”, Submitted to Electronics Letters.

Page 29: HIWIRE MEETING Granada, June 9-10, 2005

29 HIWIRE Meeting – Granada, 9-10 June, 2005

GSTC-UGR speech recognition results

Noisereduction

LTSEVAD

Framedropping

SegmentalECDF

(Gaussian ref.)Progressive

HTK

LTSE VAD: J. Ramírez, et al., “Efficient Voice Activity Detection Algorithms Using Long-

Term Speech Information”, Speech Communication, Vol. 42/3-4, pp. 271-287, 2004

Segmental ECDF: 60 frame delay J. C. Segura, et al., “Cepstral Domain Segmental Nonlinear Feature

Transformations for Robust Speech Recognition”, IEEE Signal Processing Letters, Vol.11, No. 5, pp. 517 - 520, 2004

Progressive: Log-E + Up to the 4th cepstral coefficient

Page 30: HIWIRE MEETING Granada, June 9-10, 2005

30 HIWIRE Meeting – Granada, 9-10 June, 2005

GSTC-UGR speech recognition results

AURORA 2

WACC (%)

SET A SET B SET C Average

Multicondition training

GSTC-UGR 90.58 90.23 89.10 90.14

HIWIRE baseline 88.40 88.96 88.97 88.74

Clean training GSTC-UGR 86.01 86.84 85.00 86.14

HIWIRE baseline 64.00 69.10 64.73 66.18

AURORA 3

WACC (%)

Italian Spanish Average

WM MM HM WM MM HM WM MM HM

GSTC-UGR 96.94 91.89 86.19 96.52 92.03 89.95 96.73 91.96 88.07

HIWIRE baseline 94.40 87.14 46.75 89.30 83.18 65.50 91.85 85.16 56.13

WER Relative Improvements: 12% (MCT) 59% (CT)

WER Relative Improvements: 60% (WM) 46% (MM) 73% (HM)

Page 31: HIWIRE MEETING Granada, June 9-10, 2005

31 HIWIRE Meeting – Granada, 9-10 June, 2005

GSTC-UGR speech recognition results

Test 1 2 3 4 5 6 7 Avg

GSTC-UGR 13.37 19.52 37.53 40.22 39.19 37.16 39.30 32.33

HIWIRE baseline 13.22 24.68 46.00 47.62 52.67 44.79 54.73 40.53

Test 8 9 10 11 12 13 14 Avg.

GSTC-UGR 21.40 30.76 45.49 48.43 50.46 45.30 48.77 41.52

HIWIRE baseline 22.58 36.21 55.40 58.31 65.34 54.11 62.28 50.60

AURORA 4 WER (%) (clean training experiments)

WER Relative Improvements: 20% (Test sets 1:7) 17% (Test sets 8:14)

Page 32: HIWIRE MEETING Granada, June 9-10, 2005

HIWIRE MEETINGHIWIRE MEETINGGranada, June 9-10, 2005Granada, June 9-10, 2005

JOSÉ C. SEGURA, LUZ GARCÍAJOSÉ C. SEGURA, LUZ GARCÍAJAVIER RAMÍREZJAVIER RAMÍREZ

GSTC UGRGSTC UGR