May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for...

16
May 20, 2006 SRIV2006, Toulouse, Franc e 1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication Research Laboratories (ATR-SLC) Kyoto, Japan Konstantin Markov and Satoshi Nakamura

Transcript of May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for...

Page 1: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 1

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

ATR Spoken Language Communication Research Laboratories (ATR-SLC)Kyoto, Japan

Konstantin Markov and Satoshi Nakamura

Page 2: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 2

Outline

Motivation and previous studies.

HMM based accent acoustic modeling.

Hybrid HMM/BN acoustic model for accented speech.

Evaluation and results.

Conclusion.

Page 3: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 3

Motivation and Previous Studies

Accent variability: Causes performance degradation due to training / testing

conditions mismatch.

Becomes major factor for ASR’s public applications.

Differences due to accent variability are mainly: Phonetic -

lexicon modification (Liu, ICASSP,98).

accent dependent dictionary (Humphries, ICASSP,98).

Acoustic – (addressed in this work) Pooled data HMM (Chengalvarayan, Eurospeech’01).

Accent identification (Huang, ICASSP’05).

Page 4: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 4

input

speech

HMM based approaches (1)

Accent-dependent data → A B C

A,B,C

MA-HMM

Feature Extraction Decoder

Pooled data →

Multi-accent AM →

recognition

result

Page 5: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 5

input

speech

HMM based approaches (2)

Accent-dependent data → A B C

B-HMM

Feature Extraction Decoder

Accent-dependentHMMs

Parallel AM →

recognition

result

C-HMMA-HMM

PA-HMM

Page 6: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 6

input

speech

HMM based approaches (3)

Accent-dependent data → A B C

F-HMM

Feature Extraction Decoder

Gender-dependentHMMs

Parallel AM →

recognition

result

M-HMM

GD-HMM

Page 7: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 7

Hybrid HMM/BN Background HMM/BN Structure:

HMM at the top level. Models speech temporal

characteristic by state transitions. BN at the bottom level.

Represents states PDF.

BN Topologies: Simple BN Example:

State PDF:

State output probability:

If M is hidden, then:

q1 q2 q3

HMM

Bayesian Network

X

HMM State

Mixture component index

Observation

)()|(),|(),,( QPQMPQMXPQMXP

),|( QMXP

)(

),,(

)(

),()|(

QP

QmMXP

QP

QXPQXP m

m

QmMXPQmMP ),|()|(

Q

M

Page 8: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 8

HMM/BN based Accent Model

Accent and Gender are modeled as additional variables of

the BN.

The BN topology:

G = {F,M}

A = {A,B,C}

When G and A are hidden:

)(

),,,(

)(

),()|(

QP

QgGaAXP

QP

QXPQXP a g

a g

QgGaAXPgGPaAP ),,|()()(

Page 9: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 9

HMM/BN Training

Initial conditions Bootstrap HMM: gives the (tied) state structure.

Labelled data: each feature vector has accent and gender label.

Training algorithm: Step 1: Viterbi alignment of the training data using the bootstrap HMM to

obtain state labels.

Step 2: Initialization of BN parameters.

Step 3: Forwards-Backward based embedded HMM/BN training.

Step 4: If convergence criterion is met Stop

Otherwise go to Step 3

Page 10: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 10

input

speech

HMM/BN approach

Accent-dependent andgender-dependent data

A(M) B(M) C(M)

C(F)

HMM/BN

Feature Extraction Decoder

HMM/BN AM →

recognition

result

B(F)A(F)→

Page 11: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 11

MA-HMM

Comparison of state distributions

PA-HMM

GD-HMM HMM/BN

Page 12: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 12

Database and speech pre-processing Database

Accents: American (US).

British (BRT).

Australian (AUS).

Speakers / Utterances: 100 per accent (90 for training + 10 for evaluation).

300 utterances per speaker.

Speech material same for each accent.

Travel arrangement dialogs.

Speech feature extraction: 20ms frames at 10ms rate.

25 dim. features vectors (12MFCC + 12ΔMFCC + ΔE).

Page 13: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 13

Models Acoustic models:

All HMM based AMs have: Three states, left-to-right, triphone contexts

3,275 states (MDL-SSS)

Variants with 6,18, 30 and 42 total Gaussians per state.

HMM/BN model: Same state structure as the HMM models.

Same number of Gaussian components.

Language model: Bi-gram, Tri-gram (600,000 training sentences).

35,000 word vocabulary.

Test data perplexity: 116.5 and 27.8

Pronunciation lexicon – American English.

Page 14: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 14

Evaluation results

Page 15: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 15

Evaluation results

Model

type

Test data accent Average

US BRT AUS

US-HMM 91.5 51.1 68.6 70.4

BRT-HMM 77.9 84.7 83.8 82.1

AUS-HMM 81.6 73.5 90.9 82.0

MA-HMM 90.9 82.1 89.5 87.5

PA-HMM 89.6 81.7 86.4 85.9

GD-HMM 90.9 82.5 89.3 87.6

HMM/BN 91.4 83.1 90.3 88.2

Word accuracies (%), all models with total of 42 Gaussians per state.

Page 16: May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

May 20, 2006 SRIV2006, Toulouse, France 16

Conclusions

In the matched accent case, accent-dependent models are the

best choice.

The HMM/BN is the best, almost matching the results of accent-

dependent models, but requires more mixture components.

Multi-accent HMM is the most efficient in terms of performance

and complexity.

Different performance levels of accent-dependent models

apparently caused by the phonetic accent differences.