May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for...

May 20, 2006 SRIV2006, Toulouse, France 1

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

ATR Spoken Language Communication Research Laboratories (ATR-SLC)Kyoto, Japan

Konstantin Markov and Satoshi Nakamura


Outline

Motivation and previous studies.

HMM based accent acoustic modeling.

Hybrid HMM/BN acoustic model for accented speech.

Evaluation and results.

Conclusion.


Motivation and Previous Studies

Accent variability: Causes performance degradation due to training / testing

conditions mismatch.

Becomes major factor for ASR’s public applications.

Differences due to accent variability are mainly: Phonetic -

lexicon modification (Liu, ICASSP,98).

accent dependent dictionary (Humphries, ICASSP,98).

Acoustic – (addressed in this work) Pooled data HMM (Chengalvarayan, Eurospeech’01).

Accent identification (Huang, ICASSP’05).


input

speech

HMM based approaches (1)

Accent-dependent data → A B C

A,B,C

MA-HMM

Feature Extraction Decoder

Pooled data →

Multi-accent AM →

recognition

result


input

speech



B-HMM


Accent-dependentHMMs

Parallel AM →

recognition

result

C-HMMA-HMM

PA-HMM

→


input

speech



F-HMM


Gender-dependentHMMs

Parallel AM →

recognition

result

M-HMM

GD-HMM

→


Hybrid HMM/BN Background HMM/BN Structure:

HMM at the top level. Models speech temporal

characteristic by state transitions. BN at the bottom level.

Represents states PDF.

BN Topologies: Simple BN Example:

State PDF:

State output probability:

If M is hidden, then:

q1 q2 q3

HMM

Bayesian Network

X

HMM State

Mixture component index

Observation

)()|(),|(),,( QPQMPQMXPQMXP

),|( QMXP

)(

),,(

)(

),()|(

QP

QmMXP

QP

QXPQXP m

m

QmMXPQmMP ),|()|(

Q

M


HMM/BN based Accent Model

Accent and Gender are modeled as additional variables of

the BN.

The BN topology:

G = {F,M}

A = {A,B,C}

When G and A are hidden:

)(

),,,(

)(

),()|(

QP

QgGaAXP

QP

QXPQXP a g

a g

QgGaAXPgGPaAP ),,|()()(


HMM/BN Training

Initial conditions Bootstrap HMM: gives the (tied) state structure.

Labelled data: each feature vector has accent and gender label.

Training algorithm: Step 1: Viterbi alignment of the training data using the bootstrap HMM to

obtain state labels.

Step 2: Initialization of BN parameters.

Step 3: Forwards-Backward based embedded HMM/BN training.

Step 4: If convergence criterion is met Stop

Otherwise go to Step 3


input

speech

HMM/BN approach

Accent-dependent andgender-dependent data

A(M) B(M) C(M)

C(F)

HMM/BN


HMM/BN AM →

recognition

result

B(F)A(F)→


MA-HMM

Comparison of state distributions

PA-HMM

GD-HMM HMM/BN


Database and speech pre-processing Database

Accents: American (US).

British (BRT).

Australian (AUS).

Speakers / Utterances: 100 per accent (90 for training + 10 for evaluation).

300 utterances per speaker.

Speech material same for each accent.

Travel arrangement dialogs.

Speech feature extraction: 20ms frames at 10ms rate.

25 dim. features vectors (12MFCC + 12ΔMFCC + ΔE).


Models Acoustic models:

All HMM based AMs have: Three states, left-to-right, triphone contexts

3,275 states (MDL-SSS)

Variants with 6,18, 30 and 42 total Gaussians per state.

HMM/BN model: Same state structure as the HMM models.

Same number of Gaussian components.

Language model: Bi-gram, Tri-gram (600,000 training sentences).

35,000 word vocabulary.

Test data perplexity: 116.5 and 27.8

Pronunciation lexicon – American English.


Evaluation results


Evaluation results

Model

type

Test data accent Average

US BRT AUS

US-HMM 91.5 51.1 68.6 70.4

BRT-HMM 77.9 84.7 83.8 82.1

AUS-HMM 81.6 73.5 90.9 82.0

MA-HMM 90.9 82.1 89.5 87.5

PA-HMM 89.6 81.7 86.4 85.9

GD-HMM 90.9 82.5 89.3 87.6

HMM/BN 91.4 83.1 90.3 88.2

Word accuracies (%), all models with total of 42 Gaussians per state.


Conclusions

In the matched accent case, accent-dependent models are the

best choice.

The HMM/BN is the best, almost matching the results of accent-

dependent models, but requires more mixture components.

Multi-accent HMM is the most efficient in terms of performance

and complexity.

Different performance levels of accent-dependent models

apparently caused by the phonetic accent differences.

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for...

Documents

Transcript of May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for...