May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for...
-
Upload
beverly-evans -
Category
Documents
-
view
212 -
download
0
Transcript of May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for...
May 20, 2006 SRIV2006, Toulouse, France 1
Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition
ATR Spoken Language Communication Research Laboratories (ATR-SLC)Kyoto, Japan
Konstantin Markov and Satoshi Nakamura
May 20, 2006 SRIV2006, Toulouse, France 2
Outline
Motivation and previous studies.
HMM based accent acoustic modeling.
Hybrid HMM/BN acoustic model for accented speech.
Evaluation and results.
Conclusion.
May 20, 2006 SRIV2006, Toulouse, France 3
Motivation and Previous Studies
Accent variability: Causes performance degradation due to training / testing
conditions mismatch.
Becomes major factor for ASR’s public applications.
Differences due to accent variability are mainly: Phonetic -
lexicon modification (Liu, ICASSP,98).
accent dependent dictionary (Humphries, ICASSP,98).
Acoustic – (addressed in this work) Pooled data HMM (Chengalvarayan, Eurospeech’01).
Accent identification (Huang, ICASSP’05).
May 20, 2006 SRIV2006, Toulouse, France 4
input
speech
HMM based approaches (1)
Accent-dependent data → A B C
A,B,C
MA-HMM
Feature Extraction Decoder
Pooled data →
Multi-accent AM →
recognition
result
May 20, 2006 SRIV2006, Toulouse, France 5
input
speech
HMM based approaches (2)
Accent-dependent data → A B C
B-HMM
Feature Extraction Decoder
Accent-dependentHMMs
Parallel AM →
recognition
result
C-HMMA-HMM
PA-HMM
→
May 20, 2006 SRIV2006, Toulouse, France 6
input
speech
HMM based approaches (3)
Accent-dependent data → A B C
F-HMM
Feature Extraction Decoder
Gender-dependentHMMs
Parallel AM →
recognition
result
M-HMM
GD-HMM
→
May 20, 2006 SRIV2006, Toulouse, France 7
Hybrid HMM/BN Background HMM/BN Structure:
HMM at the top level. Models speech temporal
characteristic by state transitions. BN at the bottom level.
Represents states PDF.
BN Topologies: Simple BN Example:
State PDF:
State output probability:
If M is hidden, then:
q1 q2 q3
HMM
Bayesian Network
X
HMM State
Mixture component index
Observation
)()|(),|(),,( QPQMPQMXPQMXP
),|( QMXP
)(
),,(
)(
),()|(
QP
QmMXP
QP
QXPQXP m
m
QmMXPQmMP ),|()|(
Q
M
May 20, 2006 SRIV2006, Toulouse, France 8
HMM/BN based Accent Model
Accent and Gender are modeled as additional variables of
the BN.
The BN topology:
G = {F,M}
A = {A,B,C}
When G and A are hidden:
)(
),,,(
)(
),()|(
QP
QgGaAXP
QP
QXPQXP a g
a g
QgGaAXPgGPaAP ),,|()()(
May 20, 2006 SRIV2006, Toulouse, France 9
HMM/BN Training
Initial conditions Bootstrap HMM: gives the (tied) state structure.
Labelled data: each feature vector has accent and gender label.
Training algorithm: Step 1: Viterbi alignment of the training data using the bootstrap HMM to
obtain state labels.
Step 2: Initialization of BN parameters.
Step 3: Forwards-Backward based embedded HMM/BN training.
Step 4: If convergence criterion is met Stop
Otherwise go to Step 3
May 20, 2006 SRIV2006, Toulouse, France 10
input
speech
HMM/BN approach
Accent-dependent andgender-dependent data
A(M) B(M) C(M)
C(F)
HMM/BN
Feature Extraction Decoder
HMM/BN AM →
recognition
result
B(F)A(F)→
May 20, 2006 SRIV2006, Toulouse, France 11
MA-HMM
Comparison of state distributions
PA-HMM
GD-HMM HMM/BN
May 20, 2006 SRIV2006, Toulouse, France 12
Database and speech pre-processing Database
Accents: American (US).
British (BRT).
Australian (AUS).
Speakers / Utterances: 100 per accent (90 for training + 10 for evaluation).
300 utterances per speaker.
Speech material same for each accent.
Travel arrangement dialogs.
Speech feature extraction: 20ms frames at 10ms rate.
25 dim. features vectors (12MFCC + 12ΔMFCC + ΔE).
May 20, 2006 SRIV2006, Toulouse, France 13
Models Acoustic models:
All HMM based AMs have: Three states, left-to-right, triphone contexts
3,275 states (MDL-SSS)
Variants with 6,18, 30 and 42 total Gaussians per state.
HMM/BN model: Same state structure as the HMM models.
Same number of Gaussian components.
Language model: Bi-gram, Tri-gram (600,000 training sentences).
35,000 word vocabulary.
Test data perplexity: 116.5 and 27.8
Pronunciation lexicon – American English.
May 20, 2006 SRIV2006, Toulouse, France 14
Evaluation results
May 20, 2006 SRIV2006, Toulouse, France 15
Evaluation results
Model
type
Test data accent Average
US BRT AUS
US-HMM 91.5 51.1 68.6 70.4
BRT-HMM 77.9 84.7 83.8 82.1
AUS-HMM 81.6 73.5 90.9 82.0
MA-HMM 90.9 82.1 89.5 87.5
PA-HMM 89.6 81.7 86.4 85.9
GD-HMM 90.9 82.5 89.3 87.6
HMM/BN 91.4 83.1 90.3 88.2
Word accuracies (%), all models with total of 42 Gaussians per state.
May 20, 2006 SRIV2006, Toulouse, France 16
Conclusions
In the matched accent case, accent-dependent models are the
best choice.
The HMM/BN is the best, almost matching the results of accent-
dependent models, but requires more mixture components.
Multi-accent HMM is the most efficient in terms of performance
and complexity.
Different performance levels of accent-dependent models
apparently caused by the phonetic accent differences.