Acoustic Modelling 16.345 Automatic Speech Recognition
ASR for Spoken-Dialogue Systems• Introduction• Speech recognition issues
– Example using SUMMIT system for weather information• Reducing computation• Model aggregation• Committee-based classifiers
Lecture # 18Session 2003
Acoustic Modelling 26.345 Automatic Speech Recognition
Example Dialogue-based Systems
• Vocabularies typically have 1000s of words• Widely deployed systems tend to be more conservative• Directed dialogues have fewer words per utterance• Word averages lowered by more confirmations• Human-human conversations use more words
0
5
10
15
20
25
30
CSELT SW/F Philips CMU/M CMU/F LIMSI MIT/W MIT/F AT&T Human
Ave Words/UttAve Utts/Call
Acoustic Modelling 36.345 Automatic Speech Recognition
Telephone-based, Conversational, ASR
• Telephone bandwidths with variable handsets• Noisy background conditions• Novice users with small number of interactions
– Men, women, children– Native and non-native speakers– Genuine queries, browsers, hackers
• Spontaneous speech effects – e.g., filled pauses, partial words, non-speech artifacts
• Out-of-vocabulary words and out-of-domain queries• Full vocabulary needed for complete understanding
– Word and phrase spotting are not primary strategies– Mixed-initiative dialog provides little constraint to recognizer
• Real-time decoding
Acoustic Modelling 46.345 Automatic Speech Recognition
Data Collection Issues
• System development is chicken & egg problem• Data collection has evolved considerably
– Wizard-based → system-based data collection– Laboratory deployment → public deployment – 100s of users → thousands → millions
• Data from real users solving real problems accelerates technology development– Significantly different from laboratory environment– Highlights weaknesses, allows continuous evaluation– But, requires systems providing real information!
• Expanding corpora requires unsupervised training or adaptation to unlabelled data
Acoustic Modelling 56.345 Automatic Speech Recognition
Data Collection (Weather Domain)
• Over 756K utterances from 112K calls since May, 1997
0
10000
20000
30000
40000
50000
60000
70000
May Jul Sep Nov Jan Mar May Jul Sep Nov Jan Mar May Jul Sep Nov
CallsUtterances
• Initial collection of 3,500 read utterances and 1,000 wizard utterances
Acoustic Modelling 66.345 Automatic Speech Recognition
Weather Corpus Characteristics
• Approximately 11% of data contained significant noises• Over 6% of data contained spontaneous speech effects• At least 5% of data from speakerphones
Child9%
Female21%
Male70% Native
86%
Non-Native
14%
• Corpus dominated by American male speakers
Acoustic Modelling 76.345 Automatic Speech Recognition
Vocabulary Selection
• Constrained domains naturally limit vocabulary sizes• 2000 word vocabulary gives good coverage for weather• ~2% out-of-vocabulary rate on test sets
0
20
40
60
80
100
0 1000 2000 3000
Vocabulary Size
% C
over
age
Acoustic Modelling 86.345 Automatic Speech Recognition
Vocabulary
• Lexicon based on syllabified LDC PRONLEX dictionary
• Incorporation of common reduced words & word pairs
• Current vocabulary consists of nearly 2000 words• Based on system capabilities and user queries
Type Size ExamplesGeography 933 boston, alberta, france, africa Weather 217 temperature, snow, sunny, smogBasic 815 i, what, january, tomorrow
Type ExamplesReduction give_me, going_to, want_to, what_is, i_wouldCompound clear_up, heat_wave, pollen_count
Acoustic Modelling 96.345 Automatic Speech Recognition
Example Vocabulary FileSorted alphabetically
<>*<pause1><pause2><uh><um><unknown>*aa_mamdon+tnew_york_citysixtytodaytoday+s
Numbers tend to be spelled out
Each word form has separate entry
Utterance start & end marker
Filled pause modelsPauses at utterance start & end
*’d items have no acoustic realizationOut-of-vocabulary word model<>’d words don’t count as errorsUnderbars distinguish letter sequences from actual words
Lower case is a common convention+ symbol conventionally used for ’
Acoustic Modelling 106.345 Automatic Speech Recognition
Example Baseform File
<pause1> : - +<pause2> : - +<uh> : ah_fp<um> : ah_fp ma_m : ey & eh meither : ( iy , ay ) th erlaptop : l ae pd t aa pdnew_york : n uw & y ao r kdnorthwest : n ao r th w eh s tdtrenton : tr r eh n tq enwinter : w ih nt er
previous symbol can repeat
word break allowing pause
special filledpause vowel
alternatepronunciations
Acoustic Modelling 116.345 Automatic Speech Recognition
Editing Generated Baseforms
• Automatically generated baseform file should be manually checked for the following problems:– Missing pronunciation variants that are needed– Unwanted pronunciation variants that are present– Vocabulary words missing in PRONLEX
going_to : g ow ix ng & t uwreading : ( r iy df ix ng , r eh df ix ng )woburn : <???>
going_to : g ( ow ix ng & t uw , ah n ax )reading : r eh df ix ngwoburn : w ( ow , uw ) b er n
Acoustic Modelling 126.345 Automatic Speech Recognition
Applying Phonological Rules
• Phonemic baseforms are canonical representation• Baseforms may have multiple acoustic realizations• Acoustic realizations are phones or phonetic units• Example:
batter : b ae tf erThis can be realized phonetically as:
bcl b ae tcl t eror as:
bcl b ae dx er
Standard /t/
Flapped /t/
Acoustic Modelling 136.345 Automatic Speech Recognition
Example Phonological Rules
• Example rule for /t/ deletion (“destination”):
{s} t {ax ix} => [tcl t];
PhonemeLeft Context
RightContext
PhoneticRealization
• Example rule for palatalization of /s/ (“miss you”):
{} s {y} => s | sh;
Acoustic Modelling 146.345 Automatic Speech Recognition
Language Modelling• Class bi- and trigrams used to produce 10-best outputs• Training data augmented with city and state constraints• Relative entropy measure used to help select classes
• 200 word classes reduced perplexities and error rates
raining, snowing humidity, temperaturecold, hot, warm advisories, warnings
extended, general conditions, forecast, report
Type Perplexity % Word Error Rateword bigram 18.4 16.0+ word trigram 17.8 15.5class bigram 17.6 15.6+ class trigram 16.1 14.9
Acoustic Modelling 156.345 Automatic Speech Recognition
Defining N-gram Word Classes
Class definitions have class name on left and word on right
Class names with “<U>_” forces all words to be equally likely
Alternate words in class can be placed on same line with “|” separator
CITY ==> bostonCITY ==> chicagoCITY ==> seattle
<U>_DIGIT ==> one<U>_DIGIT ==> two<U>_DIGIT ==> three
DAY ==> today | tomorrow
Acoustic Modelling 166.345 Automatic Speech Recognition
The Training Sentence File
• An n-gram model is estimated from training data• Training file contains one utterance per line• Words in training file must have same case and form as
words in vocabulary file• Training file uses the following conventions:
– Each clean utterance begins with <pause1> and ends with<pause2>
– Compound word underbars are typically removed before training– Underbars automatically re-inserted during training based on
compound words present in vocabulary file• Special artifact units may be used for noises and other
significant non-speech events:– <clipped1>, <clipped2>, <hangup>, <cough>, <laugh>
Acoustic Modelling 176.345 Automatic Speech Recognition
Example Training Sentence File
<pause1> when is the next flight to chicago <pause2>
<pause> to san <partial> san francisco <pause2>
<pause1> <um> boston <pause2>
<clipped1> it be in time <pause2>
<pause1> good bye <hangup>
<pause1> united flight two oh four <pause2>
<pause1> <cough> excuse me <laugh> <pause2>
partial word,e.g., san die(go)
all significant sounds are transcribed
clipped word,e.g., ~(w)ill it
Acoustic Modelling 186.345 Automatic Speech Recognition
Composing FST Lexical Networks
• Four basic FST networks are composed to form full search network.– G : Language model– L : Lexical model– P : Pronunciation model– C : Context-dependent acoustic
model mapping
• Mathematical composed using the expression:
CD Acoustic Model Labels
C : CD Model Mapping
Phonetic Units
P : Pronunciation Model
Phonemic Units
L : Lexical Model
Words
G : Language Model
Words
CoPoLoG
Acoustic Modelling 196.345 Automatic Speech Recognition
ix|nIN
-3.57
n|bclε0
bcl|bε
-2.35
b|aoBOSTON
0
ao|sε0
b|aaε0
aa|sBOSTON
-0.35
aa|zBOSNIA
-1.23
ix|nIN
-3.57
n|bclε0
bcl|bε
-2.35
b|aoBOSTON
0
ao|sε0
b|aaε0
aa|sBOSTON
-0.35
aa|zBOSNIA
-1.23
Arc input label
ix|nIN
-3.57
n|bclε0
bcl|bε
-2.35
b|aoBOSTON
0
ao|sε0
b|aaε0
aa|sBOSTON
-0.35
aa|zBOSNIA
-1.23
Arc output label
ix|nIN
-3.57
n|bclε0
bcl|bε
-2.35
b|aoBOSTON
0
ao|sε0
b|aaε0
aa|sBOSTON
-0.35
aa|zBOSNIA
-1.23
Arc score
FST Example
Words share arcs in network
Alternate pronunciations
ix|nIN
-3.57
n|bclε0
bcl|bε
-2.35
b|aoBOSTON
0
ao|sε0
b|aaε0
aa|sBOSTON
-0.35
aa|zBOSNIA
-1.23
Acoustic Modelling 206.345 Automatic Speech Recognition
Acoustic Models
• Models can be built for segments and boundaries– Best accuracy can be achieved when both are used– Current real-time recognition uses only boundary models
• Boundary labels combined into classes– Classes determined using decision tree clustering– One Gaussian mixture model trained per class– 112 dimension feature vector reduced to 50 dimensions via PCA– 1 Gaussian component for every 50 training tokens (based on #
dims)• Models trained on over 100 hours of spontaneous
telephone speech collected from several domains
Acoustic Modelling 216.345 Automatic Speech Recognition
Search Details
• Search uses forward and backward passes:– Forward Viterbi search using bigram– Backwards A* search using bigram to create a word graph– Rescore word graph with trigram (i.e., subtract bigram scores)– Backwards A* search using trigram to create N-best outputs
• Search relies on two types of pruning:– Pruning based on relative likelihood score– Pruning based maximum number of hypotheses– Pruning provides tradeoff between speed and accuracy
• Search can control tradeoff between insertions and deletions– Language model biased towards short sentences– Word transition weight (wtw) heuristic adjusted to remove bias
Acoustic Modelling 226.345 Automatic Speech Recognition
Recognition Experiments
• Collecting real data improves performance:– Enables increased complexity and improved
robustness for acoustic and language models– Better match than laboratory recording conditions
0
1020
30
40
5060
70
80
Apr May Jun Jul Aug Nov Apr Nov May
Erro
r Rat
e (%
)
1
10
100
Trai
ning
Dat
a (x
1000
)SentenceWordData
Acoustic Modelling 236.345 Automatic Speech Recognition
Error Analysis (2506 Utterance Test Set)
0 10 20 30 40 50 60 70
Expert (ID)
Out of Domain
Non-native (ID)
Child (ID)
Female (ID)
Male (ID)
In Domain (ID)
Entire Set
Word Error Rate (%)
2%60%
26%23%
13%8%
11%22%
50% worse than males3 ’s worse than males
Experienced users adapt to system!
Different test set
70% of speakers are male70% of test set is in domain
Acoustic Modelling 246.345 Automatic Speech Recognition
0 0.5 1 1.5 2 2.5 3 3.5 4Latency (s)
A* Search Latency
• Average latency .62 seconds• 85% < 1 second ; 99% < 2 seconds• Latency not dependent on utterance length
Acoustic Modelling 256.345 Automatic Speech Recognition
Gaussian Selection
• ~50% of total computation is evaluation of Gaussian densities• Can use binary VQ to select mixture components to evaluate• Component selection criteria for each VQ codeword:
outside distance threshold
within codeword
– Those within distance threshold– Those within codeword (i.e., every component used at least once)
• Can significantly reduce computation with small error loss– At least one component/model per codeword (i.e., only if necessary)
Acoustic Modelling 266.345 Automatic Speech Recognition
Model Aggregation• K-means and EM algorithms converge to different
local minima from different initialization points• Performance on development data not necessarily a
strong indicator of performance on test data – TIMIT phonetic recognition error for 24 training trials
27.5 28 28.528.5
29
29.5
30
Dev Set Error Rate (%)
Test
Set
Err
or R
ate
(%) Best on Dev, Worst on Test!
Correlation Coeff = 0.16
Acoustic Modelling 276.345 Automatic Speech Recognition
Aggregation Experiments
• Combining different training runs can improve performance• Three experimental systems: phonetic classification,
phonetic recognition (TIMIT), and word recognition (RM)• Acoustic models:
- Mixture Gaussian densities, randomly initialized K-means- 24 different training trials
• Measure average performance of M unique N-fold aggregated models (starting from 24 separate models)
% Error Phone Classification Phone Recognition Word Rec.M=24 N=1 22.1 29.3 4.5M=6 N=4 20.7 28.4 4.2
M=1 N=24 20.2 28.1 4.0
% Reduction 8.3 4.0 12.0
Acoustic Modelling 286.345 Automatic Speech Recognition
Model Aggregation• Aggregation combines N classifiers, with equal weighting,
to form one aggregate classifier
1
1( ) ( )N
A nn
x xN =
ϕ = ϕ∑r r
4 Gaussians
2 Gaussians
6 Gaussians
• The expected error of an aggregate classifier is less than the expected error of any randomly chosen constituent
• N-fold aggregate classifier has N times more computation• Gaussian kernels of aggregate model can be hierarchically
clustered and selectively pruned– Experiment: Prune 24-fold model back to size of smaller N-fold models
Acoustic Modelling 296.345 Automatic Speech Recognition
1 2 4 6 8 12 2427
27.5
28
28.5
Aggregation Experiments
# of Aggregated Training Trials (N)
Erro
r Rat
e (%
) on
TIM
IT D
ev S
et
Average N-fold TrialBest and Worst TrialPruned 24-fold Mode
Acoustic Modelling 306.345 Automatic Speech Recognition
Phonetic Classification Confusions
• Most confusions occur within manner class
Acoustic Modelling 316.345 Automatic Speech Recognition
Committee-based Classification
• Combining information sources can reduce error
• Change of temporal basis affects within-class error– Smoothly varying cosine basis better for vowels and nasals– Piecewise-constant basis better for fricatives and stops
20
22
24
26
28
30
% E
rror
Ove
rall
Vow
el
Nas
al
Wea
kFr
icat
ive
Stop
S1: 5 averagesS3: 5 cosines
Acoustic Modelling 326.345 Automatic Speech Recognition
Committee-based Classifiers (Halberstadt, 1998)
• Uses multiple acoustic feature vectors and classifiers to incorporate different sources of information
• Explored 3 combination methods (e.g., voting, linear, indep.)• Obtains state-of-the-art phonetic classification and
recognition results (TIMIT)• Combining 3 boundary models in Jupiter weather domain
– Word error rate 10-16% relative reduction over baseline– Substitution error rate 14-20% relative reduction over baseline
Acoustic Measurements % Error % SubB1 (30 ms, 12 MFCC, telescoping avg) 11.3 6.4 B2 (30 ms, 12 MFCC+ZC+E+LFE, 4 cos±50ms) 12.0 6.7 B3 (10ms, 12 MFCC, 5 cos±75ms) 12.1 6.9 B1 + B2 + B3 10.1 5.5
Acoustic Modelling 336.345 Automatic Speech Recognition
Related Work
• ROVER system developed at NIST [Fiscus, 1997]
– 1997 LVCSR Hub-5E Benchmark test– “Recognizer output voting error reduction”– Combines confidence-tagged word recognition output from
multiple recognizers– Produced 12.5% relative reduction in WER
• Notion of combining multiple information sources – Syllable-based and word-based [Wu, Morgan et al, 1998]– Different phonetic inventories [AT&T]– 80, 100, or 125 frames per second [BBN]– Triphone and quinphone [HTK]– Subband-based speech recognition [Bourland, Dupont, 1997]
Acoustic Modelling 346.345 Automatic Speech Recognition
References
• E. Bocchieri. Vector quantization for the efficient computation of continuous density likelihoods. Proc. ICASSP, 1993.
• T. Hazen and A. Halberstadt. Using aggregation to improve the performance of mixture Gaussian acoustic models. Proc. ICASSP, 1998.
• J. Glass, T. Hazen, and L. Hetherington. Real-time telephone-based speech recognition in the Jupiter domain. Proc. ICASSP, 1999.
• A. Halberstadt. Heterogeneous acoustic measurements and multiple classifiers for speech recognition. Ph.D. Thesis, MIT, 1998.
• T. Watanabe et al. Speech recognition using tree-structured probability density function. Proc. ICSLP, 1994.
Top Related