Hidden Markov Models use for speech recognitionsgn24006/PDF/L09-HMMs-in-speech-recognition.pdf ·...
Transcript of Hidden Markov Models use for speech recognitionsgn24006/PDF/L09-HMMs-in-speech-recognition.pdf ·...
HMMs 1SGN-24006
Hidden Markov Models usefor speech recognition
Contents:Viterbi trainingAcoustic modeling aspectsIsolated-word recognitionConnected-word recognitionToken passing algorithmLanguage models
HMMs 2SGN-24006Phoneme HMM
Each phoneme is represented by a left-to-right HMM with3 states
Word and sentence HMMs are constructed byconcatenating the phoneme-level HMMs
W AX N
HMMs 3SGN-24006Viterbi training
HMM statesForward-backward algorithm assigns a probability that a featurevector was emitted from an HMM stateViterbi training: we construct the composite HMM from thephoneme units and use Viterbi algorithm to find the best state-
HMMs 4SGN-24006Viterbi training
For each training example, use current HMM models toassign feature vectors to HMM states
Using Viterbi algorithm, find the most likely path through thecomposite HMM modelThis is called Viterbi forced alignment
Group the feature vectors assigned to each HMM stateand estimate new parameters for each HMM (for exampleusing the GMM update equations)
Repeat alignment and parameter reestimation
HMMs 5SGN-24006Acoustic models
An ideal acoustic model is:Accurate
It accounts for context dependency (phonetic context)
CompactIt provides a compact representation, trainable from finiteamounts of data
GeneralIt is a general representation that allows new words to bemodeled, even if they were not seen in the training data
HMMs 6SGN-24006Whole-word HMMs
Each word is modeled as a wholeEach word is assigned an HMM with a number of states
Is it a good acoustic model?Accurate Yes, if there is enough data and the system has asmall vocabulary; No, if trying to model context changes betweenwordsCompact No. It needs many states as the vocabulary increases,and there might not be enough training data to model EVERYword.General No. It cannot be used to build new words.
HMMs 7SGN-24006Phoneme HMMs
Each phoneme is modeled using an HMM with M states
Is it a good acoustic model?Accurate No. It does not model well coarticulation.Compact Yes. The complete system will have M states and Nphonemes, a total of MxN states, not so many parameters to beestimatedGeneral Yes. Any new word can be formed by concatenating theunits.
HMMs 8SGN-24006Modeling phonetic context
MonophoneA single model is used to represent a phoneme in all contexts
BiphoneOne model represents a particular left or right contextNotation:
left context biphone: (a-b)right context biphone: (b+c)
TriphoneOne model represents a particular left and right contextNotation: (a-b+c)
HMMs 9SGN-24006Context-dependent model examples
MonophoneSPEECH S P IY CH
BiphoneLeft context:Right context:
Triphone
HMMs 10SGN-24006Context-dependent model examples
MonophoneSPEECH S P IY CH
BiphoneLeft context: SIL-S S-P P-IY IY-CHRight context: S+P P+IY IY+CH CH+SIL
Triphone SIL-S+P S-P+IY P-IY+CH IY-CH+SIL
Word-internal context dependent triphones backs off to leftand right biphone models at the word boundarySPEECH RECOGNITIONSIL S-P S-P+IY P-IY+CH IY+CH R-EH R-EH+K EH-K+AH K-AH+G ..
Cross-word context-dependent triphonesSIL-S+P S-P+IY P-IY+CH IY-CH+R CH-R+EH R-EH+K EH-K
HMMs 11SGN-24006Context-dependent triphone HMMs
Each phoneme unit within the immediate left and rightcontext is modeled using an HMM with M statesIs it a good acoustic model?
Accurate Yes. Takes into account coarticulation.Compact Yes. Trainable No. For N phonemes there are NxNxNtriphone models, too many parameters to estimate!General Yes. New words can be formed by concatenating units
Training issues
Many triphones occur infrequently not enough training dataSolution: clustering of HMM states which have similar statisticaldistributions, to estimate HMM parameters using pooled data
HMMs 12SGN-24006Isolated word recognition
Whole-word modelCollect many examples of each word spoken in isolationAssign a number of states to each word model based on worddurationEstimate HMM model parameters
Subword-unit modelCollect a large corpus of speech and estimate phonetic unit HMMsConstruct word-level HMMs from phoneme-level HMMsThis is more general than the whole-word approach
HMMs 13SGN-24006Whole-word HMM
HMMs 14SGN-24006Viterbi algorithm through a model
HMMs 15SGN-24006Isolated word recognition system
P(O|W) calculated using Viterbi algorithm rather thanforward algorithmViterbi provides the probability of the path represented bythe most likely state sequence
HMMs 16SGN-24006Connected-word recognition
Boundaries of utterance are unknownNumber of words spoken is unknown position of wordboundaries is often unclear, difficult to determineExample: two word network
HMMs 17SGN-24006Connected-words Viterbi search
HMMs 18SGN-24006Beam pruning
At each node we must compute - the probability of thebest state sequence up to that point, and keep the informationabout where it came from this will allow back-tracing to findthe best state sequenceDuring back-tracing we will find the word boundariesBeam pruning:
at each point determine the log-probability of the absolute bestViterbi path
j if
HMMs 19SGN-24006Beam pruning illustration
HMMs 20SGN-24006Token passing approach
Assume each HMM state can hold multiple tokensToken is an object that can move from state to state in theHMM networkEach token carries with it the log scale Viterbi path score s
At each time t we examine tokens assigned to the nodesWe propagate tokens to reachable positions at time t+1:
Make a copy of the tokenAdjust path score to account for the transition within the HMMnetwork and observation probability
Merge tokens according to Viterbi algorithmSelect the token with maximum scoreDiscard all other competing tokens
HMMs 21SGN-24006Token passing algorithm
Initialization (t=0)Initialize each initial state to hold a token with score s = 0All other states are initialized with a token with
Algorithm (t>0)Propagate tokens to all possible next states (all connecting states)and increment
In each state, find the token with the largest s and discard the restof the tokens in that state (Viterbi)
Termination (t=T)Examine the tokens in all possible final states, find the one withthe largest Viterbi path scoreThis is the probability of the most likely state sequence
HMMs 22SGN-24006Token propagation illustration
HMMs 23SGN-24006Token passing for connected-word
recognitionIndividual word models are connected into a composite modelcan transition from final state of word m to initial state of word nPath scores are maintained by the tokensPath sequence also maintaned by the tokens, allowingrecovery of the best word sequence
Tokens emitted from last stateof each word propagate toinitial state of each word
Probability of entering theinitial state of each word P(W1) isthe probability of that word givenby the language model
s = s + P(W1)
HMMs 24SGN-24006Bayes formulation revisited
Recall the Bayes rule applied to speech recognition
In practice, we use log-probabilities:
Probabilities of word sequences,given by the language model
HMMs 25SGN-24006Language models
Usually the language model is also scaled by a grammarscale factor s and word transition penalty p
HMMs 26SGN-24006Language models
Assign probabilities to word sequences P(W)
The additional information provides help to reduce thesearch space
Language models resolve homonyms:
Write a letter to Mr. Wright right away.
Tradeoff between constraint and flexibility
HMMs 27SGN-24006Stastistical language models
We want to estimate
We can decompose this probability left-to-right:
HMMs 28SGN-24006How does this work?
P(W) = P(analysis of audio, speech and music signals)= P(analysis) P(of | analysis) P(audio | analysis of) ..
How can we model the entire word sequence? There isnever enough training data!Consider restricting the word history
HMMs 29SGN-24006Practical training
Consider word-histories ending in the same last N-1words, and treat is as a markov model
N = 1
N = 2
N = 3
HMMs 30SGN-24006n-gram language models
Probability of a word based on the previous N-1 words:
N=1 unigramN=2 bigramN=3 trigram
Training: probabilities are estimated from a corpus oftraining data (a large amount of text)Once the model is trained, it can be used to generate newsentences randomlySyntax is roughly encoded by the obtained model, butgenerated sentences are often ungrammatical andsemantically strange
HMMs 31SGN-24006Trigram example
P(states | the united) = ..
P(America | states of) = ..
HMMs 32SGN-24006Estimating the n-gram probabilities
Given a text corpus, define:Count of occurences of word n
Count of occurences of word (n-1) followed by word n
Count of occurences of word (n-2) followed by word n-1and word n
HMMs 33SGN-24006Estimating the n-gram probabilties
Based on the count frequency of occurence for the wordsequences, the maximum likelihood estimates of wordprobabilities are calculated:
HMMs 34SGN-24006n-grams in the decoding process
The goal of the search is to find the most likely string ofsymbols (phonemes, words, etc) to account for theobserved speech waveform:
Connected-word example:
HMMs 35SGN-24006Connected-word log-Viterbi search
At each node we must compute
where ij is the log language model score
s is the grammar scale factor and p is the (log) wordtransition penalty
HMMs 36SGN-24006Beam search revisited
HMMs 37SGN-24006Language model in the search
The language model scores are applied at the point wherethere is a transition INTO a wordAs the number of words increases, the number of statesand interconnections increases tooN-grams are easier to incorporate into the token passingalgorithms = s + gP(W1)+p
*Note: here g is the grammar scalefactor, as s was used to denote thepath score
The language model scoreis added to the path scoreupon word entry, so thetoken keeps the combinedacoustic and languagemodel information
HMMs 38SGN-24006Lyrics recognition from singing
Y EH S T ER D EY vs Y EH S . T AH D EYM AY . M AY vs M AA M AHAO L . DH AH . W EY vs AO L . AH W EY