Methods Pattern Recognition - LIACS
Transcript of Methods Pattern Recognition - LIACS
1
LML Speech Recognition 2009LML Speech Recognition 2009
Speech RecognitionSpeech RecognitionSignal Processing and AnalysisSignal Processing and Analysis
E.M. E.M. BakkerBakker
LML Speech Recognition 2009LML Speech Recognition 2009
Features for Speech Recognition Features for Speech Recognition and Audio Indexingand Audio Indexing
Parametric Representations– Short Time Energy– Zero Crossing Rates– Level Crossing Rates– Short Time Spectral Envelope
Spectral Analysis– Filter Design– Filter Bank Spectral Analysis Model– Linear Predictive Coding (LPC)
LML Speech Recognition 2009LML Speech Recognition 2009
MethodsMethodsVector QuantizationVector Quantization–– Finite code book of spectral shapesFinite code book of spectral shapes–– The code book codes for ‘typical’ spectral shapeThe code book codes for ‘typical’ spectral shape–– Method for all spectral representations (e.g. Filter Method for all spectral representations (e.g. Filter
Banks, LPC, ZCR, etc. …)Banks, LPC, ZCR, etc. …)Ensemble Interval Histogram (EIH) ModelEnsemble Interval Histogram (EIH) Model–– AuditoryAuditory--Based Spectral Analysis ModelBased Spectral Analysis Model–– More robust to noise and reverberationMore robust to noise and reverberation–– Expected to be inherently better representation of Expected to be inherently better representation of
relevant spectral information because it models the relevant spectral information because it models the human cochlea mechanicshuman cochlea mechanics
LML Speech Recognition 2009LML Speech Recognition 2009
Pattern RecognitionPattern Recognition
ReferencePatterns
ParameterMeasurements
DecisionRules
PatternComparison
SpeechAudio, …
RecognizedSpeech, Audio, …
Test PatternQuery Pattern
2
LML Speech Recognition 2009LML Speech Recognition 2009
Pattern RecognitionPattern Recognition
Reference VocabularyFeatures
Feature Detector1
HypothesisTester
Feature Combinerand
Decision Logic
SpeechAudio, …
RecognizedSpeech, Audio, …
Feature Detectorn
LML Speech Recognition 2009LML Speech Recognition 2009
Spectral Analysis ModelsSpectral Analysis Models
Pattern Recognition ApproachPattern Recognition Approach1.1. Parameter Measurement => PatternParameter Measurement => Pattern2.2. Pattern ComparisonPattern Comparison3.3. Decision MakingDecision Making
Parameter MeasurementsParameter Measurements–– Bank of Filters ModelBank of Filters Model–– Linear Predictive Coding ModelLinear Predictive Coding Model
LML Speech Recognition 2009LML Speech Recognition 2009
Band Pass FilterBand Pass Filter
Audio Signals(n)
Bandpass FilterF()
Result Audio SignalF(s(n)
Note that the bandpass filter can be defined as:
• a convolution with a filter response function in the time domain,
• a multiplication with a filter response function in the frequency domain
LML Speech Recognition 2009LML Speech Recognition 2009
Bank of Filters Analysis ModelBank of Filters Analysis Model
3
LML Speech Recognition 2009LML Speech Recognition 2009
Bank of Filters Analysis ModelBank of Filters Analysis ModelSpeech Signal: s(n), n=0,1,…– Digital with Fs the sampling frequency of s(n)
Bank of q Band Pass Filters: BPF1, …,BPFq– Spanning a frequency range of, e.g., 100-3000Hz or
100-16kHz– BPFi(s(n)) = xn(ejωi), where ωi = 2πfi/Fs is equal to the
normalized frequency fi, where i=1, …, q.– xn(ejωi) is the short time spectral representation of s(n)
at time n, as seen through the BPFi with centre frequency ωi, where i=1, …, q.
Note: Each BPF independently processes s to produce the spectral representation x
LML Speech Recognition 2009LML Speech Recognition 2009
Bank of Filters Front End Bank of Filters Front End ProcessorProcessor
LML Speech Recognition 2009LML Speech Recognition 2009
Typical Speech Wave FormsTypical Speech Wave Forms
LML Speech Recognition 2009LML Speech Recognition 2009
MFCCsMFCCs
Mel-Scale Filter Bank
MFCC’sfirst 12 most
Signiifcantcoefficients
Log()
SpeechAudio, … Preemphasis Windowing Fast Fourier
Transform
Direct CosineTransform
MFCCs are calculated using the formula:
∑=
−=N
kiki NkXC
1)/)5.0(cos(π
Where • Ci is the cepstral coefficient• P the order (12 in our case)• K the number of discrete Fourier
transform magnitude coefficients• Xk the kth order log-energy output
from the Mel-Scale filterbank.• N is the number of filters
4
LML Speech Recognition 2009LML Speech Recognition 2009
Linear Predictive Coding ModelLinear Predictive Coding Model
LML Speech Recognition 2009LML Speech Recognition 2009
Filter Response FunctionsFilter Response Functions
LML Speech Recognition 2009LML Speech Recognition 2009
SomeSomeExamples Examples
ofofIdeal Band Ideal Band
FiltersFilters
LML Speech Recognition 2009LML Speech Recognition 2009
Perceptually Based Perceptually Based Critical Band ScaleCritical Band Scale
5
LML Speech Recognition 2009LML Speech Recognition 2009
Short Time Fourier TransformShort Time Fourier Transform
• s(m) signal• w(n-m) a fixed low pass window
LML Speech Recognition 2009LML Speech Recognition 2009
Short Time Fourier TransformShort Time Fourier TransformLong Hamming Window: 500 samples (=50msec)Long Hamming Window: 500 samples (=50msec)
Voiced Speech
LML Speech Recognition 2009LML Speech Recognition 2009
Short Time Fourier TransformShort Time Fourier TransformShort Hamming Window: 50 samples (=5msec)Short Hamming Window: 50 samples (=5msec)
Voiced Speech
LML Speech Recognition 2009LML Speech Recognition 2009
Short Time Fourier TransformShort Time Fourier TransformLong Hamming Window: 500 samples (=50msec)Long Hamming Window: 500 samples (=50msec)
Unvoiced Speech
6
LML Speech Recognition 2009LML Speech Recognition 2009
Short Time Fourier TransformShort Time Fourier TransformShort Hamming Window: 50 samples (=5msec)Short Hamming Window: 50 samples (=5msec)
Unvoiced Speech
LML Speech Recognition 2009LML Speech Recognition 2009
Short Time Fourier TransformShort Time Fourier TransformLinear Filter InterpretationLinear Filter Interpretation
LML Speech Recognition 2009LML Speech Recognition 2009
Linear Predictive Coding (LPC) Linear Predictive Coding (LPC) ModelModel
Speech Signal: s(n), n=0,1,…– Digital with Fs the sampling frequency of s(n)
Spectral Analysis on Blocks of Speech with an all pole modeling constraintLPC of analysis order p– s(n) is blocked into frames [n,m]– Again consider xn(ejω) the short time spectral representation of s(n) at
time n. (where ω = 2πf/Fs is equal to the normalized frequency f). – Now the spectral representation xn(ejω) is constrained to be of the form
σ/A(ejω), where A(ejω) is the pth order polynomial with z-transform: A(z) = 1 + a1z-1 + a2z-2 + … + apz-p
– The output of the LPC parametric Conversion on block [n,m] is the vector [a1,…,ap].
– It specifies parametrically the spectrum of an all-pole model that best matches the signal spectrum over the period of time in which the frame of speech samples was accumulated (pth order polynomial approximation of the signal).
LML Speech Recognition 2009LML Speech Recognition 2009
Vector QuantizationVector Quantization
Data represented as feature vectors.Data represented as feature vectors.VQ Training set to determine a set of code VQ Training set to determine a set of code words that constitute a code book.words that constitute a code book.Code words are Code words are centroidscentroids using a similarity or using a similarity or distance measure d.distance measure d.Code words together with d divide the space into Code words together with d divide the space into a a VoronoiVoronoi regions.regions.A query vector falls into a A query vector falls into a VoronoiVoronoi region and will region and will be represented by the respective codeword.be represented by the respective codeword.
7
LML Speech Recognition 2009LML Speech Recognition 2009
Vector QuantizationVector Quantization
Distance measures Distance measures d(x,yd(x,y):):
Euclidean distanceEuclidean distanceTaxi cab distanceTaxi cab distanceHamming distanceHamming distanceetc.etc.
LML Speech Recognition 2009LML Speech Recognition 2009
Vector QuantizationVector QuantizationClustering the Training VectorsClustering the Training Vectors
Initialize:Initialize: choose M arbitrary vectors of the L vectors of choose M arbitrary vectors of the L vectors of the training set. This is the initial code book.the training set. This is the initial code book.Nearest neighbor search:Nearest neighbor search: for each training vector, find for each training vector, find the code word in the current code book that is closest the code word in the current code book that is closest and assign that vector to the corresponding cell.and assign that vector to the corresponding cell.CentroidCentroid update:update: update the code word in each cell update the code word in each cell using the using the centroidcentroid of the training vectors that are of the training vectors that are assigned to that cell.assigned to that cell.Iteration:Iteration: repeat step 2repeat step 2--3 until the 3 until the averaeaverae distance falls distance falls below a preset threshold. below a preset threshold.
LML Speech Recognition 2009LML Speech Recognition 2009
Vector ClassificationVector Classification
For an MFor an M--vector code book CB with codesvector code book CB with codesCB = {CB = {yyii | 1 | 1 ≤≤ i i ≤≤ M} ,M} ,
the index mthe index m** of the best codebook entry for a of the best codebook entry for a given vector v is:given vector v is:
mm** = = argarg min min d(vd(v, , yyii))1 1 ≤≤ i i ≤≤ MM
LML Speech Recognition 2009LML Speech Recognition 2009
VQ for ClassificationVQ for ClassificationA code book A code book CBCBkk = {= {yykk
ii | 1 | 1 ≤≤ i i ≤≤ M}, can be used to M}, can be used to define a class Cdefine a class Ckk..
Example Audio Classification:Example Audio Classification:
Classes Classes ‘‘crowdcrowd’’, , ‘‘carcar’’, , ‘‘silencesilence’’, , ‘‘screamscream’’, , ‘‘explosionexplosion’’, etc., etc.Determine by using VQ code books Determine by using VQ code books CBCBkk for each for each of the classes.of the classes.VQ is very often used as a baseline method for VQ is very often used as a baseline method for classification problems.classification problems.
8
LML Speech Recognition 2009LML Speech Recognition 2009
Sound, DNA: Sequences!Sound, DNA: Sequences!
DNA: helixDNA: helix--shaped molecule shaped molecule whose constituents are two whose constituents are two parallel strands of nucleotidesparallel strands of nucleotidesDNA is usually represented by DNA is usually represented by sequences of these four sequences of these four nucleotidesnucleotidesThis assumes only one strand This assumes only one strand is considered; the second is considered; the second strand is always derivable strand is always derivable from the first by pairing A’s from the first by pairing A’s with T’s and C’s with G’s and with T’s and C’s with G’s and vicevice--versaversa
Nucleotides (bases)Nucleotides (bases)–– Adenine (A)Adenine (A)–– Cytosine (C)Cytosine (C)–– Guanine (G)Guanine (G)–– Thymine (T)Thymine (T)
LML Speech Recognition 2009LML Speech Recognition 2009
Biological Information: Biological Information: From Genes to ProteinsFrom Genes to Proteins
GeneDNA
RNA
Transcription
Translation
Protein Protein folding
genomics
molecular biology
structural biology
biophysics
LML Speech Recognition 2009LML Speech Recognition 2009
DNA / amino acidsequence 3D structure protein functions
DNA (gene) →→→ pre-RNA →→→ RNA →→→ ProteinRNA-polymerase Spliceosome Ribosome
CGCCAGCTGGACGGGCACACCATGAGGCTGCTGACCCTCCTGGGCCTTCTG…
TDQAAFDTNIVTLTRFVMEQGRKARGTGEMTQLLNSLCTAVKAISTAVRKAGIAHLYGIAGSTNVTGDQVKKLDVLSNDLVINVLKSSFATCVLVTEEDKNAIIVEPEKRGKYVVCFDPLDGSSNIDCLVSIGTIFGIYRKNSTDEPSEKDALQPGRNLVAAGYALYGSATML
From Amino Acids to Proteins From Amino Acids to Proteins FunctionsFunctions
LML Speech Recognition 2009LML Speech Recognition 2009
Motivation for Markov ModelsMotivation for Markov Models
TThere are many cases in which we would here are many cases in which we would like to representlike to represent the statistical regularities the statistical regularities of some class of sequencesof some class of sequences–– genesgenes–– proteins in a given familyproteins in a given family–– Sequences of audio featuresSequences of audio features
Markov models are well suited to this type Markov models are well suited to this type of taskof task
9
LML Speech Recognition 2009LML Speech Recognition 2009
A Markov Chain ModelA Markov Chain Model
Transition Transition probabilitiesprobabilities–– Pr(xPr(xii=a|x=a|xii--11=g)=0.16=g)=0.16–– Pr(xPr(xii=c|x=c|xii--11=g)=0.34=g)=0.34–– Pr(xPr(xii=g|x=g|xii--11=g)=0.38=g)=0.38–– Pr(xPr(xii=t|x=t|xii--11=g)=0.12=g)=0.12∑ ==− 1)|Pr( 1 gxx ii
LML Speech Recognition 2009LML Speech Recognition 2009
Definition of Markov Chain ModelDefinition of Markov Chain Model
A Markov chainA Markov chain[1][1] model is defined bymodel is defined by
–– a set of statesa set of states
some states emit symbolssome states emit symbols
other states (e.g., the begin state) are silentother states (e.g., the begin state) are silent
–– a set of transitions with associateda set of transitions with associated probabilitiesprobabilities
the transitions emanating from a given state define athe transitions emanating from a given state define a ddistribution istribution
over the possible next statesover the possible next states
[1] [1] МарковМарков АА. . АА., ., РаспространениеРаспространение законазакона большихбольших чиселчисел нана величинывеличины, , зависящиезависящие другдруг
отот другадруга.. —— ИзвестияИзвестия физикофизико--математическогоматематического обществаобщества припри КазанскомКазанском
университетеуниверситете.. —— 22--яя сериясерия.. —— ТомТом 15. (1906)15. (1906) —— СС. 135. 135——156 156
LML Speech Recognition 2009LML Speech Recognition 2009
Markov Chain Models: Markov Chain Models: PropertiesProperties
Given some sequence x of length L, we can ask howGiven some sequence x of length L, we can ask howprobable the sequence is given our modelprobable the sequence is given our modelFor any probabilistic model of sequences, we can For any probabilistic model of sequences, we can write thiswrite this probability asprobability as
key property of a (1key property of a (1stst order) Markov chain: the order) Markov chain: the probabilityprobability of each of each xxii depends only on the value ofdepends only on the value of xxii--11
)Pr()...,...,|Pr(),...,|Pr(),...,,Pr()Pr(
112111
11
xxxxxxxxxxx
LLLL
LL
−−−
−
==
∏=
−
−−−
=
=L
iii
LLLL
xxx
xxxxxxxx
211
112211
)|Pr()Pr(
)Pr()|Pr()...|Pr()|Pr()Pr(
LML Speech Recognition 2009LML Speech Recognition 2009
The Probability of a Sequence for a The Probability of a Sequence for a Markov Chain ModelMarkov Chain Model
Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)
10
LML Speech Recognition 2009LML Speech Recognition 2009
Example ApplicationExample ApplicationCpGCpG islandsislands
CGCG didi--nucleotides are rarer in eukaryotic genomes thannucleotides are rarer in eukaryotic genomes than expected expected given the marginal probabilities of given the marginal probabilities of CC and and GG
but the regions upstream of genes are richer in but the regions upstream of genes are richer in CGCG didi--nucleotides nucleotides than elsewhere than elsewhere –– CpGCpG islandsislands
useful evidence for finding genesuseful evidence for finding genes
Application: Predict Application: Predict CpGCpG islands with Markov chainsislands with Markov chains
one Markov chain to represent one Markov chain to represent CpGCpG islandsislands
another Markov chain to represent the rest of the genomeanother Markov chain to represent the rest of the genome
LML Speech Recognition 2009LML Speech Recognition 2009
Markov Chains for Markov Chains for DiscriminationDiscrimination
Suppose we want to distinguish Suppose we want to distinguish CpGCpG islands from islands from otherother sequence regionssequence regionsGiven sequences from Given sequences from CpGCpG islands, and sequences islands, and sequences fromfrom other regions, we can constructother regions, we can construct–– a model to represent a model to represent CpGCpG islandsislands–– a null model to represent the other regionsa null model to represent the other regions
We can then score a test sequence by:We can then score a test sequence by:
)|Pr()|Pr(log)(
nullModelxCpGModelxxscore =
LML Speech Recognition 2009LML Speech Recognition 2009
Markov Chains for DiscriminationMarkov Chains for DiscriminationWhy can we use Why can we use
According to According to BayesBayes’’ rule:rule:
If we are not taking into account prior probabilities If we are not taking into account prior probabilities ((Pr(CpGPr(CpG)) and and Pr(nullPr(null)))) of the twoof the two classes, then from classes, then from BayesBayes’’ rule it is clear that rule it is clear that we just need towe just need to compare compare Pr(x|CpGPr(x|CpG)) andand Pr(x|nullPr(x|null)) as is done in as is done in our scoring function our scoring function score().score().
)Pr()Pr()|Pr()|Pr(
xCpGCpGxxCpG =
)Pr()Pr()|Pr()|Pr(
xnullnullxxnull =
)|Pr()|Pr(log)(
nullModelxCpGModelxxscore =
LML Speech Recognition 2009LML Speech Recognition 2009
Higher Order Markov ChainsHigher Order Markov Chains
The Markov property specifies that the probability of a stateThe Markov property specifies that the probability of a statedepends depends onlyonly on the probability of the previous stateon the probability of the previous state
But we can build more “memory” into our states by using aBut we can build more “memory” into our states by using a higher higher orderorder Markov modelMarkov model
In an In an nn--thth order Markov modelorder Markov model
The probability of the current state depends on the previous The probability of the current state depends on the previous nn states.states.
),...,|Pr(),...,,|Pr( 1121 niiiiii xxxxxxx −−−− =
11
LML Speech Recognition 2009LML Speech Recognition 2009
Selecting the Order of aSelecting the Order of a MarkovMarkov Chain Chain ModelModel
But the number of parameters we need to estimate But the number of parameters we need to estimate growsgrows exponentially with the orderexponentially with the order–– for modeling DNA we need for modeling DNA we need parameters for anparameters for an nn--thth
order modelorder model
The higher the order, the less reliable we can expect The higher the order, the less reliable we can expect ourour parameter estimates to beparameter estimates to be–– estimating the parameters of a estimating the parameters of a 22ndnd order Markov chainorder Markov chain from the from the
complete genome of E. Coli (5.44 x 10complete genome of E. Coli (5.44 x 1066 bases) , we’d see eachbases) , we’d see eachword ~ 85.000 times on average (divide by 4word ~ 85.000 times on average (divide by 433))
–– estimating the parameters of a 9estimating the parameters of a 9thth order chain, we’dorder chain, we’d see each see each word ~ 5 times on average (divide by 4word ~ 5 times on average (divide by 410 10 ~ 10~ 1066))
)4( 1+nO
LML Speech Recognition 2009LML Speech Recognition 2009
Higher Order Markov ChainsHigher Order Markov Chains
An An nn--thth order Markov chain over some alphabet order Markov chain over some alphabet A A isisequivalent to a first order Markov chain over the equivalent to a first order Markov chain over the alphabetalphabet of of nn--tuplestuples: A: Ann
Example: A 2Example: A 2ndnd order Markov model for DNA can beorder Markov model for DNA can betreated as a 1treated as a 1stst order Markov model over alphabetorder Markov model over alphabetAA, AC, AG, ATAA, AC, AG, AT
CA, CC, CG, CTCA, CC, CG, CT
GA, GC, GG, GTGA, GC, GG, GT
TA, TC, TG, TTTA, TC, TG, TT
LML Speech Recognition 2009LML Speech Recognition 2009
A Fifth Order Markov ChainA Fifth Order Markov Chain
Pr(gctaca)=Pr(gctac)Pr(a|gctac)LML Speech Recognition 2009LML Speech Recognition 2009
Hidden Markov Model: A Simple Hidden Markov Model: A Simple HMMHMM
Given observed sequence AGGCT, which state emits every item?
Model 1 Model 2
12
LML Speech Recognition 2009LML Speech Recognition 2009
Tutorial on HMMTutorial on HMM
L.R. L.R. RabinerRabiner, A Tutorial on Hidden Markov Models , A Tutorial on Hidden Markov Models and Selected Applications in Speech and Selected Applications in Speech Recognition,Recognition,
Proceeding of the IEEE, Vol. 77, No. 22, February Proceeding of the IEEE, Vol. 77, No. 22, February 1989.1989.
LML Speech Recognition 2009LML Speech Recognition 2009
HMM for Hidden Coin TossingHMM for Hidden Coin Tossing
HT
T
T T
TH
T……… H H T T H T H H T T H
LML Speech Recognition 2009LML Speech Recognition 2009
Hidden StateHidden State
We’ll distinguish between the observed parts of a We’ll distinguish between the observed parts of a
problemproblem and the hidden partsand the hidden parts
In the Markov models we’ve considered previously, it isIn the Markov models we’ve considered previously, it is
clear which state accounts for each part of the observedclear which state accounts for each part of the observed
sequencesequence
In the model above, there are multiple states that couldIn the model above, there are multiple states that could
account for each part of the observed sequenceaccount for each part of the observed sequence
–– this is the hidden part of the problemthis is the hidden part of the problem
LML Speech Recognition 2009LML Speech Recognition 2009
Learning and Prediction TasksLearning and Prediction Tasks(in general, i.e., applies on both MM as HMM)(in general, i.e., applies on both MM as HMM)
LearningLearning–– GivenGiven: a model, a set of training sequences: a model, a set of training sequences–– DoDo: find model parameters that explain the training sequences with: find model parameters that explain the training sequences with
relatively high probability (goal is to find a model that relatively high probability (goal is to find a model that generalizes generalizes wellwell to to sequences we haven’t seen before)sequences we haven’t seen before)
ClassificationClassification–– GivenGiven: a set of models representing different sequence classes,: a set of models representing different sequence classes, and and
given given a test sequencea test sequence–– DoDo: determine which model/class best explains the sequence: determine which model/class best explains the sequence
SegmentationSegmentation–– GivenGiven: a model representing different sequence classes,: a model representing different sequence classes, and given and given a a
test sequencetest sequence–– DoDo: segment the sequence into subsequences, predicting the class o: segment the sequence into subsequences, predicting the class of f
eacheach subsequencesubsequence
13
LML Speech Recognition 2009LML Speech Recognition 2009
Algorithms for Learning & PredictionAlgorithms for Learning & Prediction
LearningLearning–– correct path known for each training sequencecorrect path known for each training sequence -->> simple maximumsimple maximum
likelihoodlikelihood or Bayesian estimationor Bayesian estimation–– correct path not known correct path not known -->> ForwardForward--Backward algorithm + ML orBackward algorithm + ML or Bayesian Bayesian
estimationestimation
ClassificationClassification–– simple Markov modelsimple Markov model --> > calculate probability of sequence along singlecalculate probability of sequence along single
path for each modelpath for each model–– hidden Markov modelhidden Markov model -->> Forward algorithm to calculate probability ofForward algorithm to calculate probability of
sequence along all paths for each modelsequence along all paths for each model
SegmentationSegmentation–– hidden Markov modelhidden Markov model -->> ViterbiViterbi algorithm to find most probable pathalgorithm to find most probable path for for
sequencesequence
LML Speech Recognition 2009LML Speech Recognition 2009
The Parameters of an HMMThe Parameters of an HMM
Transition ProbabilitiesTransition Probabilities
–– Probability of transition from state k to state lProbability of transition from state k to state l
Emission ProbabilitiesEmission Probabilities
–– Probability of emitting character b in state kProbability of emitting character b in state k
Note: Note: HMMHMM’’ss can also be formulated using an emission probability can also be formulated using an emission probability associated with a transition from state k to state l.associated with a transition from state k to state l.
)|Pr( 1 kla iikl === −ππ
)|Pr()( kbxbe iik === π
LML Speech Recognition 2009LML Speech Recognition 2009
An HMMAn HMM ExampleExample
Emission probabilities∑ pi = 1
Transition probabilities∑ pi = 1
LML Speech Recognition 2009LML Speech Recognition 2009
Three Important QuestionsThree Important Questions(See also L.R. (See also L.R. RabinerRabiner (1989))(1989))
How likely is a given sequence?How likely is a given sequence?–– The Forward algorithmThe Forward algorithm
What is the most probable “path” for generating What is the most probable “path” for generating a givena given sequence?sequence?–– The The ViterbiViterbi algorithmalgorithm
How can we learn the HMM parameters given a How can we learn the HMM parameters given a set ofset of sequences?sequences?–– The ForwardThe Forward--Backward (BaumBackward (Baum--Welch) algorithmWelch) algorithm
14
LML Speech Recognition 2009LML Speech Recognition 2009
How Likely is a Given Sequence?How Likely is a Given Sequence?The probability that a The probability that a givengiven path is taken and path is taken and thethe sequence is generated:sequence is generated:
∏=
+=
L
iiNL iiiaxeaxx
1001 11
)()...,...Pr( ππππππ
6.3.8.4.2.4.5.)(
)()(),Pr(
35313
111101
××××××=×××
×××=aCea
AeaAeaAAC π
LML Speech Recognition 2009LML Speech Recognition 2009
How Likely is a Given Sequence?How Likely is a Given Sequence?
The probability The probability over all pathsover all paths isis
but the number of paths can be exponential in but the number of paths can be exponential in the length of the sequence...the length of the sequence...the Forward algorithm enables us to compute the Forward algorithm enables us to compute this efficientlythis efficiently
LML Speech Recognition 2009LML Speech Recognition 2009
The Forward AlgorithmThe Forward Algorithm
Define Define to be the probability of being in to be the probability of being in state kstate k having observed the first i characters of having observed the first i characters of sequence sequence x of length Lx of length LTo compute To compute , the probability of being in, the probability of being inthe end state having observed all of the end state having observed all of sequence sequence xxCan be defined recursivelyCan be defined recursivelyCompute using dynamic programmingCompute using dynamic programming
)(ifk
)(Lf N
LML Speech Recognition 2009LML Speech Recognition 2009
The Forward AlgorithmThe Forward Algorithm
ffkk(i(i)) equal toequal to the probability of being in state the probability of being in state kk having having observed the first observed the first ii characters of characters of sequence sequence xxInitializationInitialization–– ff00(0) = 1(0) = 1 for start state; for start state; ffii(0) = 0(0) = 0 for other statefor other state
RecursionRecursion–– For emitting state For emitting state (i = 1, (i = 1, …… L)L)
–– For silent stateFor silent state
TerminationTermination
∑=k
klkl aifif )()(
∑ −=k
klkll aifieif )1()()(
∑===k
kNkNL aLfLfxxx )()()...Pr()Pr( 1
15
LML Speech Recognition 2009LML Speech Recognition 2009
Forward Algorithm ExampleForward Algorithm Example
Given the sequence x=TAGA
LML Speech Recognition 2009LML Speech Recognition 2009
Forward Algorithm ExampleForward Algorithm Example
InitializationInitialization–– ff00(0)=1, f(0)=1, f11(0)=0(0)=0……ff55(0)=0(0)=0
Computing other valuesComputing other values–– ff11(1)=e(1)=e11(T)*(f(T)*(f00(0)a(0)a0101+f+f11(0)a(0)a1111))
=0.3*(1*0.5+0*0.2)=0.15=0.3*(1*0.5+0*0.2)=0.15–– ff22(1)=0.4*(1*0.5+0*0.8)(1)=0.4*(1*0.5+0*0.8)–– ff11(2)=e(2)=e11(A)*(f(A)*(f00(1)a(1)a0101+f+f11(1)a(1)a1111))
=0.4*(0*0.5+0.15*0.2)=0.4*(0*0.5+0.15*0.2)……–– Pr(TAGAPr(TAGA)= f)= f55(4)=f(4)=f33(4)a(4)a3535+f+f44(4)a(4)a4545
LML Speech Recognition 2009LML Speech Recognition 2009
Three Important QuestionsThree Important Questions
How likely is a given sequence?How likely is a given sequence?
What is the most probable “path” for generating What is the most probable “path” for generating a givena given sequence?sequence?
How can we learn the HMM parameters given a How can we learn the HMM parameters given a set ofset of sequences?sequences?
LML Speech Recognition 2009LML Speech Recognition 2009
Finding the Most Probable Path: The Finding the Most Probable Path: The ViterbiViterbi AlgorithmAlgorithm
Define Define vvkk(i(i)) to be the probability of to be the probability of the most probablethe most probablepathpath accounting for the first accounting for the first ii characters of characters of xx and and ending inending in state state kk
We want to compute We want to compute vvNN(L(L)),, the probability of the probability of the mostthe mostprobable pathprobable path accounting for all of the sequence andaccounting for all of the sequence andending in the end stateending in the end state
Can be defined recursivelyCan be defined recursively
Again we can use use Dynamic Programming to Again we can use use Dynamic Programming to compute compute vvNN(L(L)) and find the most probable path and find the most probable path efficientlyefficiently
16
LML Speech Recognition 2009LML Speech Recognition 2009
Finding the Most Probable Path: The Finding the Most Probable Path: The ViterbiViterbi AlgorithmAlgorithm
Define Define vvkk(i(i)) to be the probability of to be the probability of the most probablethe most probablepath path ππ accounting for the first accounting for the first ii characters of characters of xx and and ending inending in state state kk
The The ViterbiViterbi Algorithm:Algorithm:1.1. Initialization Initialization (i = 0)(i = 0)
vv00(0) = 1, v(0) = 1, vkk(0) = 0(0) = 0 for for k>0k>0
2.2. Recursion Recursion (i = 1,…,L)(i = 1,…,L)vvll(i(i) = ) = eell(x(xii) .max) .maxkk(v(vkk(i(i--1).a1).aklkl))
ptrptrii(l(l) = argmax) = argmaxkk(v(vkk(i(i--1).a1).aklkl))
3.3. Termination: Termination: P(xP(x,,ππ**) = max) = maxkk((vvkk(L).a(L).ak0k0))
ππ**LL = argmax= argmaxkk(v(vkk(L).a(L).ak0k0))
LML Speech Recognition 2009LML Speech Recognition 2009
Three Important QuestionsThree Important Questions
How likely is a given sequence?How likely is a given sequence?
What is the most probable “path” for What is the most probable “path” for
generating a givengenerating a given sequence?sequence?
How can we learn the HMM parameters How can we learn the HMM parameters
given a set ofgiven a set of sequences?sequences?
LML Speech Recognition 2009LML Speech Recognition 2009
Learning Without Hidden StateLearning Without Hidden StateLearning is simple if we know the correct path for each Learning is simple if we know the correct path for each sequence in our training setsequence in our training set
estimate parameters by counting the number of times estimate parameters by counting the number of times each parameter is used across the training seteach parameter is used across the training set
LML Speech Recognition 2009LML Speech Recognition 2009
Learning With Hidden StateLearning With Hidden StateIf we don’t know the correct path for each sequence If we don’t know the correct path for each sequence in ourin our training set, consider all possible paths for the training set, consider all possible paths for the sequencesequence
Estimate parameters through a procedure that Estimate parameters through a procedure that counts the expected number of times each counts the expected number of times each parameter is used across the training setparameter is used across the training set
17
LML Speech Recognition 2009LML Speech Recognition 2009
Learning Parameters: The BaumLearning Parameters: The Baum--Welch AlgorithmWelch Algorithm
Also known as the ForwardAlso known as the Forward--Backward algorithmBackward algorithm
An Expectation Maximization (EM) algorithmAn Expectation Maximization (EM) algorithm–– EM is a family of algorithms for learning probabilisticEM is a family of algorithms for learning probabilistic
models in problems that involve hidden statesmodels in problems that involve hidden states
In this context, the hidden state is the path that In this context, the hidden state is the path that bestbest explains each training sequenceexplains each training sequence
LML Speech Recognition 2009LML Speech Recognition 2009
Learning Parameters: The BaumLearning Parameters: The Baum--Welch AlgorithmWelch Algorithm
Algorithm sketch:Algorithm sketch:–– initialize parameters of modelinitialize parameters of model
–– iterate until convergenceiterate until convergence
calculate the calculate the expected expected number of times number of times eacheach transition or emission is usedtransition or emission is used
adjust the parameters to adjust the parameters to maximize maximize the the likelihood oflikelihood of these expected valuesthese expected values
LML Speech Recognition 2009LML Speech Recognition 2009
Computational Complexity of HMM AlgorithmsComputational Complexity of HMM Algorithms
Given an HMM with S states and a sequence of length Given an HMM with S states and a sequence of length L,L, the complexity of the Forward, Backward and the complexity of the Forward, Backward and ViterbiViterbialgorithms isalgorithms is
–– This assumes that the states are densely interconnectedThis assumes that the states are densely interconnected
Given M sequences of length L, the complexity of Given M sequences of length L, the complexity of BaumBaum Welch on each iteration isWelch on each iteration is
)( 2LSO
)( 2LMSO
LML Speech Recognition 2009LML Speech Recognition 2009
Markov Models SummaryMarkov Models SummaryWe considered models that vary in terms of We considered models that vary in terms of order,order, hidden statehidden state
Three DPThree DP--based algorithms for based algorithms for HMMsHMMs: Forward, : Forward, BackwardBackward and and ViterbiViterbi
We discussed three key tasks: learning, We discussed three key tasks: learning, classification andclassification and segmentationsegmentation
The algorithms used for each task depend on The algorithms used for each task depend on whether therewhether there is hidden state (correct path is hidden state (correct path known) in the problem or notknown) in the problem or not
18
LML Speech Recognition 2009LML Speech Recognition 2009
SummarySummaryMarkov chains and hidden Markov models are Markov chains and hidden Markov models are probabilistic models in which the probability of a probabilistic models in which the probability of a state depends only on that of the previous statestate depends only on that of the previous state–– Given a sequence of symbols, x, the Given a sequence of symbols, x, the forwardforward
algorithm finds the probability of obtaining x in the algorithm finds the probability of obtaining x in the model model
–– The The ViterbiViterbi algorithm finds the most probable path algorithm finds the most probable path (corresponding to x) through the model(corresponding to x) through the model
–– The The BaumBaum--WelchWelch learns or adjusts the model learns or adjusts the model parameters (transition and emission probabilities) to parameters (transition and emission probabilities) to best explain a set of training sequences.best explain a set of training sequences.