FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …
Transcript of FEATURES FOR SPEECH RECOGNITION AND AUDIO INDEXING …
1
AUDIO FEATURES &MACHINE LEARNINGE.M. Bakker
API2016
FEATURES FOR SPEECH RECOGNITION AND AUDIO
INDEXING• Parametric Representations
• Short Time Energy
• Zero Crossing Rates
• Level Crossing Rates
• Short Time Spectral Envelope
• Spectral Analysis
• Filter Design
• Filter Bank Spectral Analysis Model
• Linear Predictive Coding (LPC)
• MFCCs
2
FEATURES FOR SPEECH RECOGNITION AND AUDIO
INDEXING• Parametric Representations
• Short Time Energy
• Zero Crossing Rates
• Level Crossing Rates
Example: Speech of length 0.01 sec.
FEATURES FOR SPEECH RECOGNITION AND AUDIO
INDEXING• Spectral Analysis
• Fourier Transform
• Filter Design
• Filter Bank Spectral Analysis Model
• Linear Predictive Coding (LPC)
• MFCCs
3
FOURIER TRANSFORM PRACTICALITIES
Fourier Transform equations:
𝐻 𝑓 = −∞
∞
ℎ(𝑡)𝑒2𝜋𝑖𝑓𝑡𝑑𝑡
ℎ 𝑡 = −∞
∞
𝐻(𝑓)𝑒−2𝜋𝑖𝑓𝑡𝑑𝑓
If t is in seconds, then f is in cycles per seconds.
[3, pp 501-503; pp 563-564]
FOURIER TRANSFORM PAIRS
Homogeneity a· h[t] a· H(f)
Additivity g[t] + h[t] G(f) + H(f)
Linearity a·g[t] + b·h[t] a·G(f) + b·H(f)
Multiplication g[t] · h[t] G ⋇ H (f) = −∞∞𝐺 𝜑 𝐻 𝑓 − 𝜑 𝑑𝜑
Convolution g ⋇ h (t) = −∞∞𝑔 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏 G(f)· H(f)
Time shifting h[ t – t0 ] H(f) ∙ 𝑒2𝜋𝑖𝑓𝑡0
Frequency shifting h(t) ∙ 𝑒−2𝜋𝑖𝑓0𝑡 H( f - f0 )
4
SHORT TIME FOURIER TRANSFORMSHORT HAMMING WINDOW:
50 SAMPLES (=5MSEC)
Voiced Speech
SHORT TIME FOURIER TRANSFORMLONG HAMMING WINDOW:
500 SAMPLES (=50MSEC)
Voiced Speech
5
SHORT TIME FOURIER TRANSFORMSHORT HAMMING WINDOW:
50 SAMPLES (=5MSEC)
Unvoiced Speech
SHORT TIME FOURIER TRANSFORMLONG HAMMING WINDOW:
500 SAMPLES (=50MSEC)
Unvoiced Speech
6
FOURIER TRANSFORM CONVOLUTION AND CORRELATION
• [3, pp. 501-505]
g[t] · h[t] G ⋇ H (f) = −∞∞𝐺 𝜑 𝐻 𝑓 − 𝜑 𝑑𝜑
g ⋇ h (t) = −∞∞𝑔 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏 G(f)· H(f)
BAND PASS FILTERAudio Signal
g(t)
Bandpass Filter
h(t)
Result Audio Signal
g ⋇ h (t)
Note that the band pass filter can be
defined as:
• a convolution with a filter response
function h(t) in the time domain
• a multiplication with a filter response
H(f) function in the frequency domain
g ⋇ h (t) = −∞∞𝑔 𝜏 ℎ 𝑡 − 𝜏 𝑑𝜏 ↔G(f)· H(f)
7
BANK OF FILTERS ANALYSIS MODEL
Central Frequency
BANK OF FILTERS ANALYSIS MODEL
• Speech Signal: s(n), n=0,1,…• Sampling frequency Fs
• Band Pass Filters: • Central frequencies: f1, …,fq
• Filterk(s(n)) = Xn(𝑒𝑖𝜔𝑘), where ωk = 2πfk/Fs is equal to the normalized frequency fk, where k =1, …, q.
• Xn(𝑒𝑖𝜔𝑘) is the short time spectral representation of s(n)at time n, as seen through the Filterk with centrefrequency ωk, where k =1, …, q.
• Note: Each band pass filter independently processes the speech signal s to produce the spectral representation x
8
MEL-CEPSTRUM [4]
Auditory characteristics
• Mel-scaled filter banks
De-correlating properties
• by applying a discrete cosine transform (which is close to a Karhunen-Loeve transform) a de-correlation of the mel-scale filter log-energies results
• => probabilistic modeling on these de-correlated coefficients will be more effective.
One of the most successful feature for speech recognition, speaker recognition, and other speech related recognition tasks.
[1, pp 712-717]API2014
MFCCS
Mel-Scale
Filter Bank
MFCC’s
first 12 most
Significant
coefficients
Log()
Speech
Audio, …Preemphasis Windowing
Fast Fourier
Transform
Direct Cosine
Transform
9
JFA, I-VECTORS [5, 6]
Both very successful in Speaker Verification Systems
Joint Factor Analysis (JFA)
• Defines a low dimensional space: total variability factor space.
I-Vectors
• Combine channel and speaker variability at the same time.
API2015
MACHINE LEARNING METHODS
• Vector Quantization• Finite code book of spectral shapes• The code book codes for ‘typical’ spectral
shape• Method for all spectral representations (e.g.
Filter Banks, LPC, ZCR, etc. …)
• Support Vector Machines
• Markov Models
• Hidden Markov Models
• Neural Networks Etc.
10
PATTERN RECOGNITION
Reference
Patterns
Parameter
Measurements
Decision
Rules
Pattern
Comparison
Speech
Audio, …
Recognized
Speech, Audio, …
Test Pattern
Query Pattern
VECTOR QUANTIZATION
• Data represented as feature vectors.
• Use a Vector Quantization (VQ) Training set to determine a set of code words that constitute a code book.
• Code words are centroids using a similarity or distance measure d.
• Code words together with measure d divide the space into a Voronoi regions.
• A query vector falls into a Voronoi region and will be represented by the respective code word.
[2, pp. 466 – 467]
11
VECTOR QUANTIZATION
Distance measures d(x,y):
• Euclidean distance
• Taxi cab distance
• Hamming distance
• etc.
VECTOR QUANTIZATIONLet a training set of L vectors be given for a certain class of objects.
Assume a codebook of M code words is wanted for this class.
Initialize:
• choose M arbitrary vectors of the L vectors of the training set.
• This is the initial code book.
Nearest Neighbor Search:
• for each training vector v, find the code word w in the current code book that is closest and assign v to the corresponding cell of w.
Centroid Update:
• For each cell with code word w determine the centroid c of the training vectors that are assigned to the cell of w.
• Update the code word w with the new vector c.
Iteration:
• repeat the steps Nearest Neighbor Search and Centroid Update until the average distance between the new and previous code words falls below a preset threshold.
12
VECTOR CLASSIFICATION
For an M-vector code book CB with codes
CB = {yi | 1 ≤ i ≤ M} ,
the index m* of the best codebook entry for a given vector v is:
m* = arg min d(v, yi)
1 ≤ i ≤ M
VQ FOR CLASSIFICATION
A code book CBk = {yki | 1 ≤ i ≤ M}, can be used
to define a class Ck.
Example Audio Classification:
• Classes ‘crowd’, ‘car’, ‘silence’, ‘scream’, ‘explosion’, etc.
• Determine by using VQ code books CBk for each of the respective classes Ck.
• VQ is very often used as a baseline method for classification problems.
13
SUPPORT VECTOR MACHINES
• A generalization of linear decision boundaries for classification.
• Necessary when classes overlap when using linear decision boundaries (non separable classes).
Find hyper plane P: xTβ + β0 = 0, such that
𝛽 is minimized over 𝑦𝑖 𝑥𝑖
𝑇𝛽 + 𝛽0 ≥ 1 − 𝜀𝑖 ∀𝑖
𝜀𝑖 ≥ 0, 𝜀𝑖 ≤ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡
Where ε = (ε1 , ε2 , …, εN ) are the slack variables, i.e., the amount
that points are on the wrong side of their margin C = 1
𝛽from the
hyper plane P.
=> Problem is quadratic with linear inequalities constraint.
[2, pp 377-389]
SUPPORT VECTOR MACHINE (SVM)
In this method so called support vectors define decision boundaries for classification and regression.
An example where
a straight line
separates the two
Classes: a linear
classifier
Images from: www.statsoft.com.
14
SUPPORT VECTOR MACHINE (SVM)
In general classification is not that simple.
SVM is a method that can handle the more complex cases where the decision boundary requires a curve.
SVM uses a set of mapping
functions (kernels) to map
the feature space into
a transformed space so
that hyperplanes can be
used for the classification.
SUPPORT VECTOR MACHINE (SVM)
SVM uses a set of mapping functions (kernels)to map the feature space into a transformed space so that hyperplanes can be used for the classification.
15
SUPPORT VECTOR MACHINE (SVM)
Training of an SVM is an iterative process: • optimize the mapping function while minimizing an
error function
• The error function should capture the penalties for misclassified, i.e., non separable data points.
SUPPORT VECTOR MACHINE (SVM)
SVM uses kernels that define the mapping function used in the method. Kernels can be:
• Linear
• Polynomial
• RBF
• Sigmoid
• Etc.
• RBF (radial basis function) is the most popular kernel, again with different possible base functions.
• The final choice depends on characteristics of the classification task.
16
From: T.Joachims, SIGIR 2003 Tutorial.From: T.Joachims, SIGIR 2003 Tutorial.
17
NEURAL NETWORKS
API2015
RECENT BREAKTHROUGHS
Conversational Speech Transcription Using Context-Dependent Deep Neural Networks.
By F. Seide, G. Li, D. Yu,
INTERSPEECH (August) 2011.
• Out-of-the-box speaker-independent speech-recognition is especially important for mobile devices like smart phones, etc.
• Artificial Neural Networks until recently did not show state of the art performance for automatic speech recognition task like context dependent Gaussian mixture model HMMs.
• Context dependent deep neural network HMMs (CD-DNN-HMMs).
• DNN’s are able to extract discriminative internal representations that are robust to the many sources of variability in speech signals. [D.Yu et al. 2013]
• On the Switchboard benchmark a word error rate of 18.5 percent was obtained, which is a 33% relative improvement wrt the performance of conventional state-of-the-art speech recognition systems.
18
NEURAL NETWORKS FOR SPEECH RECOGNITION
Neural Networks have been visited many times:
• Adaline digit recognition, L.R. Talbert et al, Stanford 1963
• Speaker dependent digit recognizer
• Multi-layer Perceptrons (end 80’s, begin 90’s: Kohonen, Makino, Kammerer, and many more)
• A NN ASR around 2000 did not improve the now matured Gaussian mixture approaches
• G. Hinton et al general unsupervised approach using multiple layers: new initiation of DNN for ASR
N. Morgan, Multilayer perceptrons for speech recognition: There and Back Again. See: http://www.superlectures.com/asru2013/
GEORGE SAON, TOM SERCU, STEVEN RENNIE AND HONG-KWANG J. KUO, THE IBM 2016 ENGLISH CONVERSATIONAL TELEPHONE SPEECH RECOGNITION
SYSTEM, INTERSPEECH 2016.
In the past to build good ASR systems the following was needed:
• Speaker adaptation
• Phonetic context modeling
• Discriminative feature processing, etc.
Now neural network architectures ease the work by:
• ASR to be trained agnostically on a lot of data.
• Neural network toolkits which can implement models out-of-the-box.
• Availability of powerful hard-ware based on GPUs.
19
End-to-end ASR will become readily available to non-ASR specialists as:
1. Front-end processing has been simplified by CNNs which treat the log-mel spectral representation as an image and don’t require any extra processing steps.
2. End-to-end ASR systems such as Deepspeech, and Eesen, do not use phonetic context decision trees and HMMs and directly map the sequence of acoustic features to a sequence of characters or context independent phones.
3. New training algorithms don’t require an initial alignment of the training data which is typically done with a GMM-based base-line model.
• Three different acoustic models that were trained on 2000 hours of English conversational telephone speech
1. Recurrent nets with maxout activations and annealed dropout.
2. Very deep convolutional nets with 3×3 kernels,
3. Bidirectional long short-term memory nets operating on FM-LLR and i-vector features.
• All models are used in a hybrid HMM decoding scenario.
Result:
The word error rate of the presented English conversational telephone LVCSR system is equal to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation test-set.
20
HTTPS://ARXIV.ORG/PDF/1610.05256V1.PDF
W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu and G. Zweig ACHIEVING HUMAN PARITY IN CONVERSATIONAL SPEECH RECOGNITION, October 2016.
Human WER NIST 2000 Test Set for professional transcriptionists:
• 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic,
• 11.3% for the CallHome portion where friends and family members have open-ended conversations.
• the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity.
In this study:
• use of convolutional and LSTM (Long short-term memory) neural networks, combined with a novel spatial smoothing method and lattice-free acoustic training give on par WER’s
SEQUENCES: SOUND AND DNA
• DNA: helix-shaped molecule
whose constituents are two
parallel strands of nucleotides
• DNA is usually represented by
sequences of these four
nucleotides
• This assumes only one strand is
considered; the second strand
is always derivable from the first
by pairing A’s with T’s and C’s
with G’s and vice-versa
Nucleotides (bases)
Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
21
BIOLOGICAL INFORMATION: FROM GENES TO PROTEINS
GeneDNA
RNA
Transcription
Translation
Protein Protein folding
genomics
molecular
biology
structural
biology
biophysics
MOTIVATION FOR MARKOV MODELS
• There are many cases in which we would like to represent the statistical regularities of some class of sequences
• genes
• proteins in a given family
• Sequences of audio features
• Markov models are well suited to this type of task
22
A MARKOV CHAIN MODEL
Transition probabilities
• Pr(xi=a|xi-1=g)=0.16
• Pr(xi=c|xi-1=g)=0.34
• Pr(xi=g|xi-1=g)=0.38
• Pr(xi=t|xi-1=g)=0.12
API2014
1)|Pr( 1 gxx ii
DEFINITION OF MARKOV CHAIN MODEL
• A Markov chain[1] model is defined by
• a set of states
• some states emit symbols (for each symbol only one state)
• other states (e.g., the begin state) are silent
• a set of transitions with associated probabilities
• the transitions emanating from a given state define a
distribution over the possible next states
[1] Марков А. А., Распространение закона больших чисел на величины, зависящие
друг от друга. — Известия физико-математического общества при Казанском
университете. — 2-я серия. — Том 15. (1906) — С. 135—156
23
MARKOV CHAIN MODELS: PROPERTIES
Given some sequence x of length L
How probable the sequence is given our model?
For any probabilistic model of sequences, we can write this probability as
Key property of a (1st order) Markov chain:
The probability of each xi depends only on the value of xi-1
)Pr()...,...,|Pr(),...,|Pr(
),...,,Pr()Pr(
112111
11
xxxxxxx
xxxx
LLLL
LL
L
i
ii
LLLL
xxx
xxxxxxxx
2
11
112211
)|Pr()Pr(
)Pr()|Pr()...|Pr()|Pr()Pr(
THE PROBABILITY OF A SEQUENCE FOR A MARKOV
CHAIN MODEL
Pr(cggt)=Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)
24
EXAMPLE APPLICATIONS
Isolated Word Recognition
• A spectral vector is modeled by a state in a Markov chain
• An utterance is represented by a sequence of states
Algorithmic Music Composition
• States are note or pitch values
• Probabilities for transitions to other notes are given (can be 1st
order or second order etc.)
MARKOV CHAINS FOR DISCRIMINATION
Suppose we want to distinguish CpG islands from other sequence regions
• Given sequences from CpG islands, and sequences from other regions, we can construct
• a model to represent CpG islands
• a null model to represent the other regions
• Using the models we can now score a test sequence by:
)|Pr(
)|Pr(log)(
nullModelx
CpGModelxxscore
25
MARKOV CHAINS FOR DISCRIMINATION
• Why can we use
• According to Bayes’ rule:
• If we are not taking into account prior probabilities (Pr(CpG) and Pr(null)) of the two classes, then from Bayes’ rule it is clear that we just need to compare Pr(x|CpG) and Pr(x|null) as is done in our scoring function score().
)Pr(
)Pr()|Pr()|Pr(
x
CpGCpGxxCpG
)Pr(
)Pr()|Pr()|Pr(
x
nullnullxxnull
)|Pr(
)|Pr(log)(
nullModelx
CpGModelxxscore
As the real score should be:
Assume x is observed what is the
probability that x was modeled by
CpG compared to the probability
that x was modeled by the null
model, i.e., Pr(CpG | x) / Pr(null | x)
HIGHER ORDER MARKOV CHAINS
• The Markov property specifies that the probability of a state
depends only on the probability of the previous state
• But we can build more “memory” into our states by using a
higher order Markov model
• In an n-th order Markov model
The probability of the current state depends on the previous n
states.
),...,|Pr(),...,,|Pr( 1121 niiiiii xxxxxxx
26
SELECTING THE ORDER OF AMARKOV CHAIN MODEL
• But the number of parameters we need to
estimate grows exponentially with the order
• for modeling DNA we need parameters for an
n-th order model
• The higher the order, the less reliable we can
expect our parameter estimates to be
• estimating the parameters of a 2nd order Markov chain
from the complete genome of E. Coli (5.44 x 106 bases) ,
we’d see each word ~ 85.000 times on average (divide
by 43)
• estimating the parameters of a 9th order chain, we’d see
each word ~ 5 times on average (divide by 410 ~ 106)
)4( 1nO
HIGHER ORDER MARKOV CHAINS
• An n-th order Markov chain over some alphabet A is
equivalent to a first order Markov chain over the alphabet of
n-tuples: An
• Example: A 2nd order Markov model for DNA can be treated
as a 1st order Markov model over alphabet
AA, AC, AG, AT
CA, CC, CG, CT
GA, GC, GG, GT
TA, TC, TG, TT
27
A FIFTH ORDER MARKOV CHAIN
Pr(gctaca) = Pr(gctac) . Pr(a | gctac)
HIDDEN MARKOV MODEL: A SIMPLE HMM
Given observed sequence AGGCT, which state emits every item?
Model 1 Model 2
28
TUTORIAL ON HMM
L.R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 22, February 1989.
HMM FOR HIDDEN COIN TOSSING
HT
T
T T
T
H
T
……… H H T T H T H H T T H
29
HIDDEN STATE
• We’ll distinguish between the observed parts of a
problem and the hidden parts
• In the Markov models we’ve considered previously, it is
clear which state accounts for each part of the
observed sequence
• In the model above, there are multiple states that could
account for each part of the observed sequence
• this is the hidden part of the problem
LEARNING AND PREDICTION TASKS(IN GENERAL, I.E., APPLIES ON BOTH MM AS HMM)
• Learning• Given: a model, a set of training sequences
• Do: find model parameters that explain the training sequences withrelatively high probability (goal is to find a model that generalizes well to sequences we haven’t seen before)
• Classification• Given: a set of models representing different sequence classes, and
given a test sequence
• Do: determine which model/class best explains the sequence
• Segmentation• Given: a model representing different sequence classes, and given
a test sequence
• Do: segment the sequence into subsequences, predicting the class of each subsequence
30
ALGORITHMS FOR LEARNING & PREDICTION
• Learning• correct path known for each training sequence -> simple maximum
likelihood or Bayesian estimation
• correct path not known -> Forward-Backward algorithm + ML or
Bayesian estimation
• Classification• simple Markov model -> calculate probability of sequence along
single path for each model
• hidden Markov model -> Forward algorithm to calculate
probability of sequence along all paths for each model
• Segmentation
• hidden Markov model -> Viterbi algorithm to find most probable
path for sequence
THE PARAMETERS OF AN HMM
• Transition Probabilities
• Probability of transition from state k to state l
• Emission Probabilities
• Probability of emitting character b in state k
Note: HMM’s can also be formulated using an emission probability associated with a transition from state k to state l.
)|Pr( 1 kla iikl
)|Pr()( kbxbe iik
31
AN HMM EXAMPLE
Emission probabilities∑ pi = 1
Transition probabilities∑ pi = 1
THREE IMPORTANT QUESTIONS
(SEE ALSO L.R. RABINER (1989))
• How likely is a given sequence?
• The Forward algorithm
• What is the most probable “path” for generating a given sequence?
• The Viterbi algorithm
• How can we learn the HMM parameters
given a set of sequences?
• The Forward-Backward (Baum-Welch) algorithm
32
HOW LIKELY IS A GIVEN SEQUENCE?
• The probability that a given path is taken and the sequence is generated:
L
i
iNL iiiaxeaxx
1
001 11)()...,...Pr(
6.3.8.4.2.4.5.
)(
)()(
),Pr(
35313
111101
aCea
AeaAea
AAC
HOW LIKELY IS A GIVEN SEQUENCE?
• But we need the probability over all paths:
• The number of paths can be exponential in the length of the sequence...
• The Forward algorithm enables us to compute this efficiently
33
THE FORWARD ALGORITHM
• Define to be the probability of being in state k having observed the first icharacters of sequence x of length L
• To compute , the probability of being in the end state having observed all of sequence x
• Can be defined recursively
• Compute using dynamic programming
)(ifk
)(LfN
THE FORWARD ALGORITHM
• fk(i) equal to the probability of being in state khaving observed the first i characters of sequence x
• Initialization
• f0(0) = 1 for start state; fi(0) = 0 for other state
• Recursion
• For emitting state (i = 1, … L)
• For silent state
• Termination
k
klkl aifif )()(
k
klkll aifieif )1()()(
k
kNkNL aLfLfxxx )()()...Pr()Pr( 1
34
FORWARD ALGORITHM EXAMPLE
Given the sequence x=TAGA
FORWARD ALGORITHM EXAMPLE
• Initialization• f0(0)=1, f1(0)=0…f5(0)=0
• Computing other values• f1(1)=e1(T)*(f0(0)a01+f1(0)a11)
=0.3*(1*0.5+0*0.2)=0.15• f2(1)=0.4*(1*0.5+0*0.8)• f1(2)=e1(A)*(f0(1)a01+f1(1)a11)
=0.4*(0*0.5+0.15*0.2)…• Pr(TAGA)= f5(4)=f3(4)a35+f4(4)a45
35
THREE IMPORTANT QUESTIONS
• How likely is a given sequence?
• What is the most probable “path” for
generating a given sequence?
• How can we learn the HMM parameters
given a set of sequences?
FINDING THE MOST PROBABLE
PATH: THE VITERBI ALGORITHM
• Define vk(i) to be the probability of the most probable
path accounting for the first i characters of x and
ending in state k
• We want to compute vN(L), the probability of the most
probable path accounting for all of the sequence and
ending in the end state
• Can be defined recursively
• Again we can use use Dynamic Programming to
compute vN(L) and find the most probable path
efficiently
36
FINDING THE MOST PROBABLE
PATH: THE VITERBI ALGORITHM
Define vk(i) to be the probability of the most probablepath π accounting for the first i characters of x and ending in state k
The Viterbi Algorithm:
1. Initialization (i = 0)v0(0) = 1, vk(0) = 0 for k>0
2. Recursion (i = 1,…,L)vl(i) = el(xi) .maxk(vk(i-1).akl)
ptri(i) = argmaxk(vk(i-1).akl) keep pointer to previous max
3. Termination:
P(x,π*) = maxk(vk(L).ak0)
π*L = argmaxk(vk(L).ak0)
THREE IMPORTANT QUESTIONS
• How likely is a given sequence?
• What is the most probable “path” for generating a
given sequence?
• How can we learn the HMM parameters given a set of
sequences?
37
LEARNING WITHOUT HIDDEN STATE
• Learning is simple if we know the correct path for each sequence in our training set
• estimate parameters by counting the number of times each parameter is used across the training set
LEARNING WITH HIDDEN STATE
• If we don’t know the correct path for each sequence in our training set, consider all possible paths for the sequence
• Estimate parameters through a procedure that counts the expected number of times each parameter is used across the training set
38
LEARNING PARAMETERS: THE BAUM-WELCH ALGORITHM
• Also known as the Forward-Backward
algorithm
• An Expectation Maximization (EM) algorithm
• EM is a family of algorithms for learning
probabilistic models in problems that involve
hidden states
• In this context, the hidden state is the path
that best explains each training sequence
LEARNING PARAMETERS: THE BAUM-WELCH ALGORITHM
• Algorithm sketch:
• initialize parameters of model
• iterate until convergence
• calculate the expected number of
times each transition or emission is
used
• adjust the parameters to maximize the
likelihood of these expected values
39
COMPUTATIONAL COMPLEXITY OF HMM ALGORITHMS
• Given an HMM with S states and a sequence of
length L, the complexity of the Forward, Backward
and Viterbi algorithms is
• This assumes that the states are densely interconnected
• Given M sequences of length L, the complexity of
Baum Welch on each iteration is
)( 2LSO
)( 2LMSO
REFERENCES1. T.F. Quatieri, Discrete-Time Speech Signal Processing,
Principles and Practice, Prentice-Hall, Inc. 2002.
2. T. Hastc, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, Springer, 2001.
3. W.H. Press, S.A.Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipies in C++, The Art of Scientific Computing, 2nd Edition, Cambridge University Press, 2002.
4. S.B. Davies, P. Mermelstein, Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-28, no.4, pp. 357-366, Aug. 1980.
API2014
40
REFERENCES5. P. Kenny, “Joint Factor Analysis of Speaker and
Session Variability: Theory and Algorithms, Tech. Report CRIM-06/08-13,” 2005.
Available: http://www.crim.ca/perso/patrick.kenny
6. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788–798, May 2011.
API2014