Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review...
Transcript of Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review...
![Page 1: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/1.jpg)
1
Paper Review Seminar Research Issues in Speech
Recognition
Bartosz Ziolko
![Page 2: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/2.jpg)
2
Computer
speech recognition system
Automatic speech
recognition system
Acoustic signal
Sequence of symbols
1870 – Alexander Graham Bell
![Page 3: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/3.jpg)
3
Definition & classification
Speech recognition allows computers equipped with a microphone to
interpret human speech, e.g. for transcription. It is an alternative method
of interacting with a computer.
Classification:
• system requires or does not require the user to "train" the system
to recognise speech patterns,
• system is trained for one user only or is speaker independent,
• system can recognise continuous speech or discrete words only,
• system is intended for clear speech material (no distorted speech,
background noise or other speaker talking simultaneously) or not,
• vocabulary is small or large.
![Page 4: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/4.jpg)
4
Applications
Computer users can create and edit documents and interact with computer more
quickly because people are able to speak faster than anyone can type.
People who are poor typists (especially people with sight disability) can
extraordinarily increase their productivity.
Speaking to computer is much faster and easier than typing!
![Page 5: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/5.jpg)
5
Is speech recognition more than 100 year old ?
1. 1870 - Alexander Graham Bell -
phonoautograph
3. Radio Rex - 1920
2. The Swiss linguist Ferdinand de Saussure – Course in General Linguistics (1916)
![Page 6: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/6.jpg)
6
Approaches
Isolated word recognition
constrains the possible recognized phrases
to a small-sized possible responses.
Dictation
transcribes speech word by word, does not require semantic understanding,
the goal is to identify the exact words.
Natural language recognition
allows the speaker to provide natural, sentence-length patterns.
L. Rabiner, "A Tutorial on Hidden
Markov Models and Selected Applications
in Speech Recognition", Proceedings of
the IEEE, vol. 77, no. 2 February 1989.
S. Young, "Large Vocabulary Continuous
Speech Recognition." IEEE Signal Processing
Magazine 13(5): 45-57, (1996).
![Page 7: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/7.jpg)
7
Scheme of the speech recognition system
Time-frequency
analysis
Speech segmentation
Segment
parameterization
Fitting the nearest
basis element
Transcription and
building the words
Lexical decoding
Syntactic analysis
Semantic analysis L. Rabiner, "A Tutorial on
Hidden Markov Models and
Selected Applications in
Speech Recognition",
Proceedings of the IEEE,
vol. 77, no. 2 February
1989.
![Page 8: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/8.jpg)
8
Pronunciation
English language Afghanistan agency heighten
Polish language Afganistan agencja wzmagać
Many words in English language sound alike (e.g. night and knight).
Context dependency for the phonemes, phonemes with different left and right context
have different realizations.
I helped Apple wreck a nice beach sounds like I helped Apple recognize speech.
A general solution requires human knowledge and experience as well as advanced
pattern recognition and artificial intelligence.
German language Afganistan agentur steigen
![Page 9: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/9.jpg)
9
Difficulties
• Co-articulation of phonemes and words makes the task of speech
recognition difficult,
• Intonation and sentence stress plays an important role in the
interpretation. Utterances "go!", "go?" and "go." can clearly be
recognized by a human but are difficult for a computer,
• In naturally spoken language there are no pauses between words.
It is difficult for a computer to decide where word boundaries lie.
![Page 10: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/10.jpg)
10
Speech audibility
0 0.1 1 10 -20
0
20
40
60
80
100
120
140
Frequency [kHz]
Acou
sti
c p
ressu
re [
dB
]
Speech area
Pain threshold
Stimulation threshold
Tadeusiewicz R., Sygnał
mowy (Speech Signal),
Wydawnictwa Komunikacji i
Łączności, Warszawa, Poland,
1988.
![Page 11: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/11.jpg)
11
Jean Baptiste Joseph Fourier
On the Propagation of Heat in Solid Bodies – 1807
Fourier spectrum
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 0
100
0.25
0.50
0.75
1
Frequency [kHz]
Am
pli
tud
e
dtjfttsfs )2exp()()(ˆ
![Page 12: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/12.jpg)
12
Time [s]
Fre
qu
ency
[M
el]
0 0.2 0.4 0.6 0
500
1000
1500
2000
2500
Time [s]
Fre
qu
ency
[H
z]
0 0.1 0.2 0.3 0.4 0.5 0.6 0
1000
2000
3000
4000
5000
1000
_1log1000_ 2
Hzfmelf
Nonlinear scale
![Page 13: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/13.jpg)
13
Cepstrum
The term cepstrum was introduced by Bogert et al. and has come to be accepted
terminology for the (inverse) Fourier transform of the logarithm of the power spectrum of
a signal. (L.R.Rabiner and R.W.Schafer, Signal Processing of Speech Signals, Prentice
Hall, Englewood-cliffs, NJ, 1978)
A cepstrum is the result of taking the Fourier transform of the decibel spectrum as if it
were a signal. There is a complex cepstrum and a real cepstrum.
The cepstrum was defined in a 1963 paper:
Tukey, J. W., B. P. Bogert and M. J. R. Healy: "The quefrency alanysis of time series for
echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking". Proceedings of the
Symposium on Time Series Analysis (M. Rosenblatt, Ed) Chapter 15, 209-243. New York: Wiley.
Etymology: "cepstrum" is an anagram of "spectrum", formed by reversing the first
four letters.
![Page 14: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/14.jpg)
14
Cepstrum
Verbally: the cepstrum is the FT of the log of the power spectrum.
FFT Squaring Smoothing Logarithm FFT
Signal
Frequenc
y
spectrum
Power
spectrum Cepstru
m
Many texts incorrectly state that the process is FT → log → IFT, i.e. that the cepstrum is
the "inverse Fourier transform of the log of the spectrum".
![Page 15: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/15.jpg)
15
Mel-Frequency Cepstrum Coefficients
S.B. Davis and P. Mermelstein, "Comparison of
parametric representations for monosyllabic word
recognition in continuously spoken sentences",
IEEE Trans. on Acoustics, Speech and Signal
Processing, vol. ASSP-28, No.4, 1980.
S. Young, "Large Vocabulary Continuous
Speech Recognition." IEEE Signal Processing
Magazine 13(5): 45-57, (1996).
M is the number of cepstrum coefficients
kX (k = 1,2,…,12) represents the
log-energy output of the ith filter
![Page 16: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/16.jpg)
16
Other parameters
D.Zhu, K.K.Paliwal, "Product of Power Spectrum
and Group Delay Function For Speech
Recognition", Proceedings of ICASSP 2004, pp.I-
125-8
Mel-frequency Product Spectrum
Cepstral Coefficients
phase spectrum information
K. Ishizuka and N. Miyazaki, "Speech Feature
Extraction Method Representing Periodicity
and Aperiodicity in Sub Bands for Robust
Speech Recognition", Proceedings of ICASSP
2004, pp.I-141-4.
It focuses on feture extraction that
represents aperiodicity of speech. The
method is based on Gammatone filter
banks, framing, autocorrelation and
comb filters.
H. Hermansky, "Perceptual linear
predictive (PLP) analysis of speech", J.
of Acoust. Soc. Amer., vol. 87, no.4,
pp. 1738-52, 1990
H. Misra, S. Ikbal, H. Bourlard, H.
Hermansky, "Spectral Entropy Based
Feature for Robust ASR",
Proceedings of ICASSP 2004, pp.I-
193-6.
Normalizing a spectrum into
function like probability mass
function (PMF) allows to calculate
entropy.
Yoshizawa, N. Hayasaka, N. Wada and
Y. Miyanaga, "Cepstral Gain
Normalization For Noise Robust
Speech Recognition", Proceedings of
ICASSP 2004, pp.I-209-12.
![Page 17: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/17.jpg)
17
Hidden Markov Model
A Hidden Markov Model (HMM) is a statistical model where the system being
modelled is assumed to be a Markov process with unknown parameters, and the
challenge is to determine the hidden parameters, from the observable parameters,
based on this assumption. The extracted model parameters can then be used to
perform further analysis, for example for speech recognition applications.
Speech recognition systems are generally based on HMM or hybrid solutions with
artificial neural networks. Statistical model gives the probability of an observed
sequence of acoustic data by the application of Bayes’ rule:
acoustic|word
wordword|acousticacoustic|word
p
PpP
P(mushroom soup) > P(much rooms hope)
It can be similarly applied for phonemes, words, syntactic and semantics
L. Rabiner, "A Tutorial on Hidden
Markov Models and Selected Applications
in Speech Recognition", Proceedings of
the IEEE, vol. 77, no. 2 February 1989.
![Page 18: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/18.jpg)
18
Wavelet spectra
STFT versus continuous and discrete
wavelet spectrum
Time 2-mn
Res
olu
tion
m
1000 2000 3000 4000 5000 6000 7000
1
2
3
4
5
6
7
8
8
Time [s]
Fre
quen
cy [
Hz]
0 0.1 0.2 0.3 0.4 0.5 0.6 0
1000
2000
3000
4000
5000
Time b
Sca
le a
1000 2000 3000 4000 5000 6000 7000 150 142 134 126 118 110 102 94 86 78 70 62 54 46 38 30 22 14 6
2 4 6 8 10 12 14
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Daubechies phi of order 12
-6 -4 -2 0 2 4 6
-0.75
-0.5
-0.25
0
0.25
0.5
0.75
Daubechies psi of order 12
-15 -10 -5 0 5 10 15
0
0.2
0.4
0.6
0.8
1
F(d12_phi(w))
-15 -10 -5 0 5 10 15
0
0.2
0.4
0.6
0.8
1
F(d12_psi(w))
I. Daubechies, “Orthonormal bases of compactly
supported wavelets”, Commun. Pure Appl. Math.,
pp. 909-996, 1988
O. Farooq, S. Datta, “Wavelet based robust
subband features for phoneme recognition”, IEE
Proceedings: Vision, Image & Signal Processing,
vol.151, no.3, pp. 187-93, 2004.
O. Rioul, M. Vetterli, “Wavelets and signal
processing”, IEEE Signal Processing Mag.,
vol.8, pp. 14-38, October 1991.
![Page 19: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/19.jpg)
19 „Andrzej”
Speech signal and its discrete wavelet transform
Time
Revers
e S
cale
Am
plit
ude
0 0 0
![Page 20: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/20.jpg)
20
The frequency band splitting
Decomposition
level Frequency [Hz] Discretization density
D1
2756 ÷ 5512
2t
D2
1378 ÷ 2756
]
4t
D3
689 ÷ 1378
8t
D4
345 ÷ 689
16t
D5
172 ÷ 345
32t
D6
86 ÷ 172
64t=5.805 ms
Hz110250 fSampling frequency
means discretization density μs7.90t
![Page 21: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/21.jpg)
21
Other topics in speech recognition
R. Sarikaya, J.H.L. Hansen, “ High
Resolution Speech Feature
Parametrization for Monophone –
Based Stressed Speech
Recognition”, IEEE Signal
Processing Letters, vol. 7, no. 7, pp.
182-5, July 2000.
Impact of stress (neutral, angry, loud,
Lombard) on monophone speech recognition
accuracy. Paper compares sets of parameters:
MFCC, Wavelet Packet Parameters
(continuous time), SBC (subband-based
cepstral)
M. Wester, J. Frankel, S. King,
"Asynchronous Articulatory Feature
Recogntion Using Dynamic Bayesian
Networks",Proc. IEICI Beyond HMM
Workshop, Kyoto, December 2004.
Waveforms are parameterised as 12
MFCCs and energy with 1st and 2nd
derivatives appended. Features are here
namely: manner, place, voicing, rounding,
front-back, static.
![Page 22: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/22.jpg)
22
Others topics in speech recognition
M. Bacchiani and B. Roark, "Meta-
data Conditional Language
Modeling", Proceedings of ICASSP
2004, pp.I-241-4.
It describes an algorithm using meta-data
like calling phone number to recognise
speaker and adapt ASR system to the user.
G.Evermann, H.Y. Chan, M.J.F
Gales, T. Hain, X.liu, D.Mrva,
L.Wang, P.C. Woodland,
"Develpment of the 2003 CU-HTK
Conversational Telephone Speech
Transcription System", Proceedings
of ICASSP 2004, pp.I-249-52.
HTK is the most recognized academic
toolkit for automatic speech recognition
system, based on HMM and MFCC. It
has been designed at the University of
Cambridge by the Machine Intelligence
Laboratory.
http://htk.eng.cam.ac.uk/
H. Van hamme, "Robust Speech
Recognition using Cepstral Domain
Missing Data Techniques and Noisy
Masks", Proceedings of ICASSP
2004, pp.I-213-6.
It describes Missing Data Techniques and
improved Missing Data Detector. MDD
can compute missing data masks from
the noisy signal involving harmonic
decomposition without long-term noise
averageing.
![Page 23: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/23.jpg)
23
Open issues and research topics
Large vocabulary
Semantic analysis
Phoneme segmentation
Different languages
Dialects supporting
![Page 24: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/24.jpg)
24
“Andrzej”
ENTIRE SEGMENTS
Segmentation
![Page 25: Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko . 2 Computer speech recognition](https://reader031.fdocuments.net/reader031/viewer/2022013006/5b9fa72209d3f2da5b8b8f02/html5/thumbnails/25.jpg)
25
Thank you for your attention