Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review...

1

Paper Review Seminar Research Issues in Speech

Recognition

Bartosz Ziolko

2

Computer

speech recognition system

Automatic speech

recognition system

Acoustic signal

Sequence of symbols

1870 – Alexander Graham Bell

3

Definition & classification

Speech recognition allows computers equipped with a microphone to

interpret human speech, e.g. for transcription. It is an alternative method

of interacting with a computer.

Classification:

• system requires or does not require the user to "train" the system

to recognise speech patterns,

• system is trained for one user only or is speaker independent,

• system can recognise continuous speech or discrete words only,

• system is intended for clear speech material (no distorted speech,

background noise or other speaker talking simultaneously) or not,

• vocabulary is small or large.

4

Applications

Computer users can create and edit documents and interact with computer more

quickly because people are able to speak faster than anyone can type.

People who are poor typists (especially people with sight disability) can

extraordinarily increase their productivity.

Speaking to computer is much faster and easier than typing!

5

Is speech recognition more than 100 year old ?

1. 1870 - Alexander Graham Bell -

phonoautograph

3. Radio Rex - 1920

2. The Swiss linguist Ferdinand de Saussure – Course in General Linguistics (1916)

6

Approaches

Isolated word recognition

constrains the possible recognized phrases

to a small-sized possible responses.

Dictation

transcribes speech word by word, does not require semantic understanding,

the goal is to identify the exact words.

Natural language recognition

allows the speaker to provide natural, sentence-length patterns.

L. Rabiner, "A Tutorial on Hidden

Markov Models and Selected Applications

in Speech Recognition", Proceedings of

the IEEE, vol. 77, no. 2 February 1989.

S. Young, "Large Vocabulary Continuous

Speech Recognition." IEEE Signal Processing

Magazine 13(5): 45-57, (1996).

7

Scheme of the speech recognition system

Time-frequency

analysis

Speech segmentation

Segment

parameterization

Fitting the nearest

basis element

Transcription and

building the words

Lexical decoding

Syntactic analysis

Semantic analysis L. Rabiner, "A Tutorial on

Hidden Markov Models and

Selected Applications in

Speech Recognition",

Proceedings of the IEEE,

vol. 77, no. 2 February

1989.

8

Pronunciation

English language Afghanistan agency heighten

Polish language Afganistan agencja wzmagać

Many words in English language sound alike (e.g. night and knight).

Context dependency for the phonemes, phonemes with different left and right context

have different realizations.

I helped Apple wreck a nice beach sounds like I helped Apple recognize speech.

A general solution requires human knowledge and experience as well as advanced

pattern recognition and artificial intelligence.

German language Afganistan agentur steigen

9

Difficulties

• Co-articulation of phonemes and words makes the task of speech

recognition difficult,

• Intonation and sentence stress plays an important role in the

interpretation. Utterances "go!", "go?" and "go." can clearly be

recognized by a human but are difficult for a computer,

• In naturally spoken language there are no pauses between words.

It is difficult for a computer to decide where word boundaries lie.

10

Speech audibility

0 0.1 1 10 -20

0

20

40

60

80

100

120

140

Frequency [kHz]

Acou

sti

c p

ressu

re [

dB

]

Speech area

Pain threshold

Stimulation threshold

Tadeusiewicz R., Sygnał

mowy (Speech Signal),

Wydawnictwa Komunikacji i

Łączności, Warszawa, Poland,

1988.

11

Jean Baptiste Joseph Fourier

On the Propagation of Heat in Solid Bodies – 1807

Fourier spectrum

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 0

100

0.25

0.50

0.75

1

Frequency [kHz]

Am

pli

tud

e

dtjfttsfs )2exp()()(ˆ

12

Time [s]

Fre

qu

ency

[M

el]

0 0.2 0.4 0.6 0

500

1000

1500

2000

2500

Time [s]

Fre

qu

ency

[H

z]

0 0.1 0.2 0.3 0.4 0.5 0.6 0

1000

2000

3000

4000

5000

1000

_1log1000_ 2

Hzfmelf

Nonlinear scale

13

Cepstrum

The term cepstrum was introduced by Bogert et al. and has come to be accepted

terminology for the (inverse) Fourier transform of the logarithm of the power spectrum of

a signal. (L.R.Rabiner and R.W.Schafer, Signal Processing of Speech Signals, Prentice

Hall, Englewood-cliffs, NJ, 1978)

A cepstrum is the result of taking the Fourier transform of the decibel spectrum as if it

were a signal. There is a complex cepstrum and a real cepstrum.

The cepstrum was defined in a 1963 paper:

Tukey, J. W., B. P. Bogert and M. J. R. Healy: "The quefrency alanysis of time series for

echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking". Proceedings of the

Symposium on Time Series Analysis (M. Rosenblatt, Ed) Chapter 15, 209-243. New York: Wiley.

Etymology: "cepstrum" is an anagram of "spectrum", formed by reversing the first

four letters.

14

Cepstrum

Verbally: the cepstrum is the FT of the log of the power spectrum.

FFT Squaring Smoothing Logarithm FFT

Signal

Frequenc

y

spectrum

Power

spectrum Cepstru

m

Many texts incorrectly state that the process is FT → log → IFT, i.e. that the cepstrum is

the "inverse Fourier transform of the log of the spectrum".

15

Mel-Frequency Cepstrum Coefficients

S.B. Davis and P. Mermelstein, "Comparison of

parametric representations for monosyllabic word

recognition in continuously spoken sentences",

IEEE Trans. on Acoustics, Speech and Signal

Processing, vol. ASSP-28, No.4, 1980.

S. Young, "Large Vocabulary Continuous

Speech Recognition." IEEE Signal Processing

Magazine 13(5): 45-57, (1996).

M is the number of cepstrum coefficients

kX (k = 1,2,…,12) represents the

log-energy output of the ith filter

16

Other parameters

D.Zhu, K.K.Paliwal, "Product of Power Spectrum

and Group Delay Function For Speech

Recognition", Proceedings of ICASSP 2004, pp.I-

125-8

Mel-frequency Product Spectrum

Cepstral Coefficients

phase spectrum information

K. Ishizuka and N. Miyazaki, "Speech Feature

Extraction Method Representing Periodicity

and Aperiodicity in Sub Bands for Robust

Speech Recognition", Proceedings of ICASSP

2004, pp.I-141-4.

It focuses on feture extraction that

represents aperiodicity of speech. The

method is based on Gammatone filter

banks, framing, autocorrelation and

comb filters.

H. Hermansky, "Perceptual linear

predictive (PLP) analysis of speech", J.

of Acoust. Soc. Amer., vol. 87, no.4,

pp. 1738-52, 1990

H. Misra, S. Ikbal, H. Bourlard, H.

Hermansky, "Spectral Entropy Based

Feature for Robust ASR",

Proceedings of ICASSP 2004, pp.I-

193-6.

Normalizing a spectrum into

function like probability mass

function (PMF) allows to calculate

entropy.

Yoshizawa, N. Hayasaka, N. Wada and

Y. Miyanaga, "Cepstral Gain

Normalization For Noise Robust

Speech Recognition", Proceedings of

ICASSP 2004, pp.I-209-12.

17

Hidden Markov Model

A Hidden Markov Model (HMM) is a statistical model where the system being

modelled is assumed to be a Markov process with unknown parameters, and the

challenge is to determine the hidden parameters, from the observable parameters,

based on this assumption. The extracted model parameters can then be used to

perform further analysis, for example for speech recognition applications.

Speech recognition systems are generally based on HMM or hybrid solutions with

artificial neural networks. Statistical model gives the probability of an observed

sequence of acoustic data by the application of Bayes’ rule:

acoustic|word

wordword|acousticacoustic|word

p

PpP

P(mushroom soup) > P(much rooms hope)

It can be similarly applied for phonemes, words, syntactic and semantics

L. Rabiner, "A Tutorial on Hidden

Markov Models and Selected Applications

in Speech Recognition", Proceedings of

the IEEE, vol. 77, no. 2 February 1989.

18

Wavelet spectra

STFT versus continuous and discrete

wavelet spectrum

Time 2-mn

Res

olu

tion

m

1000 2000 3000 4000 5000 6000 7000

1

2

3

4

5

6

7

8

8

Time [s]

Fre

quen

cy [

Hz]

0 0.1 0.2 0.3 0.4 0.5 0.6 0

1000

2000

3000

4000

5000

Time b

Sca

le a

1000 2000 3000 4000 5000 6000 7000 150 142 134 126 118 110 102 94 86 78 70 62 54 46 38 30 22 14 6

2 4 6 8 10 12 14

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Daubechies phi of order 12

-6 -4 -2 0 2 4 6

-0.75

-0.5

-0.25

0

0.25

0.5

0.75

Daubechies psi of order 12

-15 -10 -5 0 5 10 15

0

0.2

0.4

0.6

0.8

1

F(d12_phi(w))

-15 -10 -5 0 5 10 15

0

0.2

0.4

0.6

0.8

1

F(d12_psi(w))

I. Daubechies, “Orthonormal bases of compactly

supported wavelets”, Commun. Pure Appl. Math.,

pp. 909-996, 1988

O. Farooq, S. Datta, “Wavelet based robust

subband features for phoneme recognition”, IEE

Proceedings: Vision, Image & Signal Processing,

vol.151, no.3, pp. 187-93, 2004.

O. Rioul, M. Vetterli, “Wavelets and signal

processing”, IEEE Signal Processing Mag.,

vol.8, pp. 14-38, October 1991.

19 „Andrzej”

Speech signal and its discrete wavelet transform

Time

Revers

e S

cale

Am

plit

ude

0 0 0

20

The frequency band splitting

Decomposition

level Frequency [Hz] Discretization density

D1

2756 ÷ 5512

2t

D2

1378 ÷ 2756

]

4t

D3

689 ÷ 1378

8t

D4

345 ÷ 689

16t

D5

172 ÷ 345

32t

D6

86 ÷ 172

64t=5.805 ms

Hz110250 fSampling frequency

means discretization density μs7.90t

21

Other topics in speech recognition

R. Sarikaya, J.H.L. Hansen, “ High

Resolution Speech Feature

Parametrization for Monophone –

Based Stressed Speech

Recognition”, IEEE Signal

Processing Letters, vol. 7, no. 7, pp.

182-5, July 2000.

Impact of stress (neutral, angry, loud,

Lombard) on monophone speech recognition

accuracy. Paper compares sets of parameters:

MFCC, Wavelet Packet Parameters

(continuous time), SBC (subband-based

cepstral)

M. Wester, J. Frankel, S. King,

"Asynchronous Articulatory Feature

Recogntion Using Dynamic Bayesian

Networks",Proc. IEICI Beyond HMM

Workshop, Kyoto, December 2004.

Waveforms are parameterised as 12

MFCCs and energy with 1st and 2nd

derivatives appended. Features are here

namely: manner, place, voicing, rounding,

front-back, static.

22

Others topics in speech recognition

M. Bacchiani and B. Roark, "Meta-

data Conditional Language

Modeling", Proceedings of ICASSP

2004, pp.I-241-4.

It describes an algorithm using meta-data

like calling phone number to recognise

speaker and adapt ASR system to the user.

G.Evermann, H.Y. Chan, M.J.F

Gales, T. Hain, X.liu, D.Mrva,

L.Wang, P.C. Woodland,

"Develpment of the 2003 CU-HTK

Conversational Telephone Speech

Transcription System", Proceedings

of ICASSP 2004, pp.I-249-52.

HTK is the most recognized academic

toolkit for automatic speech recognition

system, based on HMM and MFCC. It

has been designed at the University of

Cambridge by the Machine Intelligence

Laboratory.

http://htk.eng.cam.ac.uk/

H. Van hamme, "Robust Speech

Recognition using Cepstral Domain

Missing Data Techniques and Noisy

Masks", Proceedings of ICASSP

2004, pp.I-213-6.

It describes Missing Data Techniques and

improved Missing Data Detector. MDD

can compute missing data masks from

the noisy signal involving harmonic

decomposition without long-term noise

averageing.

23

Open issues and research topics

Large vocabulary

Semantic analysis

Phoneme segmentation

Different languages

Dialects supporting

24

“Andrzej”

ENTIRE SEGMENTS

Segmentation

25

Thank you for your attention

Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review...

Documents

Transcript of Paper Review Seminar Research Issues in Speech Recognitionpaperreview2005.pdf · Paper Review...