8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition...

57
8 - Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types 1

Transcript of 8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition...

8-Speech Recognition

Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types

1

7-Speech Recognition (Cont’d)

HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training In HMM

2

Recognition Tasks Isolated Word Recognition (IWR)

Connected Word (CW) , And Continuous Speech Recognition (CSR)

Speaker Dependent, Multiple Speaker, And Speaker Independent

Vocabulary SizeSmall <20Medium >100 , <1000Large >1000, <10000Very Large >10000

3

Speech Recognition Concepts

4

NLPSpeech

Processing

Text Speech

NLPSpeech

ProcessingSpeech

Understanding

Speech Synthesis

TextPhone Sequence

Speech Recognition

Speech recognition is inverse of Speech Synthesis

Speech Recognition Approaches

Bottom-Up Approach

Top-Down Approach

Blackboard Approach

5

Bottom-Up Approach

6

Signal Processing

Feature Extraction

Segmentation

Signal Processing

Feature Extraction

Segmentation

Segmentation

Sound Classification Rules

Phonotactic Rules

Lexical Access

Language Model

Voiced/Unvoiced/Silence

Kno

wle

dge

Sou

rces

Recognized Utterance

Top-Down Approach

7

UnitMatching

System

FeatureAnalysis

LexicalHypothesis

SyntacticHypothesis

SemanticHypothesis

UtteranceVerifier/Matcher

Inventory of speech

recognition units

Word Dictionary Grammar

TaskModel

Recognized Utterance

Blackboard Approach

8

EnvironmentalProcesses

Acoustic Processes Lexical

Processes

SyntacticProcesses

SemanticProcesses

Blackboard

Recognition Theories

Articulatory Based RecognitionUse from Articulatory system for recognitionThis theory is the most successful until now

Auditory Based RecognitionUse from Auditory system for recognition

Hybrid Based RecognitionIs a hybrid from the above theories

Motor TheoryModel the intended gesture of speaker

9

Recognition Problem

We have the sequence of acoustic symbols and we want to find the words that expressed by speaker

Solution : Finding the most probable of word sequence by having Acoustic symbols

10

Recognition Problem

A : Acoustic Symbols W : Word Sequence

we should find so that

11

W)|(max)|ˆ( AWPAWP

W

Bayse Rule

),()()|( yxPyPyxP

12

)(

)()|()|(

yP

xPxyPyxP

)(

)()|()|(

AP

WPWAPAWP

Bayse Rule (Cont’d)

13

)(

)()|(max

AP

WPWAPW

)|(max)|ˆ( AWPAWPW

)()|(max

)|(maxˆ

WPWAPArg

AWPArgW

W

W

Simple Language Model

14

nwwwww 321

),...,,,(

),...,,|(

).....,,|(

),|()|()(

)|()(

121

121

1234

123121

1211

WWWWP

WWWWP

WWWWP

WWWPWWPWP

wwwwPwP

nnn

nnn

iii

n

i

Computing this probability is very difficult and we need a very big database. So we use from Trigram and Bigram models.

Simple Language Model (Cont’d)

15

)|()( 211

iii

n

iwwwPwP

)|()( 11

ii

n

iwwPwP

Trigram :

Bigram :

)()(1

i

n

iwPwP

Monogram :

Simple Language Model (Cont’d)

16

)|( 123 wwwP

Computing Method :Number of happening W3 after W1W2

Total number of happening W1W2

AdHoc Method :

)()|()|()|( 332321231123 wfwwfwwwfwwwP

Error Production Factor

Prosody (Recognition should be Prosody Independent)

Noise (Noise should be prevented)

Spontaneous Speech

17

P(A|W) Computing Approaches

Dynamic Time Warping (DTW)

Hidden Markov Model (HMM)

Artificial Neural Network (ANN)

Hybrid Systems

18

Dynamic Time Warping

Dynamic Time Warping

Dynamic Time Warping

Dynamic Time Warping

Dynamic Time Warping

Search Limitation :

- First & End Interval

- Global Limitation

- Local Limitation

Dynamic Time Warping

Global Limitation :

Dynamic Time Warping

Local Limitation :

Artificial Neural Network

26

...

1x

0x

1w0w

1Nw

1Nx

y)(

1

0

i

N

ii xwy

Simple Computation Element of a Neural Network

Artificial Neural Network (Cont’d)

Neural Network TypesPerceptronTime DelayTime Delay Neural Network Computational

Element (TDNN)

27

Artificial Neural Network (Cont’d)

28

. . .

. . .

0x

0y 1My

1Nx

Single Layer Perceptron

Artificial Neural Network (Cont’d)

29

. . .

. . .

Three Layer Perceptron

. . .

. . .

2.5.4.2 Neural Network Topologies

30

TDNN

31

2.5.4.6 Neural Network Structures for Speech Recognition

32

2.5.4.6 Neural Network Structures for Speech Recognition

33

Hybrid Methods

Hybrid Neural Network and Matched Filter For Recognition

34

PATTERN

CLASSIFIER

SpeechAcoustic Features Delays

Output Units

Neural Network Properties

The system is simple, But too much iteration is needed for training

Doesn’t determine a specific structure Regardless of simplicity, the results are

good Training size is large, so training should

be offline Accuracy is relatively good

35

Pre-processing

Different preprocessing techniques are employed as the front end for speech recognition systems

The choice of preprocessing method is based on the task, the noise level, the modeling tool, etc.

36

38

39

41

42

MFCCروش

روش MFCCبر نحوه ادراک گوش انسان از اصوات ي مبتن باشد.يم

روش MFCCيها در محير وي نسبت به ساM بهتر ي نويزيطهايژگکند.يعمل م

MFCCجهت کاربردها Y ه شده ي گفتار ارايي شناساي اساسا دارد.يز راندمان مناسبينده ني گويياست اما در شناسا

دار گوش انسان ي واحد شنMelباشد که به کمک رابطه ي م د:ي آير بدست ميز

43

MFCCمراحل روش

گنال از حوزه زمان به حوزه ي: نگاشت س1 مرحله زمان کوتاه.FFTفرکانس به کمک

44

گنال گفتاريس : Z(n)تابع پنجره مانند پنجره :

)W(nهمينگWF= e-j2π/F

m : 0,…,F – 1;يم گفتاريطول فر : .F

MFCCمراحل روش

لتر.ي هر کانال بانک فيافتن انرژي: 2مرحله

Mبر معيار مل ي فيلتر مبتني تعداد بانکها باشد.يم

بانک فيلتر يلترهاي تابع فاست.

45

0,1,..., 1k M ( )kW j

توزيع فيلتر مبتنی بر معيار مل

46

MFCCمراحل روش

ل ي طيف و اعمال تبدي: فشرده ساز4 مرحلهDCT MFCCب يجهت حصول به ضرا

47

در رابطه باالL،...،0=nب ي مرتبه ضراMFCC باشد.يم

روش مل-کپستروم

48

Mel-scaling بندی فریم

IDCT

|FFT|2

Low-order coefficientsDifferentiator

Cepstra

Delta & Delta Delta Cepstra

زمانی سیگنال

Logarithm

ضرایب مل MFCC)کپستروم

)

49

ویژگی های مل (MFCC)کپستروم

نگاشت انرژی های بانک فیلترملدرجهتی که واریانس آنها ماکسیمم

(DCT )با استفاده ازباشد استقالل ویژگی های گفتار به صورت

(DCT غیرکامل نسبت به یکدیگر)تاثیرپاسخ مناسب در محیطهای تمیزکاهش کارایی آن در محیطهای نویزی

50

Time-Frequency analysis

Short-term Fourier Transform Standard way of frequency analysis: decompose the

incoming signal into the constituent frequency components.

W(n): windowing function N: frame length p: step size

51

Critical band integration

Related to masking phenomenon: the threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise

Frequency components within a critical band are not resolved. Auditory system interprets the signals within a critical band as a whole

52

Bark scale

53

Feature orthogonalization

Spectral values in adjacent frequency channels are highly correlated

The correlation results in a Gaussian model with lots of parameters: have to estimate all the elements of the covariance matrix

Decorrelation is useful to improve the parameter estimation.

54

otherwise

validiswwifwwP

wwwPwwwwP

wwwwP

wwwPwwPwPwwwPWP

wwwW

jkkj

jNjjjQ

QQ

Q

Q

0

1)|(

),|()|(

|(

)|()|()()()(

,

11121

).121

21312121

21

Language Models for LVCSR

Word Pair Model: Specify which word pairs are valid

Statistical Language Modeling

)(

)(

)(

),(

),(

),,(),|(ˆ

,),,(

),,,(),,|(ˆ

),,,,|()(

13

1

212

21

3211213

11

1111

1211

i

Nii

NiiiNiii

Niii

Q

iiN

wF

wFp

wF

wwFp

wwF

wwwFpwwwP

wwF

wwwFwwwP

wwwwPWP

),,,(log1

lim

)(log)(

)()()(),,,(

),,,(log),,,(1

lim

21

2121

2121

QQ

Vw

QQ

QQQ

wwwPQ

H

wPwPH

wPwPwPwwwP

wwwPwwwPQ

H

Perplexity of the Language Model

Entropy of the Source:

First order entropy of the source:

If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out,

QQ

H

Qp

Ni

Q

iiiip

Q

wwwPB

wwwPQ

H

wwwwPQ

H

wwwPQ

H

p /121

21

11

21

21

),,,(ˆ2

),,,(ˆlog1

),,,|(log1

),,,(log1

We often compute H based on a finite but sufficiently large Q:

H is the degree of difficulty that the recognizer encounters, on average,When it is to determine a word from the same source.

Using language model, if the N-gram language model PN(W) is used,An estimate of H is:

In general:

Perplexity is defined as:

Overall recognition system based on subword units