Speech Recogntion Using Hidden Markov Models

8/10/2019 Speech Recogntion Using Hidden Markov Models

1/61

SPEECH RECOGNTION

USING HIDDEN MARKOV

MODELS


2/61

OUTLINE

ITHE SPEECH

SIGNAL

IITHE HIDDEN

MARKOVMODEL

IIISPEECH

RECOGNITIONUSING HMM


3/61

INTRODUCTION

APPLICATIONS :

1. HANDS-FREE COMPUTING

II. AUTOMATIC TRANSLATION


4/61

EARLY HISTORY

1952Isolated digit recognition for asingle speaker.

1959 Vowel Recognition Program

1970sIsolated word recognitionbecame a usable technology.

Pattern recognition ideas areapplied to speech

recognition.Ideas of LPC are employed in

speech recognition.

1980sIntroduction of HMM


5/61

I. THE SPEECH SIGNAL


6/61

OUTLINE:

THE SPEECHSIGNAL

SPEECH

PRODUCTION

SPEECHREPRESENTATION

3-STATEREPRESENTATION

SPECTRALREPRESENTATION

SPEECH TOFEATUREVECTORS

PRE-PROCESSING

WINDOWING

FEATUREEXTRACTION

POSTPROCESSING


7/61

SPEECH PRODUCTION


8/61

What does each block represent

.. ?

Voiced Components

Impulse train Generator Lungs

Glottal pulse model Epiglottis

Vocal tract model Vocal Tract

Radiation model Lips

Random noise Unvoiced

sounds

generator


9/61

SPEECH REPRESENTATION

Short-time stationary / quasi stationary

Types :

Time-domain representation

Frequency-domain representation


10/61

Time-domain representation :


11/61

Frequency-domain

representation:


12/61

OBTAINING FEATURE

VECTORS

PreprocessingFrame

Blocking andWindowing

FeatureExtraction

Postprocessing

Why do we need feature vectors ?


13/61

Pre-processing :

Noisecancellation

Pre-emphasis

VoiceActivationDetection

(VAD)

Purpose : To modify raw speech signal so that

It is more suitable for feature extraction


14/61

Noise Cancelling and Pre-

emphasis

Methods for noise cancellation Spectral subtraction

Adaptive noise cancellation

Pre-emphasis

To emphasize high frequency

components

.because often high frequency

components have low SNR

H(z) = 1- 0.5z-1 ; S1(z) = H(z)S(z)


15/61

Voice Activation Detection (VAD)

The signal is chopped-off !!!!

Finds the end-points of the utterances.

Why.?


16/61

This is for a single chunk.

Ws1(m) = Ps1(m)(1 Zs1(m))Sc

Ps1= short term power estimate

Zs1= zero-crossing rate

Sc= scaling factor


17/61

The threshold twis decided by some function of the

mean and variance of Ws1itself.


18/61


19/61

Windowing

Window function such as HammingWindow is applied to reduce the

discontinuity at the edges of blocks

Hamming Window

w(k) = 0.54 0.46 cos ( 2k / K1 )

K = no. of samples in a speech signal


20/61

Feature Extraction:

Feature Extraction

LPC MFCC


21/61

Linear Predictive coding

(LPC) Encodes at low bit-rate

Assumption : speech sample at

current time can be approximated

from past samples. Glottal, vocal-tract, lip-radiation

transfer functions are integrated into

all-pole LPC filter. Feature vectors are ak.


22/61

Mel Frequency Cepstral

Coefficients (MFCC)

A non-linear frequency scale is used

Linear until 1KHz

Logarithmic afterwards

Similar to human Cochlea


23/61

Xt[n] is the DFT of the tthinput speech frame,

Hm[n] is the frequency response of mthfilter in

the filter bank, N is the window size of the

transform and M is the total number of filters


24/61


25/61

Advantages MFCC reduces information in speech to

small no. of coefficients

MFCC tries to model loudness

MFCC resembles human auditory model,and it is easy to compute

But for better accuracy in speech

recognition both models are usedsimultaneously.


26/61

Post Processing

Weightfunction

Normalization

To give more weightage

to certain features

To re-scale the numerical values

of the features. To stay in the

same numerical range


27/61

HIDDEN MARKOV MODEL


28/61

MARKOV CHAINS:

Markov Process ?

First Order Markov Process. ?

Markov Chain: Markov Process withfinite states


29/61

HIDDEN MARKOV MODEL

HMM : If one cannot observe states

If states are visible then it is termed asObservable Markov model

In a hidden Markov model, the state isnot directly visible, but output,dependent on the state, is visible

HMM l


30/61

HMM example

Imagine that you are a climatologist in

the year 2999 studying the history ofglobal warming. You cannot find any

records of the weather for the summer

of 2007, but you do find Jasons diary,which lists how many ice-creams

Jason ate every day that summer. Our

goal is to use these observations to

estimate the temperature every day.

Assume there are only two kinds of

days: cold (C) and hot (H).


31/61

Notation :


32/61

Notation :

T = length of the observation sequence

N = number of states in the model M = number of distinct observation symbols i.e., the number of

symbols observed.

Q = {q0,q1,...,qN1} = distinct states of the Markov process

V = {0,1,...,M 1} = discrete set of possible observations

A = {ai,j}where ai,j = P(it+1 | it= i), the probability of being in state

j at time t+1 given that we were is state i at time t. We assumethat ai,j are independent of time. These are also referred asstate transition probabilities

B = { bj(k)}, bj(k) = P(vkat t| it= j), the probability of observingsymbol vkgiven that we were in state i . Also termed asobservation probability matrix

= initial state distribution. = {i} , i = P(i1= i), theprobability of being in state i at the beginning of the experimenti.e., at t=1.

O= (O0,O 1,...,O T1) = observation sequence. Ot will denotethe observation symbol observed at time t.

= (A, B, ) will be used as a compact notation to denoteHMM.


33/61

The three problems for HMMs

Problem -1 Problem 1: Given the observation

sequence O = O1, O2,.. OT, and a

model = (A, B, ), how do wecompute P(O| ), the probability of the

observation sequence, given the

model ?


34/61

Problem - 1

Evaluation Problem

It tells us how well a given modelmatches the observation sequence.

Application in speech recognition. ?


35/61

Problem -11

Given the observation sequence O =

O1, O2,.. OT, and a model = (A, B,

), how do we choose a

corresponding state sequence Q = q1

q2.. qTwhich is optimal in some

meaningful sense. (i.e., best explains

the observation sequence)?


36/61

Problem -11

We attempt to uncover the hiddensequences.

We can never uncover the exacthidden state sequence.

Application in speech recognition. ?

What if a phoneme is lost in a word .

?


37/61

Problem -111

How do we adjust model parameters

= (A, B, ) to maximize P(O| ) ?

This is associated with training of

HMM

Solution to Problem 1


38/61

Solution to Problem - 1

Imagine that you are a climatologist in

the year 2999 studying the history ofglobal warming. You cannot find any

records of the weather for the summer

of 2007, but you do find Jasons diary,which lists how many ice-creams

Jason ate every day that summer. Our

goal is to use these observations to

estimate the temperature every day.

Assume there are only two kinds of

days: cold (C) and hot (H).


39/61

.8 .2

Given the HMM, what is the probability of the sequence {3,


40/61

We want to compute P(O|) or P(O)

This task is not straight-forward,

because we dont know the states that

produced this observation sequence


41/61

For the state sequence Q = {H,H,C}, Given

O = {3,1,3}

Compute joint prob. P(O,Q) . ?


42/61

We have shown for one particularcase, but there are 8 different state

sequences, such as {C,C,C}, {C,C,H}

etc We would sum over all the 8 possible

state sequences i.e.,


43/61

This is a greedy algorithm

For N hidden states and T

observations there are NTcomb. ofstate seq.

So we move on to a recursivealgorithm called Forward Algorithm


44/61


45/61

S


46/61

Solution to Problem11

Given a HMM, we are trying to find themost-likely state sequence for a

particular observation sequence.

Employing greedy algorithm, wewant to find the seq. of hidden states

that maximizes

Pr(observed seq. , hidden state comb. | )


47/61

Problem: Computationally expensive!!!

Solution: Viterbi Decoding

Logic: It is an inductive algorithm in

which at each instant you keep the

best possible state sequence for eachof the N states as the intermediate

state for the desired observation

sequence O = o1 ,o2,...,oT


48/61

Our goal is to maximize P(O,Q|)

P(O,Q| ) = P(O|Q, ). P(Q| )

=1.bq1(o1).aq1q2.bq2(o2)aqT1qT.bqT(oT)

Now define,


49/61

It can be seen that, P(O,Q| ) = exp (-U(q0,q1,q2,...,qT))

Initially our goal was to maximizeP(O,Q|)

Now, we want to minimize U(Q)

U(Q) is an attempt to re-scale theprobability values.

-ln( aqjqk bqk(Ot) ) can be viewed asCost function.


50/61


51/61

Solution to Problem - 111

Deals with training HMM Encodes HMM parameters to fit the

observation

2 methods to solve this.. ! Segmental K-means Algorithm

Baum-Welch Re-estimation formula

S t l K l ith


52/61

Segmental K-means algorithm :

Tries to adjust model parameters to maximizethe prob. of P(O,Q|), where Q is theOptimum seq. found by problem-2

Baum-Welch Re-estimation formulae :

Tries to adjust model parameters to maximizethe prob. of P(O,Q|).

Finds more general solution.

So which is preferred. ?

Segmental K means


53/61

Segmental K-means

algorithm Let,

= no. of observation seq.

T = length of each observation seq.

D = dimension of each observationsymbol Dimensions 1,2,3. . .. D

Length 1,2,3, .

T

For a single

observation seq.

i.e., for = 1


54/61

Choose N symbols (dimension D), and

assign the remaining Tsymbols to

each of the N chosen ones accordingto Euclidean dist.

Calculate initial and transition prob.

Calculate observation symbol prob


55/61

Calculate observation symbol prob.

Using these formulae

Assumption : symbol prob. Distributionare assumed to be Gaussian


56/61

Find the optimal state sequence Q* as

given by the solution to Problem 2 foreach training sequence using

computed above. A vector is

reassigned a state if its original

assignment is different from the

corresponding estimated optimum

state.

This process is contd. unless there is

no new assignment operation.


57/61

Isolated word recognizer :

Assume we have a vocabulary of Vwords, also we have K utterances ofeach word.

Training a HMM:

For each word v in the vocabulary, we

must build an HMM v , i.e., we mustestimate the model parameters (A,B,)that optimize the likelihood of the trainingset observation vectors of the vth word.

Testing :


58/61

Testing :

For each unknown word which is to berecognized, first we should measure the

observation sequence O = O1,O2. OT,via feature analysis of the speech

corresponding to the word, followed bycalculation of model likelihoods for all

possible models, P(O| v), followed by

selection of the word whose model

likelihood is highest


59/61

A simple yes,no example .

Continuous speech


60/61

Continuous speech

Recognition We connect the HMMs in a sequence.

Instead of taking the one with

maximum probability, we try tominimizes the expectancy of a given

loss function.

Reason: Well we are predicting

multiple words here .


61/61

Speech Recogntion Using Hidden Markov Models

Documents

Transcript of Speech Recogntion Using Hidden Markov Models