speaker-D

8/13/2019 speaker-D

1/48

A Project Report

On

Speaker Recognition System

Implemented in MATLAB

Abstract

Speaker recognition is the process of automatically recogniing !ho is

speaking on the "asis of indi#idual information included in speech !a#es$

8/13/2019 speaker-D

2/48

This techni%ue makes it possi"le to use the speaker&s #oice to #erify their

identity and control access to ser#ices such as #oice dialling' "anking "y

telephone' telephone shopping' data"ase access ser#ices' information

ser#ices' #oice mail' security control for confidential information areas' and

remote access to computers$

Speaker recognition can "e classified into identification and #erification$

Speaker identification is the process of determining !hich registered

speaker pro#ides a gi#en utterance$

Speaker verification' on the other hand' is the process of accepting or

rejecting the identity claim of a speaker$

Speaker recognition methods can also "e di#ided into text-independentand

text dependent methods$ In a te(t)independent system' speaker models

capture characteristics of some"ody*s speech !hich sho! up irrespective of

what one is saying$ In a te(t)dependent system' on the other hand' the

recognition of the speaker*s identity is "ased on his or her speaking one or

more specific phrases' like pass!ords' card num"ers' PI+ codes' etc$ All

technologies of speaker recognition' identification and #erification' te(t

independent and te(t)dependent' each has its o!n ad#antages and

disad#antages and may re%uires different treatments and techni%ues$ The

choice of !hich technology to use is application)specific$ The system that

8/13/2019 speaker-D

3/48

!e !ill de#elop is classified as text independent speaker identification

system since its task is to identify the person !ho speaks regardless of !hat

is saying$

Overview

Speaker recognition is the process of automatically recogniing !ho is speaking on the

"asis of indi#idual information included in speech !a#es$ This techni%ue makes it

possi"le to use the speaker&s #oice to #erify their identity and control access to ser#ices

such as #oice dialing' "anking "y telephone' telephone shopping' data"ase access

ser#ices' information ser#ices' #oice mail' security control for confidential information

areas' and remote access to computers$

Principles of Speaker Recognition

Speaker recognition can "e classified into identification and #erification$ Speaker

identification is the process of determining !hich registered speaker pro#ides a gi#en

utterance$ Speaker verification' on the other hand' is the process of accepting or rejecting

the identity claim of a speaker$ ,igure - sho!s the "asic structures of speaker

identification and #erification systems$

Speaker recognition methods can also "e di#ided into text-independent and text-

dependent methods$ In a te(t)independent system' speaker models capture characteristics

of some"ody*s speech !hich sho! up irrespective of what one is saying$ In a te(t)

dependent system' on the other hand' the recognition of the speaker*s identity is "ased on

8/13/2019 speaker-D

4/48

his or herspeaking one or more specific phrases' like pass!ords' card num"ers' PI+

codes' etc$

All technologies of speaker recognition' identification and #erification' te(t)

independent and te(t)dependent' each has its o!n ad#antages and disad#antages and may

re%uires different treatments and techni%ues$ The choice of !hich technology to use is

application)specific$ The system that !e !ill de#elop is classified as text-independent

speaker identification system since its task is to identify the person !ho speaks regardless

of !hat is saying$

At the highest le#el' all speaker recognition systems contain t!o main modules .refer

to ,igure -/0feature extraction andfeature matching$ ,eature e(traction is the process

that e(tracts a small amount of data from the #oice signal that can later "e used to

represent each speaker$ ,eature matching in#ol#es the actual procedure to identify the

unkno!n speaker "y comparing e(tracted features from his1her #oice input !ith the ones

from a set of kno!n speakers$ 2e !ill discuss each module in detail in later sections$

8/13/2019 speaker-D

5/48

Figure 1$ Basic structures of speaker recognition systems

All speaker recognition systems ha#e to ser#e t!o distinguish phases$ The first one is

referred to the enrollment sessions or training phase !hile the second one is referred to as

the operation sessions or testing phase$ In the training phase' each registered speaker has

to pro#ide samples of their speech so that the system can "uild or train a reference model

for that speaker$ In case of speaker #erification systems' in addition' a speaker)specific

8/13/2019 speaker-D

6/48

threshold is also computed from the training samples$ 3uring the testing .operational/

phase .see ,igure -/' the input speech is matched !ith stored reference model.s/ and

recognition decision is made$

Speaker recognition is a difficult task and it is still an acti#e research area$ Automatic

speaker recognition !orks "ased on the premise that a person*s speech e(hi"its

characteristics that are uni%ue to the speaker$ 4o!e#er this task has "een challenged "y

the highly variant of input speech signals$ The principle source of #ariance is the speaker

himself$ Speech signals in training and testing sessions can "e greatly different due to

many facts such as people #oice change !ith time' health conditions .e$g$ the speaker has

a cold/' speaking rates' etc$ There are also other factors' "eyond speaker #aria"ility' that

present a challenge to speaker recognition technology$ 5(amples of these are acoustical

noise and #ariations in recording en#ironments .e$g$ speaker uses different telephone

handsets/$

General Idea of Speec Recognition

4uman speech presents a formida"le pattern classification task for speech recognition

system $ +umerous speech recognition techni%ues ha#e "een formulated yet the #ery "est

techni%ues used today ha#e recognition capa"ilities !ell "elo! those of a child$ This is

due to the fact that human speech is highly dynamic and comple($ There are generally

se#eral types of disciplines present in the human speech$ A "asic understanding of these

disciplines is needed in order to create an effecti#e system$ The follo!ing pro#ide a "rief

description of the disciplines that ha#e "een applied to speech recognition pro"lems

1 Signal Processing

8/13/2019 speaker-D

7/48

This process e(tracts the important information from the speech signal in a !ell)

organised manner$ In signal processing' spectral analysis is used to characterie the time

#arying properties of the speech signal$ Se#eral other types of processing are also needed

prior to the spectral analysis stage to make the speech signal more accurate and ro"ust$

-

! Acoustics

The science of understanding the relationship "et!een the physical speech signal and the

human #ocal tract mechanisms that produce the speech and !ith !hich the speech is

distinguished$

1 Pattern Recognition

A set of coding algorithm used to compute data to create prototypical patterns of a data

ensem"le$ It is used to compare a pair of patterns "ased on the features e(tracted from the

speech signal$

1 "ommunication and Information #eory

The procedures for estimating parameters of the statistical models and the methods for

recogniing the presence of speech patterns$

1 $inguistics

This refers to the relationships "et!een sounds' !ords in a sentence' meaning and logic

of spoken !ords$

1

8/13/2019 speaker-D

8/48

! Pysiology

This refers to the comprehension of the higher)order mechanisms !ithin the human

central ner#ous system$ It is responsi"le for the production and perception of speech

!ithin the human "eings$

1

!

% "omputer Science

The study of effecti#e algorithms for application in soft!are and hard!are$ ,or e(ample'

the #arious methods used in a speech recognition system$

-

! Psycology

The science of understanding the aspects that ena"les the technology to "e used "y

human "eings$

Speec Production

Speech is the acoustic product of #oluntary and !ell)controlled mo#ement of a #ocal

mechanism of a human$ 3uring the generation of speech' air is inhaled into the human

lungs "y e(panding the ri" cage and dra!ing it in #ia the nasal ca#ity' #elum and trachea

It is then e(pelled "ack into the air "y contracting the ri" cage and increasing the lung

pressure$ 3uring the e(pulsion of air' the air tra#els from the lungs and passes through

#ocal cords !hich are the t!o symmetric pieces of ligaments and muscles located in the

laryn( on the trachea$ Speech is produced "y the #i"ration of the #ocal cords$ Before the

e(pulsion of air' the laryn( is initially closed$ 2hen the pressure produced "y the

8/13/2019 speaker-D

9/48

e(pelled air is sufficient' the #ocal cords are pushed apart' allo!ing air to pass through$

The #ocal cords close upon the decrease in air flo!$ This rela(ation cycle is repeated !ith

generation fre%uencies in the range of 674 8 9774$ The generation of this fre%uency

depends on the speaker*s age' se(' stress and emotions$ This succession of the glottis

openings and closure generates %uasi)periodic pulses of air after the #ocal cords$

,igure "elo! sho!s the schematic #ie! of the human speech apparatus$

The speech signal is a time #arying signal !hose signal characteristics represent the

different speech sounds produced$ There are three !ays of la"elling e#ents in speech$

,irst is the silence state in !hich no speech is produced$ Second state is the un#oiced

state in !hich the #ocal cords are not #i"rating' thus the output speech !a#eform is

aperiodic and random in nature$ The last state is the #oiced state in !hich the #ocal cords

are #i"rating periodically !hen air is e(pelled from the lungs$ This results in the output

speech "eing %uasi)periodic$ ,igure : "elo! sho!s a speech !a#eform !ith un#oiced

and #oiced state$

8/13/2019 speaker-D

10/48

Speech is produced as a se%uence of sounds$ The type of sound produced depends on

shape of the #ocal tract$ The #ocal tract starts from the opening of the #ocal cords to the

end of the lips$ Its cross sectional area depends on the position of the tongue' lips' ja!

and #elum$ Therefore the tongue' lips' ja! and #elum play an important part in the

production of speech$

Block 3iagram .5ngineering Model/ Of 4uman Speech Production System

Factors associated wit speec&

Formants&

It has "een kno!n from research that #ocal tract and nasal tract are tu"es !ith

non)uniform cross)sectional area$ As sound generated propagates through these the tu"es'

the fre%uency spectrum is shaped "y the fre%uency selecti#ity of the tu"e$ This effect is

#ery similar to the resonance effects o"ser#ed in organ pipes and !ind instruments$ In the

conte(t of speech production' the resonance fre%uencies of #ocal tract are called formant

8/13/2019 speaker-D

11/48

fre%uencies or simply formants$ In our engineered model the poles of the transfer

function are called formants$ 4uman Auditory system is much more sensiti#e to poles

than eros$

Ponemes&

Phonemes can "e defined as the ;Sym"ols from !hich e#ery sound can "e

classified or produced: phonemes$ ,or speech crude estimation of information rate

considering physical limitations on articulatory motion is a"out -7 phonemes per second$

#ypes of Ponemes&

Speech sounds can "e classified in to 9 distinct classes according to the mode of

e(citation$

- -$ Plosi#e Sounds

: :$ ?oiced Sounds

9 9$ @n#oiced Sounds

1' Plosive Sounds&

Plosi#e Sounds result from making a complete closure .again to!ard the front end of the

#ocal tract/' "uilding up pressure "ehind the closure' and a"ruptly releasing it$

!' (oiced Sounds&

?oiced sounds are produced "y forcing air through the glottis !ith the tension of the

#ocal chords adjusted so that they #i"rate in a rela(ation oscillation' there"y producing

%uasi)periodic pulses of air !hich e(cite the #ocal tract$

?oiced sounds are characteried "y

- 4igh 5nergy Le#els

8/13/2019 speaker-D

12/48

: ?ery 3istinct resonant and formant fre%uencies$

#e rate at wic te vocal cord vibrates determines te pitc $ These

#i"rations are periodic in time thus #oiced sounds are appro(imated "y an impulse train$

Spacing "et!een impulses is the pitch' ,7$

%' )nvoiced Sounds&

?oiced Sounds are also kno!n as formants generated "y forming a

constriction at some point in the #ocal tract .usually to!ard the mouth end/' and forcing

the air through the constriction at high enough #elocity to produce tur"ulence$ This

creates a "road)spectrum noise source to e(cite the #ocal tract$

@n#oiced sounds are characteried "y

- Lo!er 5nergy Le#els than #oiced sounds$

: 4igher fre%uencies than #oiced sounds$

In other !ords !e can say that un#oiced sounds .e$g$ 1sh1' 1s1' 1p1/ are

generated !ithout #ocal cords #i"rations$ The e(citation is modeled "y a 2hite aussian

+oise source$ @n#oiced sounds ha#e no pitch since they are e(cited "y a non)periodic

signal$

8/13/2019 speaker-D

13/48

Spectrums Of typical voiced And )nvoiced Speec

By passing the speech through a predictor filter A./' the spectrum is much more flatten

.!hitened/$ But it still contains some fine details$

8/13/2019 speaker-D

14/48

Special #ype of (oiced and )nvoiced Sounds&

There are ho!e#er some special types of #oiced and un#oiced sounds !hich are "riefly

discussed here$ The purpose of their discussion here is only to gi#e the reader an idea

a"out the further types of #oiced and un#oiced speech$

(owels&

?o!els are produced "y e(citing a fi(ed #ocal tract !ith %uasi periodic pulses of

air caused "y #i"ration of the #ocal cords$ The !ay in !hich the cross)sectional area

#aries along the #ocal tract determines the resonant fre%uencies of the tract .formants/

and thus the sound that is produced$ The dependence of cross)sectional area upon

distance along the tract is called is called area function of the #ocal tract$ The area

function of a particular #o!el is determined primarily "y the position of the tongue "ut

the position of ja!s and lips to a small e(tent also affect the resulting sound$

5(amples

a'e'i'o'u

*iptongs&

Although there is some am"iguity and disagreement as to !hat is and !hat is not

a diphthongs' a reasona"le definition is that a diphthongs is a gliding monosylla"ic

speech item that starts at or near the articulatory position for one #o!el and mo#es to or

to!ard the position for another$ According to this definition' there are C diphthongs in

American 5nglish$

3iphthongs are produced "y #arying the #ocal tract smoothly "et!een #o!el

configurations appropriate to the diphthong$ Based on these data' the diphthongs can "e

8/13/2019 speaker-D

15/48

characteried "y a time #arying #ocal tract area function !hich #aries "et!een t!o #o!el

configurations$

5(amples0

5i1 .as in "ay/ ' o@1 .as in "oat/ ' aI1 .as in "uy/ ' a@1 .as in ho!/

Semivowels&

The group of sound consisting of 1!1' 1l1 ' 1r1 '1y1 is %uite difficult to

characterie$ These sounds are called semi#o!els "ecause of their #o!el)like nature$

They are generally characteried "y a gliding transition in the #ocal tract area function

"et!een adjacent phonemes$ Thus the acoustic characteristics of these sounds are

strongly influenced "y the conte(t in !hich they occur$ ,or our purpose they just

considered as transitional #o!el)like sounds and hence are similar in nature to #o!els

and diphthongs$

(oiced Fricatives&

The #oiced fricati#es are 1#1 ' 1th1 ' 11 and 1h1are the counterpart of the un#oiced

fricati#es 1f1 '1D1 '1s1 and 1sh1 respecti#ely' in that the place of constriction for each of the

corresponding phonemes is essentially identical$

4o!e#er the #oiced fricati#es differ from their un#oiced counterparts in the manner that

t!o e(citation sources are in#ol#ed in their production$ The spectra of #oiced fricati#es

can "e e(pected to display t!o distinct components$

(oiced Stops&

The #oiced stops 1"1' 1d1'and1g1 are transient non)continuant sounds !hich are produced

"y "uilding up pressure "ehind a total constriction some!here in the oral tract' and

suddenly releasing the pressure$ ,or 1"1 the constriction is at the lipsE for 1d1 the

8/13/2019 speaker-D

16/48

constriction is at the "ack of the teethE and for 1g1 it is near the #elum$ 3uring the period

there is a total constriction in the tract there is no sound radiated from the lips$

Since the stop sounds are dynamical in nature' there properties are highly influenced "y

the #o!el !hich follo!s the stop consonant$

)nvoiced Stops&

The un#oiced stop consonants are 1p1'1t1'and1k1 are similar to their #oiced counterparts

1"1 ' 1d1 and 1g 1 !ith one major e(ception$ 3uring the period of the total closure of the

tract' as the pressure "uilds up' the #ocal cords do not #i"rate$ Thus' follo!ing the period

of closure as the air pressure is released' there is a "rief inter#al for friction .due to the

sudden tur"ulence of the escaping air/ follo!ed "y a period of aspiration .steady flo! of

air from glottis e(citing the resonances of the #ocal tract/ "efore #oiced e(citation "egins

+earing and Perception

Audi"le sounds are transmitted to the human ears through the #i"ration of the particles in

the air$ 4uman ears consist of three parts' the outer ear' the middle ear and the inner ear$

The function of the outer ear is to direct speech pressure #ariations to!ard the eardrum

!here the middle ear con#erts the pressure #ariations into mechanical motion$ The

mechanical motion is then transmitted to the inner ear' !hich transforms these motion

into electrical potentials that passes through the auditory ner#e' corte( and then to the

"rain $ ,igure "elo! sho!s the schematic diagram of the human ear$

8/13/2019 speaker-D

17/48

Schematic 3iagram of the 4uman 5ar

#e ,ngineered -odel &

The speech mechanism can "e modeled as a time #arying filter .the #ocal tract/ e(cited

"y an oscillator .the #ocal folds/' !ith different outputs$ 2hen #oiced sound is produced'

the filter is e(cited "y an impulse chain' in a range of fre%uencies .C7)>77 4/$ 2hen

un#oiced sound is produced' the filter is e(cited "y random !hite noise' !ithout any

o"ser#ed periodicity$ These attri"utes can "e o"ser#ed !hen the speech signal is

e(amined in the time domain$

8/13/2019 speaker-D

18/48

(a): The Human Speech Production Figure

(b): Speech Production by a machine

.y ,ncode Speec/

Speech coding has "een and still is a major issue in the area of digital speech

processing$ Speech coding is the act of transforming the speech signal at hand' to a more

compact form' !hich can then "e transmitted !ith a considera"ly smaller memory$ The

moti#ation "ehind this is the fact that access to unlimited amount of "and!idth is not

possi"le$ Therefore' there is a need to code and compress speech signals$ Speech

compression is re%uired in long)distance communication' high)%uality speech storage'

and message encryption$ ,or e(ample' in digital cellular technology many users need to

share the same fre%uency "and!idth$ @tiliing speech compression makes it possi"le for

more users to share the a#aila"le system$ Another e(ample !here speech compression is

8/13/2019 speaker-D

19/48

needed is in digital #oice storage$ ,or a fi(ed amount of a#aila"le memory' compression

makes it possi"le to store longer messages$

Speech coding is a lossy type of coding' !hich means that the output signal

does not e(actly sound like the input$ The input and the output signal could "e

distinguished to "e different$ Foding of audio ho!e#er' is a different kind of pro"lem

than speech coding$ Audio coding tries to code the audio in a perceptually lossless !ay$

This means that e#en though the input and output signals are not mathematically

e%ui#alent' the sound at the output is the same as the input$ This type of coding is used in

applications for audio storage' "roadcasting' and Internet streaming$

Se#eral techni%ues of speech coding such as Linear Predicti#e Foding .LPF/' 2a#eform

Foding and Su" "and Foding e(ist$ The pro"lem at hand is to use LPF to code gi#en

speech sentences$ The speech signals that need to "e coded are !ide"and signals !ith

fre%uencies ranging from 7 to 6 k4$ The sampling fre%uency should "e at 6k4$

3ifferent types of applications ha#e different time delay constraints$ ,or e(ample in

net!ork telephony only a delay of -ms is accepta"le' !hereas a delay of =77 ms is

permissi"le in #ideo telephony$ Another constraint at hand is not to e(ceed an o#erall "it

rate of 6 k"ps$

The speech coder that !ill "e de#eloped is going to "e analyed using "oth su"jecti#e

and o"jecti#e analysis$ Su"jecti#e analysis !ill consist of listening to the encoded speech

signal and making adjustments on its %uality$ The %uality of the played "ack speech !ill

"e solely "ased on the opinion of the listener$ The speech can possi"ly "e rated "y the

listener either impossi"le to understand' intelligi"le or natural sounding$ 5#en though this

is a #alid measure of %uality' an o"jecti#e analysis !ill "e introduced to technically

8/13/2019 speaker-D

20/48

assess the speech %uality and to minimie human "ias$ ,urthermore' an analysis on the

study of effects of "it rate' comple(ity and end)to)end delay on the speech %uality at the

output !ill "e made$ The report !ill "e concluded !ith the summary of results and some

ideas for future !ork$

Speec Processing

The speech !a#eform needs to "e con#erted into digital format "efore it is suita"le for

processing in the speech recognition system$ The ra! speech !a#eform is in the analog

format "efore con#ersion$ The con#ersion of analog signal to digital signal in#ol#es three

phases' mainly the sampling' %uantisation and coding phase$ In the sampling phase' the

analog signal is "eing transformed from a !a#eform that is continuos in time to a discrete

signal$ A discrete signal refers to the se%uence of samples that are discrete in time$ In the

%uantisation phase' an appro(imate sampled #alue of a #aria"le is con#erted into one of

the finite #alues contained in a code set$ These t!o stages allo! the speech !a#eform to

"e represented "y a se%uence of #alues !ith each of these #alues "elonging to the set of

finites #alues$ After passing through the sampling and %uantisation stage' the signal is

then coded in the coding phase$ The signal is usually represented "y "inary code$ These

three phases needs to "e carried out !ith caution as any miscalculations' o#er)sampling

and %uantiation noise !ill result in loss of information$ Belo! are the pro"lems faced "y

the three phases$

Sampling

According to the +y%uist Theorem' the minimum sampling rate re%uired is t!o times the

"and!idth of the signal$ This minimum sampling fre%uency is needed for the

8/13/2019 speaker-D

21/48

reconstruction of a "and limited !a#eform !ithout error$ Aliasing distortion !ill occur if

the minimum sampling rate is not met$ ,igure "elo! sho!s the comparison "et!een a

properly sampled case and an improperly sampled case$

Aliasing 3istortion "y Improperly Sampling

0uantiation

Speech signals are more likely to ha#e amplitude #alues near ero than at the e(treme

peak #alues allo!ed$ ,or e(ample' in digitiing #oice' if the peak #alue allo!ed is -?'

!eak passages may ha#e #oltage le#els on the order of 7$-?$ Speech signals !ith non)

uniform amplitude distri"ution are likely to e(perience %uantising noise if the step sie is

not reduced for amplitude #alues near ero and increased for e(tremely large #alues$ The

%uantising noise is kno!n as the granular and slope o#erload noise$ ranular noise occurs

!hen the step sie is large for amplitude #alues near ero$ Slope o#erload noise occurs

!hen the step sie is small and cannot keep up !ith the e(tremely large amplitude #alues$

To sol#e the a"o#e %uantising noise pro"lem' 3elta Modulation .3M/ is used$ 3elta

Modulation !orks "y reducing the step sie for amplitude #alues near ero and increasing

the step sie for e(tremely large amplitude #alues$ ,igure "elo! sho!s a diagram on the

t!o types of noises$

8/13/2019 speaker-D

22/48

Analog Input and Accumulator Output 2a#eform

Approaces to Speec Recognition

4uman "eings are the "est ;machine< to recognie and understand speech$ 2e are a"le to

com"ine a !ide #ariety of linguistic kno!ledge concerned !ith synta( and semantics and

adapti#ely use this kno!ledge according to the difficulties and characteristics of the

sentences$ The speech recognition system is "uilt !ith this aim in mind to match or

e(ceed human performance$ There are generally three approaches to speech recognition'

namely' acoustic)phonetic' pattern recognition and the artificial intelligence approach$

These three approaches !ill "e e(plained in greater detail in the follo!ing sections$

Speec "oding

A digital speech coder can "e classified into t!o main categories' mainly !a#eform

coders and #ocoders$ 2a#eform coders employ algorithms to encode and decode speech

signals so that the system output is an appro(imation to the input !a#eform$ ?ocoders

encode speech signals "y e(tracting a set of parameters that are digitied and transmitted

to the recei#er$ This set of digitied parameters is used to set #alues for parameters in

function generators and filters' !hich in turn synthesie the output speech signals$ The

8/13/2019 speaker-D

23/48

#ocoder output !a#eform does not appro(imate the input !a#eform signals and may

produce an unnatural sound$

Speec Feature ,2traction

Introduction

The purpose of this module is to con#ert the speech !a#eform to some type of

parametric representation .at a considera"ly lo!er information rate/ for further analysis

and processing$ This is often referred as thesignal-processing front end$

The speech signal is a slo!ly timed #arying signal .it is called quasi-stationary/$ An

e(ample of speech signal is sho!n in ,igure :$ 2hen e(amined o#er a sufficiently short

period of time ."et!een = and -77 msec/' its characteristics are fairly stationary$

4o!e#er' o#er long periods of time .on the order of -1= seconds or more/ the signal

characteristic change to reflect the different speech sounds "eing spoken$ Therefore'

short-time spectral analysis is the most common !ay to characterie the speech signal$

8/13/2019 speaker-D

24/48

Figure 2. An example of speech signal

A !ide range of possi"ilities e(ist for parametrically representing the speech signal

for the speaker recognition task' such as Linear Prediction Foding .LPF/' Mel),re%uency

Fepstrum Foefficients .M,FF/' and others$ M,FF is perhaps the "est kno!n and most

popular' and these !ill "e used in this project$

M,FF*s are "ased on the kno!n #ariation of the human ear*s critical "and!idths

!ith fre%uency' filters spaced linearly at lo! fre%uencies and logarithmically at high

fre%uencies ha#e "een used to capture the phonetically important characteristics of

speech$ This is e(pressed in the mel-frequency scale' !hich is a linear fre%uency spacing

"elo! -777 4 and a logarithmic spacing a"o#e -777 4$ The process of computing

M,FFs is descri"ed in more detail ne(t$

Pattern Recognition

8/13/2019 speaker-D

25/48

This direct approach in#ol#es manipulating the speech signals directly !ithout e(plicit

feature e(traction of the speech signals$ There are t!o stages in this approach' mainly the

training of speech patterns and recognition of patterns #ia pattern comparison$ Se#eral

identical speech signals are collected and sent to the system #ia the training procedure$

2ith ade%uate training' the system is a"le to characterie the acoustics properties of the

pattern$ This type of classification is kno!n as the pattern classification$ The recognition

stage does a direct comparison "et!een the unkno!n speech signal and the speech signal

patterns learned in the training phase$ It generates a ;accept< or ;reject< decision "ased

on the similarity of the t!o patterns$

-$ It is simple to use and the method is fairly easy to understand

:$ It has ro"ustness to different speech #oca"ularies' users' features sets' pattern

comparison algorithms and decision rules$

9$ It has "een pro#en that this method generates the most accurate results$

Acoustic3Ponetic Approac

The acoustic)phonetic approach has "een studied in depth for more than >7 years$ It is

"ased on the theory of acoustics phonetics that suggest that there e(ist finite' distincti#e

phonetic units of spoken language and that the phonetic units are !idely characteried "y

a set of properties that are manifest in the speech signal' or its spectrum' o#er time$ The

first step in this approach is to segment the speech signal into discrete time regions !here

the acoustics properties of the speech signal are represented "y one phonetic unit$ The

ne(t step is to attach one or more phonetic la"els to each segmented region according to

the acoustic properties$ ,inally the last step attempts to determine a #alid !ord from the

8/13/2019 speaker-D

26/48

phonetic la"els generated from the first step$ This is consistent !ith the constraints of the

speech recognition task

Artificial Intelligence

This approach is a com"ination of the acoustic)phonetic approach and the pattern

recognition approach$ It uses the concept and ideas of these t!o approaches$ Artificial

intelligence approach attempts to mechanie speech recognition process according to the

!ay a person applies its intelligence in #isualiing and analying$ In particular among the

techni%ues used !ithin this class of methods are the use of an e(pert system for

segmentation and la"eling so that this crucial and most complicated step can "e

performed !ith more that just the acoustic information used "y pure acoustic)phonetic

methods$ +eural +et!orks are often used in this approach to learn the relationship

"et!een the phonetic e#ents and all the kno!n inputs$ It can also "e used to differentiate

similar sound classes$

*ynamic #ime .arping

3ynamic Time 2arping is one of the pioneer approaches to speech recognition$ It first

operates "y storing a prototypical #ersion of each !ord in the #oca"ulary into the

data"ase' then compares incoming speech signals !ith each !ord and then takes the

closest match$ But this poses a pro"lem "ecause it is unlikely that the incoming signals

!ill fall into the constant !indo! spacing defined "y the host$ ,or e(ample' the pass!ord

to a #erification system is Gueensland$ 2hen a user utter ; Gueeeeensland

8/13/2019 speaker-D

27/48

due to the longer constant !indo! spacing of the speech ; Gueeeeensland

8/13/2019 speaker-D

28/48

Speech Recognition is one of the daunting challenges facing researchers throughout the

!orld$ The complete solution is far from o#er and enormous efforts ha#e "een spent "y

companies to reach the ultimate goal$ One of the techni%ues that gain acceptance from

researchers is the state of art' 4idden Marko# Model .4MM/ techni%ue$ This model can

also "e incorporated !ith other techni%ues like +eural +et!ork to form a formida"le

techni%ue$

The 4idden Marko# Model approach is !idely used in se%uence processing and speech

recognition$ The key features of the 4idden Marko# Model lies in its a"ility to model

temporal statistics of data "y introducing a discrete hidden #aria"le that goes through a

transition from one time step to the ne(t according to the stochastic transition matri($

3istri"ution of the emission sym"ols is em"odied in the assumption of the emission

pro"a"ility density$

A 4idden Marko# Model may "e #ie!ed as a finite machine !here the transitions

"et!een the states are dependent upon the occurrence of some sym"ol$ 5ach state

transition is associated !ith an output pro"a"ility distri"ution !hich determines the

pro"a"ility that a sym"ol !ill occur during the transition and a transition pro"a"ility

indicating the likelihood of this transition$ Se#eral analytical techni%ues ha#e "een

de#eloped for estimating these pro"a"ilities$ These analytical techni%ues ha#e ena"led

4MM to "ecome more computationally efficient' ro"ust and fle(i"le$

In speech recognition' the 4MM model optimises the pro"a"ility of the training set to

detect a particular speech$ The pro"a"ility function is performed "y the ?iter"i algorithm$

This algorithm is a procedure used to determine an optimal state se%uence from a gi#en

o"ser#ation se%uence$

8/13/2019 speaker-D

29/48

4eural 4etworks

As you read these !ords' your "rain is actually using its comple( net!ork of -7 neurons

to facilitate your readings $ 5ach of these neurons has a "laing processing speed of a

microprocessor' !hich allo!s us to read' think and !rite simultaneously$ Scientists ha#e

found out that all "iological neural functions including memory are stored in the neurons

and in the connections "et!een them$ As !e learn ne! things e#eryday' ne! connections

are made or modified$ Some of these neural structures are defined at "irth !hile others

are created e#eryday and others !aste a!ay$ In this thesis' the +eural +et!ork algorithm

is actually a"out Artificial +eural +et!orks and not the actual neurons in our "rain$ A

picture illustrating the "iological neurons is sho!n in ,igure "elo!

Schematic 3ra!ing of Biological +eurons

4eural -odel

8/13/2019 speaker-D

30/48

,igure "elo! sho!s a single input neuron$ The scaler input p is multiplied "y the scaler

!eight ! to form !p !hich is then sent to the summer$ In the summer' the product scalar

!p is added to the "ias " and passed through the summer $ The summer output n' goes

into the transfer function f !hich generates the scaler neuron output a$ The #alue of

output neuron ;a< depends on the type of transfer function used$ This !hole idea of the

artificial neuron is similar to "iological neurons sho!n in ,igure sho!n a"o#e $ The

!eight ! corresponds to the strength of the synapse' the cell "ody is e%ui#alent to the

summation and the transfer function and finally the neuron output ;a< corresponds to the

signal tra#elling in the a(on

Summer output n H !p "+euron output a H f.!p"/

$

Single Input +euron

+ard $imit #ransfer Function

The hard limit transfer function sho!n on the left side of ,igure "elo! sets the output

neuron a to 7 if the summer output n is less than 7$ If the summer output n is greater than

or e%ual to 7' it sets the output neuron a to -$ This transfer function is useful in classifying

8/13/2019 speaker-D

31/48

inputs into t!o categories and in this thesis it is used to determine true or false detection

of the speech signal$ The figure on the right sho!s the effect of the !eight and the "ias

com"ined together$

4ard Limit Transfer ,unction

*ecision 5oundary

A single layer perceptron consists of the input neuron' the !eight' "ias' summer and the

transfer function$ ,igure "elo! - sho!s a diagram of a single layer perceptron$ A single

layer perceptron can "e used to classify input #ectors into t!o categories$ The !eight is

al!ays orthogonal to the decision "oundary$ ,or e(ample in ,igure "elo! :' the !eight !

is set to J): 9K$ The decision "oundary corresponding to the graph in ,igure "elo! : is

indicated$ 2e can use any points on decision "oundary to find the "ias as follo!s0

!p " H 7$ Once the "ias is set' any point in the plane can "e classified as lying inside

the shaded region .!p"7/ or outside the shaded region .!p"7/$

8/13/2019 speaker-D

32/48

,igure -0 Multiple Input +euron

,igure :0Perceptron 3ecision Boundary

-el3fre6uency cepstrum coefficients processor

A "lock diagram of the structure of an M,FF processor is gi#en in ,igure 9$ The

speech input is typically recorded at a sampling rate a"o#e -7777 4$ This sampling

fre%uency !as chosen to minimie the effects of aliasing in the analog)to)digital

con#ersion$ These sampled signals can capture all fre%uencies up to = k4' !hich co#er

most energy of sounds that are generated "y humans$ As "een discussed pre#iously' the

main purpose of the M,FF processor is to mimic the "eha#ior of the human ears$ In

addition' rather than the speech !a#eforms themsel#es' M,,F*s are sho!n to "e less

suscepti"le to mentioned #ariations$

8/13/2019 speaker-D

33/48

Frame 5locking

In this step the continuous speech signal is "locked into frames of N samples' !ith

adjacent frames "eing separated "yM .M < N/$ The first frame consists of the firstN

samples$ The second frame "eginsM samples after the first frame' and o#erlaps it "yN -

M samples$ Similarly' the third frame "egins :M samples after the first frame .or M

samples after the second frame/ and o#erlaps it "y N ) :M samples$ This process

continues until all the speech is accounted for !ithin one or more frames$ Typical #alues

forN andM areN H :=C .!hich is e%ui#alent to N 97 msec !indo!ing and facilitate the

fast radi(): ,,T/ andM H -77$

.indowing

The ne(t step in the processing is to !indo! each indi#idual frame so as to minimie

the signal discontinuities at the "eginning and end of each frame$ The concept here is to

8/13/2019 speaker-D

34/48

minimie the spectral distortion "y using the !indo! to taper the signal to ero at the

"eginning and end of each frame$ If !e define the !indo! as

!hereN is the num"er of samples in each frame' then the result of !indo!ing is the

signal

Typically theHamming !indo! is used' !hich has the form0

Fast Fourier #ransform 7FF#8

The ne(t processing step is the ,ast ,ourier Transform' !hich con#erts each frame of

N samples from the time domain into the fre%uency domain$ The ,,T is a fast algorithm

to implement the 3iscrete ,ourier Transform .3,T/ !hich is defined on the set of N

samples xn' as follo!0

+ote that !e use j here to denote the imaginary unit' i$e$ -QHj$ In generaln*s are

comple( num"ers$ The resulting se%uence n is interpreted as follo!0 the erofre%uency corresponds to n H 7' positi#e fre%uencies correspond to

#alues !hile negati#e fre%uencies correspond

to

The result after this step is often referred to asspectrum orperiodogram

8/13/2019 speaker-D

35/48

-el3fre6uency .rapping

As mentioned a"o#e' psychophysical studies ha#e sho!n that human perception of

the fre%uency contents of sounds for speech signals does not follo! a linear scale$ Thus

for each tone !ith an actual fre%uency'f' measured in 4' a su"jecti#e pitch is measured

on a scale called the mel* scale$ The mel-frequency scale is a linear fre%uency spacing

"elo! -777 4 and a logarithmic spacing a"o#e -777 4$ As a reference point' the pitch

of a - k4 tone' >7 dB a"o#e the perceptual hearing threshold' is defined as -777 mels$

Therefore !e can use the follo!ing appro(imate formula to compute the mels for a gi#en

fre%uencyf in 40

$

One approach to simulating the su"jecti#e spectrum is to use a filter "ank' spaced

uniformly on the mel scale .see ,igure >/$ That filter "ank has a triangular "and pass

fre%uency response' and the spacing as !ell as the "and!idth is determined "y a constant

mel fre%uency inter#al$ The modified spectrum of S!" thus consists of the output po!er

of these filters !hen S!" is the input$ The num"er of mel spectrum coefficients' #' is

typically chosen as :7$

+ote that this filter "ank is applied in the fre%uency domain' therefore it simply

amounts to taking those triangle)shape !indo!s in the ,igure > on the spectrum$ A useful

!ay of thinking a"out this mel)!rapping filter "ank is to #ie! each filter as an histogram

"in .!here "ins ha#e o#erlap/ in the fre%uency domain$

8/13/2019 speaker-D

36/48

Figure 9$ An e(ample of Mel)spaced filter "ank

"epstrum

In this final step' !e con#ert the log mel spectrum "ack to time$ The result is called

the mel fre%uency cepstrum coefficients .M,FF/$ The cepstral representation of the

speech spectrum pro#ides a good representation of the local spectral properties of the

signal for the gi#en frame analysis$ Because the mel spectrum coefficients .and so their

logarithm/ are real num"ers' !e can con#ert them to the time domain using the 3iscrete

Fosine Transform .3FT/$ Therefore if !e denote those mel po!er spectrum coefficients

that are the result of the last step are

!e can calculate the M,FF&s' as

8/13/2019 speaker-D

37/48

+ote that !e e(clude the first component' 'N7c from the 3FT since it represents the

mean #alue of the input signal !hich carried little speaker specific information$

1

!

%

9

:

;

until a code"ook sie of M is designed$

Intuiti#ely' the LB algorithm designs anM)#ector code"ook in stages$ It starts first "y

8/13/2019 speaker-D

42/48

designing a -)#ector code"ook' then uses a splitting techni%ue on the code!ords to

initialie the search for a :)#ector code"ook' and continues the splitting process until the

desiredM)#ector code"ook is o"tained$

,igure C sho!s' in a flo! diagram' the detailed steps of the LB algorithm$ ; )luster

vectors< is the nearest)neigh"or search procedure !hich assigns each training #ector to a

cluster associated !ith the closest code!ord$ ;*ind centroids< is the centroid update

procedure$ ;)ompute + !distortion"< sums the distances of all training #ectors in the

nearest)neigh"or search so as to determine !hether the procedure has con#erged$

Obtaining Speec .aveform

The first task is to record the speech !a#eform from the speaker and upload it into the

program$ The sound Recorder program in Microsoft 2indo!s is chosen to record the

speech !a#eform$ The recorded speech is automatically filtered' sampled at a sampling

rate of ::$7= 4 and then sa#ed as a !a#e file$ 2a#e file format is chosen "ecause it is

highly compati"le !ith the Matla" program as it can "e easily retrie#ed #ia a single

command$

Pre3,2traction Process

The speech !a#e filed is loaded into the Matla" program "y using a ;!a#e read< function

!hich limits the amplitude of the speech signal to a magnitude of -$ The signal is then

sa#ed as an M ( - #ector !here M refers to the total num"er of samples in the speech

signal$ 5ach element in the M #ector contains the amplitude of the speech signal at a

8/13/2019 speaker-D

43/48

particular sampling instant$ The speech signal is no! ready to go through the

compati"ility and %uality process$

0uality Process

Before the actual e(traction process takes place' the !a#e file is su"jected to a series of

processes to ensure the compati"ility and %uality of the signal$ 2hen the speech signal is

"eing loaded into the Matla" program' the signal is not centered at the y H 7 a(is$ In order

to "ring the !hole signal to centre on the ero)line' special program code !as !ritten$

This code is used to find the mean of the signal and then su"tract this mean from each of

the sample #alues of the signal$ This is sho!n in ,igure - "elo!$ The reason for shifting

the !hole signal to the y H 7 a(is !ill "e e(plained in section

Speech 2a#eform

8/13/2019 speaker-D

44/48

The ne(t process is to suppress the noise present in the speech !a#eform$ Although the

sound recorder program has performed the initial filtering' some noise is still present in

the speech !a#eform$ Another section of the Matla" program code is used to set a

threshold #alue on the speech signal$ Any #alue of the speech signal that falls "elo! this

threshold #alue !ill "e set to ero$ This !ill greatly suppress the un!anted noise and in

the meantime preser#e the content of the main speech signal$ This is illustrated in ,igure

: and ,igure 9 "elo!$ After some testing' it !as found that a threshold #alue of 7$7: is

most suita"le$

,ig : Speech !a#e form "efore filtering

,ig 9 Speech after filtering

The final compati"ility and %uality process is to determine the area of interest of the

speech signal$ This is done "y detecting the first rise point and the final drop point of the

8/13/2019 speaker-D

45/48

speech !a#eform$ This can "e done easily since the speech signal is cleared of un!anted

noise$ 4ence area of interest of the speech lies "et!een the first rise and final drop point

of the speech !a#eform$ This area of interest is later used for the e(traction and coding

processes$

*atabase

Before the classification takes place' speech samples from each speaker ha#e to "e

collected' coded' con#erted to an M,FF code and stored into the data"ase$ The utterance

of these speech samples has to "e of the same phrase$ These speech samples for each

speaker are then a#eraged to produce the mean reference matri($ This a#eraging process

is necessary as it reduces the inconsistency of the speaker*s speech$ The speech samples

must also go through a process to find the standard de#iation of the samples$

4eural 4etwork Process

After the mean reference matri( of each indi#idual speaker is o"tained' the elements in

the matri( are then passed through the +eural +et!ork to construct the !eight and "ias

for each indi#idual speaker$ Once the !eight and "ias of each indi#idual speaker is

generated' the program is ready to classify any unkno!n user$

"onclusion

8/13/2019 speaker-D

46/48

2ith the positi#e results collected' Speech recognition using M,FF and +eural +et!ork

has pro#en to "e e(cellent in classifying speech signals$ @nlike traditional speech

recognition techni%ues !hich in#ol#e comple( ,ourier transformations' the method used

"y Mel ,re%uency cepstrum in coding the signal is simple and accurate$ The acoustics

characteristics of the speaker*s speech can easily "e detected from #isual inspection on

the M,FF code$ The +eural +et!ork classification method used is also relia"le and

uncomplicated to implement$

,rom the results' it is o"#ious that single sylla"le !ords are more relia"le in terms of

training$ This is pro"a"ly "ecause humans* pronunciations of single sylla"le !ords are

more consistent$

3espite the positi#e results collected' there are still a fe! ;,alse< acceptances and

;,alse< rejections "eing detected$ This may "e considered a serious issue !hen it is

applied in a high security room$ The main reason "ehind these errors is due to the

inconsistency in the human speech$ Although this system is a formida"le com"ination'

the single layer of Perceptron techni%ue is una"le to reduce the inconsistency of the

speech signals$ Therefore more ro"ust and po!erful methods ha#e to "e employed to

reduce the inconsistency of the speech signals$ This !ill "e further e(plained in the

follo!ing chapter$

,inally to conclude' Mel fre%uency cepstrum processing has the a"ility to discriminate

signals that remain indistinguisha"le in the fre%uency domain$ ,urthermore due to their

economic' ro"ustness and fle(i"ility' these t!o com"ined techni%ues can "e easily

8/13/2019 speaker-D

47/48

implemented on cost effecti#e machines !hich re%uires speech #erification or

identification$

Oter -etods in 4eural 4etworks

Many methods of classifying speech signals can "e found in +eural +et!ork$ One of the

methods is "ased on the Auto associati#e +eural +et!ork model$ The distri"ution

capturing a"ility of the net!ork is e(ploited to "uild the speaker speech signal$ Another

high performance +eural +et!ork "ased approach is "y using a State Transition Matri($

This method has the a"ility to address inconsistent speech signals$ An unsuper#ised

Learning method like ohonen Self)Organising Map can also "e employed$

Speaker Identification

The current system is more focused on speaker #erification !hich tests an unkno!n

speaker against a kno!n speaker$ The method presented in this thesis is still not relia"le

enough to "e used on speaker identification applications$ In speaker identification' the

aim is to determine !hether the utterance of an unkno!n speaker "elongs to any of the

speakers from amongst a kno!n group$ Of these t!o applications' speaker identification

is generally more difficult to achie#e due the larger speaker populations !hich !ill

8/13/2019 speaker-D

48/48

produce more errors$ ,uture !ork should "e concentrated on speaker identification as it

!ill increase the commercial #alue of the system

speaker-D

Documents

Transcript of speaker-D