speaker-D
-
Upload
fikrul-hakim -
Category
Documents
-
view
216 -
download
0
Transcript of speaker-D
-
8/13/2019 speaker-D
1/48
A Project Report
On
Speaker Recognition System
Implemented in MATLAB
Abstract
Speaker recognition is the process of automatically recogniing !ho is
speaking on the "asis of indi#idual information included in speech !a#es$
-
8/13/2019 speaker-D
2/48
This techni%ue makes it possi"le to use the speaker&s #oice to #erify their
identity and control access to ser#ices such as #oice dialling' "anking "y
telephone' telephone shopping' data"ase access ser#ices' information
ser#ices' #oice mail' security control for confidential information areas' and
remote access to computers$
Speaker recognition can "e classified into identification and #erification$
Speaker identification is the process of determining !hich registered
speaker pro#ides a gi#en utterance$
Speaker verification' on the other hand' is the process of accepting or
rejecting the identity claim of a speaker$
Speaker recognition methods can also "e di#ided into text-independentand
text dependent methods$ In a te(t)independent system' speaker models
capture characteristics of some"ody*s speech !hich sho! up irrespective of
what one is saying$ In a te(t)dependent system' on the other hand' the
recognition of the speaker*s identity is "ased on his or her speaking one or
more specific phrases' like pass!ords' card num"ers' PI+ codes' etc$ All
technologies of speaker recognition' identification and #erification' te(t
independent and te(t)dependent' each has its o!n ad#antages and
disad#antages and may re%uires different treatments and techni%ues$ The
choice of !hich technology to use is application)specific$ The system that
-
8/13/2019 speaker-D
3/48
!e !ill de#elop is classified as text independent speaker identification
system since its task is to identify the person !ho speaks regardless of !hat
is saying$
Overview
Speaker recognition is the process of automatically recogniing !ho is speaking on the
"asis of indi#idual information included in speech !a#es$ This techni%ue makes it
possi"le to use the speaker&s #oice to #erify their identity and control access to ser#ices
such as #oice dialing' "anking "y telephone' telephone shopping' data"ase access
ser#ices' information ser#ices' #oice mail' security control for confidential information
areas' and remote access to computers$
Principles of Speaker Recognition
Speaker recognition can "e classified into identification and #erification$ Speaker
identification is the process of determining !hich registered speaker pro#ides a gi#en
utterance$ Speaker verification' on the other hand' is the process of accepting or rejecting
the identity claim of a speaker$ ,igure - sho!s the "asic structures of speaker
identification and #erification systems$
Speaker recognition methods can also "e di#ided into text-independent and text-
dependent methods$ In a te(t)independent system' speaker models capture characteristics
of some"ody*s speech !hich sho! up irrespective of what one is saying$ In a te(t)
dependent system' on the other hand' the recognition of the speaker*s identity is "ased on
-
8/13/2019 speaker-D
4/48
his or herspeaking one or more specific phrases' like pass!ords' card num"ers' PI+
codes' etc$
All technologies of speaker recognition' identification and #erification' te(t)
independent and te(t)dependent' each has its o!n ad#antages and disad#antages and may
re%uires different treatments and techni%ues$ The choice of !hich technology to use is
application)specific$ The system that !e !ill de#elop is classified as text-independent
speaker identification system since its task is to identify the person !ho speaks regardless
of !hat is saying$
At the highest le#el' all speaker recognition systems contain t!o main modules .refer
to ,igure -/0feature extraction andfeature matching$ ,eature e(traction is the process
that e(tracts a small amount of data from the #oice signal that can later "e used to
represent each speaker$ ,eature matching in#ol#es the actual procedure to identify the
unkno!n speaker "y comparing e(tracted features from his1her #oice input !ith the ones
from a set of kno!n speakers$ 2e !ill discuss each module in detail in later sections$
-
8/13/2019 speaker-D
5/48
Figure 1$ Basic structures of speaker recognition systems
All speaker recognition systems ha#e to ser#e t!o distinguish phases$ The first one is
referred to the enrollment sessions or training phase !hile the second one is referred to as
the operation sessions or testing phase$ In the training phase' each registered speaker has
to pro#ide samples of their speech so that the system can "uild or train a reference model
for that speaker$ In case of speaker #erification systems' in addition' a speaker)specific
-
8/13/2019 speaker-D
6/48
threshold is also computed from the training samples$ 3uring the testing .operational/
phase .see ,igure -/' the input speech is matched !ith stored reference model.s/ and
recognition decision is made$
Speaker recognition is a difficult task and it is still an acti#e research area$ Automatic
speaker recognition !orks "ased on the premise that a person*s speech e(hi"its
characteristics that are uni%ue to the speaker$ 4o!e#er this task has "een challenged "y
the highly variant of input speech signals$ The principle source of #ariance is the speaker
himself$ Speech signals in training and testing sessions can "e greatly different due to
many facts such as people #oice change !ith time' health conditions .e$g$ the speaker has
a cold/' speaking rates' etc$ There are also other factors' "eyond speaker #aria"ility' that
present a challenge to speaker recognition technology$ 5(amples of these are acoustical
noise and #ariations in recording en#ironments .e$g$ speaker uses different telephone
handsets/$
General Idea of Speec Recognition
4uman speech presents a formida"le pattern classification task for speech recognition
system $ +umerous speech recognition techni%ues ha#e "een formulated yet the #ery "est
techni%ues used today ha#e recognition capa"ilities !ell "elo! those of a child$ This is
due to the fact that human speech is highly dynamic and comple($ There are generally
se#eral types of disciplines present in the human speech$ A "asic understanding of these
disciplines is needed in order to create an effecti#e system$ The follo!ing pro#ide a "rief
description of the disciplines that ha#e "een applied to speech recognition pro"lems
1 Signal Processing
-
8/13/2019 speaker-D
7/48
This process e(tracts the important information from the speech signal in a !ell)
organised manner$ In signal processing' spectral analysis is used to characterie the time
#arying properties of the speech signal$ Se#eral other types of processing are also needed
prior to the spectral analysis stage to make the speech signal more accurate and ro"ust$
-
! Acoustics
The science of understanding the relationship "et!een the physical speech signal and the
human #ocal tract mechanisms that produce the speech and !ith !hich the speech is
distinguished$
1 Pattern Recognition
A set of coding algorithm used to compute data to create prototypical patterns of a data
ensem"le$ It is used to compare a pair of patterns "ased on the features e(tracted from the
speech signal$
1 "ommunication and Information #eory
The procedures for estimating parameters of the statistical models and the methods for
recogniing the presence of speech patterns$
1 $inguistics
This refers to the relationships "et!een sounds' !ords in a sentence' meaning and logic
of spoken !ords$
1
-
8/13/2019 speaker-D
8/48
! Pysiology
This refers to the comprehension of the higher)order mechanisms !ithin the human
central ner#ous system$ It is responsi"le for the production and perception of speech
!ithin the human "eings$
1
!
% "omputer Science
The study of effecti#e algorithms for application in soft!are and hard!are$ ,or e(ample'
the #arious methods used in a speech recognition system$
-
! Psycology
The science of understanding the aspects that ena"les the technology to "e used "y
human "eings$
Speec Production
Speech is the acoustic product of #oluntary and !ell)controlled mo#ement of a #ocal
mechanism of a human$ 3uring the generation of speech' air is inhaled into the human
lungs "y e(panding the ri" cage and dra!ing it in #ia the nasal ca#ity' #elum and trachea
It is then e(pelled "ack into the air "y contracting the ri" cage and increasing the lung
pressure$ 3uring the e(pulsion of air' the air tra#els from the lungs and passes through
#ocal cords !hich are the t!o symmetric pieces of ligaments and muscles located in the
laryn( on the trachea$ Speech is produced "y the #i"ration of the #ocal cords$ Before the
e(pulsion of air' the laryn( is initially closed$ 2hen the pressure produced "y the
-
8/13/2019 speaker-D
9/48
e(pelled air is sufficient' the #ocal cords are pushed apart' allo!ing air to pass through$
The #ocal cords close upon the decrease in air flo!$ This rela(ation cycle is repeated !ith
generation fre%uencies in the range of 674 8 9774$ The generation of this fre%uency
depends on the speaker*s age' se(' stress and emotions$ This succession of the glottis
openings and closure generates %uasi)periodic pulses of air after the #ocal cords$
,igure "elo! sho!s the schematic #ie! of the human speech apparatus$
The speech signal is a time #arying signal !hose signal characteristics represent the
different speech sounds produced$ There are three !ays of la"elling e#ents in speech$
,irst is the silence state in !hich no speech is produced$ Second state is the un#oiced
state in !hich the #ocal cords are not #i"rating' thus the output speech !a#eform is
aperiodic and random in nature$ The last state is the #oiced state in !hich the #ocal cords
are #i"rating periodically !hen air is e(pelled from the lungs$ This results in the output
speech "eing %uasi)periodic$ ,igure : "elo! sho!s a speech !a#eform !ith un#oiced
and #oiced state$
-
8/13/2019 speaker-D
10/48
Speech is produced as a se%uence of sounds$ The type of sound produced depends on
shape of the #ocal tract$ The #ocal tract starts from the opening of the #ocal cords to the
end of the lips$ Its cross sectional area depends on the position of the tongue' lips' ja!
and #elum$ Therefore the tongue' lips' ja! and #elum play an important part in the
production of speech$
Block 3iagram .5ngineering Model/ Of 4uman Speech Production System
Factors associated wit speec&
Formants&
It has "een kno!n from research that #ocal tract and nasal tract are tu"es !ith
non)uniform cross)sectional area$ As sound generated propagates through these the tu"es'
the fre%uency spectrum is shaped "y the fre%uency selecti#ity of the tu"e$ This effect is
#ery similar to the resonance effects o"ser#ed in organ pipes and !ind instruments$ In the
conte(t of speech production' the resonance fre%uencies of #ocal tract are called formant
-
8/13/2019 speaker-D
11/48
fre%uencies or simply formants$ In our engineered model the poles of the transfer
function are called formants$ 4uman Auditory system is much more sensiti#e to poles
than eros$
Ponemes&
Phonemes can "e defined as the ;Sym"ols from !hich e#ery sound can "e
classified or produced: phonemes$ ,or speech crude estimation of information rate
considering physical limitations on articulatory motion is a"out -7 phonemes per second$
#ypes of Ponemes&
Speech sounds can "e classified in to 9 distinct classes according to the mode of
e(citation$
- -$ Plosi#e Sounds
: :$ ?oiced Sounds
9 9$ @n#oiced Sounds
1' Plosive Sounds&
Plosi#e Sounds result from making a complete closure .again to!ard the front end of the
#ocal tract/' "uilding up pressure "ehind the closure' and a"ruptly releasing it$
!' (oiced Sounds&
?oiced sounds are produced "y forcing air through the glottis !ith the tension of the
#ocal chords adjusted so that they #i"rate in a rela(ation oscillation' there"y producing
%uasi)periodic pulses of air !hich e(cite the #ocal tract$
?oiced sounds are characteried "y
- 4igh 5nergy Le#els
-
8/13/2019 speaker-D
12/48
: ?ery 3istinct resonant and formant fre%uencies$
#e rate at wic te vocal cord vibrates determines te pitc $ These
#i"rations are periodic in time thus #oiced sounds are appro(imated "y an impulse train$
Spacing "et!een impulses is the pitch' ,7$
%' )nvoiced Sounds&
?oiced Sounds are also kno!n as formants generated "y forming a
constriction at some point in the #ocal tract .usually to!ard the mouth end/' and forcing
the air through the constriction at high enough #elocity to produce tur"ulence$ This
creates a "road)spectrum noise source to e(cite the #ocal tract$
@n#oiced sounds are characteried "y
- Lo!er 5nergy Le#els than #oiced sounds$
: 4igher fre%uencies than #oiced sounds$
In other !ords !e can say that un#oiced sounds .e$g$ 1sh1' 1s1' 1p1/ are
generated !ithout #ocal cords #i"rations$ The e(citation is modeled "y a 2hite aussian
+oise source$ @n#oiced sounds ha#e no pitch since they are e(cited "y a non)periodic
signal$
-
8/13/2019 speaker-D
13/48
Spectrums Of typical voiced And )nvoiced Speec
By passing the speech through a predictor filter A./' the spectrum is much more flatten
.!hitened/$ But it still contains some fine details$
-
8/13/2019 speaker-D
14/48
Special #ype of (oiced and )nvoiced Sounds&
There are ho!e#er some special types of #oiced and un#oiced sounds !hich are "riefly
discussed here$ The purpose of their discussion here is only to gi#e the reader an idea
a"out the further types of #oiced and un#oiced speech$
(owels&
?o!els are produced "y e(citing a fi(ed #ocal tract !ith %uasi periodic pulses of
air caused "y #i"ration of the #ocal cords$ The !ay in !hich the cross)sectional area
#aries along the #ocal tract determines the resonant fre%uencies of the tract .formants/
and thus the sound that is produced$ The dependence of cross)sectional area upon
distance along the tract is called is called area function of the #ocal tract$ The area
function of a particular #o!el is determined primarily "y the position of the tongue "ut
the position of ja!s and lips to a small e(tent also affect the resulting sound$
5(amples
a'e'i'o'u
*iptongs&
Although there is some am"iguity and disagreement as to !hat is and !hat is not
a diphthongs' a reasona"le definition is that a diphthongs is a gliding monosylla"ic
speech item that starts at or near the articulatory position for one #o!el and mo#es to or
to!ard the position for another$ According to this definition' there are C diphthongs in
American 5nglish$
3iphthongs are produced "y #arying the #ocal tract smoothly "et!een #o!el
configurations appropriate to the diphthong$ Based on these data' the diphthongs can "e
-
8/13/2019 speaker-D
15/48
characteried "y a time #arying #ocal tract area function !hich #aries "et!een t!o #o!el
configurations$
5(amples0
5i1 .as in "ay/ ' o@1 .as in "oat/ ' aI1 .as in "uy/ ' a@1 .as in ho!/
Semivowels&
The group of sound consisting of 1!1' 1l1 ' 1r1 '1y1 is %uite difficult to
characterie$ These sounds are called semi#o!els "ecause of their #o!el)like nature$
They are generally characteried "y a gliding transition in the #ocal tract area function
"et!een adjacent phonemes$ Thus the acoustic characteristics of these sounds are
strongly influenced "y the conte(t in !hich they occur$ ,or our purpose they just
considered as transitional #o!el)like sounds and hence are similar in nature to #o!els
and diphthongs$
(oiced Fricatives&
The #oiced fricati#es are 1#1 ' 1th1 ' 11 and 1h1are the counterpart of the un#oiced
fricati#es 1f1 '1D1 '1s1 and 1sh1 respecti#ely' in that the place of constriction for each of the
corresponding phonemes is essentially identical$
4o!e#er the #oiced fricati#es differ from their un#oiced counterparts in the manner that
t!o e(citation sources are in#ol#ed in their production$ The spectra of #oiced fricati#es
can "e e(pected to display t!o distinct components$
(oiced Stops&
The #oiced stops 1"1' 1d1'and1g1 are transient non)continuant sounds !hich are produced
"y "uilding up pressure "ehind a total constriction some!here in the oral tract' and
suddenly releasing the pressure$ ,or 1"1 the constriction is at the lipsE for 1d1 the
-
8/13/2019 speaker-D
16/48
constriction is at the "ack of the teethE and for 1g1 it is near the #elum$ 3uring the period
there is a total constriction in the tract there is no sound radiated from the lips$
Since the stop sounds are dynamical in nature' there properties are highly influenced "y
the #o!el !hich follo!s the stop consonant$
)nvoiced Stops&
The un#oiced stop consonants are 1p1'1t1'and1k1 are similar to their #oiced counterparts
1"1 ' 1d1 and 1g 1 !ith one major e(ception$ 3uring the period of the total closure of the
tract' as the pressure "uilds up' the #ocal cords do not #i"rate$ Thus' follo!ing the period
of closure as the air pressure is released' there is a "rief inter#al for friction .due to the
sudden tur"ulence of the escaping air/ follo!ed "y a period of aspiration .steady flo! of
air from glottis e(citing the resonances of the #ocal tract/ "efore #oiced e(citation "egins
+earing and Perception
Audi"le sounds are transmitted to the human ears through the #i"ration of the particles in
the air$ 4uman ears consist of three parts' the outer ear' the middle ear and the inner ear$
The function of the outer ear is to direct speech pressure #ariations to!ard the eardrum
!here the middle ear con#erts the pressure #ariations into mechanical motion$ The
mechanical motion is then transmitted to the inner ear' !hich transforms these motion
into electrical potentials that passes through the auditory ner#e' corte( and then to the
"rain $ ,igure "elo! sho!s the schematic diagram of the human ear$
-
8/13/2019 speaker-D
17/48
Schematic 3iagram of the 4uman 5ar
#e ,ngineered -odel &
The speech mechanism can "e modeled as a time #arying filter .the #ocal tract/ e(cited
"y an oscillator .the #ocal folds/' !ith different outputs$ 2hen #oiced sound is produced'
the filter is e(cited "y an impulse chain' in a range of fre%uencies .C7)>77 4/$ 2hen
un#oiced sound is produced' the filter is e(cited "y random !hite noise' !ithout any
o"ser#ed periodicity$ These attri"utes can "e o"ser#ed !hen the speech signal is
e(amined in the time domain$
-
8/13/2019 speaker-D
18/48
(a): The Human Speech Production Figure
(b): Speech Production by a machine
.y ,ncode Speec/
Speech coding has "een and still is a major issue in the area of digital speech
processing$ Speech coding is the act of transforming the speech signal at hand' to a more
compact form' !hich can then "e transmitted !ith a considera"ly smaller memory$ The
moti#ation "ehind this is the fact that access to unlimited amount of "and!idth is not
possi"le$ Therefore' there is a need to code and compress speech signals$ Speech
compression is re%uired in long)distance communication' high)%uality speech storage'
and message encryption$ ,or e(ample' in digital cellular technology many users need to
share the same fre%uency "and!idth$ @tiliing speech compression makes it possi"le for
more users to share the a#aila"le system$ Another e(ample !here speech compression is
-
8/13/2019 speaker-D
19/48
needed is in digital #oice storage$ ,or a fi(ed amount of a#aila"le memory' compression
makes it possi"le to store longer messages$
Speech coding is a lossy type of coding' !hich means that the output signal
does not e(actly sound like the input$ The input and the output signal could "e
distinguished to "e different$ Foding of audio ho!e#er' is a different kind of pro"lem
than speech coding$ Audio coding tries to code the audio in a perceptually lossless !ay$
This means that e#en though the input and output signals are not mathematically
e%ui#alent' the sound at the output is the same as the input$ This type of coding is used in
applications for audio storage' "roadcasting' and Internet streaming$
Se#eral techni%ues of speech coding such as Linear Predicti#e Foding .LPF/' 2a#eform
Foding and Su" "and Foding e(ist$ The pro"lem at hand is to use LPF to code gi#en
speech sentences$ The speech signals that need to "e coded are !ide"and signals !ith
fre%uencies ranging from 7 to 6 k4$ The sampling fre%uency should "e at 6k4$
3ifferent types of applications ha#e different time delay constraints$ ,or e(ample in
net!ork telephony only a delay of -ms is accepta"le' !hereas a delay of =77 ms is
permissi"le in #ideo telephony$ Another constraint at hand is not to e(ceed an o#erall "it
rate of 6 k"ps$
The speech coder that !ill "e de#eloped is going to "e analyed using "oth su"jecti#e
and o"jecti#e analysis$ Su"jecti#e analysis !ill consist of listening to the encoded speech
signal and making adjustments on its %uality$ The %uality of the played "ack speech !ill
"e solely "ased on the opinion of the listener$ The speech can possi"ly "e rated "y the
listener either impossi"le to understand' intelligi"le or natural sounding$ 5#en though this
is a #alid measure of %uality' an o"jecti#e analysis !ill "e introduced to technically
-
8/13/2019 speaker-D
20/48
assess the speech %uality and to minimie human "ias$ ,urthermore' an analysis on the
study of effects of "it rate' comple(ity and end)to)end delay on the speech %uality at the
output !ill "e made$ The report !ill "e concluded !ith the summary of results and some
ideas for future !ork$
Speec Processing
The speech !a#eform needs to "e con#erted into digital format "efore it is suita"le for
processing in the speech recognition system$ The ra! speech !a#eform is in the analog
format "efore con#ersion$ The con#ersion of analog signal to digital signal in#ol#es three
phases' mainly the sampling' %uantisation and coding phase$ In the sampling phase' the
analog signal is "eing transformed from a !a#eform that is continuos in time to a discrete
signal$ A discrete signal refers to the se%uence of samples that are discrete in time$ In the
%uantisation phase' an appro(imate sampled #alue of a #aria"le is con#erted into one of
the finite #alues contained in a code set$ These t!o stages allo! the speech !a#eform to
"e represented "y a se%uence of #alues !ith each of these #alues "elonging to the set of
finites #alues$ After passing through the sampling and %uantisation stage' the signal is
then coded in the coding phase$ The signal is usually represented "y "inary code$ These
three phases needs to "e carried out !ith caution as any miscalculations' o#er)sampling
and %uantiation noise !ill result in loss of information$ Belo! are the pro"lems faced "y
the three phases$
Sampling
According to the +y%uist Theorem' the minimum sampling rate re%uired is t!o times the
"and!idth of the signal$ This minimum sampling fre%uency is needed for the
-
8/13/2019 speaker-D
21/48
reconstruction of a "and limited !a#eform !ithout error$ Aliasing distortion !ill occur if
the minimum sampling rate is not met$ ,igure "elo! sho!s the comparison "et!een a
properly sampled case and an improperly sampled case$
Aliasing 3istortion "y Improperly Sampling
0uantiation
Speech signals are more likely to ha#e amplitude #alues near ero than at the e(treme
peak #alues allo!ed$ ,or e(ample' in digitiing #oice' if the peak #alue allo!ed is -?'
!eak passages may ha#e #oltage le#els on the order of 7$-?$ Speech signals !ith non)
uniform amplitude distri"ution are likely to e(perience %uantising noise if the step sie is
not reduced for amplitude #alues near ero and increased for e(tremely large #alues$ The
%uantising noise is kno!n as the granular and slope o#erload noise$ ranular noise occurs
!hen the step sie is large for amplitude #alues near ero$ Slope o#erload noise occurs
!hen the step sie is small and cannot keep up !ith the e(tremely large amplitude #alues$
To sol#e the a"o#e %uantising noise pro"lem' 3elta Modulation .3M/ is used$ 3elta
Modulation !orks "y reducing the step sie for amplitude #alues near ero and increasing
the step sie for e(tremely large amplitude #alues$ ,igure "elo! sho!s a diagram on the
t!o types of noises$
-
8/13/2019 speaker-D
22/48
Analog Input and Accumulator Output 2a#eform
Approaces to Speec Recognition
4uman "eings are the "est ;machine< to recognie and understand speech$ 2e are a"le to
com"ine a !ide #ariety of linguistic kno!ledge concerned !ith synta( and semantics and
adapti#ely use this kno!ledge according to the difficulties and characteristics of the
sentences$ The speech recognition system is "uilt !ith this aim in mind to match or
e(ceed human performance$ There are generally three approaches to speech recognition'
namely' acoustic)phonetic' pattern recognition and the artificial intelligence approach$
These three approaches !ill "e e(plained in greater detail in the follo!ing sections$
Speec "oding
A digital speech coder can "e classified into t!o main categories' mainly !a#eform
coders and #ocoders$ 2a#eform coders employ algorithms to encode and decode speech
signals so that the system output is an appro(imation to the input !a#eform$ ?ocoders
encode speech signals "y e(tracting a set of parameters that are digitied and transmitted
to the recei#er$ This set of digitied parameters is used to set #alues for parameters in
function generators and filters' !hich in turn synthesie the output speech signals$ The
-
8/13/2019 speaker-D
23/48
#ocoder output !a#eform does not appro(imate the input !a#eform signals and may
produce an unnatural sound$
Speec Feature ,2traction
Introduction
The purpose of this module is to con#ert the speech !a#eform to some type of
parametric representation .at a considera"ly lo!er information rate/ for further analysis
and processing$ This is often referred as thesignal-processing front end$
The speech signal is a slo!ly timed #arying signal .it is called quasi-stationary/$ An
e(ample of speech signal is sho!n in ,igure :$ 2hen e(amined o#er a sufficiently short
period of time ."et!een = and -77 msec/' its characteristics are fairly stationary$
4o!e#er' o#er long periods of time .on the order of -1= seconds or more/ the signal
characteristic change to reflect the different speech sounds "eing spoken$ Therefore'
short-time spectral analysis is the most common !ay to characterie the speech signal$
-
8/13/2019 speaker-D
24/48
Figure 2. An example of speech signal
A !ide range of possi"ilities e(ist for parametrically representing the speech signal
for the speaker recognition task' such as Linear Prediction Foding .LPF/' Mel),re%uency
Fepstrum Foefficients .M,FF/' and others$ M,FF is perhaps the "est kno!n and most
popular' and these !ill "e used in this project$
M,FF*s are "ased on the kno!n #ariation of the human ear*s critical "and!idths
!ith fre%uency' filters spaced linearly at lo! fre%uencies and logarithmically at high
fre%uencies ha#e "een used to capture the phonetically important characteristics of
speech$ This is e(pressed in the mel-frequency scale' !hich is a linear fre%uency spacing
"elo! -777 4 and a logarithmic spacing a"o#e -777 4$ The process of computing
M,FFs is descri"ed in more detail ne(t$
Pattern Recognition
-
8/13/2019 speaker-D
25/48
This direct approach in#ol#es manipulating the speech signals directly !ithout e(plicit
feature e(traction of the speech signals$ There are t!o stages in this approach' mainly the
training of speech patterns and recognition of patterns #ia pattern comparison$ Se#eral
identical speech signals are collected and sent to the system #ia the training procedure$
2ith ade%uate training' the system is a"le to characterie the acoustics properties of the
pattern$ This type of classification is kno!n as the pattern classification$ The recognition
stage does a direct comparison "et!een the unkno!n speech signal and the speech signal
patterns learned in the training phase$ It generates a ;accept< or ;reject< decision "ased
on the similarity of the t!o patterns$
-$ It is simple to use and the method is fairly easy to understand
:$ It has ro"ustness to different speech #oca"ularies' users' features sets' pattern
comparison algorithms and decision rules$
9$ It has "een pro#en that this method generates the most accurate results$
Acoustic3Ponetic Approac
The acoustic)phonetic approach has "een studied in depth for more than >7 years$ It is
"ased on the theory of acoustics phonetics that suggest that there e(ist finite' distincti#e
phonetic units of spoken language and that the phonetic units are !idely characteried "y
a set of properties that are manifest in the speech signal' or its spectrum' o#er time$ The
first step in this approach is to segment the speech signal into discrete time regions !here
the acoustics properties of the speech signal are represented "y one phonetic unit$ The
ne(t step is to attach one or more phonetic la"els to each segmented region according to
the acoustic properties$ ,inally the last step attempts to determine a #alid !ord from the
-
8/13/2019 speaker-D
26/48
phonetic la"els generated from the first step$ This is consistent !ith the constraints of the
speech recognition task
Artificial Intelligence
This approach is a com"ination of the acoustic)phonetic approach and the pattern
recognition approach$ It uses the concept and ideas of these t!o approaches$ Artificial
intelligence approach attempts to mechanie speech recognition process according to the
!ay a person applies its intelligence in #isualiing and analying$ In particular among the
techni%ues used !ithin this class of methods are the use of an e(pert system for
segmentation and la"eling so that this crucial and most complicated step can "e
performed !ith more that just the acoustic information used "y pure acoustic)phonetic
methods$ +eural +et!orks are often used in this approach to learn the relationship
"et!een the phonetic e#ents and all the kno!n inputs$ It can also "e used to differentiate
similar sound classes$
*ynamic #ime .arping
3ynamic Time 2arping is one of the pioneer approaches to speech recognition$ It first
operates "y storing a prototypical #ersion of each !ord in the #oca"ulary into the
data"ase' then compares incoming speech signals !ith each !ord and then takes the
closest match$ But this poses a pro"lem "ecause it is unlikely that the incoming signals
!ill fall into the constant !indo! spacing defined "y the host$ ,or e(ample' the pass!ord
to a #erification system is Gueensland$ 2hen a user utter ; Gueeeeensland
-
8/13/2019 speaker-D
27/48
due to the longer constant !indo! spacing of the speech ; Gueeeeensland
-
8/13/2019 speaker-D
28/48
Speech Recognition is one of the daunting challenges facing researchers throughout the
!orld$ The complete solution is far from o#er and enormous efforts ha#e "een spent "y
companies to reach the ultimate goal$ One of the techni%ues that gain acceptance from
researchers is the state of art' 4idden Marko# Model .4MM/ techni%ue$ This model can
also "e incorporated !ith other techni%ues like +eural +et!ork to form a formida"le
techni%ue$
The 4idden Marko# Model approach is !idely used in se%uence processing and speech
recognition$ The key features of the 4idden Marko# Model lies in its a"ility to model
temporal statistics of data "y introducing a discrete hidden #aria"le that goes through a
transition from one time step to the ne(t according to the stochastic transition matri($
3istri"ution of the emission sym"ols is em"odied in the assumption of the emission
pro"a"ility density$
A 4idden Marko# Model may "e #ie!ed as a finite machine !here the transitions
"et!een the states are dependent upon the occurrence of some sym"ol$ 5ach state
transition is associated !ith an output pro"a"ility distri"ution !hich determines the
pro"a"ility that a sym"ol !ill occur during the transition and a transition pro"a"ility
indicating the likelihood of this transition$ Se#eral analytical techni%ues ha#e "een
de#eloped for estimating these pro"a"ilities$ These analytical techni%ues ha#e ena"led
4MM to "ecome more computationally efficient' ro"ust and fle(i"le$
In speech recognition' the 4MM model optimises the pro"a"ility of the training set to
detect a particular speech$ The pro"a"ility function is performed "y the ?iter"i algorithm$
This algorithm is a procedure used to determine an optimal state se%uence from a gi#en
o"ser#ation se%uence$
-
8/13/2019 speaker-D
29/48
4eural 4etworks
As you read these !ords' your "rain is actually using its comple( net!ork of -7 neurons
to facilitate your readings $ 5ach of these neurons has a "laing processing speed of a
microprocessor' !hich allo!s us to read' think and !rite simultaneously$ Scientists ha#e
found out that all "iological neural functions including memory are stored in the neurons
and in the connections "et!een them$ As !e learn ne! things e#eryday' ne! connections
are made or modified$ Some of these neural structures are defined at "irth !hile others
are created e#eryday and others !aste a!ay$ In this thesis' the +eural +et!ork algorithm
is actually a"out Artificial +eural +et!orks and not the actual neurons in our "rain$ A
picture illustrating the "iological neurons is sho!n in ,igure "elo!
Schematic 3ra!ing of Biological +eurons
4eural -odel
-
8/13/2019 speaker-D
30/48
,igure "elo! sho!s a single input neuron$ The scaler input p is multiplied "y the scaler
!eight ! to form !p !hich is then sent to the summer$ In the summer' the product scalar
!p is added to the "ias " and passed through the summer $ The summer output n' goes
into the transfer function f !hich generates the scaler neuron output a$ The #alue of
output neuron ;a< depends on the type of transfer function used$ This !hole idea of the
artificial neuron is similar to "iological neurons sho!n in ,igure sho!n a"o#e $ The
!eight ! corresponds to the strength of the synapse' the cell "ody is e%ui#alent to the
summation and the transfer function and finally the neuron output ;a< corresponds to the
signal tra#elling in the a(on
Summer output n H !p "+euron output a H f.!p"/
$
Single Input +euron
+ard $imit #ransfer Function
The hard limit transfer function sho!n on the left side of ,igure "elo! sets the output
neuron a to 7 if the summer output n is less than 7$ If the summer output n is greater than
or e%ual to 7' it sets the output neuron a to -$ This transfer function is useful in classifying
-
8/13/2019 speaker-D
31/48
inputs into t!o categories and in this thesis it is used to determine true or false detection
of the speech signal$ The figure on the right sho!s the effect of the !eight and the "ias
com"ined together$
4ard Limit Transfer ,unction
*ecision 5oundary
A single layer perceptron consists of the input neuron' the !eight' "ias' summer and the
transfer function$ ,igure "elo! - sho!s a diagram of a single layer perceptron$ A single
layer perceptron can "e used to classify input #ectors into t!o categories$ The !eight is
al!ays orthogonal to the decision "oundary$ ,or e(ample in ,igure "elo! :' the !eight !
is set to J): 9K$ The decision "oundary corresponding to the graph in ,igure "elo! : is
indicated$ 2e can use any points on decision "oundary to find the "ias as follo!s0
!p " H 7$ Once the "ias is set' any point in the plane can "e classified as lying inside
the shaded region .!p"7/ or outside the shaded region .!p"7/$
-
8/13/2019 speaker-D
32/48
,igure -0 Multiple Input +euron
,igure :0Perceptron 3ecision Boundary
-el3fre6uency cepstrum coefficients processor
A "lock diagram of the structure of an M,FF processor is gi#en in ,igure 9$ The
speech input is typically recorded at a sampling rate a"o#e -7777 4$ This sampling
fre%uency !as chosen to minimie the effects of aliasing in the analog)to)digital
con#ersion$ These sampled signals can capture all fre%uencies up to = k4' !hich co#er
most energy of sounds that are generated "y humans$ As "een discussed pre#iously' the
main purpose of the M,FF processor is to mimic the "eha#ior of the human ears$ In
addition' rather than the speech !a#eforms themsel#es' M,,F*s are sho!n to "e less
suscepti"le to mentioned #ariations$
-
8/13/2019 speaker-D
33/48
Frame 5locking
In this step the continuous speech signal is "locked into frames of N samples' !ith
adjacent frames "eing separated "yM .M < N/$ The first frame consists of the firstN
samples$ The second frame "eginsM samples after the first frame' and o#erlaps it "yN -
M samples$ Similarly' the third frame "egins :M samples after the first frame .or M
samples after the second frame/ and o#erlaps it "y N ) :M samples$ This process
continues until all the speech is accounted for !ithin one or more frames$ Typical #alues
forN andM areN H :=C .!hich is e%ui#alent to N 97 msec !indo!ing and facilitate the
fast radi(): ,,T/ andM H -77$
.indowing
The ne(t step in the processing is to !indo! each indi#idual frame so as to minimie
the signal discontinuities at the "eginning and end of each frame$ The concept here is to
-
8/13/2019 speaker-D
34/48
minimie the spectral distortion "y using the !indo! to taper the signal to ero at the
"eginning and end of each frame$ If !e define the !indo! as
!hereN is the num"er of samples in each frame' then the result of !indo!ing is the
signal
Typically theHamming !indo! is used' !hich has the form0
Fast Fourier #ransform 7FF#8
The ne(t processing step is the ,ast ,ourier Transform' !hich con#erts each frame of
N samples from the time domain into the fre%uency domain$ The ,,T is a fast algorithm
to implement the 3iscrete ,ourier Transform .3,T/ !hich is defined on the set of N
samples xn' as follo!0
+ote that !e use j here to denote the imaginary unit' i$e$ -QHj$ In generaln*s are
comple( num"ers$ The resulting se%uence n is interpreted as follo!0 the erofre%uency corresponds to n H 7' positi#e fre%uencies correspond to
#alues !hile negati#e fre%uencies correspond
to
The result after this step is often referred to asspectrum orperiodogram
-
8/13/2019 speaker-D
35/48
-el3fre6uency .rapping
As mentioned a"o#e' psychophysical studies ha#e sho!n that human perception of
the fre%uency contents of sounds for speech signals does not follo! a linear scale$ Thus
for each tone !ith an actual fre%uency'f' measured in 4' a su"jecti#e pitch is measured
on a scale called the mel* scale$ The mel-frequency scale is a linear fre%uency spacing
"elo! -777 4 and a logarithmic spacing a"o#e -777 4$ As a reference point' the pitch
of a - k4 tone' >7 dB a"o#e the perceptual hearing threshold' is defined as -777 mels$
Therefore !e can use the follo!ing appro(imate formula to compute the mels for a gi#en
fre%uencyf in 40
$
One approach to simulating the su"jecti#e spectrum is to use a filter "ank' spaced
uniformly on the mel scale .see ,igure >/$ That filter "ank has a triangular "and pass
fre%uency response' and the spacing as !ell as the "and!idth is determined "y a constant
mel fre%uency inter#al$ The modified spectrum of S!" thus consists of the output po!er
of these filters !hen S!" is the input$ The num"er of mel spectrum coefficients' #' is
typically chosen as :7$
+ote that this filter "ank is applied in the fre%uency domain' therefore it simply
amounts to taking those triangle)shape !indo!s in the ,igure > on the spectrum$ A useful
!ay of thinking a"out this mel)!rapping filter "ank is to #ie! each filter as an histogram
"in .!here "ins ha#e o#erlap/ in the fre%uency domain$
-
8/13/2019 speaker-D
36/48
Figure 9$ An e(ample of Mel)spaced filter "ank
"epstrum
In this final step' !e con#ert the log mel spectrum "ack to time$ The result is called
the mel fre%uency cepstrum coefficients .M,FF/$ The cepstral representation of the
speech spectrum pro#ides a good representation of the local spectral properties of the
signal for the gi#en frame analysis$ Because the mel spectrum coefficients .and so their
logarithm/ are real num"ers' !e can con#ert them to the time domain using the 3iscrete
Fosine Transform .3FT/$ Therefore if !e denote those mel po!er spectrum coefficients
that are the result of the last step are
!e can calculate the M,FF&s' as
-
8/13/2019 speaker-D
37/48
+ote that !e e(clude the first component' 'N7c from the 3FT since it represents the
mean #alue of the input signal !hich carried little speaker specific information$
1
!
%
9
:
;
until a code"ook sie of M is designed$
Intuiti#ely' the LB algorithm designs anM)#ector code"ook in stages$ It starts first "y
-
8/13/2019 speaker-D
42/48
designing a -)#ector code"ook' then uses a splitting techni%ue on the code!ords to
initialie the search for a :)#ector code"ook' and continues the splitting process until the
desiredM)#ector code"ook is o"tained$
,igure C sho!s' in a flo! diagram' the detailed steps of the LB algorithm$ ; )luster
vectors< is the nearest)neigh"or search procedure !hich assigns each training #ector to a
cluster associated !ith the closest code!ord$ ;*ind centroids< is the centroid update
procedure$ ;)ompute + !distortion"< sums the distances of all training #ectors in the
nearest)neigh"or search so as to determine !hether the procedure has con#erged$
Obtaining Speec .aveform
The first task is to record the speech !a#eform from the speaker and upload it into the
program$ The sound Recorder program in Microsoft 2indo!s is chosen to record the
speech !a#eform$ The recorded speech is automatically filtered' sampled at a sampling
rate of ::$7= 4 and then sa#ed as a !a#e file$ 2a#e file format is chosen "ecause it is
highly compati"le !ith the Matla" program as it can "e easily retrie#ed #ia a single
command$
Pre3,2traction Process
The speech !a#e filed is loaded into the Matla" program "y using a ;!a#e read< function
!hich limits the amplitude of the speech signal to a magnitude of -$ The signal is then
sa#ed as an M ( - #ector !here M refers to the total num"er of samples in the speech
signal$ 5ach element in the M #ector contains the amplitude of the speech signal at a
-
8/13/2019 speaker-D
43/48
particular sampling instant$ The speech signal is no! ready to go through the
compati"ility and %uality process$
0uality Process
Before the actual e(traction process takes place' the !a#e file is su"jected to a series of
processes to ensure the compati"ility and %uality of the signal$ 2hen the speech signal is
"eing loaded into the Matla" program' the signal is not centered at the y H 7 a(is$ In order
to "ring the !hole signal to centre on the ero)line' special program code !as !ritten$
This code is used to find the mean of the signal and then su"tract this mean from each of
the sample #alues of the signal$ This is sho!n in ,igure - "elo!$ The reason for shifting
the !hole signal to the y H 7 a(is !ill "e e(plained in section
Speech 2a#eform
-
8/13/2019 speaker-D
44/48
The ne(t process is to suppress the noise present in the speech !a#eform$ Although the
sound recorder program has performed the initial filtering' some noise is still present in
the speech !a#eform$ Another section of the Matla" program code is used to set a
threshold #alue on the speech signal$ Any #alue of the speech signal that falls "elo! this
threshold #alue !ill "e set to ero$ This !ill greatly suppress the un!anted noise and in
the meantime preser#e the content of the main speech signal$ This is illustrated in ,igure
: and ,igure 9 "elo!$ After some testing' it !as found that a threshold #alue of 7$7: is
most suita"le$
,ig : Speech !a#e form "efore filtering
,ig 9 Speech after filtering
The final compati"ility and %uality process is to determine the area of interest of the
speech signal$ This is done "y detecting the first rise point and the final drop point of the
-
8/13/2019 speaker-D
45/48
speech !a#eform$ This can "e done easily since the speech signal is cleared of un!anted
noise$ 4ence area of interest of the speech lies "et!een the first rise and final drop point
of the speech !a#eform$ This area of interest is later used for the e(traction and coding
processes$
*atabase
Before the classification takes place' speech samples from each speaker ha#e to "e
collected' coded' con#erted to an M,FF code and stored into the data"ase$ The utterance
of these speech samples has to "e of the same phrase$ These speech samples for each
speaker are then a#eraged to produce the mean reference matri($ This a#eraging process
is necessary as it reduces the inconsistency of the speaker*s speech$ The speech samples
must also go through a process to find the standard de#iation of the samples$
4eural 4etwork Process
After the mean reference matri( of each indi#idual speaker is o"tained' the elements in
the matri( are then passed through the +eural +et!ork to construct the !eight and "ias
for each indi#idual speaker$ Once the !eight and "ias of each indi#idual speaker is
generated' the program is ready to classify any unkno!n user$
"onclusion
-
8/13/2019 speaker-D
46/48
2ith the positi#e results collected' Speech recognition using M,FF and +eural +et!ork
has pro#en to "e e(cellent in classifying speech signals$ @nlike traditional speech
recognition techni%ues !hich in#ol#e comple( ,ourier transformations' the method used
"y Mel ,re%uency cepstrum in coding the signal is simple and accurate$ The acoustics
characteristics of the speaker*s speech can easily "e detected from #isual inspection on
the M,FF code$ The +eural +et!ork classification method used is also relia"le and
uncomplicated to implement$
,rom the results' it is o"#ious that single sylla"le !ords are more relia"le in terms of
training$ This is pro"a"ly "ecause humans* pronunciations of single sylla"le !ords are
more consistent$
3espite the positi#e results collected' there are still a fe! ;,alse< acceptances and
;,alse< rejections "eing detected$ This may "e considered a serious issue !hen it is
applied in a high security room$ The main reason "ehind these errors is due to the
inconsistency in the human speech$ Although this system is a formida"le com"ination'
the single layer of Perceptron techni%ue is una"le to reduce the inconsistency of the
speech signals$ Therefore more ro"ust and po!erful methods ha#e to "e employed to
reduce the inconsistency of the speech signals$ This !ill "e further e(plained in the
follo!ing chapter$
,inally to conclude' Mel fre%uency cepstrum processing has the a"ility to discriminate
signals that remain indistinguisha"le in the fre%uency domain$ ,urthermore due to their
economic' ro"ustness and fle(i"ility' these t!o com"ined techni%ues can "e easily
-
8/13/2019 speaker-D
47/48
implemented on cost effecti#e machines !hich re%uires speech #erification or
identification$
Oter -etods in 4eural 4etworks
Many methods of classifying speech signals can "e found in +eural +et!ork$ One of the
methods is "ased on the Auto associati#e +eural +et!ork model$ The distri"ution
capturing a"ility of the net!ork is e(ploited to "uild the speaker speech signal$ Another
high performance +eural +et!ork "ased approach is "y using a State Transition Matri($
This method has the a"ility to address inconsistent speech signals$ An unsuper#ised
Learning method like ohonen Self)Organising Map can also "e employed$
Speaker Identification
The current system is more focused on speaker #erification !hich tests an unkno!n
speaker against a kno!n speaker$ The method presented in this thesis is still not relia"le
enough to "e used on speaker identification applications$ In speaker identification' the
aim is to determine !hether the utterance of an unkno!n speaker "elongs to any of the
speakers from amongst a kno!n group$ Of these t!o applications' speaker identification
is generally more difficult to achie#e due the larger speaker populations !hich !ill
-
8/13/2019 speaker-D
48/48
produce more errors$ ,uture !ork should "e concentrated on speaker identification as it
!ill increase the commercial #alue of the system