speaker-D

download speaker-D

of 48

Transcript of speaker-D

  • 8/13/2019 speaker-D

    1/48

    A Project Report

    On

    Speaker Recognition System

    Implemented in MATLAB

    Abstract

    Speaker recognition is the process of automatically recogniing !ho is

    speaking on the "asis of indi#idual information included in speech !a#es$

  • 8/13/2019 speaker-D

    2/48

    This techni%ue makes it possi"le to use the speaker&s #oice to #erify their

    identity and control access to ser#ices such as #oice dialling' "anking "y

    telephone' telephone shopping' data"ase access ser#ices' information

    ser#ices' #oice mail' security control for confidential information areas' and

    remote access to computers$

    Speaker recognition can "e classified into identification and #erification$

    Speaker identification is the process of determining !hich registered

    speaker pro#ides a gi#en utterance$

    Speaker verification' on the other hand' is the process of accepting or

    rejecting the identity claim of a speaker$

    Speaker recognition methods can also "e di#ided into text-independentand

    text dependent methods$ In a te(t)independent system' speaker models

    capture characteristics of some"ody*s speech !hich sho! up irrespective of

    what one is saying$ In a te(t)dependent system' on the other hand' the

    recognition of the speaker*s identity is "ased on his or her speaking one or

    more specific phrases' like pass!ords' card num"ers' PI+ codes' etc$ All

    technologies of speaker recognition' identification and #erification' te(t

    independent and te(t)dependent' each has its o!n ad#antages and

    disad#antages and may re%uires different treatments and techni%ues$ The

    choice of !hich technology to use is application)specific$ The system that

  • 8/13/2019 speaker-D

    3/48

    !e !ill de#elop is classified as text independent speaker identification

    system since its task is to identify the person !ho speaks regardless of !hat

    is saying$

    Overview

    Speaker recognition is the process of automatically recogniing !ho is speaking on the

    "asis of indi#idual information included in speech !a#es$ This techni%ue makes it

    possi"le to use the speaker&s #oice to #erify their identity and control access to ser#ices

    such as #oice dialing' "anking "y telephone' telephone shopping' data"ase access

    ser#ices' information ser#ices' #oice mail' security control for confidential information

    areas' and remote access to computers$

    Principles of Speaker Recognition

    Speaker recognition can "e classified into identification and #erification$ Speaker

    identification is the process of determining !hich registered speaker pro#ides a gi#en

    utterance$ Speaker verification' on the other hand' is the process of accepting or rejecting

    the identity claim of a speaker$ ,igure - sho!s the "asic structures of speaker

    identification and #erification systems$

    Speaker recognition methods can also "e di#ided into text-independent and text-

    dependent methods$ In a te(t)independent system' speaker models capture characteristics

    of some"ody*s speech !hich sho! up irrespective of what one is saying$ In a te(t)

    dependent system' on the other hand' the recognition of the speaker*s identity is "ased on

  • 8/13/2019 speaker-D

    4/48

    his or herspeaking one or more specific phrases' like pass!ords' card num"ers' PI+

    codes' etc$

    All technologies of speaker recognition' identification and #erification' te(t)

    independent and te(t)dependent' each has its o!n ad#antages and disad#antages and may

    re%uires different treatments and techni%ues$ The choice of !hich technology to use is

    application)specific$ The system that !e !ill de#elop is classified as text-independent

    speaker identification system since its task is to identify the person !ho speaks regardless

    of !hat is saying$

    At the highest le#el' all speaker recognition systems contain t!o main modules .refer

    to ,igure -/0feature extraction andfeature matching$ ,eature e(traction is the process

    that e(tracts a small amount of data from the #oice signal that can later "e used to

    represent each speaker$ ,eature matching in#ol#es the actual procedure to identify the

    unkno!n speaker "y comparing e(tracted features from his1her #oice input !ith the ones

    from a set of kno!n speakers$ 2e !ill discuss each module in detail in later sections$

  • 8/13/2019 speaker-D

    5/48

    Figure 1$ Basic structures of speaker recognition systems

    All speaker recognition systems ha#e to ser#e t!o distinguish phases$ The first one is

    referred to the enrollment sessions or training phase !hile the second one is referred to as

    the operation sessions or testing phase$ In the training phase' each registered speaker has

    to pro#ide samples of their speech so that the system can "uild or train a reference model

    for that speaker$ In case of speaker #erification systems' in addition' a speaker)specific

  • 8/13/2019 speaker-D

    6/48

    threshold is also computed from the training samples$ 3uring the testing .operational/

    phase .see ,igure -/' the input speech is matched !ith stored reference model.s/ and

    recognition decision is made$

    Speaker recognition is a difficult task and it is still an acti#e research area$ Automatic

    speaker recognition !orks "ased on the premise that a person*s speech e(hi"its

    characteristics that are uni%ue to the speaker$ 4o!e#er this task has "een challenged "y

    the highly variant of input speech signals$ The principle source of #ariance is the speaker

    himself$ Speech signals in training and testing sessions can "e greatly different due to

    many facts such as people #oice change !ith time' health conditions .e$g$ the speaker has

    a cold/' speaking rates' etc$ There are also other factors' "eyond speaker #aria"ility' that

    present a challenge to speaker recognition technology$ 5(amples of these are acoustical

    noise and #ariations in recording en#ironments .e$g$ speaker uses different telephone

    handsets/$

    General Idea of Speec Recognition

    4uman speech presents a formida"le pattern classification task for speech recognition

    system $ +umerous speech recognition techni%ues ha#e "een formulated yet the #ery "est

    techni%ues used today ha#e recognition capa"ilities !ell "elo! those of a child$ This is

    due to the fact that human speech is highly dynamic and comple($ There are generally

    se#eral types of disciplines present in the human speech$ A "asic understanding of these

    disciplines is needed in order to create an effecti#e system$ The follo!ing pro#ide a "rief

    description of the disciplines that ha#e "een applied to speech recognition pro"lems

    1 Signal Processing

  • 8/13/2019 speaker-D

    7/48

    This process e(tracts the important information from the speech signal in a !ell)

    organised manner$ In signal processing' spectral analysis is used to characterie the time

    #arying properties of the speech signal$ Se#eral other types of processing are also needed

    prior to the spectral analysis stage to make the speech signal more accurate and ro"ust$

    -

    ! Acoustics

    The science of understanding the relationship "et!een the physical speech signal and the

    human #ocal tract mechanisms that produce the speech and !ith !hich the speech is

    distinguished$

    1 Pattern Recognition

    A set of coding algorithm used to compute data to create prototypical patterns of a data

    ensem"le$ It is used to compare a pair of patterns "ased on the features e(tracted from the

    speech signal$

    1 "ommunication and Information #eory

    The procedures for estimating parameters of the statistical models and the methods for

    recogniing the presence of speech patterns$

    1 $inguistics

    This refers to the relationships "et!een sounds' !ords in a sentence' meaning and logic

    of spoken !ords$

    1

  • 8/13/2019 speaker-D

    8/48

    ! Pysiology

    This refers to the comprehension of the higher)order mechanisms !ithin the human

    central ner#ous system$ It is responsi"le for the production and perception of speech

    !ithin the human "eings$

    1

    !

    % "omputer Science

    The study of effecti#e algorithms for application in soft!are and hard!are$ ,or e(ample'

    the #arious methods used in a speech recognition system$

    -

    ! Psycology

    The science of understanding the aspects that ena"les the technology to "e used "y

    human "eings$

    Speec Production

    Speech is the acoustic product of #oluntary and !ell)controlled mo#ement of a #ocal

    mechanism of a human$ 3uring the generation of speech' air is inhaled into the human

    lungs "y e(panding the ri" cage and dra!ing it in #ia the nasal ca#ity' #elum and trachea

    It is then e(pelled "ack into the air "y contracting the ri" cage and increasing the lung

    pressure$ 3uring the e(pulsion of air' the air tra#els from the lungs and passes through

    #ocal cords !hich are the t!o symmetric pieces of ligaments and muscles located in the

    laryn( on the trachea$ Speech is produced "y the #i"ration of the #ocal cords$ Before the

    e(pulsion of air' the laryn( is initially closed$ 2hen the pressure produced "y the

  • 8/13/2019 speaker-D

    9/48

    e(pelled air is sufficient' the #ocal cords are pushed apart' allo!ing air to pass through$

    The #ocal cords close upon the decrease in air flo!$ This rela(ation cycle is repeated !ith

    generation fre%uencies in the range of 674 8 9774$ The generation of this fre%uency

    depends on the speaker*s age' se(' stress and emotions$ This succession of the glottis

    openings and closure generates %uasi)periodic pulses of air after the #ocal cords$

    ,igure "elo! sho!s the schematic #ie! of the human speech apparatus$

    The speech signal is a time #arying signal !hose signal characteristics represent the

    different speech sounds produced$ There are three !ays of la"elling e#ents in speech$

    ,irst is the silence state in !hich no speech is produced$ Second state is the un#oiced

    state in !hich the #ocal cords are not #i"rating' thus the output speech !a#eform is

    aperiodic and random in nature$ The last state is the #oiced state in !hich the #ocal cords

    are #i"rating periodically !hen air is e(pelled from the lungs$ This results in the output

    speech "eing %uasi)periodic$ ,igure : "elo! sho!s a speech !a#eform !ith un#oiced

    and #oiced state$

  • 8/13/2019 speaker-D

    10/48

    Speech is produced as a se%uence of sounds$ The type of sound produced depends on

    shape of the #ocal tract$ The #ocal tract starts from the opening of the #ocal cords to the

    end of the lips$ Its cross sectional area depends on the position of the tongue' lips' ja!

    and #elum$ Therefore the tongue' lips' ja! and #elum play an important part in the

    production of speech$

    Block 3iagram .5ngineering Model/ Of 4uman Speech Production System

    Factors associated wit speec&

    Formants&

    It has "een kno!n from research that #ocal tract and nasal tract are tu"es !ith

    non)uniform cross)sectional area$ As sound generated propagates through these the tu"es'

    the fre%uency spectrum is shaped "y the fre%uency selecti#ity of the tu"e$ This effect is

    #ery similar to the resonance effects o"ser#ed in organ pipes and !ind instruments$ In the

    conte(t of speech production' the resonance fre%uencies of #ocal tract are called formant

  • 8/13/2019 speaker-D

    11/48

    fre%uencies or simply formants$ In our engineered model the poles of the transfer

    function are called formants$ 4uman Auditory system is much more sensiti#e to poles

    than eros$

    Ponemes&

    Phonemes can "e defined as the ;Sym"ols from !hich e#ery sound can "e

    classified or produced: phonemes$ ,or speech crude estimation of information rate

    considering physical limitations on articulatory motion is a"out -7 phonemes per second$

    #ypes of Ponemes&

    Speech sounds can "e classified in to 9 distinct classes according to the mode of

    e(citation$

    - -$ Plosi#e Sounds

    : :$ ?oiced Sounds

    9 9$ @n#oiced Sounds

    1' Plosive Sounds&

    Plosi#e Sounds result from making a complete closure .again to!ard the front end of the

    #ocal tract/' "uilding up pressure "ehind the closure' and a"ruptly releasing it$

    !' (oiced Sounds&

    ?oiced sounds are produced "y forcing air through the glottis !ith the tension of the

    #ocal chords adjusted so that they #i"rate in a rela(ation oscillation' there"y producing

    %uasi)periodic pulses of air !hich e(cite the #ocal tract$

    ?oiced sounds are characteried "y

    - 4igh 5nergy Le#els

  • 8/13/2019 speaker-D

    12/48

    : ?ery 3istinct resonant and formant fre%uencies$

    #e rate at wic te vocal cord vibrates determines te pitc $ These

    #i"rations are periodic in time thus #oiced sounds are appro(imated "y an impulse train$

    Spacing "et!een impulses is the pitch' ,7$

    %' )nvoiced Sounds&

    ?oiced Sounds are also kno!n as formants generated "y forming a

    constriction at some point in the #ocal tract .usually to!ard the mouth end/' and forcing

    the air through the constriction at high enough #elocity to produce tur"ulence$ This

    creates a "road)spectrum noise source to e(cite the #ocal tract$

    @n#oiced sounds are characteried "y

    - Lo!er 5nergy Le#els than #oiced sounds$

    : 4igher fre%uencies than #oiced sounds$

    In other !ords !e can say that un#oiced sounds .e$g$ 1sh1' 1s1' 1p1/ are

    generated !ithout #ocal cords #i"rations$ The e(citation is modeled "y a 2hite aussian

    +oise source$ @n#oiced sounds ha#e no pitch since they are e(cited "y a non)periodic

    signal$

  • 8/13/2019 speaker-D

    13/48

    Spectrums Of typical voiced And )nvoiced Speec

    By passing the speech through a predictor filter A./' the spectrum is much more flatten

    .!hitened/$ But it still contains some fine details$

  • 8/13/2019 speaker-D

    14/48

    Special #ype of (oiced and )nvoiced Sounds&

    There are ho!e#er some special types of #oiced and un#oiced sounds !hich are "riefly

    discussed here$ The purpose of their discussion here is only to gi#e the reader an idea

    a"out the further types of #oiced and un#oiced speech$

    (owels&

    ?o!els are produced "y e(citing a fi(ed #ocal tract !ith %uasi periodic pulses of

    air caused "y #i"ration of the #ocal cords$ The !ay in !hich the cross)sectional area

    #aries along the #ocal tract determines the resonant fre%uencies of the tract .formants/

    and thus the sound that is produced$ The dependence of cross)sectional area upon

    distance along the tract is called is called area function of the #ocal tract$ The area

    function of a particular #o!el is determined primarily "y the position of the tongue "ut

    the position of ja!s and lips to a small e(tent also affect the resulting sound$

    5(amples

    a'e'i'o'u

    *iptongs&

    Although there is some am"iguity and disagreement as to !hat is and !hat is not

    a diphthongs' a reasona"le definition is that a diphthongs is a gliding monosylla"ic

    speech item that starts at or near the articulatory position for one #o!el and mo#es to or

    to!ard the position for another$ According to this definition' there are C diphthongs in

    American 5nglish$

    3iphthongs are produced "y #arying the #ocal tract smoothly "et!een #o!el

    configurations appropriate to the diphthong$ Based on these data' the diphthongs can "e

  • 8/13/2019 speaker-D

    15/48

    characteried "y a time #arying #ocal tract area function !hich #aries "et!een t!o #o!el

    configurations$

    5(amples0

    5i1 .as in "ay/ ' o@1 .as in "oat/ ' aI1 .as in "uy/ ' a@1 .as in ho!/

    Semivowels&

    The group of sound consisting of 1!1' 1l1 ' 1r1 '1y1 is %uite difficult to

    characterie$ These sounds are called semi#o!els "ecause of their #o!el)like nature$

    They are generally characteried "y a gliding transition in the #ocal tract area function

    "et!een adjacent phonemes$ Thus the acoustic characteristics of these sounds are

    strongly influenced "y the conte(t in !hich they occur$ ,or our purpose they just

    considered as transitional #o!el)like sounds and hence are similar in nature to #o!els

    and diphthongs$

    (oiced Fricatives&

    The #oiced fricati#es are 1#1 ' 1th1 ' 11 and 1h1are the counterpart of the un#oiced

    fricati#es 1f1 '1D1 '1s1 and 1sh1 respecti#ely' in that the place of constriction for each of the

    corresponding phonemes is essentially identical$

    4o!e#er the #oiced fricati#es differ from their un#oiced counterparts in the manner that

    t!o e(citation sources are in#ol#ed in their production$ The spectra of #oiced fricati#es

    can "e e(pected to display t!o distinct components$

    (oiced Stops&

    The #oiced stops 1"1' 1d1'and1g1 are transient non)continuant sounds !hich are produced

    "y "uilding up pressure "ehind a total constriction some!here in the oral tract' and

    suddenly releasing the pressure$ ,or 1"1 the constriction is at the lipsE for 1d1 the

  • 8/13/2019 speaker-D

    16/48

    constriction is at the "ack of the teethE and for 1g1 it is near the #elum$ 3uring the period

    there is a total constriction in the tract there is no sound radiated from the lips$

    Since the stop sounds are dynamical in nature' there properties are highly influenced "y

    the #o!el !hich follo!s the stop consonant$

    )nvoiced Stops&

    The un#oiced stop consonants are 1p1'1t1'and1k1 are similar to their #oiced counterparts

    1"1 ' 1d1 and 1g 1 !ith one major e(ception$ 3uring the period of the total closure of the

    tract' as the pressure "uilds up' the #ocal cords do not #i"rate$ Thus' follo!ing the period

    of closure as the air pressure is released' there is a "rief inter#al for friction .due to the

    sudden tur"ulence of the escaping air/ follo!ed "y a period of aspiration .steady flo! of

    air from glottis e(citing the resonances of the #ocal tract/ "efore #oiced e(citation "egins

    +earing and Perception

    Audi"le sounds are transmitted to the human ears through the #i"ration of the particles in

    the air$ 4uman ears consist of three parts' the outer ear' the middle ear and the inner ear$

    The function of the outer ear is to direct speech pressure #ariations to!ard the eardrum

    !here the middle ear con#erts the pressure #ariations into mechanical motion$ The

    mechanical motion is then transmitted to the inner ear' !hich transforms these motion

    into electrical potentials that passes through the auditory ner#e' corte( and then to the

    "rain $ ,igure "elo! sho!s the schematic diagram of the human ear$

  • 8/13/2019 speaker-D

    17/48

    Schematic 3iagram of the 4uman 5ar

    #e ,ngineered -odel &

    The speech mechanism can "e modeled as a time #arying filter .the #ocal tract/ e(cited

    "y an oscillator .the #ocal folds/' !ith different outputs$ 2hen #oiced sound is produced'

    the filter is e(cited "y an impulse chain' in a range of fre%uencies .C7)>77 4/$ 2hen

    un#oiced sound is produced' the filter is e(cited "y random !hite noise' !ithout any

    o"ser#ed periodicity$ These attri"utes can "e o"ser#ed !hen the speech signal is

    e(amined in the time domain$

  • 8/13/2019 speaker-D

    18/48

    (a): The Human Speech Production Figure

    (b): Speech Production by a machine

    .y ,ncode Speec/

    Speech coding has "een and still is a major issue in the area of digital speech

    processing$ Speech coding is the act of transforming the speech signal at hand' to a more

    compact form' !hich can then "e transmitted !ith a considera"ly smaller memory$ The

    moti#ation "ehind this is the fact that access to unlimited amount of "and!idth is not

    possi"le$ Therefore' there is a need to code and compress speech signals$ Speech

    compression is re%uired in long)distance communication' high)%uality speech storage'

    and message encryption$ ,or e(ample' in digital cellular technology many users need to

    share the same fre%uency "and!idth$ @tiliing speech compression makes it possi"le for

    more users to share the a#aila"le system$ Another e(ample !here speech compression is

  • 8/13/2019 speaker-D

    19/48

    needed is in digital #oice storage$ ,or a fi(ed amount of a#aila"le memory' compression

    makes it possi"le to store longer messages$

    Speech coding is a lossy type of coding' !hich means that the output signal

    does not e(actly sound like the input$ The input and the output signal could "e

    distinguished to "e different$ Foding of audio ho!e#er' is a different kind of pro"lem

    than speech coding$ Audio coding tries to code the audio in a perceptually lossless !ay$

    This means that e#en though the input and output signals are not mathematically

    e%ui#alent' the sound at the output is the same as the input$ This type of coding is used in

    applications for audio storage' "roadcasting' and Internet streaming$

    Se#eral techni%ues of speech coding such as Linear Predicti#e Foding .LPF/' 2a#eform

    Foding and Su" "and Foding e(ist$ The pro"lem at hand is to use LPF to code gi#en

    speech sentences$ The speech signals that need to "e coded are !ide"and signals !ith

    fre%uencies ranging from 7 to 6 k4$ The sampling fre%uency should "e at 6k4$

    3ifferent types of applications ha#e different time delay constraints$ ,or e(ample in

    net!ork telephony only a delay of -ms is accepta"le' !hereas a delay of =77 ms is

    permissi"le in #ideo telephony$ Another constraint at hand is not to e(ceed an o#erall "it

    rate of 6 k"ps$

    The speech coder that !ill "e de#eloped is going to "e analyed using "oth su"jecti#e

    and o"jecti#e analysis$ Su"jecti#e analysis !ill consist of listening to the encoded speech

    signal and making adjustments on its %uality$ The %uality of the played "ack speech !ill

    "e solely "ased on the opinion of the listener$ The speech can possi"ly "e rated "y the

    listener either impossi"le to understand' intelligi"le or natural sounding$ 5#en though this

    is a #alid measure of %uality' an o"jecti#e analysis !ill "e introduced to technically

  • 8/13/2019 speaker-D

    20/48

    assess the speech %uality and to minimie human "ias$ ,urthermore' an analysis on the

    study of effects of "it rate' comple(ity and end)to)end delay on the speech %uality at the

    output !ill "e made$ The report !ill "e concluded !ith the summary of results and some

    ideas for future !ork$

    Speec Processing

    The speech !a#eform needs to "e con#erted into digital format "efore it is suita"le for

    processing in the speech recognition system$ The ra! speech !a#eform is in the analog

    format "efore con#ersion$ The con#ersion of analog signal to digital signal in#ol#es three

    phases' mainly the sampling' %uantisation and coding phase$ In the sampling phase' the

    analog signal is "eing transformed from a !a#eform that is continuos in time to a discrete

    signal$ A discrete signal refers to the se%uence of samples that are discrete in time$ In the

    %uantisation phase' an appro(imate sampled #alue of a #aria"le is con#erted into one of

    the finite #alues contained in a code set$ These t!o stages allo! the speech !a#eform to

    "e represented "y a se%uence of #alues !ith each of these #alues "elonging to the set of

    finites #alues$ After passing through the sampling and %uantisation stage' the signal is

    then coded in the coding phase$ The signal is usually represented "y "inary code$ These

    three phases needs to "e carried out !ith caution as any miscalculations' o#er)sampling

    and %uantiation noise !ill result in loss of information$ Belo! are the pro"lems faced "y

    the three phases$

    Sampling

    According to the +y%uist Theorem' the minimum sampling rate re%uired is t!o times the

    "and!idth of the signal$ This minimum sampling fre%uency is needed for the

  • 8/13/2019 speaker-D

    21/48

    reconstruction of a "and limited !a#eform !ithout error$ Aliasing distortion !ill occur if

    the minimum sampling rate is not met$ ,igure "elo! sho!s the comparison "et!een a

    properly sampled case and an improperly sampled case$

    Aliasing 3istortion "y Improperly Sampling

    0uantiation

    Speech signals are more likely to ha#e amplitude #alues near ero than at the e(treme

    peak #alues allo!ed$ ,or e(ample' in digitiing #oice' if the peak #alue allo!ed is -?'

    !eak passages may ha#e #oltage le#els on the order of 7$-?$ Speech signals !ith non)

    uniform amplitude distri"ution are likely to e(perience %uantising noise if the step sie is

    not reduced for amplitude #alues near ero and increased for e(tremely large #alues$ The

    %uantising noise is kno!n as the granular and slope o#erload noise$ ranular noise occurs

    !hen the step sie is large for amplitude #alues near ero$ Slope o#erload noise occurs

    !hen the step sie is small and cannot keep up !ith the e(tremely large amplitude #alues$

    To sol#e the a"o#e %uantising noise pro"lem' 3elta Modulation .3M/ is used$ 3elta

    Modulation !orks "y reducing the step sie for amplitude #alues near ero and increasing

    the step sie for e(tremely large amplitude #alues$ ,igure "elo! sho!s a diagram on the

    t!o types of noises$

  • 8/13/2019 speaker-D

    22/48

    Analog Input and Accumulator Output 2a#eform

    Approaces to Speec Recognition

    4uman "eings are the "est ;machine< to recognie and understand speech$ 2e are a"le to

    com"ine a !ide #ariety of linguistic kno!ledge concerned !ith synta( and semantics and

    adapti#ely use this kno!ledge according to the difficulties and characteristics of the

    sentences$ The speech recognition system is "uilt !ith this aim in mind to match or

    e(ceed human performance$ There are generally three approaches to speech recognition'

    namely' acoustic)phonetic' pattern recognition and the artificial intelligence approach$

    These three approaches !ill "e e(plained in greater detail in the follo!ing sections$

    Speec "oding

    A digital speech coder can "e classified into t!o main categories' mainly !a#eform

    coders and #ocoders$ 2a#eform coders employ algorithms to encode and decode speech

    signals so that the system output is an appro(imation to the input !a#eform$ ?ocoders

    encode speech signals "y e(tracting a set of parameters that are digitied and transmitted

    to the recei#er$ This set of digitied parameters is used to set #alues for parameters in

    function generators and filters' !hich in turn synthesie the output speech signals$ The

  • 8/13/2019 speaker-D

    23/48

    #ocoder output !a#eform does not appro(imate the input !a#eform signals and may

    produce an unnatural sound$

    Speec Feature ,2traction

    Introduction

    The purpose of this module is to con#ert the speech !a#eform to some type of

    parametric representation .at a considera"ly lo!er information rate/ for further analysis

    and processing$ This is often referred as thesignal-processing front end$

    The speech signal is a slo!ly timed #arying signal .it is called quasi-stationary/$ An

    e(ample of speech signal is sho!n in ,igure :$ 2hen e(amined o#er a sufficiently short

    period of time ."et!een = and -77 msec/' its characteristics are fairly stationary$

    4o!e#er' o#er long periods of time .on the order of -1= seconds or more/ the signal

    characteristic change to reflect the different speech sounds "eing spoken$ Therefore'

    short-time spectral analysis is the most common !ay to characterie the speech signal$

  • 8/13/2019 speaker-D

    24/48

    Figure 2. An example of speech signal

    A !ide range of possi"ilities e(ist for parametrically representing the speech signal

    for the speaker recognition task' such as Linear Prediction Foding .LPF/' Mel),re%uency

    Fepstrum Foefficients .M,FF/' and others$ M,FF is perhaps the "est kno!n and most

    popular' and these !ill "e used in this project$

    M,FF*s are "ased on the kno!n #ariation of the human ear*s critical "and!idths

    !ith fre%uency' filters spaced linearly at lo! fre%uencies and logarithmically at high

    fre%uencies ha#e "een used to capture the phonetically important characteristics of

    speech$ This is e(pressed in the mel-frequency scale' !hich is a linear fre%uency spacing

    "elo! -777 4 and a logarithmic spacing a"o#e -777 4$ The process of computing

    M,FFs is descri"ed in more detail ne(t$

    Pattern Recognition

  • 8/13/2019 speaker-D

    25/48

    This direct approach in#ol#es manipulating the speech signals directly !ithout e(plicit

    feature e(traction of the speech signals$ There are t!o stages in this approach' mainly the

    training of speech patterns and recognition of patterns #ia pattern comparison$ Se#eral

    identical speech signals are collected and sent to the system #ia the training procedure$

    2ith ade%uate training' the system is a"le to characterie the acoustics properties of the

    pattern$ This type of classification is kno!n as the pattern classification$ The recognition

    stage does a direct comparison "et!een the unkno!n speech signal and the speech signal

    patterns learned in the training phase$ It generates a ;accept< or ;reject< decision "ased

    on the similarity of the t!o patterns$

    -$ It is simple to use and the method is fairly easy to understand

    :$ It has ro"ustness to different speech #oca"ularies' users' features sets' pattern

    comparison algorithms and decision rules$

    9$ It has "een pro#en that this method generates the most accurate results$

    Acoustic3Ponetic Approac

    The acoustic)phonetic approach has "een studied in depth for more than >7 years$ It is

    "ased on the theory of acoustics phonetics that suggest that there e(ist finite' distincti#e

    phonetic units of spoken language and that the phonetic units are !idely characteried "y

    a set of properties that are manifest in the speech signal' or its spectrum' o#er time$ The

    first step in this approach is to segment the speech signal into discrete time regions !here

    the acoustics properties of the speech signal are represented "y one phonetic unit$ The

    ne(t step is to attach one or more phonetic la"els to each segmented region according to

    the acoustic properties$ ,inally the last step attempts to determine a #alid !ord from the

  • 8/13/2019 speaker-D

    26/48

    phonetic la"els generated from the first step$ This is consistent !ith the constraints of the

    speech recognition task

    Artificial Intelligence

    This approach is a com"ination of the acoustic)phonetic approach and the pattern

    recognition approach$ It uses the concept and ideas of these t!o approaches$ Artificial

    intelligence approach attempts to mechanie speech recognition process according to the

    !ay a person applies its intelligence in #isualiing and analying$ In particular among the

    techni%ues used !ithin this class of methods are the use of an e(pert system for

    segmentation and la"eling so that this crucial and most complicated step can "e

    performed !ith more that just the acoustic information used "y pure acoustic)phonetic

    methods$ +eural +et!orks are often used in this approach to learn the relationship

    "et!een the phonetic e#ents and all the kno!n inputs$ It can also "e used to differentiate

    similar sound classes$

    *ynamic #ime .arping

    3ynamic Time 2arping is one of the pioneer approaches to speech recognition$ It first

    operates "y storing a prototypical #ersion of each !ord in the #oca"ulary into the

    data"ase' then compares incoming speech signals !ith each !ord and then takes the

    closest match$ But this poses a pro"lem "ecause it is unlikely that the incoming signals

    !ill fall into the constant !indo! spacing defined "y the host$ ,or e(ample' the pass!ord

    to a #erification system is Gueensland$ 2hen a user utter ; Gueeeeensland

  • 8/13/2019 speaker-D

    27/48

    due to the longer constant !indo! spacing of the speech ; Gueeeeensland

  • 8/13/2019 speaker-D

    28/48

    Speech Recognition is one of the daunting challenges facing researchers throughout the

    !orld$ The complete solution is far from o#er and enormous efforts ha#e "een spent "y

    companies to reach the ultimate goal$ One of the techni%ues that gain acceptance from

    researchers is the state of art' 4idden Marko# Model .4MM/ techni%ue$ This model can

    also "e incorporated !ith other techni%ues like +eural +et!ork to form a formida"le

    techni%ue$

    The 4idden Marko# Model approach is !idely used in se%uence processing and speech

    recognition$ The key features of the 4idden Marko# Model lies in its a"ility to model

    temporal statistics of data "y introducing a discrete hidden #aria"le that goes through a

    transition from one time step to the ne(t according to the stochastic transition matri($

    3istri"ution of the emission sym"ols is em"odied in the assumption of the emission

    pro"a"ility density$

    A 4idden Marko# Model may "e #ie!ed as a finite machine !here the transitions

    "et!een the states are dependent upon the occurrence of some sym"ol$ 5ach state

    transition is associated !ith an output pro"a"ility distri"ution !hich determines the

    pro"a"ility that a sym"ol !ill occur during the transition and a transition pro"a"ility

    indicating the likelihood of this transition$ Se#eral analytical techni%ues ha#e "een

    de#eloped for estimating these pro"a"ilities$ These analytical techni%ues ha#e ena"led

    4MM to "ecome more computationally efficient' ro"ust and fle(i"le$

    In speech recognition' the 4MM model optimises the pro"a"ility of the training set to

    detect a particular speech$ The pro"a"ility function is performed "y the ?iter"i algorithm$

    This algorithm is a procedure used to determine an optimal state se%uence from a gi#en

    o"ser#ation se%uence$

  • 8/13/2019 speaker-D

    29/48

    4eural 4etworks

    As you read these !ords' your "rain is actually using its comple( net!ork of -7 neurons

    to facilitate your readings $ 5ach of these neurons has a "laing processing speed of a

    microprocessor' !hich allo!s us to read' think and !rite simultaneously$ Scientists ha#e

    found out that all "iological neural functions including memory are stored in the neurons

    and in the connections "et!een them$ As !e learn ne! things e#eryday' ne! connections

    are made or modified$ Some of these neural structures are defined at "irth !hile others

    are created e#eryday and others !aste a!ay$ In this thesis' the +eural +et!ork algorithm

    is actually a"out Artificial +eural +et!orks and not the actual neurons in our "rain$ A

    picture illustrating the "iological neurons is sho!n in ,igure "elo!

    Schematic 3ra!ing of Biological +eurons

    4eural -odel

  • 8/13/2019 speaker-D

    30/48

    ,igure "elo! sho!s a single input neuron$ The scaler input p is multiplied "y the scaler

    !eight ! to form !p !hich is then sent to the summer$ In the summer' the product scalar

    !p is added to the "ias " and passed through the summer $ The summer output n' goes

    into the transfer function f !hich generates the scaler neuron output a$ The #alue of

    output neuron ;a< depends on the type of transfer function used$ This !hole idea of the

    artificial neuron is similar to "iological neurons sho!n in ,igure sho!n a"o#e $ The

    !eight ! corresponds to the strength of the synapse' the cell "ody is e%ui#alent to the

    summation and the transfer function and finally the neuron output ;a< corresponds to the

    signal tra#elling in the a(on

    Summer output n H !p "+euron output a H f.!p"/

    $

    Single Input +euron

    +ard $imit #ransfer Function

    The hard limit transfer function sho!n on the left side of ,igure "elo! sets the output

    neuron a to 7 if the summer output n is less than 7$ If the summer output n is greater than

    or e%ual to 7' it sets the output neuron a to -$ This transfer function is useful in classifying

  • 8/13/2019 speaker-D

    31/48

    inputs into t!o categories and in this thesis it is used to determine true or false detection

    of the speech signal$ The figure on the right sho!s the effect of the !eight and the "ias

    com"ined together$

    4ard Limit Transfer ,unction

    *ecision 5oundary

    A single layer perceptron consists of the input neuron' the !eight' "ias' summer and the

    transfer function$ ,igure "elo! - sho!s a diagram of a single layer perceptron$ A single

    layer perceptron can "e used to classify input #ectors into t!o categories$ The !eight is

    al!ays orthogonal to the decision "oundary$ ,or e(ample in ,igure "elo! :' the !eight !

    is set to J): 9K$ The decision "oundary corresponding to the graph in ,igure "elo! : is

    indicated$ 2e can use any points on decision "oundary to find the "ias as follo!s0

    !p " H 7$ Once the "ias is set' any point in the plane can "e classified as lying inside

    the shaded region .!p"7/ or outside the shaded region .!p"7/$

  • 8/13/2019 speaker-D

    32/48

    ,igure -0 Multiple Input +euron

    ,igure :0Perceptron 3ecision Boundary

    -el3fre6uency cepstrum coefficients processor

    A "lock diagram of the structure of an M,FF processor is gi#en in ,igure 9$ The

    speech input is typically recorded at a sampling rate a"o#e -7777 4$ This sampling

    fre%uency !as chosen to minimie the effects of aliasing in the analog)to)digital

    con#ersion$ These sampled signals can capture all fre%uencies up to = k4' !hich co#er

    most energy of sounds that are generated "y humans$ As "een discussed pre#iously' the

    main purpose of the M,FF processor is to mimic the "eha#ior of the human ears$ In

    addition' rather than the speech !a#eforms themsel#es' M,,F*s are sho!n to "e less

    suscepti"le to mentioned #ariations$

  • 8/13/2019 speaker-D

    33/48

    Frame 5locking

    In this step the continuous speech signal is "locked into frames of N samples' !ith

    adjacent frames "eing separated "yM .M < N/$ The first frame consists of the firstN

    samples$ The second frame "eginsM samples after the first frame' and o#erlaps it "yN -

    M samples$ Similarly' the third frame "egins :M samples after the first frame .or M

    samples after the second frame/ and o#erlaps it "y N ) :M samples$ This process

    continues until all the speech is accounted for !ithin one or more frames$ Typical #alues

    forN andM areN H :=C .!hich is e%ui#alent to N 97 msec !indo!ing and facilitate the

    fast radi(): ,,T/ andM H -77$

    .indowing

    The ne(t step in the processing is to !indo! each indi#idual frame so as to minimie

    the signal discontinuities at the "eginning and end of each frame$ The concept here is to

  • 8/13/2019 speaker-D

    34/48

    minimie the spectral distortion "y using the !indo! to taper the signal to ero at the

    "eginning and end of each frame$ If !e define the !indo! as

    !hereN is the num"er of samples in each frame' then the result of !indo!ing is the

    signal

    Typically theHamming !indo! is used' !hich has the form0

    Fast Fourier #ransform 7FF#8

    The ne(t processing step is the ,ast ,ourier Transform' !hich con#erts each frame of

    N samples from the time domain into the fre%uency domain$ The ,,T is a fast algorithm

    to implement the 3iscrete ,ourier Transform .3,T/ !hich is defined on the set of N

    samples xn' as follo!0

    +ote that !e use j here to denote the imaginary unit' i$e$ -QHj$ In generaln*s are

    comple( num"ers$ The resulting se%uence n is interpreted as follo!0 the erofre%uency corresponds to n H 7' positi#e fre%uencies correspond to

    #alues !hile negati#e fre%uencies correspond

    to

    The result after this step is often referred to asspectrum orperiodogram

  • 8/13/2019 speaker-D

    35/48

    -el3fre6uency .rapping

    As mentioned a"o#e' psychophysical studies ha#e sho!n that human perception of

    the fre%uency contents of sounds for speech signals does not follo! a linear scale$ Thus

    for each tone !ith an actual fre%uency'f' measured in 4' a su"jecti#e pitch is measured

    on a scale called the mel* scale$ The mel-frequency scale is a linear fre%uency spacing

    "elo! -777 4 and a logarithmic spacing a"o#e -777 4$ As a reference point' the pitch

    of a - k4 tone' >7 dB a"o#e the perceptual hearing threshold' is defined as -777 mels$

    Therefore !e can use the follo!ing appro(imate formula to compute the mels for a gi#en

    fre%uencyf in 40

    $

    One approach to simulating the su"jecti#e spectrum is to use a filter "ank' spaced

    uniformly on the mel scale .see ,igure >/$ That filter "ank has a triangular "and pass

    fre%uency response' and the spacing as !ell as the "and!idth is determined "y a constant

    mel fre%uency inter#al$ The modified spectrum of S!" thus consists of the output po!er

    of these filters !hen S!" is the input$ The num"er of mel spectrum coefficients' #' is

    typically chosen as :7$

    +ote that this filter "ank is applied in the fre%uency domain' therefore it simply

    amounts to taking those triangle)shape !indo!s in the ,igure > on the spectrum$ A useful

    !ay of thinking a"out this mel)!rapping filter "ank is to #ie! each filter as an histogram

    "in .!here "ins ha#e o#erlap/ in the fre%uency domain$

  • 8/13/2019 speaker-D

    36/48

    Figure 9$ An e(ample of Mel)spaced filter "ank

    "epstrum

    In this final step' !e con#ert the log mel spectrum "ack to time$ The result is called

    the mel fre%uency cepstrum coefficients .M,FF/$ The cepstral representation of the

    speech spectrum pro#ides a good representation of the local spectral properties of the

    signal for the gi#en frame analysis$ Because the mel spectrum coefficients .and so their

    logarithm/ are real num"ers' !e can con#ert them to the time domain using the 3iscrete

    Fosine Transform .3FT/$ Therefore if !e denote those mel po!er spectrum coefficients

    that are the result of the last step are

    !e can calculate the M,FF&s' as

  • 8/13/2019 speaker-D

    37/48

    +ote that !e e(clude the first component' 'N7c from the 3FT since it represents the

    mean #alue of the input signal !hich carried little speaker specific information$

    1

    !

    %

    9

    :

    ;

    until a code"ook sie of M is designed$

    Intuiti#ely' the LB algorithm designs anM)#ector code"ook in stages$ It starts first "y

  • 8/13/2019 speaker-D

    42/48

    designing a -)#ector code"ook' then uses a splitting techni%ue on the code!ords to

    initialie the search for a :)#ector code"ook' and continues the splitting process until the

    desiredM)#ector code"ook is o"tained$

    ,igure C sho!s' in a flo! diagram' the detailed steps of the LB algorithm$ ; )luster

    vectors< is the nearest)neigh"or search procedure !hich assigns each training #ector to a

    cluster associated !ith the closest code!ord$ ;*ind centroids< is the centroid update

    procedure$ ;)ompute + !distortion"< sums the distances of all training #ectors in the

    nearest)neigh"or search so as to determine !hether the procedure has con#erged$

    Obtaining Speec .aveform

    The first task is to record the speech !a#eform from the speaker and upload it into the

    program$ The sound Recorder program in Microsoft 2indo!s is chosen to record the

    speech !a#eform$ The recorded speech is automatically filtered' sampled at a sampling

    rate of ::$7= 4 and then sa#ed as a !a#e file$ 2a#e file format is chosen "ecause it is

    highly compati"le !ith the Matla" program as it can "e easily retrie#ed #ia a single

    command$

    Pre3,2traction Process

    The speech !a#e filed is loaded into the Matla" program "y using a ;!a#e read< function

    !hich limits the amplitude of the speech signal to a magnitude of -$ The signal is then

    sa#ed as an M ( - #ector !here M refers to the total num"er of samples in the speech

    signal$ 5ach element in the M #ector contains the amplitude of the speech signal at a

  • 8/13/2019 speaker-D

    43/48

    particular sampling instant$ The speech signal is no! ready to go through the

    compati"ility and %uality process$

    0uality Process

    Before the actual e(traction process takes place' the !a#e file is su"jected to a series of

    processes to ensure the compati"ility and %uality of the signal$ 2hen the speech signal is

    "eing loaded into the Matla" program' the signal is not centered at the y H 7 a(is$ In order

    to "ring the !hole signal to centre on the ero)line' special program code !as !ritten$

    This code is used to find the mean of the signal and then su"tract this mean from each of

    the sample #alues of the signal$ This is sho!n in ,igure - "elo!$ The reason for shifting

    the !hole signal to the y H 7 a(is !ill "e e(plained in section

    Speech 2a#eform

  • 8/13/2019 speaker-D

    44/48

    The ne(t process is to suppress the noise present in the speech !a#eform$ Although the

    sound recorder program has performed the initial filtering' some noise is still present in

    the speech !a#eform$ Another section of the Matla" program code is used to set a

    threshold #alue on the speech signal$ Any #alue of the speech signal that falls "elo! this

    threshold #alue !ill "e set to ero$ This !ill greatly suppress the un!anted noise and in

    the meantime preser#e the content of the main speech signal$ This is illustrated in ,igure

    : and ,igure 9 "elo!$ After some testing' it !as found that a threshold #alue of 7$7: is

    most suita"le$

    ,ig : Speech !a#e form "efore filtering

    ,ig 9 Speech after filtering

    The final compati"ility and %uality process is to determine the area of interest of the

    speech signal$ This is done "y detecting the first rise point and the final drop point of the

  • 8/13/2019 speaker-D

    45/48

    speech !a#eform$ This can "e done easily since the speech signal is cleared of un!anted

    noise$ 4ence area of interest of the speech lies "et!een the first rise and final drop point

    of the speech !a#eform$ This area of interest is later used for the e(traction and coding

    processes$

    *atabase

    Before the classification takes place' speech samples from each speaker ha#e to "e

    collected' coded' con#erted to an M,FF code and stored into the data"ase$ The utterance

    of these speech samples has to "e of the same phrase$ These speech samples for each

    speaker are then a#eraged to produce the mean reference matri($ This a#eraging process

    is necessary as it reduces the inconsistency of the speaker*s speech$ The speech samples

    must also go through a process to find the standard de#iation of the samples$

    4eural 4etwork Process

    After the mean reference matri( of each indi#idual speaker is o"tained' the elements in

    the matri( are then passed through the +eural +et!ork to construct the !eight and "ias

    for each indi#idual speaker$ Once the !eight and "ias of each indi#idual speaker is

    generated' the program is ready to classify any unkno!n user$

    "onclusion

  • 8/13/2019 speaker-D

    46/48

    2ith the positi#e results collected' Speech recognition using M,FF and +eural +et!ork

    has pro#en to "e e(cellent in classifying speech signals$ @nlike traditional speech

    recognition techni%ues !hich in#ol#e comple( ,ourier transformations' the method used

    "y Mel ,re%uency cepstrum in coding the signal is simple and accurate$ The acoustics

    characteristics of the speaker*s speech can easily "e detected from #isual inspection on

    the M,FF code$ The +eural +et!ork classification method used is also relia"le and

    uncomplicated to implement$

    ,rom the results' it is o"#ious that single sylla"le !ords are more relia"le in terms of

    training$ This is pro"a"ly "ecause humans* pronunciations of single sylla"le !ords are

    more consistent$

    3espite the positi#e results collected' there are still a fe! ;,alse< acceptances and

    ;,alse< rejections "eing detected$ This may "e considered a serious issue !hen it is

    applied in a high security room$ The main reason "ehind these errors is due to the

    inconsistency in the human speech$ Although this system is a formida"le com"ination'

    the single layer of Perceptron techni%ue is una"le to reduce the inconsistency of the

    speech signals$ Therefore more ro"ust and po!erful methods ha#e to "e employed to

    reduce the inconsistency of the speech signals$ This !ill "e further e(plained in the

    follo!ing chapter$

    ,inally to conclude' Mel fre%uency cepstrum processing has the a"ility to discriminate

    signals that remain indistinguisha"le in the fre%uency domain$ ,urthermore due to their

    economic' ro"ustness and fle(i"ility' these t!o com"ined techni%ues can "e easily

  • 8/13/2019 speaker-D

    47/48

    implemented on cost effecti#e machines !hich re%uires speech #erification or

    identification$

    Oter -etods in 4eural 4etworks

    Many methods of classifying speech signals can "e found in +eural +et!ork$ One of the

    methods is "ased on the Auto associati#e +eural +et!ork model$ The distri"ution

    capturing a"ility of the net!ork is e(ploited to "uild the speaker speech signal$ Another

    high performance +eural +et!ork "ased approach is "y using a State Transition Matri($

    This method has the a"ility to address inconsistent speech signals$ An unsuper#ised

    Learning method like ohonen Self)Organising Map can also "e employed$

    Speaker Identification

    The current system is more focused on speaker #erification !hich tests an unkno!n

    speaker against a kno!n speaker$ The method presented in this thesis is still not relia"le

    enough to "e used on speaker identification applications$ In speaker identification' the

    aim is to determine !hether the utterance of an unkno!n speaker "elongs to any of the

    speakers from amongst a kno!n group$ Of these t!o applications' speaker identification

    is generally more difficult to achie#e due the larger speaker populations !hich !ill

  • 8/13/2019 speaker-D

    48/48

    produce more errors$ ,uture !ork should "e concentrated on speaker identification as it

    !ill increase the commercial #alue of the system